PUP-26: Mitigations for Validator Downtime

Attributes

  • Author(s): @iajrz @otto_vargas @jdaugherty
  • Parameters:
    • MinSignedPerWindow
    • SignedBlocksWindow
    • DowntimeJailDuration
    • SlashFractionDowntime
  • Current Values:
    • MinSignedPerWindow: 0.6
    • SignedBlocksWindow: 10
    • DowntimeJailDuration: 3600000000000
    • SlashFractionDowntime: 0.000001
  • New Values:
    • MinSignedPerWindow: 0.8
    • SignedBlocksWindow: 5
    • DowntimeJailDuration: 14400000000000
    • SlashFractionDowntime: 0.001

Summary

When validators go offline or spread bad gossip due to misconfigurations, their behavior can disrupt service and put the chain at risk for chain halts.This proposal aims to make it more expensive for a validator to go offline and to create mitigations that correlate with the danger posed to network performance and resilience by removing non-responsive validator nodes from the validator pool faster and increasing penalties for misconfigured validators.

Abstract

The MinSignedPerWindow represents the minimum amount of blocks that need to be signed by validator nodes during the given window. If a validator falls under this percentage (currently 60%, or 6 out of 10 blocks), the validator gets jailed. This value can be increased to 80%, which would decrease the window of time before a validator invokes this mitigation from the current 1 hour and 15 minutes to 30 minutes.

In addition to increasing the MinSignedPerWindow, the SignedBlocksWindow should be halved from the current 10 blocks to 5 blocks, requiring the validator to sign every 4 out of 5 blocks rather than 8 out of10 blocks to allow the network to react more quickly to validator misconfigurations.

As part of the penalty for failing to sign an adequate number of blocks in the given window, we propose the DowntimeJailDuration window should be increased from 4 blocks (1 hour) to 16 blocks (4 hours) minimum, which is approximately 0.2% of a validator’s monthly availability.

To additionally mitigate and correct for this downtime, increasing SlashFractionDowntime will create proportional penalties for validators that do not sign enough blocks in the above windows.

Motivation

These parameter updates aim to reinforce desired validator behavior in anticipation of potential downstream impacts of recent and upcoming protocol upgrades and software releases:

  • The release of PIP-22 (RC-0.9.0) has increased validator diversity and the likelihood that servicers will become validators due to the introduction of stake-weighted reward bins, which creates the potential for influxes of validator misconfigurations that should be dis-incentivized with more aggressive slashing.

  • The release of Lean Pocket (RC-0.9.2) will introduce the ability for servicers to run stacked nodes, which increases the danger posed by validator misconfigurations for servicers who have consolidated into a more vertical model that interacts with misconfigured validators through bad gossip, which is spread more effectively, resulting in disagreements among validators about state, leading to consequences from stuck nodes to chain halts.

Rationale

MinSignedPerWindow and SignedBlocksWindow work in tandem; the idea is to reduce the amount of time a validator’s vote still counts toward consensus when it’s going to be offline for an indeterminate amount of time.

Currently, validators must sign 60% of blocks in the MinSignedPerWindow (6/10 blocks). The window has a fixed rollover point, meaning that if a node gets stuck at the 6th block of the window, it would be counted as a voter for up to 9 blocks before being removed from the validator pool — 4 from the first window and 5 from the next. In terms of time, this means a validator node could be stuck for up to 2 hours and 15 minutes (since each block is ~15 mins) before it’s removed from the validator pool. Furthermore, the validator node can be out of service for one out of every two and a half hours with no negative consequences.

The proposed changes to SignedBlocksWindow make it so that 4 out of 5 blocks need to be signed, which corresponds to an increase of MinSignedPerWindow from 60% to 80% to keep these parameters closely coupled. This reduces the worst-case absenteeism scenario to 45 minutes (one block from an ending window and two blocks from the next window), increasing the level of service to a maximum downtime of 15 minutes every 75 minutes.

The combination of the factors mentioned above and the introduction of larger (stake-weighted) nodes creates an incentive for both large servicers and validators to run tight infrastructure with solid monitoring and recovery capabilities, improving overall network health.

In line with the above changes, DowntimeJailDuration would be increased so that the price of downtime is higher. Jail duration count starts when the node is taken out of the validator pool, which today means a validator has to be offline for over an hour to be jailed for a single hour. We propose the duration be increased so that the punishment is significant regarding the downtime with the penalty of four hours for punishable downtime, which is either 30 minutes or 45 minutes after a validator is offline, depending on window timing.

Breakdown of Current vs. Future state of Downtime Duration

Scenario Present Parameters Future Parameters
Slowest Detection 135 minutes 45 minutes
Fastest Detection 75 minutes 30 minutes

As an additional means of reinforcing desired behavior through monetary dis-incentives, SlashFractionDowntime needs to increase to be proportional to the amount of risk they’re trying to dissuade. An increase to 0.1% of the staked amount, in combination with the higher mean validator stake we have when compared to the past, means that this level of slashing is going to be felt.

This is particularly important because misconfiguration is the most common reason why these would be applied. The proposed value changes are set against the average validator stake of 70k $POKT, which would result in the validator losing ~3% of their monthly rewards every time this penalty was incurred.

Breakdown of Current vs. Future state of reward penalties (using validator stakes as of Oct. 12th, 2022)

Scenario Present Parameters (0.0001% of stake) Future Parameters (0.1% of stake)
Smallest Validator Stake 63,200 uPOKT 63,200,000 uPOKT
Average Validator Stake 72,184 uPOKT 72,184,000 uPOKT
Highest Validator Stake 333,333 uPOKT 333,333,000 uPOKT

The proposed adjustments to the above parameters would reinforce desired validator behavior, mitigate the impact of validator misconfigurations on network health, and protect service nodes as new reward and node configuration models change the makeup of, and node interactions with, the validator pool.

Dissenting Opinions

The validator penalty is too aggressive

The penalty is intentionally aggressive to be proportional to the threat. Given that this proposal aims to incentivize less vigilant validators to prioritize quality, we believe this penalty is enough to feel it without being an outsized threat to profitability (assuming the configurations of misbehaving validators are corrected accordingly).

2 hours of jailing is not enough for validators to notice

When used in conjunction with MinSignedPerWindow we believe that the increased chance of jailing makes 2 hours a good starting point that incentivizes validators to debug and address downtime issues before the jailing period is over (and, ideally, moving forward with improved monitoring, etc.).

Edit: Following forum discussions below, this value was deemed too low and has been upped throughout the proposal to set the new value of the DowntimeJailDuration parameter to 14400000000000 (4 hours).

Why can’t we increase MinSignedPerWindow without decreasing SignedBlocksWindow?

While increasing MinSignedPerWindow to 80% (8 out of 10 blocks) achieves similar results, it does not impact the time it takes for the network to react to this misconfiguration. Decreasing SignedBlocksWindow allows the network to react twice as fast to misconfigured validators and remove the validator pool, which is better for network consensus.

Copyright

Copyright and related rights waived via CC0.

7 Likes

I strongly support this proposal, but I wonder if it goes far enough. We’ve seen significant validation lags in some instances (@BenVan has tracked this in the past), and even two hours doesn’t seem like enough time for some of the node runners to respond to an issue affecting validation, depending on their level of monitoring and reporting. Four hours might be a better target here.

I’m also OK with the higher slash, and higher still if this doesn’t prove to be an effective incentive for maintaining validator performance.

4 Likes

I support this proposal. Very straightforward. I concur w @Jinx that 4 hr rather than 2 hr may be a better value for DowntimeJailDuration. Could even go higher.

I’m fine with the proposed SlashFractionDowntime as a start. It is easy enough to come back and increase if there is need. I note that in Andy’s double sign proposal he proposed 2% for double sign slash - 20x more aggressive than what is being proposed here for double sign vs downtime. But appropriately so, I’m sure.

…speaking of which, I don’t recall PUP-24 ever being put to a vote. What happened there?

1 Like

This is a very simple and low-risk change that results in a large increase in the chain’s security and health.

I support this proposal.

5 Likes

Goes without saying, I support this proposal. Nothing unreasonable. We’ve had multiple years to get our validator hosting game up to speed.

To Jinx’s point - if validators are still causing issues in the network after this PUP, we should investigate into the whys and potentially suggest harsher penalties if it’s not a software fault.

4 Likes

I strongly support this proposal and anything more that can be done to increase the reliability of the validator nodes. I also agree with @Jinx that the slash rate might need to be higher to make this effective. But that could always be increased over time I suppose.

3 Likes

In agreement with this proposal and other’s comments here. This can be a great first step, for which can be re-evaluated if needed for any additional adjustments if not deemed effective enough.

2 Likes

Throwing my support here as well!

3 Likes

Thank you, everyone, for your support. After reviewing these comments and discussing with the internal team, we are aligned on increasing the DowntimeJailDuration from the proposed 7200000000000 to 14400000000000. Edits have been made above to incorporate that requested change.

Regarding SlashFractionDowtime, if anyone has a specific value that they want to propose for additional debate we are open to that. Otherwise we will leave this as is for now as it’s a 99,900% increase to the slash.

Pending any further discussion we will aim to move this to a vote in the upcoming days. Thanks again!

2 Likes

I’m all for increasing the penelty. But personally think reducing the downtime before jailing isn’t a good idea as it massively reduces the time to rectify any issues before being jailed. Currently it’s not really an issue because slashing is peanuts but increasing the punishment makes this much more impactful.

Having a 30min jail window basically means you are extremely unlikely to fix an issue before being jailed and thus slashed. Pretty much all other networks I can think of (and have worked with) have a 12-48hr downtime window here which gives ample time to mitigate any issues.

You could have a fail over node but this imposes other risk when automated (I. E. Double signing) and being on hand for a manual 30min switch 24/7 seems optimistic.

Also have the risk of people below the threshold running nodes with higher stakes but setups not optimised for validation (pruned DBs and so on) been considered here?. Having shorter Windows will result in more nodes potentially being jailed which could have a knock on effect if it brings these nodes into the set (if I remember correctly incorrectly pruned nodes caused halts in the past).

2 Likes

Appreciate the response. Those are all fair points but still believe that the aforementioned parameters are a fair tradeoff with the goal of increasing network security.

I wanted to make an open request to those on the thread so we can could be more data-driven about the values chosen.

Pretty much all other networks I can think of (and have worked with) have a 12-48hr downtime window here which gives ample time to mitigate any issues

Could you provide a few examples of networks with a high or low downtime window?

and being on hand for a manual 30min switch 24/7 seems optimistic.

Would be great to get uptime metrics from validator operators to understand how frequently they go down.

2 Likes

Sorry for my absence on this thread @Andy-Liquify! Appreciate your detailed comment. And thanks for chiming in @Olshansky

Could you provide a few examples of networks with a high or low downtime window?

I am going to use ETH and SOL as two contrasting examples here.

ETH: Validators inactive for 4 epochs (25.6 min) will receive inactive penalties that will get worse with every epoch until the entire stake is slashed.

SOL: Validators are “benched” and not rewarded, but are not penalized. After Hetzner blocked their servers, 22% of the validator network was offline. I don’t think I need to go into detail as to why we don’t want to follow Solana design patterns.

basically means you are extremely unlikely to fix an issue before being jailed and thus slashed.

Correct, which is the intention of this proposal. There should not be issues in validator configurations to begin with, and we hope that validator operators only have to learn this lesson once.

You could have a fail over node but this imposes other risk when automated (I. E. Double signing) and being on hand for a manual 30min switch 24/7 seems optimistic.

Double signing is not an issue for the proposal although it is a network risk.

Also have the risk of people below the threshold running nodes with higher stakes but setups not optimised for validation (pruned DBs and so on) been considered here?

Smart question. I can’t find evidence of pruned DBs causing chain halts in the past, but understand the concern. Nodes should not be pruned, let alone incorrectly. With these mitigations we hope to incentivize accountability across the board by making risk > reward.

These measures are intended to inspire exactly this kind of thinking: “what if I’m running a risky setup?”. Those optimizing only for rewards cannot control the narrative. If nodes are willing to pose this much risk to the network, the risk to them should be equally high. We hope these reinforcements in addition to properly educating network participants about the risk will actually result in fewer or no outages at all.

Thanks again and hope this context is helpful!

5 Likes

ETA: RC-0.9.3 Persistence Replacement is a release that is planned to go live before these parameter updates would be introduced to the network. With a ~41% reduction in overall disk usage, pruning should be discouraged/mitigated as well @Andy-Liquify :pray:

1 Like

Thanks @jdaugherty for a really thoughtful proposal and responses to the various queries

I fully support this proposal too

2 Likes

This proposal is hard to accept.
The downtime window of 30 minutes is tiny.
We can assume that leaves us with 20 minutes to fix the issue after catching it.

Downloading an official Pocket snapshot and restoring it takes roughly 4 hours, and many issues in the past required downloading the snapshot.
This change would encourage all operators to work with fully pruned nodes.

If we want to compare this to other networks:
ATOM:
MinSignedPerWindow: 0.05
SignedBlocksWindow: 10 000
DowntimeJailDuration: 600s
SlashFractionDowntime: 0.01
Osmosis:
MinSignedPerWindow: 0.05
SignedBlocksWindow: 30 000
DowntimeJailDuration: 60s
SlashFractionDowntime: 0.00
In the case of ATOM, we are looking at ~15 hours and ~45 hours for osmosis.

@jdaugherty As for ETH, I’m not an expert, but based on Proof-of-stake rewards and penalties | ethereum.org, isn’t the penalty only applied if 1/3 of the stake is unavailable?

2 Likes

Thanks for the replies guys

But you can’t always pinpoint the issues on the actual node runner. It has been shown in the past (and currently) that nodes can fail to meet consensus and corrupt themselves if there is a sudden drop in validators during voting rounds. Which is amplified when people use lite client nodes as validators (regardless if it is not recommended people will do it!). We see this happening regularly on our fleet of ~180 validators. We have really stepped up our mitigation and automation to deal with this and throught the month of November have had no jailing but our mitigation needs >2blocks to kick in to avoid false positives due to any delayed propagation.

I was more referring to a risk of running automated failover here, which more will be inclined to do post these values. This is not a bad thing if done correctly but orchestrating failover and avoiding double signing is tricky!.

Practically all cosmos based chains have >12hr windows for downtime @Dominik has suggested some above.

Awesome I did spot this on GIT (just a disclaimer we DO NOT run any pruned databases, but I hear rumors of people doing so and like I mentioned I like 99% sure this caused issues in the past some years back now, will dig through discord later if I get a chace)

Like I have said prior I’m all for harsher penalties but the windows suggested here are far too short. You can’t assume all problems arise due to negligence from the operator. I think these windows will have big impact on validators performing best practices but punished based on factors outside of their control and the churn in validators caused by excess jailing may have 2nd order affects to the network.

4 Likes

Based on Andy’s response, the team is trying to solve the symptom, not the problem.
The main issues that force people to run Lean are 2:

  • Resources needed for a validator (especially storage + overhead for the snapshot)

If you could run a pruned node on Pocket, you would require two vCPU, 4GB and 50 GB. For whatever reason, the Pocket does not support pruning and requires validators to run the indexer.

  • Optimizing rewards. If non-custodial validators’ profit increases with the stake size, this will reduce the number of validators people try to run.
2 Likes

Thanks @Dominik and @Andy-Liquify for your thoughtful and well reasoned replies. Definitely making me think carefully about this.

I thought this was really insightful:

nodes can fail to meet consensus and corrupt themselves if there is a sudden drop in validators during voting rounds.

Do you have recommendations on the right balance here, then? Would keeping the SignedBlocksWindow: 10 but updating MinSignedPerWindow: 0.8 make more sense in your opinion? I can discuss suggestions with the other proposers, so curious your thoughts on how we react to these validators faster (which is a goal of this proposal in addition to penalizing them more harshly).

I do want to say that I’m not super swayed by the Cosmos comparisons, despite Pocket using Tendermint. Cosmos and Osmosis only have 175 and 135 validators, respectively, which left the latter susceptible to a $5mm exploit. Our 1000 validator pool gives us more security, and given the value proposition of our network is reliability and redundancy, the security and diversity of the network needs to be protected with on-chain accountability.

As for ETH, I’m not an expert, but based on Proof-of-stake rewards and penalties | ethereum.org , isn’t the penalty only applied if 1/3 of the stake is unavailable?

This is referring to the “Inactivity Leak” which is an event where the Beacon chain has not finalized for 4 epochs, which is a different event than a single validator being inactive for x epochs.

I was referring to penalties, which “bleed” the reward the validator would have received from the stake for each epoch for missing the target and source votes.

I did conflate the two, however, and ETH applies this penalty per epoch (6.4 minutes), not after being inactive for 4 epochs.

Anyway, just wanted to clarify that. Mainly would love to hear your counterproposals on the right balance with SignedBlocksWindow to consider before moving forward with voting on this proposal. Thanks again, both, for your engagement!

1 Like