Chain Halt Status Update

To those of you who are unaware, we are currently experiencing a chain halt, which means we don’t have the 67% consensus required between nodes to produce new blocks. We have been providing updates in the #node-runner Discord channel (which you can join by clicking the robot emoji in #welcome) whenever we have new developments to share, but we should have been providing more frequent public updates. Now that we have a solid grasp of the issue and how to resolve it, this will be the first of a regular cadence of status updates we’ll be sharing until the chain gets moving again.

The Issue

The cause of the chain halt was a deterministic app hash error based on the new transaction indexer introduced in RC-0.6.3. The transaction indexer is only used for consensus in 1 place: replay protection, which makes RC-0.6.3 and previous releases handle a particular edge case differently.

Unlike a chain halt that results from node downtime, this required identifying the cause (to determine if a hotfix is required) and coordinating with node runners to update their software. This is why the chain halt was taking longer to resolve than might otherwise be expected.

Now the nodes are on the software we need them to be, but there’s a different issue stemming from the chain halt itself. The longer the halt has persisted, the more voting rounds have occurred (72 in total), the more memory nodes are having to retain, the harder it has been to maintain nodes (keep them from crashing), the harder it has been to get 67% of nodes to stay caught up to the round data and vote in sync. Solving this is our main focus now.

The Backup

One thing that is important to highlight, since not everyone may be aware of the backup mechanisms we have in place, service to applications has remained uninterrupted for the duration of the halt. This is because the majority of applications use the Pocket Dashboard to connect to Pocket and we have built-in backup nodes that ensure application’s relays continue to be serviced in any event.

The Solution

Once we identified that the root cause was the transaction indexer in RC-0.6.3, we coordinated with the largest node runners to get them all updated and ensure that 67% of the network is operating by the rules of the new transaction indexer. This has been completed successfully.

Now we are working to collectively disregard the 72 unsuccessful voting rounds, to lighten the load and make it easier to get that next block produced. As I write this, the core devs are working on a patch that will skip these voting rounds for just this block height. Node runners will be provided time to update to this new version once it is released, with a deadline upon which upgraded nodes will wake, to account for different time zones. Once 67% have updated to the new version and the nodes wake, nodes will achieve consensus on the next voting round, and finally produce the next block.

We recommend ALL nodes to continue paying attention to these updates, as the more who upgrade with the new patch, the quicker we’ll unhalt the chain.

The Implications

  • The chain will not be resuming until the wake deadline after the above patch is released. Edit: this is now May 31st 6pm EST.
  • The long tail of node runners who do not participate in this hotfix will miss a block, which will result in jailing if you also missed 3 out of the past 9 blocks. However, after this block they will continue to be in consensus.
  • Node runners who update to the patch should revert to 0.6.3, because it is confirmed that the patch will only affect the next block (27197) and 0.6.3 will remain in consensus moving forward.

The Silver Linings

  • A blockchain is only as good as its node runners and our node runners have really stepped up this weekend, coordinating around the clock with our core devs to diagnose the halt, keep their nodes up, and cooperate across time zones to achieve consensus. It is seriously heartening to see the camaraderie that has been displayed during this challenging time and bodes well for the future resilience of our community.
  • Once the chain gets moving again, we have enough consensus on 0.6.3 to activate all of the 0.6.X features: UpdateStake functionality (which will be part of a new process for whitelisting new chains more rapidly), higher network stability, Protobuf encoding for easier SDK/client development, etc.
1 Like

We have now released the patch mentioned above and (assuming everything goes smoothly) can expect the chain to resume around 6pm EST.

RC-0.6.3.2 hotfix

You can find the hotfix release here: Release RC-0.6.3.2 · pokt-network/pocket-core · GitHub

RC-0.6.3.2 works exactly as RC-0.6.3 except for the following patches:

  1. Hardcoded sleep for the entire process until May 31st 6pm EST
  2. It will skip wal file replay at height 26197, avoiding long winded round replay times and resource issues being experienced by validators in the network.
  3. It will skip to voting round 100.
  4. Once all the nodes running this version awake at the same time, there will be a fixed 20 minute window for the proposal block to be created and gossiped by the selected node, after which the rest of the validators will vote to get the block produced.

How to participate in the network recovery?

  1. Upgrade your nodes to RC-0.6.3.2
  2. Restart your nodes. You will see a log that will say: Sleeping for <duration>. If you are unsure if you upgraded, submit the pocket version command to return the Sleeping for <duration> output.
  3. Wait until the deadline expires and your nodes wake up.

:question:
You don’t need to wait until 6pm EST to upgrade and restart your node. You can upgrade and restart them now, then they’ll automatically wake at 6pm EST. By pre-deploying and starting the node as early as possible, you will ensure synchronicity once the 6pm deadline comes.

Can we keep RC-0.6.3.2 even after block 26197 is produced?

You can, however, there could be unexpected side-effects from the hotfix. We have already tested and confirmed that RC-0.6.3 nodes continue to operate as usual, so we recommend reverting back to RC-0.6.3 once the next block is produced. Revert by simply upgrading your node using 0.6.3 as the new version.

Do we need to upgrade non-validators?

Do not restart your nodes, as you might experience resource issues due to the WAL file replay mechanism, which will try to replay the thousands of votes that have been cast in the halted block so far. Keep your non-validators as is.

Do we need to manipulate the validator settings or configs in any shape or form?

There’s no need to update anything in your setup.

The chain is now moving again!

Current state of the network

The network currently is experiencing a high number of validator updates due to the fact that we just barely got enough power to get the network through. This is making keeping nodes up hard because the high amount of I/O going into updating the nodes’ internal databases is really big.

How can we go back to normal?

By maintaining validators, even through constant rest cycles, to at least the point they can cast a vote will help us get to that point of stability faster.

How can I troubleshoot my nodes?

  1. Shut down your node, delete your cs.wal file in your datadir and restart your nodes.
  2. Increase your node resources if possible.
  3. Disable log levels except for *:error in your config.json
1 Like

Since getting the chain moving again, we have validated 4 blocks.

The reason for this low number is the difficulty attaining 67% consensus on pre-votes, while the backlog of data gets pushed through and nodes struggle with resources.

What’s Next?

  • Now that we’ve passed 4 blocks, and because non-0.6.3 nodes will be missing all of these blocks, non-0.6.3 nodes will start to be jailed for missing blocks. This should help ensure 67% consensus.
  • After 10-20 blocks, all of the backlogged transactions should be flushed from P2P, and claims and proofs processed, which should make resource requirements lighter for nodes.

Action Items

  • If you haven’t already, upgrade your node to 0.6.3. If you don’t, you’ll be jailed and until then you’re making it harder for us to achieve consensus. Upgrade guide is here: Release RC-0.6.3 · pokt-network/pocket-core · GitHub
  • Open your /v1 endpoints in your nodes Service URL’s, so that we are able to detect the current version you are running, which helps bring more visibility to the network.