Chain Halt Post-Mortem & QA Enhancements

JackALaing · July 9, 2021, 8:15pm

On June 30th 2021, the Pocket Network blockchain halted due to a number of factors that we’ll outline below. During the halt, service to applications continued thanks to fallback mechanisms we have built-in to the Dashboard. This post seeks to provide full clarity on what caused the halt, how we resolved the halt, and the measures we’re taking to minimize future risks.

What Happened

Initial Cause of the Halt

There was a bug in RC-0.6.3 through RC-0.6.3.2 which caused the signing info to be deleted when edit stake transactions were submitted while out of jail, both preventing the nodes from signing blocks and preventing the jailing of non-signing blocks.

This bug was not present in mainnet until the edit stake feature was activated when the consensus rule change of PIP-4 took place. It was not caught in testing due to a large jailing buffer in cleanroom testing environments and due to the lack of adoption in testnet (more on that later).

This bug was not present in Beta-0.6.4, causing RC-0.6.3 and Beta-0.6.4 to have incompatible states.

When too much voting power was moved from RC-0.6.3 to Beta-0.6.4, as a result of larger than usual adoption of the beta due to rumors about it boosting node revenue, the split state escalated into consensus failure, as less than 67% of nodes agreed on any one state.

Complications in Addressing the Halt

For security reasons, the proposer for the next block in Pocket Core is non-deterministic - which is not true of vanilla Tendermint. This safety feature - which prevents a type of attack commonly called “grinding” - made it harder, under unfavorable P2P conditions, to get the nodes to agree in a moment that was crucial for the chain to restart. As a consequence, the implementation of a fix was delayed.

Silver Linings

The work done since the May 28th chain halt allowed our core development team to get volatile network environments set up fast, heavily test patches, and catch some unwanted features before they were live in production.

We also coordinated the response to the halt more effectively than last time, both in terms of clear communication and more efficiently working with key node runners.

The Fix

The final resolution of the halt was achieved with a combination of measures:

The deployment of a code change in 6.3.3 that avoids the loss of signature information
Ensuring deterministic proposer selection by removing lastCommitInfo
A servicing fix where the session generation algorithm was using the legacy selection mechanism

How We’re Embedding these Lessons into QA Enhancements

Leaning into QA Process and Tools

Since our QA process allowed us to react fast and avoid provoking further issues, we will be leaning into developing and perfecting the toolset around it.

Slowing Down Pocket Core 0.X Development

As we alluded to in the previous chain halt post, Pocket has begun the journey of transitioning from Tendermint towards a custom stack for Pocket 1.0. This implies rebuilding much of our architecture, which we are working on with a heavy focus on quality features. We’ll be publishing a vision/roadmap post soon with more details.

In line with this transition, we will be slowing down the pace of 0.X development while attention is focused on Pocket 1.0.

Now that we have the key features required to support and scale a multi-chain ecosystem, as demonstrated by the recent whitelisting of several new chains, we can slow down the progress of 0.X development and let the current feature set drive ecosystem growth.

We will still work on bug fixes for the latest release, and if there’s overwhelming demand for a new feature we will ship it, but the priority now is 1.0.

Building 1.0 with Quality in Mind

As we are building 1.0, we’re not only improving the tech: we’re establishing world-class operational capabilities to support growth, both in scale and in scope. In other words: we’re not only making a sturdy building, but a process to maintain it and safely grow it in the direction the community needs it to grow.

Test Networks

We have already reaped the benefits of the seeds sown in QA and will work to continue to enhance QA with features and processes that will serve both 0.x and 1.0.

One major area that we’re enhancing QA, not just for our core developers but for the community as a whole, is a revamped suite of test networks that will more tightly integrate the community with our core developer’s QA processes. Node runner feedback is relatively ad-hoc today, so these changes will be the first step towards formalizing a more rigorous process that enables a tighter feedback loop.

Localnet

We are putting together a lightweight package that will enable anyone to spin up a small network locally. Using this, you’ll be able to run different versions of Pocket Core and test the behavior of different setups and protocol changes. This will help our core developers to evaluate their work more efficiently and help community members to get started contributing to the protocol. Getting localnets into the hands of our developers and community members will be the first step to ensuring we have a full community-centric feedback loop for testing new versions.

Devnet

Devnet will be a testnet-like environment where the bleeding edge software will be deployed. The main difference between devnet and testnet will be its volatility. This will be a deliberately transient network; if we have to re-genesis we will re-genesis. Here we’ll be able to test new version deployments, node migration from one version to another, and the interactions of these deployments; we’ll simulate any scenario to identify edge cases.

The main purpose of devnet will be to catch the most obvious bugs in new version deployments. As a standard practice, all new production versions must withstand devnet testing for a minimum period of time. More significant changes, such as consensus rule changes, will be required for longer to allow consensus edge cases to emerge.

Testnet

Finally, and perhaps most importantly, we’ll be re-prioritizing testnet as a bona fide production environment.

We had de-prioritized it due to low adoption from node runners, but now that we have the tooling in place to enable incentive bootstrapping for Settlers of New Chains, we can use the same tooling to enable continuous incentives for the Pocket Network testnet chain. When we are ready to deploy and scale the testnet, we will submit a proposal to the DAO to add the testnet RelayChainID (0002) to the Settlers incentive program in perpetuity. Node runners using testnet will just have to point a mainnet node to their testnet node in order to receive compensation for their support of testnet. We believe the inflation would be justified since running testnet nodes is just as much a public good for the ecosystem as providing mainnet service is.

The main purpose of testnet will be to serve as a low-stakes production-like sandbox that developers and node runners can use to learn the ropes. Developers of community tools, such as Node Pilot, will be able to stage changes in testnet before deploying to mainnet. Rookie node runners will be provided with a baseline of traffic to their RelayChain nodes to test their RelayChain setups before deploying to mainnet.

This will all serve to establish an environment in which the community can foster the same world-class QA that our core developers are applying internally.

We’d like to thank again the node runners and community members who showed true grit in getting through this situation and we encourage you to share your thoughts on how we can continue to improve together as a network and as a community.