At around 3am UTC on May 28th 2021, the Pocket Network blockchain halted due to a transaction indexing conflict between 0.6.2 nodes and prior versions. During the halt, service to applications continued thanks to fallback mechanisms we have built-in to the Dashboard. This post seeks to provide full clarity on what caused the halt, how we resolved the halt, and the measures we’re taking to minimize the risk of future halts.
The original transaction indexer based on Cosmos events was broken and often led to mis-indexing of transactions. In 0.6.1 and prior, transactions were mis-indexed, oftentimes not being able to be appropriately queried via CLI or RPC. On top of that, any and all transactions would be indexed, even invalid transactions as long as they were the right amount of bytes, which constituted a potential attack vector as an attacker could have filled every node’s indexer with garbage data bloating their disks and potentially causing mass failures in nodes.
To address this, we released 0.6.2 with a brand new transaction indexer. We wrote the unit tests and they all passed. As far as we were concerned, this was a backwards compatible change (not a consensus rule change) and did not need to be protected within the governance upgrade process, meaning the change could be active as soon as nodes started using the new version.
The issue came from the fact that the transaction indexer also plays an important role in consensus: replay protection. Because every transaction submitted to the blockchain would have been indexed, we can avoid a transaction being replayed by checking against the transaction indexer once the transaction is submitted. However an edge case happened on block 27196, where 2 invalid transactions were resubmitted multiple times, non-maliciously, potentially the product of nodes mempools repopulating between restarts.
Since 0.6.1- nodes had incorrectly indexed the invalid transactions, they viewed the replays as a replay attack (code 6 invalid transaction). 0.6.2+ nodes hadn’t indexed the transaction, so they correctly said this is in invalid transaction without enough funds (code 4 invalid transaction). The codes of transactions matter when it comes to consensus, because they’ll produce different block hashes, so this resulted in a divide in the two states, ultimately halting the chain.
The code mismatch error was recognized almost immediately, but would need to be proven in what would amount to an 8 hour sprint. Once we proved this, ruling out any other causes, we decided the optimal solution would be to have all 0.6.1- nodes upgrade to the 0.6.3 nodes in order to achieve 67% consensus on the transaction codes.
Coordinating with the largest node runners to upgrade them to 0.6.3 would go on to take another 12-24 hours. As a result, by the time we had a 67% majority on 0.6.3, the consensus voting round count was all the way up to 71. In a chain halt resulting from node downtime, the round count will not increase. However, in a chain halt resulting from divided consensus, the rounds keep climbing. If a 0.6.1 node proposed the block, 0.6.3 nodes would say it’s invalid, and vice versa, round after round.
Now, although we had achieved the 67% adoption of 0.6.3 that we would need to get the blocks moving, we couldn’t keep enough of these nodes alive to actually achieve consensus in practice. When a node votes on a round, it needs to replay all of the votes from previous rounds all the way back to round zero. This is unfortunately a property of the Tendermint consensus algorithm (PBFT), that some other blockchains don’t hold, which is its immediate consistency and not eventual consistency. When you replay these votes, you have to hold them in memory and write them to a .wal file (write ahead log). For this reason, the higher the round, the more resource intensive it is to vote on a block. Thus by round 71 it was a herculean effort to keep the 0.6.3 nodes alive because of the sheer size of the .wal file. The chain was actively trying to self-heal, but too many nodes would time out and die, leading to restarts that would make the nodes restart the effort to catch up from scratch all over again.
Over this 48h period, our largest node runners had really stepped up to help us diagnose the issue and achieve consensus again. We therefore all collectively decided upon the optimal solution to get the chain moving again. The key requirements were:
- Skip the voting rounds to eliminate resource requirements
- Minimize moving parts to minimize risk of failure
We coded up and released 0.6.3.2, a hotfix that would do the following:
- Make nodes ignore voting information from the internal db for rounds less than 100 (only on height 27196).
- Sleep all nodes until Monday 6pm EDT on May 31st, thus allowing nodes to jump to round 100 at the same time and achieve consensus in 1 block’s time (~15 minutes).
Our largest node runners were given from around 9am EDT on May 31st to update to this hotfixed version.
As we approached 6pm EDT, we all gathered in a voice channel to witness the chain coming back to life. The Discord voice channel was home to at least 30 core devs and node runners, most running on just a few hours of sleep. As the first nodes came to life, calm voices reported signs of life in a manner that reminded us of mission control. The less technical among us made themselves useful by sharing NASA gifs to ease the tension.
And, just like that, the chain was moving again.
Our community is our strongest asset
Despite the timing of this crisis happening over the Memorial Day long weekend, our node runners really stepped up to help us diagnose the issue and coordinate a resolution. Many pulled all-nighters alongside our core devs, an act of camaraderie that we do not take lightly. Now more than ever we understand that it is our community that will ultimately make Pocket the most resilient Web3 infrastructure in the world.
Improvements in crisis-response communications
While our updates were high-quality once they got started, we were far too late getting started, and we’ll own up to this.
Not everyone is in our node running channels and we should have been more transparent externally so that no-one was left wondering why the chain had halted. In addition to focusing too much on our inner node runner communities, we focused too much on milestones rather than time passing. We felt that communication without an action plan can be discouraging, so we spent all of our focus and attention on working with node runners to bring the crisis to a prompt resolution. However, as explained in Resolution Difficulties above, the finish line got pushed back and comms along with it.
In hindsight, we should have published more intermediate updates explaining what we believed to be the cause and how we were working with key node runners to test and implement solutions. We will be sure to do this in the future.
Improvements in our development and quality assurance lifecycles.
Currently our software development lifecycle and quality assurance cycles are being overhauled, because when issues like these are not caught by our Unit Test suite, nor our Functional Test suite, it points to a more fundamental problem in our development pipeline, and this is something we’ve been working to resolve even before this crisis.
We are working on a more inclusive software development process that will enable the community to discuss specs, releases and documentation in the earlier stages of the development cycle, allowing us to capture feedback and provide evidence that will raise the level of confidence in Pocket Network software from the first stages of development. We’re looking forward to the continued support of our community as we go through this phase shift in how we work.
Reliance on Tendermint
Initially Pocket Network set out to solve a very specific problem: create the best decentralized infrastructure network for Web 3.0. We chose Tendermint because we wanted to focus on solving these issues and not have to build from scratch an entire peer to peer, storage and consensus layer. However, every syncing issue, every database corruption, every time transactions are not properly propagated, these are all issues pertaining to the Tendermint base layer we chose to build Pocket Network on top of, and all of this is code that is not covered by our QA suites, because it is code that has its own QA and Audit process owned by the Tendermint core developers.
With every new Pocket Core release, we have pushed the limits of what is possible with Tendermint. Among the Cosmos SDK-derived chains, Cosmos has 125 validators, Polygon has 100 validators, BSC has 21 validators, while Pocket Network has ~5,500 validators. This is in large part due to differing use cases; the other chain’s validators primary purpose is to validate transactions efficiently, while validation comes secondary to Pocket’s core purpose of servicing API requests, which requires more nodes. However, as demonstrated by the various bugs we have had to release patches for, it is clear that we’re pushing beyond the limits of what Tendermint can handle for Pocket’s use case. The custom transaction indexer is one example of the work the Pocket Core team is doing to decouple our development from Tendermint as a base layer and, in order to grow into the use-case specific utility network we set out to become, we plan to continue the process at all layers of the blockchain stack. Tendermint and the Cosmos team have done great work, laying the foundation that enabled us to build an MVP more quickly, but our time has come to begin transitioning.
Should everyone be running on 0.6.3.2?
No. This was a hotfix designed to coordinate the next block and get the chain moving again, as described above. We advise all nodes to downgrade back to 0.6.3 because this is the more stable version.
Did we lose any transactions during the halt?
No. While the chain was halted, all transactions were being gossiped between nodes and held in mempools, so they were eventually included in a block.