PUP-8: Increase MaxValidators from 5,000 to 50,000

JackALaing · June 11, 2021, 4:36am

Attributes

Author(s): @JackALaing @luyzdeleon @andrew @varoten
Parameter: MaxValidators
Current Value: 5,000
New Value: 50,000

Summary

Update the MaxValidators parameter from 5,000 to 50,000 to avoid a bug that is causing Tendermint to treat nodes differently if they are outside of the MaxValidators limit. We recently crossed 5,000 unjailed nodes, which is why the bug has now become apparent. Increasing the value of the parameter will postpone the effects of the bug until we release our next protocol upgrade (0.7). It will also have the added benefit of enabling the long-tail of nodes to participate in proposing blocks and ensuring all service nodes can continue to use jailing as a graceful method of removing themselves from service.

Motivation

We discovered an error in the logic of Pocket Core, wherein it tried to re-jail an already jailed validator:

cannot jail already jailed validator

After some investigating, we discovered that this is being caused by a bug which results in Tendermint and Pocket Core treating nodes differently if they exist outside of the MaxValidator set, which is currently set to 5,000. Nodes which have a stake that is outside of the top 5,000 largest stakes are jailed in the eyes of Pocket Core, whereas Tendermint thinks the node is still active in the Validator set and therefore obligated to cast a vote on a block. However, when that “blank vote” comes through Pocket Core, Pocket Core still thinks it is a jailed node and puts the node through the normal lifecycle of punishing a validator that didn’t cast a vote in time, triggering a single slash (0.00001%) every 10 blocks even after the validator has been jailed.

Outside of this bug, there are other reasons to increase the parameter. We no longer have the scalability limitations that were referenced originally in PUP-4. In fact, thanks to @BenVan’s peering enhancement efforts, we are finding that the resources of service nodes beyond 5k are equivalent to the resources of validators within 5k, and block times are remaining healthy. Scalability no longer being a concern, increasing the parameter would provide the following benefits:

Providing more nodes, particularly the long-tail of node runners, with the opportunity to participate in block validation and earn proposer block rewards.
Maintaining jailing as a graceful exit option for all nodes, thus optimizing quality of service (when separation of validation/servicing is in effect, jailing is not an option for nodes outside of the MaxValidators limit, which means they have no graceful way to remove themselves from the service cycle).

Rationale

We ran 2 different tests that replicated the issue described above in an internal test network and we replicated mainnet’s behaviour successfully. These tests confirmed to us that upgrading the max_validators param to a higher number will make the problem self-heal by the same logic we use to coordinate the current set of 5,000 nodes. We’re recommending 50,000 as the new value to postpone the effects of the bug until we release 0.7 (the next protocol upgrade).

Dissenting Opinions

Can’t we ask altruistic node runners to bring us down below 5k?

This is a solution if nodes within the top 5k are the ones doing the jailing/unstaking and if no one else subsequently spins up nodes. However, it would be infeasible to convince the collective node runner community not to spin more nodes up until the bug is patched, especially since they’d be able to avoid slashing penalties for themselves if they stake within the 5k. The more reliable solution is to change the parameter, especially given we no longer have scalability concerns, as explained above.

Analyst(s)

Pocket Network Inc’s blockchain devs - Luis, Andrew & Otto

Copyright

Copyright and related rights waived via CC0.

JackALaing · June 11, 2021, 4:43am

This being a minor parameter change, which eliminates the effects of an active bug, and has no downsides, it seems to me there’s no reason to wait on voting.

Here’s the link to vote: Snapshot

arnaud-skillz · June 11, 2021, 7:31am

@JackALaing @luyzdeleon

Agreed for the bug fix that is urgent.
But how will the Tenderming P2P Peering and Consensus engines/components will hold with more that 5k validators on the network ? (5k being their official limit).

Won’t will it create network instability ? Has it been tested ?
That was the initial concern of PUP-4, right ?

Thanks,
Arnaud

JackALaing · June 11, 2021, 7:33am

I touched on this in the proposal. We’re actually finding the scalability issues that we were previously concerned about to be much less pressing, thanks to better peering practices across the board.

arnaud-skillz · June 11, 2021, 7:40am

Alright, noted, thanks @BenVan.

Please note that we still experience important issues on our side @ SkillZ. (Jailing and Unsync).
Shared with @o_rourke and @luyzdeleon yesterday during a call.

We’ll be monitoring the effect of this upgrade.

We’ll vote asap.

tdephuoc · June 11, 2021, 8:02am

It seems that @BenVan 's solution (and SkillZ’s implementation of peering enhancement) leads to increased centralisation: a large set of nodes rely on a smaller subset of key nodes. How is this evaluated against the goal of having a network of independent full nodes?

Happy to have that conversation somewhere else on the forum if this thread is focused on get the PUP voted ASAP (which we’ll do as Arnaud mentioned).

o_rourke · June 11, 2021, 11:07am

Transaction submitted and successful. Thanks everyone.

{
“logs”: null,
“txhash”: “267770F1249DE7723B2249C9F290B3DEDD48F03B052AB61A9D3C5DC7CD840226”
}

Can check the new value with a pocket query params call.

JackALaing · June 11, 2021, 8:23pm

This is a worthwhile talking point. In general, my view is that decentralization is a balancing act while we scale and mature the protocol. I don’t view off-chain peering practices to be problematic, because there’s nothing stopping people from changing their peers; it may introduce more technical centralization but not political centralization. But yes let’s continue the conversation elsewhere in the forum if you feel inclined and we can get @BenVan’s take too.

JackALaing · July 3, 2021, 1:50am