Block Sizes, Claims and Proofs in the Multi-Gateway Era

RawthiL · February 21, 2024, 9:05pm

Edits:

22/02/2024 : Corrections and added more data splitting the sample by gateway.

The intention of this thread is to share data on the issue raised by @BenVan on Discord.

I will try to follow @Olshansky proposed format.

What’s the problem?

Block size is increasing, currently at ~7.5 MB out of a maximum of 8 MB.

Things node runners could / should do

Node runners can activate the pocket_config.prevent_negative_reward_claim parameter available in the last version of Pokt. However this won’t fix the issue, less than 0.2% of claims meet the prevent_negative_reward_claim condition.
I believe that no other action can be done by node runners without working against their economic incentives.

Things gateways could / should do

Re-think their relay load balancing strategies.
I don’t know much about how and what they do, but there is a clear change in the distribution of relays. The number of total claims/proofs increased 2x since the 2023-10 (before second gateway).
We have no information to tell if the addition of the new gateway is the source of this or if it roots in the ever changing Quality of Service (QoS) and routing strategies of gateways.

List of parameters we could change on chain

IMPORTANT NOTE: This is an exhaustive list, NOT A RECOMMENDATION. It is here just for visibility.

pos/BlocksPerSession : Increasing the blocks per session reduces the number of claims/proofs that are the bulk of the block size.
pocketcore/BlockByteSize : Making the block bigger to get more space.
pocketcore/SessionNodeCount : Reducing the nodes in a session will reduce the number of nodes posting claim / proofs.
pocketcore/MinimumNumberOfProofs: Increasing the minimum number of proofs to submit a claim will reduce the posted claim / proofs.
auth/FeeMultipliers: Setting a higher fee for claims will disincentivize the nodes from posting claims for fewer/cheaper relays (they would loose money).

Collected data

Block size evolution since height 8000

The blue bars represent the block state size, the size can be read using the left Y axis.
The green line represents the number of TXs that are caused by claims ans proofs. The orange line represents the reminder of the TXs in the network. The number of TXs for both lines can be read using the right Y axis.
Its important to note that the second gateway was introduced in November 2023 and it correlates with an increase of the block size from ~4.5 MB to ~6 MB. Once more, correlation does not means causality.
In the last month, the block size has jumped again, reaching a critical value of ~7.5 MB.
The number of transaction that are not related to relays (i.e. wPOKT) does not seem to have growth in a meaningful way, this is probably not the source of the problem.

Distribution of the number of relays per claim

This figure shows how the number of relays per claim is distributed. A higher “Density” values (Y axis) means that more claims (in the sample) have the amount of relays indicated by the X axis. The colors indicate if the sample corresponds to the multigateway network or not.
We call “multigateway” a network with more than one gateway. specifically:

multigateway=False: A period of 500 blocks in October 2023, specifically from block 111000 to 111500. In this period only Grove was online.
multigateway=True: A period of 500 blocks in February 2024, specifically from block 123187 to 122687. In this period Grove and Nodies were online.

We can see that the distribution shifted notably in these two different samples. Before (blue distribution) there was a larger number of claims that packed from 5000 to 10000 relays each, now (orange distribution) almost all claims pack less than 5000 relays, and most of them pack only ~1000 relays.

The same information can be seen by means of a cumulative distribution plot:

We can see that the current state of the network has most of their claims filled with less 2500 relays, actually, the orange trace marks that more than 80% of the claims have 2500 relays or less, while before (blue trace) only 50% of the claims carried this amount.

Its important to remember here that the reduction of the free tier occurred in this same period, in November 2023 the average by day was ~1 B relays. This can partially explain the change in the number of relays by claim. What it cannot explain is the tails near the zero of the cumulative density plot. We can look closely at the number of claims that had less than 100 relays and count their numbers:

When we compare these tails, that highlight the amount of low-relays claims, we see that the difference between the samples is remarkable, This is probably due to a change in the way that gateways work, regardless the amount of relays.

Number of Claims and Thresholds

Here we analyze the number of claims in both samples of 500 blocks (before and after multigateway). In the following table we show the number of claims that resulted in minting less than the total fees (we call them “backfire” since the node runner lost money), the number of correct claims (that resulted in revenue) and the proportion of these two groups.

	Backfire Claims	Correct Claims	Proportion
multigateway
False	0	738827	0.000 %
True	3307	1549049	0.213 %

We can see that the “backfire” begun after the introduction of the new gateway, caused probably by the number of claims that are now packed with very few relays. Its also important to bite that the total amount of claims almost doubled (just like the block size, causality here) despite that the total number of relays in the network was sharply reduced (almost a half today compared to October 2023). This could be partially explained by the extra apps that the new gateway is using.
From this table we can tell that activating pocket_config.prevent_negative_reward_claim wont solve the problem.
If we were to set the pocketcore/MinimumNumberOfProofs higher, it should be much higher than expected to reduce the number of claims in a meaningful way. This is observed in the following table

Percentile	Threshold value (number of relays)
1 %	101
10 %	513
25 %	633
50 %	879

For clarity we calculated the total proportion of apps that produced claims in each of these thresholds:

It can be seen that after the second threshold, corresponding to the 10% percentile, almost all apps are included. In other words, ~99% of all apps produce 90% of all relays. This indicates that the low-relay claims is not fault of a given set of apps.

By Gateway Data

Using the thresholds/percentiles presented in the last section, we proceed to create a view of the number of claims corresponding to each gateway.

In this image we can see that the claims in the lower threshold values (from 1% to 10% percentiles) originate mostly in the Nodies gateway.

Also lets look the proportion per each threshold, this will represent how many claims do we cut from each gateway if we apply the given threshold.

How Shannon solves this

Not sure, ~~Probabilistic Proofs~~ Relay Mining only solves for high number of relays, no low.
We should set some higher threshold and compensate by distributing rewards using the total session relays (a hybrid approach between salary distribution and claiming).
I need more time to think…

BenVan · February 21, 2024, 9:33pm

Greatly appreciate this thoughtful, thorough, and rapid information dump. Additional thought is indeed required.

Thank you very much.

Olshansky · February 22, 2024, 6:03am

@RawthiL Firstly, ty so much for the detailed response and so promptly!

Next Steps

Below are what I’d suggest as next steps based on the data and discussion available…

Actionable now:

Gateways: Gateways should re-evaluate how their gigastakes are distributed and used. See the Gateway Q&A.
DAO: We re-evaluate what an effective pocketcore/MinimumNumberOfProofs that helps the network but does not have a considerate impact on the economics (i.e. these will become un-claimable sessions).

Education:

PNF Education: PNF should take this into consideration as more gateway stakes are distributed (to whom, how much, educating the gateway, etc…). Cc @Dermot

Backup bandaids:

These are last resort backups in case of emergency in order of least impact on network economics/tokenomics.

Increase pocketcore/BlockByteSize
- Though LeanPOKT makes this less costly, it should not be the immediate go to option for scalability reasons.
Decrease pocketcore/SessionNodeCount
- Though this will work, it will cost some decentralization and optionality w.r.t gateways managing QoS.
Increase pos/BlocksPerSession
- Though this will work, the long 15 minute block time we already having, slowing it down even more is not preferable.

Follow-up Questions for @BenVan

@BenVan I have a couple of questions I wanted to follow up on.

You mentioned you changed something in your nodes to make sure the network has less bloat. Do you mind sharing & reiterating what that is?
@BenVan what are the current ssd storage requirements for a single pocket node for you? It could be a good data point to decide how viable the block size option is.
You suggested increasing pocketcore/MinimumNumberOfProofs to something in the low 10s.
3.1 Am I remembering correctly?
3.2 My understanding from the code is that NumOfProofs is literally the number of relays, so like @RawthiL said, it has to be in the hundreds to have substantial impact.

Gateway Q&A

Specific data we need to understand w.r.t gateways are:

How many app stakes are available to each gateway?
How many of the available stakes are used?
How are they used w.r.t to stake concentration?
Is there an opportunity to consolidate stake.

I’ll work on this with the backend team for Grove.

@poktblade Could you look into this on behalf of nodies?

Shannon

Relay Mining (with the relay threshold) will enable a single session to hold billions/trillions of relays by modulating the difficulty.

The Probabilistic Proof document we put together (so not all claims need a proof) will solve this part.

Nits / Questions for @RawthiL

Add a unit whenever the x-axis is amount so it’s more self-descriptive
Found a couple small typos that spell-check should catch. E.g. gruops
pocketcore/SessionNodeCount is mentioned twice
Do you know why this link shows 80MB for the block size: https://poktscan.com/explore?tab=blocks

Context

@BenVan posted the following in #node-chat.

And during the ecosystem call, we decided to collect the following data to understand next steps:

Screenshot 2024-02-21 at 9.19.04 PM

RawthiL · February 22, 2024, 12:23pm

There we show the total block size, which is composed of the block_size + state_size.
To see only the block_size (what we are dealing with here), you can see the table just below:

I edited the post to fix the suggestions and changed the axis names of several figures to make them more self-explanatory. Also I edited the block size figure, I was using only the last block of each month for size, now I use the mean block size, luckily there was no change in the reported behavior.

BenVan · February 22, 2024, 3:09pm

I just hardcoded MinimumNumberOfProofs to 35 (with regard to proof submission only). As @RawthiL has shown, this will have minimal effect on the network as a whole, but it will eliminate negative value txs and reduce tx count by ~1%

BenVan · February 22, 2024, 4:15pm

I think that this is our least disruptive and most effective “bang for the buck” action. I would argue that the effects on decentralization are minimal because of the wide adoption of lean clients. Any given session no longer represents 25 unique targets as it did when we adopted that number.
I do not have data to back it up, but my guess is that an average session currently contains less than 20 physically different nodes.

fredt · February 22, 2024, 6:23pm

This option doesn’t work for Gateways. Gateways rely on having node diversity from multiple stakes (ie, Gateways want to have hundreds of nodes available at any given time, this is trending is in the wrong direction). Node diversity is critical to provide the Quality of Service that Gateway customers require.

if this happened, gateways probably inadvertently create even more bloat, since they would want to achieve the same number of nodes per chain per session as they have today, thus resulting in staking more apps.

Shooting from the hip: Perhaps changing the minimum relays to post a claim could help with this? QoS probing isn’t going away anytime soon and future gateways will likely consume even more RPCs across more nodes per session. At the same rate, if there were even more relays on network, doesn’t this same issue continue to grow?

I will comment that i do not believe moving any sort of QoS or artificial metrics akin to QoS onchain is the correct answer and I believe it would be woefully premature. This type of measure can create artificial restrictions and second-order effects (gamification) where the incentives for QoS onchain directly compete with those of the gateways.

I recognize the complexity of this issue and am happy to contribute any way I can to devise a solution that is copacetic to all personas in the ecosystem.

RawthiL · February 22, 2024, 7:08pm

I’ve added more data, its of special interest to see how the claims that are below each of the presented thresholds are divided by gateways. The Nodies gateway seem to be generating more low-relays claims. This is not an causation of any kind as the construction of this samples depends on a lot of things, like overall traffic. I also want to highlight that even if Nodies stops doing these relays, it wont change the long term issue that will keep on arising with each new gateway.

Olshansky · February 22, 2024, 8:12pm

@JackALaing @Dermot et al

I wanted to reiterate the context, suggestions, next steps, backup plans, etc in my post above: Block Sizes, Claims and Proofs in the Multi-Gateway Era - #3 by Olshansky

To tl;dr next steps

We need data from @poktblade w.r.t to nodies gateway-related questions. @RawthiL’s data above suggests this a necessary first step.
In parallel to (1), we can start collecting similar data from Grove’s side.
In parallel to (1) & (2), I can continue looking into MinimumNumberOfProofs with @toshi.
We should only onboard new gateways once (1) and (2) are resolved, and the best practices are well-defined. This can be an iteration on top of the PR that’s currently in review: [Docs] AAT related comments & documentation improvements by Olshansk · Pull Request #1598 · pokt-network/pocket-core · GitHub

BenVan · February 23, 2024, 3:32pm

I understand that Gateways “want” certain things. And, I submit that their “wants” are exactly why we only allow vetted and trusted parties to operate as Gateways. The current ( V0) environment requires Gateways to act with care and to hold the overall best interest of the network above their personal goal maximization.

We do not currently know exactly how our “Trusted Gateways” are behaving and are forced to undertake research projects like this one (THANK YOU @RawthiL for helping us see) in order to keep the network running smoothly.

Changing minimum proofs per claim was the first mitigation strategy that was proposed and remains one of the top considerations. Unfortunately, we lack visibility into the actual number of test transactions that our “Trusted Gateways” are consuming which is making it difficult to figure out what an appropriate minimum number would be.

Thank you. Please confirm the exact number of test transactions per node per session which the Grove Gateway is producing and [if that number is variable] help us understand the circumstances under which it varies.

Olshansky · February 26, 2024, 7:43pm

I wanted to consolidate a few different threads happening across different forums to make sure we’re on the same page.

Urgency - This is very important but not ultra urgent. As long as we hold off on adding more gateways/appStakes, we should be okay for now.

1.1 @Dermot @b3n - It’s critical that we hold off on introducing more gateways & sharing app stakes until this is resolved.

1.2 @BenVan - I know we’re nearing the block size limit but would you agree with the level of urgency assuming (1.1)?

Data Collection

2.1 @RawthiL thanks again for everything!
2.2 I’ll keep looking for any data we can identify within Grove, but no major changes (that would affect this) to app stakes have been made in recent past.
2.3 @poktblade I wanted to reiterate my asks above. Specifically, could you share:
2.3.1 App Stakes nodies has: (address, chain, stake)
2.3.2 This would help us understand if there’s an opportunity to consolidate app stakes on a per chain basis. Again, this won’t be necessary in Shannon, but we have to work around the limitations today.
2.4 @BenVan Do you mind sharing any other requests for data you might have?

Future Best Practices

It is outside the scope of this discussion, but we’ll provide more details on (2.3.2) for future Morse Gateways.

The tl;dr for now is: instead of having N apps staked for chain_id, have 1.

cc @JackALaing

RawthiL · February 27, 2024, 2:53pm

More information on what will happen if we increase the pocketcore/MinimumNumberOfProofs to achieve the reduction in this table:

Percentile (~size reduction)	`pocketcore/MinimumNumberOfProofs` (Threshold)
1 %	101
10 %	513
25 %	633
50 %	879

First the overall network traffic will be reduced as:

Network traffic change	`pocketcore/MinimumNumberOfProofs` (Threshold)
-0.05 %	101
-1.79 %	513
-8.19 %	633
-21.35 %	879

But the effect will be more important on a per-chain basis, this can be seen here:

There are many chains that will loose +25% of their traffic even if we only reduce 10% of block size, by setting a threshold at 513 relays per claim. These are small chains. For completion you can see the full list of chains that will loose a significant amount of relays in the table below:

Full Table

Size Reduction	Threshold	Traffic reduction higher than	affected chains
1 %	101	10%	No chain
		25%	No chain
		35%	No chain
		50%	No chain
10 %	513	10%	[‘0005’, ‘0027’, ‘0053’, ‘0056’]
		25%	[‘0056’]
		35%	No chain
		50%	No chain
25%	633	10%	[‘0005’, ‘000F’, ‘0027’, ‘0053’, ‘0054’, ‘0056’, ‘0070’, ‘0077’, ‘0079’]
		25%	[‘0005’, ‘000F’, ‘0053’, ‘0056’]
		35%	[‘0005’, ‘0053’]
		50%	No chain
50%	879	10%	A lot…
		25%	A lot…
		35%	[‘0005’, ‘000F’, ‘0022’, ‘0026’, ‘0027’, ‘0028’, ‘0044’, ‘0049’, ‘0051’, ‘0053’, ‘0054’, ‘0063’, ‘0070’, ‘0072’, ‘0076’, ‘0077’, ‘0079’]
		50%	[‘0005’, ‘000F’, ‘0022’, ‘0026’, ‘0027’, ‘0028’, ‘0049’, ‘0051’, ‘0053’, ‘0054’, ‘0063’, ‘0070’, ‘0072’, ‘0076’, ‘0077’, ‘0079’]

Given the effect that this have on per-chain relays, I think it is not realistic to think that we can free up more than 25% of block space without affecting the ecosystem (using this method).
If we intend to fit more than one additional gateway, we will probably need to change more than a single parameter.

shane · February 27, 2024, 5:53pm

Hey everyone, I created a Block Size War Room in Discord. Obviously the forum is good for long forum discussions, but for coordinating and quick info gathering, Discord can be useful.

Feel free to utilize it

fredt · March 6, 2024, 6:09pm

Grove gateway has permission to invoke over 1000 gigastakes (exact number I can confirm with the foundation), but we actively invoke ~261 of them.

For each app stake, there are 24 nodes per stake per session, up to 15 chains (with 1 session per chain) per stake and 261 (active) appstakes. The prior imputes that we are currently able to invoke 93,960 total nodes in session and 3915 active sessions at any given time.

Even if we were to reduce the total nodes per session per chain to 5, you achieve a 75% decrease (19,575) of nodes in session, but you do not reduce the number of sessions. If you reduce the number of chains per app to 5, then you achieve a 50% decrease of nodes in session (31,320) and the number of sessions 66% (1305 active sessions). This also reduces the number of top quality nodes that have an opportunity to produce outsized rewards. To that point, Gateways require node diversity in session to provide QoS that end users need and are willing to engage with. The logical conclusion from a Gateway’s perspective is then to up the number of appstakes they actively invoke to the limit (~1000) to make up the difference, resulting in the same outcome (and possibly worse!).

I understand the nuance in increasing the blocksize (and how it affects scale) as well as the implications of a consensus-breaking change this late in the lifecycle of Morse, but I would advocate to double the blocksize (again) to try and get the existing ecosystem until Shannon (when this whole conversation becomes moot). In addition, I would recommend Suppliers lowering the minimum relays per claim. While I understand this is a “tax” on Suppliers, I believe it is a tradeoff to scale for the time being.

I believe this solution paves the way for the 3rd+ gateway to come onchain in Morse and hopefully brings an increase in relays to build momentum as we prepare to launch Shannon. Eager to see if other creative solutions pop up to this complex problem. (also pls check my arithmetic and parameter values above )

Olshansky · March 7, 2024, 2:22am

tl;dr My personal recommendation on behalf of Grove, and as a public representative of the Pocket Network ecosystem, is to:

Double the BlockSize from 8MB to 16MB, spending ~ 1 month at 12MB in between
Increase MinimumNumberOfProofs from 10 to 500

The following is an opinionated and intentionally simple tradeoff table I’ve put together with @fredt
& @shane to help drive to a solution.

All of these are simply a governance transaction that can be executed by the foundation if approved by the DAO w/o any client or consensus-breaking changes.

Note: I recommend clicking the expand button in the top right corner so the table is readable.

	On-Chain Parameter	Pros	Cons	Nuances	Additional Context
Block Size	`pocketcore/BlockByteSize`	Enables the ability to handle more on-chain Sessions, enabling more Gateways & more Chains	Costs for every supplier (i.e. node runner) will increase	Introduces overhead to communicate new resource requirements	20MB blocks were tested on TestNet when the upgrade from 4MB to 8MB took places
# of Relays Per Session	`pocketcore/MinimumNumberOfProofs`	Reduces the number of on-chain Sessions, enabling more Gateways & more Chains	Rewards and relays for low-volume chains decrease	Could affect tokenomics if the number becomes too high.	See the amazing data & analysis from @Ramiro Rodríguez Colmeiro above
# Nodes Per Session	`pocketcore/SessionNodeCount`	skipped	skipped	Out of scope given the requirements of this discussion since it impacts one or more of QoS, decentralization & tokenomics	A new forum thread should be started for this discussion.
Max Chains	`pocketCore/MaximumChains`	skipped	skipped	Out of scope given the requirements of this discussion since it impacts one or more of QoS, decentralization & tokenomics	A new forum thread should be started for this discussion.

For additional details related to how many applications Grove’s gateway is handling, I have used the query below to collect the number of relays handled since March 1st showing the 216 app stakes actively being used by Grove’s Gateway. The full csv is available here.

SELECT
  protocol_app_public_key,
  CONCAT('[', STRING_AGG(DISTINCT chain_id, ', '), ']') AS list_of_chain_ids,
  COUNT(*) AS total_relays
FROM
  `portal-gb-prod.DWH._prod`
WHERE
  TIMESTAMP_TRUNC(ts, DAY) >= TIMESTAMP("2024-03-01")
  AND protocol_app_public_key IS NOT NULL
  AND TRIM(protocol_app_public_key) <> ""
GROUP BY
  protocol_app_public_key
ORDER BY
  total_relays DESC;

poktblade · March 7, 2024, 8:38am

I advocate for increasing the block space until Shannon is complete, as long as consensus can keep up. The majority of node operators utilize GeoMesh and typically have one or two master nodes at any given time. Most of them should be using pruned options, so the impact on cost and storage increase should be minimal. We should monitor the P2P IO usage of the network, block times, and the actual increase in storage for node operators as we expand the block space. I’ve requested Poktscan to chart the data directory over time as the state size increased, and it seems to show minimal impact.

Economically speaking, I believe node operators should be open to increasing block space, as that has a strong correlation to more potential relays from gateway operators.

Given that POKT doesn’t have to operate with extremely low block times, this could be the option with the lowest friction without requiring gateway operators to change their QoS strategies. If the network allows gateways to send relays to ‘n’ nodes, then we should permit them to do so. Even if gateways attempt to optimize fully with respect to block space, as @fredt mentions, this will decrease potential QoS checks and the ability for gateways to evenly distribute relays in an altruistic fashion. Reducing ‘SessionNodeCount’ or ‘MaximumChains’ doesn’t seem very fruitful; all this does is shift the problem to the gateway operator, who will likely just request more app stakes. Over time, the same number of sessions and claims will occupy more block space anyway.

Finally, even if we are able to optimize on the number of claims submitted on the network, this likely won’t be fruitful for the scale that the Foundation is shooting for, which is an additional 2 or more gateway operators in the network. The result of decreasing the aforementioned two parameters will not buy enough block space for these gateway operators.

Olshansky · March 7, 2024, 5:36pm

@poktblade Do you mind sharing similar information for nodies as I did above for Grove? I’ve been told it’s also available on poktscan so would appreciate your help to add this level of transparency & communication with the ecosystem.

I’ve requested Poktscan to chart the data directory over time as the state size increased, and it seems to show minimal impact.

@RawthiL Is this something your team would add?

Olshansky · March 7, 2024, 6:15pm

Please this comment if the recommendation seems sensible to the readers. I will move this to an official proposal early next week if there are no objections.

RawthiL · March 7, 2024, 6:56pm

My conversation with @poktblade was around the same subjects that we already talked here. We even were around the same mind of increasing block size and min proofs. Not sure why he did not posted earlier.

The graph:

is just another way to represent the same bar plot of block size, with the added data size. But since the latter was not that important to this discussion (it does not scale with claim/proofs as the state), I left it out.

We also analyzed app usage and claim frequency, the difference between gateways was interesting but irrelevant for this topic, as we cannot force gateways into specific strategies. Both @poktblade and @fredt agree on this I think. Consequently I left that graph out also.

Dermot · March 8, 2024, 12:06pm

Thanks @RawthiL and @BenVan for kickstarting this conversation

And for all the input and leadership from @Olshansky @shane @poktblade and @fredt on the forum and behind the scenes

It looks like we will soon be able to onboard new gateways pre-Shannon. Four additional gateways (including Liquify who are now waiting for app stakes) are in the final onboarding stages, so I hope everyone can continue to rally around this effort and get it over the line ASAP. It’s really appreciated.

In the meantime, all those interested in how app stakes are best managed, please see here (if you haven’t already) to comment on any best practice instructions you believe new gateways should follow to get the best results in terms of QoS and network health.