PUP-25: Non-Linear Stake Weighting For PIP-22

@RawthiL , et al. I apologize again for letting my emotions get the best of me and letting my comments get out of hand. You guys have put in a tremendous amount of work, which I respect immensely . I will focus on the technical issue at hand going forward;

Your expectation was well founded and met prior to my reply. As I promise to lay off any form of throwing barbs, it would be welcome to see that reciprocated.

My suggestion to apply your knowledge gained about nuances within the cherry picker to optimizing the cherry picker was not rhetorical but sincere. You guys probably have a better empirical understanding of the cherry picker than just about any one in the ecosystem. E.g., is the amount of transactions occurring in the flat ramp-up phase of a session causing a QoS problem - then let’s shorten the transition period. It won’t even take a DAO vote; just the making of the case to the dev team to change a cherry picker parameter. I’ve already fixed a couple problems in the cherry picker; I would be happy to collaborate with you guys on further improvements.

Right now, what I promise to do is shut up and re-engage the pdf to see what I’m missing from the first read

3 Likes

After reading through the proposal document, the basic thesis of @RawthiL’s argument is actually quite simple. I’ll share a summary that I wrote for myself to help my understanding, in case anyone else finds it useful:

At the beginning of any given pocket session, high QoS nodes will lose some relays to low QoS nodes, as it takes time for the cherry picker to measure latency and gradually allocate more relays to the high QoS nodes.

Now that PUP-21/22 is active, a 15k node needs to attend far more sessions than previously to earn the same reward. If you’re operating a low QoS 15k node, this is a good thing - more sessions means more relays gained at session beginnings. If you’re operating a high QoS 15k node, you lose more relays for the same reason.

This issue is reversed for 60k nodes because (pokt-for-pokt) they earn their rewards over 1/4 the sessions. A high QoS 60k node loses less reward to lower QoS nodes, and a low QoS 60k node gets fewer opportunities to scavenge relays.

It’s important to understand this in the context of a total POKT daily mint rate that is fixed by FRENS. So the higher reward given to low QoS 15k and high QoS 60k nodes must come at the expense of the rest.

I have no idea if this thesis is correct or not, but I do think it’s worthy of investigation. I’ve read through all the comments listed, and thus far I don’t see anyone refuting the thesis. Nobody has proven it wrong or explained why it is invalid. Closest I’ve seen is @msa6867 acknowledging the effect may be real, but that it is likely only causing a ±10% variation from before. This may be true, but needs to be proven.

Is it possible for the Pocket team to release relay data from the portals which we can then use to independently and quantitively determine the true scale of this effect? Back at InfraCon there was a lot of talk of postgres database dumps, which would be invaluable in resolving this debate. I completely understand we wouldn’t simply want to take the data provided by Poktscan at face value.

Couple of side points:

  1. Given this thesis is completely valid and has not been discredited, ad-hominin attacks are totally unjustified in my opinion.

  2. IF this thesis is proven to be accurate and the level of disadvantage can be determined quantitively, we should address it. I have total faith that all of us only want a genuinely fair and level playing field for all node runners, big or small.

2 Likes

For what it is worth; The DAO should be funding a project like @steve / Dabble setup on their own to verify the parameters , performance and variance of PIP-22.

I am shocked that Steve had to fund and run this test on his own , it should have accompanied the implementation of these changes showing actual data from real world testing.

I for one would like to see collaboration / help with Steve , overlayed and cross referenced with cherry picker data to validate any of these hypotheses being thrown around.

Second, Steve should probably get some kind of grant/support from the DAO as the real world research and testing he is doing is very beneficial and imo should have been part of the original proposal deliverables.

3 Likes

Agreed! Especially if he is digging into cherry picker data to do so as that takes a lot of work

First of all, absolute kudos to you for taking this stab at creating a summary! That is phenomenal. I cannot answer for poktscan as to whether your summary accurately captures their thesis or not; they will have to answer and I eagerly await their reply. I was going to ask them to please create an executive summary of the form you have created free of incendiary words like “unfair” and what not but just containing the facts and theses.

Whether or not what you write accurately represents poktscan’s thesis, the thesis you present is pretty straightforward to debunk. The problem is in the third paragraph.

I would rewrite the paragraph as follows:

"This issue is exactly the same for 60k nodes even though (pokt-for-pokt) they earn their rewards over 1/4 the sessions. A high QoS 60k node loses one quarter the relays to lower QoS nodes over the course of a day but each loss of relay causes a 4x loss of potential reward resulting in an identical loss of rewards as tat experienced by four high-QoS 15k nodes. Conversely, a low QoS 60k node gets one quarter the opportunities to scavenge relays but each scavenged relay results in 4x rewards that must be accounted toward to FREN-allowed total daily rewards and thus has the identical effect to the scavenged relays of four low-QoS 15k nodes.

I reiterate that I think taking a stab at refining the parameter settings within the cherry picker is a worthwhile spend of time and energy

1 Like

I totally agree. However I would like to make it clear that this thesis has nothing to do with the proposal which is being presented here and if it were proven to be real, the solution would also have nothing to do with the solution which this proposal is advocating. This discussion / research point should have its own thread rather than being conflated with this one.

This proposal should be withdrawn
.

2 Likes

This is subjective. As you know, the cherry picker and the gateway as a whole, and our protocol network are practically coupled. It’s how we resolved our QoS issues, I thought this was understood.

If the thesis is true, then I do not see a problem with using protocol parameters to address this issue. I did voice my concerns in PUP-21 whether there were any fairness issues when it comes to consolidation - and it was addressed by stating that we could leverage the protocol parameters to address any unfairness, as seen here: PUP-21 Setting parameter values for the PIP-22 new parameter set - #60 by msa6867

Going into it deeper, I did mention if we set it at 1, and there was any perceived unfairness, we would see pushback from 60K node runners when it comes to decreasing it.

These parameters decide whether someone should max out on consolidation or not, especially with linear staking. Bumping it down may not sound as easy as it seems if the majority of the network decides to consolidate to max. I believe there is less friction in being conservative and bumping it up - rather than convincing a load of maxed consolidated nodes that their rewards are going to be cut if we need to dial back.

I agree with @StephenRoss in general, appreciate PoktScan time in all of this, but we can’t take what they say as fact unless it’s been peer-reviewed by other entites. Coming up with a solution is going to take engineering time, data validation, and plenty of cognitive overhead. What that means is in general, we’re put in a rather shitty position, and the whole network and community hurt off that.

2 Likes

I understand, but I do agree with @benvan that this should really be split into two separate proposals - or at least two separate white papers as I think there are two very distinct issues that are being raised:

Issue 1: in light of the nuanced manner in which PIP-22 interact with the cherry-picker in an environment of nodes with wildly disparate QoS levels, would a different set of parameter values and/or a different methodology by which PNF sets those values lead to a better-behaved system than the current values and methodology and if so what would be the optimal settings and methodology.

Issue 2: is the current incentive structure that gives to a node staked to 60k the same rewards as that which is given to four nodes staked to 15k in the best interest of the system (seeing that it leads over time to noderunners almost exclusively compounding their rewards vertically rather than horizontally) or is it better to incentivize only the least cost efficient noderunners to compound vertically while incentivize other node runners to compound horizontally, and if so, what is the optimal parameter settings to achieve the desired balance between incentive to compound vertically vs incentive to compound horizontally.

Neither I nor @BenVan nor anyone else is opposed to a peer-reviewed consideration of issue number 1 and I have committed to go back give the pdf a completely fresh and thorough reading… But I do object to the conflation of the two topics, particularly since the current incentive structure (to make it more cost effective over time to stack vertically vs horizontally) was precisely the expressed will of the community in the approved PIP-22 and PUP-21 given the current environment of being “over-provisioned”.

Even at that, I am willing to take a fresh look at issue number two if they really make a good case that it leads to a better-behaved system, but I reiterate that I want to consider the questions separately and not have issue 2 be conflated with or hidden in the subtext of issue 1.

And why do I say the issues are getting conflated? because the proposed new parameter value (exponent = 0.7 ) is so far out balance of anything that would be needed to address issue 1 that it can only be motivated by issue 2. And yet it is being proffered as part of the deep technical discussion of issue 1. A back of the envelop calculation shows that even if poktscan is absolutely correct in every aspect of their technical thesis, then a value of exponent=0.9 is more than sufficient to restore the desired balance, whereas the proposed value of 0.7 tips the scales so far in the other direction as to cause a system imbalance twice as large and twice as unfair as the imbalance/unfairness they are trying to fix. But I admittedly make this statement prior to giving the fresh re-read of the pdf which I promised and perhaps I will discover there some argument that I missed the first time of why 0.7 (rather than 0.9 or some other value) is needed to address issue 1.

1 Like

This is a fair and prescient concern that you raised. I have confidence in the caliber of our community that if a legitimate, peer-reviewed, substantiated and verified need to drastically lower the exponent can be found, while they would not be happy they would not block the change. That being said, I stand by my original reply of being confident that any such need to modify the exponent that may arise will not be found to arise with anything so drastic as 0.7 but would be much more modest such that it would in reality lead to much less friction than feared.

1 Like

In starting to relook at the pdf I got the impression that it would be important to go back and absorb poktscan’s May Infracon whitepaper and presentation before proceeding. I’m going to pause for a day or two to try to really understand the May whitepaper and its implications for the topic at hand.

I will not go through all the answers since my last post in order to keep this post concise.

First I want to say that the personal attacks are not useful for the community. As a community we need to try our best to keep discussions polite and within the topic at hand.

I’m relieved that the topic shifted rapidly and regain focus on what is important; proving right or wrong the issue that we are observing. Thanks @poktblade and @StephenRoss for taking the time to go through the our document and give your opinion. Also I thank @msa6867 for agreeing on giving a second look to our document, as he is one of the most prepared contributors of our community.

We have no problem with going through a review of all the data or performing new experiments and metrics to clarify this issue. Thats the way knowledge is built. We are currently tracking the nodes provided by @steve (and 20 more of our own nodes in a similar set-up). This is a long running experiment that we hope will give us some hard evidence to work with. We also are talking to other community members in order to produce new metrics, some of which we have not thought of. We even have an open channel with @msa6867 and we collaborate with him directly outside the public channels.

Finally I want to say that we can use this subject to create further division in our already small community or to help us consolidate our community knowledge.
It is not important whether this proposal passes or not, as long as the arguments of each side are defended with data and respect.

1 Like

I’m also convinced that we’re all on the same side and want the same thing - clarity and fairness. Thanks for reminding us that we’re all aligned here.

It seems that the root issue is that tracking rewards and understanding the math/ models behind them has gotten too complex for many of us - or maybe just me. I used to feel pretty confident about my understanding of why my nodes were over/under performing - I’m not so confident anymore. For the same reason, I can’t take sides on this proposal and would have to abstain from voting if this was put to a vote now.

Thanks for the comments @beezy - I would also love to see collaboration to validate some of the math/models, and hypotheses. I’m trying to figure out how we can validate in a way that is not subjective, easy to understand, and backed by concrete data.

To that end, perhaps everyone could answer a couple of simple questions to see if we all agree on some basic assumptions?

First Question

If I have 60K POKT to stake, I should see similar rewards if I choose to stake four 15K nodes, two 30K nodes, a 45K and a 15K node, or one 60K node. And this should hold even I have 600K or 6M POKT to stake. This is assuming of course that all of the nodes are configured exactly the same way - for example if all the nodes were running on a single server running LeanPocket. Do we all agree with this?

Second Question

If all the nodes are configured exactly the same way, they would have the same QoS. Is that a good assumption also?


All I’m trying to do at this point is validate those two assumptions and it’s the purpose of the test I’m running. If we all agree with those basic assumptions, I’ll share the addresses of the 250 nodes that we’ve rolled out for the test and everyone can monitor the results over time and see what we see.

If anyone doesn’t agree, please suggest how a test might be setup in a way that is not subjective, easy to understand, and will provide concrete data to validate outcomes.

We all have the same objective here, let’s work together and figure this out.

3 Likes

Yes. I agree with your assumptions/questions.
and I look forward to tracking your results.

I caution you to be careful of early results as the variance of a 250 node set can take quit a while to settle out. I’ll try to find you a formula so that you can compute the P value yourself.

Just for fun


: Here’s a 24 hour picture of the variance between buckets among the (anonymized) major providers. All of them are significantly more than 250 nodes.

2 Likes

Re first question: I would say that yes, theoretically they ought to see similar rewards whether 4x 15k, 2x 30k, 60k, 15k+45k, etc. And by “theoretically” I INCLUDE meaning to say that all possible second order effects that may have bearing on the system as a whole ought to have zero bearing as pertains to identically-configured nodes. The only reason I equivocate with the qualifier “theoretically” is because there is so much variation in the random rolling of the dice that splitting 250 nodes into two 125-node groups is borderllne in terms of being able to build up a sufficient set of data that is statistically significant in a time period that is short enough to be considered static.

Here’s what I mean. Suppose that you had done the experiment in April so that PIP-22 has absolutely no bearing on anything, and you randomly split your nodes (assuming they were identical) into two groups of 125 nodes. I would expect that on any given week you may see up to 30% or more variation in rewards between the two groups just from the simple variable of rolling the dice. I’m guessing it would take about two months of data collection to even out those random variations to draw any interesting conclusion. And you are at the mercy of the system remaining more-or-less static during that period.

Suppose on the other hand that new nodes are actively being added to the system so that the system is not static over two months. Then the random rolling of the dice that randomly favored one half over the first couple weeks gets less chance of of evening out later on because later on there are more competing nodes… in that manner pure randomness can get baked in over time and not even out.

Nonetheless, I think 250 nodes, if identical, should give some interesting results over a couple months. And if there is another node runner with say 1-2k identical nodes then all the better as we can probably make due for them with a few weeks of data given their larger sample size.

Re, “if all the nodes were running on a single server running LeanPocket” I would say, “especially if all the nodes were running on a single server running LeanPocket”, as the more that can be confirmed to be identical in configuration, hardware and location, the better.

Re second question, I would say yes, with the same caveat re gathering sufficient statistics (eg at least two months for comparing two test groups of 125 nodes each, etc.) I have not done the math to figure out how big the sample size (nodes x days) needs to be to draw statistically-significant conclusions and it is no trivial task to do so given how many moving parts of variability there are in the system. Especially given the effect that someone mentioned earlier where say USA based nodes may get called on to service relays in asia and have bad QoS stats compared to identically configured nodes that serviced relays in usa. It takes quite a large sample size to even that kind of stuff out. Of course we can look at QoS on a per-region basis to get rid of this effect but that means all the more days of data collection need to be gathered to get a sufficiently large sample size.

In your case, since you are looking for statistically-significant variations between bins, you have to consider the node count per bin to figure out how long you need to collect data before you can draw any conclusion. Last I checked you have about 40 in bin 4, 40 in bin 3, 60 in bin 2 and the rest in bin 1. I really like the more or less even spread between bin 2-4. That helps. But 40 nodes may not be enough to have 2 months be enough data collection. Its hard to say. I think your best approach is to randomly assign identically configured 15k nodes into different groups of approx 40 nodes each. Once you collect enough data such that the statistical variations between your purely random groups of forty 15k nodes is sufficiently small, that can be used as a good empirical argument to say that your bin 2-4 groupings have also reach a sufficiently large sample size to test the hypothesis of whether or not there are bin-related variations. Hope that makes sense

can you describe the setup?

Thanks for the clear/direct answers @BenVan. If you could suggest how long you think we’d need to run the test to be confident in the results that would be super helpful. Thanks again!

Thank you for your responses also @msa6867 - although I read your response as a solid “maybe” - I’d like to get a consensus on what we think will work to remove any subjectivity when we see the results. If what you’re saying is that you believe there is no way to accurately determine if PIP-22 is working the way it should be - that’s a bigger issue in my opinion. But to confirm that’s not what you’re saying, can you tell us if you think it is possible to conduct a test like the one we’re discussing that will tell us definitely if PIP-22 is working the way it should be? If so, can you also tell us, in as few words as possible, what that test looks like?

Here is more detail regarding how the test is set up. As, I’ve shared, there are 250 total nodes representing a total stake of 7.2M POKT.

NOTE: The 15K nodes actually have 15005K staked to provide a little buffer over the minimum stake threshold which is why the total in the chart below shows 7.26M

The question I’m trying to answer is:

Is it better to stake 15K, 30K, 45K, or 60K nodes? Or, does it make no difference?

This is what I’m most interested in knowing for sure. So, the test is looking at this question with a 1.8M POKT staked in each of the bins (15K, 30K, 45K, and 60K). As I understand how PIP-22 should work, it seems to me that all of the cohorts should result in the same or nearly the same rewards over some period of time - which I know needs to be determined also.

Each of the four cohorts is split evenly across two servers. The purpose of the split was to simulate a lower end and a higher end config. Both of the servers are dedicated servers (so no other apps are running on them) that are running Lean Pocket . The general hardware configuration is:

Server 1 (higher-end config)

  • CPU: 16 Core x 3.0 GHz (AMD Epyc 7302P)

  • RAM: 128 GB

  • Storage: NVMe (2 x 960 GB Hardware RAID 1)

  • Location: Philadelphia

Server 2 (lower-end config)

  • CPU: 8 Core x 3.6 GHz (Intel i7-9700K)

  • RAM: 64 GB

  • Storage: NVMe (2 x 960 GB Hardware RAID 1)

  • Location: Saint Louis

Both servers have the exact same chains.json file so all of the nodes are relaying to the exact same RPC endpoints.

NOTE: The goal of the test is not to optimize for rewards but rather to test PIP-22. If it was optimizing for rewards we would use different chain.json configurations / relay node endpoints that we know would perform better.

If there is anything more you’d like to know about the test setup, I’m happy to answer any relevant questions.

3 Likes

I love the openness and fairness of your test, but unfortunately, I don’t think it’s anywhere near large enough to achieve statistical significance in the next few months.

Even your largest bin (120 nodes) only represents ( 120 / 30,000 ) less than 1/2 of 1% of the network. That’s equivalent to testing one block every two days for the network as a whole. And as msa6867 has pointed out, the network node composition will be changing during the this period of time, which will necessitate an even larger sample to achieve the same level of confidence.

The problem gets worse because the net difference (18%) which PUP-25 claims to exist, Is claimed by other to be ±10% and by others to ± 3.5%. The SWMM adjustment factor has a greater impact than that. The lower the “true” difference is (if -in fact- there is any difference), the harder it will be to get it to into focus.

Now for the good news…sort of:
Although I cannot provide a time estimate of when your data will have significance, I can give you a way to determine when your sample siz has reached a level that it is no longer meaningless. That is:

Measure the internal variance of your 120 node bin using the same sample size as that of the comparison bin (30).
How to do this:
1.) cut the 120 node sample into 8 different pre-defined sub-sets:
Set one: numbers 1-30
Set two: numbers 31-60
Set three: numbers 61-90
Set four: numbers 91-120
Set five: every fourth node IE: 1,5,9,13,17 etc.
Set six: every fourth node +1: IE 2,6,10,14,18 etc.
Set seven: every fourth node +2: IE 3,7,11,15,19 etc.
Set eight: every fourth node +3: IE 4,8,12,16,20 etc.

As long as the absolute value of difference between ANY TWO of the eight sets is greater than the difference between bucket 1 and bucket 4 your data is 100% insufficient. After it reaches the point where the max difference between your internal sets and the bucket differences you can subtract the max difference from the bucket difference and get an idea of what the data is saying. (Note: this does not imply significance it just lets you read it correctly)

3 Likes

Sorry, I was long winded. My answer is a solid yes, with the same caveat that @benvan raised. With the added caveat that if the system behavior changes substantially over a time period needed to gain statistical significance, you’ll have to find creative ways to factor out the effects of the system changes.

I like your setup. Are you gathering CP data (median latency, P90 latency, success rate etc)?

The controlling sample size is 15. Pretty small, Two months may or may not be long enough. As you gather data and want to visualize results, here is a convenient way to graph the data to quickly visualize how far you are on the road to gathering enough data:

  • Randomly assign each group of 30 30k nodes into 2 subgroups
  • Randomly assign each group of 60 15k nodes into 4 subgroups
  • plot the subgroups against each other (within a single bin/server combo)
  • Once the variation within all the different subgroups becomes negligible for many days in a row of adding to the cumulative data set, you can be fairly confident that you have collected enough data.

I would really like to see some other providers join the fun. I don’t think there are any that can give any more bin 2 and 3 coverage than you are providing, But I know there are at least three providers that could do a bin 1 vs bin 4 comparison set up with 10x or more the sample size and I would encourage them to do so and report their results, also.

Could you elaborate on the divergence between optimizing for reward vs optimizing for gathering test data. I can guess (“we chose chains known to have the highest USE1,2 volume” or “we chose chains known to have the least demand-side variation day to day” or “we chose chains known to have the most supply-side competition from other nodes”)… but I would love to hear directly from you rather than guessing.

Also, I will go back and reiterate what @beezy said earlier. In addition to all the time and effort going into the test, if you are foregoing optimal rewards in order to conduct the test, then the DAO should reimburse for the loss of income you are incurring by doing so, as this is a service to the community as a whole.

1 Like

I love good news - even when I don’t fully understand it. :slightly_smiling_face:

So, I won’t pretend I know why you think what you outlined will work (and no need to explain). But the fact that you’re providing a path to getting an objective result makes me happy.

So, if you know all of the addresses of the nodes in the test, can you:

  1. Tell us if you feel the sample is meaningfull
  2. Let us know how much more we’d need to stake if our sample size is too small

I don’t need to understand the math/statistics - I’d just like to get consensus on what an objective test looks like.

2 Likes

I think @benvan and I are giving more or less the same path forward but in different words. Can you provide a csv with cumulatives to date . Eg:

address stake amt location cum rewards

If you provide a link to that data, then either @benvan or I (or both of us independently) could probably turn that around in an hour to post a visual (either privately or publicly at your discretion) to see how much variation there is between the randomized subgroups we are speaking of and consequently, how far you are on the road of gathering sufficient data.

This will be a much more efficient use of time than only giving addresses and having to go and regather all the data block by block that you have already gathered .

Or, equivalently, if poktscan has been gathering this data, you can just let the poktscan team know that they have your permission to share the collected data with me and they an get a csv to me on our private channel

1 Like

At this point we have not agragated any data. We just staked the nodes in the cohorts and we’re tracking the rewards using the blockchain data. The blockchain is the source of truth so shouldn’t it be used as opposed to a generated .csv?