Re first question: I would say that yes, theoretically they ought to see similar rewards whether 4x 15k, 2x 30k, 60k, 15k+45k, etc. And by “theoretically” I INCLUDE meaning to say that all possible second order effects that may have bearing on the system as a whole ought to have zero bearing as pertains to identically-configured nodes. The only reason I equivocate with the qualifier “theoretically” is because there is so much variation in the random rolling of the dice that splitting 250 nodes into two 125-node groups is borderllne in terms of being able to build up a sufficient set of data that is statistically significant in a time period that is short enough to be considered static.
Here’s what I mean. Suppose that you had done the experiment in April so that PIP-22 has absolutely no bearing on anything, and you randomly split your nodes (assuming they were identical) into two groups of 125 nodes. I would expect that on any given week you may see up to 30% or more variation in rewards between the two groups just from the simple variable of rolling the dice. I’m guessing it would take about two months of data collection to even out those random variations to draw any interesting conclusion. And you are at the mercy of the system remaining more-or-less static during that period.
Suppose on the other hand that new nodes are actively being added to the system so that the system is not static over two months. Then the random rolling of the dice that randomly favored one half over the first couple weeks gets less chance of of evening out later on because later on there are more competing nodes… in that manner pure randomness can get baked in over time and not even out.
Nonetheless, I think 250 nodes, if identical, should give some interesting results over a couple months. And if there is another node runner with say 1-2k identical nodes then all the better as we can probably make due for them with a few weeks of data given their larger sample size.
Re, “if all the nodes were running on a single server running LeanPocket” I would say, “especially if all the nodes were running on a single server running LeanPocket”, as the more that can be confirmed to be identical in configuration, hardware and location, the better.
Re second question, I would say yes, with the same caveat re gathering sufficient statistics (eg at least two months for comparing two test groups of 125 nodes each, etc.) I have not done the math to figure out how big the sample size (nodes x days) needs to be to draw statistically-significant conclusions and it is no trivial task to do so given how many moving parts of variability there are in the system. Especially given the effect that someone mentioned earlier where say USA based nodes may get called on to service relays in asia and have bad QoS stats compared to identically configured nodes that serviced relays in usa. It takes quite a large sample size to even that kind of stuff out. Of course we can look at QoS on a per-region basis to get rid of this effect but that means all the more days of data collection need to be gathered to get a sufficiently large sample size.
In your case, since you are looking for statistically-significant variations between bins, you have to consider the node count per bin to figure out how long you need to collect data before you can draw any conclusion. Last I checked you have about 40 in bin 4, 40 in bin 3, 60 in bin 2 and the rest in bin 1. I really like the more or less even spread between bin 2-4. That helps. But 40 nodes may not be enough to have 2 months be enough data collection. Its hard to say. I think your best approach is to randomly assign identically configured 15k nodes into different groups of approx 40 nodes each. Once you collect enough data such that the statistical variations between your purely random groups of forty 15k nodes is sufficiently small, that can be used as a good empirical argument to say that your bin 2-4 groupings have also reach a sufficiently large sample size to test the hypothesis of whether or not there are bin-related variations. Hope that makes sense