Gateway Optimal Orchestration and Deployment (GOOD)

AlexF · February 9, 2023, 11:54pm

It averages out to 150ms because that is the cut off for the Cherry Picker’s top bucket.
The cost to deploy a region to AWS is minimal – Elasticache service is the only thing that requires spinning up. It only costs money once it receives traffic.
Pocket is changing providers so the AWS routing will no longer apply
Reducing the number of Gateway endpoints will undoubtedly cause poorer latency for the end users
You are centralizing the network and reducing the possibility of local service (user in Paris hits Paris gateway and gets Paris nodes)
Your assumption that users only centralized in those 3 cities is incorrect

RawthiL · February 10, 2023, 1:26pm

Hi Alex, thanks for the commentaries.

If you mean the current global latency, that is 155 ms, I think that it is just coincidence. Each gateway have their own Cherry Picker process, and as seen in table 1 of the report the averages at each location greatly differ from the CP threshold of 150 ms:

This is because the data obtained from the CP tracking process does not inform “buckets” it informs the actual latency of the relays. Also, node-runners do not seem to be targeting 150 ms, the go as low as they can. This can be seen in POKTscan’s Geo tab for any provider.

I cannot give you precision on cost reduction as I have no visibility of the involved costs. Our claim comes from a simple concept, removing things should reduce the costs. In the end is the PNI who decides if this is worth doing.

Its is true that the report as it is right now wont translate directly to GCP (the new PNI provider). However, the methodology can be easily converted to GCP by simply replacing the ping matrix used in the calculations. This matrix can be found in data/cloudping_p50_1W_02-02-2023.csv in the repository. We believe that results will not vary much, since the selected gateways represent clear clusters, but we are open to be proved otherwise.

That highly depends on the implementation, we have shown that if internal routing is applied this is not true, in fact it will improve.
If nothing is done and only the gateways are removed, the apps might align to the nodes or implement services such as Argo to improve their latency to the remaining locations. This will also reduce their ping since the round-trip to the gateway is avoided. Anyway, it is very difficult to add this to a report without making hard assumptions, for that reason we only provided the resulting scenario under controlled conditions.

This is not correct, a user in Paris hits the Paris gateway but it gets the nodes that are the best in his session.
Maybe Paris is not the best example, since Europe is very well connected and a node in Paris is the same as one in Frankfurt for the CP behavior. Take for instance Hon Kong, right now the users of Polygon in Hon Kong hit ap-east-1 and they are paired to nodes that are not in Hon Kong because there are too few of them there. The average session wont have a Hon Kong node, since there are less than half the nodes with less than 150 ms in ap-east-1, as can be seen in POKTscan’s Chains page.
We agree that it would be better to have more gateways and node-runners deploying nodes all over the world, but the current market conditions are not ideal to put that economic stress on the ecosystem.

We make no assumptions on where the nodes are deployed. The three selected cities are the result of an optimization problem not an assumption.
We believe that around 70% of the deployed nodes are in those regions, but that played no role in the results presented here. You can easily check that out by modifying the latency and/or traffic of any not selected region and re-running the algorithms (by means of modifying the grouped_df dataframe).
We never set, hint or guide the process to select a given location or a certain number of them.

AlexF · February 10, 2023, 2:35pm

The numbers are clearly clustering around 150ms except in underserved regions.

I had full visibility and the cost per region is minimal. A few hundred dollars per month when it has no traffic and is scaled down. They only cost money when they receive traffic therefore the entire premise of this post is moot. You’re not saving money by spinning down regions.

Please don’t tell me how the cherry picker and sessions work. I wrote the code. There are many nodes in Paris; I ran hundreds there and in Amsterdam. The datacenters were cheaper than Frankfurt. Given that there are now dozens of nodes in each session, the likelihood that a Paris user will get a few Paris nodes is good.

Traffic shifts over time. For the first year of the protocol there was almost NO APAC traffic at all. Historically AP-SE-1 has been the biggest but over the past few months, the north regions took over. By reducing the number of gateways, you are betting that traffic doesn’t shift.

Take US-West as an easy example. With the only gateway now in the east, every single user request coming from California will enjoy an additional 80+ ms of roundtrip for each request.

RawthiL · February 10, 2023, 3:09pm

Thanks for the insight, this should be discussed with the PNI then.

Great then you can do the math and calculate which are those odds. The fact that you ran hundreds of nodes in Paris does not tell us anything, there are 20K nodes in the network and the CP wont care if your node is in Paris or Frankfurt as long as they have less than 150 ms latency.

We are aware that traffic shifts.

Lots have changed in the last year.
If the token was still 3 u$d, we would not be proposing this.
If the PNI wouldn’t be trying to reduce costs, we would not be proposing this.
If the V0 had enough support so node-runners could haply run their business, we would not be proposing this.

Nobody wants to drop support, ideally we would be buying hardware everyday and expanding to new regions, but that’s not the landscape today. If it wasn’t for the community (mainly node-runners) V0 would have probably crashed. I believe that this is what we need to re-group and make it to V1. We need to focus on maintaining V0 and help on V1. The node-running race can take a brake.

That is already happening, less than 10% of the nodes are local in US-West. You can get the odds of being in a session of one of your nodes in us-west.

AlexF · February 10, 2023, 4:20pm

The bottom line is that this proposal does not reduce costs in any significant way. The regions cost a few hundred dollars each per month to spin up before they receive traffic. It will significantly affect service. The fact that no one on the team has raised this to Arthur is disturbing.

shane · February 10, 2023, 4:55pm

Hmmm… not sure how to reconciling the different perspectives here. This is what PNI is saying regarding reducing costs:

I was always under the impression that gateways were lightweight, but PNI is saying otherwise.

ArtSabintsev · February 10, 2023, 4:55pm

hey @AlexF!

Great to hear from you. Would you mind hopping on a call with POKTscan and some folks from PNI next week?

ArtSabintsev · February 10, 2023, 6:37pm

We’ll be chatting on Tuesday. Will follow up after.

AlexF · February 10, 2023, 6:49pm

They are wrong. Spinning up a new region costs around $500 per month with the minimum level of Elasticache. Nothing else costs money in a region (aside from minor costs like VPC creation) until it receives traffic.

All this proposal does is shuffle the costs of traffic to another region.

RawthiL · February 10, 2023, 7:34pm

Even when the cost is shuffled and no net reduction exists for PNI, the rest of the benefits will hold:

Reduction of costs for node runners, both in infra and human resources (I know that this is not 500 u$d/month).
Reduction of latency, just by re-routing the traffic using either AWS or GCP.
Easier estimation of network cost.
(and others mentioned in first post)

AlexF · February 10, 2023, 8:35pm

I don’t believe this is true. The Gateway system already uses AWS Global Accelerator which has hundreds of PoPs and keeps the traffic on the internal AWS network as soon as it reaches one of those points.

Your example in your second post is incorrect. Because the user near AP-NE-2 has to travel with their request over the public internet to AP-SE-1, nothing is gained. You’ve changed what part of the request travels over the public internet, but there is still the same amount of public internet being used.

There is no logical way that reducing the number of Gateways will reduce end-user latency. I suspect we will see this effect in action as the other Gateways are shut down.

RawthiL · February 10, 2023, 8:54pm

I understand that once in AWS the traffic between regions will move through the AWS Global Backbone and it is not the same as public internet (and the reason why pings are lower in the data matrix). The amount of public internet being used is different, the example is quite clear about that.

msa6867 · February 21, 2023, 8:16pm

@RawthiL would you mind sharing your analysis (assuming you have done so) of how eg 20 vs 3 would play out in terms of QoS in v1, where things like setting location are enabled?

RawthiL · February 21, 2023, 9:37pm

I’m not sure exactly what you mean by “in terms of V1”.
If you say want to know how would QoS be in a network were the relays are only served by local node, then I cannot tell.
Today it would be really difficult to know where are nodes located and in which geozones would they be staked in V1. In a perfect world, only local nodes would be staked in each geozone, resulting in better QoS, but this cannot be enforced and non-local nodes could also be staked in any geozone. This will reduce the QoS to values that are at the limits of the MinimumTestScoreThreshold.
Also, I think that V1 wont have 20 geozones, so the routing problem presented here will be present within each geozone in V1, as more than one gateway will exist in each geozone.

addison · February 27, 2023, 2:58pm

Hey @ArtSabintsev @AlexF

Any update regarding this discussion?

ArtSabintsev · February 27, 2023, 3:38pm

hey,

We’re going to continue as is, but may not get down to 3 regions. Plan is to go slowly - spin one down, monitor latency and cost, and make a decision to continue.

Alex did believe there were some regions that could be spun down, but not as many as we believed.

@fredt for visibility.

addison · February 27, 2023, 7:52pm

Sounds good. Whats the methodology for monitoring latency?

ArtSabintsev · February 28, 2023, 3:37am

I’d defer to @fredt here since he’s overseeing this initiative.

fredt · February 28, 2023, 2:45pm

Our Team has been focused heavily on the Portal v2 work and QoS over any regional reductions. We have found cost savings in other areas for the time being that are satisfactory and have put this initiative on hold.

I can provide more information as we get back to this workload around methodology, key metrics, etc.

RawthiL · February 28, 2023, 4:15pm

You are not going to retire any portal then?
Not even those that are really close? Like keep us-west-1 close us-west-2 or keep eu-west-1 close eu-west-2/eu-west-3 or keep us-east-1 close us-east-2.

QoS? are you planning to change the Cherry Picker process? is there any place I can follow this?