Grove’s Approach to Quality

fredt · August 20, 2024, 12:16am

Introduction

Since the original release of Portal V2 in July 2023, what was formerly known as “the cherry picker” has not existed and the approach to Quality of Service has been drastically different.

Our view as a Gateway has been that the Pocket Network functions as a permissionless, adversarial network, where reliability is always in question and never guaranteed. We have no preference for suppliers and only prioritize nodes based on the metrics that users demand from us.

To be abundantly clear:

The concept of “buckets” for latency do not exist.
The allowed distance between blockheights have gotten tighter over time.
The penalties in our systems have gotten much stronger and stricter.

The truth of the matter is that the game has gotten much more difficult for Suppliers and this is because of the demands of end users and the maturation of a commoditized industry. Customers at scale beyond individual users shop for vendors based on the standards set by web2 SaaS products. Most customers do not care that our service is decentralized – they are fixated on generating margin off of our service to bring in revenue to their business. Any fluctuations in quality negatively impact both Grove and our customers, where the end users can churn off of either platform. It has only become easier for end users to shop for and to switch between RPC providers and with this ease has become a necessity to provide the best-in-class services to retain paying customers.

What is Quality of Service?

While there are many definitions of Quality of Service, we have targeted a typical enterprise-grade QoS for web2 SaaS services. This includes offering SLAs for Uptime and, in the unique scenario of a decentralized network, what we have defined as Success Rate for each service.

A successful relay is when a user sends a valid request and gets a valid response.

This means that invalid requests (4XX) do not count towards our Success Rate metric, but that any valid request that fails to return a less than satisfactory response as a result of an internal Grove portal (5XX) or node error does negatively affect the Success Rate.

To be more discrete, the following conditions negatively affect our Success Rate metric per service:

The Portal fails to process the relay to Pocket Network
The Portal Dispatchers fail to dispatch the relay to Pocket Network
The selected Node fails to return a specification-valid (i.e. unmarshallable) response
The selected Node fails to return the expected data the user requested

We do not currently have SLAs on Success Rate, but our end goal with QoS is to be able to provide the same 99.95% (or higher) SLA for Success Rate.

The reasoning behind this focus is users and businesses cannot function on less than these guarantees. Having an Uptime and Success Rate guarantee of 99% is the bare minimum for any SaaS. From our own experience, our deal flow increased drastically when we introduced our SLA.

Important to note, even with 18+ months of innovation on this front, these are the per-chain success rates for July 2024:

blockchain	Success Rate
amoy-testnet-archival	99.93%
oasys-mainnet	99.89%
metis-mainnet	99.88%
iotex-mainnet	99.83%
oasys-mainnet-archival	99.79%
evmos-mainnet	99.76%
blast-archival	99.74%
moonriver-mainnet	99.69%
bsc-archival	99.64%
arbitrum-sepolia-archival	99.61%
klaytn-mainnet	99.54%
gnosischain-archival	99.53%
harmony-0	99.53%
fraxtal-archival	99.30%
fantom-mainnet	99.25%
sepolia-archival	99.23%
osmosis-mainnet	99.18%
scroll-testnet	99.16%
sepolia	98.97%
poly-mainnet	98.74%
opbnb-archival	98.72%
poly-archival	98.72%
eth-mainnet	98.57%
kava-mainnet	98.36%
arbitrum-one	98.31%
avax-archival	98.28%
boba-mainnet	98.28%
optimism-archival	98.16%
gnosischain-mainnet	98.06%
eth-trace	98.00%
optimism-mainnet	97.79%
near-mainnet	97.50%
zklink-nova-archival	97.29%
pokt-archival	97.27%
base-mainnet	97.07%
scroll-mainnet	97.07%
polygon-zkevm-mainnet	96.79%
zksync-era-mainnet	96.45%
celestia-archival	95.64%
sui-mainnet	95.57%
celo-mainnet	95.25%
solana-mainnet	95.02%
holesky-fullnode-testnet	94.71%
kava-mainnet-archival	94.20%
fuse-mainnet	92.39%
base-testnet	91.05%
optimism-sepolia-archival	90.97%
avax-mainnet	90.41%
radix-mainnet	88.69%
mainnet	88.02%
bsc-mainnet	87.18%
moonbeam-mainnet	83.82%
avax-dfk	80.01%
solana-mainnet-custom	79.54%
eth-archival	70.09%
celestia-testnet-da-archival	66.67%
celestia-consensus-archival	40.00%

In summary, there is still a lot of work to do to achieve 99%+ Success Rates on chains, and it is an exponential function of difficulty to increase this metric.

Algorithms and Approach

Our algorithms use a multi-pronged approach to solve Quality of Service and ensure our SLAs and Success Rates.

The Portal is optimized for this problem statement: users want the correct data as fast as they can get it

The RPC Trilemma summarizes well what users are looking for: Reliable, Performant, and Cheap Infrastructure. When selecting an RPC provider, users often take Reliability for granted – this is a standard established by web2 providers that is a great foundation for Web3. Next would be Performance – which is easiest to understand from an end user perspective as speed. At scale, this also relates to throughput, but the easiest proxy is sheer latency. Lastly, is cost, which we are doing everything we can to create a cheap environment for RPC. We even believe that RPC trends towards being “free” on a longer timescale.

Latency

The top node by latency wins relays. There are no buckets. Nodes are ranked by the latency they provide per request. Being the fastest is always the best.

Checks

We use a series of checks to ensure that nodes are adhering to the specifications we put out for them.

A sync check to ensure every node is within a certain block tolerance (called allowance)
Checks to guarantee that chains contain the full ledger when staked for archival services
Checks on EVM chains to ensure they are representing the right service.

We also include a number of checks for US and EU compliance as well as some edge cases with regards to adversarial node runners attempting to misrepresent their ability to perform the staked and advertised services.

Pre and Post Processing

The Portal executes a considerable amount of pre and post processing on relays to ensure that they are adhering either to the JSON-RPC specification or RESTful API specification provided by the services’ source of truth. Returning out of specification responses such as non-200 responses for JSON-RPC or non-4XX responses for RESTful requests will result in penalization

Penalties

Penalties are used to disqualify nodes and are commensurate with the error. These range from temporary timeouts of ~1min or less, up to and including permanent bans from consideration by the Portal. Penalties are always assessed on a per node address basis and never on a domain basis.

We do reserve the right to ban based on the domain should we be required by regulation or law and if we conclude that certain domains are continuously acting in bad faith.

New Services and Scalability

With all of the above, adding new services adds a considerable amount of scalability issues. While we have very much streamlined the generic EVM blockchain with JSON-RPC interface, as referenced in our Success Rates, we are still far from achieving perfection on the RESTful services, let alone services with other interfaces.

We have come to an understanding that one of the greatest challenges and bottlenecks we will have in the coming months and years is to enable quality, SLA-backed services on POKT Network and attract their users. Which is why we endeavor with Path.

Path and QoS

Path is set to replace our Portal Middleware and will have continued, iterative releases including more and more pieces of our QoS. Our goal with Path is to enable new gateways to provide a SLA out-of-the-box.

With the official release of Path in Q4 2024 / Q1 2025, we are hoping to enable the network effects of having multiple gateways and users providing the QoS checks and allowing us to scale and add more services more quickly with high quality of service to POKT Network.

Please use the Path Roadmap as a reference to understand the approach and timelines there.

Path Alpha release: September 2024
Path Beta release: Q4 2024

Conclusion

Quality of Service is a moving target – a living and breathing organism – that is still very much in its infancy when being guaranteed on top of a decentralized network. We’re excited at what the future holds but understand that there is still so much work to be done on this front.

We’re hopeful that with the release of Path, and the onboarding of new gateways, we can, as a community, increase the velocity at which we improve Quality of Service and onboard more users to Pocket Network. We plan to take all of our learnings and release them over time to Path, and Path will enable others to contribute their learnings so we can grow the number of high-quality gateways together; scaling both the services offered on Pocket and their quality above and beyond their web2 counterparts as an evolution to the open internet.

RawthiL · August 20, 2024, 2:57pm

Thanks for sharing how the QoS of Grove is working on a high level.
Agreeing or not to how Grove tackles QoS is irrelevant, as we should expect to have many gateways implementing whatever they see fit for their business.

I’m more interested in how do you plan to enable the community to improve the QoS. As you say, it is an open problem, and to some extent reminds me of what we built for AI RPCs.

Will Path have sorts of “modules” that a user can opt to implement (or develop/share)?
Will you be developing an endpoint to share metrics with third party services? IMO it is important to know live measurements to develop new ones.

fredt · August 20, 2024, 3:44pm

Yes, we plan to have a modular approach to QoS on Path with a mix between modules or components and some embedded checks we believe any gateway would want
This is a feature worthy of discussing for Path. I think there is value in having some shared metrics. Here is a challenge question: Most of the reason we do not share our metrics today is that the cost to share them is considerable. Why would gateways share their metrics? What incentive is there to sharing them with another organization? How does having a shared metrics pool increase the ROI or revenues of a gateway?

RawthiL · August 20, 2024, 4:07pm

This is not a question of ROI of gateways, but a community service. Also, one thing is to have to possibility of sharing the metrics and other is to actually share them.

I’m not saying that portals should give in-depth data or provide relay distributions, only node’s overview is enough. We were able to improve many things only by having a call with you and providing us with minimal intuitions on network performance. If we want to have the best possible nodes in the network, we need to have a way to measure them.
Yes this can create the problem of “a metric is no longer a metric if it becomes an objective”, but as you have said, the QoS is a living target and we should not expect to the metrics to remain fixed in time. Also, hiding metrics does not mean that they cannot be gamed, it only means that fewer eyes get to see if they are being gamed.

In a few words, I would like Path to have endpoints ready to share data, the same way as Nodie’s Gateway server has. Enabling them or not depends on the portal, but IF the portal owner decides to share the data, doing so should be as simple as possible with almost no friction.