Introduction
Since the original release of Portal V2 in July 2023, what was formerly known as “the cherry picker” has not existed and the approach to Quality of Service has been drastically different.
Our view as a Gateway has been that the Pocket Network functions as a permissionless, adversarial network, where reliability is always in question and never guaranteed. We have no preference for suppliers and only prioritize nodes based on the metrics that users demand from us.
To be abundantly clear:
- The concept of “buckets” for latency do not exist.
- The allowed distance between blockheights have gotten tighter over time.
- The penalties in our systems have gotten much stronger and stricter.
The truth of the matter is that the game has gotten much more difficult for Suppliers and this is because of the demands of end users and the maturation of a commoditized industry. Customers at scale beyond individual users shop for vendors based on the standards set by web2 SaaS products. Most customers do not care that our service is decentralized – they are fixated on generating margin off of our service to bring in revenue to their business. Any fluctuations in quality negatively impact both Grove and our customers, where the end users can churn off of either platform. It has only become easier for end users to shop for and to switch between RPC providers and with this ease has become a necessity to provide the best-in-class services to retain paying customers.
What is Quality of Service?
While there are many definitions of Quality of Service, we have targeted a typical enterprise-grade QoS for web2 SaaS services. This includes offering SLAs for Uptime and, in the unique scenario of a decentralized network, what we have defined as Success Rate for each service.
A successful relay is when a user sends a valid request and gets a valid response.
This means that invalid requests (4XX) do not count towards our Success Rate metric, but that any valid request that fails to return a less than satisfactory response as a result of an internal Grove portal (5XX) or node error does negatively affect the Success Rate.
To be more discrete, the following conditions negatively affect our Success Rate metric per service:
- The Portal fails to process the relay to Pocket Network
- The Portal Dispatchers fail to dispatch the relay to Pocket Network
- The selected Node fails to return a specification-valid (i.e. unmarshallable) response
- The selected Node fails to return the expected data the user requested
We do not currently have SLAs on Success Rate, but our end goal with QoS is to be able to provide the same 99.95% (or higher) SLA for Success Rate.
The reasoning behind this focus is users and businesses cannot function on less than these guarantees. Having an Uptime and Success Rate guarantee of 99% is the bare minimum for any SaaS. From our own experience, our deal flow increased drastically when we introduced our SLA.
Important to note, even with 18+ months of innovation on this front, these are the per-chain success rates for July 2024:
blockchain | Success Rate |
---|---|
amoy-testnet-archival | 99.93% |
oasys-mainnet | 99.89% |
metis-mainnet | 99.88% |
iotex-mainnet | 99.83% |
oasys-mainnet-archival | 99.79% |
evmos-mainnet | 99.76% |
blast-archival | 99.74% |
moonriver-mainnet | 99.69% |
bsc-archival | 99.64% |
arbitrum-sepolia-archival | 99.61% |
klaytn-mainnet | 99.54% |
gnosischain-archival | 99.53% |
harmony-0 | 99.53% |
fraxtal-archival | 99.30% |
fantom-mainnet | 99.25% |
sepolia-archival | 99.23% |
osmosis-mainnet | 99.18% |
scroll-testnet | 99.16% |
sepolia | 98.97% |
poly-mainnet | 98.74% |
opbnb-archival | 98.72% |
poly-archival | 98.72% |
eth-mainnet | 98.57% |
kava-mainnet | 98.36% |
arbitrum-one | 98.31% |
avax-archival | 98.28% |
boba-mainnet | 98.28% |
optimism-archival | 98.16% |
gnosischain-mainnet | 98.06% |
eth-trace | 98.00% |
optimism-mainnet | 97.79% |
near-mainnet | 97.50% |
zklink-nova-archival | 97.29% |
pokt-archival | 97.27% |
base-mainnet | 97.07% |
scroll-mainnet | 97.07% |
polygon-zkevm-mainnet | 96.79% |
zksync-era-mainnet | 96.45% |
celestia-archival | 95.64% |
sui-mainnet | 95.57% |
celo-mainnet | 95.25% |
solana-mainnet | 95.02% |
holesky-fullnode-testnet | 94.71% |
kava-mainnet-archival | 94.20% |
fuse-mainnet | 92.39% |
base-testnet | 91.05% |
optimism-sepolia-archival | 90.97% |
avax-mainnet | 90.41% |
radix-mainnet | 88.69% |
mainnet | 88.02% |
bsc-mainnet | 87.18% |
moonbeam-mainnet | 83.82% |
avax-dfk | 80.01% |
solana-mainnet-custom | 79.54% |
eth-archival | 70.09% |
celestia-testnet-da-archival | 66.67% |
celestia-consensus-archival | 40.00% |
In summary, there is still a lot of work to do to achieve 99%+ Success Rates on chains, and it is an exponential function of difficulty to increase this metric.
Algorithms and Approach
Our algorithms use a multi-pronged approach to solve Quality of Service and ensure our SLAs and Success Rates.
The Portal is optimized for this problem statement: users want the correct data as fast as they can get it
The RPC Trilemma summarizes well what users are looking for: Reliable, Performant, and Cheap Infrastructure. When selecting an RPC provider, users often take Reliability for granted – this is a standard established by web2 providers that is a great foundation for Web3. Next would be Performance – which is easiest to understand from an end user perspective as speed. At scale, this also relates to throughput, but the easiest proxy is sheer latency. Lastly, is cost, which we are doing everything we can to create a cheap environment for RPC. We even believe that RPC trends towards being “free” on a longer timescale.
Latency
The top node by latency wins relays. There are no buckets. Nodes are ranked by the latency they provide per request. Being the fastest is always the best.
Checks
We use a series of checks to ensure that nodes are adhering to the specifications we put out for them.
- A sync check to ensure every node is within a certain block tolerance (called allowance)
- Checks to guarantee that chains contain the full ledger when staked for archival services
- Checks on EVM chains to ensure they are representing the right service.
We also include a number of checks for US and EU compliance as well as some edge cases with regards to adversarial node runners attempting to misrepresent their ability to perform the staked and advertised services.
Pre and Post Processing
The Portal executes a considerable amount of pre and post processing on relays to ensure that they are adhering either to the JSON-RPC specification or RESTful API specification provided by the services’ source of truth. Returning out of specification responses such as non-200 responses for JSON-RPC or non-4XX responses for RESTful requests will result in penalization
Penalties
Penalties are used to disqualify nodes and are commensurate with the error. These range from temporary timeouts of ~1min or less, up to and including permanent bans from consideration by the Portal. Penalties are always assessed on a per node address basis and never on a domain basis.
We do reserve the right to ban based on the domain should we be required by regulation or law and if we conclude that certain domains are continuously acting in bad faith.
New Services and Scalability
With all of the above, adding new services adds a considerable amount of scalability issues. While we have very much streamlined the generic EVM blockchain with JSON-RPC interface, as referenced in our Success Rates, we are still far from achieving perfection on the RESTful services, let alone services with other interfaces.
We have come to an understanding that one of the greatest challenges and bottlenecks we will have in the coming months and years is to enable quality, SLA-backed services on POKT Network and attract their users. Which is why we endeavor with Path.
Path and QoS
Path is set to replace our Portal Middleware and will have continued, iterative releases including more and more pieces of our QoS. Our goal with Path is to enable new gateways to provide a SLA out-of-the-box.
With the official release of Path in Q4 2024 / Q1 2025, we are hoping to enable the network effects of having multiple gateways and users providing the QoS checks and allowing us to scale and add more services more quickly with high quality of service to POKT Network.
Please use the Path Roadmap as a reference to understand the approach and timelines there.
- Path Alpha release: September 2024
- Path Beta release: Q4 2024
Conclusion
Quality of Service is a moving target – a living and breathing organism – that is still very much in its infancy when being guaranteed on top of a decentralized network. We’re excited at what the future holds but understand that there is still so much work to be done on this front.
We’re hopeful that with the release of Path, and the onboarding of new gateways, we can, as a community, increase the velocity at which we improve Quality of Service and onboard more users to Pocket Network. We plan to take all of our learnings and release them over time to Path, and Path will enable others to contribute their learnings so we can grow the number of high-quality gateways together; scaling both the services offered on Pocket and their quality above and beyond their web2 counterparts as an evolution to the open internet.