Launching AI models on Pocket Network

bulutcambazi · February 29, 2024, 7:40pm

Please read these first, as this proposal is about launching foundation models on Pocket Network. It helps to understand what they are and who is some of the competition.

1. What are Foundation Models? - Foundation Models in Generative AI Explained - AWS
2. Build Generative AI Applications with Foundation Models - Amazon Bedrock FAQs - AWS

Goals of this doc:

Starting a discussion on enabling a brand new offering type, specifically multi-purpose and general use generative AI models on Pocket Network.
Getting feedback from
- portal operators on their requirements.
- node runners on cost and implementation.
- DAO on ecosystem fit and tokenomics.

This doc is a conversation starter. It is not a be-all/end-all proposal for all possible uses of RTTM feature or even implementation of AI on Pocket Network.

Introduction:

Once the RTTM changes are in place, Pocket Network can be an excellent venue for hosting and running off-the-shelf Foundation Models (FM).

Why can Pocket with successful in this domain?

Privacy - Pocket offers a much higher level of privacy and confidentiality than any other commercial offering.
Permission - Pocket offers worldwide access. In contrast, competitors such as AWS BedRock require case-by-case approval to access these models. Friction to onboarding is much lower in case of Pocket.
Price - Pocket can be more price effective than competitors because it doesn’t incur the costs of running large data centers and the massive amount of personnel associated with their offerings.

Possible Concerns (and why they shouldn’t be)

Foundation Models - Some people think that only custom or fine-tuned models are necessary, therefore Pocket wouldn’t be competitive with its off-the-shelf foundation models. This is incorrect. FMs are still extremely useful for a lot of purposes and they are sufficient for most LLM-enabled applications nowadays.
Performance - If you are worried about perf of such systems, don’t be: Each LLM call already takes several seconds (typically 15 to 30 seconds), so the minimal perf hit of few milliseconds when crossing routers etc. is not a competitive concern in terms of Cherry Picker or even in comparison to centralized providers.

If we look around for inspiration and choose some of the successful products, AWS BedRock stands out as one of the better places if someone is getting started with such models. Out of 6 such models that BedRock offers, 2 stand out with the most utility and simplicity. This document proposes that Pocket Network starts with the following two.

Proposed Models

Llama 2

Llama is a multipurpose text generation (i.e. generative prediction) library. Its license [3] allows free redistribution rights.

Use Cases [4] [5]

Llama 2 is an incredibly powerful tool for creating non-harmful content such as blog posts, articles, academic papers, stories, and poems. Llama 2 has many applications including writing emails, generating summaries, expanding sentences, and answering questions. It can also be used for automated customer service bots, helping reduce the need for human input.

Text Generation: Llama 2 leverages reinforcement learning from human feedback and natural language processing to generate text based on given prompts and commands. That means you can quickly create high-quality and non-toxic written content without spending hours at the keyboard. However, like every language model, Llama 2 doesn’t give you completed text, it gives you a piece to work on.

Summarization: Llama 2 language model can summarize any written text in seconds. Simply copy and paste your existing text and give a prompt to Llama 2. It will quickly generate a summary of your text without losing any critical information.

Question & Answering: Like every language model, Llama 2 is successful in answering users’ questions, analyzing their commands and prompts, and generating output. The feature that distinguishes Llama 2 from other large language models is that it can generate safe output. According to Meta AI’s benchmark, the Llama 2 language model generates lower violation output than its competitors.

Implementation

Llama comes with 3 sizes: 7B, 13B, 70B. Although 70B is the most capable, its hardware requirements is also pretty high. 7B is very nimble and fast, but sometimes it gives very confidence-shaking answers. Therefore**, we propose running 13B model.** It can be run on most GPUs (and even CPUs, albeit slowly)

Pricing:

(Remember, one of the goals of this document is getting feedback from all participants. So these numbers are temporary)

AWS charges [1] $0.00075 per 1000 input tokens and $0.001 per output tokens.

Pocket Network doesn’t have infra or code for custom billing per call. We could charge portal operators a fixed $0.00075 (equivalent in POKT) per call total, up to 200 input tokens and 1000 output tokens for the 13B model. Calls larger than these would be dropped by the node runners.

Furthermore, we propose that there could be another tier, specifically designed for batch workload that is slower but cheaper. Those nodes would run on lower end GPUs or only CPUs, and they would be charged at $0.00050 per call total, up to 200 input tokens and 1000 output tokens for the 13B model.

Mistral

Just like above, Mistral is another capable LLM model. Allow-listing Mistral in addition to Llama v2 could allow users A/B test the responses and give them alternatives. Pricing etc. would be similar to Llama v2 proposal.

Stable Diffusion

Stable Diffusion is an image processing engine. Its license [2] allows free redistribution rights.

Use Cases

Text-to-Image: generate an image from a text prompt.
Image-to-Image: tweak an existing image towards a prompt.
Inpainting: tweak an existing image only at specific masked parts.
Outpainting: add to an existing image at the border of it.
Data generation and augmentation: The Stable Diffusion model can generate new data samples, similar to the training data, and thus, can be leveraged for data augmentation.

Edit: Alternative Approach

Above approaches are prescriptive about specific model names. One of the comments below argues that as long as “IQ” is there, it doesn’t matter which exact model it is. For example:

Chain 1: Meets or beats a score of 70 per XYZ (for example GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models. )
Chain 2: Meets or beats a score of 55 per XYZ at a different price point than Chain 1.
Chain 3: …

Pricing:

(Again, please remember, one of the goals of this document is getting feedback from all participants. So these numbers are temporary)

AWS charges [1] $0.018 per standard quality image at 512x512 resolution. Pocket Network could duplicate the same price. Again, we don’t necessarily need to beat AWS absolutely with pricing, because there are other advantages of using Pocket Network such as privacy and easier onboarding.

ARR / Inflation Management

Two most important things to know:

These new chains will not be inflationary.
They are not impacted by, neither do they impact ARR measures.

If 0.003 POKT token is minted as a result of a call, that portal operator will be charged exactly the same amount. Any free tiers or promotional access will be at their expense (unless, of course, DAO passes a proposal to subsidize and/or support any portal operators in this new area).

Flow of Funds

Portal operators and Pocket DAO (with feedback from node runners) agree on a price for each chain.
For example, say, $0.00075 per relay for a particular chain. This fiat value is converted to POKT weekly. For example, at the time of writing this doc, POKT trades at $0.25. So, each call would cost 0.003 POKT
We set the RTTM value for that chain so it rewards (i.e. mints) exactly that much POKT per relay.
For example, 0.003 POKT will be minted for that relay as a reward upon claim/proof.
At the end of the week, we keep a tally of how much new POKT is minted as a result of relay rewards in these chains, and for which portal operator and we charge them for that amount in POKT
For example, if Portal Operator XYZ made 100000 calls, there would be 100,000x0.003 = 300 POKT minted. So, we charge that operator 300 POKT.
We burn their payment to avoid inflation. Burn will be 1:1.

In terms of ops: Above calculations are not hard at all. They are all recorded in the blockchain as part of the claim/proof cycle. Public tools such as PoktScan already show individual app performance. If needed, a more purpose-built tool can be created easily to see which app did what.

References:

RawthiL · February 29, 2024, 9:10pm

Thanks @bulutcambazi for putting this information up, there is a lot to discuss around this subject. We are starting a socket on March for treating this subject (and some other related ones).
We will be providing a large comment on the subject as soon as possible, but there are some points that we need to clarify.

Models

In order to white-list a model we need to know how to test if a model is of a given type. Having a service for “Llama-2 13B” is to ambiguous, as we explain in out Socket presentation. There is no easy way to know if a model is of a given kind and there is also no reason to separate language models of a kind from others. Also, the models that can be staked is large and grows every day. Setting a service per model sub-class like “Mistral” or “Llama-2” will bloat the blockchain or create friction with users that want to able to get the best possible results. The case with diffusion models is similar.

Pricing

We have not yet started to go down into the pricing subject, but I would like to make the ecosystem friendly to independent node runners, I was thinking to use the PoW mining rewards as a start point for pricing. Miners already have GPUs that they can connect to POKT instead if we set a fair price. With some luck this will be much cheaper than other services and we bring in people that already has the hardware.
Regarding counting tokens, I don’t like to rely on gateways for that and also, what happens if you need more tokens? models can take 4K tokens easily, why restrict? this will be a pain point for users who know that the context size is critical (and often scarce).

We will try to get a document ready as soon as possible to cover all these subjects, but we need to be clear that running machine learning models is not like running blockchain nodes and machine learning models users/devs are not like blockchain users/devs…

bulutcambazi · March 1, 2024, 1:10am

Hi Rawthil, looking forward to your document. Also, thanks a lot for your comments.

There is no easy way to know if a model is of a given kind

QoS is always a concern. Today, for blockchain RPC, it is enforced by the portals to some degree. For example, obviously erroneous data (like responding for the wrong chain) is rejected by the portal. A similar enforcement will be needed for Gen AI models, too. Portal operators will need to come up with their own mechanisms of quality assurance (including model enforcement) as a differentiating factor.

… there is also no reason to separate language models of a kind from others.

Different models alongside with their training sets behave differently. Llama v2 with 70B will behave differently than Vicuna with smallest dataset. Try it yourself, there is a day and night difference.

Advertising of what we offer gives customers a base on expectations. This also lets them compare our offering with other providers apples to apples.

Glossing over details will let them believe that we only offer lowest cost (therefore most likely lowest quality) models possible.

… models can take 4K tokens easily, why restrict?

Cost… Execution time (hence the cost for node runners) largely depend on the input size, in fact, almost linearly. Pocket protocol is not capable enough to price the calls per their complexity. Sure, we can have different chains for different complexities, but mixing them all together would be unfair.

RawthiL · March 1, 2024, 3:23pm

Defining this for generative models is not easy. Two outputs of a language model can be lexically different but perfectly correct. Two images generated can have absolutely no overlap but refer to the same query.
Even in same family models, like “Llama-2 13b”, the quantization of the weights affects the response. Same model, same dataset, different optimization.

Model enforcement cripples development.
Having a service per model like proposed wont solve the expected quality problem. We can say “Llama-2 13b only service” but anyone will stake what they like, better and worse models. If the portal chooses a single model and do some rudimentary testing like expecting same answers as their source of truth model, then they will kick better models of the network limiting the usefulness of the chain to what they can test.
Then, when there are more than two portals, who dictates the version of the model? What if an other portal has better ways for assessing quality and can use models that are above the publicized quality?
Finally, supply with excess capacity wont adapt to a portal, they will never join POKT and the supply (which is very important in ML) will be limited to operators that agree to arbitrary portal mandates.

To make it simple

We cannot guarantee that a publicized service contains or not a given model.
It is naïve to think that same-response testing is a way for checking same-model staking.
Model enforcement goes against attracting excess capacity.
Model enforcement limits the supply optimization and development, as moving away from de-facto portal source of truth results in ban, i.e. implementing a better quantization that reduces your service cost will result in removal.

Sure, now you tell me how “day and night difference” is mathematically computed and automated into a pipeline.
I can give you some samples, below you will see 6 responses to the prompt:

Who won the world series in 2020?

Three (3) of those responses are from “Llama-2 13b” models, from those 3 Llama models, 2 only differ in the quantization. The other 3 responses are from “Mistral”, “Mixtral” and “Yi” families, models that are better or equal to “Llama-2 13b”, one actually rivals Llama-2 70b but has fewer parameters.

Response 1

• The Scorecards - World Series 1903 through 2017 - (rankings with graphs.)• Baseball Awards (includes Local High School Awards)• Baseball Fielding Stars• Baseball Records• Baseball Relievers (regular season and postseason data, saves leaders)• Baseball Search Interface (search for games, teams, players, and more)• Baseball Top Teams• Baseball Uniform Numbers• Batting Leaders - 100 Years of Batting• Baseball Utility Assist• Baseball World Series Logos• Major League Baseball Franchises (team names, locations, stadium names)• Baseball Yearly Rankings & Totals• Baseball’s Greatest Moments• Baseball Yearly Rankings (and records)• Mini League Schedule• MLB World Series Goblins• MLB World Series Score Boards• Most Valuable Player (MVP) Award Winners• Pitching Leaders - 100 Years of Pitching• Pitching ERAs (Earn

Response 2

Chief AP Science Writer

The Los Angeles Dodgers won the 2020 World Series, their first championship since 1988. The 2020 season was significantly different as MLB had to be played in “bubbles” due to the COVID-19 pandemic. The pandemic also made it so some playoff games, including the World Series, took place during the Trump presidency while others took place during the Biden presidency.

The Dodgers defeated the Tampa Bay Rays in six games to claim their seventh World Series title and their first since 1988. The Fall Classic was played in a neutral-site format under the expanded postseason due to the coronavirus pandemic, with games being held at the Globe Life Field in Arlington, Texas.

The Dodgers finished the season with the best record in the MLB, sweeping through the NLDS, NLCS and clinching the

Response 3

In 2020, there were two World Series, each played by different leagues operated by Major League Baseball (MLB). Here are the results of each series:

American League World Series:
- Tampa Bay Rays defeated the Los Angeles Dodgers in Game 6, winning the series 4-2.
National League World Series:
- Washington Nationals defeated the St. Louis Cardinals in Game 7, winning the series 4-3.

So, in 2020, the Tampa Bay Rays won the American League World Series, and the Washington Nationals won the National League World Series.

Response 4

The Los Angeles Dodgers won the 2020 Major League Baseball World Series. The World Series is a series played in October between the American and National Leagues champion baseball teams. It is the conclusion of the two different, but equally competitive, leagues’ competition in Major League Baseball.
Did the Dodgers win the World Series in 2020?
Did the LA Dodgers win any world series?
Did the LA Dodgers win the division?
What year did the Dodgers win their last World Series?
How long will the Dodgers be in Arizona?
Did EA remove Maddon from The Show 21?
What year did the Dodgers win their fifth World Series title?
For the first time in franchise history, Dodgers clinch
Mookie Betts is the ninth player in MLB history and second in Dodgers history to win a M

Response 5

In 2020, there were two teams in Major League Baseball who made it to the World Series. Who are those two teams?

Who won the World Series in 2020?

And, for bonus points, why did these two teams make it to the fall classic in the first place?

Response 6

Chief<|im_start|> Editor<|im_start|> Editor

The Los Angeles Dodgers won the 2020 World Series. This victory marked their seventh championship title and their first since 1988. The Dodgers defeated the Tampa Bay Rays in six games. The title translated into Real Major League Baseball Championship for the team from Los Angeles. Dodger players, coaching staff, management and even the most loyal fans celebrated this triumph as an end to a long drought. The goal of the team was realized through outstanding team strategy, special individual performances (like that of Mookie Betts), and an undeniable sense of unity and love for the game.

Generation details, for all responses

Engine: vLLM 0.3.2
Endpoint: /v1/completions
Request data:

 {
    "model": "vllm_default",
    "prompt": "Who won the world series in 2020?",
    "seed" : 1,
    "max_tokens": 200,
    "ignore_eos": False,
    "stream": False,
    "echo": False,
    "logprobs": False,
}

If you take the time to go through them and tell me which are from the “Llama-2 13b” family I would love to know the method and start implementing it for the community. I certainly cannot tell.

The problem with this naïve approach, base on “same model with same seed = same string” is evident. Unless you freeze not only what you are calling the “model family”, but also all other variables, quantization, generator, seed, prompting technique and generation guidance, you will have to deal with this.

I agree that we need to give the customers offers based on their expectations. In the LLM world, apples are measured like this:
Huggingface Leaderboard
Berkeley Function-Calling Leaderboard
(and others around there)

I think that we need to meet the users demand and have our own, not just blindly advertise “Llama-2 13b” family or any other base model name.

I believe the contrary, showing that we do not care to actually measure models quality and just believe in the service provider is a big alarm to anyone that knows how these model work and how difficult are to measure (this was actually my first thought when I saw your proof of concept).

This is a misconception, depending on the implementation and the topology of the model linearity wont hold. RWKV or MAMBA based models might be linear but Transformer based topologies have increased computational costs (and memory costs) as context windows grow. Also, depending on how the model is deployed, quantization plays a significant role with context and batch size, at some point de-quantization times become larger than actual processing.

Agree, we wont be counting tokens, but we estimate how much we want to pay for a given GPU capability working 24hs at an expected token rate. We can set this up and then let the supply side innovate and try to become cheaper. It is not something new for the Pocket network to allow innovation on the supply side.

I’m not opposed to different complexities services/chains, I’m opposed to model-enforcing.
Anyways, the problem of detecting and filter real high/low/trash quality nodes from the rest is the same and will be needed for each service.

bulutcambazi · March 5, 2024, 8:34pm

Thanks for sharing your thoughts.

We will need to be able to articulate what exactly we are offering to customers, and how we are assuring the quality of it.

Are you saying that instead of a particular model name, what we should advertise is a quality score for a chain?

For example:

Chain 1: Meets or beats a score of 70 per XYZ (for example GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models.)
Chain 2: Meets or beats a score of 55 per XYZ at a different price point than Chain 1.
Chain 3: …

Node runners run whatever is the most suitable for their hardware / platform as long as they hit the performance goals.

I like this approach, I think the customer value would be clear enough. Is this what you are saying? If not, how would you define the customer value that we are offering?

RawthiL · March 6, 2024, 1:05pm

Yes, I’m not clear on the details right know but you are on the spot.

lm-evaluation-harness is one of the frameworks that we will be instrumenting to produce public node scores. We are evaluating others also, to make the metrics harder to game.

Just like you say, we can offer services/chains that have an average expected score, for example “GPT 3.5 Quality” , “GPT 4 Quality” (to give them known references). Also we could split them in skill sets, like “Creativity/Friendly”, “Coding”, “Instruction/Planing”, etc.
In any case these categories that will have a “user-friendly” name will be evaluated using known metrics to the industry. Clients should get a quick reference of what we offer and devs can go deep into our metrics to ensure that we are running quality stuff.

As a node runner this will not impact your hardware too much, as for a given hardware you can select any model that gives you best results. If a new model is released that promises better scores in lower hardware (like Mistral-7B vs Llama-13B) you can simply swap them in your back-end and change to a higher difficulty service/chain or just climb on the leader board (which gateways could potentially use as reference for relay routing).

bulutcambazi · March 6, 2024, 5:40pm

Glad we are on the same page

Regarding the perf, I don’t think a strict cherry-picker-like approach with fastest nodes getting most of the rewards is the right one.

Not all demand is interactive (and since Pocket doesn’t yet support streaming data, probably we won’t be very desirable as, say, a chat bot).
If we want to motivate a variety of hardware across the world, we need to be more tolerant.

Perf still matters, but we need to be more nuanced and deliberate in this case.

shane · March 7, 2024, 1:36pm

I think the best thing to do is define the target audience first.

IMO, those wanting to use this will be companies interested in getting cheap, but competent AI for their products or services. These companies would likely be looking into AI APIs already. This would be businesses wanting to utilize it for data analysis, automation, etc.

I’m referencing the corporate usecases which require a lot of automated inferencing… not consumer facing, “ChatGPT” like products which require a lot of users.

So, with that perspective, I don’t think that quality score is the way to go. Different models are better at different things, therefore, I believe it would be best to focus on a UX where folks can first access specific models for testing with their workloads, then choose the one that meets their needs. By finding the model that works for them, they can have some level of expected outcomes, especially when used in automation.

If models are linked by general quality rating, instead of specific quality rating, than it will be hard to use POKT for consistent/specific use cases… especially in the area of automation.

AI inference responses are already non-deterministic even when sending the same query to the same model… so mixing in other models would just make things too crazy IMO. As an MVP, I believe the best option would be to have RelayChainIDs be specific models with specific hardware requirements.

RawthiL · March 7, 2024, 2:24pm

This is what we need, potential use cases well defined. With that we can define exactly what we need to measure and how the models should be separated.

Isn’t this solved by offering categories as I said in the previous post? Category names were just quick examples.

Also, what do you mean exactly by “specific quality rating” and how it differs from “quality score”?
My idea is not to produce a single category with a single rating, rather a sub-category with a set of known features/metrics/scores. We divide models per-category measure them and provide rankings.
A common automation task could be summarization, if look that scenario in HELM you will see that it has many metrics, there is no “summarization quality score”.

This will increase the cost of supply, instead of targeting people with “spare time” on their already deployed models, or experts, we will be forcing existing node runners to acquire specific hardware and run specific models with no opportunity to optimize their setups.

As I said before choosing “a model” implies lots of things, also changing that model for a new model (potentially better) will have great coordination costs. We would be stuck with a model, for example, when the C0d3r released the proof of concept Llama-2 was the best out there, now nobody would use it. Since then the Mistral came out, then Yi, now we have Mixtrals (and small variations/fine tunings being released regularly).
Do we want to force node runners to update their models every other month? Who will decide to make the change?

shane · March 7, 2024, 3:36pm

I’m saying that just because a model does well in a category score, that doesn’t mean they perform the same in specific tasks.

The assumption being made here is: If Model A gets a rating of X in a test, then it will perform the same as Model B if it has a similar rating.

Rating are meant to be general and do not mean they perform the same with specific tasks. Some models do better in different coding languages for example. General tests are not trying to find the best in a specific task… hence why I don’t believe we should have a UX that suggests they are.

Rating = Score. I guess I used the wrong word

I’m not suggesting a “hard” requirement… as that will be impossible to enforce on POKT What I’m suggesting is no different than how POKT suggests node runners use NVME storage instead of SSD for blockchain. Just like how in POKT, each chain has different hardware requirements, I’m suggesting we take seriously providing hardware requirements information for each model in each ServiceID.

QoS will likely be heavily relent on if folks being aware of what they are running and the resources required to run it efficiently for POKT. If we maintain a culture of encouraged specific hardware to service specific model, it can be easier for folks to know where they can fit in.