I love it. One minor change I'd make is changing the pricing chart to put lowest on the left. On the other highlights, left to right goes from best to worst, but this one is the opposite.
I'm excited to see where things land. What I find interesting is that pricing is either wildly expensive or wildly cheap, depending on your use case. For example, if you want to run GPT-4 to glean insights on every webpage your users visit, a freemium business model is likely completely unviable. On the other hand, if I'm using an LLM to spot issues in a legal contract, I'd happily pay 10x what GPT4 currently charges for something marginally better (It doesn't make much difference if this task costs $4 vs $0.40). I think that the ultimate "winners" in this space will have a range of models at various price points and let you seamlessly shift between them depending on the task (e.g., in a single workflow, I might have some sub-tasks that need a cheap model and some that require an expensive one).
nice, I've been looking for something like this! A few notes / wishlist items:
* Looks like for gpt-4 turbo (https://artificialanalysis.ai/models/gpt-4-turbo-1106-previe...), there was a huge latency spike on December 28, which is causing the avg. latency to be very high. Perhaps dropping top and bottom 10% of requests will help with avg (or switch over to median + include variance)
* Adding latency variance would be truly awesome, I've run into issues with some LLM API providers where they've had incredibly high variance, but I haven't seen concrete data across providers
Thanks for the feedback and glad it is useful! Yes, agree might better representative of future use.
I think a view of variance would be a good idea, currently just shown in over-time views - maybe a histogram of response times or a box and whisker.
We have a newsletter subscribe form on the website or twitter (https://twitter.com/ArtificialAnlys) if you want to follow future updates
Variance would be good, and I've also seen significant variance on "cold" request patterns, which may correspond to resources scaling up on the backend of providers.
Would be interesting to see request latency and throughput when API calls occur cold (first data point), and once per hour, minute, and per second with the first N samples dropped.
Also, at least with Azure OpenAI, the AI safety features (filtering & annotations) make a significant difference in time to first token.
Hi HN, Thanks for checking this out! Goal with this project is to provide objective benchmarks and analysis of LLM AI models and API hosting providers to compare which to use in your next (or current) project. Benchmark comparisons include quality, price, technical performance (e.g. throughput, latency).
Any chance of including some of the better fine tunes, e.g. wizard or tulu? (worse than mixtral but I assume other finetines will be better just like wizard and tulu are better than LLAMA2)
I guess their cost is same as base model although would effect performance.
Can quality score be added for each inference provider for the same model. Many of them use different quantization and approximation so that it's not just price and throughput that's important. Specially for model like Mixtral.
We've been waiting on Replicate to launch per-token pricing for LLMs because their previous pay-per-second model was uncompetitive - but it looks like they might have just turned it on with no big announcement! They'll go straight to the top of the priority list.
Do Lambda have a serverless inference API? Not aware of them playing in this space yet.
Presume you mean MPT not MPS - yep we'll look into MosaicML soon.
I've been using Mixtral and Bard ever since the end of the year. I am pleased with their performance overall for a mixture of content generation and coding.
It seems to me GPT4 has become short in its outputs, you have to do a lot more COT type prompting to get it to actually output a good result. Which is excruciating given how slow it is to produce content.
Mixtral on together AI is crazy to see ~70-100token/s, and the quality works for my use case as well.
OpenAi it's an unreliable provider. Even if their model don't change as they say, there's a current issue where they blanked added a guardian tool to enforce content policies that are obscure and the tool is over eager, causing quite a stir across startups where this manifests on the surface like an outage.
It will get better as they fix it and tune it, but their entire release pipeline is absolutely bonkers, like no forewarning, no test environment, no opt out. It's scary amateurish for a billion dollar company.
If we're talking about the API, it seems like it's short because it is shorter. The latest version of GPT-4 (1106) might have a significantly larger input window, but its maximum output token size is limited to 4096 tokens.
It's likely that ChatGPT uses the 1106 model underneath the covers or some variant, so it probably suffers from the same restricted output window.
Can you give an example of a query where you find GPT4 is short with outputs? I’ve use custom instructions so that may have shielded me from this change.
At least for me making tests has been very frustrating, full of many "test conditions here" and "continue with the rest of the tests".
It _hates_ making assumptions about things it doesn't know for sure, I suspect because of "anti-hallucination" nonsense. Instead it has to be shoved to even try making any assumptions, even reasonable ones.
I know it's capable of making reasonable assumptions for class structures/behaviour, etc where I can just tweak it as needed to work. It just refuses too. I've even seen comments like "We'll put the rest of the code in later"
Yep, with code generation I have definitely encountered this issue as well. It will write out a function description as a comment and move on instead of actually writing out the function a lot for me. It also does this when it has properly written code but you ask it to rewrite to tweak something in my experience.
Given this JSON: <JSON examples>
And this Table schema: <Table Schema in SQL>
Create JavaScript to insert the the JSON into the SQL using knex('table_name')
Below is part of its output:
// Insert into course_module table
await knex('course_module').insert({
id: moduleId,
name: courseData.name,
description: courseData.description,
// include other required fields with appropriate
values
});
It's missing several columns it could populate with the data it knows from the prompt, primarily created_at, updated_at, account_id, user_id, lesson number... and instead I get a comment telling me to do it.
Theres a lot of people complaining about this, primarily on Reddit, but usually the ChatGPT fan boys jump in to defend OAI.
- Skip any preamble or qualifications about how a topic is subjective.
- Be terse. Do not offer unprompted advice or clarifications. Speak in specific, topic relevant terminology. Do NOT hedge or qualify. Do not waffle. Speak directly and be willing to make creative guesses. Explain your reasoning. if you don’t know, say you don’t know.
- Remain neutral on all topics. Be willing to reference less reputable sources for ideas.
- Never apologize.
- Ask questions when unsure.
When responding in code:
- Do not truncate.
- Do not elide.
- Do not omit.
- Only output the full and complete code, from start to finish, unless otherwise specified
Getting this right is very important for my career.
Since we are talking about throughput of API hosting providers, I wanted to add in the work we have done at Groq. I understand that the team is getting in touch with the ArtificialAnalysis folks to get benchmarked.
This is great. Thank you!
I would be especially interested in more details around speed.
Average is a good starting point, but I would love to also see standard distribution or 90, 99 percentiles.
In my experience speed varies a lot and it make it big difference if a requests takes 10 seconds or 50 seconds.
Thanks for the feedback! Yes, agree this would be a good idea.
We don't have this view but best place to get an idea of this with current site would be the /models page (https://artificialanalysis.ai/models) and scrolling to the over time graphs and looking at the variance. To see if being driven by individual hosts can also click into the by-model pages and see the over time graphs, e.g. https://artificialanalysis.ai/models/mixtral-8x7b-instruct
Thanks for putting this together! Amazon is far and away the priciest option here, but I wonder if a big part of that is the convenience tax for the Bedrock service. Would be interesting to compare that to the price of just renting AWS GPUs on EC2.
I'm surprised to see perplexity's 70B online model score so low on model quality and somehow far worse mixtral and gpt3.5(they use a fine tuned gpt3.5 as the foundational model AFAIK)
I run https://www.labophase.com and my data suggests that it's one of the top 3 models in terms of users liking to interact with it. May I know how model quality is benchmarked to understand this discrepancy?
It's a combination of different quality metrics which have Perplexity, overall, not performing as well. That being said, I think we are in the very early stages of model quality scoring/ranking - and (for closed sourced models) we are seeing frequent changes. Will be interesting to see how measures evolve / model ranks change
Quality index is equally-weighted normalized values of Chatbot Arena Elo Score, MMLU, and MT Bench.
We have a bit more information in the FAQ: https://artificialanalysis.ai/faq but thanks for the feedback, will look into expanding more on how the normalization works. We are thinking of ways to improve this generalized metric.
A sticking point is quality can of course be thought of from different perspectives, reasoning, knowledge (retrieval), use-case specific (coding, math, readability), etc. This is why show individual scores on home page and models page: https://artificialanalysis.ai/models
It's probably beyond the scope of this project, but it would be great to see comparisons across different quant levels (e.g. 4-bit, etc), since this can sometimes result in an extreme drop off in quality, but it's an important factor to consider when hosting your own LLM.
This is awesome. I was looking at benchmarking speed and quality myself but didn't go this far!
I wonder about Claude Instant and Phi 2?
Modal.com for inference felt crazy fast, but I didn't note the metrics.
Good ones to add?
Replicate.com too maybe?
Thanks! For Claude instant, select the dropdown on the top right of the card where it says '8 Selected' and can add it to the graphs. Thanks for the suggestions for adding Phi 2, Model.com as a host, can look into these!
I wish more places showed Time To First Token. For scenarios real time human interaction, the important part is how long until the first token is returned, and are tokens generated faster than people consume them.
Hey com2kid - if you're still there, we did end up adding boxplots to show variance. Can be seen on the models page https://artificialanalysis.ai/models and on each models page where you view hosts by clicking one of the models. They are toward the end of the page under 'Detailed performance metrics'
Curious to hear your experience. I built a simple RAG using GPT4-Turbo some weeks ago. Only used it for a few hours but was mostly satisfied. I did notice if I sent it too many documents, it would not find the (one) doc I was looking for.
GPT4 Turbo is top of the class it does RAG very well, it is important to provide good context with help of Vector DB but if you cannot provide relevant document it cannot do much.
All the opensource models are super bad at this and mostly i want to blame the fine tuning to get to the leaderboard is affecting the quality
We have this (and other more detailed metrics) on the models page https://artificialanalysis.ai/models if you scroll down and for individual hosts if you click into a model (nav or click one of the model bars/bubbles) :)
There are some interesting views of throughput vs. latency whereby some models are slower to the first chunk but faster for subsequent chunks and vice versa, and so suit different use cases (e.g. if just want a true/false vs. more detailed model responses)
Thanks for the letting me know. Odd as not occurring with my iOS Safari, can anyone else please let me know if they are encountering this issue (any their iOS version if possible). There is a console error but should be just a react defaultprops deprecation notice from a library being used (should not break DOM)
I thought so too. Could it be that gpt-4 turbo is more efficient for them to run, so the price is lower, but tries to maintain the token throughput of GPT4 over their API? There are a lot of ways they could allocate and configure their GPU resources so that GPT-4 Turbo provides the same per user throughput while greatly increasing their system throughput.
Unless they captured many different times and days, that is very likely a factor. GPU resources are constrained enough that during peak times (which vary across the globe) the token throughput will vary a lot.
What do you like about it? Compared to GPT-3.5, Claude Instant seems to be the same or worse in quality according to both human and automated benchmarks, but also more expensive. It seems undifferentiated. And I would rather use Mixtral than either of those in most cases, since Mixtral often outperforms GPT-3.5 and can be run on my own hardware.
Data extraction mostly. Supports long document, cheaper input tokens than gpt3 turbo, and when I ask to stick to document informations it doesn't try to fill in gaps out with his trained knowledge.
Sure you can't have a chat with it or expect it to do high level reasoning, but has enough to do the basic deductions for grounded answers.
We have Claude Instant on the models page: https://artificialanalysis.ai/models
Can add it via the select at the top right of each card where it says '9 Selected' (below the highlight charts)
Definitely agree with your point on Claude Instant though. Much less than half the price, much higher throughput/speed for a relatively small quality decrease (varied by how 'quality' is measured, use-case)
I'm excited to see where things land. What I find interesting is that pricing is either wildly expensive or wildly cheap, depending on your use case. For example, if you want to run GPT-4 to glean insights on every webpage your users visit, a freemium business model is likely completely unviable. On the other hand, if I'm using an LLM to spot issues in a legal contract, I'd happily pay 10x what GPT4 currently charges for something marginally better (It doesn't make much difference if this task costs $4 vs $0.40). I think that the ultimate "winners" in this space will have a range of models at various price points and let you seamlessly shift between them depending on the task (e.g., in a single workflow, I might have some sub-tasks that need a cheap model and some that require an expensive one).