Benchmarks and comparison of LLM AI models and API hosting providers

chadash · 2024-01-16T16:33:12 1705422792

I love it. One minor change I'd make is changing the pricing chart to put lowest on the left. On the other highlights, left to right goes from best to worst, but this one is the opposite.

I'm excited to see where things land. What I find interesting is that pricing is either wildly expensive or wildly cheap, depending on your use case. For example, if you want to run GPT-4 to glean insights on every webpage your users visit, a freemium business model is likely completely unviable. On the other hand, if I'm using an LLM to spot issues in a legal contract, I'd happily pay 10x what GPT4 currently charges for something marginally better (It doesn't make much difference if this task costs $4 vs $0.40). I think that the ultimate "winners" in this space will have a range of models at various price points and let you seamlessly shift between them depending on the task (e.g., in a single workflow, I might have some sub-tasks that need a cheap model and some that require an expensive one).

badFEengineer · 2024-01-16T21:35:52 1705440952

nice, I've been looking for something like this! A few notes / wishlist items:

* Looks like for gpt-4 turbo (https://artificialanalysis.ai/models/gpt-4-turbo-1106-previe...), there was a huge latency spike on December 28, which is causing the avg. latency to be very high. Perhaps dropping top and bottom 10% of requests will help with avg (or switch over to median + include variance)

* Adding latency variance would be truly awesome, I've run into issues with some LLM API providers where they've had incredibly high variance, but I haven't seen concrete data across providers

Gcam · 2024-01-16T21:47:03 1705441623

Thanks for the feedback and glad it is useful! Yes, agree might better representative of future use. I think a view of variance would be a good idea, currently just shown in over-time views - maybe a histogram of response times or a box and whisker. We have a newsletter subscribe form on the website or twitter (https://twitter.com/ArtificialAnlys) if you want to follow future updates

AaronFriel · 2024-01-16T22:02:04 1705442524

Variance would be good, and I've also seen significant variance on "cold" request patterns, which may correspond to resources scaling up on the backend of providers.

Would be interesting to see request latency and throughput when API calls occur cold (first data point), and once per hour, minute, and per second with the first N samples dropped.

Also, at least with Azure OpenAI, the AI safety features (filtering & annotations) make a significant difference in time to first token.

Gcam · 2024-01-16T20:10:03 1705435803

Hi HN, Thanks for checking this out! Goal with this project is to provide objective benchmarks and analysis of LLM AI models and API hosting providers to compare which to use in your next (or current) project. Benchmark comparisons include quality, price, technical performance (e.g. throughput, latency).

Twitter thread with initial insights: https://twitter.com/ArtificialAnlys/status/17472648324397343...

All feedback is welcome

ttt3ts · 2024-01-16T20:43:29 1705437809

Any chance of including some of the better fine tunes, e.g. wizard or tulu? (worse than mixtral but I assume other finetines will be better just like wizard and tulu are better than LLAMA2)

I guess their cost is same as base model although would effect performance.

_micah_h · 2024-01-17T03:30:20 1705462220

Hey, yeah the bar for adding finetunes will probably be that they're being hosted by ~3 supported hosting providers. Very much open to it!

YetAnotherNick · 2024-01-17T05:25:15 1705469115

Can quality score be added for each inference provider for the same model. Many of them use different quantization and approximation so that it's not just price and throughput that's important. Specially for model like Mixtral.

bravura · 2024-01-16T21:24:09 1705440249

I'd love to see replicate.com (pay per sip) on there. And lambdalabs.com

[edit: And also MPS]

_micah_h · 2024-01-17T03:40:40 1705462840

We've been waiting on Replicate to launch per-token pricing for LLMs because their previous pay-per-second model was uncompetitive - but it looks like they might have just turned it on with no big announcement! They'll go straight to the top of the priority list.

Do Lambda have a serverless inference API? Not aware of them playing in this space yet.

Presume you mean MPT not MPS - yep we'll look into MosaicML soon.

bearjaws · 2024-01-16T20:28:50 1705436930

I've been using Mixtral and Bard ever since the end of the year. I am pleased with their performance overall for a mixture of content generation and coding.

It seems to me GPT4 has become short in its outputs, you have to do a lot more COT type prompting to get it to actually output a good result. Which is excruciating given how slow it is to produce content.

Mixtral on together AI is crazy to see ~70-100token/s, and the quality works for my use case as well.

avereveard · 2024-01-16T23:18:06 1705447086

OpenAi it's an unreliable provider. Even if their model don't change as they say, there's a current issue where they blanked added a guardian tool to enforce content policies that are obscure and the tool is over eager, causing quite a stir across startups where this manifests on the surface like an outage.

It will get better as they fix it and tune it, but their entire release pipeline is absolutely bonkers, like no forewarning, no test environment, no opt out. It's scary amateurish for a billion dollar company.

outside415 · 2024-01-17T04:49:17 1705466957

$80bn dollar company *

vunderba · 2024-01-16T23:30:09 1705447809

If we're talking about the API, it seems like it's short because it is shorter. The latest version of GPT-4 (1106) might have a significantly larger input window, but its maximum output token size is limited to 4096 tokens.

It's likely that ChatGPT uses the 1106 model underneath the covers or some variant, so it probably suffers from the same restricted output window.

djsh · 2024-01-19T18:33:42 1705689222

If you like that speed, you would love Mixtral running at >500 tokens/s @ Groq https://www.youtube.com/watch?v=5fJyOVtOk4Y

In full disclosure, I have worked on getting this up @ Groq.

PS: Experience the speed for yourself, LLama2-70B, at https://chat.groq.com/

thierrydamiba · 2024-01-16T20:44:09 1705437849

Can you give an example of a query where you find GPT4 is short with outputs? I’ve use custom instructions so that may have shielded me from this change.

declaredapple · 2024-01-16T20:57:04 1705438624

At least for me making tests has been very frustrating, full of many "test conditions here" and "continue with the rest of the tests".

It _hates_ making assumptions about things it doesn't know for sure, I suspect because of "anti-hallucination" nonsense. Instead it has to be shoved to even try making any assumptions, even reasonable ones.

I know it's capable of making reasonable assumptions for class structures/behaviour, etc where I can just tweak it as needed to work. It just refuses too. I've even seen comments like "We'll put the rest of the code in later"

thierrydamiba · 2024-01-17T02:37:22 1705459042

Yep, with code generation I have definitely encountered this issue as well. It will write out a function description as a comment and move on instead of actually writing out the function a lot for me. It also does this when it has properly written code but you ask it to rewrite to tweak something in my experience.

bearjaws · 2024-01-16T21:18:12 1705439892

Given this JSON: <JSON examples> And this Table schema: <Table Schema in SQL>

Create JavaScript to insert the the JSON into the SQL using knex('table_name')

Below is part of its output:

  // Insert into course_module table

  await knex('course_module').insert({

    id: moduleId,

    name: courseData.name,

    description: courseData.description,

    // include other required fields with appropriate

values });

It's missing several columns it could populate with the data it knows from the prompt, primarily created_at, updated_at, account_id, user_id, lesson number... and instead I get a comment telling me to do it.

Theres a lot of people complaining about this, primarily on Reddit, but usually the ChatGPT fan boys jump in to defend OAI.

abrichr · 2024-01-16T21:25:23 1705440323

Try this custom instruction:

    - Skip any preamble or qualifications about how a topic is subjective.
    - Be terse. Do not offer unprompted advice or clarifications. Speak in specific, topic relevant terminology. Do NOT hedge or qualify. Do not waffle. Speak directly and be willing to make creative guesses. Explain your reasoning. if you don’t know, say you don’t know.
    - Remain neutral on all topics. Be willing to reference less reputable sources for ideas.
    - Never apologize.
    - Ask questions when unsure.
    
    When responding in code:
    - Do not truncate.
    - Do not elide.
    - Do not omit.
    - Only output the full and complete code, from start to finish, unless otherwise specified

    Getting this right is very important for my career.

thierrydamiba · 2024-01-17T02:38:47 1705459127

Has the last sentence, "Getting this right is very important for my career." continued to show improved results for you?

coder543 · 2024-01-17T15:17:17 1705504637

I'm not the person you're replying to, but that sentence comes from this research: https://arxiv.org/abs/2307.11760

bearjaws · 2024-01-16T21:39:41 1705441181

Hmm I like this more than my current one, which I got from a Reddit thread. I'll have to give it a whirl.

bearjaws · 2024-01-16T21:25:02 1705440302

Here is the mixtral output (truncated):

knex('course_module')

  .insert({

    name: jsonData.name,

    description: jsonData.description,

    content: JSON.stringify(jsonData),

    number: jsonData.number,

    account_id: 'account_id',

    user_id: 'user_id',

    course_id: 'course_id',

    created_at: new Date(),

    updated_at: new Date()
  })

m3kw9 · 2024-01-17T05:33:45 1705469625

I feel sorry for all other models when gpt4.5 comes out. If you are not at gpt4 level it’s pretty useless other than have some fun.

djsh · 2024-01-19T19:50:56 1705693856

Since we are talking about throughput of API hosting providers, I wanted to add in the work we have done at Groq. I understand that the team is getting in touch with the ArtificialAnalysis folks to get benchmarked.

Mixtral running at >500 tokens/s @ Groq https://www.youtube.com/watch?v=5fJyOVtOk4Y Experience the speed for yourself, LLama2-70B, at https://chat.groq.com/

zurfer · 2024-01-16T17:02:24 1705424544

This is great. Thank you! I would be especially interested in more details around speed. Average is a good starting point, but I would love to also see standard distribution or 90, 99 percentiles.

In my experience speed varies a lot and it make it big difference if a requests takes 10 seconds or 50 seconds.

Gcam · 2024-01-16T18:53:16 1705431196

Thanks for the feedback! Yes, agree this would be a good idea. We don't have this view but best place to get an idea of this with current site would be the /models page (https://artificialanalysis.ai/models) and scrolling to the over time graphs and looking at the variance. To see if being driven by individual hosts can also click into the by-model pages and see the over time graphs, e.g. https://artificialanalysis.ai/models/mixtral-8x7b-instruct

causal · 2024-01-16T16:27:30 1705422450

Thanks for putting this together! Amazon is far and away the priciest option here, but I wonder if a big part of that is the convenience tax for the Bedrock service. Would be interesting to compare that to the price of just renting AWS GPUs on EC2.

Gcam · 2024-01-16T16:52:13 1705423933

Yes! An interesting insight is that the smaller, emerging hosts also offer strong relative performance (throughput - tokens per second)

binsquare · 2024-01-16T21:54:03 1705442043

I'm surprised to see perplexity's 70B online model score so low on model quality and somehow far worse mixtral and gpt3.5(they use a fine tuned gpt3.5 as the foundational model AFAIK)

I run https://www.labophase.com and my data suggests that it's one of the top 3 models in terms of users liking to interact with it. May I know how model quality is benchmarked to understand this discrepancy?

Gcam · 2024-01-16T22:00:19 1705442419

Model quality index methodology is as per this comment (can add perplexity using the dropdown): https://news.ycombinator.com/item?id=39014985#39017632

It's a combination of different quality metrics which have Perplexity, overall, not performing as well. That being said, I think we are in the very early stages of model quality scoring/ranking - and (for closed sourced models) we are seeing frequent changes. Will be interesting to see how measures evolve / model ranks change

idiliv · 2024-01-16T19:17:32 1705432652

I'm curious how they evaluated model quality. The only information I could find is "Quality: Index based on several quality benchmarks".

Gcam · 2024-01-16T19:27:37 1705433257

Quality index is equally-weighted normalized values of Chatbot Arena Elo Score, MMLU, and MT Bench.

We have a bit more information in the FAQ: https://artificialanalysis.ai/faq but thanks for the feedback, will look into expanding more on how the normalization works. We are thinking of ways to improve this generalized metric.

A sticking point is quality can of course be thought of from different perspectives, reasoning, knowledge (retrieval), use-case specific (coding, math, readability), etc. This is why show individual scores on home page and models page: https://artificialanalysis.ai/models

vunderba · 2024-01-16T23:36:31 1705448191

It's probably beyond the scope of this project, but it would be great to see comparisons across different quant levels (e.g. 4-bit, etc), since this can sometimes result in an extreme drop off in quality, but it's an important factor to consider when hosting your own LLM.

MacsHeadroom · 2024-01-17T21:00:25 1705525225

Perhaps price should be tokens per dollar, to keep the charts all "higher is better."

luke-stanley · 2024-01-16T19:14:42 1705432482

This is awesome. I was looking at benchmarking speed and quality myself but didn't go this far! I wonder about Claude Instant and Phi 2? Modal.com for inference felt crazy fast, but I didn't note the metrics. Good ones to add? Replicate.com too maybe?

Gcam · 2024-01-16T19:31:00 1705433460

Thanks! For Claude instant, select the dropdown on the top right of the card where it says '8 Selected' and can add it to the graphs. Thanks for the suggestions for adding Phi 2, Model.com as a host, can look into these!

com2kid · 2024-01-17T02:39:17 1705459157

I wish more places showed Time To First Token. For scenarios real time human interaction, the important part is how long until the first token is returned, and are tokens generated faster than people consume them.

Sadly very few benchmarks bother to track this.

Gcam · 2024-01-17T02:59:24 1705460364

Hi, we have this if you take a look at the models page (https://artificialanalysis.ai/models) and scroll down to 'Latency', and also on the API host comparison pages for each model (e.g. https://artificialanalysis.ai/models/llama-2-chat-70b)

com2kid · 2024-01-17T08:51:33 1705481493

Ah so you do!

Your latency numbers for OpenAI (and Azure's equivalents) seem really high, I run time to first token tests and I see much better numbers!

(Also are those numbers average, p50, p99, etc? I'd honestly expect a box plot to really see what is going on!)

Gcam · 2024-01-29T07:30:10 1706513410

Hey com2kid - if you're still there, we did end up adding boxplots to show variance. Can be seen on the models page https://artificialanalysis.ai/models and on each models page where you view hosts by clicking one of the models. They are toward the end of the page under 'Detailed performance metrics'

sabareesh · 2024-01-16T19:13:28 1705432408

I want to see benchmarks for RAG. Most of the models are not very good with RAG

BeetleB · 2024-01-16T23:24:22 1705447462

Curious to hear your experience. I built a simple RAG using GPT4-Turbo some weeks ago. Only used it for a few hours but was mostly satisfied. I did notice if I sent it too many documents, it would not find the (one) doc I was looking for.

sabareesh · 2024-01-16T23:59:22 1705449562

GPT4 Turbo is top of the class it does RAG very well, it is important to provide good context with help of Vector DB but if you cannot provide relevant document it cannot do much. All the opensource models are super bad at this and mostly i want to blame the fine tuning to get to the leaderboard is affecting the quality

wonderfuly · 2024-01-18T08:22:15 1705566135

If you want to compare LLMs on daily usage, checkout: https://chathub.gg

throwawaymaths · 2024-01-16T19:55:12 1705434912

Latency (ttft) would be a nice metric.

Gcam · 2024-01-16T19:59:05 1705435145

We have this (and other more detailed metrics) on the models page https://artificialanalysis.ai/models if you scroll down and for individual hosts if you click into a model (nav or click one of the model bars/bubbles) :)

There are some interesting views of throughput vs. latency whereby some models are slower to the first chunk but faster for subsequent chunks and vice versa, and so suit different use cases (e.g. if just want a true/false vs. more detailed model responses)

throwawaymaths · 2024-01-16T22:53:45 1705445625

Thanks!

elicksaur · 2024-01-16T16:23:14 1705422194

> Application error: a client-side exception has occurred (see the browser console for more information).

iOS Safari

Gcam · 2024-01-16T17:01:22 1705424482

Thanks for the letting me know. Odd as not occurring with my iOS Safari, can anyone else please let me know if they are encountering this issue (any their iOS version if possible). There is a console error but should be just a react defaultprops deprecation notice from a library being used (should not break DOM)

elicksaur · 2024-01-17T00:29:17 1705451357

Tried the link again, and it works now! Sorry for not having more info.

scribu · 2024-01-16T19:30:40 1705433440

I’m not sure about the Speed chart. I would expect gpt-4-turbo to be faster than plain gpt-4.

_micah_h · 2024-01-17T02:30:44 1705458644

Check out the graphs over time on the model pages - https://artificialanalysis.ai/models/gpt-4-turbo-1106-previe....

OpenAI are doing a ton of load balancing, presumably constantly tweaking batch sizes to try to optmize across all their workloads.

You can test the GPT-4 vs GPT-4 Turbo on Playground to intuitively confirm that the speeds are similar.

pseudosavant · 2024-01-16T20:13:06 1705435986

I thought so too. Could it be that gpt-4 turbo is more efficient for them to run, so the price is lower, but tries to maintain the token throughput of GPT4 over their API? There are a lot of ways they could allocate and configure their GPU resources so that GPT-4 Turbo provides the same per user throughput while greatly increasing their system throughput.

bredren · 2024-01-16T20:42:10 1705437730

The speed of GPT-4 via chatgpt varies greatly on when you’re using it.

Could the data have been collected when the system is under different loads?

pseudosavant · 2024-01-16T22:18:04 1705443484

Unless they captured many different times and days, that is very likely a factor. GPU resources are constrained enough that during peak times (which vary across the globe) the token throughput will vary a lot.

MacsHeadroom · 2024-01-17T21:15:29 1705526129

The speed data is an average over 30 days.

Clearly OpenAI is throttling their API to save costs and get more out of fewer GPUs.

jdthedisciple · 2024-01-17T06:29:34 1705472974

Really neat!

And I did not realize how much Gemini Pro lags behind GPT4 in terms of quality, wow!

_micah_h · 2024-01-18T02:15:41 1705544141

Gemini Ultra is the model they claim will match GPT-4, not out yet!

avereveard · 2024-01-16T23:20:46 1705447246

I wish there was claude instant in there is a damn fine model often overlooked

coder543 · 2024-01-17T15:28:01 1705505281

What do you like about it? Compared to GPT-3.5, Claude Instant seems to be the same or worse in quality according to both human and automated benchmarks, but also more expensive. It seems undifferentiated. And I would rather use Mixtral than either of those in most cases, since Mixtral often outperforms GPT-3.5 and can be run on my own hardware.

avereveard · 2024-01-19T11:38:14 1705664294

Data extraction mostly. Supports long document, cheaper input tokens than gpt3 turbo, and when I ask to stick to document informations it doesn't try to fill in gaps out with his trained knowledge.

Sure you can't have a chat with it or expect it to do high level reasoning, but has enough to do the basic deductions for grounded answers.

Gcam · 2024-01-16T23:25:49 1705447549

We have Claude Instant on the models page: https://artificialanalysis.ai/models Can add it via the select at the top right of each card where it says '9 Selected' (below the highlight charts)

avereveard · 2024-01-17T01:46:33 1705455993

Ah cool was on mobile didn't eee the select

Gcam · 2024-01-17T03:22:31 1705461751

Definitely agree with your point on Claude Instant though. Much less than half the price, much higher throughput/speed for a relatively small quality decrease (varied by how 'quality' is measured, use-case)

rubymamis · 2024-01-16T19:17:18 1705432638

I wish there were more details about how you measure "quality".

pseudosavant · 2024-01-16T20:15:08 1705436108

See this comment: https://news.ycombinator.com/item?id=39014985#39017792

jafitc · 2024-01-16T23:16:38 1705446998

Deepinfra Mixtral is $0.27 / M tokens as per their website

_micah_h · 2024-01-17T02:28:08 1705458488

Hey, yep looks like they updated their pricing - we've now updated it on the site!