I'm not aware of 100 coding benchmarks, but there are over 100 LLM benchmarks. This makes sense, as there will eventually be at least one benchmark for each human task.
In addition to automated benchmarks, there are also human-rated evaluations, such as Chatbot Arena.
I manually tested DeepSeek v3 against Claude 3.5 Sonnet. In my human evaluation, Claude 3.5 Sonnet outperformed DeepSeek v3, and it also outperforms DeepSeek v3 on SWE Bench. Therefore, the title of the post claiming "DeepSeek v3 beats Claude 3.5 Sonnet and is way cheaper" is wrong.
That said, I was surprised by how well it performed. Its fast. Ironically, I have a paid Claude Team Plan. At the same time I was conducting the evaluations, Claude was experiencing performance issues - https://status.anthropic.com and DeepSeek v3 was not. This is telling for the state of chip sale restrictions.
In addition to automated benchmarks, there are also human-rated evaluations, such as Chatbot Arena.
I manually tested DeepSeek v3 against Claude 3.5 Sonnet. In my human evaluation, Claude 3.5 Sonnet outperformed DeepSeek v3, and it also outperforms DeepSeek v3 on SWE Bench. Therefore, the title of the post claiming "DeepSeek v3 beats Claude 3.5 Sonnet and is way cheaper" is wrong.
That said, I was surprised by how well it performed. Its fast. Ironically, I have a paid Claude Team Plan. At the same time I was conducting the evaluations, Claude was experiencing performance issues - https://status.anthropic.com and DeepSeek v3 was not. This is telling for the state of chip sale restrictions.