I recommend looking at swe-bench to get an idea as to what breakthroughs this product accomplishes: https://www.swebench.com/. They claim to have tested SOTA models like GPT-4 and Claude 2 (I would like to see it tested on Claude 3 Opus) and their score is 13.86% as opposed to 4.80% for Claude 2. This benchmark is for solving real-world GitHub issues. So for those claiming that they tried models in the past and it didn't work for their use case, maybe this one will be better?