The prompt, for those interested. I find it pretty underspecified, but maybe that's the point. For example, "Business operating hours" could be expanded a little, because "Closed - Opens at XX" is still non-processable in both cases.
You are an expert in Web Scraping, so you are capable to find the information in HTML and label them accordingly. Please return the final result in JSON.
Data to scrape:
title: Name of the business
type: The business nature like Cafe, Coffee Shop, many others
phone: The phone number of the business
address: Address of the business, can be a state, country or a full address
years_in_business: Number of years since the business started
hours: Business operating hours
rating: Rating of the business
reviews: Number of reviews on the business
price: Typical spending on the business
description: Extra information that is not mentioned yet in any of the data
service_options: Array of shopping options from the business, for example, in store shopping, delivery and many others. It should be in format -> option_name: true
is_operating: Whether the business is operating
HTML:
{html}
This should be higher up. This whole blog post is mostly worthless because the way they are extracting data is less than optimal.
Lower end models do not have the attention to complete tasks like this, GPT4Turbo will generally have the capability. But to have an optimal pipeline you should really be splitting up these tasks into individual units. You extract each attribute you want independently and then combine it back together however you want. Also asking for JSON upfront is equally suboptimal in the whole process.
I have high confidence that I could accomplish this task using a lower end model with a high degree of accuracy.
Edit: I am not suggesting that an LLM is more optimal than what ever traditional parsing methods they may use, simply the way they are doing it is wrong from an LLM flow.
Also, my (limited) experience with prompts tells that you want to invest more into the “You are” part. I’ll share my understanding, corrections are appreciated.
LLMs aren’t people even in a chat-roleplaying sense. They complete a “document” that can be a plot, a book, a protocol of conversation. The “AI” side in the chat isn’t an LLM itself, it’s a character (and so are you, it completes your “You: …” replies too - that’s where the driver app stops it and allows you to interfere). So everything you put in that header is very important. There are two places where you can do that: right in the chat, as in TFA, or in the “character card” (idk if GPTs have it, no GPT access for me). I found out that properly crafting a character card makes a huge difference and can resolve the whole classes of issues.
Idk what will work best in this case, but I’d start with describing which sort of a bot, how it deals with unclear or incomplete information, how amazing it is (yes, really), its soft/tech skills and problem solving abilities, what other people think of it, their experience and so on. Maybe would add few examples of interactions in a free form. Then in the task message I’d tell it more and specific details about that json.
One more note - at least for 8x7B, the “You are” in the chat is a much weaker instruction than a character card, even if the context is still empty. I low-key believe that’s because it’s a second-class prompt, i.e. the chat document starts with “This is a conversation with a helpful AI bot which yada yada” in… mind, and then in that chat that AI character gets asked to turn into something else, which poisons the setting.
Simply asking the default AI card represents 0.1% of what’s possible and doesn’t give the best results. Prompt Engineering is real.
I have high confidence that I could accomplish this task using a lower end model with a high degree of accuracy.
Same. I think that no matter how good a model is, this prompt just isn’t a professional task statement and leaves too much to decide. It’s a task that you, as a regular human, would hate to receive.
The prompt does not matter as much as the workflow which is describe above. 1) Extract one attribute at a time. 2) Don't ask for json during extraction, but on binary small attributes it might not matter as much.. 3) Combine the data later.
There are differences that can be marked on how different models perform against the same raw prompt but generally the workflow is what matters more. The raw text prompt will be dependent on what model you are using as there are those differences but I don't think its a level of "prompt engineering" like we had a year ago.
Character-level operations are difficult for LLMs. Because of tokenization they don't really "perceive" strings as a list of characters. There are LLMs that ingest bytes, but they are intended to process binary data.
Why tough? To me, it seems more prone to issues (hallucinations, prompt injections etc). It is also slower and more expensive at the same time. I also think it is harder to implement properly, and you need to add way more tests in order to be confident it works.
Generally I agree with your point, but there is some value in a parser that doesn’t have to be updated when the underlying HTML changes.
Whether or not this benefit outweighs the significant problems (cost, speed, accuracy and determinism) is up to the use case. For most use cases I can think of, the speed and accuracy of an actual parser would be preferable.
However, in situations where one is parsing highly dynamic HTML (eg if each business type had slightly different output, or you are scraping a site which updates the structure frequently and breaks your hand written parser) then this could be worth the accuracy loss.
You could employ an LLM to give you updated queries when the format changes. This is something where they should shine. And you get something that you can audit and exhaustively test.
There are so many applications for LLMs where having a perfect score is much more important than speed, because getting it wrong is so expensive, damaging, or time consuming to resolve for an organisation.
If you need a perfect score, don't use LLMs. This seems obvious to me, even given the state of the art LLMs. I am a heavy user of GPT4 and I wouldn't bet $1000 bucks on it being 100% reliable for any non-trivial task.
Businesses do bet on imperfect and even criminal models all the time (way before LLMs existed)... they call it cost of doing business when they get it wrong or get caught.
Humans running multishot with mixture of experts is close to perfect. You can't compare a multishot mixture of expert AI to a single human, humans doesn't work in isolation.
Machine learning models will get better for sure. We don't know if LLM are the end game though and it's not sure if this particular technique is what we'll need to reach the next level.
Or they might not get better. It could be that we are at a local optimum for that sort of thing, and major improvements will have to wait (perhaps for a very long time) for radical new technologies.
Maybe, but it certainly hasn’t been the arc of the past few years. I don’t know how anyone could look at this and assume that it’s likely to slow down.
I remember talking to a radiologist who said he was sure something like this was coming like ten years ago where instead of a radiologist looking at scans manually, a machine would go through a lot of images and flag some for manual review.
My professor (Sir Michael Brady) at university 14 years ago set up a company to do this very thing, and he already had reliable models back before 2010. I believe their company was called Oxford Imaging or something similar.
Yep, everyone seems to forget that ML was available before 2021. Had a conversation recently with my former colleague who learned about some plastic packaging company which used "AI" to predict client orders and inform them about scheduling implications. When I told him that you don't need Transformers and 30GB models for that, he was quasi-confused, cause he kinda knew it but the hype just overtook his knowledge.
In ML courses, you’re taught to try simpler methods and models before turning to more complex ones. I think that’s something that hasn’t made it into the mainstream yet.
A lot of people seem to be using GPT-4 for tasks like text classification and NER, and they’d be much better off fine-tuning a BERT model instead. In vision, too, transformers are great but a lot of times, a CNN is all you really need.
Yes and no. Countless teams have solved exactly this problems at universities and research groups across the world. Technically it's pretty much a solved problem. The hard part is getting the systems out of the labs and certified as an actual product and convincing hospitals and doctors to actually use them.
Changing a single pixel is usually not enough to confuse convolutional neuronal networks. Even so, human supervision will probably always be quite important.
I've tried to apply it to parsing HTML as this article into a pretty long pipeline. I'm using DeepInfra with Mistral 8x7B and I'm still unsure if I'm going to use for production.
The problem I'm finding is that the time I wanted to save mantaining selectors and the like is time that I'm spending writing wrapper code and dealing with the mistakes it makes. Some are OK and can deal with them, others are pretty annoying because It's difficult to deal with them in a deterministic manner.
I've also tried with GPT-4 but it's way more expensive, and despite what this guy got, it also makes mistakes.
I don't really care about inference speed, but I do care about price and correctness.
Might be a silly question, but if you want determinism in this, why don't you get the LLM to write the deterministic code, and use that instead? Interesting experiment, though!
In fact, what about a hybrid of what you're doing now? Initially, you use an LLM to generate examples. And then from those examples, you use that same LLM to write deterministic code?
Have you tried swapping Mistral 8x7B with either command-r 34B, Qwen 1.5 70B, or miqu 70B? Those are all superior in my experience, though suited for slightly different tasks, so experimentation is needed.
Parsing HTML and tagsoup is IMHO not the right application for LLMs since these are ultimately structured formats. LLM are for NLP tasks, like extracting meaning out of unstructured and ambiguous text. The computational cost of an LLM chewing through even moderately-sized document can be more efficiently spent on sophisticated parser technologies that have been around for decades, which can also to a degree deal with ambiguous and irregular grammars. LLMs should be able to help you write those.
Yeah I agree - just an hour ago I was dealing with an LLM that was missing a "not" thus inverting the meaning of a rather important simulation parameter!
It makes much more sense to me to have the LLM infer the correct query for extracting data on the page. Much faster and reliable and it wouldn't really be a problem to have a human in the loop every now and then.
All the places I see AI being applicable to my work don't require a perfect score, and a threshold is actually much more useful, especially where multiple factors come together to make evaluation to a single value hard.
The problem is asking for facts, LLM are not a database so they know stuff but it is compressed so expect wrong facts, wrong names, dates, wrong anything.
We will need an LLM as a front end then it will generate a query to fetch the facts from the internet or a database , then maybe format the facts for your consumption.
This is called Retrieval Augmented Generation (RAG). The LLM driver recognizes a query, it gets send to a vector database or to an external system (could be another LLM...) and the answer is placed in the context. It's a common strategy to work around their limited context length, but it tends to be brittle. Look for survey papers.
That‘s exactly it. It‘s ok for LLMs to not know everything, because they _should_ have a means to look up information. What are some projects where this obvious approach is implemented/tried?
But then you need an LLM that can separate between grammar and facts. Current LLMs doesn't know the difference, that is the main source to these issues, these models treat facts like grammar and that worked well enough to excite people but probably wont get us to a good state.
The weird problem is with LLM hallucinations is that it usually will acknowledge its mistake and correct itself if you call it out. My question is why can't LLMs included a sub-routine to check itself before answering.
Simply asking itself something like "this answer may not be correct, are you sure you're right?"
>The weird problem is with LLM hallucinations is that it usually will acknowledge its mistake and correct itself if you call it out.
From what I've tested, all of the current models will see a prompt like "are you sure that's correct" and respond "no, I was incorrect [here's some other answer]", irrespective of the accuracy of the original statement.
> My question is why can't LLMs included a sub-routine to check itself before answering.
Because LLMs don't work in a way for that to be possible if you operate them on their own.
Here is the debug output of my local instance of Mistral-Instruct 8x7B. The prompt from me was 'What is poop spelled backwards?'. It answered 'puoP'. Let's see how it got there starting with it processing my prompt into tokens:
It picked 'pu' even though it was only a ~4% chance of being correct, then instead of picking 'op' it picked 'o'. The last token was a 100% probability of being 'P'.
Output: puoP
At no time did it write 'puoP' as a complete word nor does it know what 'puoP' is. It has no way of evaluating whether that is the right answer or not. You would need a different process to do that.
That is a common bullshitting strategy, talk a lot of bullshit, and then backtrack and acknowledge you were wrong when people push back. That way they will think you know way more than you do. Many people will see thought that, but most will just think you are a humble expert who can acknowledge when you are wrong instead of you always acknowledging you are wrong even when you aren't.
People have a really hard time catching such bullshitting from humans, which is why free form interviews doesn't work.
Its because theres no entity that is actually acknowledging anything. Its generating an answer to your prompt. You can gaslight it into anything being wrong or correct.
They simply don't work that way. You are asking it for an answer, it will give you one since all it can do is extrapolate from its training data.
Good prompting and certain adjustment to the text generation parameters might help prevent hallucinations, but it's not an exact science since it depends on how it was trained. Also, an LLMs training data frankly said contains a lot of bulls*t.
> If I ask an LLM a very complex and specific question 500 times, if it just doesn't know the facts you'll still get the wrong answer 500 times.
Think the commenter meant use another model/LLM which could give a different answer, then let them vote on the result. Like "old fashioned AI" did with ensemble learning.
This test is interesting from a general high level metric/test but overall the way they are extracting data using a LLM is suboptimal so I don't think the takeaway means much. You could extract this type of data using a low-end model like 8x7B with a high degree of accuracy.
Mixtral works very well with json output in my personal experience. Gpt family are excellent of course, and i would bet Claude and Gemini are pretty good. Mixtral however is the smallest of the models and the most efficient.
Especially running on Groq's infrastructure it's blazing fast. Some examples i ran on Groq's API, the query was completed in 70ms. Groq has released API libraries for Python and Javascript, i wrote a simple Rust example here, of how to use the API [1].
Groq's API documents how long it takes to generate the tokens for each request. 70ms for a page of document, are well over 100 times faster than GPT, and the fastest of every other capable model. Accounting for internet's latency and some queue that might exist, then the user receives the request in a second, but how fast would this model run locally? Fast enough to generate natural language tokens, generate a synthetic voice, listen again and decode the next request the user might talk to it, all in real time.
With a technology like that, why not talk to internet services with just APIs and no web interface at all? Just functions exposed on the internet, take json as an input, validate it, and send the json back to the user? Or every other interface and button around. Why pressing buttons for every electric appliance, and not just talk to the machine using a json schema? Why should users on an internet forum, every time a comment is added, have to press the add comment button, instead of just talking and saying "post it"? Pretty annoying actually.
Groq will soon support function calling. At that point, you would want to describe your data specification and use function calling to do extraction. Tools such as Pydantic and Instructor are good starting points.
Interesting post, but the prompt is missing? How do the LLMs generate the keys? It's likely the mistakes could be corrected with a better prompt or a post check?
Also, Google SERP page is deterministic (always has the same structure for the same kind of queries), so it would probably be much more effective to use AI to write a parser, and then refine it and use that?
The example here has HTML with a somewhat fixed format. It would indeed have been better to have samples with different format and aiming for a low error rate.
If you are scraping a limited amount of sites, you could for each site ask the LLM for parsing code from some samples, review that, and move on.
Sorry to be nit-picky but thats the essence of these benchmarks - Mistral putting "N/A" for not available is weird - N/A is not applicable, in every use I have ever seen, and they DONT mean the same thing. I would expect null for not available and N/A for not applicable
The first entry on Google is Wikipedia [1] for me:
> N/A (or sometimes n/a or N.A.) is a common abbreviation in tables and lists for the phrase not applicable, not available, not assessed, or no answer.
Thats interesting, wikipedia is not on the first page for me, my first hit is Cambridge dict: (and then a bunch of other dicts) - Im flying right now but IP geolocation puts me in the US
Meaning of n/a in English
written abbreviation for not applicable: used on a form to show that you are not giving the information asked for because the question is not intended for you or your situation: If a question does not apply to you, please put N/A in the box provided. COMMERCE.
In a data table "not available" is usually the right word for it, like if you have a list of national statistics then some of the values wont be available due to political reasons etc. But all of those means basically the same thing to the end user, this value isn't there.
I'm from North Europe, so not a native English speaker, but still it seems like based on my experience in life it seems as the first idea is that it's Not Available.
If I was to code something and for whatever reason some data wasn't available I would use N/A.
"Not applicable" doesn't feel right to me about N/A.
For instance if there is a table of comparison and for whatever reason there is data missing for some entity, while there should be, I would use N/A. So not applicable feels wrong for me for that reason alone.
LLM performance is about parallelism but also memory bandwidth.
Groq delivers this kind of speed by networking many, many chips together with high bandwidth interconnect. Each chip has only 230mb of SRAM[0].
From the linked reference:
"In the case of the Mixtral model, Groq had to connect 8 racks of 9 servers each with 8 chips per server. That’s a total of 576 chips to build up the inference unit and serve the Mixtral model."
That's eight racks with ~132GB of memory for the model. A single H100 has 80GB and can serve Mixtral without issue (albeit at lower performance).
If you consider the requirements for actual real-world inference serving workloads you need to serve multiple models, multiple versions of models, LoRA adapters, sentence embeddings models (for RAG), etc the economics and physical footprint alone get very challenging.
It's an interesting approach and clearly very, very fast but I'm curious to see how they do in the market:
1) This analysis uses cloud GPU costs for Nvidia pricing. Cloud providers make significant margin on their GPU instances. If you look at qty 1 retail Nvidia DGX, Lambda Hyperplane, etc and compare it to cloud GPU pricing (inference needs to run 24x7) break even on hardware vs cloud is less than seven months depending on what your costs are for hosting the hardware.
2) Nvidia has incredibly high margins.
3) CUDA.
There are some special cases where tokens per second and time to first token are incredibly important (as the article states - real time agents, etc) but overall I think actual real-world production use or deployment of Groq is a pretty challenging proposition.
The Mistral Mixed Expert model has way fewer parameters active during inference and Groq has special purpose hardware (and probably less concurrent demand).
This is a significant understatement. ChatGPT has an estimated 100m monthly active users.
Groq gets featured on HN from time to time but is otherwise almost completely unknown. According to their stats they have done something like 15m requests total since launch. ChatGPT likely does this in hours (or less).
There's no reason for OpenAI to release the model. They have close to 100% market anyways and releasing GPT-5 likely won't increase the total market as it is a incremental leap. And it's a open secret that most other models used GPT-4 synthetic data for training to come close to it.
They would likely wait till any model performs better than GPT 4 for the same price
The same reasoning would have applied for GPT-3.5. In the hindsight, you can say that it was obviously a good idea to build and ship GPT4. But hindsight is 20/20.
There are few differences. Firstly, GPT-3.5 wasn't ahead of Palm etc. from Google which was published at the same time as GPT-4.
Secondly, GPT-4 increased overall AI market. According to all the sources, interviews and leaks, GPT-5 won't be a big leap over GPT-4 as the model size and training data won't be significantly larger. I doubt GPT-5 would do that. (I could be wrong in my assumption though that GPT-5 would just be a incremental gain).
Claude 3 Opus is in the capability ballpark of GPT-4, GPT-3.5 has alternatives that are cheaper (Claude 3 Haiku) or cheaper and work offline (Qwen 1.5, Mixtral, …).
Is Claude 3 Opus generating more profits and taking considerable amount of customers from OpenAI? I'm not seeing that yet. Granted, I'm in Europe (outside of EU) so I can't pay for Opus but I guess that kinda confirms my statement. GPT4 is still a good product and there are no market pressures to release GPT5.
According to Sam Altman in a podcast with Lex Fridman this week, there is no real indication that it will be dropped this year. They will release a new model, but it might not be GPT-5
Which is an indication of nothing. In which world would Sam A. drop any kind of info about such a sensitive topic? If anything, this could just be deception before a massive drop.
Could also be resetting expectations for people who've been expecting GPT-5 (or just GPT-4.5) sooner - been a year now since GPT-4 was released.
The other odd thing from Altman was saying that GPT-4 sucks.
I think the context for both announcements is the recent release of Anthropic's Claude-3, which in it's largest "Opus" form beats GPT-4 across the board in benchmarks.
I personally think OpenAI/Altman is a bit scared that any moat/lead they had has disappeared and they are now being out-competed by Anthropic (Claude). Remember that Anthropic as a company was only formed (by core members of the OpenAI LLM team) at the same time as GPT-3 was released, so in same time it took OpenAI to go from GPT-3 to GPT-4, Anthropic have gone from nothing -> Claude-1 -> Claude-2 -> Claude-3 which beats GPT-4 !!
Anthropic have also had quite a bit of success attracting corporate business, quite a bit of which is more long-term in nature (sharing details of expected future model capabilities so that partners can target those).
So, I think OpenAI is running a bit scared, and I'd interpret this non-announcement of some model (4.5 or 5) "coming soonish" to be them just waving the flag and saying "we'll be back on top soon", which they presumably will be, briefly, when their next release(s) do come out. Altman's odd "GPT-4 sucks" statement might be meant to downplay Claude-3 "Opus" which beats it.
For all the posturing and crypto hate on HN, we're entering a world where it's socially acceptable to use 1000W of computing power and 5 seconds of inference time to parse a tiny HTML fragment which would take microseconds with traditional methods - and people are cheering about it. Time for some self-reflection? That's not very green.
Crypto energy requirements go up as the currency gets more traction.
TFA shows that groq is many times faster than GPT-4.
Up to 18x groq claims. Faster means less energy.
So I think it's just a matter of time until these things become ridiculously power efficient (eg run on phones in sub second times)
Presumably. Less time the giant chip has to draw power for computation.
The point is that everyone's interested in making AI power efficient, while crypto's proof of work is a competition for more power burned hashing and throwing away the result.
It's still a monstrosity compared to a traditional parser. You can even be fancy and use complex parsers that backtrack and can deal with mildly context-sensitive languages (as required for HTML, XML, and many programmin languages), and you'd still be more efficient.
This is a valid point, but we are still in the early stages of AI/LLMs, so one would expect the speed and efficiency to improve drastically (perhaps accuracy too) over the coming years.
At least AI & LLMs have large scale practical applications as opposed to crypto (IMO).
It's also interesting to think that IBM released an 8-trillion parameter model back in the 1980s [0]. Granted it was an n-gram model so it's not exactly an apples-to-apples comparison with today's models, but still, quite crazy to think about.
Interesting to see Robert Mercer the former CEO of Renaissance Technology is one of the authors on that paper. He is a former IBMer. If his name is unfamiliar he is a reclusive character who was a major funder of Breitbart, Cambridge Analytica and the Republican candidate in the 2016 presidential election.
I wouldn't call the early McCulloch & Pitts work quite "full-fledged". Also backpropagation, essential for multi level perceptrons was not a thing until 1980s.
It was thought of as early as in 1960s by Rosenblatt but he did not come up with a practical implementation at the time. Lotsa things look obvious in hindsight.
You're partially right. It's obvious that the solution is to combine traditional programming with AI, using traditional programming wherever possible because it's greener. Assuming you want things to turn out well in every possible future scenario, your decisions only matter if AGI isn't right around the corner. So assume it isn't right around the corner. Then there's going to be some interesting combining-together of manual human intervention, traditional software, and AI. We'll need to charge more for some uses of electricity, to incentivise turning AI into traditional software wherever possible.
> We'll need to charge more for some uses of electricity, to incentivise turning AI into traditional software wherever possible.
I don't understand this. This adds bureaucracy and I don't see why different uses need to be charged differently if they all use energy the same.
In other words, if energy costs X per unit, and an inefficient (AI) software takes 30 units and an efficient (traditional) software takes 10 units, then it is already cheaper to run the efficient software, and thus people are already incentivised to do so. There's no need to charge differently. If one day AI turns out to only need 5 units, turning more efficient, then just charge them for 5X. People will gravitate towards the new, efficient AI software naturally then.
Websites will never be fast, will they? Even with 1000x more compute than now they will just perform everything in LLM calls and stuff are just as slow as now.
It's almost a fallacy at this point to declare something bad simply because of the existence of carbon emissions, without first comparing the benefits of what is being produced, and the alternative tradeoffs.
To be fair to GP, they did compare it to alternatives (dumb HTML parsing), but failed to consider versatile HTML parsing or other uses for Groq LLM.
While energy remains cheap and human minds remain expensive, it always makes sense to use AI to reduce human effort.
If one cares about the environment, a carbon cap/tax is what you should campaign for. Then carbon-based energy sources will be curtailled, energy costs will go up, and AI like this will be encouraged to become more energy efficient or other methods used instead.
It is a nice idea in principle but ends up being a political tool and a tariff on goods and services of your own country. A global and corruption free carbon tax might work but that is impossible to achieve.
The only way it's gonna work is if a bunch of countries get together, agree a carbon cap/tax, and then tell other countries that they need to join the scheme if they want to trade goods with the group.
One way to combat corruption is to ask an international panel of experts to assess how many extra emissions came from non-official sources in each country and reduce next years cap by that amount. Then countries have an incentive to stamp out corruption.
I don't know. Corruption gets easier with increased centralization. I think a far better approach is to innovate our way out of it. If carbon free energy sources are less expensive then the problem will solve itself essentially. A global carbon tax will enviably extract some portion of global GDP from corruption. That money would likely be better spent in other ways.
Basically, carbon tax is the accountant's solution, innovation is the engineer's.
carbon-free will take a really long time to be cheaper.
As soon as demand for oil starts to drop, so will oil prices, and I suspect they could go down by a factor of 10 or more and oil-rich nations would still think it worthwhile to exploit at least some reserves.
What a ridiculous complaint. Energy efficiency won't remain static, and even if it were, it's not up to you to decide how to best leverage the available electricity.
Unless you live in a dictatorship it's definitely up to us to decide... Otherwise you leave your voice to the top 0.0001% business owners and expect them to work for your good and not for their own interests
Also read about the rebound effect. Planes are twice as efficient as they were 100 years ago yet they pollute infinitely more as a whole.
There is nothing ridiculous about the comment you're replying to
Yes you are right and the future is dependent on innovation and using more electricity with a large percentage of it coming form renewable sources. I don't want to go live on the farm myself.