This is a pretty useless post. You could also follow the same 1000x tutorials about llama and use the already uploaded hugging face formats that are on hugging face...
I dunno what's worse, the pointless commentary, needless gatekeeping, the superfluous white knighting or the fact we're getting upvotes for all this nonsense.
1. There is a list of open source fine-tuning datasets on millions of topics. Like, anime, lord of the rings, dnd, customer service responses, finance, code in many programming languages, children's books, religions, philosophies, etc. I mean, on every topic imaginable sort of like a Wikipedia or Reddit of fine-tuning data sets.
2. Users can select one or more available datasets as well as upload their own private datasets
3. Users can turn-key fine-tune llama 2 or other pre-trained models
Right now, doing this kind of thing is way beyond the capability of the common user.
I personally don't see a future where common users will ever have to know the phrase "fine-tuning" or worry about it. The most I can see is "Do you consent to share your information with Apple/Meta/X/Microsoft/OpenAI's knowledge engine?" and if you agree, everything they have on you will power an extremely powerful all-encompassing knowledge engine. Probably with some daily recommendations to integrate a new domain into it, like, "We noticed you're into Lord of the Rings, so we went ahead and made your knowledge engine familiar with the collected works of Tolkein, all historical academic and modern interpretations and criticisms, transcripts of the movies, and generative AI fan fiction capabilities."
I don't think the major barrier to the idea would be consumer awareness. For the near-term the major barrier will be cost. Just as one example, together.ai offers fine-tuning service at an advertised cost of $0.001 per 1k tokens used [1]. That will get pricey for even small datasets. No doubt this will come down, but I don't see consumers paying $1000 for a customized AI model that they then have to pay inference costs to run. Maybe once we get consumer devices that have sufficiently capable AI accelerators (e.g. Apple Neural Engine) to run sufficiently capable llm models, then customers would be willing to customize and run local.
The second point is, we don't know if fine-tuned models, vector search or more-massive general purpose llm models is the right way to go.
But for business-to-business, I think this might be a viable business. If you had a whole bunch of ready-to-go open-source fine-tune datasets for commercial applications you might find a market of businesses that want to run their own models for a variety of reasons.
This sounds like a great fit for Cerebras, if they can set up the text database front end.
They could host the text database for free, and then offer a "oh look, you can train llama on this text right now for cheaper than a Nvidia box" button on every listing.
Then charge through the nose for private business training (kinds like they do now, but charging more.)
I agree that it would be almost impossible to defend this kind of business, especially if you stayed committed to open-source datasets. It would come down to the UX and the community if you hoped to survive. Probably long-term you would either have to get into your own pre-trained models, fight the commodity hosting business or aim to get acquired.
This would initially be a community, like Wikipedia, Reddit, Github, etc. People who are passionate about the future of AI, believe in the value of open source data and want their voice to be part of a community of data that will be used to train AIs in the future.
In my wildest dreams, and even reasonably, you could incentivize people with a digital currency. I was thinking something along the lines of a community that could stake some money ($100/$1000). They would then get "ownership" and moderating rights to the contents of a dataset. Other people could submit content to their dataset that they could allow or deny. The allowance of the content would distribute some share of the stake in the form of tokens. Then they would be able to re-sell the data in the set to people who want to fine-tune AIs using that dataset. The value of the tokens associated with that dataset would go up thereby distributing some portion of the profit to the moderators and the contributors.
Can someone share a good tutorial how to prepare the data? And for fine tuning, does a 3090 have enough VRAM? I want to do what the author mentioned by fine tuning the model on my personal data but I’m not sure how to prepare the data. I tried using vector search + LLM but I find the results very subpar when using a local LLM.
I’m looking forward to this! Are you using an adapter (I don’t see it mentioned in your article)? I was under the impression you cannot fit 7B at 4 bit since it’ll take 25GB of VRAM or so.
I've veen a bit out of the loop on this area but would like to get back into it given how much has changed in the LLM landscape in the last 1-2 yrs. What models are small enough to play with on Collab? Or am I going to have to spin up my own gpu box on aws to be able to mess around with these models?
Hey, you could use a template on brev.dev to spin up a gpu box with the model and Jupyter notebook. Alternatively, the falcon 7b model should be small enough for colab
Is there any tutorial on how to use HuggingFace LLaMA 2-derived models? They don't have checkpoint files of the original LLaMA and can't be used by the Meta's provided inference code, instead they use .bin files. I am only interested in Python code so no llama.cpp.
I'd reconsider your rejection of llama.cpp if I were you. You can always call out to it from Python, but llama.cpp is by far the most active project in this space, and they've gotten the UX to the point where it's extremely simple to use.
This user on HuggingFace has all the models ready to go in GGML format and quantized at various sizes, which saves a lot of bandwidth:
There was a post yesterday about a 500 line single-file C implmenetation of llama2 with no dependencies. The llama2 architecture is hard coded. It shouldn't be too hard to port to python.
Found the repo, couldn't easily find the HN thread.
Sure. I worked at a company that produced tens of thousands of human written summaries of news data a year. This was costly and slow but our clients really valued them. Back in 2019 we fine tuned an LLM to help, we put a lot of effort into creating a human-in-the-loop experience, highlighting parts of speech that were commonly hallucinated and ensuring that we were allowing humans to focus on things that humans are good at.
We also released some of the data as a free dataset with a commercial option for all of it. This was more successful than I thought it would be and was hoovered up by the kind of people that buy these datasets.
It will have been surpassed by recent developments now but it was an incredibly enjoyable project.
large corporates, financial services. Use cases were needle-in-a-haystack style searching, internal comms, following research topics over time, external newsletters, that kinda stuff. It wasn't particularly high margin but it was a fun business.
Awesome! To protect your privacy on HN, please email nparker2050@gmail.com and let me know whether you prefer getting on a call or keep things in writing. Looking forward to hearing from you!
Here are some actually useful links
https://blog.ovhcloud.com/fine-tuning-llama-2-models-using-a...
https://huggingface.co/meta-llama/Llama-2-70b-hf
https://huggingface.co/meta-llama/Llama-2-7b-hf