Hacker News new | past | comments | ask | show | jobs | submit login

Honestly you're better off rolling your own (but avoid LangChain like the plague). The actual implementation is simple but the devil is in the details - specifically how you chunk your documents to generate vector embeddings. Every time I've tried to apply general purpose RAG tools to specific types of documents like medical records, internal knowledge base, case law, datasheets, and legislation, it's been a mess.

Best case scenario you can come up with a chunking strategy specific to your use case that will make it work: stuff like grouping all the paragraphs/tables about a register together or grouping tables of physical properties in a datasheet with the table title or grouping the paragraphs in a PCB layout guideline together into a single unit. You also have to figure out how much overlap to allow between the different types of chunks and how many dimensions you need in the output vectors. You then have to link chunks together so that when your RAG matches the register description, it knows to include the chunk with the actual documentation so that the LLM can actually use the documentation chunk instead of just the description chunk. I've had to train many a classifier to get this part even remotely usable in nontrivial use cases like caselaw.

Worst case scenario you have to finetune your own embedding model because the colloquialisms the general purpose ones are trained on have little overlap with how terms of art and jargon used in the documents (this is especially bad for legal and highly technical texts IME). This generally requires thousands of examples created by an expert in the field.




Disclosure: I'm an engineer at LangChain, primarily focused on LangGraph. I'm new to the team, though - and I'd really like to understand your perspective a bit better. If we're gritting the wheels for you rather than greasing them, I _really_ want to know about it!

> Every time I've tried to apply general purpose RAG tools to specific types of documents like medical records, internal knowledge base, case law, datasheets, and legislation, it's been a mess.

Would it be fair to paraphrase you as saying that people should avoid using _any_ library's ready-made components for a RAG pipeline, or do you think there's something specific to LangChain that is making it harder for people to achieve their goals when they use it? Either way, is there more detail that you can share on this? Even if it's _any_ library - what are we all getting wrong?

Not trying to correct you here - rather stating my perspective in hopes that you'll correct it (pretty please) - but my take as someone who was a user before joining the company is that LangChain is a good starting point because of the _structure_ it provides, rather than the specific components.

I don't know what the specific design intent was (again, new to the team!) but just candidly as a user I tend to look at the components as stand-ins that'll help me get something up and running super quickly so I can start building out evals. I might be very unique in this, but I tend to think that until I have evals, I don't really have any idea if my changes are actually improvements or not. Once I have evals running against something that does _roughly_ what I want it to do, I can start optimizing the end-to-end workflow. I suspect in 99.9% of cases that'll involve replacing some (many?) of our prebuilt components with custom ones that are more tailored to your specific task.

Complete side note, but for anyone looking at LangChain to build out RAG stuff today, I'd advise using LangGraph for structuring your end-to-end process. You can still pull in components for individual process steps from LangChain (or any other library you prefer) as needed, and you can still use LangChain pipelines as individual workflow steps if you want to, but I think you'll find that LangGraph is a more flexible foundation to build upon when it comes to defining the structure of your overall workflow.


> This generally requires thousands of examples created by an expert in the field.

Or an AI model pretending to be an expert in the field... (works well in a few niche domains I have used this in)


Don't forget to finetune the reranker too if you end up doing the embedding model. That tends to have outsized effects on performance for out of distribution content.


I am looking up chunking techniques, but the resources are so scarce on this. What's your recommendation?


It’s the big unsolved problem and nobody’s talking about it. I’ve had some decent success asking an expensive model to generate the chunks and combining that with document location, and my next plan for an upcoming project is to do that hierarchically, but there’s no generally accepted solution yet.

RAG’s big problem is turning PDFs into chunks, both as a parsing problem and as the chunking problem. I paid someone to do the parsing part into markdown for a project recently (including table data summaries) and it worked well. MathPix has an good API for this, but it only works sensibly for PDFs that don’t have insane layouts, and many do.


The data source i have is a filesystem with docs, pdfs, graphs etc.

Will need to expand folder names, file abfeviations. Do repetative analysis to find footers and headets. Locate titles on first pages and dedupe a lot. It seems like some kind of content+hierarchy+keywords+subtitle will need to be vectorized, like a card catalog.


Not the person you asked, but it's dependent on what you're trying to chunk. I've written a standalone chunking library for an app I'm building: https://github.com/rmusser01/tldw/blob/main/App_Function_Lib...

It's setup so that you can perform whatever type of chunking you might prefer.


If there's a list of techniques and their optimal use cases I haven't found it. I started writing one for the day job, but then graphRAG happened, and Garnter is saying all RAG will be graphRAG.

You can't fight Gartner, no matter how wrong they are, so the work stopped, now everything is a badly implemented graph.

That's a long way to say, if there is a comparison, a link would be most appreciated


Semantic chunking is where I would start with now. Also check this out: https://github.com/chonkie-ai/chonkie


> but avoid LangChain like the plague

Can you elaborate on this?

I have a proof-of-concept RAG system implemented with LangChain, but would like input before committing to this framework.


LangChain is considered complicated to get started with despite offering probably the widest amount of functionality. If you are already comfortable with LangChain you are free to ignore that.


I've had great luck just base64'ing images and asking Qwen 2.5 VL to both parse it to markdown and generate a title, description and list of keywords (seems to work well on tables and charts). My plan is to split PDFs into pngs first then run those against Qwen async, then put them into a vector database (haven't gotten around to that quite yet).


How does the base64 output become useful / usable information to an LLM?


No idea but Qwen 2.5 VL seems to understand it all quite well.


Why avoid Langchain?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: