Honestly you're better off rolling your own (but avoid LangChain like the plague...

benjamincburns · 2025-02-05T21:44:09 1738791849

Disclosure: I'm an engineer at LangChain, primarily focused on LangGraph. I'm new to the team, though - and I'd really like to understand your perspective a bit better. If we're gritting the wheels for you rather than greasing them, I _really_ want to know about it!

> Every time I've tried to apply general purpose RAG tools to specific types of documents like medical records, internal knowledge base, case law, datasheets, and legislation, it's been a mess.

Would it be fair to paraphrase you as saying that people should avoid using _any_ library's ready-made components for a RAG pipeline, or do you think there's something specific to LangChain that is making it harder for people to achieve their goals when they use it? Either way, is there more detail that you can share on this? Even if it's _any_ library - what are we all getting wrong?

Not trying to correct you here - rather stating my perspective in hopes that you'll correct it (pretty please) - but my take as someone who was a user before joining the company is that LangChain is a good starting point because of the _structure_ it provides, rather than the specific components.

I don't know what the specific design intent was (again, new to the team!) but just candidly as a user I tend to look at the components as stand-ins that'll help me get something up and running super quickly so I can start building out evals. I might be very unique in this, but I tend to think that until I have evals, I don't really have any idea if my changes are actually improvements or not. Once I have evals running against something that does _roughly_ what I want it to do, I can start optimizing the end-to-end workflow. I suspect in 99.9% of cases that'll involve replacing some (many?) of our prebuilt components with custom ones that are more tailored to your specific task.

Complete side note, but for anyone looking at LangChain to build out RAG stuff today, I'd advise using LangGraph for structuring your end-to-end process. You can still pull in components for individual process steps from LangChain (or any other library you prefer) as needed, and you can still use LangChain pipelines as individual workflow steps if you want to, but I think you'll find that LangGraph is a more flexible foundation to build upon when it comes to defining the structure of your overall workflow.

byefruit · 2025-02-04T19:07:25 1738696045

> This generally requires thousands of examples created by an expert in the field.

Or an AI model pretending to be an expert in the field... (works well in a few niche domains I have used this in)

deoxykev · 2025-02-04T18:55:58 1738695358

Don't forget to finetune the reranker too if you end up doing the embedding model. That tends to have outsized effects on performance for out of distribution content.

3abiton · 2025-02-04T19:37:12 1738697832

I am looking up chunking techniques, but the resources are so scarce on this. What's your recommendation?

petesergeant · 2025-02-05T02:17:44 1738721864

It’s the big unsolved problem and nobody’s talking about it. I’ve had some decent success asking an expensive model to generate the chunks and combining that with document location, and my next plan for an upcoming project is to do that hierarchically, but there’s no generally accepted solution yet.

RAG’s big problem is turning PDFs into chunks, both as a parsing problem and as the chunking problem. I paid someone to do the parsing part into markdown for a project recently (including table data summaries) and it worked well. MathPix has an good API for this, but it only works sensibly for PDFs that don’t have insane layouts, and many do.

cyanydeez · 2025-02-05T03:09:00 1738724940

The data source i have is a filesystem with docs, pdfs, graphs etc.

Will need to expand folder names, file abfeviations. Do repetative analysis to find footers and headets. Locate titles on first pages and dedupe a lot. It seems like some kind of content+hierarchy+keywords+subtitle will need to be vectorized, like a card catalog.

oceansweep · 2025-02-05T02:55:12 1738724112

Not the person you asked, but it's dependent on what you're trying to chunk. I've written a standalone chunking library for an app I'm building: https://github.com/rmusser01/tldw/blob/main/App_Function_Lib...

It's setup so that you can perform whatever type of chunking you might prefer.

RansomStark · 2025-02-04T20:45:13 1738701913

If there's a list of techniques and their optimal use cases I haven't found it. I started writing one for the day job, but then graphRAG happened, and Garnter is saying all RAG will be graphRAG.

You can't fight Gartner, no matter how wrong they are, so the work stopped, now everything is a badly implemented graph.

That's a long way to say, if there is a comparison, a link would be most appreciated

dimitri-vs · 2025-02-05T01:19:45 1738718385

Semantic chunking is where I would start with now. Also check this out: https://github.com/chonkie-ai/chonkie

crishoj · 2025-02-04T19:31:13 1738697473

> but avoid LangChain like the plague

Can you elaborate on this?

I have a proof-of-concept RAG system implemented with LangChain, but would like input before committing to this framework.

t1amat · 2025-02-04T20:01:35 1738699295

LangChain is considered complicated to get started with despite offering probably the widest amount of functionality. If you are already comfortable with LangChain you are free to ignore that.

cpursley · 2025-02-04T22:09:19 1738706959

I've had great luck just base64'ing images and asking Qwen 2.5 VL to both parse it to markdown and generate a title, description and list of keywords (seems to work well on tables and charts). My plan is to split PDFs into pngs first then run those against Qwen async, then put them into a vector database (haven't gotten around to that quite yet).

metadat · 2025-02-05T02:46:07 1738723567

How does the base64 output become useful / usable information to an LLM?

cpursley · 2025-02-06T10:47:03 1738838823

No idea but Qwen 2.5 VL seems to understand it all quite well.

ideashower · 2025-02-05T13:40:07 1738762807

Why avoid Langchain?