Long, expensive, awesome
Disclaimer:
This text is sourced from Reddit and posted here with the author’s permission and blessing. The author requested to remain anonymous and to be referred to only by username: LegSubstantial2624. Source - Reddit link. Let this touch of humor offer a bit of comfort on your challenging journey toward mastering RAG and building a RAG system.
I know exactly how to build an awesome RAG. It’s as easy as a pie.
First, prepare your data. Use some cool vision quirk with hi_res option. Your, oh, about 400 pdf files will be processed in just a week or so, maybe a bit more… No biggie.
Make sure to use some smart chunking. Smth semantic, with embedings from OpenAI. I mean, come on, even a kid knows that.
But! Data prep doesn't stop there! You want it awesome,right? Every chunk needs to go through some LLM magic. Analyze it, enrich it so that every chunk is like Scrooge McDuck diving into his money bin. Keywords, summarization, all that jazz. Pick a pricey LLM, don’t be stingy. You want awesome, don’t you?
Ok, now for search. Simple stuff. Every query needs to be rephrased by LLM, like, 5-7 times, maybe 10. Less is pointless. So - each query will give you 10 new ones,but what a bunch!
Then, take them all into vector search. And the results? You guessed it! Straight into Cohere reranker! We’re going for awesome, remember? Don’t forget to merge the results.
And now, for the final touch - LLM on the output. Here is my suggestion: pick a few models, let each one do its job. Then, use yet another model to pick the best one. Or, you know, whichever…
And the most important rule - no open source, only proprietary, only hardcore!
P.S. Under every Reddit post, there’s always a comment saying, “Clearly, this post was written by ChatGPT.” Don’t bother. This post was entirely crafted by ChatGPT, no humans involved.