Beyond the Basics of RAG


June 12, 2024


LLMs are powerful, but have limitations: their knowledge is fixed in their weights, and their context window is limited. Worse: when they don’t know something, they might just make it up. RAG, for Retrieval Augmented Generation, has emerged as a way to mitigate both of those problems. However, implementing RAG effectively is more complex than it seems. The nitty gritty parts of what makes good retrieval good are rarely talked about: No, cosine similarity is, in fact, not all you need. In this workshop, we explore what helps build a robust RAG pipeline, and how simple insights from retrieval research can greatly improve your RAG efforts. We’ll cover key topics like BM25, re-ranking, indexing, domain specificity, evaluation beyond LGTM@few, and filtering. Be prepared for a whole new crowd of incredibly useful buzzwords to enter your vocabulary.

This talk was given by Ben Clavié at the Mastering LLMs Conference.

Subscribe For More Educational Content

If you enjoyed this content, subscribe to receive updates on new educational content for LLMs.


00:00 Introduction

Hamel introduces Ben Clavier, a researcher at with a strong background in information retrieval and the creator of the RAGatouille library.

00:48 Ben’s Background

Ben shares his journey into AI and information retrieval, his work at, and the open-source libraries he maintains, including ReRankers.

02:20 Agenda

Ben defines Retrieval-Augmented Generation (RAG), clarifies common misconceptions, and explains that RAG is not a silver bullet or an end-to-end system.

05:01 RAG Basics and Limitations

Ben explains the basic mechanics of RAG, emphasizing that it is simply the process of stitching retrieval and generation together, and discusses common failure points.

06:29 RAG MVP Pipeline

Ben breaks down the simple RAG pipeline, including model loading, data encoding, cosine similarity search, and obtaining relevant documents.

07:54 Vector Databases

Ben explains the role of vector databases in handling large-scale document retrieval efficiently and their place in the RAG pipeline.

08:46 Bi-Encoders

Ben describes bi-encoders, their efficiency in pre-computing document representations, and their role in quick query encoding and retrieval.

11:24 Cross-Encoders and Re-Ranking

Ben introduces cross-encoders, their computational expense, and their ability to provide more accurate relevance scores by encoding query-document pairs together.

14:38 Importance of Keyword Search

Ben highlights the enduring relevance of keyword search methods like BM25 and their role in handling specific terms and acronyms effectively.

15:24 Integration of Full-Text Search

Ben discusses the integration of full-text search (TF-IDF) with vector search to handle detailed and specific queries better, especially in technical domains.

16:34 TF-IDF and BM25

Ben explains TF-IDF, BM25, and their implementation in modern retrieval systems, emphasizing their effectiveness despite being older techniques.

19:33 Combined Retrieval Approach

Ben illustrates a combined retrieval approach using both embeddings and keyword search, recommending a balanced weighting of scores.

19:22 Metadata Filtering

Ben emphasizes the importance of metadata in filtering documents, providing examples and explaining how metadata can significantly improve retrieval relevance.

22:37 Full Pipeline Overview

Ben presents a comprehensive RAG pipeline incorporating bi-encoders, cross-encoders, full-text search, and metadata filtering, showing how to implement these steps in code.

26:05 Q&A Session Introduction

26:14 Fine-Tuning Bi-Encoder and Cross-Encoder Models

Ben discusses the importance of fine-tuning bi-encoder and cross-encoder models for improved retrieval accuracy, emphasizing the need to make the bi-encoder more loose and the cross-encoder more precise.

26:59 Combining Scores from Different Retrieval Methods

A participant asks about combining scores from different retrieval methods. Ben explains the pros and cons of weighted averages versus taking top candidates from multiple rankers, emphasizing the importance of context and data specifics.

29:01 The Importance of RAG as Context Lengths Get Longer

Ben reflects on how RAG may evolve or change as context lengths of LLMs get larger, but emphasizing that long context lengths are not a silver bullet.

30:06 Chunking Strategies for Long Documents

Ben discusses effective chunking strategies for long documents, including overlapping chunks and ensuring chunks do not cut off sentences, while considering the importance of latency tolerance in production systems.

30:56 Fine-Tuning Encoders and Advanced Retrieval with ColBERT

Ben also discusses when to fine-tune your encoders, and explains ColBERT for advanced retrieval.


Download PDF file.

Additional Resources

The following resources were mentioned during the talk:

  • Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline.
  • A lightweight unified API for various reranking models:
  • A Hackers’ Guide to Language Models:
  • GLiNER: Generalist Model for Named Entity Recognition using Bidirectional - Transformer:
  • Fine-Tuning with Sentence Transformers:
  • Elastic, Dense vector field type:

Full Transcript

[0:01] Hamel: Ben Clavier is one of the cracked researchers who work at You’ve heard from several researchers from already in this conference. Ben has a background in information retrieval, amongst other things, and he has an open source package called Ragatouille, which you should check out. He also comes from a deep background in information retrieval. and brings that to RAG. And he’s also one of the clearest thinkers on the topic. But yeah, I’ll hand it over to you, Ben, to kind of give more color to your background, anything that I missed.
[0:45] Hamel: And yeah, we can just jump into it.
[0:48] Ben: Okay, let’s go. So I think that’s pretty much the key aspect of my background. You pretty much read this slide out. So I do R&D at Ansoya with Jeremy. You’ve seen Jono in this course and there’s a lot of other awesome people. We’re a distributed R&D lab, so we do AI research and we try to be as open source as possible because we want people to use what we build. Prior to joining ANSR, I did a lot of NLP and kind of stumbled upon information retrieval because it’s very, very useful and everybody wants information retrieval.
[1:20] Ben: It’s more for clarifying what information retrieval is, which I hope today’s talk will help. And yeah, so my claim to fame or claim to moderate fame at least is the Ragatool library, which makes it much easier to use a family of models called Colbert. which we will very briefly mention today, but won’t have time to go into detail. But hopefully, like, if you want to know more about that, like, do feel free to ping me on Discord. I’m generally either very responsive or you need to ping me again. Pretty much how I work.
[1:50] Ben: And I also maintain the ReRankers library, which we’ll discuss in one of the later slides. And yeah, if you know me, I want to follow me. I want to hear more. But what I do is pretty much all on Twitter. I’m not on LinkedIn at all. I’m just everything go through Twitter. A lot of memes and shitposts, but some very informative stuff once in a while. So. So yeah, and let’s get started with what we’re going to talk about today. So it’s only half an hour, so we’re not going to talk about a lot.
[2:20] Ben: I’m going to talk about why I think I’ll do like call retrieval basics as they should exist in your pipelines, because RAG is a very nebulous term and that will be the first slide and Hamel will be very happy about that slide, I think. But RAG is not a silver bullet. RAG is not a new thing from December 2022. RAG is not even an end-to-end system. We’ll cover that, but I think it’s very important to like ground it a bit when we talk about RAG because it means a lot of different things to different people.
[2:47] Ben: Then we will cover what we call the compact MVP, which is what most people do when they are starting out with RAG. It’s actually an example from Jeremy. It’s like the simplest possible implementation of RAG, as in just using a vector search. And then the other topics are basically things that I think you should have in your rack pipeline as part of your MVP. And I’ll show that like there’s a lot of scary concepts because they’re all big walls like by encoder, cross encoder, TFIDF, BM25, filtering.
[3:14] Ben: That sounds like a lot, but then I’m going to try and show it that they’re very simple concepts and you can have pretty much the same MVP by adding just 10 lines of code, by choosing like by the state of the art retrieval components in every bit. And the bonus, which I don’t think we’ll have time to cover when I try this again, was talking about Colbert because I like talking about Colbert. So I might do it at the end if we have some time, but I might not. And yeah, that’s it for the agenda.
[3:40] Ben: And then I also think it’s important to have the counter agenda, which is what we won’t be talking about today, because those are just as important for RAG. But they are not what we put in the very basics. And here we’re very much about the basics. So one of them is. How to monitor and improve RAC systems because RACs are systems and they’re living systems and they’re very much things you should monitor and continuously improve on. I think Jaydon covered that quite well in his talk yesterday or last week. Yeah, last week.
[4:07] Ben: So I would invite you to watch that and watch Jaydon and Dan’s upcoming course if it does materialize. Evaluations, they’re also extremely important, but we won’t talk about them at all today. I know that Joe will talk about them at length in his talk. Benchmarks and paper references. So I’ll make a lot of claims that you will just have to trust me on because I don’t want to have too many references or too many academic looking tables and this trying to keep it quite lively and airy.
[4:33] Ben: I won’t give you a rundown of all the best performing models and why you should use them. I won’t talk about training, data augmentation, et cetera. And I won’t talk about all the other cool approaches like Splayed, Colbert, and details because they go beyond the basics. But those are all very important topics, so if you’re interested, do look up, there’s a lot of good resources out there. Do feel free to ask me. And with that, let’s get started with the rant, which is my favorite part.
[5:01] Ben: This is a thing that Hamel has been doing on Twitter recently as part of his flame posting campaign, I’ll say, which is basically, there’s so much in AI, so much in especially the LLM world that uses worlds that are like a lot scarier than they need to be. And RUG’s probably that because to me when I hear retrieval of matter generation or RUG, it sounds like that’s an end-to-end system, that’s a very definite set of components, that’s a thing that works on its own.
[5:27] Ben: And it’s not, it’s literally just doing retrieval to put stuff into your prompt context, like before your prompt or after your prompt, you want to get some context, so you’re doing retrieval. But that means that’s not an end-to-end system, despite what Jason will have you believe on his Twitter, he’s not created it, but it does make a lot of money from it. And it’s basically just the act of stitching together retrieval, so the R part of RAG and generation, so the G part of RAG. like to ground the later.
[5:53] Ben: So you want your generation to be grounded to use some context. So you’re doing retrieval on the wire of documents you have and pass it to your LLM. But there’s no magic going on. It’s very much like a pipeline that take the output of model A and gives it to model B. The generation part is what’s handled by large language models and good rags and actually three different components. It’s your good retrieval pipeline. It’s a good generative model and it’s a good way of linking them up. So it can be formatting your prompt or whatever.
[6:20] Ben: And it’s very important to think about it when you’re saying my rack doesn’t work. You need to be more specific like my rack doesn’t work is the same as saying my car doesn’t work. It’s like yeah, but something specific is broken. You need to figure out what is the retrieval part is the LLM struggling to make use of the context, etc. There’s a lot of failure cases there. And with that being said, let’s look at what the compact MVP is.
[6:44] Ben: So that is basically what you will see, I think, if you’ve read any Medium blog post about the advent of Frag in early 2023. That’s the pipeline that everyone used. And that’s also because the easiest pipeline to put into production is very simple. You have a query. You have an embedding model. You have documents. The documents get embedded and pulled into a single vector. Then you do cosine similarity search between the vectors for your query and for the documents. And that gets you a result. That gets you a score.
[7:10] Ben: And this is a bit of a teaser for an upcoming slide when I say this is called the Bayan Khodor approach, but just so you get the term in mind and I’ll define it because that’s one of those things that is like a scary term. That’s actually very, very simple when you break it down. But first, let’s look at what this actually means in code, this whole pipeline. So the first thing you want to do is load your model.
[7:29] Ben: Then you get your data, you encode it, you store your vectors, and you get your query, you encode it. And then here we use NumPy, you do a cosine similarity search, eg a dot product between normalized vectors to get the most similar documents. And the documents whose embedding are similar to your query embedding is what you would consider as your relevant documents. And that’s pretty much it. Thanks. modified from something that Jeremy did to showcase how simple RAG actually is in his Hackers Guide to LLMs.
[8:02] Ben: But that’s what you want to do to retrieve context in the simplest possible way. And you will have noticed that there’s no vector DB in this. This is all numpy arrays. And this is all numpy arrays because when you use vector DBs, the huge point of using a vector DB is to allow you to efficiently search through a lot of documents because what a vector DB does generally, not all of them, but most of them, wrap stuff like HNSW, IVFPQ, which are indexing types.
[8:31] Ben: That allows you to do is to find and retrieve relevant documents without having to compute cosine similarity against every single document. It tries to do an approximate search of an exact search. This is not something that you need if you’re embedding 500 documents. Your CPU can do that in milliseconds. You don’t actually need a vector DB if you’re trying to go to the simplest possible stage. But if you wanted one, it would go right here on the graph, right after you embed your documents, you would put them in the vector DB.
[9:00] Ben: And the second thing I think to discuss about is like this tiny graph is why am I calling embeddings by encoders? Because that step that I call by encoder, you will have seen a lot of times, but you will always see it generally called embeddings or model. And by encoder is the term that the IR literature uses to refer to that. And it’s simply because you encode things separately, like you do two encoding stages. So it’s a by encoding. And that’s used to create single vector representations where you pre-compute all your documentary presentations.
[9:30] Ben: So when you’re using by encoders, you encode your documents whenever you want. Like when you’re creating your database, when you’re adding documents, those get encoded at a time that’s completely separate from intrants. And then only at intrants will you, like in the second aspect of this column, will you embed your query to compare to your pre-computed documentary presentations. So that’s really computationally efficient because at inference you’re only ever encoding one thing which is the query and everything else has been done before. And so that is part of why it’s done that quickly.
[10:02] Ben: And I did want to take a slight break because I can see there are questions, but they’re not showing up on my screen. So if there are any on this quick MVP, then.
[10:11] Participant 3: Yeah, let me look through some of the questions. I’m going to give you a few of them and you can decide whether you want to take them now or later. So we got one. It’s a 7,000 query, a 7,000 question and answer data set. Can I optimize RAG to accurately retrieve and quote exact answers? We’ll also effectively hand in queries that are slightly different from the original data. I think there’s actually two parts to that. So one is to quote the exact answer, which is something about the information retrieval part.
[10:45] Participant 3: But it’s rather just like, what do you tell the LLM to do? But the information retrieval part is probably well.
[10:56] Ben: I will actually cover how to better deal with out of context things in the upcoming slide.
[11:04] Participant 3: Why do you keep going?
[11:08] Ben: None of these questions can be saved right now. Perfect. The next one is if that’s very computationally efficient, there is an obvious trade-off here. And that is your documents are entirely unaware of your query and your queries are entirely unaware of your documents.
[11:24] Ben: which means that you’re very very like subject to how it was trained is basically if your queries look a bit different from your training data or if like if there’s very very specific information that will be in certain documents and not other sometimes you want to know what how the query is phrased you want to know what the query is looking for when you’re encoding your document so that it can like kind of paint that representation and represent it more towards information that you’re interested in And that’s done with what we call re-ranking.
[11:53] Ben: So re-ranking is another one of those scary stages that we’ll see in your pipeline. And the most common way to do re-ranking is using something that we call cross-encoder. And cross-encoder is another one of those scary words, like by encoder that you feel should be like a very advanced concept, but it’s actually very simple. This graph here represents the whole difference between them. The bi-encoder is basically this two column system that we described where documents get encoded in their corner, queries get encoded in their own corner, and they only meet very, very late.
[12:20] Ben: Like you only do cosine similarity between vectors, but the documents never seen the query and vice versa. The cross-encoder is different. The cross-encoder is a model that will take your document and your query together. So you’re going to give it both your document or like a series of documents, depending on the type of model, but to keep it simple, we do it one by one. So you always give it a query document pair. And you put it through this cross-encoder model, which is effectively a classifier with a single label.
[12:46] Ben: And the probability of the label being positive is what your model considers as how similar the documents are or how relevant it is. This is extremely powerful because it means that the model knows everything about what you’re looking for when it’s encoding the document. It can give you a very accurate score or at least a more accurate score. The problem is that you can see how that wouldn’t scale because it’s not very computationally realistic to compute this query document score for every single query document pair every time you want to retrieve a document.
[13:15] Ben: Say you’ve got Wikipedia embedded, you’ve got, I don’t know, like 10 million paragraphs. You’re not going to compute 10 million scores. through a model for like using 300 million parameters for every single document for you. You would eventually return something and it would be a very, very relevant document, but it will also take 15 minutes, which is probably not what you want in production.
[13:37] Ben: So you probably also have heard, or you might also have heard if you’re really into retrieval, or not heard at all if you’re not into retrieval of other re-ranking approaches like RankGPT or RankLLM using LLMs to rank documents has been a big thing lately. For people really into retrieval, you will know of MonoT5, et cetera. So those are not cross-encoders, but that’s not really relevant to us because the core idea is the same, and that’s basically what we always do with re-ranking in the pipeline.
[14:04] Ben: you use a powerful model that is computationally expensive to score only a subset of your documents. And that’s why it’s re-ranking and not ranking, because this can only work if you give it like, I don’t know, 10, 50, not more than that document. So you always have a first stage retrieval, which here is our vector search. And then the re-ranker does the ranking for you, so it creates an ordered list. There’s a lot of ways to try those models out.
[14:28] Ben: Some of them have an API base, so it’s just an API called to cohere or Jena. Some of them you run your machine. If you want to try them out, and this is basically the self-promotion moment, I do maintain at library just called rerankers with the QR code here, where it’s basically a unified API so you can test any ranking method in your pipeline and swap them out freely. And that’s what your pipeline looks like now.
[14:52] Ben: It’s the same with just that one extra step at the end where you re-rank things before getting your results. So we’ve added re-ranking, but there’s something else that’s missing here. And that’s something actually addresses the first question, at least partially, is that the semantic search via embeddings is powerful and I’m not saying don’t choose vectors. Vectors are cool, like models are cool, deep learning is cool. But it’s very, very hard if you think about it, because you’re asking your model to take, I don’t know, 512 tokens, even more if you’re doing long context.
[15:24] Ben: And you’re like, okay, put all of this into this one vector. We are just using a single vector. You’ve got like, I don’t know, 384, 1024 at most floats, and that must represent all the information in this document. That’s naturally lossy. There’s no way you’re going to keep all of the information here. And what you do when you’re training on embedding is that you’re teaching the embedding to represent information that is useful in their training.
[15:49] Ben: So the model doesn’t learn to represent all of the document’s information because that’s pretty much impossible since embeddings are essentially a form of compression. What the model actually learned is to replant the information that is useful to the training queries. So your training data is very, very important here. It’s like replanting the documents in a way that will help you use the queries in the way that phrase in your training data to retrieve a given document.
[16:16] Ben: So when you use that on your own data, it’s likely that you’re going to be missing some information, or when you go slightly out of distribution. There’s another thing which is humans love to use keywords, especially if you’re going into the legal domain, the biomedical domain, anything specific. We have a lot of acronyms that might not even be in the training data, but we use a lot of acronyms. We use a lot of very advanced medical words. People love jargon. People love to use technical words because they’re very, very useful.
[16:44] Ben: And that’s why you should, and I know it sounds like I’m talking from the 70s, because that’s actually a method from the 70s, but you should always have keyword search in your pipeline. You should always also have full text search on top of like anything that you do with vectors. And keyword search, which you can call full text search or like tfidifbm25, it’s powered by what we call tfidif.
[17:06] Ben: which is a very basic NLP concept that essentially stands for term frequency, inverse document frequency, and it assigns every single word in a document or a group of words because sometimes we do them two by two, or three by three even. It gives them a weight based on how rare they are. So a word that appears everywhere like V or A has a very, very small weight and a word that’s highly specific to certain documents has a very high weight.
[17:32] Ben: And the main method to use TF-IDF for retrieval is called BM25, which stands for Best Matching 25. It was invented in the 70s. It’s been updated since then, but it’s basically just been iterations of it. And you’ll often hear IR researchers say that the reason that the field’s not taken off like NLP has or Computer Vision has is because the baseline is just too good. We’re still competing with BM25, although it’s been 50 years now. Oh my god, it’s been 50 years.
[18:00] Ben: Yeah, so the M25 existed for like, basically my entire lifetime before my birth, and it’s still used in production pipeline today. That’s how good it is. And the good thing is it’s just word counting with a match with like a waiting formula. So the compute time is virtually unnoticeable. Like you can add that to your pipeline, you will absolutely never fail it.
[18:20] Ben: And I said I wouldn’t add anything from papers, but I feel like because I’m making a very strong claim that this method from 70 is strong, I should add a table and add the table from the bare paper, which is the retrieval part of MTEB, which is basically the main embeddings benchmark. And they compared it to a lot of models that were very popular for retrieval, like DPR and very strong vector retrievers.
[18:45] Ben: And basically, you can see that unless you go into very over-trained embeddings like E5, BGE, BM25 is competitive with virtually all deep learning-based approaches, at least at the time of the paper, which was only just three years ago. We now have embeddings that are better, but we don’t have any embeddings that are better to the point where they’re not made better by being used in conjunction with BM25.
[19:10] Ben: Knowing that this is how you want your pipeline to look, you’ll notice that there’s now a whole new pathway for both the query and the documents, who are on top of being encoded by the embedder, they’re also encoded by TF-IDF to get full text search, and that will help you retrieve keywords, etc. Humans use keywords in queries all the time, it’s something you should do. At the end, you will combine the scores. You can do that in a lot of ways.
[19:33] Ben: I won’t go into too much details, but what a lot of people do is give a weight of 0.7 to the cosine similarity score and 0.3 to the full text hash. But I’m pretty sure we could do a whole talk for an hour on different methods of combining that. Okay. I do have five more minutes.
[19:50] Ben: So the last one that you want to add to a simple pipeline, the thing that I think really completes your MVP plus plus is using metadata and using metadata filtering because academic benchmarks don’t because in academic benchmarks documents exist mostly in a vacuum like they don’t exist in the real world they’re not tied to a specific company etc when you’re using rag in production it’s very, very rare that someone comes to you and says, these documents came to me in a dream and caught them. Like they came from somewhere. They’ve been generated by a department.
[20:21] Ben: They’ve been generated for a reason. They might be old Excel sheets or whatever, but they have business sense or they have in context sense. And the metadata is actually sometimes a lot more informative than the document content, especially in RAG contexts. So if you take the query here, which is, can you get me the Cruise Division financial report for Q422? There’s a lot of ways in which this can go wrong if you’re just looking at it from the semantic or even using keywords aspect.
[20:52] Ben: When you say, when you see like this, the model must capture the financial report. So you, the model must figure out you want the financial report, but also cruise division Q4 and 2022 and embedding models are bad at numbers. So you might get a financial report, but maybe for another division, or maybe for the cruise division of 1998, it’s very hard to just hope that your vector will capture all of this.
[21:15] Ben: But there’s another failure case, which will happen, especially with weaker LLMs, is that if you just have like top Ks and top five documents, and you retrieve the top five documents for your query. Even if your model is very good, if you just let it retrieve the top five documents, no matter what, you will end up with financial reports, at least five of them, and there’s most likely only one for Q4 22.
[21:37] Ben: So at that point, you’re just passing all five to the model and being like, good luck, use the right one, which might confuse it, especially because tables can be hard, et cetera. And I’m not saying that your vector search will fail, but statistically it will. In most cases, it will fail. If it don’t fail for this query, it will fail for a similar one. But that’s actually very, very easy to mitigate. You just have to think outside of the vector and just use more traditional methods. You can use entity detection models.
[22:07] Ben: One that’s very good for this is Gleaner, which is a very recent model that does basically zero-shot entity detection. You give it arbitrary entity types, so document type, time period, and the department. And this is like a live thing of Glenore. You can run the demo on the bottom, but here we just extract financial report, time period, and department. And when you generate your database for RAG, all you need to do is basically specify the time period.
[22:32] Ben: So when you get an Excel sheet, you will just pass the name for it or pass the date in it and give metadata 2024 Q2, and Q4, sorry, 2022 Q2, the Q4. Okay, mixed up there. Then you just need to ensure that this is stored alongside your document. At query time, you can always pre-filter your document set to only query things that make sense. You will only query documents for this relevant time period.
[22:56] Ben: You ensure that even if you give your model the wrong thing, it will at least be the right time frame so it can maybe try and make sense of it. And with this final component, this is what your pipeline looks like. You can see the new component here, which is metadata filtering, which doesn’t apply to queries. Queries go right through it. The documents get filtered by that, and we won’t perform search on documents that will not meet the metadata that we want.
[23:20] Ben: And okay, I do agree that this looks a lot scarier than the friendly one at the start, which just had your embedder and then cosine similarity search and the results. It is actually not very scary. This is your full pipeline. This implements everything we’ve just talked about. It’s about 25 lines of code if you remove the commands. It does look a bit more unfriendly because there’s a lot more moving parts, I think. There’s a lot more steps, but if you want, we can just break it down a bit further. We use LensDB for this.
[23:50] Ben: This is not necessarily an endorsement of LensDB as a vector DB, although I do like LensDB because it makes all of these components, which are very important, very easy to use. But I try not to take side in the vector DB wars because I’ve used WeaveYard, I’ve used Chroma, I’ve used LensDB, I’ve used Pencode, they all have their place. But I think LensDB, if you’re trying to build an MVP, is the one you should always use for MVPs right now because it has those components built in.
[24:14] Ben: And here you can see just how easy it actually is. So we still load the By Encoder, just in a slightly different way, same as earlier. We define our document metadata. Here is just a string category, but it could be a timestamp, it could be just about anything. Then we encode a lot of documents just like we did previously. Here we’ve created, so it’s not an index, this is still a hard search, this is not an approximate search. Then we create a full text search index, which is generating those TF-IDF.
[24:40] Ben: Why I mentioned before, we give a way to every single term in the documents. Then we load the reranker. Here we’re using the query ranker because it’s simple to use an API. And at the very end, you’ve just got your query and your search where we restrict it to the category equals films. So we will only ever search into the document that’s about a film, not about an author, not about a director. We get the top 10 results and we just have a quick ranking step. And that’s pretty much it.
[25:06] Ben: We’ve taken the pipeline at the start, which only had the biancoder component to a pipeline that now has the biancoder component, metadata filtering, full text search, and a reranker at the end. So we’ve added like basically the four most important components of RetriVault into a single pipeline. And it really don’t take much more space in your code. And Yeah, that is pretty much the end of this talk. So there’s a lot more to cover in RAC. This is definitely not the full cover of RAC, but this is the most important thing.
[25:36] Ben: This is what you need to know about how to make a good pipeline very quickly. All the other improvements are very, very valuable, but they have a decreasing cost effort ratio. This takes virtually no effort to put in place. Definitely worth learning about sparse methods, multi-vector methods, because they are very adapted to a lot of situations. Colbert, for instance, is very strong out of domain. Sparse is very strong in domain.
[25:58] Ben: You should watch Jason’s talk about rack systems and Joe’s upcoming talk about retrieval evaluations because those are by a clear trifecta of the most important things. And yeah, any questions now?
[26:12] Participant 3: Hamel and I were just messaging saying… We love this talk. Everything is presented so clearly. We’ve also got quite a few questions.
[26:30] Hamel: My favorite talk so far. Not big favorites, but yeah.
[26:36] Ben: Thank you.
[26:37] Participant 3: Go ahead.
[26:41] Hamel: Okay, questions. Did you have one that you were looking at already, Dan? I can tell it.
[26:47] Participant 3: Yeah, we’ve got one that I quite like. Can the way that you fine-tune your bi-encoder model affect how you should approach fine-tuning for your cross-encoder and vice versa?
[26:58] Ben: Yes. I don’t think I can give a really comprehensive answer because it will really depend on your domain, but you generally want them to be complementary. So if you’re in a situation where you’ve got the compute and the data to fine-tune both, you always want to… by encoder to be a bit more loose. Like you want it to retrieve potential candidates and then you want to trust your reranker, like your cross-encoder to actually do the filtering.
[27:21] Ben: So if you’re going to use both and have full control over both, you might want to fine tune it in a way that will basically make sure that your top K candidates can be a bit more representative and trust the reranker.
[27:35] Participant 3: Let me ask, this wasn’t an audience question, but a related question. You showed us where the, when you choose questions to feed into the re-ranker, that’s sort of a weighted average of what you get from the TF-IDF or BM-25 with what you get from the just simple vector search. What do you think of as the advantage or disadvantage of that over saying we’re going to take the top X from one cat from one of the rankers and the top X from the others?
[28:14] Participant 3: And that way, if you think one of these is, for some questions, especially bad, you have a way of short-circuiting its influence on what gets sent to the re-ranker.
[28:27] Ben: Yeah, I think that also makes complete sense. And that’s another, that’s a cop-out answer I use a lot, but that also depends a lot on your data. Like a lot of the time you want to look at what’s your actual context and how it’s actually being used. Because in some situations that actually works better, like especially if you work with biomedical data, because there’s so much like specific documents, it’s quite often the embedding won’t be that amazing on some questions.
[28:52] Ben: So you just want to take the top five from both and get the re-ranker to do it, because the re-ranker is quite aware. So it’s a perfectly valid approach to combine them that way.
[29:04] Participant 3: You want to pick a question, Hamel?
[29:10] Hamel: Yeah, I’ve been looking through them. You guys have been… Okay, Jeremy’s asking, can we get a link to the code example? Yeah, sure. Your slides in Maven. We can also, can I share your slides in Discord as well, Ben?
[29:25] Ben: Yes, please.
[29:26] Hamel: Yeah. I’ll go ahead and share the slides in
[29:28] Ben: Discord. And I’ll share the GitHub gist for the code examples I thought of.
[29:34] Participant 3: And I’ll embed the link to the slides in Maven for people who want to talk some point deep into the future and might lose track of it in Discord. There’s a question somewhere in here I’ll find in a moment, but we got this question for Jason, the speed and then the speaker just before you, Paige Bailey said. RAG, you know, in the world of million token context lengths is not going to be as important. What’s your take on the relative importance of RAG in the future?
[30:20] Ben: So I’m still very hopeful about RAG in the future. And I think I see it as some sort of like, so your LLM to me is like your CPU and your context window will be your RAM. And so like, even if you’ve got 32 gigs of RAM, nobody’s ever said, yeah, throw away your hard drive. You don’t need that. Like in a lot of contexts, you will still want to have like some sort of storage where you can retrieve the relevant documents.
[30:42] Ben: Having to use a long context window is never going to be a silver bullet. Just like RAG is never a silver bullet. But I’m actually really happy because it just means I can retrieve much longer documents and get more efficient rack systems. Because to me, it’s a bit of a trade off where if you’ve got a longer context, it just means you’ve got a lot more freedom with how quick your retrieval system can be. Because if you need to use top 10 or top 15, that’s fine. You can fit them in.
[31:06] Ben: Whereas when you can only fit the top three documents, you need your retrieval system to be really good, which might mean really slow. Yeah.
[31:12] Participant 3: So, yeah.
[31:13] Ben: So, yeah.
[31:26] Participant 3: We had a question from Wade Gilliam. What are your thoughts on different chunking strategies?
[31:36] Ben: I probably don’t think about chunking as much as I should. I am very hopeful for future avenues using LLMs to pre-chunk. I don’t think those work very well right now, but in my test I’ve never been impressed. Also, I do tend to use Colbert more often than Bancoders, and Colbert is a lot more resistant to chunking, so it’s something that I don’t care about as much. But generally I would try to…
[32:01] Ben: So my go-to is always to chunk based on like around 300 tokens per chunk, and try to do it in a way where you never cut off a sentence in the middle, and always keep like the last 50 tokens and the next 50 tokens of the previous and next chunk. Because information overlap is very useful to give content, like please don’t be afraid to duplicate information in your chunks.
[32:22] Hamel: I have a question about the buy encoder. Do you ever try to fine tune that using some kind of like label data to get that to be really good? Or do you usually kind of use that off the shelf and then use a re-ranker? And how do you usually go about it or how do you make the trade off?
[32:43] Ben: So again, context dependent, but if you have data, you should always fine-tune all your encoders, be it the bi-encoder, the cross-encoder. I think Colbert, because it’s single vector, you can get away with not fine-tuning for a bit longer because it’s multi-vector, so you can get away with not fine-tuning for a bit longer. But if you have data, it’s all about like basically the resources you have. So in this talk, we’re doing an MVP, this is something you can put together in an afternoon. If your company says you have $500.
[33:10] Ben: Spend 480 of that on OpenAI to generate synthetic questions and find your encoders that will always get you better results. Like always find your encoders if you can. And so, yes, so a couple of questions about fitting Colbert in and I’m using this entire executive decision to answer those. So Colbert in this pipeline, some people use it as a re-ranker, but then that’s not optimal. That’s very much when you don’t want to have to change your existing pipeline.
[33:50] Ben: If you were to design a pipeline from scratch and wanted to use Colbert, you would have it instead of the BI encoder and it would perform basically the same role as the BI encoder, which is first-edge retrieval. And if you wanted to use Colbert, and especially if you don’t have the budget to fine-tune and need a re-ranking step, sometimes it can actually be better to use Colbert as a re-ranker still. Because the multi-vector approach can be better at capturing keywords, etc. But that’s very context-dependent. So ideally, you would have it as ShowByEncoder.
[34:22] Participant 3: For a lot of people here who probably aren’t familiar with Colbert, Colbert, can you give the… Quick summary of it?
[34:32] Ben: Yeah, sorry, I got carried away because I saw the question. So Colbert is an approach which is effectively a biancoder, but instead of cramming everything into a single vector, you represent each document as a bag of embeddings. So like, if you’ve got 100 tokens, instead of having one big 124 vector, you will have a lot of small 128 vectors, one for each token. And then you will score that at the end. You will do the same for the query. So if your query is 32 tokens, you will have 32 query token.
[35:02] Ben: And for each query token, you will compare it to every token in the document and keep the highest score. And then you will sum up those highest scores and that will be the score for that given document. That’s called max similarity. And the reason that’s so powerful is not because it does very well on data it’s been trained on. You can beat it with a normal Bayer encoder, but it does very well at extrapolating to out of domain because you just give the model so much more room to replant each token in its context.
[35:29] Ben: So it’s much easier if you’re in a non-familiar setting, you’ve not compressed as much information. And I do have self promotion. I do have a pretty cool Colbert thing coming out later this week to compress the Colbert space by reducing the tokens that actually needs to save by about 50 to 60% without losing any performance. So that’s a bit of a teaser, but look forward to the blog post if you’re interested.
[35:57] Participant 3: And to find the blog post, you suggest people follow you on Twitter or?
[36:02] Ben: Yeah, definitely follow me on Twitter. Because it was pretty much the only place where you can reliably reach me.
[36:14] Hamel: Someone’s asking what are some good tools to fine tune embeddings for retrieval? Would you recommend Ragatouille or anything else? Like what’s your…
[36:24] Ben: I’d recommend sentence transformers, especially with the 3.0 release recently. It’s now much, much funnier to use. It’s basically, there’s no need to reinvent the wheel. They’ve got all the basics implemented very well there, so sentence transformers.
[36:44] Participant 3: Question from Divya. Can you give any pointers on how one fine-tunes their embedding model?
[36:53] Ben: Sorry, can you repeat that? I could have said it.
[36:55] Participant 3: Yeah. The question is, can you give any pointers or describe the flow for when you fine tune your embedding model?
[37:04] Ben: Okay. So that’s probably a bit more involved than this talk, but essentially when you fine tune your embedding model, what you’ll want is queries. You need to have queries and you need your documents and you’re going to tell the model. For this given query, this document is relevant. And for this given query, this document is not relevant because sometimes there’s a triplet loss. And a triplet loss is what you will do when you have one positive document and one negative document. And you’ll kind of be teaching the model, this is useful, this is not useful.
[37:32] Ben: And I’m not going to go down too much because this rabbit hole can take you quite far. But sometimes when you have triplets, you also want to use what we call hard negatives. which is you want to actually use retrieval to generate your negative examples because you want them to be quite close to what the positive example is, but not quite the right thing. Because that’s why you teach the model more, but was actually useful to answer your query.
[37:57] Ben: So the workflow is probably, as always, look at your data, figure out what kind of queries your user will actually be doing. If you don’t have user queries. Go into production, write some, write some queries yourself and give that to an LLM, generate more queries and you can have a pretty solid ritual pipeline like that.
[38:16] Hamel: Someone’s asking in the Discord, and I get this question all the time, is please share your thoughts on graph rag.
[38:25] Ben: I have never actually done graph rag. I see this mentioned all the time, but it’s not something that has come up for me at all. So I don’t have strong thoughts about. I think it’s cool, but that’s pretty much the full extent of my knowledge.
[38:49] Hamel: Someone’s asking, okay, when you have long context windows, does that allow you to do something different with RAG, like retrieve longer documents or do any other different kinds of strategies than you were able to before? Does it change anything? How you go about this?
[39:09] Ben: Yeah, I think it’s a bit what I mentioned before. To me it changes two main things. One is I can use longer documents, which means I can use longer models, or I can stitch chunks together. Because sometimes if your retrieval model isn’t very good at retrieving long documents, which is often the case, you might just want, if I get a chunk from this document, give the model the full document. Like if I just get a chunk from it past the full context and you just hope the model is able to read it.
[39:34] Ben: And if you’ve got a good long context model, it can. So it changes how you decide to feed the information into the model. And then the other aspect is, like I said, it changes the retrieval overhead because if you need to be very good, like I was saying, if you need only the top three documents to be relevant, you’re going to spend a lot of time and money on retrieval pipeline. If you’re like, oh, as long as my recall at 10 or my recall at 15 is good, that’s fine.
[39:56] Ben: You can afford to have much lighter models and spend a lot less time and resources on retrieval. There’s a lot of diminishing returns in retrieval when getting a good recall at 10. So recall at 10 is how likely you are to retrieve the relevant document in the first 10 results. is generally very easy. Recall at 100 is very, very easy. And then recall at 5 is getting harder. And recall at 3 and recall at 1 are like the really tough ones because a lot of the training data is noisy.
[40:23] Ben: So it can even be hard to know what a good recall at 1 is. So longer context makes that irrelevant. And that’s why it’s great for RUG.
[40:49] Hamel: Someone’s asking, and I don’t even know what this means, what’s your view on PIDE versus React versus StepBack?
[40:58] Ben: I’ve only used React out of those. And so those are like adjunct systems of function coding. It’s like to give your LLM the ability to call tools, at least React is. I don’t have strong thoughts on those in the context of retrieval, so I can’t really answer the question. Yeah, I think. I would occasionally use React from the model to be able to trigger a search itself, but I think that’s still an open area of research. And I think Griffin from AnswerIA is also in the chat and he’s very interested in that.
[41:31] Ben: It’s basically how do you get a model to tell you that it doesn’t know? Because sometimes you don’t need retrieval, the model already knows. Sometimes you do need retrieval, but that’s still a very open question. Like how do you decide when to search? So no strong thoughts there yet.
[41:50] Participant 3: You may or may not have a good answer for this one. Is there an end-to-end project, open-source project, that someone could look at as a way to see or evaluate the difference in result quality when they do result from just buying code or MVP and compare that to the final compact MVP++ that you showed?
[42:14] Ben: No, actually, that’s a very good point. I don’t think there is one that systematically goes through every step. And that’s probably something that I would like to build at some point or find one because just like most things in retrieval, everything is kind of conventional wisdom. Like you’ve seen it piece and pieces in a lot of projects and you just know that that’s how it is. But unless you dig deep into the papers or like do it yourself, it’s quite rare to find very good resources showing that.
[42:54] Participant 3: A related question, do you have a tutorial that you typically point people to on fine-tuning their encoder?
[43:08] Ben: That would be the sentence transformers documentation, but it’s not the friendliest tutorial, so that’s a half answer. That’s what I would point you to, but it’s still a bit hard to get into, sadly.
[43:40] Hamel: Wade is asking if you have go-to embedding models.
[43:48] Ben: I think my go-to these days when I’m demoing something is the Korea one, because it’s nice to be able to work with an API. It works really well. It’s cheap. But other than that, I would just call bear if I’m using something in my own pipeline. I would use multi-vectors. But it really depends on the use case, because you would often find that some things work well for you and some things don’t. I do have strong opinions on not using…
[44:16] Ben: So if you go to the MTB leaderboard, which is the embedding leaderboard right now, you’ll see a lot of LLMs as encoders. And I would advise against that because the latency is not worth it. Don’t need 7 billion parameters to encode stuff. And at least some of the early ones actually generalized worse, like Neil Schremer from Cohere had a really interesting table where the E5 mistral was worth an E5 large, despite being seven times as big.
[44:47] Ben: So probably just stick to the small ones between 100 and at most a billion parameters, but that would be my only advice about that. Try all the good ones like GT, BG, E5.
[45:00] Hamel: Chris Levy is asking… this question about Elasticsearch, which I also get quite a lot. So he asks, Anyone here have experience building RAG application with just keyword BM25 as a retriever at work? It makes use of Elasticsearch. And he said it’s all over the tech stack that people are already using Elasticsearch. Is there basically he’s asking, is there a way to keep using Elasticsearch with RAG that you know about or that you have encountered? Or do you mainly use like vector database like LanceDB and things like that?
[45:32] Hamel: Have you tried seeing people using Elasticsearch and trying to bootstrap off of that?
[45:37] Ben: Yeah, I’ve used Elasticsearch a bit and it’s perfectly possible. You do lose obviously the semantic search aspect, although I think now Elasticsearch has a vector DB offering, so you could add vectors to it. You could always plug in, you could always just do BM25 and then plug in a re-ranker at the end. That’s often, if you read papers on like cross encoders, generally the way they evaluate them is actually doing just that, like do BM25 to retrieve 50 to 100 documents and then rank them using the re-ranker.
[46:07] Ben: which if you can afford to just set up your re-ranking pipeline or call the Core API is a really good way to go about it because you don’t need to embed your whole documents to sample how good it would be with deep learning because there are domains where you do not need deep learning, BM25 is still good enough in some bits and you know like I think it’s become very apparent like BM25 has never told anyone they should eat three rocks a day whereas embeddings have so
[46:35] Hamel: Dimitri is asking, is it worthwhile to weigh the BM25 similarity score during the re-ranking step as well?
[46:45] Ben: Probably not. You generally just want to use BM25 to retrieve candidates, but you don’t need to give those scores to your cross-encoder.
[46:59] Participant 3: There’s a question. I’m going to change it slightly. Someone asks about retrieving from many documents rather than finding the best one. Maybe the tweak there is if you have a theory that information within any single document is so correlated that you actually want to try and get some diversity, are you familiar with or have you used approaches where you I specifically try in some loss function somewhere, encourage that diversity and encourage pulling from many documents rather than from one.
[47:37] Ben: I have not done that myself. I know that there’s different loss methods to optimize for diversity versus clear accuracy. But I don’t think I would be able to give you a clear answer without sounding really confident about something I don’t know much about.
[47:59] Participant 3: Have you used hierarchical reg? Any thoughts on it?
[48:03] Ben: I have not, and I don’t think it’s very needed for the current pipelines. I think there’s a lot of other steps you can improve on.
[48:18] Participant 3: Since I think we have several Answer AI people here, I don’t know if this is a question or a request, I’m eager to learn if Answer AI will come up with any books on LLM applications in the future.
[48:33] Ben: I don’t think so, but never say never. Jeremy, if you want to chime in. Because I can’t make any promises because my boss is watching.
[49:00] Participant 3: You see anything else to Ben, did you say that you can’t see the questions?
[49:05] Ben: Yeah, they’re all blank for me. I saw one earlier, but they really show up sporadically.
[49:10] Participant 3: Yeah. Not sure what’s happened with her. And I think people also cannot upvote these. So a couple of quirks today. You see any others here, Emil, that you think we should pull in?
[49:25] Hamel: um no not necessarily i think like probably going to the discord is pretty good now yep tons of activity there as well um I mean, there’s infinite number of questions, so we can keep going. Okay, Lorien is asking, what’s the best strategy when chunks, when the documents, when chunks of documents don’t fit into the context window? Do you do RAG in a MapReduce style, summarize aggressively? What are the techniques that you’ve seen work most effectively?
[50:25] Ben: So that’s, I think, a very broad question, because it’s like, why do they not fit? Is it because like every document is really long? Is it because you need a lot of different documents, etc, etc? So. And also another important aspect is what’s the latency tolerance? Because quite a lot of the time you can make RAG infinitely better, but users won’t stay waiting like 20 seconds for an answer. So you need to figure out like, how much time do I have?
[50:52] Ben: One way that you can often see what I’ve done in production actually is retrieve the full documents but have another database that maps every document to its summary. So you will have done your LLM summarization at the previous step. You will retrieve the relevant chucks, and then you will pass the relevant summaries to the context window. But that kind of depends on your actual setting. I have another call at 10, which is in five minutes for me. So if you’ve got another final question.
[51:35] Hamel: I really enjoyed this presentation.
[51:39] Ben: Thank you.
[51:42] Participant 3: Yeah, this was really great. Everything is super clear and well presented, so thanks so much.
[51:54] Ben: Thank you. Cheers.
[51:59] Participant 3: Thanks, everyone.