top of page

Boosting RAG: Choosing the Best Embedding & Reranker Models


When you're building a Retrieval Augmented Generation (RAG) pipeline, one crucial component is the Retriever. You have a variety of embedding models to choose from, including OpenAI, CohereAI, and open-source sentence transformers. Additionally, there are several re-Setting Up the Keys:rankers available from CohereAI and sentence transformers. But with so many options, how do you determine the best combination for top-notch retrieval performance? How do you know which embedding model works best with your data? Or which reranker gives the biggest boost to your results?


In this blog post, we'll use the Retrieval Evaluation module from LlamaIndex to quickly identify the best mix of embedding and reranker models. Let's dive in!


First, let's understand the metrics used in Retrieval Evaluation:


To measure the effectiveness of our retrieval system, we primarily rely on two widely accepted metrics: Hit Rate and Mean Reciprocal Rank (MRR).


Hit Rate: This metric calculates the percentage of queries where the correct answer is found within the top-k retrieved documents. In simpler terms, it tells us how often our system gets the right answer within the top few guesses.


Mean Reciprocal Rank (MRR): For each query, MRR evaluates the system's accuracy by looking at the rank of the highest-placed relevant document. It's the average of the reciprocals of these ranks across all the queries. So, if the first relevant document is the top result, the reciprocal rank is 1; if it's second, the reciprocal rank is 1/2, and so on.


Now that we understand the metrics, let's move on to the experiment. You can also follow along using our Google Colab Notebook.


Setting Up the Environment
pip install llama-index sentence-transformers cohere anthropic voyageai protobuf pypdf

Setting Up the Keys
openai_api_key = 'YOUR OPENAI API KEY'
cohere_api_key = 'YOUR COHEREAI API KEY'
anthropic_api_key = 'YOUR ANTHROPIC API KEY'
openai.api_key = openai_api_key

Download the Data

We will use the Llama2 paper for this experiment. Let's download the paper.

!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "llama2.pdf"

Load the Data

Let's load the data. We will use pages from start to 36 for the experiment, excluding the table of contents, references, and appendix. This data is parsed into nodes, which represent chunks of data we want to retrieve.

documents = SimpleDirectoryReader(input_files=["llama2.pdf"]).load_data()
node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)

Generating Question-Context Pairs

For evaluation purposes, we created a dataset of question-context pairs. This dataset consists of questions and their corresponding context from our data. To remove bias, we use Anthropic LLM to generate the question-context pairs.


Let's initialize a prompt template to generate question-context pairs:


# Prompt to generate questions
qa_generate_prompt_tmpl = """\
Context information is below.---------------------{context_str}---------------------\
Given the context information and not prior knowledge, generate only questions based on the below query.\
You are a Professor. Your task is to setup \{num_questions_per_chunk} questions for an upcoming quiz/examination.\   
The questions should be diverse in nature across the document. The questions should not contain options, not start with Q1/Q2.\
Restrict the questions to the context information provided.\"""\

llm = Anthropic(api_key=anthropic_api_key)
qa_dataset = generate_question_context_pairs(
    nodes, llm=llm, num_questions_per_chunk=2)

Filtering the Dataset

We filter out pairs with phrases like "Here are 2 questions based on the provided context."

def filter_qa_dataset(qa_dataset):
    # Filter out queries and relevant_docs using dictionary comprehensions
    filtered_queries = {k: v for k, v in qa_dataset.queries.items() if 'Here are 2' not in v and 'Here are two' not in v}
    filtered_relevant_docs = {k: v for k, v in qa_dataset.relevant_docs.items() if 'Here are 2' not in v and 'Here are two' not in v}
    # Create a new instance of EmbeddingQAFinetuneDataset with the filtered data
    return EmbeddingQAFinetuneDataset(queries=filtered_queries, corpus=qa_dataset.corpus, relevant_docs=filtered_relevant_docs)

qa_dataset = filter_qa_dataset(qa_dataset)

Creating a Custom Retriever

To find the optimal retriever, we combine an embedding model with a reranker. We start with a base VectorIndexRetriever and then introduce a reranker to refine the results. For this experiment, we set similarity_top_k to 10 and picked the top 5 with the reranker. Here is the code using OpenAIEmbedding.

embed_model = OpenAIEmbedding()
service_context = ServiceContext.from_defaults(llm=None, embed_model = embed_model)
vector_index = VectorStoreIndex(nodes, service_context=service_context)
vector_retriever = VectorIndexRetriever(index=vector_index, similarity_top_k = 10)

class CustomRetriever(BaseRetriever):
    """Custom retriever that performs both Vector search and Knowledge Graph search"""
    def __init__(
        self,
        vector_retriever: VectorIndexRetriever,
    ) -> None:
        """Initialize parameters."""
        self._vector_retriever = vector_retriever

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        """Retrieve nodes given query."""
        retrieved_nodes = self._vector_retriever.retrieve(query_bundle)

        if reranker != 'None':
            retrieved_nodes = reranker.postprocess_nodes(retrieved_nodes, query_bundle)
        else:
            retrieved_nodes = retrieved_nodes[:5]

        return retrieved_nodes

    async def _aretrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        """Asynchronously retrieve nodes given query. Implemented by the user."""
        return self._retrieve(query_bundle)

    async def aretrieve(self, str_or_query_bundle: QueryType) -> List[NodeWithScore]:
        if isinstance(str_or_query_bundle, str):
            str_or_query_bundle = QueryBundle(str_or_query_bundle)
        return await self._aretrieve(str_or_query_bundle)

custom_retriever = CustomRetriever(vector_retriever)

Evaluation

To evaluate our retriever, we compute the Mean Reciprocal Rank (MRR) and Hit Rate metrics.

retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=custom_retriever)
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

Results:

We tested various embedding models and re-rankers. Here are the models we considered:


Embedding Models:

  • OpenAI Embedding

  • Voyage Embedding

  • CohereAI Embedding (v2.0/v3.0)

  • Jina Embeddings

  • BAAI/bge-large-en


Rerankers:

  • CohereAIbge-reranker-base

  • CohereAIbge-reranker-large


The table below shows the evaluation results based on the Hit Rate and Mean Reciprocal Rank (MRR) metrics:


[Table with evaluation results]


Performance by Embedding
  • OpenAI: Performs exceptionally well, especially with CohereRerank and bge-reranker-large, indicating strong compatibility with reranking tools.

  • bge-large: Shows significant improvement with rerankers, particularly CohereRerank.

  • llm-embedder: Benefits greatly from reranking, especially with CohereRerank.

  • Cohere: The latest v3.0 embeddings outperform v2.0 and significantly improve with the integration of CohereRerank.

  • Voyage: Has strong initial performance, further amplified by CohereRerank.

  • JinaAI: Sees notable gains with bge-reranker-large, indicating that reranking significantly boosts its performance.


Impact of Re-rankers
  • Without Reranker: Provides the baseline performance for each embedding.

  • bge-reranker-base: Generally improves both hit rate and MRR across embeddings.

  • bge-reranker-large: Frequently offers the highest or near-highest MRR for embeddings.

  • CohereRerank: Consistently enhances performance across all embeddings, often providing the best or near-best results.


Necessity of Rerankers

The data clearly indicates the significance of rerankers in refining search results. Nearly all embeddings benefit from reranking, showing improved hit rates and MRRs.


Overall Superiority

The combinations of OpenAI + CohereRerank and Voyage + bge-reranker-large emerge as top contenders when considering both hit rate and MRR.

However, the consistent improvement brought by CohereRerank/bge-reranker-large across various embeddings makes them the standout choice for enhancing search quality, regardless of the embedding used.


Conclusions

In this blog post, we have demonstrated how to evaluate and enhance retriever performance using various embeddings and re-rankers. Here are our final conclusions:


  1. Embeddings: OpenAI and Voyage embeddings, especially when paired with CohereRerank/bge-reranker-large, set the gold standard for both hit rate and MRR.

  2. Re-rankers: The influence of re-rankers, particularly CohereRerank/bge-reranker-large, cannot be overstated. They play a key role in improving the MRR for many embeddings.

  3. Foundation is Key: Choosing the right embedding for the initial search is essential; even the best re-ranker can't help much if the basic search results aren't good.

  4. Integration: To get the best out of retrievers, it's important to find the right mix of embeddings and re-rankers. This study highlights the importance of careful testing and finding the best pairing.


Comments


bottom of page