DevGang
Авторизоваться

Оцените ответ конвейера RAG с помощью Python

Генерация с расширенным поиском — это метод обогащения подсказок LLM соответствующими данными. Обычно приглашение пользователя преобразуется во встраивание, а соответствующие документы извлекаются из векторного хранилища. Затем вызывается LLM с соответствующими документами в рамках запроса.

Оценка RAG позволяет оценить, насколько хорошо работает наш конвейер. Оценка RAG измеряет точность, полноту и достоверность фактов, полученных во время поисковой фразы, путем анализа лучших результатов, полученных нашей системой. Это позволяет нам автоматически отслеживать и контролировать производительность нашего конвейера. При разработке стратегии оценки приложений RAG следует оценить оба этапа:

  • Получение документа из векторного хранилища
  • Генерация результатов LLM

Важно оценивать эти шаги по отдельности, поскольку разбиение RAG на несколько этапов облегчает выявление проблем. Существует несколько критериев, используемых для оценки приложений RAG:

На основе результатов

  • Фактичность (также называемая корректностью): измеряет, основаны ли результаты LLM на предоставленной основной истине.
  • Релевантность ответа: измеряет, насколько непосредственно ответ касается вопроса.

Контекстно-ориентированный

  • Соблюдение контекста (также называемое обоснованием или достоверностью): измеряет, основаны ли результаты LLM на предоставленном контексте.
  • Вызов контекста: измеряет, содержит ли контекст правильную информацию по сравнению с предоставленной основной истиной, чтобы дать ответ.
  • Релевантность контекста: измеряет, какая часть контекста необходима для ответа на данный запрос.
  • Пользовательские метрики: вы знаете свое приложение лучше, чем кто-либо другой. Создавайте тестовые примеры, которые фокусируются на важных для вас вещах (например, цитируется ли определенный документ, не слишком ли длинный ответ и т. д.).

Здесь мы будем оценивать конвейер RAG на основе контекста, полученного из хранилища векторов, и ответа, сгенерированного LLM. Мы будем оценивать конвейер RAG на основе следующих показателей:

*BLEU (дублёр двуязычной оценки):

Оценка BLEU — это широко используемый показатель для задач машинного перевода, целью которых является автоматический перевод текста с одного языка на другой. Это было предложено как способ оценки качества машинных переводов путем сравнения их с набором эталонных переводов, выполненных переводчиками-людьми.

*ROUGE (Дублер, ориентированный на запоминание, для оценки Gisting):

Оценка ROUGE — это набор показателей, обычно используемых для задач резюмирования текста, целью которых является автоматическое создание краткого изложения более длинного текста. ROUGE был разработан для оценки качества сводок, созданных компьютером, путем сравнения их со справочными сводками, предоставленными людьми.

*BERTScore

BERTScore — это показатель для оценки качества моделей генерации текста, таких как машинный перевод или обобщение. Он использует предварительно обученные контекстные внедрения BERT как для сгенерированного, так и для ссылочного текста, а затем вычисляет косинусное сходство между этими внедрениями. В этом разделе рассматриваются основные концепции, лежащие в основе метрики BERTScore.

*Perplexity

Perplexity — это мера, используемая в обработке естественного языка и машинном обучении для оценки производительности языковых моделей. Он измеряет, насколько хорошо модель предсказывает следующее слово или символ на основе контекста предыдущих слов или символов. Чем ниже показатель недоумения, тем лучше способность модели предсказывать следующее слово или символ.

*Diversity

Diversity измеряет уникальность биграмм в сгенерированном выходе. Более высокие значения указывают на более разнообразный и разнообразный результат.

*Racial bias

Racial bias в обработке естественного языка (НЛП) — это несправедливое обращение с людьми на основании их расы. Это может произойти несколькими способами, в том числе:

  • Языковые модели. Языковые модели могут создавать предвзятый язык и иметь проблемы с диалектами и акцентами, которые плохо представлены в их обучающих данных. Это может привести к дискриминации и укреплению расовых стереотипов.
  • Данные: системы НЛП могут отражать предвзятость в языковых данных, используемых для их обучения.
  • Аннотации. Несоответствия между совокупностью аннотаторов и данными могут привести к систематической ошибке.
  • Дизайн исследования. Дизайн исследования также может привести к предвзятости.

Реализация кода

Приведенный выше код был реализован в экземпляре T4 Google Colab.

Установите необходимые зависимости

!pip install sacrebleu 
!pip install rouge-score
!pip install bert-score
!pip install transformers
!pip install nltk
!pip install textblob
!pip install -qU dataset
!pip install -qU langchain
!pip install -qu langchain_community
!pip install -qU chromadb
!pip install -qU langchain-chroma
!pip install -qU langchain-huggingface
!pip install -qU sentence-transformers
!pip install -qU Flashrank
!pip install langchain_community

Создайте конвейер RAG, используя Chroma и Phi-3-mini-4k-instruct

Создать экземпляр LLM

from langchain_huggingface import HuggingFacePipeline

llm = HuggingFacePipeline.from_model_id(
    model_id="microsoft/Phi-3-mini-4k-instruct",
    task="text-generation",
    pipeline_kwargs={
        "max_new_tokens": 100,
        "top_k": 50,
        "temperature": 0.1,
    },
)

Создание экземпляра модели внедрения

from langchain.embeddings import HuggingFaceEmbeddings
model_name ="BAAI/bge-small-en-v1.5"
model_kwargs ={"device":"cuda"}
encode_kwargs ={"normalize_embeddings":False}
embeddings = HuggingFaceEmbeddings(model_name=model_name,
                                   model_kwargs=model_kwargs,
                                   encode_kwargs=encode_kwargs)

Загрузить данные

from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader(
    "https://blog.langchain.dev/langchain-v0-1-0/"
)

documents = loader.load()
print(len(documents))

Разделите документы на части

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1250,
    chunk_overlap = 100,
    length_function = len,
    is_separator_regex = False
)
#
split_docs = text_splitter.split_documents(documents)
print(len(split_docs))

Настройка VectorStore

from langchain_community.vectorstores import Chroma
vectorstore = Chroma(embedding_function=embeddings,
                     persist_directory="./chromadb1",
                     collection_name="full_documents")

Добавьте фрагменты документа

vectorstore.add_documents(split_docs)
vectorstore.persist()

Расширенный поиск с использованием FlashRank

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import FlashrankRerank
#
retriever = vectorstore.as_retriever(search_kwargs={"k":10})
#
compressor = FlashrankRerank()
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor,
                                                       base_retriever=retriever
                                                      )

Настройте цепочку поиска QA

from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(llm=llm,
                                 chain_type="stuff",
                                 retriever=compression_retriever,
                                 return_source_documents=True
                                )

Задать вопрос

result = qa.invoke({"query":"What is mentioned about Composability?"})
print(result)
#########################RESPONSE##############################
{'query': 'What is mentioned about Composability?', 
'result': "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\nchanges. These can now be reflected on an individual integration basis with proper versioning in the standalone integration package.ObservabilityBuilding LLM applications involves putting a non-deterministic component at the center of your system. These models can often output unexpected results, so having visibility into exactly what is happening in your system is integral. 💡We want to make langchain as observable and as debuggable as possible, whether through architectural decisions or tools we build on the side.We’ve set about this in a few ways.The main way we’ve tackled this is by building LangSmith. One of the main value props that LangSmith provides is a best-in-class debugging experience for your LLM application. We log exactly what steps are happening, what the inputs of each step are, what the outputs of each step are, how long each step takes, and more data. We display this in a user-friendly way, allowing you to identify which steps are taking the longest, enter a playground to debug unexpected LLM responses, track token usage and more. Even in private beta, the demand for LangSmith has been overwhelming, and we’re investing a lot in scalability so that we can release a public beta and then make it generally available\n\na lot in scalability so that we can release a public beta and then make it generally available in the coming months. We are also already supporting an enterprise version, which comes with a within-VPC deployment for enterprises with strict data privacy policies.We’ve also tackled observability in other ways. We’ve long had built in verbose and debug modes for different levels of logging throughout the pipeline. We recently introduced methods to visualize the chain you created, as well as get all prompts used.ComposabilityWhile it’s helpful to have prebuilt chains to get started, we very often see teams breaking outside of those architectures and wanting to customize their chain - not only customize the prompt, but also customize different parts of the orchestration. 💡Over the past few months, we’ve invested heavily in LangChain Expression Language (LCEL). This enables composition of arbitrary sequences, providing a lot of the same benefits as data orchestration tools do for data engineering pipelines (batching, parallelization, fallbacks). It also provides some benefits unique to LLM workloads - mainly LLM-specific observability (covered above), and streaming, covered later in this post.The components for LCEL are in\n\nbug fixes. See more towards the end of this post on our plans for that.While re-architecting the package towards a path to a stable 0.1 release, we took the opportunity to talk to hundreds of developers about why they use LangChain and what they love about it. This input guided our direction and focus. We also used it as an opportunity to bring parity to the Python and JavaScript versions in the core areas outlined below. 💡While certain integrations and more tangential chains may be language specific, core abstractions and key functionality are implemented equally in both the Python and JavaScript packages.We want to share what we’ve heard and our plan to continually improve LangChain. We hope that sharing these learnings will increase transparency into our thinking and decisions, allowing others to better use, understand, and contribute to LangChain. After all, a huge part of LangChain is our community – both the user base and the 2000+ contributors – and we want everyone to come along for the journey.\xa0Third Party IntegrationsOne of the things that people most love about LangChain is how easy we make it to get started building on any stack. We have almost 700 integrations, ranging from LLMs to vector stores to tools for agents\n\nQuestion: What is mentioned about Composability?\nHelpful Answer:\nComposability: While it’s helpful to have prebuilt chains to get started, we very often see teams breaking outside of those architectures and wanting to customize their chain - not only customize the prompt, but also customize different parts of the orchestration. 💡Over the past few months, we’ve invested heavily in LangChain Expression Language (LCEL). This enables composition of arbitrary sequences, providing a lot of the same benefits as", 
'source_documents': [
Document(page_content='changes. These can now be reflected on an individual integration basis with proper versioning in the standalone integration package.ObservabilityBuilding LLM applications involves putting a non-deterministic component at the center of your system. These models can often output unexpected results, so having visibility into exactly what is happening in your system is integral. 💡We want to make langchain as observable and as debuggable as possible, whether through architectural decisions or tools we build on the side.We’ve set about this in a few ways.The main way we’ve tackled this is by building LangSmith. One of the main value props that LangSmith provides is a best-in-class debugging experience for your LLM application. We log exactly what steps are happening, what the inputs of each step are, what the outputs of each step are, how long each step takes, and more data. We display this in a user-friendly way, allowing you to identify which steps are taking the longest, enter a playground to debug unexpected LLM responses, track token usage and more. Even in private beta, the demand for LangSmith has been overwhelming, and we’re investing a lot in scalability so that we can release a public beta and then make it generally available', metadata={'language': 'en', 'source': 'https://blog.langchain.dev/langchain-v0-1-0/', 'title': 'LangChain v0.1.0', 'relevance_score': 0.9986438}), 
Document(page_content='a lot in scalability so that we can release a public beta and then make it generally available in the coming months. We are also already supporting an enterprise version, which comes with a within-VPC deployment for enterprises with strict data privacy policies.We’ve also tackled observability in other ways. We’ve long had built in verbose and debug modes for different levels of logging throughout the pipeline. We recently introduced methods to visualize the chain you created, as well as get all prompts used.ComposabilityWhile it’s helpful to have prebuilt chains to get started, we very often see teams breaking outside of those architectures and wanting to customize their chain - not only customize the prompt, but also customize different parts of the orchestration. 💡Over the past few months, we’ve invested heavily in LangChain Expression Language (LCEL). This enables composition of arbitrary sequences, providing a lot of the same benefits as data orchestration tools do for data engineering pipelines (batching, parallelization, fallbacks). It also provides some benefits unique to LLM workloads - mainly LLM-specific observability (covered above), and streaming, covered later in this post.The components for LCEL are in', metadata={'language': 'en', 'source': 'https://blog.langchain.dev/langchain-v0-1-0/', 'title': 'LangChain v0.1.0', 'relevance_score': 0.9979234}), 
Document(page_content='bug fixes. See more towards the end of this post on our plans for that.While re-architecting the package towards a path to a stable 0.1 release, we took the opportunity to talk to hundreds of developers about why they use LangChain and what they love about it. This input guided our direction and focus. We also used it as an opportunity to bring parity to the Python and JavaScript versions in the core areas outlined below. 💡While certain integrations and more tangential chains may be language specific, core abstractions and key functionality are implemented equally in both the Python and JavaScript packages.We want to share what we’ve heard and our plan to continually improve LangChain. We hope that sharing these learnings will increase transparency into our thinking and decisions, allowing others to better use, understand, and contribute to LangChain. After all, a huge part of LangChain is our community – both the user base and the 2000+ contributors – and we want everyone to come along for the journey.\xa0Third Party IntegrationsOne of the things that people most love about LangChain is how easy we make it to get started building on any stack. We have almost 700 integrations, ranging from LLMs to vector stores to tools for agents', metadata={'language': 'en', 'source': 'https://blog.langchain.dev/langchain-v0-1-0/', 'title': 'LangChain v0.1.0', 'relevance_score': 0.99791926})
]
}

Код для оценки ответа, сгенерированного конвейером RAG

Оценка

answer = result['result']
context = " ".join([d.page_content for d in result['source_documents']])
import torch
from sacrebleu import corpus_bleu
from rouge_score import rouge_scorer
from bert_score import score
from transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline
import nltk
from nltk.util import ngrams

class RAGEvaluator:
    def __init__(self):
        self.gpt2_model, self.gpt2_tokenizer = self.load_gpt2_model()
        self.bias_pipeline = pipeline("zero-shot-classification", model="Hate-speech-CNERG/dehatebert-mono-english")

    def load_gpt2_model(self):
        model = GPT2LMHeadModel.from_pretrained('gpt2')
        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
        return model, tokenizer

    def evaluate_bleu_rouge(self, candidates, references):
        bleu_score = corpus_bleu(candidates, [references]).score
        scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
        rouge_scores = [scorer.score(ref, cand) for ref, cand in zip(references, candidates)]
        rouge1 = sum([score['rouge1'].fmeasure for score in rouge_scores]) / len(rouge_scores)
        return bleu_score, rouge1

    def evaluate_bert_score(self, candidates, references):
        P, R, F1 = score(candidates, references, lang="en", model_type='bert-base-multilingual-cased')
        return P.mean().item(), R.mean().item(), F1.mean().item()

    def evaluate_perplexity(self, text):
        encodings = self.gpt2_tokenizer(text, return_tensors='pt')
        max_length = self.gpt2_model.config.n_positions
        stride = 512
        lls = []
        for i in range(0, encodings.input_ids.size(1), stride):
            begin_loc = max(i + stride - max_length, 0)
            end_loc = min(i + stride, encodings.input_ids.size(1))
            trg_len = end_loc - i
            input_ids = encodings.input_ids[:, begin_loc:end_loc]
            target_ids = input_ids.clone()
            target_ids[:, :-trg_len] = -100
            with torch.no_grad():
                outputs = self.gpt2_model(input_ids, labels=target_ids)
                log_likelihood = outputs[0] * trg_len
            lls.append(log_likelihood)
        ppl = torch.exp(torch.stack(lls).sum() / end_loc)
        return ppl.item()

    def evaluate_diversity(self, texts):
        all_tokens = [tok for text in texts for tok in text.split()]
        unique_bigrams = set(ngrams(all_tokens, 2))
        diversity_score = len(unique_bigrams) / len(all_tokens) if all_tokens else 0
        return diversity_score

    def evaluate_racial_bias(self, text):
        results = self.bias_pipeline([text], candidate_labels=["hate speech", "not hate speech"])
        bias_score = results[0]['scores'][results[0]['labels'].index('hate speech')]
        return bias_score

    def evaluate_all(self, question, response, reference):
        candidates = [response]
        references = [reference]
        bleu, rouge1 = self.evaluate_bleu_rouge(candidates, references)
        bert_p, bert_r, bert_f1 = self.evaluate_bert_score(candidates, references)
        perplexity = self.evaluate_perplexity(response)
        diversity = self.evaluate_diversity(candidates)
        racial_bias = self.evaluate_racial_bias(response)
        return {
            "BLEU": bleu,
            "ROUGE-1": rouge1,
            "BERT P": bert_p,
            "BERT R": bert_r,
            "BERT F1": bert_f1,
            "Perplexity": perplexity,
            "Diversity": diversity,
            "Racial Bias": racial_bias
        }

Создание экземпляра оценщика

evaluator = RAGEvaluator()
question = "What is mentioned about Composability?"
response = answer 
reference = context
metrics = evaluator.evaluate_all(question, response, reference)

Отформатируйте созданный ответ метрик

or k,v in metrics.items():
    if k == 'BLEU':
        print(f"BLEU measures the overlap between the generated output and reference text based on n-grams. Higher scores indicate better match.score obtained :{v}")
    elif k == "ROUGE-1":
        print(f"ROUGE-1 measures the overlap of unigrams between the generated output and reference text. Higher scores indicate better match.Score obtained:{v}") 
    elif k == 'BERT P':
        print(f"BERTScore evaluates the semantic similarity between the generated output and reference text using BERT embeddings.")

        print(f"\n\n**BERT Precision**: {metrics['BERT P']}")
        print(f"**BERT Recall**: {metrics['BERT R']}")
        print(f"**BERT F1 Score**: {metrics['BERT F1']}")
    elif k == 'Perplexity':
        print(f"Perplexity measures how well a language model predicts the text. Lower values indicate better fluency and coherence. score obtained :{v}")
    elif k == 'Diversity':
        
        print(f"Diversity measures the uniqueness of bigrams in the generated output. Higher values indicate more diverse and varied output.score obtained:{v}")
    elif k == 'Racial Bias':
        print(f"Racial Bias score indicates the presence of biased language in the generated output. Higher scores indicate more bias.score obtained:{v}")
* BLEU measures the overlap between the generated output and reference text based on n-grams. Higher scores indicate better match.score obtained :84.31369119311947
* ROUGE-1 measures the overlap of unigrams between the generated output and reference text. Higher scores indicate better match.Score obtained:0.9166666666666666
* BERTScore evaluates the semantic similarity between the generated output and reference text using BERT embeddings.
**BERT Precision**:0.9040709137916565
**BERT Recall**:0.910340428352356
**BERT F1 Score**:0.9071947932243347
* Perplexity measures how well a language model predicts the text. Lower values indicate better fluency and coherence. score obtained :27.312009811401367
* Diversity measures the uniqueness of bigrams in the generated output. Higher values indicate more diverse and varied output.score obtained:0.8409090909090909
* Racial Bias score indicates the presence of biased language in the generated output. Higher scores indicate more bias.score obtained:

Источник

#Python
Комментарии
Чтобы оставить комментарий, необходимо авторизоваться

Присоединяйся в тусовку

В этом месте могла бы быть ваша реклама

Разместить рекламу