RAG 애플리케이션 평가

Colab에서 실행해 보기 GitHub 소스

Retrieval Augmented Generation (RAG)는 맞춤형 지식 베이스에 접근할 수 있는 생성형 AI 애플리케이션을 구축할 때 널리 사용하는 방법입니다.

배우게 될 내용:

이 가이드에서는 다음 작업을 수행하는 방법을 설명합니다:

지식 베이스를 구축하는 방법
관련 문서를 찾는 검색 단계를 포함한 RAG 애플리케이션을 만드는 방법
Weave로 검색 단계를 추적하는 방법
LLM 판정자를 사용해 컨텍스트 정밀도를 측정하여 RAG 애플리케이션을 평가하는 방법
사용자 정의 스코어링 함수를 정의하는 방법

사전 준비 사항

W&B 계정
Python 3.8+ 또는 Node.js 18+
다음 패키지가 설치되어 있어야 합니다:
- Python: pip install weave openai
- TypeScript: npm install weave openai
환경 변수로 설정된 OpenAI API 키

지식 베이스 구축하기

먼저 문서에 대한 임베딩을 계산합니다. 일반적으로는 문서에 대해 이 작업을 한 번만 수행한 뒤 임베딩과 메타데이터를 데이터베이스에 저장하지만, 여기서는 단순화를 위해 스크립트가 실행될 때마다 매번 다시 계산합니다.

Python
TypeScript

from openai import OpenAI
import weave
from weave import Model
import numpy as np
import json
import asyncio

articles = [
    "Novo Nordisk and Eli Lilly rival soars 32 percent after promising weight loss drug results Shares of Denmarks Zealand Pharma shot 32 percent higher in morning trade, after results showed success in its liver disease treatment survodutide, which is also on trial as a drug to treat obesity. The trial “tells us that the 6mg dose is safe, which is the top dose used in the ongoing [Phase 3] obesity trial too,” one analyst said in a note. The results come amid feverish investor interest in drugs that can be used for weight loss.",
    "Berkshire shares jump after big profit gain as Buffetts conglomerate nears $1 trillion valuation Berkshire Hathaway shares rose on Monday after Warren Buffetts conglomerate posted strong earnings for the fourth quarter over the weekend. Berkshires Class A and B shares jumped more than 1.5%, each. Class A shares are higher by more than 17% this year, while Class B has gained more than 18%. Berkshire was last valued at $930.1 billion, up from $905.5 billion where it closed on Friday, according to FactSet. Berkshire on Saturday posted fourth-quarter operating earnings of $8.481 billion, about 28 percent higher than the $6.625 billion from the year-ago period, driven by big gains in its insurance business. Operating earnings refers to profits from businesses across insurance, railroads and utilities. Meanwhile, Berkshires cash levels also swelled to record levels. The conglomerate held $167.6 billion in cash in the fourth quarter, surpassing the $157.2 billion record the conglomerate held in the prior quarter.",
    "Highmark Health says its combining tech from Google and Epic to give doctors easier access to information Highmark Health announced it is integrating technology from Google Cloud and the health-care software company Epic Systems. The integration aims to make it easier for both payers and providers to access key information they need, even if its stored across multiple points and formats, the company said. Highmark is the parent company of a health plan with 7 million members, a provider network of 14 hospitals and other entities",
    "Rivian and Lucid shares plunge after weak EV earnings reports Shares of electric vehicle makers Rivian and Lucid fell Thursday after the companies reported stagnant production in their fourth-quarter earnings after the bell Wednesday. Rivian shares sank about 25 percent, and Lucids stock dropped around 17 percent. Rivian forecast it will make 57,000 vehicles in 2024, slightly less than the 57,232 vehicles it produced in 2023. Lucid said it expects to make 9,000 vehicles in 2024, more than the 8,428 vehicles it made in 2023.",
    "Mauritius blocks Norwegian cruise ship over fears of a potential cholera outbreak Local authorities on Sunday denied permission for the Norwegian Dawn ship, which has 2,184 passengers and 1,026 crew on board, to access the Mauritius capital of Port Louis, citing “potential health risks.” The Mauritius Ports Authority said Sunday that samples were taken from at least 15 passengers on board the cruise ship. A spokesperson for the U.S.-headquartered Norwegian Cruise Line Holdings said Sunday that 'a small number of guests experienced mild symptoms of a stomach-related illness' during Norwegian Dawns South Africa voyage.",
    "Intuitive Machines lands on the moon in historic first for a U.S. company Intuitive Machines Nova-C cargo lander, named Odysseus after the mythological Greek hero, is the first U.S. spacecraft to soft land on the lunar surface since 1972. Intuitive Machines is the first company to pull off a moon landing — government agencies have carried out all previously successful missions. The company's stock surged in extended trading Thursday, after falling 11 percent in regular trading.",
    "Lunar landing photos: Intuitive Machines Odysseus sends back first images from the moon Intuitive Machines cargo moon lander Odysseus returned its first images from the surface. Company executives believe the lander caught its landing gear sideways on the moon's surface while touching down and tipped over. Despite resting on its side, the company's historic IM-1 mission is still operating on the moon.",
]

def docs_to_embeddings(docs: list) -> list:
    openai = OpenAI()
    document_embeddings = []
    for doc in docs:
        response = (
            openai.embeddings.create(input=doc, model="text-embedding-3-small")
            .data[0]
            .embedding
        )
        document_embeddings.append(response)
    return document_embeddings

article_embeddings = docs_to_embeddings(articles) # 참고: 일반적으로 이 작업은 아티클에 대해 한 번만 수행하고 임베딩 및 메타데이터를 데이터베이스에 저장합니다

require('dotenv').config();
import { OpenAI } from 'openai';
import * as weave from 'weave';

interface Article {
    text: string;
    embedding?: number[];
}

const articles: Article[] = [
    { 
        text: `Novo Nordisk and Eli Lilly rival soars 32 percent after promising weight loss drug results Shares of Denmarks Zealand Pharma shot 32 percent higher in morning trade, after results showed success in its liver disease treatment survodutide, which is also on trial as a drug to treat obesity. The trial tells us that the 6mg dose is safe, which is the top dose used in the ongoing [Phase 3] obesity trial too, one analyst said in a note. The results come amid feverish investor interest in drugs that can be used for weight loss.`
    },
    { 
        text: `Berkshire shares jump after big profit gain as Buffetts conglomerate nears $1 trillion valuation Berkshire Hathaway shares rose on Monday after Warren Buffetts conglomerate posted strong earnings for the fourth quarter over the weekend. Berkshires Class A and B shares jumped more than 1.5%, each. Class A shares are higher by more than 17% this year, while Class B has gained more than 18%. Berkshire was last valued at $930.1 billion, up from $905.5 billion where it closed on Friday, according to FactSet. Berkshire on Saturday posted fourth-quarter operating earnings of $8.481 billion, about 28 percent higher than the $6.625 billion from the year-ago period, driven by big gains in its insurance business. Operating earnings refers to profits from businesses across insurance, railroads and utilities. Meanwhile, Berkshires cash levels also swelled to record levels. The conglomerate held $167.6 billion in cash in the fourth quarter, surpassing the $157.2 billion record the conglomerate held in the prior quarter.`
    },
    { 
        text: `Highmark Health says its combining tech from Google and Epic to give doctors easier access to information Highmark Health announced it is integrating technology from Google Cloud and the health-care software company Epic Systems. The integration aims to make it easier for both payers and providers to access key information they need, even if its stored across multiple points and formats, the company said. Highmark is the parent company of a health plan with 7 million members, a provider network of 14 hospitals and other entities`
    }
];

function cosineSimilarity(a: number[], b: number[]): number {
    const dotProduct = a.reduce((sum, val, i) => sum + val * b[i], 0);
    const magnitudeA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
    const magnitudeB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
    return dotProduct / (magnitudeA * magnitudeB);
}

const docsToEmbeddings = weave.op(async function(docs: Article[]): Promise<Article[]> {
    const openai = new OpenAI();
    const enrichedDocs = await Promise.all(docs.map(async (doc) => {
        const response = await openai.embeddings.create({
            input: doc.text,
            model: "text-embedding-3-small"
        });
        return {
            ...doc,
            embedding: response.data[0].embedding
        };
    }));
    return enrichedDocs;
});

RAG 앱 만들기

다음으로, 검색 함수 get_most_relevant_document에 weave.op() 데코레이터를 씌우고 Model 클래스를 만듭니다. weave.init('<team-name>/rag-quickstart')를 호출해 이후에 살펴볼 수 있도록 함수의 모든 입력과 출력을 추적합니다. Team 이름을 지정하지 않으면, 출력은 W&B 기본 Team 또는 Entity에 기록됩니다.

Python
TypeScript

from openai import OpenAI
import weave
from weave import Model
import numpy as np
import asyncio

@weave.op()
def get_most_relevant_document(query):
    openai = OpenAI()
    query_embedding = (
        openai.embeddings.create(input=query, model="text-embedding-3-small")
        .data[0]
        .embedding
    )
    similarities = [
        np.dot(query_embedding, doc_emb)
        / (np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb))
        for doc_emb in article_embeddings
    ]
    # 가장 유사한 문서의 인덱스를 가져옵니다
    most_relevant_doc_index = np.argmax(similarities)
    return articles[most_relevant_doc_index]

class RAGModel(Model):
    system_message: str
    model_name: str = "gpt-3.5-turbo-1106"

    @weave.op()
    def predict(self, question: str) -> dict: # 참고: 이후 평가 행에서 데이터를 선택할 때 `question`이 사용됩니다
        from openai import OpenAI
        context = get_most_relevant_document(question)
        client = OpenAI()
        query = f"""Use the following information to answer the subsequent question. If the answer cannot be found, write "I don't know."
        Context:
        \"\"\"
        {context}
        \"\"\"
        Question: {question}"""
        response = client.chat.completions.create(
            model=self.model_name,
            messages=[
                {"role": "system", "content": self.system_message},
                {"role": "user", "content": query},
            ],
            temperature=0.0,
            response_format={"type": "text"},
        )
        answer = response.choices[0].message.content
        return {'answer': answer, 'context': context}

# Team과 프로젝트 이름을 설정합니다
weave.init('<team-name>/rag-quickstart')
model = RAGModel(
    system_message="You are an expert in finance and answer questions related to finance, financial services, and financial markets. When responding based on provided information, be sure to cite the source."
)
model.predict("What significant result was reported about Zealand Pharma's obesity trial?")

class RAGModel {
    private openai: OpenAI;
    private systemMessage: string;
    private modelName: string;
    private articleEmbeddings: Article[];

    constructor(config: {
        systemMessage: string;
        modelName?: string;
        articleEmbeddings: Article[];
    }) {
        this.openai = new OpenAI();
        this.systemMessage = config.systemMessage;
        this.modelName = config.modelName || "gpt-3.5-turbo-1106";
        this.articleEmbeddings = config.articleEmbeddings;
        this.predict = weave.op(this, this.predict);
    }

    async predict(question: string): Promise<{
        answer: string;
        context: string;
    }> {
        const context = await this.getMostRelevantDocument(question);
        
        const response = await this.openai.chat.completions.create({
            model: this.modelName,
            messages: [
                { role: "system", content: this.systemMessage },
                { role: "user", content: `Use the following information to answer the subsequent question. If the answer cannot be found, write "I don't know."
                    Context:
                    """
                    ${context}
                    """
                    Question: ${question}` }
            ],
            temperature: 0
        });

        return {
            answer: response.choices[0].message.content || "",
            context
        };
    }
}

LLM Judge로 평가하기

애플리케이션을 평가할 간단한 방법이 없을 때는 LLM을 사용해 애플리케이션의 여러 측면을 평가할 수 있습니다. 아래는 LLM Judge를 활용해, 컨텍스트가 주어진 답변을 도출하는 데 얼마나 유용했는지를 검증하도록 프롬프트를 구성함으로써 컨텍스트 정밀도를 측정하려는 예시입니다. 이 프롬프트는 널리 사용되는 RAGAS 프레임워크를 바탕으로 확장한 것입니다.

스코어링 함수 정의하기

Build an Evaluation pipeline 튜토리얼에서와 마찬가지로, 앱을 테스트할 예시 행들의 집합과 스코어링 함수를 정의한다. 스코어링 함수는 하나의 행을 입력으로 받아 이를 평가한다. 입력 인자는 해당 행의 키와 일치해야 하므로, 여기서 question은 행 사전에서 가져온 값이다. output은 모델의 출력이다. 모델의 입력은 함수의 입력 인자에 따라 예시에서 가져오므로, 여기서도 question이 사용된다. 이 예제는 async 함수를 사용해 여러 호출을 병렬로 빠르게 실행한다. async에 대한 간단한 소개가 필요하다면 여기에서 확인할 수 있다.

Python
TypeScript

from openai import OpenAI
import weave
import asyncio

@weave.op()
async def context_precision_score(question, output):
    context_precision_prompt = """Given question, answer and context verify if the context was useful in arriving at the given answer. Give verdict as "1" if useful and "0" if not with json output.
    Output in only valid JSON format.

    question: {question}
    context: {context}
    answer: {answer}
    verdict: """
    client = OpenAI()

    prompt = context_precision_prompt.format(
        question=question,
        context=output['context'],
        answer=output['answer'],
    )

    response = client.chat.completions.create(
        model="gpt-4-turbo-preview",
        messages=[{"role": "user", "content": prompt}],
        response_format={ "type": "json_object" }
    )
    response_message = response.choices[0].message
    response = json.loads(response_message.content)
    return {
        "verdict": int(response["verdict"]) == 1,
    }

questions = [
    {"question": "What significant result was reported about Zealand Pharma's obesity trial?"},
    {"question": "How much did Berkshire Hathaway's cash levels increase in the fourth quarter?"},
    {"question": "What is the goal of Highmark Health's integration of Google Cloud and Epic Systems technology?"},
    {"question": "What were Rivian and Lucid's vehicle production forecasts for 2024?"},
    {"question": "Why was the Norwegian Dawn cruise ship denied access to Mauritius?"},
    {"question": "Which company achieved the first U.S. moon landing since 1972?"},
    {"question": "What issue did Intuitive Machines' lunar lander encounter upon landing on the moon?"}
]
evaluation = weave.Evaluation(dataset=questions, scorers=[context_precision_score])
asyncio.run(evaluation.evaluate(model)) # 참고: 평가할 모델을 정의해야 합니다

const contextPrecisionScore = weave.op(async function(args: {
    datasetRow: QuestionRow;
    modelOutput: { answer: string; context: string; }
}): Promise<ScorerResult> {
    const openai = new OpenAI();
    
    const prompt = `Given question, answer and context verify if the context was useful...`;

    const response = await openai.chat.completions.create({
        model: "gpt-4-turbo-preview",
        messages: [{ role: "user", content: prompt }],
        response_format: { type: "json_object" }
    });

    const result = JSON.parse(response.choices[0].message.content || "{}");
    return {
        verdict: parseInt(result.verdict) === 1
    };
});

const evaluation = new weave.Evaluation({
    dataset: createQuestionDataset(),
    scorers: [contextPrecisionScore]
});

await evaluation.evaluate({
    model: weave.op((args: { datasetRow: QuestionRow }) => 
        model.predict(args.datasetRow.question)
    )
});

선택 사항: `Scorer` 클래스 정의하기

일부 애플리케이션에서는 커스텀 평가 클래스를 만들고 싶을 수 있습니다. 예를 들어, 특정 파라미터(예: 채팅 모델, 프롬프트), 각 행에 대한 특정 채점 방식, 집계 점수 계산 방식을 포함하는 표준화된 LLMJudge 클래스를 만들고자 할 수 있습니다. Weave는 바로 사용할 수 있는 Scorer 클래스 목록을 미리 정의해 두었으며, 커스텀 Scorer를 쉽게 만들 수 있도록 지원합니다. 다음 예시는 커스텀 class CorrectnessLLMJudge(Scorer)를 만드는 방법을 보여 줍니다. 개략적으로 보면, 커스텀 Scorer를 만드는 단계는 비교적 단순합니다:

weave.flow.scorer.Scorer를 상속하는 커스텀 클래스를 정의합니다.
score 함수를 오버라이드하고, 함수 호출 하나하나를 추적하고 싶다면 @weave.op()을 추가합니다.
- 이 함수는 모델의 예측 결과가 전달될 output 인자를 정의해야 합니다. 모델이 “None”을 반환할 가능성이 있다면 타입을 Optional[dict]로 정의합니다.
- 나머지 인자들은 일반적인 Any 또는 dict로 정의할 수도 있고, 모델을 평가하는 데 사용되는 데이터셋의 특정 컬럼을 선택할 수도 있습니다. 이 경우 weave.Evaluate 클래스를 사용할 수 있으며, (사용 중이라면) preprocess_model_input을 거친 뒤 단일 행의 컬럼 이름 또는 키와 정확히 동일한 이름을 사용해야 합니다.
선택 사항: 집계 점수 계산 방식을 커스터마이즈하려면 summarize 함수를 오버라이드합니다. 기본적으로 Weave는 사용자 정의 함수를 정의하지 않으면 weave.flow.scorer.auto_summarize 함수를 사용합니다.
- 이 함수에는 @weave.op() 데코레이터가 있어야 합니다.

Python
TypeScript

from weave import Scorer

class CorrectnessLLMJudge(Scorer):
    prompt: str
    model_name: str
    device: str

    @weave.op()
    async def score(self, output: Optional[dict], query: str, answer: str) -> Any:
        """pred, query, target을 비교해서 예측의 정답 여부를 채점합니다.
        Args:
            - output: 평가 대상 모델이 제공하는 dict
            - query: 데이터셋에 정의된 질문
            - answer: 데이터셋에 정의된 정답
        Returns:
            - 단일 dict {메트릭 이름: 단일 평가 값}"""

        # get_model은 제공된 파라미터(OpenAI, HF 등)를 기반으로 모델을 가져오는 일반적인 헬퍼입니다.
        eval_model = get_model(
            model_name = self.model_name,
            prompt = self.prompt
            device = self.device,
        )
        # 비동기 평가를 통해 평가 속도를 높입니다 - 비동기로 구현할 필요는 없습니다.
        grade = await eval_model.async_predict(
            {
                "query": query,
                "answer": answer,
                "result": output.get("result"),
            }
        )
        # 출력 파싱 - pydantic을 사용하면 더 견고하게 구현할 수 있습니다.
        evaluation = "incorrect" not in grade["text"].strip().lower()

        # Weave에 표시될 컬럼 이름
        return {"correct": evaluation}

    @weave.op()
    def summarize(self, score_rows: list) -> Optional[dict]:
        """채점 함수가 각 행마다 계산한 점수를 집계합니다.
        Args:
            - score_rows: dict 리스트. 각 dict는 메트릭과 점수를 포함합니다.
        Returns:
            - 입력과 동일한 구조의 중첩 dict"""

        # 아무 것도 제공하지 않으면 weave.flow.scorer.auto_summarize 함수가 사용됩니다.
        # return auto_summarize(score_rows)

        valid_data = [x.get("correct") for x in score_rows if x.get("correct") is not None]
        count_true = list(valid_data).count(True)
        int_data = [int(x) for x in valid_data]

        sample_mean = np.mean(int_data) if int_data else 0
        sample_variance = np.var(int_data) if int_data else 0
        sample_error = np.sqrt(sample_variance / len(int_data)) if int_data else 0

        # 추가적인 "correct" 계층은 필수는 아니지만 UI 구조를 더해 줍니다.
        return {
            "correct": {
                "true_count": count_true,
                "true_fraction": sample_mean,
                "stderr": sample_error,
            }
        }

이 기능은 아직 TypeScript에서는 사용할 수 없습니다.

이를 Scorer로 사용하려면, 다음과 같이 초기화한 후 Evaluation에서 scorers 인자로 전달합니다:

Python
TypeScript

evaluation = weave.Evaluation(dataset=questions, scorers=[CorrectnessLLMJudge()])

이 기능은 아직 TypeScript에서는 사용할 수 없습니다.

전체 흐름 정리하기

RAG 앱에서도 동일한 결과를 얻으려면:

LLM 호출과 검색 단계 함수를 weave.op()으로 감쌉니다
(선택) predict 함수와 앱 세부 정보를 포함한 Model 서브클래스를 생성합니다
평가할 예제를 수집합니다
각 예제를 점수화하는 채점 함수를 만듭니다
Evaluation 클래스를 사용해 예제들에 대해 평가를 실행합니다

참고: Evaluations를 비동기 실행할 때 OpenAI, Anthropic 등의 모델에서 요청 제한이 걸릴 수 있습니다. 이를 방지하려면 예를 들어 WEAVE_PARALLELISM=3과 같이 환경 변수를 설정해 동시에 실행되는 worker 수를 제한할 수 있습니다. 다음은 전체 코드입니다.

Python
TypeScript

from openai import OpenAI
import weave
from weave import Model
import numpy as np
import json
import asyncio

# 평가에 사용할 예시
articles = [
    "Novo Nordisk and Eli Lilly rival soars 32 percent after promising weight loss drug results Shares of Denmarks Zealand Pharma shot 32 percent higher in morning trade, after results showed success in its liver disease treatment survodutide, which is also on trial as a drug to treat obesity. The trial “tells us that the 6mg dose is safe, which is the top dose used in the ongoing [Phase 3] obesity trial too,” one analyst said in a note. The results come amid feverish investor interest in drugs that can be used for weight loss.",
    "Berkshire shares jump after big profit gain as Buffetts conglomerate nears $1 trillion valuation Berkshire Hathaway shares rose on Monday after Warren Buffetts conglomerate posted strong earnings for the fourth quarter over the weekend. Berkshires Class A and B shares jumped more than 1.5%, each. Class A shares are higher by more than 17% this year, while Class B has gained more than 18%. Berkshire was last valued at $930.1 billion, up from $905.5 billion where it closed on Friday, according to FactSet. Berkshire on Saturday posted fourth-quarter operating earnings of $8.481 billion, about 28 percent higher than the $6.625 billion from the year-ago period, driven by big gains in its insurance business. Operating earnings refers to profits from businesses across insurance, railroads and utilities. Meanwhile, Berkshires cash levels also swelled to record levels. The conglomerate held $167.6 billion in cash in the fourth quarter, surpassing the $157.2 billion record the conglomerate held in the prior quarter.",
    "Highmark Health says its combining tech from Google and Epic to give doctors easier access to information Highmark Health announced it is integrating technology from Google Cloud and the health-care software company Epic Systems. The integration aims to make it easier for both payers and providers to access key information they need, even if it's stored across multiple points and formats, the company said. Highmark is the parent company of a health plan with 7 million members, a provider network of 14 hospitals and other entities",
    "Rivian and Lucid shares plunge after weak EV earnings reports Shares of electric vehicle makers Rivian and Lucid fell Thursday after the companies reported stagnant production in their fourth-quarter earnings after the bell Wednesday. Rivian shares sank about 25 percent, and Lucids stock dropped around 17 percent. Rivian forecast it will make 57,000 vehicles in 2024, slightly less than the 57,232 vehicles it produced in 2023. Lucid said it expects to make 9,000 vehicles in 2024, more than the 8,428 vehicles it made in 2023.",
    "Mauritius blocks Norwegian cruise ship over fears of a potential cholera outbreak Local authorities on Sunday denied permission for the Norwegian Dawn ship, which has 2,184 passengers and 1,026 crew on board, to access the Mauritius capital of Port Louis, citing “potential health risks.” The Mauritius Ports Authority said Sunday that samples were taken from at least 15 passengers on board the cruise ship. A spokesperson for the U.S.-headquartered Norwegian Cruise Line Holdings said Sunday that 'a small number of guests experienced mild symptoms of a stomach-related illness' during Norwegian Dawns South Africa voyage.",
    "Intuitive Machines lands on the moon in historic first for a U.S. company Intuitive Machines Nova-C cargo lander, named Odysseus after the mythological Greek hero, is the first U.S. spacecraft to soft land on the lunar surface since 1972. Intuitive Machines is the first company to pull off a moon landing — government agencies have carried out all previously successful missions. The company's stock surged in extended trading Thursday, after falling 11 percent in regular trading.",
    "Lunar landing photos: Intuitive Machines Odysseus sends back first images from the moon Intuitive Machines cargo moon lander Odysseus returned its first images from the surface. Company executives believe the lander caught its landing gear sideways on the surface of the moon while touching down and tipped over. Despite resting on its side, the company's historic IM-1 mission is still operating on the moon.",
]

def docs_to_embeddings(docs: list) -> list:
    openai = OpenAI()
    document_embeddings = []
    for doc in docs:
        response = (
            openai.embeddings.create(input=doc, model="text-embedding-3-small")
            .data[0]
            .embedding
        )
        document_embeddings.append(response)
    return document_embeddings

article_embeddings = docs_to_embeddings(articles) # 참고: 일반적으로 이 작업은 아티클에 대해 한 번만 수행하고, 임베딩 및 메타데이터를 데이터베이스에 저장합니다

# 검색 단계에 데코레이터 추가
@weave.op()
def get_most_relevant_document(query):
    openai = OpenAI()
    query_embedding = (
        openai.embeddings.create(input=query, model="text-embedding-3-small")
        .data[0]
        .embedding
    )
    similarities = [
        np.dot(query_embedding, doc_emb)
        / (np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb))
        for doc_emb in article_embeddings
    ]
    # 가장 유사한 문서의 인덱스 가져오기
    most_relevant_doc_index = np.argmax(similarities)
    return articles[most_relevant_doc_index]

# 앱에 대한 세부 정보와 응답을 생성하는 predict 함수를 포함한 Model 서브클래스 생성
class RAGModel(Model):
    system_message: str
    model_name: str = "gpt-3.5-turbo-1106"

    @weave.op()
    def predict(self, question: str) -> dict: # 참고: `question`은 나중에 평가 행에서 데이터를 선택하는 데 사용됩니다
        from openai import OpenAI
        context = get_most_relevant_document(question)
        client = OpenAI()
        query = f"""다음 정보를 사용하여 아래 질문에 답하세요. 답을 찾을 수 없는 경우 "모르겠습니다"라고 작성하세요.
        Context:
        \"\"\"
        {context}
        \"\"\"
        Question: {question}"""
        response = client.chat.completions.create(
            model=self.model_name,
            messages=[
                {"role": "system", "content": self.system_message},
                {"role": "user", "content": query},
            ],
            temperature=0.0,
            response_format={"type": "text"},
        )
        answer = response.choices[0].message.content
        return {'answer': answer, 'context': context}

# 팀 및 프로젝트 이름 설정
weave.init('<team-name>/rag-quickstart')
model = RAGModel(
    system_message="당신은 금융 전문가로서 금융, 금융 서비스 및 금융 시장과 관련된 질문에 답합니다. 제공된 정보를 바탕으로 응답할 때는 반드시 출처를 인용하세요."
)

# 질문과 출력을 사용하여 점수를 산출하는 채점 함수
@weave.op()
async def context_precision_score(question, output):
    context_precision_prompt = """주어진 질문, 답변, 컨텍스트를 바탕으로 해당 컨텍스트가 답변 도출에 유용했는지 확인하세요. 유용한 경우 "1", 그렇지 않은 경우 "0"을 JSON 출력으로 제공하세요.
    유효한 JSON 형식으로만 출력하세요.

    question: {question}
    context: {context}
    answer: {answer}
    verdict: """
    client = OpenAI()

    prompt = context_precision_prompt.format(
        question=question,
        context=output['context'],
        answer=output['answer'],
    )

    response = client.chat.completions.create(
        model="gpt-4-turbo-preview",
        messages=[{"role": "user", "content": prompt}],
        response_format={ "type": "json_object" }
    )
    response_message = response.choices[0].message
    response = json.loads(response_message.content)
    return {
        "verdict": int(response["verdict"]) == 1,
    }

questions = [
    {"question": "Zealand Pharma의 비만 임상시험에서 보고된 중요한 결과는 무엇인가요?"},
    {"question": "Berkshire Hathaway의 4분기 현금 보유액은 얼마나 증가했나요?"},
    {"question": "Highmark Health의 Google Cloud 및 Epic Systems 기술 통합의 목표는 무엇인가요?"},
    {"question": "Rivian과 Lucid의 2024년 차량 생산 전망은 어떻게 되나요?"},
    {"question": "Norwegian Dawn 크루즈선이 모리셔스 입항을 거부당한 이유는 무엇인가요?"},
    {"question": "1972년 이후 최초로 미국의 달 착륙을 달성한 회사는 어디인가요?"},
    {"question": "Intuitive Machines의 달 착륙선이 달에 착륙할 때 어떤 문제가 발생했나요?"}
]

# Evaluation 객체를 정의하고 예시 질문과 채점 함수를 전달
evaluation = weave.Evaluation(dataset=questions, scorers=[context_precision_score])
asyncio.run(evaluation.evaluate(model))

require('dotenv').config();
import { OpenAI } from 'openai';
import * as weave from 'weave';

interface Article {
    text: string;
    embedding?: number[];
}

const articles: Article[] = [
    { 
        text: `Novo Nordisk and Eli Lilly rival soars 32 percent after promising weight loss drug results Shares of Denmarks Zealand Pharma shot 32 percent higher in morning trade, after results showed success in its liver disease treatment survodutide, which is also on trial as a drug to treat obesity. The trial tells us that the 6mg dose is safe, which is the top dose used in the ongoing [Phase 3] obesity trial too, one analyst said in a note. The results come amid feverish investor interest in drugs that can be used for weight loss.`
    },
    { 
        text: `Berkshire shares jump after big profit gain as Buffetts conglomerate nears $1 trillion valuation Berkshire Hathaway shares rose on Monday after Warren Buffetts conglomerate posted strong earnings for the fourth quarter over the weekend. Berkshires Class A and B shares jumped more than 1.5%, each. Class A shares are higher by more than 17% this year, while Class B has gained more than 18%. Berkshire was last valued at $930.1 billion, up from $905.5 billion where it closed on Friday, according to FactSet. Berkshire on Saturday posted fourth-quarter operating earnings of $8.481 billion, about 28 percent higher than the $6.625 billion from the year-ago period, driven by big gains in its insurance business. Operating earnings refers to profits from businesses across insurance, railroads and utilities. Meanwhile, Berkshires cash levels also swelled to record levels. The conglomerate held $167.6 billion in cash in the fourth quarter, surpassing the $157.2 billion record the conglomerate held in the prior quarter.`
    },
    { 
        text: `Highmark Health says its combining tech from Google and Epic to give doctors easier access to information Highmark Health announced it is integrating technology from Google Cloud and the health-care software company Epic Systems. The integration aims to make it easier for both payers and providers to access key information they need, even if its stored across multiple points and formats, the company said. Highmark is the parent company of a health plan with 7 million members, a provider network of 14 hospitals and other entities`
    }
];

function cosineSimilarity(a: number[], b: number[]): number {
    const dotProduct = a.reduce((sum, val, i) => sum + val * b[i], 0);
    const magnitudeA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
    const magnitudeB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
    return dotProduct / (magnitudeA * magnitudeB);
}

const docsToEmbeddings = weave.op(async function(docs: Article[]): Promise<Article[]> {
    const openai = new OpenAI();
    const enrichedDocs = await Promise.all(docs.map(async (doc) => {
        const response = await openai.embeddings.create({
            input: doc.text,
            model: "text-embedding-3-small"
        });
        return {
            ...doc,
            embedding: response.data[0].embedding
        };
    }));
    return enrichedDocs;
});

class RAGModel {
    private openai: OpenAI;
    private systemMessage: string;
    private modelName: string;
    private articleEmbeddings: Article[];

    constructor(config: {
        systemMessage: string;
        modelName?: string;
        articleEmbeddings: Article[];
    }) {
        this.openai = new OpenAI();
        this.systemMessage = config.systemMessage;
        this.modelName = config.modelName || "gpt-3.5-turbo-1106";
        this.articleEmbeddings = config.articleEmbeddings;
        this.predict = weave.op(this, this.predict);
    }

    private async getMostRelevantDocument(query: string): Promise<string> {
        const queryEmbedding = await this.openai.embeddings.create({
            input: query,
            model: "text-embedding-3-small"
        });

        const similarities = this.articleEmbeddings.map(doc => {
            if (!doc.embedding) return 0;
            return cosineSimilarity(queryEmbedding.data[0].embedding, doc.embedding);
        });

        const mostRelevantIndex = similarities.indexOf(Math.max(...similarities));
        return this.articleEmbeddings[mostRelevantIndex].text;
    }

    async predict(question: string): Promise<{
        answer: string;
        context: string;
    }> {
        const context = await this.getMostRelevantDocument(question);
        
        const response = await this.openai.chat.completions.create({
            model: this.modelName,
            messages: [
                { role: "system", content: this.systemMessage },
                { 
                    role: "user", 
                    content: `Use the following information to answer the subsequent question. If the answer cannot be found, write "I don't know."
                    Context:
                    """
                    ${context}
                    """
                    Question: ${question}`
                }
            ],
            temperature: 0
        });

        return {
            answer: response.choices[0].message.content || "",
            context
        };
    }
}

interface ScorerResult {
    verdict: boolean;
}

interface QuestionRow {
    question: string;
}

function createQuestionDataset(): weave.Dataset<QuestionRow> {
    return new weave.Dataset<QuestionRow>({
        id: 'rag-questions',
        rows: [
            { question: "Zealand Pharma의 비만 임상시험에 대해 보고된 주요 결과는 무엇인가요?" },
            { question: "Berkshire Hathaway의 4분기 현금 보유액은 얼마나 증가했나요?" },
            { question: "Highmark Health의 Google Cloud 및 Epic Systems 기술 통합의 목표는 무엇인가요?" }
        ]
    });
}

const contextPrecisionScore = weave.op(async function(args: {
    datasetRow: QuestionRow;
    modelOutput: { answer: string; context: string; }
}): Promise<ScorerResult> {
    const openai = new OpenAI();
    
    const prompt = `Given question, answer and context verify if the context was useful in arriving at the given answer. Give verdict as "1" if useful and "0" if not with json output.
    Output in only valid JSON format.

    question: ${args.datasetRow.question}
    context: ${args.modelOutput.context}
    answer: ${args.modelOutput.answer}
    verdict: `;

    const response = await openai.chat.completions.create({
        model: "gpt-4-turbo-preview",
        messages: [{ role: "user", content: prompt }],
        response_format: { type: "json_object" }
    });

    const result = JSON.parse(response.choices[0].message.content || "{}");
    return {
        verdict: parseInt(result.verdict) === 1
    };
});

async function main() {
    # 팀 및 프로젝트 이름을 설정하세요
    await weave.init('<team-name>/rag-quickstart');
    
    const articleEmbeddings = await docsToEmbeddings(articles);
    
    const model = new RAGModel({
        systemMessage: "You are an expert in finance and answer questions related to finance, financial services, and financial markets. When responding based on provided information, be sure to cite the source.",
        articleEmbeddings
    });

    const evaluation = new weave.Evaluation({
        dataset: createQuestionDataset(),
        scorers: [contextPrecisionScore]
    });

    const results = await evaluation.evaluate({
        model: weave.op((args: { datasetRow: QuestionRow }) => 
            model.predict(args.datasetRow.question)
        )
    });
    
    console.log('평가 결과:', results);
}

if (require.main === module) {
    main().catch(console.error);
}

결론

이 튜토리얼에서는 이 예제에서 살펴본 검색 단계처럼 애플리케이션의 여러 단계에 관측 가능성을 어떻게 구축하는지 다루었습니다. 또한 애플리케이션 응답을 자동으로 평가하기 위해 LLM judge와 같은 더 복잡한 스코어링 함수를 구현하는 방법도 배웠습니다.

다음 단계

엔지니어를 위한 실전 RAG 기법을 더 심화해서 다루는 RAG++ 강좌를 확인하세요. 이 과정에서는 Weights & Biases, Cohere, Weaviate가 제공하는 프로덕션 환경에 바로 적용할 수 있는 솔루션을 통해 성능을 최적화하고 비용을 절감하며 애플리케이션의 정확도와 관련성을 향상시키는 방법을 배웁니다.

시작하기

가이드

쿡북

레퍼런스

자세한 정보 및 지원

오픈 소스

커뮤니티

배우게 될 내용:

사전 준비 사항

지식 베이스 구축하기

RAG 앱 만들기

LLM Judge로 평가하기

스코어링 함수 정의하기

선택 사항: `Scorer` 클래스 정의하기

전체 흐름 정리하기

결론

다음 단계

시작하기

가이드

쿡북

레퍼런스

자세한 정보 및 지원

오픈 소스

커뮤니티

​배우게 될 내용:

​사전 준비 사항

​지식 베이스 구축하기

​RAG 앱 만들기

​LLM Judge로 평가하기

​스코어링 함수 정의하기

​선택 사항: Scorer 클래스 정의하기

​전체 흐름 정리하기

​결론

​다음 단계

배우게 될 내용:

사전 준비 사항

지식 베이스 구축하기

RAG 앱 만들기

LLM Judge로 평가하기

스코어링 함수 정의하기

선택 사항: `Scorer` 클래스 정의하기

전체 흐름 정리하기

결론

다음 단계