docs/weave.mdx at main · julia-g-rose/docs

title	W&B Weave
description	Track, test, and improve language model apps with W&B Weave
mode	wide

W&B Weave is a powerful observability and evaluation platform that helps you track, evaluate, and improve your LLM application's performance. Weave has the ability to:

Trace your application's LLM calls, capturing inputs, outputs, costs, and latency
Evaluate and monitor your application's responses using scorers and LLM judges
Log versions of your application's code, prompts, datasets, and other attributes
Create leaderboards to track and compare your application's performance over time
Integrate Weave into your W&B reinforcement-learning training runs to gain observability into how your models perform during training

Weave works with many popular frameworks and has both Python and TypeScript SDKs.

Get Started

See the following quickstart docs to install and learn how integrate Weave into your code:

You can also review the following Python example to get a quick understanding of how Weave is implemented into code:

The following example sends simple math questions to OpenAI and then evaluates the responses for correctness (in parallel) using the built-in CorrectnessScorer():

import weave
from openai import OpenAI
from weave import Scorer
import asyncio

# Initialize Weave
weave.init("parallel-evaluation")

# Create OpenAI client
client = OpenAI()

# Define your model as a weave.op function
@weave.op
def math_model(question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "user", "content": question}
        ]
    )
    return response.choices[0].message.content

# Create a dataset with questions and expected answers
dataset = [
    {"question": "What is 2+2?", "expected": "4"},
    {"question": "What is 5+3?", "expected": "8"},
    {"question": "What is 10-7?", "expected": "3"},
    {"question": "What is 12*3?", "expected": "36"},
    {"question": "What is 100/4?", "expected": "25"},
]

# Define a class-based scorer
class CorrectnessScorer(Scorer):
    """Scorer that checks if the answer is correct"""
    
    @weave.op
    def score(self, question: str, expected: str, output: str) -> dict:
        """Check if the model output contains the expected answer"""
        import re
        
        # Extract numbers from the output
        numbers = re.findall(r'\d+', output)
        
        if numbers:
            answer = numbers[0]
            correct = answer == expected
        else:
            correct = False
        
        return {
            "correct": correct,
            "extracted_answer": numbers[0] if numbers else None,
            "contains_expected": expected in output
        }

# Instantiate the scorer
correctness_scorer = CorrectnessScorer()

# Create an evaluation
evaluation = weave.Evaluation(
    dataset=dataset,
    scorers=[correctness_scorer]
)

# Run the evaluation - automatically evaluates examples in parallel
asyncio.run(evaluation.evaluate(math_model))

To use this example, follow the installation instructions in the first step of the quickstart. You also need an OpenAI API key.

Advanced guides

Explore advanced topics:

Integrations: Connect Weave with popular language model providers, such as OpenAI and Anthropic.
Cookbooks: See examples of how to use Weave in our interactive notebooks.
W&B AI Academy: Build advanced retrieval systems, improve language model prompting, and fine-tune models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get Started

Advanced guides

FilesExpand file tree

weave.mdx

Latest commit

History

weave.mdx

File metadata and controls

Get Started

Advanced guides