| title | W&B Weave |
|---|---|
| description | Track, test, and improve language model apps with W&B Weave |
| mode | wide |
W&B Weave is a powerful observability and evaluation platform that helps you track, evaluate, and improve your LLM application's performance. Weave has the ability to:
- Trace your application's LLM calls, capturing inputs, outputs, costs, and latency
- Evaluate and monitor your application's responses using scorers and LLM judges
- Log versions of your application's code, prompts, datasets, and other attributes
- Create leaderboards to track and compare your application's performance over time
- Integrate Weave into your W&B reinforcement-learning training runs to gain observability into how your models perform during training
Weave works with many popular frameworks and has both Python and TypeScript SDKs.
See the following quickstart docs to install and learn how integrate Weave into your code:
You can also review the following Python example to get a quick understanding of how Weave is implemented into code:
The following example sends simple math questions to OpenAI and then evaluates the responses for correctness (in parallel) using the built-in CorrectnessScorer():
import weave
from openai import OpenAI
from weave import Scorer
import asyncio
# Initialize Weave
weave.init("parallel-evaluation")
# Create OpenAI client
client = OpenAI()
# Define your model as a weave.op function
@weave.op
def math_model(question: str) -> str:
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "user", "content": question}
]
)
return response.choices[0].message.content
# Create a dataset with questions and expected answers
dataset = [
{"question": "What is 2+2?", "expected": "4"},
{"question": "What is 5+3?", "expected": "8"},
{"question": "What is 10-7?", "expected": "3"},
{"question": "What is 12*3?", "expected": "36"},
{"question": "What is 100/4?", "expected": "25"},
]
# Define a class-based scorer
class CorrectnessScorer(Scorer):
"""Scorer that checks if the answer is correct"""
@weave.op
def score(self, question: str, expected: str, output: str) -> dict:
"""Check if the model output contains the expected answer"""
import re
# Extract numbers from the output
numbers = re.findall(r'\d+', output)
if numbers:
answer = numbers[0]
correct = answer == expected
else:
correct = False
return {
"correct": correct,
"extracted_answer": numbers[0] if numbers else None,
"contains_expected": expected in output
}
# Instantiate the scorer
correctness_scorer = CorrectnessScorer()
# Create an evaluation
evaluation = weave.Evaluation(
dataset=dataset,
scorers=[correctness_scorer]
)
# Run the evaluation - automatically evaluates examples in parallel
asyncio.run(evaluation.evaluate(math_model))To use this example, follow the installation instructions in the first step of the quickstart. You also need an OpenAI API key.
Explore advanced topics:
- Integrations: Connect Weave with popular language model providers, such as OpenAI and Anthropic.
- Cookbooks: See examples of how to use Weave in our interactive notebooks.
- W&B AI Academy: Build advanced retrieval systems, improve language model prompting, and fine-tune models.