braintrust

A Python library for logging data to Braintrust. braintrust is distributed as a library on PyPI (opens in a new tab).

Quickstart

Install the library with pip.

pip install braintrust

Then, run a simple experiment with the following code (replace YOUR_API_KEY with your Braintrust API key):

import braintrust
 
experiment = braintrust.init(project="PyTest", api_key="YOUR_API_KEY")
experiment.log(
    inputs={"test": 1},
    output="foo",
    expected="bar",
    scores={
        "n": 0.5,
    },
    metadata={
        "id": 1,
    },
)
print(experiment.summarize())

API Reference

braintrust.logger

Span Objects

class Span(ABC)

A Span encapsulates logged data and metrics for a unit of work. This interface is shared by all span implementations.

We suggest using one of the various startSpan methods, instead of creating Spans directly. See Span.startSpan for full details.

id

@property
@abstractmethod
def id() -> str

Row ID of the span.

span_id

@property
@abstractmethod
def span_id() -> str

Span ID of the span. This is used to link spans together.

root_span_id

@property
@abstractmethod
def root_span_id() -> str

Span ID of the root span in the full trace.

log

@abstractmethod
def log(**event)

Incrementally update the current span with new data. The event will be batched and uploaded behind the scenes.

Arguments:

**event: Data to be logged. See Experiment.log for full details.

start_span

@abstractmethod
def start_span(name,
               span_attributes={},
               start_time=None,
               set_current=None,
               **event)

Create a new span. This is useful if you want to log more detailed trace information beyond the scope of a single log event. Data logged over several calls to Span.log will be merged into one logical row.

We recommend running spans within context managers (with start_span(...) as span) to automatically mark them as current and ensure they are terminated. If you wish to start a span outside a callback, be sure to terminate it with span.end().

Arguments:

name: The name of the span.
span_attributes: Optional additional attributes to attach to the span, such as a type name.
start_time: Optional start time of the span, as a timestamp in seconds.
set_current: If true (the default), the span will be marked as the currently-active span for the duration of the context manager. Unless the span is bound to a context manager, it will not be marked as current. Equivalent to calling with braintrust.with_current(span).
**event: Data to be logged. See Experiment.log for full details.

Returns:

The newly-created Span

end

@abstractmethod
def end(end_time=None) -> float

Terminate the span. Returns the end time logged to the row's metrics. After calling end, you may not invoke any further methods on the span object, except for the property accessors.

Will be invoked automatically if the span is bound to a context manager.

Arguments:

end_time: Optional end time of the span, as a timestamp in seconds.

Returns:

The end time logged to the span metrics.

close

@abstractmethod
def close(end_time=None) -> float

Alias for end.

init

def init(project: str,
         experiment: str = None,
         description: str = None,
         dataset: "Dataset" = None,
         update: bool = False,
         base_experiment: str = None,
         is_public: bool = False,
         api_url: str = None,
         api_key: str = None,
         org_name: str = None,
         disable_cache: bool = False,
         set_current: bool = None)

Remember to close your experiment when it is finished by calling Experiment.close. We recommend binding the experiment to a context manager (with braintrust.init(...) as experiment) to automatically mark it as current and ensure it is terminated.

Arguments:

project: The name of the project to create the experiment in.
experiment: The name of the experiment to create. If not specified, a name will be generated automatically.
description: (Optional) An optional description of the experiment.
dataset: (Optional) A dataset to associate with the experiment. The dataset must be initialized with braintrust.init_dataset before passing it into the experiment.
update: If the experiment already exists, continue logging to it.
base_experiment: An optional experiment name to use as a base. If specified, the new experiment will be summarized and compared to this experiment. Otherwise, it will pick an experiment by finding the closest ancestor on the default (e.g. main) branch.
is_public: An optional parameter to control whether the experiment is publicly visible to anybody with the link or privately visible to only members of the organization. Defaults to private.
api_url: The URL of the Braintrust API. Defaults to https://www.braintrustdata.com (opens in a new tab).
api_key: The API key to use. If the parameter is not specified, will try to use the BRAINTRUST_API_KEY environment variable. If no API key is specified, will prompt the user to login.
org_name: (Optional) The name of a specific organization to connect to. This is useful if you belong to multiple.
disable_cache: Do not use cached login information.
set_current: If true (default), set the currently-active experiment to the newly-created one. Unless the experiment is bound to a context manager, it will not be marked as current. Equivalent to calling with braintrust.with_current(experiment).

Returns:

The experiment object.

init_dataset

def init_dataset(project: str,
                 name: str = None,
                 description: str = None,
                 version: "str | int" = None,
                 api_url: str = None,
                 api_key: str = None,
                 org_name: str = None,
                 disable_cache: bool = False)

Create a new dataset in a specified project. If the project does not exist, it will be created.

Remember to close your dataset when it is finished by calling Dataset.close. We recommend wrapping the dataset within a context manager (with braintrust.init_dataset(...) as dataset) to ensure it is terminated.

Arguments:

project: The name of the project to create the dataset in.
name: The name of the dataset to create. If not specified, a name will be generated automatically.
description: An optional description of the dataset.
version: An optional version of the dataset (to read). If not specified, the latest version will be used.
api_url: The URL of the Braintrust API. Defaults to https://www.braintrustdata.com (opens in a new tab).
api_key: The API key to use. If the parameter is not specified, will try to use the BRAINTRUST_API_KEY environment variable. If no API key is specified, will prompt the user to login.
org_name: (Optional) The name of a specific organization to connect to. This is useful if you belong to multiple.
disable_cache: Do not use cached login information.

Returns:

The dataset object.

login

def login(api_url=None,
          api_key=None,
          org_name=None,
          disable_cache=False,
          force_login=False)

Log into Braintrust. This will prompt you for your API token, which you can find at

https://www.braintrustdata.com/app/token (opens in a new tab). This method is called automatically by init().

Arguments:

api_url: The URL of the Braintrust API. Defaults to https://www.braintrustdata.com (opens in a new tab).
api_key: The API key to use. If the parameter is not specified, will try to use the BRAINTRUST_API_KEY environment variable. If no API key is specified, will prompt the user to login.
org_name: (Optional) The name of a specific organization to connect to. This is useful if you belong to multiple.
disable_cache: Do not use cached login information.
force_login: Login again, even if you have already logged in (by default, this function will exit quickly if you have already logged in)

log

def log(**event)

Log a single event to the current experiment. The event will be batched and uploaded behind the scenes.

Arguments:

**event: Data to be logged. See Experiment.log for full details.

Returns:

The id of the logged event.

summarize

def summarize(summarize_scores=True, comparison_experiment_id=None)

Summarize the current experiment, including the scores (compared to the closest reference experiment) and metadata.

Arguments:

summarize_scores: Whether to summarize the scores. If False, only the metadata will be returned.
comparison_experiment_id: The experiment to compare against. If None, the most recent experiment on the comparison_commit will be used.

Returns:

ExperimentSummary

current_experiment

def current_experiment() -> Optional["Experiment"]

Returns the currently-active experiment (set by with braintrust.init(...) or with braintrust.with_current(experiment)). Returns undefined if no current experiment has been set.

current_span

def current_span() -> Span

Return the currently-active span for logging (set by with *.start_span or braintrust.with_current). If there is no active span, returns a no-op span object, which supports the same interface as spans but does no logging.

See Span for full details.

start_span

def start_span(name,
               span_attributes={},
               start_time=None,
               set_current=None,
               **event) -> Span

Toplevel function for starting a span. If there is a currently-active span, the new span is created as a subspan. Otherwise, if there is a currently-active experiment, the new span is created as a toplevel span. Otherwise, it returns a no-op span object.

Unless a name is explicitly provided, the name of the span will be the name of the calling function, or "root" if no meaningful name can be determined.

We recommend running spans bound to a context manager (with start_span) to automatically mark them as current and ensure they are terminated. If you wish to start a span outside a callback, be sure to terminate it with span.end().

See Span.startSpan for full details.

with_current

def with_current(object: Union["Experiment", "SpanImpl", _NoopSpan])

Set the given experiment or span as current within the bound context manager (with braintrust.with_current(object)) and any asynchronous operations created within the block. The current experiment can be accessed with braintrust.current_experiment, and the current span with braintrust.current_span.

Arguments:

object: The experiment or span to be marked as current.

traced

def traced(*span_args, **span_kwargs)

Decorator to trace the wrapped function as a span. Can either be applied bare (@traced) or by providing arguments (@traced(*span_args, **span_kwargs)), which will be forwarded to the created span. See braintrust.start_span for details on how the span is created, and Span.start_span for full details on the span arguments.

Unless a name is explicitly provided in span_args or span_kwargs, the name of the span will be the name of the decorated function.

Experiment Objects

class Experiment(ModelWrapper)

An experiment is a collection of logged events, such as model inputs and outputs, which represent a snapshot of your application at a particular point in time. An experiment is meant to capture more than just the model you use, and includes the data you use to test, pre- and post- processing code, comparison metrics (scores), and any other metadata you want to include.

Experiments are associated with a project, and two experiments are meant to be easily comparable via their inputs. You can change the attributes of the experiments in a project (e.g. scoring functions) over time, simply by changing what you log.

You should not create Experiment objects directly. Instead, use the braintrust.init() method.

log

def log(input=None,
        output=None,
        expected=None,
        scores=None,
        metadata=None,
        metrics=None,
        id=None,
        dataset_record_id=None,
        inputs=None)

Log a single event to the experiment. The event will be batched and uploaded behind the scenes.

Arguments:

input: The arguments that uniquely define a test case (an arbitrary, JSON serializable object). Later on, Braintrust will use the input to know whether two test cases are the same between experiments, so they should not contain experiment-specific state. A simple rule of thumb is that if you run the same experiment twice, the input should be identical.
output: The output of your application, including post-processing (an arbitrary, JSON serializable object), that allows you to determine whether the result is correct or not. For example, in an app that generates SQL queries, the output should be the result of the SQL query generated by the model, not the query itself, because there may be multiple valid queries that answer a single question.
expected: The ground truth value (an arbitrary, JSON serializable object) that you'd compare to output to determine if your output value is correct or not. Braintrust currently does not compare output to expected for you, since there are so many different ways to do that correctly. Instead, these values are just used to help you navigate your experiments while digging into analyses. However, we may later use these values to re-score outputs or fine-tune your models.
scores: A dictionary of numeric values (between 0 and 1) to log. The scores should give you a variety of signals that help you determine how accurate the outputs are compared to what you expect and diagnose failures. For example, a summarization app might have one score that tells you how accurate the summary is, and another that measures the word similarity between the generated and grouth truth summary. The word similarity score could help you determine whether the summarization was covering similar concepts or not. You can use these scores to help you sort, filter, and compare experiments.
metadata: (Optional) a dictionary with additional data about the test example, model outputs, or just about anything else that's relevant, that you can use to help find and analyze examples later. For example, you could log the prompt, example's id, or anything else that would be useful to slice/dice later. The values in metadata can be any JSON-serializable type, but its keys must be strings.
metrics: (Optional) a dictionary of metrics to log. The following keys are populated automatically and should not be specified: "start", "end", "caller_functionname", "caller_filename", "caller_lineno".
id: (Optional) a unique identifier for the event. If you don't provide one, BrainTrust will generate one for you.
dataset_record_id: (Optional) the id of the dataset record that this event is associated with. This field is required if and only if the experiment is associated with a dataset.
inputs: (Deprecated) the same as input (will be removed in a future version).

Returns:

The id of the logged event.

start_span

def start_span(name="root",
               span_attributes={},
               start_time=None,
               set_current=None,
               **event)

Create a new toplevel span. The name parameter is optional and defaults to "root".

See Span.start_span for full details

summarize

def summarize(summarize_scores=True, comparison_experiment_id=None)

Summarize the experiment, including the scores (compared to the closest reference experiment) and metadata.

Arguments:

summarize_scores: Whether to summarize the scores. If False, only the metadata will be returned.
comparison_experiment_id: The experiment to compare against. If None, the most recent experiment on the origin's main branch will be used.

Returns:

ExperimentSummary

close

def close()

Finish the experiment and return its id. After calling close, you may not invoke any further methods on the experiment object.

Will be invoked automatically if the experiment is bound to a context manager.

Returns:

The experiment id.

SpanImpl Objects

class SpanImpl(Span)

Primary implementation of the Span interface. See the Span interface for full details on each method.

We suggest using one of the various start_span methods, instead of creating Spans directly. See Span.start_span for full details.

Dataset Objects

class Dataset(ModelWrapper)

A dataset is a collection of records, such as model inputs and outputs, which represent data you can use to evaluate and fine-tune models. You can log production data to datasets, curate them with interesting examples, edit/delete records, and run evaluations against them.

You should not create Dataset objects directly. Instead, use the braintrust.init_dataset() method.

insert

def insert(input, output, metadata=None, id=None)

Insert a single record to the dataset. The record will be batched and uploaded behind the scenes. If you pass in an id,

and a record with that id already exists, it will be overwritten (upsert).

Arguments:

input: The argument that uniquely define an input case (an arbitrary, JSON serializable object).
output: The output of your application, including post-processing (an arbitrary, JSON serializable object).
metadata: (Optional) a dictionary with additional data about the test example, model outputs, or just about anything else that's relevant, that you can use to help find and analyze examples later. For example, you could log the prompt, example's id, or anything else that would be useful to slice/dice later. The values in metadata can be any JSON-serializable type, but its keys must be strings.
id: (Optional) a unique identifier for the event. If you don't provide one, Braintrust will generate one for you.

Returns:

The id of the logged record.

delete

def delete(id)

Delete a record from the dataset.

Arguments:

id: The id of the record to delete.

summarize

def summarize(summarize_data=True)

Summarize the dataset, including high level metrics about its size and other metadata.

Arguments:

summarize_data: Whether to summarize the data. If False, only the metadata will be returned.

Returns:

DatasetSummary

fetch

def fetch()

Fetch all records in the dataset.

for record in dataset.fetch():
    print(record)
 
# You can also iterate over the dataset directly.
for record in dataset:
    print(record)

Returns:

An iterator over the records in the dataset.

close

def close()

Terminate connection to the dataset and return its id. After calling close, you may not invoke any further methods on the dataset object.

Will be invoked automatically if the dataset is bound as a context manager.

Returns:

The dataset id.

ScoreSummary Objects

@dataclasses.dataclass
class ScoreSummary(SerializableDataClass)

Summary of a score's performance.

name

Average score across all examples.

score

Difference in score between the current and reference experiment.

diff

Number of improvements in the score.

improvements

Number of regressions in the score.

ExperimentSummary Objects

@dataclasses.dataclass
class ExperimentSummary(SerializableDataClass)

Summary of an experiment's scores and metadata.

project_name

Name of the experiment.

experiment_name

URL to the project's page in the Braintrust app.

project_url

URL to the experiment's page in the Braintrust app.

experiment_url

The experiment scores are baselined against.

comparison_experiment_name

Summary of the experiment's scores.

DataSummary Objects

@dataclasses.dataclass
class DataSummary(SerializableDataClass)

Summary of a dataset's data.

new_records

Total records in the dataset.

DatasetSummary Objects

@dataclasses.dataclass
class DatasetSummary(SerializableDataClass)

Summary of a dataset's scores and metadata.

project_name

Name of the dataset.

dataset_name

URL to the project's page in the Braintrust app.

project_url

URL to the experiment's page in the Braintrust app.

dataset_url

Summary of the dataset's data.

braintrust.framework

EvalCase Objects

@dataclasses.dataclass
class EvalCase(SerializableDataClass)

An evaluation case. This is a single input to the evaluation task, along with an optional expected output and metadata.

EvalHooks Objects

class EvalHooks(abc.ABC)

An object that can be used to add metadata to an evaluation. This is passed to the task function.

EvalScorerArgs Objects

class EvalScorerArgs(SerializableDataClass)

Arguments passed to an evaluator scorer. This includes the input, expected output, actual output, and metadata.

Evaluator Objects

@dataclasses.dataclass
class Evaluator()

An evaluator is an abstraction that defines an evaluation dataset, a task to run on the dataset, and a set of scorers to evaluate the results of the task. Each method attribute can be synchronous or asynchronous (for optimal performance, it is recommended to provide asynchronous implementations).

You should not create Evaluators directly if you plan to use the Braintrust eval framework. Instead, you should create them using the Eval() method, which will register them so that braintrust eval ... can find them.

name

Returns an iterator over the evaluation dataset. Each element of the iterator should be an EvalCase or a dict with the same fields as an EvalCase (input, expected, metadata).

data

Runs the evaluation task on a single input. The hooks object can be used to add metadata to the evaluation.

task

A list of scorers to evaluate the results of the task. Each scorer can be a Scorer object or a function that takes input, output, and expected arguments and returns a Score object. The function can be async.

Eval

def Eval(name: str, data: Callable[[], Union[Iterator[EvalCase],
                                             AsyncIterator[EvalCase]]],
         task: Callable[[Input, EvalHooks], Union[Output, Awaitable[Output]]],
         scores: List[EvalScorer])

A function you can use to define an evaluator. This is a convenience wrapper around the Evaluator class.

Example:

Eval(
    name="my-evaluator",
    data=lambda: [
        EvalCase(input=1, expected=2),
        EvalCase(input=2, expected=4),
    ],
    task=lambda input, hooks: input * 2,
    scores=[
        NumericDiff,
    ],
)

Arguments:

name: The name of the evaluator. This corresponds to a project name in Braintrust.
data: Returns an iterator over the evaluation dataset. Each element of the iterator should be a EvalCase.
task: Runs the evaluation task on a single input. The hooks object can be used to add metadata to the evaluation.
scores: A list of scorers to evaluate the results of the task. Each scorer can be a Scorer object or a function that takes an EvalScorerArgs object and returns a Score object.

Returns:

An Evaluator object.

Text-to-SQL Node.js