Skip to main content
ART writes a metrics row every time you call model.log(...). Those rows go to history.jsonl in the run directory and, if W&B logging is enabled, to W&B. Use this page for three things:
  • understand the metrics ART emits automatically
  • add task-specific metrics from your own rollout code
  • track external judge and API spend alongside training metrics

What ART logs automatically

When you call await model.train(...) or await model.log(train_groups, split="train"), ART already logs most of the metrics you need to monitor a run.
TypeExamples
Rewardreward/mean, reward/std_dev, reward/exception_rate
Lossloss/train, loss/entropy, loss/kl_div, loss/grad_norm, loss/learning_rate
Datadata/step_num_scenarios, data/step_num_trajectories, data/step_num_groups_submitted, data/step_num_groups_trainable
Train summarytrain/num_groups_submitted, train/num_groups_trainable, train/num_trajectories
Timetime/wall_clock_sec, time/step_wall_s, time/step_trainer_s
Costcosts/gpu on LocalBackend when GPU pricing is known
If ART has the inputs it needs, it also derives:
  • cumulative metrics such as time/cum/trainer_s, data/cum/num_unique_scenarios, and costs/cum/all
  • cost rollups such as costs/train, costs/eval, and costs/all
  • throughput metrics such as throughput/avg_trainer_tok_per_s and throughput/avg_actor_tok_per_s
Some metrics only appear when the backend or your code provides the underlying inputs. For example, throughput/avg_actor_tok_per_s requires both data/step_actor_tokens and time/step_actor_s.

Add task-specific outcome metrics

Attach metrics directly to each Trajectory when your rollout code knows whether an attempt succeeded, how many tools it called, or any other task-specific signal.
async def rollout(model: art.Model, scenario: Scenario) -> art.Trajectory:
    trajectory = art.Trajectory(
        messages_and_choices=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": scenario.prompt},
        ],
        metadata={"scenario_id": scenario.id},
    )

    completion = await model.openai_client().chat.completions.create(
        model=model.get_inference_name(),
        messages=trajectory.messages(),
    )
    trajectory.messages_and_choices.append(completion.choices[0])

    trajectory.reward = score_reward(trajectory)
    trajectory.metrics["correct"] = float(is_correct(trajectory))
    trajectory.metrics["tool_calls"] = float(count_tool_calls(trajectory))
    return trajectory
On train steps, ART averages those rollout metrics and logs them under the reward/ namespace, such as reward/correct and reward/tool_calls. If you want to record one value per TrajectoryGroup instead of one per trajectory, pass metrics={...} when you build the group. ART logs those once per group, using keys like reward/group_difficulty on train steps.

Add step-level metrics ART cannot infer

Use model.metrics_builder() for metrics that live outside individual trajectories, such as actor-side timing, token counts, or idle time.
builder = model.metrics_builder()

with builder.measure("time/step_actor_s"):
    result = await run_rollouts()

builder.add_data(
    step_num_scenarios=result.num_scenarios,
    step_actor_tokens=result.actor_tokens,
    scenario_ids=result.scenario_ids,
)
builder.add_idle_times(step_actor_idle_s=result.actor_idle_s)

await model.log(result.train_groups, split="train", step=result.step)
A few useful patterns:
  • log scenario_ids to unlock data/cum/num_unique_scenarios
  • log both data/step_actor_tokens and time/step_actor_s to unlock actor throughput metrics
  • log time/step_eval_s when eval runs happen outside the backend
  • use fully qualified keys like time/step_actor_s or data/step_actor_tokens for builder-managed metrics
ART flushes builder-managed metrics on the next model.log(...) or model.train(...) call.

Track judge and API costs

Use @track_api_cost when a function returns a provider response object with token usage. Wrap the relevant part of your code in a metrics context so ART knows whether the spend belongs to training or evaluation.
from art.metrics import track_api_cost

@track_api_cost(
    source="llm_judge/correctness",
    provider="openai",
    model_name="openai/gpt-4.1",
)
async def run_judge(client, messages):
    return await client.chat.completions.create(
        model="gpt-4.1",
        messages=messages,
    )

with model.metrics_builder("train").activate_context():
    await run_judge(judge_client, train_messages)

with model.metrics_builder("eval").activate_context():
    await run_judge(judge_client, eval_messages)
The next metrics row will include:
  • costs/train/llm_judge/correctness or costs/eval/llm_judge/correctness
  • rollups such as costs/train, costs/eval, and costs/all
  • cumulative totals such as costs/cum/all
ART can price OpenAI and Anthropic responses from their usage fields. You must pass both provider and model_name to @track_api_cost. For custom pricing or unsupported models, register pricing on the builder:
builder = model.metrics_builder()
builder.register_model_pricing(
    "anthropic/my-custom-judge",
    prompt_per_million=1.2,
    completion_per_million=4.8,
)

Track GPU cost on LocalBackend

LocalBackend can log costs/gpu automatically on train steps. ART currently auto-detects H200 pricing at $3/hour per GPU. For other hardware, pass an explicit override:
backend = LocalBackend(gpu_cost_per_hour_usd=2.25)
This lets ART include GPU spend in the same metrics stream as rewards, losses, and judge/API costs.