`evalsig.io`¶

Readers and writers for every input format EVALSIG accepts. Each reader returns a RunFrame.

Reader registry¶

from evalsig.io import register_reader, get_reader, available_formats

register_reader(name, reader) -- add a new format under a short name.
get_reader(name) -- look one up.
available_formats() -- list everything currently registered.

The built-in registrations happen at import time: runframe, lm_eval, inspect, helm, parquet.

`read_runframe_json` / `write_runframe_json`¶

EVALSIG's own JSON format, the canonical exchange shape.

from evalsig.io import read_runframe_json, write_runframe_json
run = read_runframe_json("baseline.json")
write_runframe_json(run, "out.json")

The schema is exported as RUNFRAME_SCHEMA (JSON Schema draft 2020-12). The validator is lightweight and runs before any RunFrame is constructed, so bad inputs fail fast with a clear message.

`read_lm_eval_json`¶

Reads samples_*.jsonl from EleutherAI's lm-evaluation-harness.

run = read_lm_eval_json(
    "samples_mmlu_2026-05-16.jsonl",
    model_id="claude-x",
    task_id="mmlu",
    metric_name="acc",         # or 'exact_match', 'is_correct', etc.
    cluster_key="subject",     # optional; field on the doc to group by
)

Resilient to several variants of the format (a list of dicts, a samples-wrapping dict, JSONL).

`read_inspect_log`¶

Reads JSON exports of Inspect AI .eval logs. Run inspect log export run.eval > run.json first.

run = read_inspect_log("run.json",
                       metric_name="accuracy",
                       cluster_key="passage_id")

Handles the common score.value shapes ("C"/"I", booleans, numbers).

`read_helm_scenario`¶

Reads HELM's scenario_state.json.

run = read_helm_scenario("scenario_state.json",
                         metric_name="accuracy",
                         cluster_key="category")

Pulls result.success (bool) by default, falls back to a numeric metric in result[metric_name] or result.stats[metric_name].

`read_runframe_parquet` / `write_runframe_parquet`¶

The long-term storage format. One row per (run, item, epoch). Use the canonical PARQUET_SCHEMA (also exported) when writing your own ingestion paths.

from evalsig.io import read_runframe_parquet, write_runframe_parquet
write_runframe_parquet(run, "run.parquet")
back = read_runframe_parquet("run.parquet")

If a file holds multiple runs, pass run_id= to disambiguate.

`normalize`¶

Convenience wrapper for callers that want to import the alignment helper from evalsig.io.normalize rather than evalsig.compare.compare. Returns the aligned arrays plus any warning notes.

Writing your own reader¶

Any function that turns a path into a RunFrame is a reader.

from evalsig.io import register_reader
from evalsig.types import RunFrame, ItemResult

def read_my_format(path: str, **kw) -> RunFrame:
    rows = my_parser(path)
    return RunFrame(
        run_id=kw.get("run_id", path),
        model_id=kw.get("model_id", "unknown"),
        task_id=kw.get("task_id", "unknown"),
        metric_name=kw.get("metric_name", "accuracy"),
        items=[
            ItemResult(item_id=str(r["id"]),
                       score=float(r["score"]),
                       cluster_id=r.get("group"))
            for r in rows
        ],
    )

register_reader("my_format", read_my_format)

The CLI's --format will then accept my_format.

evalsig.io¶

Reader registry¶

read_runframe_json / write_runframe_json¶

read_lm_eval_json¶

read_inspect_log¶

read_helm_scenario¶

read_runframe_parquet / write_runframe_parquet¶

normalize¶