A general-purpose, modular, and extensible platform for custom evaluations of AI models and applications.
Lake Merritt provides a standardized yet flexible environment for evaluating AI systems. With its Eval Pack architecture, you can run everything from quick, simple comparisons using a spreadsheet to complex, multi-stage evaluation pipelines defined in a single configuration file. This is an Alpha Version.
The platform is designed for:
The fastest way to understand Lake Merritt is to run an evaluation. Our hands-on guide will walk you through everything from a simple 60-second test to evaluating a complex AI agent, with no coding required for most examples.
➡️ Click here for the Hands-On Quick Start Guide
The guide covers six core workflows, designed to show you the full range and power of the platform:
Before starting, it’s important to understand the two fundamental workflows Lake Merritt supports. Your choice of mode determines what kind of data you need to provide, and the capabilities available.
This is the most common use case. You provide a dataset that already contains the model’s performance data.
input
, output
, and expected_output
.This mode is for when you have test inputs but need to generate the outputs to be evaluated.
input
and expected_output
(the output
column is not needed).output
for each row, and then run the scoring pipeline on the newly generated data.Lake Merritt introduces a uniquely powerful workflow that dramatically accelerates the evaluation process, making it accessible to everyone on your team, regardless of their technical background. This approach allows you to bootstrap an entire evaluation lifecycle starting with nothing more than a list of inputs and a plain-text description of your goals.
This isn’t for generating production-grade, statistically perfect evaluation data. Instead, it’s an incredibly handy feature for a quick start and rapid iteration, allowing you to see how an entire evaluation would run before you invest heavily in manual data annotation.
Here’s how you can go from an idea to a full evaluation run in five steps:
Begin with the bare minimum: a simple CSV file containing only an input
column. Your “idea” is a natural language explanation of what you’re trying to achieve—your success criteria, the persona you want the AI to adopt, and the business, legal, or risk rules it must follow. You can write this directly in the UI or in a markdown file.
expected_output
)Using Mode B: Generate New Data, you’ll run the “Generate Expected Outputs” sub-mode. The system will use your context to guide a powerful LLM, which will read each of your inputs and generate a high-quality, correctly formatted expected_output
for every row. At the end of this step, you can download a brand new dataset, ready for evaluation.
output
)With your new dataset in hand, you immediately run a second Mode B pass. This time, you’ll use the “Generate Outputs” sub-mode. You provide context for the model you want to test (the “Actor LLM”), and the system generates its output
for each input, creating a complete three-column CSV.
Now, with a complete dataset of synthetically generated data, you can immediately run a Mode A: Evaluate Existing Outputs workflow. You can select scorers like LLM-as-a-Judge
to see how the generated outputs
stack up against the generated expected_outputs
, getting a full report with scores and analysis.
This is the most crucial step. Having seen a full evaluation lifecycle, you and your team can now intelligently refine the process. You can go back and manually revise the generated expected_outputs
to better reflect reality-based context, edit the inputs to cover more edge cases, or adjust your initial context to improve the success criteria.
This workflow is a significant step forward for accessible, open-source AI evaluation tools, offering several unique advantages:
By transforming the tedious task of initial dataset creation into a creative and iterative process, Lake Merritt empowers teams to build better, safer, and more aligned AI systems faster than ever before.
Lake Merritt offers two UIs for running evaluations, catering to different needs.
This is the fastest way to get started. If you have a simple CSV file, you can upload it and configure scorers directly in the user interface. It’s perfect for quick checks and initial exploration.
input
, output
, expected_output
for Mode A).This is the new, powerful way to use Lake Merritt. An Eval Pack is a YAML file where you declaratively define the entire evaluation process. This is the recommended path for any serious or recurring evaluation task.
.yaml
file.csv
: For standard CSV files, supporting both Mode A and Mode B.json
: For evaluating records in simple JSON or list-of-JSON formats.generic_otel
: A powerful ingester for standard OpenTelemetry JSON traces, allowing evaluation of complex agent behavior by extracting fields from across an entire trace.python
: For ultimate flexibility, run a custom Python script to ingest any data format and yield EvaluationItem
objects.exact_match
and fuzzy_match
for clear, repeatable checks.CriteriaSelectionJudge
and ToolUsageScorer
.The Eval Pack system is the most powerful feature of Lake Merritt. It turns your evaluation logic into a shareable, version-controllable artifact that is essential for rigorous testing.
A great way to start is by exploring the examples provided in the repository. The examples/eval_packs/
directory contains ready-to-use packs for a variety of tasks, including:
Use these as templates for your own custom evaluations.
Eval Packs are designed for both systematic and exploratory analysis.
For Routine & Ongoing Evals: Codify your team’s quality bar in an Eval Pack. Run the same pack against every new model version or prompt update to track regressions and improvements over time. This makes your evaluation process a reliable, repeatable part of your development lifecycle, suitable for integration into CI/CD pipelines.
For Ad-Hoc & Exploratory Evals: Quickly prototype a new evaluation idea without changing the core application code. Have a novel data format? Write a small Python ingester and reference it in your pack. Want to test a new scoring idea? Define a new pipeline stage with a custom prompt. An Eval Pack lets you experiment rapidly and share the entire evaluation strategy in a single file.
Eval Packs enable sophisticated evaluation designs:
exact_match
scorer to filter easy passes, then run an expensive llm_judge
only on the items that failed the initial check.ToolUsageScorer
to validate tool calls and a separate LLMJudgeScorer
to assess the quality of the agent’s final answer.Versioning is crucial. The version
field in your Eval Pack (e.g., version: "1.1"
) is more than just metadata. When you change a prompt, adjust a threshold, or add a scorer, you should increment the version. This ensures that you can always reproduce past results and clearly track how your evaluation standards evolve over time.
Lake Merritt’s extensible architecture was designed to handle diverse and complex evaluation needs. To prove and showcase this power, we have added comprehensive, end-to-end support for two sophisticated benchmarks: the official Bias Benchmark for Question Answering (BBQ and an initial, experimental Fiduciary Duty of Loyalty (FDL) evaluation inspired by BBQ’s methodology.
This new capability demonstrates how Lake Merritt can be used as a “Benchmark-in-a-Box,” where a single Eval Pack can codify the entire workflow—from ingesting complex, multi-file datasets to running generation, scoring, and calculating specialized, benchmark-specific aggregate metrics.
BBQ Quick Primer: The BBQ benchmark organizes questions into pairs of ambiguous vs. disambiguated contexts. It tests whether a model relies on harmful social stereotypes when a question is under-informative, and whether it can overcome those stereotypes when more context is provided. Results are aggregated into a final bias score that measures the model’s tendency to select stereotype-consistent answers. Our implementation mirrors these core concepts.
Integrating the BBQ benchmark was a perfect test of Lake Merritt’s modular design, leveraging a custom ingester, a deterministic scorer, and a new aggregator.
The entire BBQ workflow is defined in test_packs/bbq_eval_pack.yaml
and uses the following components:
PythonIngester
: The BBQ dataset consists of multiple .jsonl
data files and a separate additional_metadata.csv
. Our PythonIngester
is configured in the Eval Pack to point to a dedicated script (scripts/ingest_bbq_jsonl.py
) that correctly parses this multi-file structure into clean EvaluationItem
s.ChoiceIndexScorer
: No LLM judge is needed. After the model generates its text-based answer, the ChoiceIndexScorer
deterministically maps that answer back to one of the choices (A, B, or C) using fuzzy matching. This is fast, cheap, and repeatable.BBQBiasScoreAggregator
: After all items are scored, this aggregator calculates the official BBQ metrics: Disambiguated Accuracy and the crucial Accuracy-Weighted Ambiguous Bias Score. These are then displayed directly in the summary report.Download the BBQ Dataset: First, clone or download the official BBQ repository to your machine. The ingester script expects the original directory structure (data/
and supplemental/
subdirectories).
PythonIngester
needs to know where to find the dataset. Create a simple text file (e.g., bbq_path.txt
) that contains a single line: the absolute path to the root of the BBQ repository you just downloaded.
# Example command on macOS or Linux
echo "/Users/yourname/path/to/your/BBQ_full" > bbq_path.txt
Launch Lake Merritt and Configure: Run streamlit run streamlit_app.py
and ensure your API keys are set up in “System Configuration.”
test_packs/bbq_eval_pack.yaml
.bbq_path.txt
file.We’ve also adapted the BBQ methodology for a specialized legal and ethical domain: Fiduciary Duty of Loyalty. This eval tests if an AI, when faced with a conflict of interest, correctly prioritizes the user’s interests.
This workflow is a powerful example of the full “idea to insight” lifecycle in Lake Merritt, which requires a human-in-the-loop process for creating a high-quality dataset.
python scripts/generate_fdl_dataset.py
This creates the data/duty_of_loyalty_benchmark.csv
file.
(Crucial) Human Review Step: A domain expert (e.g., a lawyer) must review and, if necessary, correct the synthetically generated expected_output
values in the CSV. An LLM’s initial attempt is a draft, not ground truth.
test_packs/fdl_eval_pack.yaml
.data/duty_of_loyalty_benchmark.csv
.PythonIngester
), per-item logic (ChoiceIndexScorer
), and dataset-level analysis (BBQBiasScoreAggregator
).PythonIngester
allowed us to handle BBQ’s complex multi-file format without changing the core application.Evaluating the complex behavior of AI agents is a primary use case for Lake Merritt. Instead of simple input/output pairs, you can evaluate an agent’s entire decision-making process captured in an OpenTelemetry trace.
generic_otel
or python
).ToolUsageScorer
or LLMJudge
to evaluate aspects like:
traces.json
) to run the evaluation.A common source of errors, especially in the manual workflow, is an incorrectly formatted LLM-as-a-Judge
prompt.
{input}
), the LLM will not see your data and will return an error message instead of a JSON score. This will cause the item to be marked as “failed to score.”,
, and ``. The default prompt in the manual UI has been corrected, but be sure to use this syntax if you write a custom prompt.git clone https://github.com/PrototypeJam/lake_merritt.git
cd lake_merritt
# Using uv (recommended)
uv venv
source .venv/bin/activate
# This installs the app, Streamlit, and all core dependencies
# including python-dotenv for local .env file handling
uv pip install -e ".[test,dev]"
.env.template
to .env
and add your API keys:
cp .env.template .env
Run the Streamlit app:
streamlit run streamlit_app.py
Navigate to the “Evaluation Setup” page to begin.
A free deployed version of Lake Merritt is currently available for low-volume live testing on Streamlit Community Cloud (subject to Streamlit’s terms and restrictions), at: https://lakemerritt.streamlit.app
This project emphasizes deep modularity. When adding new features:
core/scoring/
and inherit from BaseScorer
.core/ingestion/
and inherit from BaseIngester
.core/registry.py
to make them available to the Eval Pack engine.core/data_models.py
.MIT License - see LICENSE file for details.
This section documents critical learnings from deploying Lake Merritt to Streamlit Community Cloud, including common issues, their root causes, and proven solutions.
Issue: “Oh no. Error running app” with logs showing:
Error: The current project could not be installed: No file/folder found for package ai-eval-workbench
Root Cause: Streamlit Community Cloud uses Poetry for dependency management, not setuptools. Even though pyproject.toml specified build-backend = "setuptools.build_meta"
, Streamlit ignores this and uses Poetry.
Solution: Add Poetry configuration to disable package mode:
[tool.poetry]
package-mode = false
This tells Poetry to only install dependencies, not attempt to install the project as a package.
Issue: Deployment failures or unexpected behavior.
Root Cause: Mismatch between Python version configured in Streamlit Cloud settings and runtime.txt
file.
Solution:
runtime.txt
matches exactly:
python-3.13
Issue: Import errors for new subdirectories (e.g., ModuleNotFoundError: No module named 'core.otel'
)
Root Cause: When using setuptools locally, listing parent packages (e.g., “core”) automatically includes subdirectories. However, this may not work reliably in all deployment environments.
Initial Wrong Approach:
[tool.setuptools]
packages = ["app", "core", "core.otel", "core.scoring", "core.scoring.otel", "services", "utils"]
This caused conflicts because subdirectories were listed explicitly when the parent was already included.
Correct Approach:
[tool.setuptools]
packages = ["app", "core", "services", "utils"]
List only top-level packages; subdirectories are automatically included.
Issue: Changes pushed to GitHub but deployment still shows old errors.
Symptoms:
Root Cause: Streamlit Community Cloud aggressively caches deployments. Once a deployment fails, it may get stuck in a bad state.
Solutions (in order of effectiveness):
Strategy: Add temporary debug logging to identify exact failure point.
Example:
# In streamlit_app.py
try:
from core.logging_config import setup_logging
print("✓ Core imports successful")
except ImportError as e:
st.error(f"Import error: {e}")
raise
This makes import errors visible in the deployment logs instead of just showing “Oh no.”
Critical Files:
pyproject.toml
- Dependencies and project metadataruntime.txt
- Python version specification__init__.py
files to be recognized as packagesDo NOT Use:
requirements.txt
alongside pyproject.toml
- This can cause conflictsBefore deploying to Streamlit Community Cloud:
runtime.txt
matches Streamlit’s Python version[tool.poetry]
section with package-mode = false
__init__.py
filesWhen deployment fails:
Remember: Streamlit Community Cloud’s deployment environment differs from local development. What works locally may need adjustments for cloud deployment.
Highlights:
Deeper Dives into Roadmap Items: