lake_merritt

Lake Merritt: AI Evaluation Workbench.

A general-purpose, modular, and extensible platform for custom evaluations of AI models and applications.

Overview

Lake Merritt provides a standardized yet flexible environment for evaluating AI systems. With its Eval Pack architecture, you can run everything from quick, simple comparisons using a spreadsheet to complex, multi-stage evaluation pipelines defined in a single configuration file. This is an Alpha Version.

The platform is designed for:

Rapid Prototyping: Get feedback on your model with a simple CSV upload and a few clicks.
Customizable Evaluation: Define bespoke evaluation logic using YAML “Eval Packs” to test for specific behaviors, tool usage, and more.
Repeatable & Shareable Workflows: Codify your evaluation strategy in a version-controllable file that can be shared and reused across your team.
Deep Analysis: Analyze results through intuitive visualizations and detailed data exports.

🚀 Quick Start: Try It Now!

The fastest way to understand Lake Merritt is to run an evaluation. Our hands-on guide will walk you through everything from a simple 60-second test to evaluating a complex AI agent, with no coding required for most examples.

➡️ Click here for the Hands-On Quick Start Guide

The guide covers six core workflows, designed to show you the full range and power of the platform:

The 60-Second Sanity Check: Quickly grade a model’s performance on a simple CSV file using an LLM-as-a-Judge.
The “Hold My Beer” Workflow: Bootstrap a complete evaluation dataset from just a list of inputs and an idea.
Your First Eval Pack: Codify your testing logic into a reusable, version-controllable YAML file for repeatable evaluations.
Replicating a Benchmark: Run the official BBQ bias benchmark using the built-in dataset with just a few clicks.
Domain-Specific Compliance: Test an AI’s grasp of a complex legal principle (Fiduciary Duty of Loyalty).
Evaluating Agent Traces: Go beyond simple I/O and score an agent’s multi-step decision process from an OpenTelemetry trace.

Understanding the Two Evaluation Modes

Before starting, it’s important to understand the two fundamental workflows Lake Merritt supports. Your choice of mode determines what kind of data you need to provide, and the capabilities available.

Mode A: Evaluate Existing Outputs

This is the most common use case. You provide a dataset that already contains the model’s performance data.

Required Data: input, output, and expected_output.
Use Case: You have already run your model and have its outputs, and you want to score them against a ground truth.

Mode B: Generate & Evaluate

This mode is for when you have test inputs but need to generate the outputs to be evaluated.

Required Data: input and expected_output (the output column is not needed).
Use Case: You want to test a new prompt or model on a set of inputs. Lake Merritt will first use a configurable “Actor LLM” to generate the output for each row, and then run the scoring pipeline on the newly generated data.

Mode B “Hold-My-Beer” Workflow: From Idea to Insight in Minutes

Lake Merritt introduces a uniquely powerful workflow that dramatically accelerates the evaluation process, making it accessible to everyone on your team, regardless of their technical background. This approach allows you to bootstrap an entire evaluation lifecycle starting with nothing more than a list of inputs and a plain-text description of your goals.

This isn’t for generating production-grade, statistically perfect evaluation data. Instead, it’s an incredibly handy feature for a quick start and rapid iteration, allowing you to see how an entire evaluation would run before you invest heavily in manual data annotation.

Here’s how you can go from an idea to a full evaluation run in five steps:

Step 1: Start with Only Inputs and an Idea

Begin with the bare minimum: a simple CSV file containing only an input column. Your “idea” is a natural language explanation of what you’re trying to achieve—your success criteria, the persona you want the AI to adopt, and the business, legal, or risk rules it must follow. You can write this directly in the UI or in a markdown file.

Step 2: Generate Your “Gold Standard” (`expected_output`)

Using Mode B: Generate New Data, you’ll run the “Generate Expected Outputs” sub-mode. The system will use your context to guide a powerful LLM, which will read each of your inputs and generate a high-quality, correctly formatted expected_output for every row. At the end of this step, you can download a brand new dataset, ready for evaluation.

Step 3: Generate the Model’s Response (`output`)

With your new dataset in hand, you immediately run a second Mode B pass. This time, you’ll use the “Generate Outputs” sub-mode. You provide context for the model you want to test (the “Actor LLM”), and the system generates its output for each input, creating a complete three-column CSV.

Step 4: Run a Full, End-to-End Evaluation

Now, with a complete dataset of synthetically generated data, you can immediately run a Mode A: Evaluate Existing Outputs workflow. You can select scorers like LLM-as-a-Judge to see how the generated outputs stack up against the generated expected_outputs, getting a full report with scores and analysis.

Step 5: Iterate with Human Insight

This is the most crucial step. Having seen a full evaluation lifecycle, you and your team can now intelligently refine the process. You can go back and manually revise the generated expected_outputs to better reflect reality-based context, edit the inputs to cover more edge cases, or adjust your initial context to improve the success criteria.

Why This is a Game-Changer for AI Evaluation

This workflow is a significant step forward for accessible, open-source AI evaluation tools, offering several unique advantages:

Rapid Prototyping & Iteration: Go from a concept to a full evaluation baseline in minutes, not days or weeks. This allows you to test hypotheses and iterate on your models and prompts at an unprecedented speed.
Democratizing Evaluation: This feature is designed for non-technical experts. A product manager, lawyer, or risk officer can directly provide the context that matters in a simple text file, ensuring that the evaluation’s success criteria truly support and reflect their domain of authority. It brings essential business, legal, and safety expertise directly into the evaluation setup process.
Evaluate Your Evals First: Before spending dozens of hours meticulously hand-crafting a “perfect” dataset, you can run a quick, synthetic version through the entire lifecycle. This helps you validate whether your evaluation criteria and prompts are even correct in the first place.
From Zero to Baseline Instantly: For new projects without existing test data, this workflow instantly generates a starter set of correctly formatted synthetic data, providing a tangible starting point for more rigorous, reality-based annotation later on.

By transforming the tedious task of initial dataset creation into a creative and iterative process, Lake Merritt empowers teams to build better, safer, and more aligned AI systems faster than ever before.

Getting Started: Two Paths to Evaluation

Lake Merritt offers two UIs for running evaluations, catering to different needs.

Path 1: The Manual Workflow (For Quick Tests)

This is the fastest way to get started. If you have a simple CSV file, you can upload it and configure scorers directly in the user interface. It’s perfect for quick checks and initial exploration.

Prepare Your Data: Create a CSV file with the required columns for your chosen mode (e.g., input, output, expected_output for Mode A).
Navigate to “Evaluation Setup”: Select the “Configure Manually” option.
Upload & Select: Upload your CSV and choose from a list of built-in scorers (Exact Match, Fuzzy Match, LLM Judge).
Run: Click “Start Evaluation” to see your results.

Path 2: The Eval Pack Workflow (For Power and Repeatability)

This is the new, powerful way to use Lake Merritt. An Eval Pack is a YAML file where you declaratively define the entire evaluation process. This is the recommended path for any serious or recurring evaluation task.

Create an Eval Pack: Define your data source, scorers, and configurations in a .yaml file.
Navigate to “Evaluation Setup”: Select the “Upload Eval Pack” option.
Upload Pack & Data: Upload your Eval Pack, then upload the corresponding data file (e.g., a CSV, a JSON trace file, etc.).
Run: Click “Start Pack Evaluation” to execute your custom workflow.

Core Features

Dual Evaluation Workflows:
- Mode A (Evaluate Existing Data): Upload a dataset with inputs, outputs, and expected outputs to score pre-existing results.
- Mode B (Generate & Evaluate): Provide a dataset with inputs and expected outcomes, and use a configurable “Actor LLM” to generate the outputs before the evaluation pipeline runs.
Powerful Eval Pack Engine: Define and run custom, multi-stage evaluation pipelines from a single, version-controllable YAML file.
Flexible Data Ingestion:
- csv: For standard CSV files, supporting both Mode A and Mode B.
- json: For evaluating records in simple JSON or list-of-JSON formats.
- generic_otel: A powerful ingester for standard OpenTelemetry JSON traces, allowing evaluation of complex agent behavior by extracting fields from across an entire trace.
- python: For ultimate flexibility, run a custom Python script to ingest any data format and yield EvaluationItem objects.
Rich Scorer Library:
- Deterministic Scorers: exact_match and fuzzy_match for clear, repeatable checks.
- LLM-as-a-Judge: Use powerful models (GPT, Claude, Gemini) with fully customizable Jinja2-based prompts to score for quality, nuance, and correctness.
- Trace-Aware Scorers: Evaluate AI agent behavior directly from OTEL traces using scorers like CriteriaSelectionJudge and ToolUsageScorer.
Advanced Generation with Meta-Prompting: In Mode B, leverage an LLM to generate a sophisticated, context-aware prompt for another LLM, perfect for creating high-quality, structured test data.
Modular & Extensible Architecture: A registry-based system allows for the easy addition of custom scorers and ingesters to meet any evaluation need.

Using Eval Packs

The Eval Pack system is the most powerful feature of Lake Merritt. It turns your evaluation logic into a shareable, version-controllable artifact that is essential for rigorous testing.

Where to Find Existing Eval Packs

A great way to start is by exploring the examples provided in the repository. The examples/eval_packs/ directory contains ready-to-use packs for a variety of tasks, including:

Evaluating legal citation formatting.
Assessing AI agent decision quality from OTEL traces.
Multi-stage pipelines combining fuzzy matching and LLM judgment.

Use these as templates for your own custom evaluations.

The Power of Eval Packs: Routine & Ad-Hoc Evals

Eval Packs are designed for both systematic and exploratory analysis.

For Routine & Ongoing Evals: Codify your team’s quality bar in an Eval Pack. Run the same pack against every new model version or prompt update to track regressions and improvements over time. This makes your evaluation process a reliable, repeatable part of your development lifecycle, suitable for integration into CI/CD pipelines.
For Ad-Hoc & Exploratory Evals: Quickly prototype a new evaluation idea without changing the core application code. Have a novel data format? Write a small Python ingester and reference it in your pack. Want to test a new scoring idea? Define a new pipeline stage with a custom prompt. An Eval Pack lets you experiment rapidly and share the entire evaluation strategy in a single file.

Creative Configurations & Versioning

Eval Packs enable sophisticated evaluation designs:

Multi-Stage Pipelines: Combine scorers for efficiency. Use a cheap exact_match scorer to filter easy passes, then run an expensive llm_judge only on the items that failed the initial check.
Targeted Trace Analysis: When evaluating an OTEL trace, use different scorers for different parts of the agent’s process. Use the ToolUsageScorer to validate tool calls and a separate LLMJudgeScorer to assess the quality of the agent’s final answer.

Versioning is crucial. The version field in your Eval Pack (e.g., version: "1.1") is more than just metadata. When you change a prompt, adjust a threshold, or add a scorer, you should increment the version. This ensures that you can always reproduce past results and clearly track how your evaluation standards evolve over time.

What’s New: Benchmark-in-a-Box with BBQ and FDL

Lake Merritt’s extensible architecture was designed to handle diverse and complex evaluation needs. To prove and showcase this power, we have added comprehensive, end-to-end support for two sophisticated benchmarks: the official Bias Benchmark for Question Answering (BBQ and an initial, experimental Fiduciary Duty of Loyalty (FDL) evaluation inspired by BBQ’s methodology.

This new capability demonstrates how Lake Merritt can be used as a “Benchmark-in-a-Box,” where a single Eval Pack can codify the entire workflow—from ingesting complex, multi-file datasets to running generation, scoring, and calculating specialized, benchmark-specific aggregate metrics.

BBQ Quick Primer: The BBQ benchmark organizes questions into pairs of ambiguous vs. disambiguated contexts. It tests whether a model relies on harmful social stereotypes when a question is under-informative, and whether it can overcome those stereotypes when more context is provided. Results are aggregated into a final bias score that measures the model’s tendency to select stereotype-consistent answers. Our implementation mirrors these core concepts.

1. Official BBQ Fairness Benchmark Replication

Integrating the BBQ benchmark was a perfect test of Lake Merritt’s modular design, leveraging a custom ingester, a deterministic scorer, and a new aggregator.

How it Works in Lake Merritt

The entire BBQ workflow is defined in test_packs/bbq_eval_pack.yaml and uses the following components:

Custom Ingestion with PythonIngester: The BBQ dataset consists of multiple .jsonl data files and a separate additional_metadata.csv. Our PythonIngester is configured in the Eval Pack to point to a dedicated script (scripts/ingest_bbq_jsonl.py) that correctly parses this multi-file structure into clean EvaluationItems.
Mode B for Response Generation: The pack runs in “generate & evaluate” mode. For each question, Lake Merritt uses a configured LLM to generate an answer based on the provided multiple-choice options.
Deterministic Scoring with ChoiceIndexScorer: No LLM judge is needed. After the model generates its text-based answer, the ChoiceIndexScorer deterministically maps that answer back to one of the choices (A, B, or C) using fuzzy matching. This is fast, cheap, and repeatable.
Specialized Aggregation with BBQBiasScoreAggregator: After all items are scored, this aggregator calculates the official BBQ metrics: Disambiguated Accuracy and the crucial Accuracy-Weighted Ambiguous Bias Score. These are then displayed directly in the summary report.

Step-by-Step: Running the BBQ Evaluation

Download the BBQ Dataset: First, clone or download the official BBQ repository to your machine. The ingester script expects the original directory structure (data/ and supplemental/ subdirectories).
Create the Path File: The PythonIngester needs to know where to find the dataset. Create a simple text file (e.g., bbq_path.txt) that contains a single line: the absolute path to the root of the BBQ repository you just downloaded.
```
# Example command on macOS or Linux
echo "/Users/yourname/path/to/your/BBQ_full" > bbq_path.txt
```
Launch Lake Merritt and Configure: Run streamlit run streamlit_app.py and ensure your API keys are set up in “System Configuration.”
Run the Eval Pack:
- Navigate to Evaluation Setup and select “Upload Eval Pack”.
- Upload test_packs/bbq_eval_pack.yaml.
- When prompted for the data file, upload your bbq_path.txt file.
- Provide a simple context for generation, such as: Respond with only the text of the single best option.
- Click “Start Pack Run”.
Analyze Results: Once the run completes, the most important output is in the Download Center. The Summary Report will contain a “BBQ Bias Score Scorecard” with the final, aggregated bias metrics.

2. Experimental Fiduciary Duty of Loyalty (FDL) Evals

We’ve also adapted the BBQ methodology for a specialized legal and ethical domain: Fiduciary Duty of Loyalty. This eval tests if an AI, when faced with a conflict of interest, correctly prioritizes the user’s interests.

This workflow is a powerful example of the full “idea to insight” lifecycle in Lake Merritt, which requires a human-in-the-loop process for creating a high-quality dataset.

Step-by-Step: Running the FDL Evaluation

Generate the Baseline Dataset: First, run the provided script to generate a synthetic “gold standard” dataset.
```
python scripts/generate_fdl_dataset.py
```
This creates the data/duty_of_loyalty_benchmark.csv file.
(Crucial) Human Review Step: A domain expert (e.g., a lawyer) must review and, if necessary, correct the synthetically generated expected_output values in the CSV. An LLM’s initial attempt is a draft, not ground truth.
Run the Eval Pack:
- In the UI, upload test_packs/fdl_eval_pack.yaml.
- When prompted, upload your human-verified data/duty_of_loyalty_benchmark.csv.
- Provide context for the generation step (e.g., You are a helpful AI assistant with a strict duty of loyalty to the user.) and run the evaluation.
Analyze FDL-Specific Metrics: Your final summary report will include an “FDL Metrics Scorecard” with domain-specific metrics like Disambiguated Accuracy, Appropriate Clarification Rate, and Disclosure Success Rate.

Why Lake Merritt was a Good Fit for This

Separable Concerns: The architecture cleanly separates data ingestion (PythonIngester), per-item logic (ChoiceIndexScorer), and dataset-level analysis (BBQBiasScoreAggregator).
Extensible Ingestion: The PythonIngester allowed us to handle BBQ’s complex multi-file format without changing the core application.
Custom Aggregators: The aggregator system enabled us to calculate highly specialized, benchmark-specific metrics like the BBQ bias score and display them in the final report.

Advanced Use Case: Evaluating OpenTelemetry Traces

Evaluating the complex behavior of AI agents is a primary use case for Lake Merritt. Instead of simple input/output pairs, you can evaluate an agent’s entire decision-making process captured in an OpenTelemetry trace.

How it Works with Eval Packs

Capture Traces: Instrument your AI agent to produce standard OpenTelemetry traces, preferably using the OpenInference semantic conventions.
Create an Eval Pack: Write a pack that specifies an OTel-compatible ingester (generic_otel or python).
Define a Pipeline: Add stages that use trace-aware scorers like ToolUsageScorer or LLMJudge to evaluate aspects like:
- Did the agent use the correct tool?
- Was the agent’s final response consistent with its retrieved context?
- Did the agent follow its instructions?
Run in the UI: Upload your pack and your trace file (e.g., traces.json) to run the evaluation.

Key Gotcha: Jinja2 Prompts for LLM Judge

A common source of errors, especially in the manual workflow, is an incorrectly formatted LLM-as-a-Judge prompt.

Problem: The LLM Judge scorer uses the Jinja2 templating engine to insert data into your prompt. Jinja2 requires variables to be enclosed in double curly braces.
Symptom: If your prompt uses single braces (e.g., {input}), the LLM will not see your data and will return an error message instead of a JSON score. This will cause the item to be marked as “failed to score.”
Solution: Always use double curly braces for placeholders in your prompts, like ,, and ``. The default prompt in the manual UI has been corrected, but be sure to use this syntax if you write a custom prompt.

Installation

Clone the repository:

git clone https://github.com/PrototypeJam/lake_merritt.git
cd lake_merritt

Create a virtual environment:

# Using uv (recommended)
uv venv
source .venv/bin/activate

Install dependencies:

# This installs the app, Streamlit, and all core dependencies
# including python-dotenv for local .env file handling
uv pip install -e ".[test,dev]"

Copy .env.template to .env and add your API keys:
```
cp .env.template .env
```

Usage

Run the Streamlit app:

streamlit run streamlit_app.py

Navigate to the “Evaluation Setup” page to begin.

Live Web-Based Version

A free deployed version of Lake Merritt is currently available for low-volume live testing on Streamlit Community Cloud (subject to Streamlit’s terms and restrictions), at: https://lakemerritt.streamlit.app

Contributing

This project emphasizes deep modularity. When adding new features:

Scorers go in core/scoring/ and inherit from BaseScorer.
Ingesters go in core/ingestion/ and inherit from BaseIngester.
Register new components in core/registry.py to make them available to the Eval Pack engine.
All data structures should be defined as Pydantic models in core/data_models.py.

License

MIT License - see LICENSE file for details.

Working With Streamlit Community Cloud

This section documents critical learnings from deploying Lake Merritt to Streamlit Community Cloud, including common issues, their root causes, and proven solutions.

Key Discovery: Streamlit Uses Poetry, Not Setuptools

Issue: “Oh no. Error running app” with logs showing:

Error: The current project could not be installed: No file/folder found for package ai-eval-workbench

Root Cause: Streamlit Community Cloud uses Poetry for dependency management, not setuptools. Even though pyproject.toml specified build-backend = "setuptools.build_meta", Streamlit ignores this and uses Poetry.

Solution: Add Poetry configuration to disable package mode:

[tool.poetry]
package-mode = false

This tells Poetry to only install dependencies, not attempt to install the project as a package.

Python Version Consistency

Issue: Deployment failures or unexpected behavior.

Root Cause: Mismatch between Python version configured in Streamlit Cloud settings and runtime.txt file.

Solution:

Check Python version in Streamlit Cloud dashboard (e.g., 3.13)
Ensure runtime.txt matches exactly:
```
python-3.13
```

Package Discovery Issues

Issue: Import errors for new subdirectories (e.g., ModuleNotFoundError: No module named 'core.otel')

Root Cause: When using setuptools locally, listing parent packages (e.g., “core”) automatically includes subdirectories. However, this may not work reliably in all deployment environments.

Initial Wrong Approach:

[tool.setuptools]
packages = ["app", "core", "core.otel", "core.scoring", "core.scoring.otel", "services", "utils"]

This caused conflicts because subdirectories were listed explicitly when the parent was already included.

Correct Approach:

[tool.setuptools]
packages = ["app", "core", "services", "utils"]

List only top-level packages; subdirectories are automatically included.

Deployment Cache Issues

Issue: Changes pushed to GitHub but deployment still shows old errors.

Symptoms:

Logs show timestamps from hours ago
Error messages reference issues already fixed
“Updating the app files has failed: exit status 1” repeatedly

Root Cause: Streamlit Community Cloud aggressively caches deployments. Once a deployment fails, it may get stuck in a bad state.

Solutions (in order of effectiveness):

Nuclear Option - Delete and Redeploy (Most Reliable):
- Go to share.streamlit.io
- Find your app
- Click three dots menu → Delete
- Create new app with same settings
Force Fresh Pull:
- Deploy from a different branch temporarily
- Then switch back to main
Reboot App:
- From app page, click “Manage app”
- Click three dots → “Reboot app”
- Note: This may not clear all cache

Debugging Deployment Failures

Strategy: Add temporary debug logging to identify exact failure point.

Example:

# In streamlit_app.py
try:
    from core.logging_config import setup_logging
    print("✓ Core imports successful")
except ImportError as e:
    st.error(f"Import error: {e}")
    raise

This makes import errors visible in the deployment logs instead of just showing “Oh no.”

Project Structure Requirements

Critical Files:

pyproject.toml - Dependencies and project metadata
runtime.txt - Python version specification
All directories must have __init__.py files to be recognized as packages

Do NOT Use:

requirements.txt alongside pyproject.toml - This can cause conflicts
Mixed dependency management systems

Common Error Messages and Solutions

“installer returned a non-zero exit code”
- Check pyproject.toml syntax
- Verify all dependencies can be installed
- Look for package name conflicts
“Oh no. Error running app”
- Check deployment logs for specific errors
- Verify Python version consistency
- Ensure Poetry configuration is correct
“This app has gone over its resource limits”
- Implement proper caching with TTL
- Avoid loading large models repeatedly
- Monitor memory usage in compute-heavy operations

Best Practices for Streamlit Deployment

Always test locally first with the exact Python version
Use pyproject.toml exclusively - don’t mix with requirements.txt
Include Poetry configuration even if using setuptools locally
Monitor deployment logs immediately after pushing changes
When stuck, delete and redeploy rather than fighting cache issues
Document non-obvious dependencies in comments

Deployment Checklist

Before deploying to Streamlit Community Cloud:

Verify runtime.txt matches Streamlit’s Python version
Add [tool.poetry] section with package-mode = false
Ensure all packages have __init__.py files
Test imports locally
Commit and push all changes
If updating existing deployment, consider delete/redeploy for clean state

Getting Help

When deployment fails:

Check logs at share.streamlit.io → Your App → Logs
Add debug logging to identify exact failure point
Search Streamlit forums for similar errors
Consider the nuclear option: delete and redeploy

Remember: Streamlit Community Cloud’s deployment environment differs from local development. What works locally may need adjustments for cloud deployment.

Planned

Highlights:

Cross-run analysis and comparison
Live system monitoring via OpenTelemetry integration
Custom report generation from templates
UI-driven Eval Pack creation and editing

Deeper Dives into Roadmap Items:

This site is open source. Improve this page.