lake_merritt

Lake Merritt: AI Evaluation Workbench.

A general-purpose, modular, and extensible platform for custom evaluations of AI models and applications.


Overview

Lake Merritt provides a standardized yet flexible environment for evaluating AI systems. With its Eval Pack architecture, you can run everything from quick, simple comparisons using a spreadsheet to complex, multi-stage evaluation pipelines defined in a single configuration file. This is an Alpha Version.

The platform is designed for:

🚀 Quick Start: Try It Now!

The fastest way to understand Lake Merritt is to run an evaluation. Our hands-on guide will walk you through everything from a simple 60-second test to evaluating a complex AI agent, with no coding required for most examples.

➡️ Click here for the Hands-On Quick Start Guide

The guide covers six core workflows, designed to show you the full range and power of the platform:


Understanding the Two Evaluation Modes

Before starting, it’s important to understand the two fundamental workflows Lake Merritt supports. Your choice of mode determines what kind of data you need to provide, and the capabilities available.

Mode A: Evaluate Existing Outputs

This is the most common use case. You provide a dataset that already contains the model’s performance data.

Mode B: Generate & Evaluate

This mode is for when you have test inputs but need to generate the outputs to be evaluated.

Mode B “Hold-My-Beer” Workflow: From Idea to Insight in Minutes

Lake Merritt introduces a uniquely powerful workflow that dramatically accelerates the evaluation process, making it accessible to everyone on your team, regardless of their technical background. This approach allows you to bootstrap an entire evaluation lifecycle starting with nothing more than a list of inputs and a plain-text description of your goals.

This isn’t for generating production-grade, statistically perfect evaluation data. Instead, it’s an incredibly handy feature for a quick start and rapid iteration, allowing you to see how an entire evaluation would run before you invest heavily in manual data annotation.

Here’s how you can go from an idea to a full evaluation run in five steps:

Step 1: Start with Only Inputs and an Idea

Begin with the bare minimum: a simple CSV file containing only an input column. Your “idea” is a natural language explanation of what you’re trying to achieve—your success criteria, the persona you want the AI to adopt, and the business, legal, or risk rules it must follow. You can write this directly in the UI or in a markdown file.

Step 2: Generate Your “Gold Standard” (expected_output)

Using Mode B: Generate New Data, you’ll run the “Generate Expected Outputs” sub-mode. The system will use your context to guide a powerful LLM, which will read each of your inputs and generate a high-quality, correctly formatted expected_output for every row. At the end of this step, you can download a brand new dataset, ready for evaluation.

Step 3: Generate the Model’s Response (output)

With your new dataset in hand, you immediately run a second Mode B pass. This time, you’ll use the “Generate Outputs” sub-mode. You provide context for the model you want to test (the “Actor LLM”), and the system generates its output for each input, creating a complete three-column CSV.

Step 4: Run a Full, End-to-End Evaluation

Now, with a complete dataset of synthetically generated data, you can immediately run a Mode A: Evaluate Existing Outputs workflow. You can select scorers like LLM-as-a-Judge to see how the generated outputs stack up against the generated expected_outputs, getting a full report with scores and analysis.

Step 5: Iterate with Human Insight

This is the most crucial step. Having seen a full evaluation lifecycle, you and your team can now intelligently refine the process. You can go back and manually revise the generated expected_outputs to better reflect reality-based context, edit the inputs to cover more edge cases, or adjust your initial context to improve the success criteria.

Why This is a Game-Changer for AI Evaluation

This workflow is a significant step forward for accessible, open-source AI evaluation tools, offering several unique advantages:

By transforming the tedious task of initial dataset creation into a creative and iterative process, Lake Merritt empowers teams to build better, safer, and more aligned AI systems faster than ever before.

Getting Started: Two Paths to Evaluation

Lake Merritt offers two UIs for running evaluations, catering to different needs.

Path 1: The Manual Workflow (For Quick Tests)

This is the fastest way to get started. If you have a simple CSV file, you can upload it and configure scorers directly in the user interface. It’s perfect for quick checks and initial exploration.

  1. Prepare Your Data: Create a CSV file with the required columns for your chosen mode (e.g., input, output, expected_output for Mode A).
  2. Navigate to “Evaluation Setup”: Select the “Configure Manually” option.
  3. Upload & Select: Upload your CSV and choose from a list of built-in scorers (Exact Match, Fuzzy Match, LLM Judge).
  4. Run: Click “Start Evaluation” to see your results.

Path 2: The Eval Pack Workflow (For Power and Repeatability)

This is the new, powerful way to use Lake Merritt. An Eval Pack is a YAML file where you declaratively define the entire evaluation process. This is the recommended path for any serious or recurring evaluation task.

  1. Create an Eval Pack: Define your data source, scorers, and configurations in a .yaml file.
  2. Navigate to “Evaluation Setup”: Select the “Upload Eval Pack” option.
  3. Upload Pack & Data: Upload your Eval Pack, then upload the corresponding data file (e.g., a CSV, a JSON trace file, etc.).
  4. Run: Click “Start Pack Evaluation” to execute your custom workflow.

Core Features

Using Eval Packs

The Eval Pack system is the most powerful feature of Lake Merritt. It turns your evaluation logic into a shareable, version-controllable artifact that is essential for rigorous testing.

Where to Find Existing Eval Packs

A great way to start is by exploring the examples provided in the repository. The examples/eval_packs/ directory contains ready-to-use packs for a variety of tasks, including:

Use these as templates for your own custom evaluations.

The Power of Eval Packs: Routine & Ad-Hoc Evals

Eval Packs are designed for both systematic and exploratory analysis.

Creative Configurations & Versioning

Eval Packs enable sophisticated evaluation designs:

Versioning is crucial. The version field in your Eval Pack (e.g., version: "1.1") is more than just metadata. When you change a prompt, adjust a threshold, or add a scorer, you should increment the version. This ensures that you can always reproduce past results and clearly track how your evaluation standards evolve over time.

What’s New: Benchmark-in-a-Box with BBQ and FDL

Lake Merritt’s extensible architecture was designed to handle diverse and complex evaluation needs. To prove and showcase this power, we have added comprehensive, end-to-end support for two sophisticated benchmarks: the official Bias Benchmark for Question Answering (BBQ and an initial, experimental Fiduciary Duty of Loyalty (FDL) evaluation inspired by BBQ’s methodology.

This new capability demonstrates how Lake Merritt can be used as a “Benchmark-in-a-Box,” where a single Eval Pack can codify the entire workflow—from ingesting complex, multi-file datasets to running generation, scoring, and calculating specialized, benchmark-specific aggregate metrics.

BBQ Quick Primer: The BBQ benchmark organizes questions into pairs of ambiguous vs. disambiguated contexts. It tests whether a model relies on harmful social stereotypes when a question is under-informative, and whether it can overcome those stereotypes when more context is provided. Results are aggregated into a final bias score that measures the model’s tendency to select stereotype-consistent answers. Our implementation mirrors these core concepts.

1. Official BBQ Fairness Benchmark Replication

Integrating the BBQ benchmark was a perfect test of Lake Merritt’s modular design, leveraging a custom ingester, a deterministic scorer, and a new aggregator.

How it Works in Lake Merritt

The entire BBQ workflow is defined in test_packs/bbq_eval_pack.yaml and uses the following components:

Step-by-Step: Running the BBQ Evaluation

  1. Download the BBQ Dataset: First, clone or download the official BBQ repository to your machine. The ingester script expects the original directory structure (data/ and supplemental/ subdirectories).

  2. Create the Path File: The PythonIngester needs to know where to find the dataset. Create a simple text file (e.g., bbq_path.txt) that contains a single line: the absolute path to the root of the BBQ repository you just downloaded.
    # Example command on macOS or Linux
    echo "/Users/yourname/path/to/your/BBQ_full" > bbq_path.txt
    
  3. Launch Lake Merritt and Configure: Run streamlit run streamlit_app.py and ensure your API keys are set up in “System Configuration.”

  4. Run the Eval Pack:
    • Navigate to Evaluation Setup and select “Upload Eval Pack”.
    • Upload test_packs/bbq_eval_pack.yaml.
    • When prompted for the data file, upload your bbq_path.txt file.
    • Provide a simple context for generation, such as: Respond with only the text of the single best option.
    • Click “Start Pack Run”.
  5. Analyze Results: Once the run completes, the most important output is in the Download Center. The Summary Report will contain a “BBQ Bias Score Scorecard” with the final, aggregated bias metrics.

2. Experimental Fiduciary Duty of Loyalty (FDL) Evals

We’ve also adapted the BBQ methodology for a specialized legal and ethical domain: Fiduciary Duty of Loyalty. This eval tests if an AI, when faced with a conflict of interest, correctly prioritizes the user’s interests.

This workflow is a powerful example of the full “idea to insight” lifecycle in Lake Merritt, which requires a human-in-the-loop process for creating a high-quality dataset.

Step-by-Step: Running the FDL Evaluation

  1. Generate the Baseline Dataset: First, run the provided script to generate a synthetic “gold standard” dataset.
    python scripts/generate_fdl_dataset.py
    

    This creates the data/duty_of_loyalty_benchmark.csv file.

  2. (Crucial) Human Review Step: A domain expert (e.g., a lawyer) must review and, if necessary, correct the synthetically generated expected_output values in the CSV. An LLM’s initial attempt is a draft, not ground truth.

  3. Run the Eval Pack:
    • In the UI, upload test_packs/fdl_eval_pack.yaml.
    • When prompted, upload your human-verified data/duty_of_loyalty_benchmark.csv.
    • Provide context for the generation step (e.g., You are a helpful AI assistant with a strict duty of loyalty to the user.) and run the evaluation.
  4. Analyze FDL-Specific Metrics: Your final summary report will include an “FDL Metrics Scorecard” with domain-specific metrics like Disambiguated Accuracy, Appropriate Clarification Rate, and Disclosure Success Rate.

Why Lake Merritt was a Good Fit for This

Advanced Use Case: Evaluating OpenTelemetry Traces

Evaluating the complex behavior of AI agents is a primary use case for Lake Merritt. Instead of simple input/output pairs, you can evaluate an agent’s entire decision-making process captured in an OpenTelemetry trace.

How it Works with Eval Packs

  1. Capture Traces: Instrument your AI agent to produce standard OpenTelemetry traces, preferably using the OpenInference semantic conventions.
  2. Create an Eval Pack: Write a pack that specifies an OTel-compatible ingester (generic_otel or python).
  3. Define a Pipeline: Add stages that use trace-aware scorers like ToolUsageScorer or LLMJudge to evaluate aspects like:
    • Did the agent use the correct tool?
    • Was the agent’s final response consistent with its retrieved context?
    • Did the agent follow its instructions?
  4. Run in the UI: Upload your pack and your trace file (e.g., traces.json) to run the evaluation.

Key Gotcha: Jinja2 Prompts for LLM Judge

A common source of errors, especially in the manual workflow, is an incorrectly formatted LLM-as-a-Judge prompt.

Installation

  1. Clone the repository:
    git clone https://github.com/PrototypeJam/lake_merritt.git
    cd lake_merritt
    
  2. Create a virtual environment:
    # Using uv (recommended)
    uv venv
    source .venv/bin/activate
    
  3. Install dependencies:
    # This installs the app, Streamlit, and all core dependencies
    # including python-dotenv for local .env file handling
    uv pip install -e ".[test,dev]"
    
  4. Copy .env.template to .env and add your API keys:
    cp .env.template .env
    

Usage

Run the Streamlit app:

streamlit run streamlit_app.py

Navigate to the “Evaluation Setup” page to begin.

Live Web-Based Version

A free deployed version of Lake Merritt is currently available for low-volume live testing on Streamlit Community Cloud (subject to Streamlit’s terms and restrictions), at: https://lakemerritt.streamlit.app

Contributing

This project emphasizes deep modularity. When adding new features:

  1. Scorers go in core/scoring/ and inherit from BaseScorer.
  2. Ingesters go in core/ingestion/ and inherit from BaseIngester.
  3. Register new components in core/registry.py to make them available to the Eval Pack engine.
  4. All data structures should be defined as Pydantic models in core/data_models.py.

License

MIT License - see LICENSE file for details.

Working With Streamlit Community Cloud

This section documents critical learnings from deploying Lake Merritt to Streamlit Community Cloud, including common issues, their root causes, and proven solutions.

Key Discovery: Streamlit Uses Poetry, Not Setuptools

Issue: “Oh no. Error running app” with logs showing:

Error: The current project could not be installed: No file/folder found for package ai-eval-workbench

Root Cause: Streamlit Community Cloud uses Poetry for dependency management, not setuptools. Even though pyproject.toml specified build-backend = "setuptools.build_meta", Streamlit ignores this and uses Poetry.

Solution: Add Poetry configuration to disable package mode:

[tool.poetry]
package-mode = false

This tells Poetry to only install dependencies, not attempt to install the project as a package.

Python Version Consistency

Issue: Deployment failures or unexpected behavior.

Root Cause: Mismatch between Python version configured in Streamlit Cloud settings and runtime.txt file.

Solution:

  1. Check Python version in Streamlit Cloud dashboard (e.g., 3.13)
  2. Ensure runtime.txt matches exactly:
    python-3.13
    

Package Discovery Issues

Issue: Import errors for new subdirectories (e.g., ModuleNotFoundError: No module named 'core.otel')

Root Cause: When using setuptools locally, listing parent packages (e.g., “core”) automatically includes subdirectories. However, this may not work reliably in all deployment environments.

Initial Wrong Approach:

[tool.setuptools]
packages = ["app", "core", "core.otel", "core.scoring", "core.scoring.otel", "services", "utils"]

This caused conflicts because subdirectories were listed explicitly when the parent was already included.

Correct Approach:

[tool.setuptools]
packages = ["app", "core", "services", "utils"]

List only top-level packages; subdirectories are automatically included.

Deployment Cache Issues

Issue: Changes pushed to GitHub but deployment still shows old errors.

Symptoms:

Root Cause: Streamlit Community Cloud aggressively caches deployments. Once a deployment fails, it may get stuck in a bad state.

Solutions (in order of effectiveness):

  1. Nuclear Option - Delete and Redeploy (Most Reliable):
    • Go to share.streamlit.io
    • Find your app
    • Click three dots menu → Delete
    • Create new app with same settings
  2. Force Fresh Pull:
    • Deploy from a different branch temporarily
    • Then switch back to main
  3. Reboot App:
    • From app page, click “Manage app”
    • Click three dots → “Reboot app”
    • Note: This may not clear all cache

Debugging Deployment Failures

Strategy: Add temporary debug logging to identify exact failure point.

Example:

# In streamlit_app.py
try:
    from core.logging_config import setup_logging
    print("✓ Core imports successful")
except ImportError as e:
    st.error(f"Import error: {e}")
    raise

This makes import errors visible in the deployment logs instead of just showing “Oh no.”

Project Structure Requirements

Critical Files:

Do NOT Use:

Common Error Messages and Solutions

  1. “installer returned a non-zero exit code”
    • Check pyproject.toml syntax
    • Verify all dependencies can be installed
    • Look for package name conflicts
  2. “Oh no. Error running app”
    • Check deployment logs for specific errors
    • Verify Python version consistency
    • Ensure Poetry configuration is correct
  3. “This app has gone over its resource limits”
    • Implement proper caching with TTL
    • Avoid loading large models repeatedly
    • Monitor memory usage in compute-heavy operations

Best Practices for Streamlit Deployment

  1. Always test locally first with the exact Python version
  2. Use pyproject.toml exclusively - don’t mix with requirements.txt
  3. Include Poetry configuration even if using setuptools locally
  4. Monitor deployment logs immediately after pushing changes
  5. When stuck, delete and redeploy rather than fighting cache issues
  6. Document non-obvious dependencies in comments

Deployment Checklist

Before deploying to Streamlit Community Cloud:

Getting Help

When deployment fails:

  1. Check logs at share.streamlit.io → Your App → Logs
  2. Add debug logging to identify exact failure point
  3. Search Streamlit forums for similar errors
  4. Consider the nuclear option: delete and redeploy

Remember: Streamlit Community Cloud’s deployment environment differs from local development. What works locally may need adjustments for cloud deployment.


Planned

Highlights:

Deeper Dives into Roadmap Items: