Skip to content

DiscoBench

DiscoBench is a modular benchmark for automated algorithm discovery in machine learning.

What is DiscoBench?

DiscoBench is a new, open-ended benchmark and research playground for developing automated algorithm discovery and AI scientist systems. DiscoBench has a modular setup, an emphasis on discovering algorithms that transfer, and a huge diversity of tasks! We hope DiscoBench helps drive the frontier of research in algorithm discovery by providing a large-scale, open-ended landscape for evaluating AI research agents!

Key Features

  • Modular Architecture: Break down ML algorithms into composable components
  • Multiple Domains: Support for reinforcement learning, language modeling, computer vision, Bayesian optimization, and more
  • Flexible Configuration: Easy switching between baseline and experimental implementations
  • LLM-Ready: Designed for automated algorithm discovery using AI agents
  • Extensible: Simple framework for adding new tasks and domains

Quick Start

Installation

Install from source:

git clone git@github.com:AlexGoldie/discobench.git
cd discobench
make install

or install from pip:

pip install discobench

Basic Usage

List available domains:

uv run discobench get-domains

Create a full task-domain codebase (with baseline implementations):

uv run discobench create-task --task-domain OnPolicyRL

Create an example task for algorithm discovery:

uv run discobench create-task --task-domain OnPolicyRL --example

See the full Usage Guide for detailed instructions.

Available Domains

DiscoBench currently supports the following task domains:

See the Domains page for detailed information about each domain.

How It Works

1. Modular Components

Each task domain is decomposed into modules. For example, OnPolicyRL includes: - loss.py: Objective function (e.g., PPO loss) - networks.py: Neural network architectures - optim.py: Optimization algorithms - train.py: Training loop logic

2. Base and Edit Implementations

Each module has two versions: - Base: Fully implemented, tested baseline - Edit: Template with function signatures for customization

3. Configuration-Driven

Control which modules use baseline vs. custom implementations via YAML config:

change_optim: true   # Use custom optimizer
change_loss: false   # Use baseline loss
change_networks: false
change_train: false

4. Task Generation

DiscoBench assembles the configured modules into a complete, runnable task in task_src/:

discobench create-task --task-domain OnPolicyRL
cd task_src/OnPolicyRL
python run_main.py

Documentation

For Users

  • Usage Guide: CLI commands, Python API, and workflows
  • Domains: Available task domains and their modules

For Contributors

Example Use Cases

Algorithm Discovery with LLMs

Use DiscoBench to have AI agents discover new ML algorithms: 1. Configure which modules should be generated by the LLM 2. LLM writes implementations for those modules 3. Evaluate performance across multiple tasks 4. Iterate and refine based on results

Transfer Learning Research

Test if components discovered on one task generalize to others: 1. Discover algorithm on training tasks 2. Evaluate on held-out test tasks 3. Measure generalization across domains

Project Structure

discobench/
├── tasks/              # Task domain implementations
│   ├── OnPolicyRL/
│   ├── LanguageModelling/
│   └── ...
├── utils/              # Core utilities
├── create_task.py      # Task generation logic
├── create_config.py    # Configuration utilities
└── cli.py              # Command-line interface

task_src/               # Generated task files (after running create-task)

Contributing

We welcome contributions! DiscoBench grows stronger with more tasks and domains.

Citation

If you use DiscoBench in your research, please cite:

    @article{goldie2025discobench,
      title={DiscoBench: An Open-Ended Benchmark For Algorithm Discovery},
      author={Alexander D. Goldie and Zilin Wang and Adrian Hayler and Deepak Nathani and Edan Toledo and Ken Thampiratwong and Aleksandra Kalisz and Michael Beukman and Alistair Letcher and Shashank Reddy and Clarisse Wibault and Theo Wolf and Charles O'Neill and Jakob N. Foerster and Shimon Whiteson and Roberta Raileanu},
      year={2025}
    }

License

This project is licensed under the terms specified in the LICENSE file.