DiscoBench¶

DiscoBench is a modular benchmark for automated algorithm discovery in machine learning.

What is DiscoBench?¶

DiscoBench is a new, open-ended benchmark and research playground for developing automated algorithm discovery and AI scientist systems. DiscoBench has a modular setup, an emphasis on discovering algorithms that transfer, and a huge diversity of tasks! We hope DiscoBench helps drive the frontier of research in algorithm discovery by providing a large-scale, open-ended landscape for evaluating AI research agents!

Key Features¶

Modular Architecture: Break down ML algorithms into composable components
Multiple Domains: Support for reinforcement learning, language modeling, computer vision, Bayesian optimization, and more
Flexible Configuration: Easy switching between baseline and experimental implementations
LLM-Ready: Designed for automated algorithm discovery using AI agents
Extensible: Simple framework for adding new tasks and domains

Quick Start¶

Installation¶

Install from source:

git clone git@github.com:AlexGoldie/discobench.git
cd discobench
make install

or install from pip:

pip install discobench

Basic Usage¶

List available domains:

uv run discobench get-domains

Create a full task-domain codebase (with baseline implementations):

uv run discobench create-task --task-domain OnPolicyRL

Create an example task for algorithm discovery:

uv run discobench create-task --task-domain OnPolicyRL --example

See the full Usage Guide for detailed instructions.

Available Domains¶

DiscoBench currently supports the following task domains:

OnPolicyRL: On-policy reinforcement learning (PPO-style algorithms)
OffPolicyRL: Off-policy reinforcement learning (DQN-style algorithms)
LanguageModelling: Pre-training language models
ComputerVisionClassification: Image classification tasks
BayesianOptimisation: Black-box optimization
BrainSpeechDetection: Neural signal analysis
ModelUnlearning: LLM unlearning tasks
UnsupervisedEnvironmentDesign: Environment curriculum learning
ContinualLearning: Learning under non-stationarity
GreenhouseGasPrediction: Predicting atmospheric greenhouse gas concentrations

See the Domains page for detailed information about each domain.

How It Works¶

1. Modular Components¶

Each task domain is decomposed into modules. For example, OnPolicyRL includes: - loss.py: Objective function (e.g., PPO loss) - networks.py: Neural network architectures - optim.py: Optimization algorithms - train.py: Training loop logic

2. Base and Edit Implementations¶

Each module has two versions: - Base: Fully implemented, tested baseline - Edit: Template with function signatures for customization

3. Configuration-Driven¶

Control which modules use baseline vs. custom implementations via YAML config:

change_optim: true   # Use custom optimizer
change_loss: false   # Use baseline loss
change_networks: false
change_train: false

4. Task Generation¶

DiscoBench assembles the configured modules into a complete, runnable task in task_src/:

discobench create-task --task-domain OnPolicyRL
cd task_src/OnPolicyRL
python run_main.py

Documentation¶

For Users¶

Usage Guide: CLI commands, Python API, and workflows
Domains: Available task domains and their modules

For Contributors¶

Contributing Overview: How to add new tasks to DiscoBench
Dataset Integration: Adding new datasets to tasks

Example Use Cases¶

Algorithm Discovery with LLMs¶

Use DiscoBench to have AI agents discover new ML algorithms: 1. Configure which modules should be generated by the LLM 2. LLM writes implementations for those modules 3. Evaluate performance across multiple tasks 4. Iterate and refine based on results

Transfer Learning Research¶

Test if components discovered on one task generalize to others: 1. Discover algorithm on training tasks 2. Evaluate on held-out test tasks 3. Measure generalization across domains

Project Structure¶

discobench/
├── tasks/              # Task domain implementations
│   ├── OnPolicyRL/
│   ├── LanguageModelling/
│   └── ...
├── utils/              # Core utilities
├── create_task.py      # Task generation logic
├── create_config.py    # Configuration utilities
└── cli.py              # Command-line interface

task_src/               # Generated task files (after running create-task)

Contributing¶

We welcome contributions! DiscoBench grows stronger with more tasks and domains.

Found a bug? Open an issue
Want to add a task? See the Contributing Guide
Adding datasets? Check the Dataset Integration Guide

Citation¶

If you use DiscoBench in your research, please cite:

    @article{goldie2025discobench,
      title={DiscoBench: An Open-Ended Benchmark For Algorithm Discovery},
      author={Alexander D. Goldie and Zilin Wang and Adrian Hayler and Deepak Nathani and Edan Toledo and Ken Thampiratwong and Aleksandra Kalisz and Michael Beukman and Alistair Letcher and Shashank Reddy and Clarisse Wibault and Theo Wolf and Charles O'Neill and Jakob N. Foerster and Shimon Whiteson and Roberta Raileanu},
      year={2025}
    }

Links¶

GitHub Repository: https://github.com/AlexGoldie/discobench
Documentation: https://AlexGoldie.github.io/discobench
Blog: https://alexgoldie.github.io/discobench-blog/
PyPI Package: Coming soon

License¶

This project is licensed under the terms specified in the LICENSE file.