disco ball

DiscoBench is an open-ended framework for evaluating automated algorithm discovery, e.g. via AI research agent systems. DiscoBench has a modular setup, an emphasis on discovering algorithms that transfer, and a huge diversity of tasks! We hope DiscoBench helps drive the frontier of research in algorithm discovery by providing a large-scale, open-ended landscape for evaluating AI research agents!

Overview of DiscoBench.
Figure 1: An overview of the DiscoBench system. A user can select a domain from a broad range of different areas of machine learning. After, they can select files from the list of available modules for an AI research agent to edit. We support selecting any combination of modules, from just one to all of them!

Evaluating Automated Algorithm Discovery

One long term goal of automated algorithm discovery systems is to safely automate AI research itself. To do so, we need to be able to measure the ability of AI research agents. While current benchmarks do exist, they suffer from fundamental limitations, including data contamination, poor quality evaluations, and difficulty in assessing whether methods discovered by the agent generalise to new tasks and domains.

We design DiscoBench specifically with the issues of current benchmarks in mind; as such, we hope that it can remain pertinent for a long time. Below, we explain why DiscoBench is so useful, and some of the tasks that are currently implemented in it; expect these to grow over the next few months!

Limitations of Current Benchmarks

Criteria MLEBench ASTABench MLGym Bench DiscoBench
Algorithm Transfer ❌ ❌ ❌ ✅
Task Diversity 🟡 ✅ 🟡 ✅
Unbiased to Initialisation 🟡 ✅ ❌ ✅
Contamination Resistant ❌ 🟡 🟡 ✅
Table 1: A comparison of some common benchmarks used for research agent evaluation. ✅ means a benchmark is good for a certain issue, 🟡 means it is "ok" but not great, and ❌ means it struggles with a specific issue.

What is DiscoBench?

DiscoBench is a new task-generation framework, and task-suite, for algorithm discovery and AI research agents. As well as already providing a number of problems with which to measure an agent's performance for AI research, we believe it provides a useful foundation where efforts in automated, open-ended scientific research can effectively take place.

DiscoBench is set up in a modular way, such that you can specify which parts of a codebase an LLM edits; this enables unparalleled task diversity compared to other benchmarks in this space. We also differ from other benchmarks in how we define evaluation, with emphasis on meta-test evaluation. This means we test the performance of algorithms on held-out datasets or environments, without telling the LLM what it will be evaluated on a-priori. And DiscoBench is very diverse; we offer tasks from different applied and foundational disciplines and support a broad range of public datasets and RL environments.

A demonstration of a typical DiscoBench setup.
Figure 2: A visualisation of a typical DiscoBench setup. An agent will develop algorithms to train models on a set of meta-train datasets, making refinements based on the model's evaluation score. After a final algorithm is developed, it is used to train models on a held-out meta-test dataset, which is unknown to the agent. The models are evaluated to get a final performance metric.
Here are some of the main features of DiscoBench:

DiscoBench is in continuous development, with a planned set of tasks and an expanded suite of features to be added over the coming week! We are also always open to more suggestions/open-source contributions.

Using DiscoBench

Getting started with DiscoBench is simple. Full installation instructions are available in our docs, as well as information about how to define your own DiscoBench configs.

Below, we provide an example of creating a DiscoBench task using our Python API. We also support interacting with DiscoBench through the DiscoBench cli!

Quick Start

1. Install DiscoBench

pip install discobench

2. Select a task domain and build an example task

discobench.create_task(task_domain="task_domain", example=True, test=False)

This creates a requirements.txt and a full modular codebase. It also creates description.md, which explains the task for your agent. This includes descriptions of the task domain, a description of the purpose of any modules, and explanations of all meta-train datasets.

3. Evaluate the algorithms

discobench.create_task(task_domain="task_domain", example=True, test=True)

This builds a new codebase for your test environments. You can conveniently call python run_main.py to run training with the discovered algorithms on all meta-test environments.

To ensure the LLM hasn't cheated its evaluation, creating a test task will overwrite all files except the discovered modules. This means an agent can't, for instance, replace its evaluation accuracy with 100%!

Task Domains

Task domains are the high level areas which we have implemented full modular codebases for; things like OnPolicyRL and ComputerVisionClassification. For a full list and descriptions of each domain, use:

discobench.utils.get_domains()

Modules

Modules are the components of each codebase which we can mark as editable or fixed for the LLM. We prevent cheating by overriding all non-editable files when we create a test task. To see what modules are implemented for each domain, use:

discobench.utils.get_modules()

Configuration

To control which modules and datasets to use, the easiest way is to edit a base config dictionary. This can be obtained using:

config_dict = discobench.utils.get_config(task_domain=”task_domain”)

Afterwards, you can feed your updated config to create a task:

discobench.create_task(task_domain="task_domain", config_dict=config_dict)

Example Tasks

We implement a number of tasks from a range of AI research domains in DiscoBench. While these are continuously being iterated on and added to, we include a description of a few of our tasks, and their modules, below.

On-Policy RL

Modules: Optimiser, Loss, Network Architecture, Training Loop

Starting from code set up for a standard actor-critic RL training loop, the agent must find algorithms which maximise the return of an RL agent policy. We support different RL environments from MinAtar, Brax and Craftax.

Image Classification

Modules: Optimiser, Loss, Network Architecture, Data Preprocessing

Using code designed for training an image classifier, the agent must find algorithms which maximise the accuracy of image classification. We support a range of different datasets, from MNIST (simple) to CIFAR100.

Model Unlearning

Modules: Loss

Unlearning involves trying to train a pre-trained model to maintain certain behaviours, while removing others, and is very important in AI safety! The agent must develop algorithms which maximise the desired capabilities while minimising the undesired capabilities or knowledge. This is a multi-objective task, in which the agent is given many different feedback signals to optimise internally.

Brain Speed Detection

Modules: Network Architecture, Loss, Optimiser

Research suggests we can detect speech from MEG signals (magnetic fields produced by electrical currents) produced in the brain! The agent must develop algorithms for detecting whether the brain is processing speech or silence for a number of different novels Sherlock Holmes novels.

Unsupervised Environment Design

Modules: Level Sampler, Training Step, Training Loop, Hyperparameter Config

Unsupervised Environment Design (UED) involves training an RL actor to be robust to different levels from a particular environment. The LLM agent must develop algorithms to maximise the performance of an RL actor on a held out set of environments.

An Example Usage

Below, we provide a (slightly abridged) example usage of DiscoBench. Here, we work with Claude Code to design a new loss function for On-Policy RL. Claude Code receives feedback from its training environments: MinAtar Breakout and MinAtar Freeway. After Claude Code has developed its new loss functions and run its own experiments for 5 different algorithms, we test its chosen algorithm's performance on a held-out test set: MinAtar SpaceInvaders and MinAtar Asterix.

In DiscoBench, such a process is simple; all we have to do is create a task_config which reflects what we want. The following config example completely defines the task for Claude Code! Automatically, we build all of the files, requirements and task description; the only thing for the agent designer to add is a prompt! Easy!

train_task_id: [MinAtar/Breakout, MinAtar/Freeway]
test_task_id: [MinAtar/SpaceInvaders, MinAtar/Asterix]

source_path: task_src/OnPolicyRL
template_backend: default

change_optim: false
change_loss: true
change_networks: false
change_train: false

Let's see how Claude Code did. Scroll to the bottom to see how Claude compares to PPO in- and out-of-meta-distribution:

🛠️ Coming Soon

DiscoBench is in active development, with many new features in the pipeline. In particular, some of the things we are currently working on include:

Future Work

We think DiscoBench has the potential to open up lots of new potential research avenues! While we are continuously adding new tasks, there are already thousands of domain-module-dataset combinations in DiscoBench! Here's a few examples of research ideas we'd love to see arise from using DiscoBench:

Conclusion

We really hope you love using DiscoBench, and some really great things can come out of it! Please consider contributing to our Open-Source repository, where we would love to see new tasks from the wider community as well as feature ideas! If you have any questions please feel free to reach out to the core contributors!

Related Work

Our work is related to a bunch of other amazing research! In particular, in this blog we discuss:

MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering

MLGym: A New Framework and Benchmark for Advancing AI Research Agents

AstaBench: Rigorous Benchmarking of AI Agents With A Scientific Research Suite

Claude Code

Please also make sure to look into task_domain/utils/_reference.txt to see the origins of all of our code and datasets!

Citation

For attribution in academic contexts, please cite this work as

Goldie et al., "DiscoBench: An Open-Ended Benchmark For Algorithm Discovery", 2025.

BibTeX citation

    @article{goldie2025discobench,
      title={DiscoBench: An Open-Ended Benchmark For Algorithm Discovery},
      author={Alexander D. Goldie and Zilin Wang and Adrian Hayler and Deepak Nathani and Edan Toledo and Ken Thampiratwong and Aleksandra Kalisz and Michael Beukman and Alistair Letcher and Shashank Reddy and Clarisse Wibault and Theo Wolf and Charles O'Neill and Shimon Whiteson and Jakob N. Foerster and Roberta Raileanu},
      year={2025}
    }
    

Contributions

DiscoBench was a hugely collaborative effort. Below, we discuss the contributions made by each author:

* Lead Author

† Core Contributor

‡ Task Contributor