DiscoBench is an open-ended framework for evaluating automated algorithm discovery, e.g. via AI research agent systems. DiscoBench has a modular setup, an emphasis on discovering algorithms that transfer, and a huge diversity of tasks! We hope DiscoBench helps drive the frontier of research in algorithm discovery by providing a large-scale, open-ended landscape for evaluating AI research agents!
One long term goal of automated algorithm discovery systems is to safely automate AI research itself. To do so, we need to be able to measure the ability of AI research agents. While current benchmarks do exist, they suffer from fundamental limitations, including data contamination, poor quality evaluations, and difficulty in assessing whether methods discovered by the agent generalise to new tasks and domains.
We design DiscoBench specifically with the issues of current benchmarks in mind; as such, we hope that it can remain pertinent for a long time. Below, we explain why DiscoBench is so useful, and some of the tasks that are currently implemented in it; expect these to grow over the next few months!
Poor Evaluation: Like in all machine learning, proper evaluation requires a train-test dataset separation. Most preexisting benchmarks do this on the model-level - i.e., a successful algorithm is one that trains a model that generalises to a test dataset. This misses the correct train-test boundary. Given we are finding algorithms on the meta-scale, we should instead be measuring the transfer of algorithms to train models on completely new datasets. In other words, our focus should be on meta-test performance!
Limited Diversity: Existing benchmarks require manually creating every single problem, which is laborious and repetitive. Even when a benchmark consists of many tasks, they often focus on specific domains at the cost of breadth.
Bias To The Initialisation: Because of how they are implemented, many preexisting benchmarks require initialising a codebase from a fully working, preexisting implementation. In addition to requiring a complete initial solution, which is non-trivial for hard problems, initialising agents in this local minimum can limit the creativity that we hope to elicit from AI research agents. It also affects evaluation, since doing nothing can be a reasonably performant strategy.
Data Contamination: It is hard to accurately measure data contamination of benchmarks, especially when they use machine learning problems that have been publicly released. This problem is particularly prevalent for older Kaggle competitions, where first place solutions are public, or HuggingFace datasets, where an agent could have seen the data in pretraining and use that to inform their solutions. In particular, issues often arise when AI research agents are aware of which specific tasks it will be evaluated on.
| Criteria | MLEBench | ASTABench | MLGym Bench | DiscoBench |
|---|---|---|---|---|
| Algorithm Transfer | ❌ | ❌ | ❌ | ✅ |
| Task Diversity | 🟡 | ✅ | 🟡 | ✅ |
| Unbiased to Initialisation | 🟡 | ✅ | ❌ | ✅ |
| Contamination Resistant | ❌ | 🟡 | 🟡 | ✅ |
DiscoBench is a new task-generation framework, and task-suite, for algorithm discovery and AI research agents. As well as already providing a number of problems with which to measure an agent's performance for AI research, we believe it provides a useful foundation where efforts in automated, open-ended scientific research can effectively take place.
DiscoBench is set up in a modular way, such that you can specify which parts of a codebase an LLM edits; this enables unparalleled task diversity compared to other benchmarks in this space. We also differ from other benchmarks in how we define evaluation, with emphasis on meta-test evaluation. This means we test the performance of algorithms on held-out datasets or environments, without telling the LLM what it will be evaluated on a-priori. And DiscoBench is very diverse; we offer tasks from different applied and foundational disciplines and support a broad range of public datasets and RL environments.
DiscoBench is in continuous development, with a planned set of tasks and an expanded suite of features to be added over the coming week! We are also always open to more suggestions/open-source contributions.
Getting started with DiscoBench is simple. Full installation instructions are available in our docs, as well as information about how to define your own DiscoBench configs.
Below, we provide an example of creating a DiscoBench task using our Python API. We also support interacting with DiscoBench through the DiscoBench cli!
1. Install DiscoBench
pip install discobench
2. Select a task domain and build an example task
discobench.create_task(task_domain="task_domain", example=True, test=False)
This creates a requirements.txt and a full modular codebase. It also creates description.md, which explains the task for your agent. This includes descriptions of the task domain, a description of the purpose of any modules, and explanations of all meta-train datasets.
3. Evaluate the algorithms
discobench.create_task(task_domain="task_domain", example=True, test=True)
This builds a new codebase for your test environments. You can conveniently call python run_main.py to run training with the discovered algorithms on all meta-test environments.
To ensure the LLM hasn't cheated its evaluation, creating a test task will overwrite all files except the discovered modules. This means an agent can't, for instance, replace its evaluation accuracy with 100%!
Task domains are the high level areas which we have implemented full modular codebases for; things like OnPolicyRL and ComputerVisionClassification. For a full list and descriptions of each domain, use:
discobench.utils.get_domains()
Modules are the components of each codebase which we can mark as editable or fixed for the LLM. We prevent cheating by overriding all non-editable files when we create a test task. To see what modules are implemented for each domain, use:
discobench.utils.get_modules()
To control which modules and datasets to use, the easiest way is to edit a base config dictionary. This can be obtained using:
config_dict = discobench.utils.get_config(task_domain=”task_domain”)
Afterwards, you can feed your updated config to create a task:
discobench.create_task(task_domain="task_domain", config_dict=config_dict)
We implement a number of tasks from a range of AI research domains in DiscoBench. While these are continuously being iterated on and added to, we include a description of a few of our tasks, and their modules, below.
Modules: Optimiser, Loss, Network Architecture, Training Loop
Starting from code set up for a standard actor-critic RL training loop, the agent must find algorithms which maximise the return of an RL agent policy. We support different RL environments from MinAtar, Brax and Craftax.
Modules: Optimiser, Loss, Network Architecture, Data Preprocessing
Using code designed for training an image classifier, the agent must find algorithms which maximise the accuracy of image classification. We support a range of different datasets, from MNIST (simple) to CIFAR100.
Modules: Loss
Unlearning involves trying to train a pre-trained model to maintain certain behaviours, while removing others, and is very important in AI safety! The agent must develop algorithms which maximise the desired capabilities while minimising the undesired capabilities or knowledge. This is a multi-objective task, in which the agent is given many different feedback signals to optimise internally.
Modules: Network Architecture, Loss, Optimiser
Research suggests we can detect speech from MEG signals (magnetic fields produced by electrical currents) produced in the brain! The agent must develop algorithms for detecting whether the brain is processing speech or silence for a number of different novels Sherlock Holmes novels.
Modules: Level Sampler, Training Step, Training Loop, Hyperparameter Config
Unsupervised Environment Design (UED) involves training an RL actor to be robust to different levels from a particular environment. The LLM agent must develop algorithms to maximise the performance of an RL actor on a held out set of environments.
Below, we provide a (slightly abridged) example usage of DiscoBench. Here, we work with Claude Code to design a new loss function for On-Policy RL. Claude Code receives feedback from its training environments: MinAtar Breakout and MinAtar Freeway. After Claude Code has developed its new loss functions and run its own experiments for 5 different algorithms, we test its chosen algorithm's performance on a held-out test set: MinAtar SpaceInvaders and MinAtar Asterix.
In DiscoBench, such a process is simple; all we have to do is create a task_config which reflects what we want. The following config example completely defines the task for Claude Code! Automatically, we build all of the files, requirements and task description; the only thing for the agent designer to add is a prompt! Easy!
train_task_id: [MinAtar/Breakout, MinAtar/Freeway] test_task_id: [MinAtar/SpaceInvaders, MinAtar/Asterix] source_path: task_src/OnPolicyRL template_backend: default change_optim: false change_loss: true change_networks: false change_train: false
Let's see how Claude Code did. Scroll to the bottom to see how Claude compares to PPO in- and out-of-meta-distribution:
DiscoBench is in active development, with many new features in the pipeline. In particular, some of the things we are currently working on include:
We think DiscoBench has the potential to open up lots of new potential research avenues! While we are continuously adding new tasks, there are already thousands of domain-module-dataset combinations in DiscoBench! Here's a few examples of research ideas we'd love to see arise from using DiscoBench:
We really hope you love using DiscoBench, and some really great things can come out of it! Please consider contributing to our Open-Source repository, where we would love to see new tasks from the wider community as well as feature ideas! If you have any questions please feel free to reach out to the core contributors!
MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering
MLGym: A New Framework and Benchmark for Advancing AI Research Agents
AstaBench: Rigorous Benchmarking of AI Agents With A Scientific Research Suite
Please also make sure to look into task_domain/utils/_reference.txt to see the origins of all of our code and datasets!
For attribution in academic contexts, please cite this work as
Goldie et al., "DiscoBench: An Open-Ended Benchmark For Algorithm Discovery", 2025.
BibTeX citation
@article{goldie2025discobench,
title={DiscoBench: An Open-Ended Benchmark For Algorithm Discovery},
author={Alexander D. Goldie and Zilin Wang and Adrian Hayler and Deepak Nathani and Edan Toledo and Ken Thampiratwong and Aleksandra Kalisz and Michael Beukman and Alistair Letcher and Shashank Reddy and Clarisse Wibault and Theo Wolf and Charles O'Neill and Shimon Whiteson and Jakob N. Foerster and Roberta Raileanu},
year={2025}
}
DiscoBench was a hugely collaborative effort. Below, we discuss the contributions made by each author:
* Lead Author
†Core Contributor
‡ Task Contributor