DiscoBench

DiscoBench is an open-ended framework for evaluating automated algorithm discovery, e.g. via AI research agent systems. DiscoBench has a modular setup, an emphasis on discovering algorithms that transfer, and a huge diversity of tasks! We hope DiscoBench helps drive the frontier of research in algorithm discovery by providing a large-scale, open-ended landscape for evaluating AI research agents!

Overview of DiscoBench. — Figure 1: An overview of the DiscoBench system. A user can select a domain from a broad range of different areas of machine learning. After, they can select files from the list of available modules for an AI research agent to edit. We support selecting any combination of modules, from just one to all of them!

Evaluating Automated Algorithm Discovery

One long term goal of automated algorithm discovery systems is to safely automate AI research itself. To do so, we need to be able to measure the ability of AI research agents. While current benchmarks do exist, they suffer from fundamental limitations, including data contamination, poor quality evaluations, and difficulty in assessing whether methods discovered by the agent generalise to new tasks and domains.

We design DiscoBench specifically with the issues of current benchmarks in mind; as such, we hope that it can remain pertinent for a long time. Below, we explain why DiscoBench is so useful, and some of the tasks that are currently implemented in it; expect these to grow over the next few months!

Limitations of Current Benchmarks

Poor Evaluation: Like in all machine learning, proper evaluation requires a train-test dataset separation. Most preexisting benchmarks do this on the model-level - i.e., a successful algorithm is one that trains a model that generalises to a test dataset. This misses the correct train-test boundary. Given we are finding algorithms on the meta-scale, we should instead be measuring the transfer of algorithms to train models on completely new datasets. In other words, our focus should be on meta-test performance!
Limited Diversity: Existing benchmarks require manually creating every single problem, which is laborious and repetitive. Even when a benchmark consists of many tasks, they often focus on specific domains at the cost of breadth.
Bias To The Initialisation: Because of how they are implemented, many preexisting benchmarks require initialising a codebase from a fully working, preexisting implementation. In addition to requiring a complete initial solution, which is non-trivial for hard problems, initialising agents in this local minimum can limit the creativity that we hope to elicit from AI research agents. It also affects evaluation, since doing nothing can be a reasonably performant strategy.
Data Contamination: It is hard to accurately measure data contamination of benchmarks, especially when they use machine learning problems that have been publicly released. This problem is particularly prevalent for older Kaggle competitions, where first place solutions are public, or HuggingFace datasets, where an agent could have seen the data in pretraining and use that to inform their solutions. In particular, issues often arise when AI research agents are aware of which specific tasks it will be evaluated on.

Criteria	MLEBench	ASTABench	MLGym Bench	DiscoBench
Algorithm Transfer	❌	❌	❌	✅
Task Diversity	🟡	✅	🟡	✅
Unbiased to Initialisation	🟡	✅	❌	✅
Contamination Resistant	❌	🟡	🟡	✅

Table 1: A comparison of some common benchmarks used for research agent evaluation. ✅ means a benchmark is good for a certain issue, 🟡 means it is "ok" but not great, and ❌ means it struggles with a specific issue.

What is DiscoBench?

DiscoBench is a new task-generation framework, and task-suite, for algorithm discovery and AI research agents. As well as already providing a number of problems with which to measure an agent's performance for AI research, we believe it provides a useful foundation where efforts in automated, open-ended scientific research can effectively take place.

DiscoBench is set up in a modular way, such that you can specify which parts of a codebase an LLM edits; this enables unparalleled task diversity compared to other benchmarks in this space. We also differ from other benchmarks in how we define evaluation, with emphasis on meta-test evaluation. This means we test the performance of algorithms on held-out datasets or environments, without telling the LLM what it will be evaluated on a-priori. And DiscoBench is very diverse; we offer tasks from different applied and foundational disciplines and support a broad range of public datasets and RL environments.

A demonstration of a typical DiscoBench setup. — Figure 2: A visualisation of a typical DiscoBench setup. An agent will develop algorithms to train models on a set of *meta-train* datasets, making refinements based on the model's evaluation score. After a final algorithm is developed, it is used to train models on a held-out *meta-test* dataset, which is unknown to the agent. The models are evaluated to get a final performance metric.

Here are some of the main features of DiscoBench:

Modular File System: To massively expand task diversity, we implement each task in a modular fashion. For every ML codebase we use, we identify a series of small modules that we can mark as editable (the LLM is fed just the interface it must match) or fixed (the code uses a prewritten file which the LLM shouldn't edit). For example, in one task the modules could be the network architecture, the loss function and the optimiser. This means that for the cost of implementing a single codebase with n modules, we can get $$2^n-1$$ possible different tasks from each combination of editable and fixed modules! Also, when a module is editable, we start the agent off with a near-empty file, so agents are not biased too close to an initialisation. Lastly, due to this modular setup, DiscoBench is not limited to LLMs; other methods, such as symbolic evolution, may also prove effective given the emphasis is only on discovering algorithms, rather than needing to write the boilerplate code.
Open-Ended Task Space: In DiscoBench, we focus on up-to-date codebases with unsaturated datasets and environments. These represent active research problems in which we can measure the performance of a discovered algorithm but do not yet know a “perfect” solution. As such, they are inherently open-ended problems that rely on creativity and effective contextual reasoning to maximise performance.
Meta-Train/Meta-Test Split: DiscoBench is implemented with the ability to define a clear Meta-Train/Meta-Test split. In DiscoBench, we focus on the transferability of the algorithms written by agents. In practice, the agent must write algorithms which train models on data. The algorithm is evaluated at meta-test time based on its score when training a new model on an unseen, and unknown to an agent, dataset or RL environment. For example, the LLM might receive feedback for algorithms it develops to train classifiers on CIFAR10, and its evaluation score is the performance of classifiers trained with this algorithm on ImageNet. It turns out, it's really easy to overfit algorithms to specific problems, limiting the utility of the discovered algorithms. We try to support a broad range of datasets or RL environments, and meta-training/meta-testing can take place on more than one at a time, meaning the LLM potentially has to balance multiple objectives. This only adds to the huge diversity of possible tasks above!
Broad Task Coverage: We implement tasks from a range of AI disciplines, with coverage from reinforcement learning to image classification to speech detection from brain signals. Within most tasks, we also offer datasets and environments which can be run on different levels of hardware. In addition to providing different levels of difficulty and complexity for research agents to solve, this also means DiscoBench can be used in both small-scale academic experiments and larger-scale industry development.

DiscoBench is in continuous development, with a planned set of tasks and an expanded suite of features to be added over the coming week! We are also always open to more suggestions/open-source contributions.

Using DiscoBench

Getting started with DiscoBench is simple. Full installation instructions are available in our docs, as well as information about how to define your own DiscoBench configs.

Below, we provide an example of creating a DiscoBench task using our Python API. We also support interacting with DiscoBench through the DiscoBench cli!

Quick Start

1. Install DiscoBench

pip install discobench

2. Select a task domain and build an example task

discobench.create_task(task_domain="task_domain", example=True, test=False)

This creates a requirements.txt and a full modular codebase. It also creates description.md, which explains the task for your agent. This includes descriptions of the task domain, a description of the purpose of any modules, and explanations of all meta-train datasets.

3. Evaluate the algorithms

discobench.create_task(task_domain="task_domain", example=True, test=True)

This builds a new codebase for your test environments. You can conveniently call python run_main.py to run training with the discovered algorithms on all meta-test environments.

To ensure the LLM hasn't cheated its evaluation, creating a test task will overwrite all files except the discovered modules. This means an agent can't, for instance, replace its evaluation accuracy with 100%!

Task Domains

Task domains are the high level areas which we have implemented full modular codebases for; things like OnPolicyRL and ComputerVisionClassification. For a full list and descriptions of each domain, use:

discobench.utils.get_domains()

Modules

Modules are the components of each codebase which we can mark as editable or fixed for the LLM. We prevent cheating by overriding all non-editable files when we create a test task. To see what modules are implemented for each domain, use:

discobench.utils.get_modules()

Configuration

To control which modules and datasets to use, the easiest way is to edit a base config dictionary. This can be obtained using:

config_dict = discobench.utils.get_config(task_domain=”task_domain”)

Afterwards, you can feed your updated config to create a task:

discobench.create_task(task_domain="task_domain", config_dict=config_dict)

Example Tasks

We implement a number of tasks from a range of AI research domains in DiscoBench. While these are continuously being iterated on and added to, we include a description of a few of our tasks, and their modules, below.

On-Policy RL

Modules: Optimiser, Loss, Network Architecture, Training Loop

Starting from code set up for a standard actor-critic RL training loop, the agent must find algorithms which maximise the return of an RL agent policy. We support different RL environments from MinAtar, Brax and Craftax.

Image Classification

Modules: Optimiser, Loss, Network Architecture, Data Preprocessing

Using code designed for training an image classifier, the agent must find algorithms which maximise the accuracy of image classification. We support a range of different datasets, from MNIST (simple) to CIFAR100.

Model Unlearning

Modules: Loss

Unlearning involves trying to train a pre-trained model to maintain certain behaviours, while removing others, and is very important in AI safety! The agent must develop algorithms which maximise the desired capabilities while minimising the undesired capabilities or knowledge. This is a multi-objective task, in which the agent is given many different feedback signals to optimise internally.

Brain Speed Detection

Modules: Network Architecture, Loss, Optimiser

Research suggests we can detect speech from MEG signals (magnetic fields produced by electrical currents) produced in the brain! The agent must develop algorithms for detecting whether the brain is processing speech or silence for a number of different novels Sherlock Holmes novels.

Unsupervised Environment Design

Modules: Level Sampler, Training Step, Training Loop, Hyperparameter Config

Unsupervised Environment Design (UED) involves training an RL actor to be robust to different levels from a particular environment. The LLM agent must develop algorithms to maximise the performance of an RL actor on a held out set of environments.

An Example Usage

Below, we provide a (slightly abridged) example usage of DiscoBench. Here, we work with Claude Code to design a new loss function for On-Policy RL. Claude Code receives feedback from its training environments: MinAtar Breakout and MinAtar Freeway. After Claude Code has developed its new loss functions and run its own experiments for 5 different algorithms, we test its chosen algorithm's performance on a held-out test set: MinAtar SpaceInvaders and MinAtar Asterix.

In DiscoBench, such a process is simple; all we have to do is create a task_config which reflects what we want. The following config example completely defines the task for Claude Code! Automatically, we build all of the files, requirements and task description; the only thing for the agent designer to add is a prompt! Easy!

train_task_id: [MinAtar/Breakout, MinAtar/Freeway]
test_task_id: [MinAtar/SpaceInvaders, MinAtar/Asterix]

source_path: task_src/OnPolicyRL
template_backend: default

change_optim: false
change_loss: true
change_networks: false
change_train: false

Let's see how Claude Code did. Scroll to the bottom to see how Claude compares to PPO in- and out-of-meta-distribution:

🛠️ Coming Soon

DiscoBench is in active development, with many new features in the pipeline. In particular, some of the things we are currently working on include:

New Tasks: We are in the process of adding many new tasks, with nearly-full implementations for tasks in AI-for-biology (Protein Docking), ALife (Neural Cellular Automata) and image generation (Diffusion Models).
Agent Interface: We are looking to expand native support of DiscoBench to a range of different open-source agents. While building file systems is very generic, we want anyone to be able to run any agent in DiscoBench as easily as possible.
The DiscoBench Leaderboard: We are planning on creating a set of domain-module-dataset combinations which constitute the public DiscoBench Challenge, as well as a non-publicised set of combinations which form the private DiscoBench Challenge.

Future Work

We think DiscoBench has the potential to open up lots of new potential research avenues! While we are continuously adding new tasks, there are already thousands of domain-module-dataset combinations in DiscoBench! Here's a few examples of research ideas we'd love to see arise from using DiscoBench:

Learning to Research: With DiscoBench, there are literally thousands of different tasks to collect data from! It would be amazing to see what so much data can be used for in training!
Discovering Completely New Algorithms: As well as giving quantitative measures of performance, DiscoBench is the perfect playground for discovering new algorithms and easily testing them on a wide range of datasets and tasks. We'd love to see if DiscoBench could be used to discover the next PPO or CNN!
Expanding the Task Space: So far, DiscoBench has focused on machine learning algorithm discovery. However, in practice, the philosophy of DiscoBench (modular code breakdowns) can be applied anywhere; we've already had our first foray into AI4Science (e.g., Brain Speech Detection, GreenhouseGasPrediction), with more on the way, but it would be amazing to see what other real-world problems can be broken down into the DiscoBench framework.
Building New Agents: DiscoBench provides the perfect place for developing new agent architectures and scaffolds. The explicit open-endedness of our problem-space opens up the potential for significantly more complex and capable agents which would have saturated current benchmarks!

Conclusion

We really hope you love using DiscoBench, and some really great things can come out of it! Please consider contributing to our Open-Source repository, where we would love to see new tasks from the wider community as well as feature ideas! If you have any questions please feel free to reach out to the core contributors!

Related Work

Our work is related to a bunch of other amazing research! In particular, in this blog we discuss:

MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering

MLGym: A New Framework and Benchmark for Advancing AI Research Agents

AstaBench: Rigorous Benchmarking of AI Agents With A Scientific Research Suite

Claude Code

Please also make sure to look into task_domain/utils/_reference.txt to see the origins of all of our code and datasets!

Citation

For attribution in academic contexts, please cite this work as

Goldie et al., "DiscoBench: An Open-Ended Benchmark For Algorithm Discovery", 2025.

BibTeX citation

    @article{goldie2025discobench,
      title={DiscoBench: An Open-Ended Benchmark For Algorithm Discovery},
      author={Alexander D. Goldie and Zilin Wang and Adrian Hayler and Deepak Nathani and Edan Toledo and Ken Thampiratwong and Aleksandra Kalisz and Michael Beukman and Alistair Letcher and Shashank Reddy and Clarisse Wibault and Theo Wolf and Charles O'Neill and Shimon Whiteson and Jakob N. Foerster and Roberta Raileanu},
      year={2025}
    }

Contributions

DiscoBench was a hugely collaborative effort. Below, we discuss the contributions made by each author:

Alexander D. Goldie*: Led the project, designed most of the DiscoBench system, wrote the blog and made the OnPolicyRL task.
Zilin Wang†: Helped design system, built ComputerVisionClassification and BrainSpeechDetection tasks.
Adrian Hayler†: Helped design system and built LanguageModelPretraining task.
Deepak Nathani†: Helped design system and helped structure repository.
Edan Toledo†: Helped design system and offered general support, guidance and editing.
Ken Thampiratwong†: Helped design system.
Aleksandra Kalisz†: Helped design system and created blog figures.
Michael Beukman‡: Contributed UnsupervisedEnvironmentDesign task.
Alistair Letcher‡: Contributed ModelUnlearning task.
Shashank Reddy‡: Contributed OffPolicyRL task.
Clarisse Wibault‡: Contributed BayesianOptimisation task.
Theo Wolf‡: Contributed GreenhouseGasPrediction task.
Charles O'Neill‡: Contributed ContinualLearning task.
Shimon Whiteson: Offered general project supervision and feedback on blog.
Jakob N. Foerster: Offered general project supervision and feedback on blog.
Roberta Raileanu: Helped design system, provided feedback on blog and supervised early stages of the project.

* Lead Author

† Core Contributor

‡ Task Contributor