πͺ© How to Contribute a Task for DiscoBench¶
Thank you for your interest in making a task for DiscoBench! Your contribution is hugely appreciated and will help unlock new research in automated research and algorithm discovery using agentic LLMs.
π― Goal¶
The goal of DiscoBench is to develop a series of modular tasks, where an ML codebase is broken into its constituent components, for LLMs to use when discovering new algorithms. Through configs, we can choose which modules should use default code (the original implementation) and which should be LLM-generated. We want to ensure that LLMs can produce performant, generalisable algorithms for AI research.
βοΈ Getting Started¶
- Follow the setup instructions from the DiscoBench repository to prepare your environment.
- Clone the repo and ensure everything runs correctly.
- Follow the guide below to create your own task.
π Directory Structure Example¶
Here, we will use OnPolicyRL as an example task structure. The OnPolicyRL directory looks as follows.
OnPolicyRL/
βββ datasets/
β βββ Brax/
β βββ Craftax/
β βββ GridWorld/
β βββ MinAtar/
β
βββ templates/
β βββ default/
β β βββ base/
β β βββ edit/
β β βββ main.py
β β βββ __init__.py
β β βββ wrappers.py
β β
β βββ recurrent/
β βββ transformer/
β β
β βββ utils/
β βββ _reference.txt
β βββ description.md
β βββ requirements.txt
β βββ task_information.yaml
β βββ task_spec.yaml
β
βββ task_config.yaml
π§© Step-by-Step Explanation¶
π§ datasets/¶
Contains each dataset (or environment) that your code can run with.
Each dataset folder should include:
description.mdβ explains what the dataset/environment is (e.g., βThis is Breakout!β).make_env.py/make_dataset.pyβ loads and returns the dataset or environment. Seedataset_integration.mdfor a more thorough explanation of how to handle datasets in your new DiscoBench task!- Any dataset-specific configs or helper files.
ποΈ templates/¶
The templates/ directory contains all versions of your code templates.
Must contain:¶
-
default/β includes: -
base/: fully implemented modules. edit/: same file names asbase/, but with function signatures, comments, and possibly some useful lines only. These are the files to be completed by an LLM.main.py: the main entry point to the task.- Other necessary files like
wrappers.py, or any model evaluation logic. Any non-modules should be stored outside ofbase/edit. utils/β meta-information and configuration files.
Example¶
base/optim.py:
* edit/optim.py:

π§© Files in
templates/(outsidebase/edit) are shared β used regardless of which version (default or LLM-generated) is selected.
π§° utils/¶
This folder always contains:
-
description.mdβ general task-domain description (e.g., what RL is). -
requirements.txtβ dependencies required to run the benchmark. -
task_information.yamlβ describes per-module prompts foreditcodebases. Each{module}_promptmust match the corresponding filename. -
task_spec.yamlβ defines all files which need to be loaded to define a task. Also sets which files are fixed and which are modular.
π§ template_backends/¶
Folders like transformer/ or recurrent/ are optional backends that override specific files in default/.
Example:
transformer/networks.pyreplacesdefault/networks.pywith a transformer implementation.- If implementing any additional backends, there should be an updated
task_information.yamlin the backend folder for whichever modules have been overwritten.
π§Ύ task_config.yaml¶
Defines which modules use base or edit code. This is what anyone running the benchmark can use to configure the task.
It also:
- Specifies the dataset/environment
- Chooses backend (default/recurrent/transformer)
- Defines where to save the task under
task_src/
task_spec.yaml vs task_config.yaml:
-
task_spec.yaml(inutils/): Defines the structure of your task domain. It lists which files are fixed (always copied as-is) vs which are module files (can havebase/andedit/versions). This file is static and defines the task domain architecture. -
task_config.yaml(in task root): Defines the runtime configuration for a specific task instance. It specifies:- Which datasets to use (
train_task_id,test_task_id) - Which modules should use
edit/implementations (change_loss: true,change_optim: false, etc.) - Which backend to create the task with.
- Any task-specific settings
- Which datasets to use (
This file is dynamic and can be modified to change which parts of the code are editable for participants.
π€ (Optional) models/¶
Contains different pretrained models that your code relies on. These can optionally be included in your tasks if they involve finetuning or changing pretrained models.
Each model folder should include:
* description.md - an explanation of that model
* model_config.yaml - everything needed to download the model from HuggingFace.
See discobench/tasks/ModelUnlearning for an example of how models can be used!
π§± How to Make a New Task¶
-
Choose a codebase
-
Stay close to a known repo for verification and reproducibility.
-
Example: OnPolicyRL is derived from PureJaxRL.
-
Identify modules
-
Generally, there are some easy modules to identify:
network,loss,optimizer -
Optionally include
config,training loop, or other unique artifacts. -
Split code into modules
-
Each module should ideally have a single purpose (e.g.
get_optimizer()). -
Create base and edit folders
-
base/: complete implementations. -
edit/: empty or commented versions, keeping function signatures and minimal guidance. -
Define a metric
-
Must return or print a performance metric.
- E.g., validation accuracy, test score after tuning, etc.
- The logic for producing this metric must not reside in a module (otherwise the LLM could cheat)!
-
Be consistent across tasks!
-
Create
task_spec.yaml -
List all modules and mark whether they're editable or fixed. This file defines the structure of your task and does not change. It lives in
utils/task_spec.yaml. Below you can find an exampletask_spec.yamlfile:```yaml fixed_files: - train.py - evaluate.py - make_dataset.py - config.py - main.py
module_files: - loss.py - networks.py - optim.py ```
-
(Optional) Add backends (
transformer/,recurrent/, etc.) -
Write metadata
-
Add
description.md,task_information.yaml,requirements.txtinsideutils/. -
Add datasets
-
Each under
datasets/, with its owndescription.mdand loader/configs. -
Verify your code
- Ensure base code runs to expected performance.
- Check
editcode has correct signatures and structure. - You can temporarily replace edit with base code to verify functionality.
-
Add
_reference.txt- Include original codebase and dataset citation or source link.
-
Ensure
main.pyexists- This must be the entrypoint.
-
Create
task_config.yaml- This file lives in the task root directory (same level as
utils/,templates/, etc.). - It specifies which datasets to use and which modules should use
edit/implementations. - For every file listed in
module_filesin yourtask_spec.yaml, you must include a correspondingchange_<module_name>entry (without the.pyextension).
Example from OnPolicyRL:
```yaml train_task_id: [MinAtar/Breakout, MinAtar/Freeway] test_task_id: [MinAtar/Breakout, MinAtar/SpaceInvaders, MinAtar/Freeway, MinAtar/Asterix, Brax/Ant]
source_path: task_src/OnPolicyRL template_backend: default # default, transformer, recurrent
change_optim: false change_loss: true change_networks: false change_train: false ```
train_task_idandtest_task_id: Specify which datasets to use (must match dataset folder names underdatasets/).change_<module>: Set totrueto use theedit/version (participants can modify),falseto use thebase/version (fixed implementation).- Each module file from
task_spec.yaml'smodule_fileslist needs a correspondingchange_<module>entry (e.g.,loss.pyβchange_loss,networks.pyβchange_networks,optim.pyβchange_optim).
- This file lives in the task root directory (same level as
-
Create example_config in
example_configs/<task_domain>.yaml- This will create an example task for anyone who wants to test an agent on your task.
Example from OnPolicyRL:
```yaml train_task_id: [MinAtar/Breakout, MinAtar/Freeway,] test_task_id: [MinAtar/Asterix, MinAtar/SpaceInvaders]
source_path: task_src/OnPolicyRL template_backend: default # default, transformer, recurrent
change_optim: true change_loss: true change_networks: false change_train: false ```
-
Keep metrics outside modules
- The main performance metric should not be computed inside a module (we don't want it to be possible to cheat)!
β Done! Your task is ready for integration.
ποΈ Dataset Integration¶
For detailed instructions on adding new datasets to your tasks, see our Dataset Integration Guide.
π§ͺ Verifying Your Task¶
- Generate the LLM-facing file system
To test whether your task is runnable, try creating the file system as it would be used in discobench with the command:
bash
python3 -m discobench.create_task --task_domain <TASK_NAME>
This will populate:
task_src/<TASK_NAME>
The first check should therefore be that the above runs through without any errors.
-
Verifying that your code can run.
After you verified that your task can run using
make_files.py, it is now time to actually run your code. There are many ways to do so. One easy way is to (i) change edit tofalsefor all modules and (ii) include all datasets as train tasks in thetask_config.yaml. Then re-run the script in (1); you should be able to run the files in the file system created undertask_src/. To test this, userun_main.py, which will run all files calledmain.py. -
Make sure that all additional files are there
There are some files that are needed to generate the LLM Agent prompts, which currently do not lead to errors in steps (1) and (2), even when they are missing. While they were already mentioned in the text above, here you can find a compact collection to make sure that all the files you need are there:
description.mdβ general task-domain description (e.g., what RL is).requirements.txtβ dependencies required to run the benchmark.task_information.yamlβ describes per-module prompts foreditcodebases. Each{module}_promptmust match the corresponding filename._reference.txtβ original codebase citation or source link for attribution and reproducibility.datasets/<DATASET_NAME>/description.mdβ Must be provided for each dataset. Explains what the dataset/environment is (e.g., "This is Breakout!").
π‘ Nice to Know¶
- Running pre-commit hooks on every commit can be annoying. You can disable them temporarily:
bash
git commit --no-verify
Then, when youβre ready to push:
bash
pre-commit run --all-files
or simply commit again without --no-verify.
π§ Summary¶
Creating a DiscoBench task involves:
- Structuring your files (
datasets,templates,utils). - Separating full (
base) and empty (edit) implementations. - Adding metadata (
task_information.yaml,task_spec.yaml). - Ensuring reproducibility and attribution.
- Verifying your task with the creation script.
Follow this guide carefully β doing so makes our lives much easier when integrating your task! β¨