Skip to content

πŸͺ© How to Contribute a Task for DiscoBench

Thank you for your interest in making a task for DiscoBench! Your contribution is hugely appreciated and will help unlock new research in automated research and algorithm discovery using agentic LLMs.


🎯 Goal

The goal of DiscoBench is to develop a series of modular tasks, where an ML codebase is broken into its constituent components, for LLMs to use when discovering new algorithms. Through configs, we can choose which modules should use default code (the original implementation) and which should be LLM-generated. We want to ensure that LLMs can produce performant, generalisable algorithms for AI research.


βš™οΈ Getting Started

  1. Follow the setup instructions from the DiscoBench repository to prepare your environment.
  2. Clone the repo and ensure everything runs correctly.
  3. Follow the guide below to create your own task.

πŸ“ Directory Structure Example

Here, we will use OnPolicyRL as an example task structure. The OnPolicyRL directory looks as follows.

OnPolicyRL/
β”œβ”€β”€ datasets/
β”‚   β”œβ”€β”€ Brax/
β”‚   β”œβ”€β”€ Craftax/
β”‚   β”œβ”€β”€ GridWorld/
β”‚   └── MinAtar/
β”‚
β”œβ”€β”€ templates/
β”‚   β”œβ”€β”€ default/
β”‚   β”‚   β”œβ”€β”€ base/
β”‚   β”‚   β”œβ”€β”€ edit/
β”‚   β”‚   β”œβ”€β”€ main.py
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── wrappers.py
β”‚   β”‚
β”‚   β”œβ”€β”€ recurrent/
β”‚   β”œβ”€β”€ transformer/
β”‚   β”‚
β”‚   └── utils/
β”‚       β”œβ”€β”€ _reference.txt
β”‚       β”œβ”€β”€ description.md
β”‚       β”œβ”€β”€ requirements.txt
β”‚       β”œβ”€β”€ task_information.yaml
β”‚       └── task_spec.yaml
β”‚
└── task_config.yaml

🧩 Step-by-Step Explanation

🧠 datasets/

Contains each dataset (or environment) that your code can run with.

Each dataset folder should include:

  • description.md β€” explains what the dataset/environment is (e.g., β€œThis is Breakout!”).
  • make_env.py / make_dataset.py β€” loads and returns the dataset or environment. See dataset_integration.md for a more thorough explanation of how to handle datasets in your new DiscoBench task!
  • Any dataset-specific configs or helper files.

πŸ—οΈ templates/

The templates/ directory contains all versions of your code templates.

Must contain:

  • default/ β€” includes:

  • base/: fully implemented modules.

  • edit/: same file names as base/, but with function signatures, comments, and possibly some useful lines only. These are the files to be completed by an LLM.
  • main.py: the main entry point to the task.
  • Other necessary files like wrappers.py, or any model evaluation logic. Any non-modules should be stored outside of base/edit.
  • utils/ β€” meta-information and configuration files.

Example

  • base/optim.py:

Base optim.py implementation * edit/optim.py:

Edit optim.py template

🧩 Files in templates/ (outside base/edit) are shared β€” used regardless of which version (default or LLM-generated) is selected.


🧰 utils/

This folder always contains:

  • description.md β€” general task-domain description (e.g., what RL is).

  • requirements.txt β€” dependencies required to run the benchmark.

  • task_information.yaml β€” describes per-module prompts for edit codebases. Each {module}_prompt must match the corresponding filename.

  • task_spec.yaml β€” defines all files which need to be loaded to define a task. Also sets which files are fixed and which are modular.


🧠 template_backends/

Folders like transformer/ or recurrent/ are optional backends that override specific files in default/.

Example:

  • transformer/networks.py replaces default/networks.py with a transformer implementation.
  • If implementing any additional backends, there should be an updated task_information.yaml in the backend folder for whichever modules have been overwritten.

🧾 task_config.yaml

Defines which modules use base or edit code. This is what anyone running the benchmark can use to configure the task.

It also:

  • Specifies the dataset/environment
  • Chooses backend (default/recurrent/transformer)
  • Defines where to save the task under task_src/

task_spec.yaml vs task_config.yaml:

  • task_spec.yaml (in utils/): Defines the structure of your task domain. It lists which files are fixed (always copied as-is) vs which are module files (can have base/ and edit/ versions). This file is static and defines the task domain architecture.

  • task_config.yaml (in task root): Defines the runtime configuration for a specific task instance. It specifies:

    • Which datasets to use (train_task_id, test_task_id)
    • Which modules should use edit/ implementations (change_loss: true, change_optim: false, etc.)
    • Which backend to create the task with.
    • Any task-specific settings

This file is dynamic and can be modified to change which parts of the code are editable for participants.


πŸ€– (Optional) models/

Contains different pretrained models that your code relies on. These can optionally be included in your tasks if they involve finetuning or changing pretrained models.

Each model folder should include: * description.md - an explanation of that model * model_config.yaml - everything needed to download the model from HuggingFace.

See discobench/tasks/ModelUnlearning for an example of how models can be used!


🧱 How to Make a New Task

  1. Choose a codebase

  2. Stay close to a known repo for verification and reproducibility.

  3. Example: OnPolicyRL is derived from PureJaxRL.

  4. Identify modules

  5. Generally, there are some easy modules to identify: network, loss, optimizer

  6. Optionally include config, training loop, or other unique artifacts.

  7. Split code into modules

  8. Each module should ideally have a single purpose (e.g. get_optimizer()).

  9. Create base and edit folders

  10. base/: complete implementations.

  11. edit/: empty or commented versions, keeping function signatures and minimal guidance.

  12. Define a metric

  13. Must return or print a performance metric.

  14. E.g., validation accuracy, test score after tuning, etc.
  15. The logic for producing this metric must not reside in a module (otherwise the LLM could cheat)!
  16. Be consistent across tasks!

  17. Create task_spec.yaml

  18. List all modules and mark whether they're editable or fixed. This file defines the structure of your task and does not change. It lives in utils/task_spec.yaml. Below you can find an example task_spec.yaml file:

    ```yaml fixed_files: - train.py - evaluate.py - make_dataset.py - config.py - main.py

module_files: - loss.py - networks.py - optim.py ```

  1. (Optional) Add backends (transformer/, recurrent/, etc.)

  2. Write metadata

  3. Add description.md, task_information.yaml, requirements.txt inside utils/.

  4. Add datasets

  5. Each under datasets/, with its own description.md and loader/configs.

  6. Verify your code

    • Ensure base code runs to expected performance.
    • Check edit code has correct signatures and structure.
    • You can temporarily replace edit with base code to verify functionality.
  7. Add _reference.txt

    • Include original codebase and dataset citation or source link.
  8. Ensure main.py exists

    • This must be the entrypoint.
  9. Create task_config.yaml

    • This file lives in the task root directory (same level as utils/, templates/, etc.).
    • It specifies which datasets to use and which modules should use edit/ implementations.
    • For every file listed in module_files in your task_spec.yaml, you must include a corresponding change_<module_name> entry (without the .py extension).

    Example from OnPolicyRL:

    ```yaml train_task_id: [MinAtar/Breakout, MinAtar/Freeway] test_task_id: [MinAtar/Breakout, MinAtar/SpaceInvaders, MinAtar/Freeway, MinAtar/Asterix, Brax/Ant]

    source_path: task_src/OnPolicyRL template_backend: default # default, transformer, recurrent

    change_optim: false change_loss: true change_networks: false change_train: false ```

    • train_task_id and test_task_id: Specify which datasets to use (must match dataset folder names under datasets/).
    • change_<module>: Set to true to use the edit/ version (participants can modify), false to use the base/ version (fixed implementation).
    • Each module file from task_spec.yaml's module_files list needs a corresponding change_<module> entry (e.g., loss.py β†’ change_loss, networks.py β†’ change_networks, optim.py β†’ change_optim).
  10. Create example_config in example_configs/<task_domain>.yaml

    • This will create an example task for anyone who wants to test an agent on your task.

    Example from OnPolicyRL:

    ```yaml train_task_id: [MinAtar/Breakout, MinAtar/Freeway,] test_task_id: [MinAtar/Asterix, MinAtar/SpaceInvaders]

    source_path: task_src/OnPolicyRL template_backend: default # default, transformer, recurrent

    change_optim: true change_loss: true change_networks: false change_train: false ```

  11. Keep metrics outside modules

    • The main performance metric should not be computed inside a module (we don't want it to be possible to cheat)!

βœ… Done! Your task is ready for integration.


πŸ—‚οΈ Dataset Integration

For detailed instructions on adding new datasets to your tasks, see our Dataset Integration Guide.

πŸ§ͺ Verifying Your Task

  1. Generate the LLM-facing file system

To test whether your task is runnable, try creating the file system as it would be used in discobench with the command:

bash python3 -m discobench.create_task --task_domain <TASK_NAME>

This will populate:

task_src/<TASK_NAME>

The first check should therefore be that the above runs through without any errors.

  1. Verifying that your code can run.

    After you verified that your task can run using make_files.py, it is now time to actually run your code. There are many ways to do so. One easy way is to (i) change edit to false for all modules and (ii) include all datasets as train tasks in the task_config.yaml. Then re-run the script in (1); you should be able to run the files in the file system created under task_src/. To test this, use run_main.py, which will run all files called main.py.

  2. Make sure that all additional files are there

There are some files that are needed to generate the LLM Agent prompts, which currently do not lead to errors in steps (1) and (2), even when they are missing. While they were already mentioned in the text above, here you can find a compact collection to make sure that all the files you need are there:

  • description.md β€” general task-domain description (e.g., what RL is).
  • requirements.txt β€” dependencies required to run the benchmark.
  • task_information.yaml β€” describes per-module prompts for edit codebases. Each {module}_prompt must match the corresponding filename.
  • _reference.txt β€” original codebase citation or source link for attribution and reproducibility.
  • datasets/<DATASET_NAME>/description.md β€” Must be provided for each dataset. Explains what the dataset/environment is (e.g., "This is Breakout!").

πŸ’‘ Nice to Know

  • Running pre-commit hooks on every commit can be annoying. You can disable them temporarily:

bash git commit --no-verify

Then, when you’re ready to push:

bash pre-commit run --all-files

or simply commit again without --no-verify.


🧭 Summary

Creating a DiscoBench task involves:

  1. Structuring your files (datasets, templates, utils).
  2. Separating full (base) and empty (edit) implementations.
  3. Adding metadata (task_information.yaml, task_spec.yaml).
  4. Ensuring reproducibility and attribution.
  5. Verifying your task with the creation script.

Follow this guide carefully β€” doing so makes our lives much easier when integrating your task! ✨