RL Loss Function Design Task
Combining our agent prompt with the pre-built description.md
description.mdRole
You are an AI research scientist designing a novel loss function for Reinforcement Learning.
Objective
Create the highest-performing loss function in loss.py for the training environments provided. This loss function will be tested on additional RL environments after you complete your work.
Key Files
loss.py: Your main deliverable - implement your loss function hererun_main.py: Executes allmain.pyscripts to run trainingmain.py: Training scripts (likely multiple across different environments)
Execution Instructions
Running Code
python run_main.py
This finds and executes all main.py training scripts.
Important Constraints
- Symlink Latency: File symlinks take several seconds to update. Always wait 3-5 seconds after creating/modifying files before running code.
- JAX Compilation: Code uses JAX with JIT compilation, which means:
- First runs will be slow (compilation overhead)
- Subsequent runs are faster
- Long execution times are expected
- Concurrency Limits: Do NOT queue multiple training runs simultaneously. Wait for each run to complete before starting the next.
- Monitor Results: Check each run's output carefully to ensure it completed successfully.
Research Approach
Expectations
- Innovate: Think outside the box. Don't just reimplement PPO, DQN, SAC, etc.
- Iterate: Test multiple ideas and refine based on results
- Experiment systematically: Make changes, test, analyze, repeat
- Track performance: Compare metrics across different loss function designs
Workflow
- Read
description.mdthoroughly to understand requirements - Examine existing code structure and baseline implementation
- Design and implement a novel loss function in
loss.py - Run training:
python run_main.py - Wait for completion (may take up to an hour, but you should periodically check in-between)
- Analyze results and metrics
- Iterate on your design
- Repeat steps 3-7 until you achieve strong performance
Success Criteria
Your loss function should demonstrate strong performance on the training environments. The final loss.py will be evaluated on held-out test environments.
Tips
- Start by understanding the baseline
- Make incremental changes and test frequently
- Keep notes on what works and what doesn't
- Consider theoretical motivations for your design choices
- Balance exploration of new ideas with refinement of promising approaches
Automated algorithm discovery is a branch of machine learning focused on using computational search to create new, high-performance algorithms. Unlike traditional algorithm design, which relies on human creativity and expertise, this field automates the process of invention by systematically exploring a vast space of possible programs. The core idea is that a system generates candidate algorithms, evaluates their performance on benchmark tasks, and uses these results as feedback to guide the search towards more effective and efficient solutions.
Historically, automated algorithm discovery draws inspiration from evolutionary computation and genetic programming, which apply principles of natural selection to evolve computer programs. Early formalizations in the 1980s and 1990s established methods for representing algorithms as structures, like trees or graphs, that could be modified and combined. The generate-evaluate-refine loop—where a system proposes an algorithm, tests its correctness and efficiency, and iteratively improves it—remains central to all automated discovery frameworks.
In practice, automated algorithm discovery has been used to find faster sorting and hashing routines, optimize fundamental computations like matrix multiplication, and design novel neural network architectures. Here, the objective is to discover new machine learning algorithms. To do so, files have been created in discovered/ which you can use to implement new algorithms. The algorithms implemented in discovered/ will eventually be tested for the ability to generalise. For testing, these algorithms will be run with code that has the exact same format as that in the task folders shared with you. Therefore, it is important that any algorithms you implement maintain the exact same interface as that provided.
Below, we provide a description of the domain of machine learning in which you will be discovering algorithms.
Reinforcement learning is a branch of machine learning focused on training agents to make sequences of decisions in an environment to maximize a notion of cumulative reward. Unlike supervised learning, where models learn from labeled examples, RL agents learn through trial and error, receiving feedback in the form of rewards or penalties based on their actions. The core idea is that the agent explores the environment, evaluates the outcomes of its actions, and gradually improves its decision-making policy to achieve better long-term results.
Historically, RL draws inspiration from behavioral psychology, particularly the study of how animals learn from rewards and punishments. Early formalizations in the 1950s and 1960s laid the groundwork for algorithms that could handle Markov decision processes (MDPs), a mathematical framework for modeling decision-making under uncertainty. The agent-environment interaction loop—where the agent observes the state of the environment, takes an action, and receives a reward—remains central to all RL formulations.
The objective of reinforcement learning is to find a policy—a mapping from states to actions—that maximizes the expected cumulative reward over time. RL algorithms vary in approach, from value-based methods, which estimate the expected reward of actions, to policy-based methods, which directly optimize the agent’s behavior. Success in RL requires balancing exploration (trying new actions to discover rewards) and exploitation (leveraging known actions that yield high rewards).
In practice, RL has been applied to robotics, game playing, resource management, and recommendation systems, among other areas, where sequential decision-making is key. Understanding the principles of reward, policy, and environment dynamics is essential before implementing an RL algorithm, as these components shape how the agent learns and adapts.
Below, we provide a description of the environment which you will be training in. However, be aware that any code you develop may be applied to other RL environments too.
You should change the loss file, which can be found in loss.py. In deep learning, the loss provides an objective to minimize; in reinforcement learning, minimizing this objective corresponds to maximising the return of an agent. You should not change the name of the function, loss_actor_and_critic, or its inputs.
DESCRIPTION
Breakout MinAtar is a simplified version of the classic Atari Breakout game. The player controls a paddle at the bottom of the screen and must bounce a ball to break rows of bricks at the top. The ball travels only along diagonals and bounces off when hitting the paddle or walls. The game continues until the ball hits the bottom of the screen or the maximum number of steps is reached.
OBSERVATION SPACE
The observation is a ndarray with shape (10, 10, 4) where the channels correspond to the following:
Channel Description
0 paddle - position of the player's paddle
1 ball - current position of the ball
2 trail - indicates the ball's direction of movement
3 brick - layout of the remaining bricks
Each channel contains binary values (0 or 1) indicating presence/absence of the respective element.
ACTION SPACE
The action space consists of 3 discrete actions:
Num Action
0 no-op (no movement)
1 move paddle left
2 move paddle right
TRANSITION DYNAMICS
- The paddle moves left or right based on the chosen action
- The ball moves diagonally and bounces off walls and the paddle
- When the ball hits a brick, the brick is destroyed
- When all bricks are cleared, a new set of three rows is added
- The ball's direction is indicated by the trail channel
REWARD
- +1 reward for each brick broken
- No negative rewards
STARTING STATE
- Paddle starts at position 4
- Ball starts at either (3,0) or (3,9) with corresponding diagonal direction
- Three rows of bricks are initialized at the top (rows 1-3)
EPISODE END
The episode ends if either of the following happens:
- Termination: The ball hits the bottom of the screen
- Truncation: The length of the episode reaches max_steps_in_episode (default: 1000)
STATE SPACE
The state consists of:
- ball_y: vertical position of ball (0-9)
- ball_x: horizontal position of ball (0-9)
- ball_dir: direction of ball movement (0-3)
- pos: paddle position (0-9)
- brick_map: 10x10 binary map of bricks
- strike: boolean indicating if ball hit something
- last_y, last_x: previous ball position
- time: current timestep
- terminal: whether episode has ended
DESCRIPTION
Freeway MinAtar is a simplified version of the classic Atari Freeway game. The player starts at bottom of screen and can travel up/down. Player speed is restricted s.t. player only moves every 3 frames. Reward +1 given when player reaches the top of the screen. The player must navigate through traffic that moves horizontally across the screen while trying to reach the opposite side.
OBSERVATION SPACE
The observation is a ndarray with shape (10, 10, 4) where the channels correspond to the following:
Channel Description
0 chicken - position of the player character
1 cars - positions of moving cars/traffic
2 trail - indicates recent position or movement
3 background - static background elements
Each channel contains binary values (0 or 1) indicating presence/absence of the respective element.
ACTION SPACE
The action space consists of 3 discrete actions:
Num Action
0 no-op (no movement)
1 move up
2 move down
TRANSITION DYNAMICS
- The player moves up or down based on the chosen action but at reduced speed (every 3 frames)
- Cars move horizontally across the screen at different speeds
- Player must avoid colliding with cars
- When player reaches the top, they receive reward and can continue
REWARD
- +1 reward for reaching the top of the screen
- No negative rewards for collisions
STARTING STATE
- Player starts at the bottom of the screen
- Cars spawn and move across different lanes
EPISODE END
The episode ends if either of the following happens:
- Termination: Maximum episode length reached
- Truncation: The length of the episode reaches max_steps_in_episode (default: 1000)