Training SWE Agent from Scratch: Everything you need to know

5 minute read

Published:

Agentic RL Framework

1. ROLL

2. AReaL

3. SkyRL: Specialized for SWE Agent with Verl

4. Verl-agent: Multi-turn RL with Verl

Learn about AReal

The key idea of AReal is to separate inference and training completely so that they can support asynchronous rollouts.

To understand and use it, the first thing is to understand how the repo is organized, and where to customize. That is, the code files that people should get familiar with, and the places to customize.

The first example is main_async_ppo.py, which is a quick start example.

  • Maybe need to modify a config file: from realhf.experiments.common.ppo_math_exp import PPOMATHConfig.

Learning about Verl

Verl-agent

  • In main_ppo.py: added
    • from agent_system.reward_manager.episode import EpisodeRewardManager.
    • from agent_system.multi_turn_rollout import TrajectoryCollector.
    • from agent_system.environments import make_envs
    • Then when RayPPOTrainer is initialized, the traj_collector, envs and val_envs are passed in as additional arguments, and the reward manager is passed in as reward_fn.
  • In ray_trainer.py:

    • Intact function added:
      def apply_invalid_action_penalty(data: DataProto, invalid_action_penalty_coef=float):
          reward_tensor = data.batch['token_level_scores']
          if 'step_rewards' in data.batch.keys():
              step_rewards = data.batch['step_rewards']
          for i in range(len(data)):
              data_item = data[i]  # DataProtoItem
      
              prompt_ids = data_item.batch['prompts']
      
              prompt_length = prompt_ids.shape[-1]
      
              valid_response_length = data_item.batch['attention_mask'][prompt_length:].sum()
      
              action_valids = data_item.non_tensor_batch['is_action_valid'].astype(np.float32)
              action_invalids = torch.tensor(1 - action_valids, dtype=torch.float32, device=prompt_ids.device).squeeze(0)
              # invalid action penalty
              # assert reward_tensor[i, valid_response_length - 1] != 0.0, f'i={i}'
              reward_tensor[i, valid_response_length - 1] -= invalid_action_penalty_coef * action_invalids
      
              if 'step_rewards' in data.batch.keys():
                  step_rewards[i] -= invalid_action_penalty_coef * action_invalids
                
          valid_action_ratio = np.mean(data.non_tensor_batch['is_action_valid'].astype(np.float32)).item()
          metrics = {'valid_action_ratio': valid_action_ratio}
          return data, metrics
      
    • In compute_advantage: add more algorithm options, but should be adopted from verl still.
    • Moved _timer from mark_timer to this file.
    • Remove megatron strategy from _validate_config function.

Notes: - Line 117: verl now supports async actor rollout. - Verl added a do_profile related parts. Need to check out.

About PPO

Others:

LTP connection

  1. The script should be looks like it is running a real training process.
  2. redirect the output to a file in the blob for extra safety.

  3. a ready to run script to pull the latest code

Machines:

    1. Local windows machine
    1. Remote 4xA6000: code
    1. Github cloud repo
    1. LTP machine

Setup:

  1. We connect to Remote 4xA6000 via ssh
  2. We connect to LTP machine via tmate locally
  3. In remote machine, we can upload code to github cloud repo
  4. In LTP machine, we can pull from github cloud repo

Workflow:

  1. We make some changes to the code in remote 4xA6000
  2. We push the code to github cloud repo
  3. We pull the code from github cloud repo in LTP machine
  4. We run the script in LTP machine

Quick script to prepare the environment:

  • We have a script in the azure blob that can pull the latest code from github cloud repo.

Agentica rLLM

  • https://github.com/agentica-project/rllm/tree/main

1 Start point: rllm.trainer.verl.train_agent_ppo

  • Use config.ppo_trainer as the base config.
  • Env: rllm.environments.swe.swe
  • Agent: rllm.agents.swe_agent

2. Run from rllm.trainer.verl.agent_ppo_trainer import AgentPPOTrainer init_workers and fit_agent. Look into AgentPPOTrainer class.

  • Branch 1: it inherits from verl.trainer.ppo.ray_trainer.RayPPOTrainer.

  • Implements init_workers: call the super class and initialize an additional AsyncAgentExecutionEngine.
  • Implements init_envs_and_agents: initialize the environment (cocurrently) and the agents concurrently. The agent initialization is more like passing in parameters. Then they are used to set up the AgentExecutionEngine.
  • Implements fit_agent: Main function.
  1. if validate before training, call self._validate_agent to get evaluation on the validation set.
  2. Start the iteratation loop of epochs. For each epoch iterate over the training data.
    • For each batch, since we need to sample k times for each task, we first repeat the batch k times.
    • Skpped parts: use_stepwise_advantage, use_rm (if using a reward model)
    • Then we first roll our with generate_agent_trajectory function, and also get the metrics (reward)
    • We then calcuate the values from the critic model.
    • Then We cacluate the advantages:
      • we here assume a static reward function instead of using a reward model
      • we reject samples/task that all k times are wrong or all k times are correct: those are not useful for training.
      • we calculate the old log probs from the old policy
      • we calculate the ref log probs from the ref policy
      • we use a compute_advantage function to calculate the advantages
    • We skip balance_batch part: which balane the number of valid tokens in the batch.
    • We first update the critic
    • We skip the critic_warmup part.
    • We test with the validation set if condition is met.

Useful Commands

ACR

# List every repo (requires Catalog Lister)
az acr repository list -n msraairgroup -o table

# List tags in a given repo
az acr repository show-tags -n msraairgroup \
    --repository slimshetty/swebench-verified -o table

# Show properties of one tag (works even without Catalog Lister)
az acr repository show -n msraairgroup \
    --image slimshetty/swebench-verified:sweb.eval_x86_64.astropy__astropy-12907