Training SWE Agent from Scratch: Everything you need to know
Published:
Agentic RL Framework
1. ROLL
2. AReaL
3. SkyRL: Specialized for SWE Agent with Verl
4. Verl-agent: Multi-turn RL with Verl
Learn about AReal
The key idea of AReal is to separate inference and training completely so that they can support asynchronous rollouts.
To understand and use it, the first thing is to understand how the repo is organized, and where to customize. That is, the code files that people should get familiar with, and the places to customize.
The first example is main_async_ppo.py
, which is a quick start example.
- Maybe need to modify a config file:
from realhf.experiments.common.ppo_math_exp import PPOMATHConfig
.
Learning about Verl
Verl-agent
- In
main_ppo.py
: addedfrom agent_system.reward_manager.episode import EpisodeRewardManager
.from agent_system.multi_turn_rollout import TrajectoryCollector
.from agent_system.environments import make_envs
- Then when
RayPPOTrainer
is initialized, the traj_collector, envs and val_envs are passed in as additional arguments, and the reward manager is passed in as reward_fn.
In
ray_trainer.py
:- Intact function added:
def apply_invalid_action_penalty(data: DataProto, invalid_action_penalty_coef=float): reward_tensor = data.batch['token_level_scores'] if 'step_rewards' in data.batch.keys(): step_rewards = data.batch['step_rewards'] for i in range(len(data)): data_item = data[i] # DataProtoItem prompt_ids = data_item.batch['prompts'] prompt_length = prompt_ids.shape[-1] valid_response_length = data_item.batch['attention_mask'][prompt_length:].sum() action_valids = data_item.non_tensor_batch['is_action_valid'].astype(np.float32) action_invalids = torch.tensor(1 - action_valids, dtype=torch.float32, device=prompt_ids.device).squeeze(0) # invalid action penalty # assert reward_tensor[i, valid_response_length - 1] != 0.0, f'i={i}' reward_tensor[i, valid_response_length - 1] -= invalid_action_penalty_coef * action_invalids if 'step_rewards' in data.batch.keys(): step_rewards[i] -= invalid_action_penalty_coef * action_invalids valid_action_ratio = np.mean(data.non_tensor_batch['is_action_valid'].astype(np.float32)).item() metrics = {'valid_action_ratio': valid_action_ratio} return data, metrics
- In
compute_advantage
: add more algorithm options, but should be adopted from verl still. - Moved
_timer
from mark_timer to this file. - Remove megatron strategy from
_validate_config
function.
- Intact function added:
Notes: - Line 117: verl now supports async actor rollout. - Verl added a do_profile
related parts. Need to check out.
About PPO
- 图解大模型RLHF系列之:人人都能看懂的PPO原理与源码解读: intuitive and easy to understand
Others:
- 从 tokenizer 视角来分析 Agentic 多轮训练的复杂性
- Verl from system perspective: what is “HybridFlow” or “Colocate”? How GPUs are allocated?
- https://zhuanlan.zhihu.com/p/12871616401: Good for understanding the basics of Ray
- UltraScale Playbook: Good for understanding the basics of Ray
LTP connection
- The script should be looks like it is running a real training process.
redirect the output to a file in the blob for extra safety.
- a ready to run script to pull the latest code
Machines:
- Local windows machine
- Remote 4xA6000: code
- Github cloud repo
- LTP machine
Setup:
- We connect to Remote 4xA6000 via ssh
- We connect to LTP machine via tmate locally
- In remote machine, we can upload code to github cloud repo
- In LTP machine, we can pull from github cloud repo
Workflow:
- We make some changes to the code in remote 4xA6000
- We push the code to github cloud repo
- We pull the code from github cloud repo in LTP machine
- We run the script in LTP machine
Quick script to prepare the environment:
- We have a script in the azure blob that can pull the latest code from github cloud repo.
Agentica rLLM
- https://github.com/agentica-project/rllm/tree/main
1 Start point: rllm.trainer.verl.train_agent_ppo
- Use
config.ppo_trainer
as the base config. - Env: rllm.environments.swe.swe
- Agent: rllm.agents.swe_agent
2. Run from rllm.trainer.verl.agent_ppo_trainer import AgentPPOTrainer
init_workers and fit_agent. Look into AgentPPOTrainer
class.
Branch 1: it inherits from
verl.trainer.ppo.ray_trainer.RayPPOTrainer
.- Implements
init_workers
: call the super class and initialize an additionalAsyncAgentExecutionEngine
. - Implements
init_envs_and_agents
: initialize the environment (cocurrently) and the agents concurrently. The agent initialization is more like passing in parameters. Then they are used to set up theAgentExecutionEngine
. - Implements
fit_agent
: Main function.
- if validate before training, call
self._validate_agent
to get evaluation on the validation set. - Start the iteratation loop of epochs. For each epoch iterate over the training data.
- For each batch, since we need to sample k times for each task, we first repeat the batch k times.
- Skpped parts: use_stepwise_advantage, use_rm (if using a reward model)
- Then we first roll our with
generate_agent_trajectory
function, and also get the metrics (reward) - We then calcuate the values from the critic model.
- Then We cacluate the advantages:
- we here assume a static reward function instead of using a reward model
- we reject samples/task that all k times are wrong or all k times are correct: those are not useful for training.
- we calculate the old log probs from the old policy
- we calculate the ref log probs from the ref policy
- we use a
compute_advantage
function to calculate the advantages
- We skip
balance_batch
part: which balane the number of valid tokens in the batch. - We first update the critic
- We skip the
critic_warmup
part. - We test with the validation set if condition is met.
Useful Commands
ACR
# List every repo (requires Catalog Lister)
az acr repository list -n msraairgroup -o table
# List tags in a given repo
az acr repository show-tags -n msraairgroup \
--repository slimshetty/swebench-verified -o table
# Show properties of one tag (works even without Catalog Lister)
az acr repository show -n msraairgroup \
--image slimshetty/swebench-verified:sweb.eval_x86_64.astropy__astropy-12907