Training SWE Agent from Scratch: Everything you need to know

5 minute read

Published: June 24, 2025

Agentic RL Framework

1. ROLL

2. AReaL

AReaL

3. SkyRL: Specialized for SWE Agent with Verl

4. Verl-agent: Multi-turn RL with Verl

Verl-agent Repo

Learn about AReal

The key idea of AReal is to separate inference and training completely so that they can support asynchronous rollouts.

To understand and use it, the first thing is to understand how the repo is organized, and where to customize. That is, the code files that people should get familiar with, and the places to customize.

The first example is main_async_ppo.py, which is a quick start example.

Maybe need to modify a config file: from realhf.experiments.common.ppo_math_exp import PPOMATHConfig.

Learning about Verl

Verl-agent

In main_ppo.py: added
- from agent_system.reward_manager.episode import EpisodeRewardManager.
- from agent_system.multi_turn_rollout import TrajectoryCollector.
- from agent_system.environments import make_envs
- Then when RayPPOTrainer is initialized, the traj_collector, envs and val_envs are passed in as additional arguments, and the reward manager is passed in as reward_fn.

In ray_trainer.py:

Intact function added:

def apply_invalid_action_penalty(data: DataProto, invalid_action_penalty_coef=float):
    reward_tensor = data.batch['token_level_scores']
    if 'step_rewards' in data.batch.keys():
        step_rewards = data.batch['step_rewards']
    for i in range(len(data)):
        data_item = data[i]  # DataProtoItem

        prompt_ids = data_item.batch['prompts']

        prompt_length = prompt_ids.shape[-1]

        valid_response_length = data_item.batch['attention_mask'][prompt_length:].sum()

        action_valids = data_item.non_tensor_batch['is_action_valid'].astype(np.float32)
        action_invalids = torch.tensor(1 - action_valids, dtype=torch.float32, device=prompt_ids.device).squeeze(0)
        # invalid action penalty
        # assert reward_tensor[i, valid_response_length - 1] != 0.0, f'i={i}'
        reward_tensor[i, valid_response_length - 1] -= invalid_action_penalty_coef * action_invalids

        if 'step_rewards' in data.batch.keys():
            step_rewards[i] -= invalid_action_penalty_coef * action_invalids
          
    valid_action_ratio = np.mean(data.non_tensor_batch['is_action_valid'].astype(np.float32)).item()
    metrics = {'valid_action_ratio': valid_action_ratio}
    return data, metrics

In compute_advantage: add more algorithm options, but should be adopted from verl still.
Moved _timer from mark_timer to this file.
Remove megatron strategy from _validate_config function.

Notes: - Line 117： verl now supports async actor rollout. - Verl added a do_profile related parts. Need to check out.

About PPO

图解大模型RLHF系列之：人人都能看懂的PPO原理与源码解读: intuitive and easy to understand

Others:

从 tokenizer 视角来分析 Agentic 多轮训练的复杂性
Verl from system perspective: what is “HybridFlow” or “Colocate”? How GPUs are allocated?
https://zhuanlan.zhihu.com/p/12871616401: Good for understanding the basics of Ray
UltraScale Playbook: Good for understanding the basics of Ray

LTP connection

The script should be looks like it is running a real training process.
redirect the output to a file in the blob for extra safety.
a ready to run script to pull the latest code

Machines:

1. Local windows machine
1. Remote 4xA6000: code
1. Github cloud repo
1. LTP machine

Setup:

We connect to Remote 4xA6000 via ssh
We connect to LTP machine via tmate locally
In remote machine, we can upload code to github cloud repo
In LTP machine, we can pull from github cloud repo

Workflow:

We make some changes to the code in remote 4xA6000
We push the code to github cloud repo
We pull the code from github cloud repo in LTP machine
We run the script in LTP machine

Quick script to prepare the environment:

We have a script in the azure blob that can pull the latest code from github cloud repo.

Agentica rLLM

https://github.com/agentica-project/rllm/tree/main

1 Start point: `rllm.trainer.verl.train_agent_ppo`

Use config.ppo_trainer as the base config.
Env: rllm.environments.swe.swe
Agent: rllm.agents.swe_agent

2. Run `from rllm.trainer.verl.agent_ppo_trainer import AgentPPOTrainer` init_workers and fit_agent. Look into `AgentPPOTrainer` class.

Branch 1: it inherits from verl.trainer.ppo.ray_trainer.RayPPOTrainer.
Implements init_workers: call the super class and initialize an additional AsyncAgentExecutionEngine.
Implements init_envs_and_agents: initialize the environment (cocurrently) and the agents concurrently. The agent initialization is more like passing in parameters. Then they are used to set up the AgentExecutionEngine.
Implements fit_agent: Main function.

if validate before training, call self._validate_agent to get evaluation on the validation set.
Start the iteratation loop of epochs. For each epoch iterate over the training data.
- For each batch, since we need to sample k times for each task, we first repeat the batch k times.
- Skpped parts: use_stepwise_advantage, use_rm (if using a reward model)
- Then we first roll our with generate_agent_trajectory function, and also get the metrics (reward)
- We then calcuate the values from the critic model.
- Then We cacluate the advantages:
  - we here assume a static reward function instead of using a reward model
  - we reject samples/task that all k times are wrong or all k times are correct: those are not useful for training.
  - we calculate the old log probs from the old policy
  - we calculate the ref log probs from the ref policy
  - we use a compute_advantage function to calculate the advantages
- We skip balance_batch part: which balane the number of valid tokens in the batch.
- We first update the critic
- We skip the critic_warmup part.
- We test with the validation set if condition is met.

Useful Commands

ACR

# List every repo (requires Catalog Lister)
az acr repository list -n msraairgroup -o table

# List tags in a given repo
az acr repository show-tags -n msraairgroup \
    --repository slimshetty/swebench-verified -o table

# Show properties of one tag (works even without Catalog Lister)
az acr repository show -n msraairgroup \
    --image slimshetty/swebench-verified:sweb.eval_x86_64.astropy__astropy-12907

Share on

Twitter Facebook LinkedIn

Yiran Wu

Training SWE Agent from Scratch: Everything you need to know

Agentic RL Framework

1. ROLL

2. AReaL

3. SkyRL: Specialized for SWE Agent with Verl

4. Verl-agent: Multi-turn RL with Verl

Learn about AReal

Learning about Verl

Verl-agent

About PPO

Others:

LTP connection

Agentica rLLM

1 Start point: `rllm.trainer.verl.train_agent_ppo`

2. Run `from rllm.trainer.verl.agent_ppo_trainer import AgentPPOTrainer` init_workers and fit_agent. Look into `AgentPPOTrainer` class.

Useful Commands

Share on

You May Also Enjoy

Future Blog Post

Blog Post number 4

Blog Post number 3

Blog Post number 2

Yiran Wu

Agentic RL Framework

1. ROLL

2. AReaL

3. SkyRL: Specialized for SWE Agent with Verl

4. Verl-agent: Multi-turn RL with Verl

Learn about AReal

Learning about Verl

Verl-agent

About PPO

Others:

LTP connection

Agentica rLLM

1 Start point: rllm.trainer.verl.train_agent_ppo

2. Run from rllm.trainer.verl.agent_ppo_trainer import AgentPPOTrainer init_workers and fit_agent. Look into AgentPPOTrainer class.

Useful Commands

Share on

You May Also Enjoy

Future Blog Post

Blog Post number 4

Blog Post number 3

Blog Post number 2

1 Start point: `rllm.trainer.verl.train_agent_ppo`

2. Run `from rllm.trainer.verl.agent_ppo_trainer import AgentPPOTrainer` init_workers and fit_agent. Look into `AgentPPOTrainer` class.