A Lua script that lets ArduPilot learn to tune itself, live, within SITL.
Mentors: Nate Mailhot & Sanket Sharma | Timeframe: May → Sep 2025 | Project: GSoC
Hello ArduPilot community! I'm Bruno Andreoni Sarmento, a 4th‑year student at the Polytechnic School of the University of São Paulo (USP), pursuing a bachelor’s degree in Electrical Engineering. For the past two years I’ve been a member of Skyrats (USP’s autonomous‑drone team), where I’ve worked with many open‑source tools, ArduPilot being our core flight‑control framework.
I've been inspired by other team members that took part in GSoC projects, and it was a dream come true being selected for this awesome project!
Today, UAV parameter tuning still relies on manual adjustments, often requiring trial‑and‑error sweeps and test flights. This approach is:
- Time‑consuming and inefficient;
- Poorly reproducible;
- Not scalable for adaptive or autonomous systems.
Meanwhile, Reinforcement Learning (RL) offers a principled method for adaptive control, but the ArduPilot SITL environment lacks built‑in support for episodic interactions, which are essential for reliable training and evaluation.
This project builds a lightweight Lua‑based scripting layer on top of the existing SITL engine to enable episodic RL interactions.
The project introduces:
- Episodic Replay – a structured way to test and evaluate UAV behavior across multiple runs, starting from known or randomized states;
- State Reset Mechanism – using
sim:set_pose()in Lua scripts to programmatically reset UAV position, velocity, and attitude between episodes; - Metric Logging – automated telemetry collection (e.g., tracking error, convergence time) to evaluate each episode’s performance;
- Offline RL Phase (baseline) – log rich episodes and train an initial Q‑table/value‑function offline before any in‑place updates;
- Online RL Phase (final goal) – once the baseline is set, the same agent keeps learning live, updating parameters in real time during simulation.
This project builds the necessary infrastructure for online learning directly in SITL, unlocking new workflows for intelligent parameter tuning and experimental reinforcement learning within the ArduPilot ecosystem.
| Block | Role | Key ArduPilot / Lua APIs |
|---|---|---|
| Episode Manager | Detects episode termination (e.g. time‑out, crash, success) and fires a reset. | time:millis(), custom state checks |
| Teleport / Reset | Instantaneously re‑positions vehicle & zeros attitude/vel using sim:set_pose(). |
sim:set_pose(instance, loc, orient, vel_bf, gyro) |
| Metrics Collector | Computes reward signals (tracking error, overshoot, energy) every tick. | ahrs:get_pitch(), vehicle:get_target_alt(), custom math |
| Reinforcement Learning | Updates Q‑table / value function in‑place at ≈ 10 Hz; selects next action. During the Offline RL Phase it runs in log‑only mode to gather state–action–reward traces for baseline training, switching to live updates in the Online RL Phase. | Pure Lua tables for Q‑table; math lib |
| Policy Executor | Applies action by tweaking params or sending direct control: | param:set("ATC_RAT_PIT_P", val), vehicle:set_target_roll() |
| Direct State Feed (EKF Opt 10) | Supplies fused state to Lua via binding helpers (ahrs, ins, etc.). |
Firmware internal |
| Sensor Simulation | Generates IMU/GPS from physics backend (Gazebo or built‑in). | SITL C++ |
- Reset → Script arms vehicle, immediately calls
sim:set_pose()to starting state. - Episode → RL chooses an action (e.g., ±5 % on
ATC_RAT_RLL_P). - Flight → Controller responds; Metrics Collector accumulates reward.
- Terminate → After
Nsteps or goal reached, Episode Manager logs result and triggers new reset. - Learn → RL updates its policy online before next episode begins.
Every tick is scheduled via return loop, 100 (100 ms).
-- init
local episode = EpisodeManager.new{max_steps=200}
local policy = QLearning.new{alpha=0.1, gamma=0.9, eps=0.2}
local collector = Metrics.new()
function loop()
if episode:done() then
sim:set_pose(0, episode.start_loc, episode.start_ori)
policy:learn(collector:get())
episode:reset()
collector:clear()
else
local s = episode:get_state()
local a = policy:select(s)
PolicyExecutor.apply(a)
collector:record(s, a)
end
return loop, 100 -- ms
end
return loop, 100(Full implementation lives in https://github.com/b-andreoni/GSoC/blob/main/scripts)
This GSoC project is structured into five main phases:
| Phase | Timeframe | Key Deliverables |
|---|---|---|
| 1. Reset Mechanism | June 2025 | • Integrate sim:set_pose() into SITL Lua API• Enable repeatable episode resets |
| 2. Offline RL Phase (baseline) | late June – July 2025 | • Collect episodic logs with a fixed policy • Train baseline value/Q‑function offline • Report baseline performance |
| 3. Online RL Loop | July 2025 | • Implement reward & metrics collection • Activate in‑place learning (SARSA/Q‑learning) |
| 4. Policy Refinement | August 2025 | • Add exploration strategies (e.g., ε‑decay) • Validate on varied flight scenarios |
| 5. Docs & Releases | Early Sep 2025 | • Publish user guide and examples • Open PRs/issues for community review |
I’d love to hear your field‑tested wisdom, war stories, and wish‑list items:
- Reset mechanics – any EKF glitches, timing tricks, or failsafes you’ve hit when using
sim:set_pose()? - Practical benefits – where could an in‑sim offline+online RL loop spare you manual PID sweeps or repetitive test flights?
Drop a comment on the PR, open an issue, or ping me on Discord (@bruno_as). Your input will directly shape the next commits.
Thanks in advance!
https://github.com/b-andreoni/GSoC/
- ArduPilot/ardupilot#29616
- https://github.com/ArduPilot/ardupilot/blob/master/libraries/AP_Scripting/examples/sim_arming_pos.lua
- https://github.com/ArduPilot/ardupilot/pull/29498/commits/51364c7f43af09c5f42d3e95315dc9c642dc093c
Reinforcement Learning is already possible
Hello again, ArduPilot community! Below is a concise overview of the milestones reached, and obstacles overcome during the first half of GSoC 2025.
The initial task was to extend the example script sim_arming_pos.lua so that a teleport could be executed after arming, followed immediately by any mission of interest.
- AUTO mode - After arming, the vehicle is teleported anywhere in the world, takes off, and then switches to AUTO to fly a pre-saved mission.
- GUIDED mode - With AUTO working, waypoints are now injected dynamically via GUIDED, providing full path control from Lua.
A limitation emerged: subsequent teleports reused the initial NED offset, which quickly became impractical. This motivated a deeper change to sim:set_pose().
Thanks to @tridge, we already had a working set_pose!
However, episodic resets were still impossible because only the NED offset was updated; updating global position more than once was impossible. The solution required editing ardupilot/libraries/SITL/SIM_Aircraft.cpp:
// original
aircraft.position = aircraft.home.get_distance_NED_double(loc);
// replacement
aircraft.home = loc;
aircraft.origin = loc;
aircraft.position = Vector3d(0, 0, 0);With home, origin, and position realigned, each call to sim:set_pose() now places the vehicle at the exact intended global location, enabling clean episode restarts.
The complete patch is already pushed to my public fork:
https://github.com/b-andreoni/ardupilot
Feel free to clone it instead of patching the file manually.
To evaluate learning performance, a CSV logger was created that records
elapsed_ms, fsm_state, flight_mode, is_armed,
lat_deg, lon_deg, alt_m,
vel_n, vel_e, vel_d,
roll, pitch, yaw, yaw_bhv,
gyro_x, gyro_y, gyro_z,
volt_v, curr_a
A flag is inserted whenever a reset occurs, allowing precise segmentation of episodes.
1.4 Customisable initial conditions at episodic resets
Reset hooks allow deterministic tweaks - �e.g. yaw can be rotated by 45° at every episode start.
A lightweight online Q-learning loop was implemented to optimise best path to visit five waypoints energy-minimizing (inspired by the IMAV 2024 first task). Workflow:
- Teleport to home via
sim:set_pose(). - Take off to 10 m AGL.
- Select the next waypoint ε-greedily (ε = 0.2, decaying).
- Log voltage � current every 50 ms; negative energy is the reward.
- Apply a Q-update at each waypoint (
α = 0.1,γ = 0.9). - End-of-episode: compare total energy to best-so-far; repeat.
Convergence to the optimal path occurs after � 200 episodes.
To reproduce the results:
git clone git@github.com:b-andreoni/ardupilot.git
cd ardupilot
./waf configure --board sitl
./waf copter
- Clone the scripts repo at https://github.com/b-andreoni/GSoC/tree/main.
git clone git@github.com:b-andreoni/GSoC.git - Enable Lua scripts:
param set SCR_ENABLE 1. - Copy
scripts/reinforcement_learning/rl_wp.luainto the workingscriptsfolder and run
sim_vehicle.py -v ArduCopter --map- Teleport reliability: A stress test of consecutive resets shows no EKF divergence.
- Energy-optimal path: Q-learning finds the best path between waypoints.
- Publish additional toy examples (e.g., PID tuning for precision landing and waypoint centralization, plane flare optimization).
- Write a wiki-style guide highlighting the reset API, logging template, and common pitfalls.
- Polish reinforcement-learning.
- Gather community feedback.
- Stretch goal: Gazebo / Isaac Sim bridge - Investigate adapter layers for their reset functions, enabling cross-sim RL experiments.
Feedback and suggestions, mainly for desired tunable parameters, are very welcome!


