Code for running reinforcement-learning (RL) experiments on continuing (non-episodic) problems.
This repository contains code for (1) different RL algorithms, (2) some environments, and (3) the agent-env loop to run experiments with different parameters and multiple runs.
agents/:- Prediction algorithms:
- Differential TD (Wan, Naik, Sutton, 2021)
- Differential TD(lambda) (Naik, Sutton, 2022; Naik, 2024)
- Discounted TD(lambda) (Sutton, 1988)
- Discounted TD(lambda) with reward centering (Naik, Wan, Tomar, Sutton, 2024; Naik, 2024)
- Average-cost TD(lambda) (Tsitsklis, Van Roy, 1999)
- Control algorithms:
- Discounted Q-learning (Watkins, Dayan, 1992)
- Discounted Sarsa (Rummery, Niranjan, 1994)
- Discounted Q-learning with reward centering (Naik, Wan, Tomar, Sutton, 2024; Naik, 2024)
- Discounted Sarsa with reward centering
- Differential Q-learning (Wan, Naik, Sutton, 2021)
- Prediction algorithms:
environments/:- Some multi-armed bandits
- Acrobot
- An n-state random walk (Naik, Sutton, 2022)
- A couple other simple diagnostic environments
- AccessControl, Catch, Puckworld, and other continuing environments can be run from my fork of Zhao et al.'s (2022) csuite.
config_files/: JSON files containing all the parameters required to run a particular experimentutils/: various utilities and helper functionsexperiments.py: contains the agent-environment interaction loopmain.py: used to start an experiment based on the parameters specified inconfig_files
An example experiment can be run using:
python main.py --config-file='config_files/accesscontrol/test.json' --output-path='results/test_exp/'
Some basic plotting code is in plot_results_example.ipynb.
The prediction algorithms can be run with linear function approximation (using tile coding (see Sutton & Barto (2018): Section 9.5.4)) and tabular representations (via a one-hot encoding).
The control algorithms can be run with tabular, linear, and non-linear function approximation. The non-linear algorithms are essentially Mnih et al.'s (2015), and Naik, Wan, Tomar, Sutton's (2024) DQN with reward centering and the differential version of DQN.
There is a single algorithmic implementation which results in the different algorithms with different parameter choices. For example, for the control algorithms, there is one implementation of a discounted algorithm with reward centering. Then:
-
$\gamma \in [0,1), \eta=0$ : Discounted Q-learning -
$\gamma \in [0,1), \eta > 0$ : Discounted Q-learning with reward centering -
$\gamma=1, \eta > 0$ : Differential Q-learning
The code in this repository can run most—if not all—experiments in the following works:
- Naik, 2024: Reinforcement Learning for Continuing Problems Using Average Reward, Ph.D. Dissertation, University of Alberta. [Link]
- Naik, Wan, Tomar, Sutton, 2024: Reward Centering, RLC. [Link]
- Naik, Sutton, 2022: Multi-Step Average-Reward Prediction via Differential TD(lambda), RLDM. [Link]
- Wan*, Naik*, Sutton, 2021: Learning and Planning in Average-Reward Markov Decision Processes, ICML. [Link]
Note: Instead of maintaining multiple public repositories on github for all the different projects in my PhD, I created this single repository that can probably run every experiment in my dissertation.
However, I have not re-run all those experiments with this unified codebase.
If you are experiencing some unexpected results, feel free to reach out to me at abhisheknaik22296@gmail.com and I will be happy to work those out with you :)