[feature] Training AI agents with RL

Some library such as ART supports training AI agents with RL efficiently using Unsloth: https://docs.unsloth.ai/basics/reinforcement-learning-rl-guide/training-ai-agents-with-rl

Currently we support most dataset-based use cases with GRPO and DPO, but this would be also useful!