-
VLM + VLA Integration System with LangGraph Memory and LeRobot Simulation and Hardware in rust
-
below you see that I have given the language prompt to pick up the blue bottle and place it in the pink bowl using smolVLA on lerobot SO101 arm
-
This document outlines a comprehensive system that integrates Vision Language Models (VLMs) for high-level reasoning with Vision Language Action (VLA) models for motor control, using LangGraph for memory management and LeRobot.
-
there are 3 inputs (image, text, robot state) for the model training that has been fine-tuned on the in the notebook
- Purpose: High-level reasoning, task planning, and tool calling
- Framework: LangChain/LangGraph with persistent memory
- Key Features:
- Natural language instruction processing
- Visual scene understanding
- Tool calling for complex tasks
- Memory-aware conversations
- Framework: LangGraph with checkpointers
- Components:
- Short-term memory (conversation context)
- Long-term memory (cross-session persistence)
- Thread management for multi-user scenarios
- Persistent storage backends (SQLite, PostgreSQL, Redis)
- Purpose: Low-level motor control and action generation
- Integration: OpenVLA, π0, or custom VLA models
- Key Features:
- Action tokenization
- Motor command generation
- Real-time inference (20-80Hz)
- Trajectory planning
- Platform: LeRobot framework
- Features:
- Multi-embodiment support
- Real-time physics simulation
- Dataset collection and replay
- Cross-platform compatibility
# Core frameworks
langgraph>=0.2.0
langchain>=0.3.0
transformers>=4.40.0
torch>=2.0.0
# LeRobot and robotics
lerobot
gymnasium
pybullet
# Vision and VLA models
opencv-python
pillow
open-vla # or specific VLA model
# Memory backends
sqlite3 # Built-in
psycopg2-binary # PostgreSQL
redis # Redis backend-
Input Processing
- Camera feeds provide RGB images
- Natural language instructions from users
- Previous conversation context from memory
-
VLM Processing
- Processes multimodal inputs (vision + language)
- Performs high-level reasoning and task decomposition
- Makes tool calls for complex operations
- Updates conversation memory
-
Action Generation
- VLA model receives high-level commands from VLM
- Generates low-level motor commands
- Produces action tokens for robot execution
-
Robot Execution
- LeRobot simulation executes commands
- Provides real-time feedback
- Updates environment state
-
Memory Updates
- Execution results stored in memory
- Context maintained across conversations
- Long-term learning patterns captured
- Set up LangGraph with memory persistence
- Integrate OpenVLA or similar VLA model
- Connect to LeRobot simulation environment
- Implement basic VLM tool calling
- Multi-user memory management
- Real-time performance optimization
- Advanced tool integration
- Cross-embodiment support
- Scalability improvements
- Real robot integration
- Safety and robustness testing
- Rust based porting
- VLM: Claude, GPT-4V, or open-source alternatives
- VLA: OpenVLA-7B, π0, SmolVLA, or custom models
- Memory Backend: SQLite (development), PostgreSQL (production)
- GPU: NVIDIA A1000 for large VLA models
- RAM: 32-64GB for model loading and memory management
- Storage: 1TB SSD recommended for fast memory operations
This system provides a robust foundation for building sophisticated robotic systems that combine the reasoning capabilities of large language models with the precise control needed for physical manipulation tasks.
NOTE: These instructions are for ubuntu 22.04
git clone https://github.com/s0um0r0y/LeRobot-Object-Manipulation-VLM --recurse-submodules .
conda install ffmpeg=7.1.1 -c conda-forge
cd lerobot && pip install -e .
# find the USB port with this command
conda activate lerobot && cd lerobot
lsusb
lerobot-find-port
sudo chmod 777 /dev/ttyACM0
sudo chmod 777 /dev/ttyACM1
# use this command below to find the camera
lerobot-find-cameras opencv
# teleoperation command with camera
lerobot-teleoperate \
--robot.type=so101_follower \
--robot.port=/dev/ttyACM0 \
--robot.id=my_awesome_follower_arm \
--teleop.type=so101_leader \
--teleop.port=/dev/ttyACM1 \
--teleop.id=my_awesome_leader_arm \
--robot.cameras="{ front: {type: opencv, index_or_path: /dev/video2, width: 640, height: 480, fps: 30}}" \
--display_data=true
# to record data for a small VLA dataset
# 5 episodes as of now need to go till atleast 50
# Press Right Arrow (→): Early stop the current episode or reset time and move to the next.
# Press Left Arrow (←): Cancel the current episode and re-record it.
# Press Escape (ESC): Immediately stop the session, encode videos, and upload the dataset.
export HF_USER=s0um0r0y
rm -rf /home/soumoroy/.cache/huggingface/lerobot/s0um0r0y/record-test
lerobot-record \
--robot.type=so101_follower \
--robot.port=/dev/ttyACM0 \
--robot.id=my_awesome_follower_arm \
--robot.cameras="{ front: {type: opencv, index_or_path: /dev/video2, width: 640, height: 480, fps: 30}}" \
--teleop.type=so101_leader \
--teleop.port=/dev/ttyACM1 \
--teleop.id=my_awesome_leader_arm \
--display_data=true \
--dataset.repo_id=${HF_USER}/record-test \
--dataset.num_episodes=50 \
--dataset.reset_time_s=10 \
--resume=True \
--dataset.root=/home/soumoroy/.cache/huggingface/lerobot/s0um0r0y/record-test \
--dataset.single_task="Grab the blue bottle and put it in the pink bowl."
# replaying the first episode to check
lerobot-replay \
--robot.type=so101_follower \
--robot.port=/dev/ttyACM0 \
--robot.id=my_awesome_follower_arm \
--dataset.repo_id=${HF_USER}/record-test \
--dataset.episode=0
# setup for smolVLA
rm -rf outputs/train/my_smolvla
python src/lerobot/scripts/lerobot_train.py \
--policy.path=cijerezg/smolvla-test \
--policy.repo_id=${HF_USER}/my_smolvla_model \
--dataset.repo_id=${HF_USER}/record-test \
--batch_size=64 \
--steps=20000 \
--output_dir=outputs/train/my_smolvla \
--job_name=my_smolvla_training \
--policy.device=cuda \
--wandb.enable=true
--resume=true
# evaluating fine-tuned smolVLA model
# in lerobot_dataset.py change this line below
# obj.root.mkdir(parents=True, exist_ok=True) # to true
lerobot-record \
--robot.type=so101_follower \
--robot.port=/dev/ttyACM0 \
--robot.id=my_awesome_follower_arm \
--robot.cameras="{ front: {type: opencv, index_or_path: /dev/video2, width: 640, height: 480, fps: 30}}" \
--dataset.single_task="Grab the blue bottle and put it in the pink bowl." \
--dataset.repo_id=${HF_USER}/eval_blue_bottle_pink_bowl \
--dataset.episode_time_s=50 \
--dataset.num_episodes=10 \
# <- Teleop optional if you want to teleoperate in between episodes \
--teleop.type=so101_leader \
--teleop.port=/dev/ttyACM1 \
--teleop.id=my_awesome_leader_arm \
--policy.path={HF_USER}/my_smolvla \
--display_data=true- see
urdf_testingto see the environment setup
- code for the smolVLA is present in
pybullet_sim/envs/run_pybullet_sim.py
- lifting blue colour parachute bottle and putting it in the pink bowl custom dataset link
- smolVLA fine tuned model for blue-bottle and pink bowl case hugging face model link
- new dataset of lifting a black marker and then putting it in a white bowl custom dataset link
- smolVLA fine tuned model for black marker and white bowl case hugging face model link
- I would like to thank Sai Vishwak for his guidance and support for this project
@misc{soumo_roy_2025,
author = { Soumo Roy and saivishwak },
title = { black_marker_whitebowl_SO101 (Revision b702f88) },
year = 2025,
url = { https://huggingface.co/datasets/s0um0r0y/black_marker_whitebowl_SO101 },
doi = { 10.57967/hf/6677 },
publisher = { Hugging Face }
}
@article{shukor2025smolvla,
title={Smolvla: A vision-language-action model for affordable and efficient robotics},
author={Shukor, Mustafa and Aubakirova, Dana and Capuano, Francesco and Kooijmans, Pepijn and Palma, Steven and Zouitine, Adil and Aractingi, Michel and Pascal, Caroline and Russi, Martino and Marafioti, Andres and others},
journal={arXiv preprint arXiv:2506.01844},
year={2025}
}
@inproceedings{todorov2012mujoco,
title={MuJoCo: A physics engine for model-based control},
author={Todorov, Emanuel and Erez, Tom and Tassa, Yuval},
booktitle={2012 IEEE/RSJ International Conference on Intelligent Robots and Systems},
pages={5026--5033},
year={2012},
organization={IEEE},
doi={10.1109/IROS.2012.6386109}
}
@MISC{coumans2021,
author = {Erwin Coumans and Yunfei Bai},
title = {PyBullet, a Python module for physics simulation for games, robotics and machine learning},
howpublished = {\url{http://pybullet.org}},
year = {2016--2021}
}



