LeRobot-VLA-Object-Manipulation

VLM + VLA Integration System with LangGraph Memory and LeRobot Simulation and Hardware in rust
below you see that I have given the language prompt to pick up the blue bottle and place it in the pink bowl using smolVLA on lerobot SO101 arm

System Overview

This document outlines a comprehensive system that integrates Vision Language Models (VLMs) for high-level reasoning with Vision Language Action (VLA) models for motor control, using LangGraph for memory management and LeRobot.
there are 3 inputs (image, text, robot state) for the model training that has been fine-tuned on the in the notebook

Architecture Components

1. Vision Language Model (VLM) Layer

Purpose: High-level reasoning, task planning, and tool calling
Framework: LangChain/LangGraph with persistent memory
Key Features:
- Natural language instruction processing
- Visual scene understanding
- Tool calling for complex tasks
- Memory-aware conversations

2. Memory Management System

Framework: LangGraph with checkpointers
Components:
- Short-term memory (conversation context)
- Long-term memory (cross-session persistence)
- Thread management for multi-user scenarios
- Persistent storage backends (SQLite, PostgreSQL, Redis)

3. Vision Language Action (VLA) Layer

Purpose: Low-level motor control and action generation
Integration: OpenVLA, π0, or custom VLA models
Key Features:
- Action tokenization
- Motor command generation
- Real-time inference (20-80Hz)
- Trajectory planning

4. Simulation Environment and Hardware deployment

Platform: LeRobot framework
Features:
- Multi-embodiment support
- Real-time physics simulation
- Dataset collection and replay
- Cross-platform compatibility

Implementation Details

Core Dependencies

# Core frameworks
langgraph>=0.2.0
langchain>=0.3.0
transformers>=4.40.0
torch>=2.0.0

# LeRobot and robotics
lerobot
gymnasium
pybullet

# Vision and VLA models
opencv-python
pillow
open-vla  # or specific VLA model

# Memory backends
sqlite3  # Built-in
psycopg2-binary  # PostgreSQL
redis  # Redis backend

System Flow

Input Processing
- Camera feeds provide RGB images
- Natural language instructions from users
- Previous conversation context from memory
VLM Processing
- Processes multimodal inputs (vision + language)
- Performs high-level reasoning and task decomposition
- Makes tool calls for complex operations
- Updates conversation memory
Action Generation
- VLA model receives high-level commands from VLM
- Generates low-level motor commands
- Produces action tokens for robot execution
Robot Execution
- LeRobot simulation executes commands
- Provides real-time feedback
- Updates environment state
Memory Updates
- Execution results stored in memory
- Context maintained across conversations
- Long-term learning patterns captured

Development Roadmap

Phase 1: Basic Integration

Set up LangGraph with memory persistence
Integrate OpenVLA or similar VLA model
Connect to LeRobot simulation environment
Implement basic VLM tool calling

Phase 2: Advanced Features

Multi-user memory management
Real-time performance optimization
Advanced tool integration
Cross-embodiment support

Phase 3: Production Deployment

Scalability improvements
Real robot integration
Safety and robustness testing
Rust based porting

Technical Considerations

Model Selection

VLM: Claude, GPT-4V, or open-source alternatives
VLA: OpenVLA-7B, π0, SmolVLA, or custom models
Memory Backend: SQLite (development), PostgreSQL (production)

Hardware Requirements

GPU: NVIDIA A1000 for large VLA models
RAM: 32-64GB for model loading and memory management
Storage: 1TB SSD recommended for fast memory operations

This system provides a robust foundation for building sophisticated robotic systems that combine the reasoning capabilities of large language models with the precise control needed for physical manipulation tasks.

Hardware setup instruction

NOTE: These instructions are for ubuntu 22.04

git clone https://github.com/s0um0r0y/LeRobot-Object-Manipulation-VLM --recurse-submodules .
conda install ffmpeg=7.1.1 -c conda-forge
cd lerobot && pip install -e .

# find the USB port with this command
conda activate lerobot && cd lerobot
lsusb
lerobot-find-port
sudo chmod 777 /dev/ttyACM0
sudo chmod 777 /dev/ttyACM1

# use this command below to find the camera
lerobot-find-cameras opencv

# teleoperation command with camera
lerobot-teleoperate \
    --robot.type=so101_follower \
    --robot.port=/dev/ttyACM0 \
    --robot.id=my_awesome_follower_arm \
    --teleop.type=so101_leader \
    --teleop.port=/dev/ttyACM1 \
    --teleop.id=my_awesome_leader_arm \
    --robot.cameras="{ front: {type: opencv, index_or_path: /dev/video2, width: 640, height: 480, fps: 30}}" \
    --display_data=true

# to record data for a small VLA dataset
# 5 episodes as of now need to go till atleast 50
# Press Right Arrow (→): Early stop the current episode or reset time and move to the next.
# Press Left Arrow (←): Cancel the current episode and re-record it.
# Press Escape (ESC): Immediately stop the session, encode videos, and upload the dataset.
export HF_USER=s0um0r0y
rm -rf /home/soumoroy/.cache/huggingface/lerobot/s0um0r0y/record-test
lerobot-record \
    --robot.type=so101_follower \
    --robot.port=/dev/ttyACM0 \
    --robot.id=my_awesome_follower_arm \
    --robot.cameras="{ front: {type: opencv, index_or_path: /dev/video2, width: 640, height: 480, fps: 30}}" \
    --teleop.type=so101_leader \
    --teleop.port=/dev/ttyACM1 \
    --teleop.id=my_awesome_leader_arm \
    --display_data=true \
    --dataset.repo_id=${HF_USER}/record-test \
    --dataset.num_episodes=50 \
    --dataset.reset_time_s=10 \
    --resume=True \
    --dataset.root=/home/soumoroy/.cache/huggingface/lerobot/s0um0r0y/record-test \
    --dataset.single_task="Grab the blue bottle and put it in the pink bowl."

# replaying the first episode to check
lerobot-replay \
    --robot.type=so101_follower \
    --robot.port=/dev/ttyACM0 \
    --robot.id=my_awesome_follower_arm \
    --dataset.repo_id=${HF_USER}/record-test \
    --dataset.episode=0 

# setup for smolVLA
rm -rf outputs/train/my_smolvla
python src/lerobot/scripts/lerobot_train.py \
  --policy.path=cijerezg/smolvla-test \
  --policy.repo_id=${HF_USER}/my_smolvla_model \
  --dataset.repo_id=${HF_USER}/record-test \
  --batch_size=64 \
  --steps=20000 \
  --output_dir=outputs/train/my_smolvla \
  --job_name=my_smolvla_training \
  --policy.device=cuda \
  --wandb.enable=true
  --resume=true

# evaluating fine-tuned smolVLA model
# in lerobot_dataset.py change this line below
# obj.root.mkdir(parents=True, exist_ok=True) # to true
lerobot-record \
  --robot.type=so101_follower \
  --robot.port=/dev/ttyACM0 \ 
  --robot.id=my_awesome_follower_arm \ 
  --robot.cameras="{ front: {type: opencv, index_or_path: /dev/video2, width: 640, height: 480, fps: 30}}" \
  --dataset.single_task="Grab the blue bottle and put it in the pink bowl." \ 
  --dataset.repo_id=${HF_USER}/eval_blue_bottle_pink_bowl \  
  --dataset.episode_time_s=50 \
  --dataset.num_episodes=10 \
  # <- Teleop optional if you want to teleoperate in between episodes \
  --teleop.type=so101_leader \
  --teleop.port=/dev/ttyACM1 \
  --teleop.id=my_awesome_leader_arm \
  --policy.path={HF_USER}/my_smolvla \
  --display_data=true

Pybullet simulation setup (work in progress)

see urdf_testing to see the environment setup

code for the smolVLA is present in pybullet_sim/envs/run_pybullet_sim.py

Custom dataset and model link for 52 episodes

lifting blue colour parachute bottle and putting it in the pink bowl custom dataset link
smolVLA fine tuned model for blue-bottle and pink bowl case hugging face model link
new dataset of lifting a black marker and then putting it in a white bowl custom dataset link
smolVLA fine tuned model for black marker and white bowl case hugging face model link

Acknowlegement

I would like to thank Sai Vishwak for his guidance and support for this project

Citation

@misc{soumo_roy_2025,
	author       = { Soumo Roy and saivishwak },
	title        = { black_marker_whitebowl_SO101 (Revision b702f88) },
	year         = 2025,
	url          = { https://huggingface.co/datasets/s0um0r0y/black_marker_whitebowl_SO101 },
	doi          = { 10.57967/hf/6677 },
	publisher    = { Hugging Face }
}

@article{shukor2025smolvla,
  title={Smolvla: A vision-language-action model for affordable and efficient robotics},
  author={Shukor, Mustafa and Aubakirova, Dana and Capuano, Francesco and Kooijmans, Pepijn and Palma, Steven and Zouitine, Adil and Aractingi, Michel and Pascal, Caroline and Russi, Martino and Marafioti, Andres and others},
  journal={arXiv preprint arXiv:2506.01844},
  year={2025}
}

@inproceedings{todorov2012mujoco,
  title={MuJoCo: A physics engine for model-based control},
  author={Todorov, Emanuel and Erez, Tom and Tassa, Yuval},
  booktitle={2012 IEEE/RSJ International Conference on Intelligent Robots and Systems},
  pages={5026--5033},
  year={2012},
  organization={IEEE},
  doi={10.1109/IROS.2012.6386109}
}

@MISC{coumans2021,
author =   {Erwin Coumans and Yunfei Bai},
title =    {PyBullet, a Python module for physics simulation for games, robotics and machine learning},
howpublished = {\url{http://pybullet.org}},
year = {2016--2021}
}

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
assets		assets
config		config
integration_scripts		integration_scripts
lerobot @ d11ec6b		lerobot @ d11ec6b
mujoco_sim/scripts		mujoco_sim/scripts
pybullet_sim		pybullet_sim
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
setup.md		setup.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LeRobot-VLA-Object-Manipulation

System Overview

Architecture Components

1. Vision Language Model (VLM) Layer

2. Memory Management System

3. Vision Language Action (VLA) Layer

4. Simulation Environment and Hardware deployment

Implementation Details

Core Dependencies

System Flow

Development Roadmap

Phase 1: Basic Integration

Phase 2: Advanced Features

Phase 3: Production Deployment

Technical Considerations

Model Selection

Hardware Requirements

Hardware setup instruction

Pybullet simulation setup (work in progress)

Custom dataset and model link for 52 episodes

Acknowlegement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LeRobot-VLA-Object-Manipulation

System Overview

Architecture Components

1. Vision Language Model (VLM) Layer

2. Memory Management System

3. Vision Language Action (VLA) Layer

4. Simulation Environment and Hardware deployment

Implementation Details

Core Dependencies

System Flow

Development Roadmap

Phase 1: Basic Integration

Phase 2: Advanced Features

Phase 3: Production Deployment

Technical Considerations

Model Selection

Hardware Requirements

Hardware setup instruction

Pybullet simulation setup (work in progress)

Custom dataset and model link for 52 episodes

Acknowlegement

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages