Skip to content

s0um0r0y/LeRobot-Object-Manipulation-VLM

Repository files navigation

LeRobot-VLA-Object-Manipulation

  • VLM + VLA Integration System with LangGraph Memory and LeRobot Simulation and Hardware in rust

  • below you see that I have given the language prompt to pick up the blue bottle and place it in the pink bowl using smolVLA on lerobot SO101 arm

System Overview

  • This document outlines a comprehensive system that integrates Vision Language Models (VLMs) for high-level reasoning with Vision Language Action (VLA) models for motor control, using LangGraph for memory management and LeRobot.

  • there are 3 inputs (image, text, robot state) for the model training that has been fine-tuned on the in the notebook

alt text

Architecture Components

1. Vision Language Model (VLM) Layer

  • Purpose: High-level reasoning, task planning, and tool calling
  • Framework: LangChain/LangGraph with persistent memory
  • Key Features:
    • Natural language instruction processing
    • Visual scene understanding
    • Tool calling for complex tasks
    • Memory-aware conversations

2. Memory Management System

  • Framework: LangGraph with checkpointers
  • Components:
    • Short-term memory (conversation context)
    • Long-term memory (cross-session persistence)
    • Thread management for multi-user scenarios
    • Persistent storage backends (SQLite, PostgreSQL, Redis)

3. Vision Language Action (VLA) Layer

  • Purpose: Low-level motor control and action generation
  • Integration: OpenVLA, π0, or custom VLA models
  • Key Features:
    • Action tokenization
    • Motor command generation
    • Real-time inference (20-80Hz)
    • Trajectory planning

4. Simulation Environment and Hardware deployment

  • Platform: LeRobot framework
  • Features:
    • Multi-embodiment support
    • Real-time physics simulation
    • Dataset collection and replay
    • Cross-platform compatibility

Implementation Details

Core Dependencies

# Core frameworks
langgraph>=0.2.0
langchain>=0.3.0
transformers>=4.40.0
torch>=2.0.0

# LeRobot and robotics
lerobot
gymnasium
pybullet

# Vision and VLA models
opencv-python
pillow
open-vla  # or specific VLA model

# Memory backends
sqlite3  # Built-in
psycopg2-binary  # PostgreSQL
redis  # Redis backend

System Flow

  1. Input Processing

    • Camera feeds provide RGB images
    • Natural language instructions from users
    • Previous conversation context from memory
  2. VLM Processing

    • Processes multimodal inputs (vision + language)
    • Performs high-level reasoning and task decomposition
    • Makes tool calls for complex operations
    • Updates conversation memory
  3. Action Generation

    • VLA model receives high-level commands from VLM
    • Generates low-level motor commands
    • Produces action tokens for robot execution
  4. Robot Execution

    • LeRobot simulation executes commands
    • Provides real-time feedback
    • Updates environment state
  5. Memory Updates

    • Execution results stored in memory
    • Context maintained across conversations
    • Long-term learning patterns captured

Development Roadmap

Phase 1: Basic Integration

  • Set up LangGraph with memory persistence
  • Integrate OpenVLA or similar VLA model
  • Connect to LeRobot simulation environment
  • Implement basic VLM tool calling

Phase 2: Advanced Features

  • Multi-user memory management
  • Real-time performance optimization
  • Advanced tool integration
  • Cross-embodiment support

Phase 3: Production Deployment

  • Scalability improvements
  • Real robot integration
  • Safety and robustness testing
  • Rust based porting

Technical Considerations

Model Selection

  • VLM: Claude, GPT-4V, or open-source alternatives
  • VLA: OpenVLA-7B, π0, SmolVLA, or custom models
  • Memory Backend: SQLite (development), PostgreSQL (production)

Hardware Requirements

  • GPU: NVIDIA A1000 for large VLA models
  • RAM: 32-64GB for model loading and memory management
  • Storage: 1TB SSD recommended for fast memory operations

This system provides a robust foundation for building sophisticated robotic systems that combine the reasoning capabilities of large language models with the precise control needed for physical manipulation tasks.

Hardware setup instruction

NOTE: These instructions are for ubuntu 22.04

git clone https://github.com/s0um0r0y/LeRobot-Object-Manipulation-VLM --recurse-submodules .
conda install ffmpeg=7.1.1 -c conda-forge
cd lerobot && pip install -e .

# find the USB port with this command
conda activate lerobot && cd lerobot
lsusb
lerobot-find-port
sudo chmod 777 /dev/ttyACM0
sudo chmod 777 /dev/ttyACM1

# use this command below to find the camera
lerobot-find-cameras opencv

# teleoperation command with camera
lerobot-teleoperate \
    --robot.type=so101_follower \
    --robot.port=/dev/ttyACM0 \
    --robot.id=my_awesome_follower_arm \
    --teleop.type=so101_leader \
    --teleop.port=/dev/ttyACM1 \
    --teleop.id=my_awesome_leader_arm \
    --robot.cameras="{ front: {type: opencv, index_or_path: /dev/video2, width: 640, height: 480, fps: 30}}" \
    --display_data=true

# to record data for a small VLA dataset
# 5 episodes as of now need to go till atleast 50
# Press Right Arrow (→): Early stop the current episode or reset time and move to the next.
# Press Left Arrow (←): Cancel the current episode and re-record it.
# Press Escape (ESC): Immediately stop the session, encode videos, and upload the dataset.
export HF_USER=s0um0r0y
rm -rf /home/soumoroy/.cache/huggingface/lerobot/s0um0r0y/record-test
lerobot-record \
    --robot.type=so101_follower \
    --robot.port=/dev/ttyACM0 \
    --robot.id=my_awesome_follower_arm \
    --robot.cameras="{ front: {type: opencv, index_or_path: /dev/video2, width: 640, height: 480, fps: 30}}" \
    --teleop.type=so101_leader \
    --teleop.port=/dev/ttyACM1 \
    --teleop.id=my_awesome_leader_arm \
    --display_data=true \
    --dataset.repo_id=${HF_USER}/record-test \
    --dataset.num_episodes=50 \
    --dataset.reset_time_s=10 \
    --resume=True \
    --dataset.root=/home/soumoroy/.cache/huggingface/lerobot/s0um0r0y/record-test \
    --dataset.single_task="Grab the blue bottle and put it in the pink bowl."

# replaying the first episode to check
lerobot-replay \
    --robot.type=so101_follower \
    --robot.port=/dev/ttyACM0 \
    --robot.id=my_awesome_follower_arm \
    --dataset.repo_id=${HF_USER}/record-test \
    --dataset.episode=0 

# setup for smolVLA
rm -rf outputs/train/my_smolvla
python src/lerobot/scripts/lerobot_train.py \
  --policy.path=cijerezg/smolvla-test \
  --policy.repo_id=${HF_USER}/my_smolvla_model \
  --dataset.repo_id=${HF_USER}/record-test \
  --batch_size=64 \
  --steps=20000 \
  --output_dir=outputs/train/my_smolvla \
  --job_name=my_smolvla_training \
  --policy.device=cuda \
  --wandb.enable=true
  --resume=true

# evaluating fine-tuned smolVLA model
# in lerobot_dataset.py change this line below
# obj.root.mkdir(parents=True, exist_ok=True) # to true
lerobot-record \
  --robot.type=so101_follower \
  --robot.port=/dev/ttyACM0 \ 
  --robot.id=my_awesome_follower_arm \ 
  --robot.cameras="{ front: {type: opencv, index_or_path: /dev/video2, width: 640, height: 480, fps: 30}}" \
  --dataset.single_task="Grab the blue bottle and put it in the pink bowl." \ 
  --dataset.repo_id=${HF_USER}/eval_blue_bottle_pink_bowl \  
  --dataset.episode_time_s=50 \
  --dataset.num_episodes=10 \
  # <- Teleop optional if you want to teleoperate in between episodes \
  --teleop.type=so101_leader \
  --teleop.port=/dev/ttyACM1 \
  --teleop.id=my_awesome_leader_arm \
  --policy.path={HF_USER}/my_smolvla \
  --display_data=true

Pybullet simulation setup (work in progress)

  • see urdf_testing to see the environment setup

alt text

  • code for the smolVLA is present in pybullet_sim/envs/run_pybullet_sim.py

alt text

Custom dataset and model link for 52 episodes

Acknowlegement

  • I would like to thank Sai Vishwak for his guidance and support for this project

Citation

@misc{soumo_roy_2025,
	author       = { Soumo Roy and saivishwak },
	title        = { black_marker_whitebowl_SO101 (Revision b702f88) },
	year         = 2025,
	url          = { https://huggingface.co/datasets/s0um0r0y/black_marker_whitebowl_SO101 },
	doi          = { 10.57967/hf/6677 },
	publisher    = { Hugging Face }
}

@article{shukor2025smolvla,
  title={Smolvla: A vision-language-action model for affordable and efficient robotics},
  author={Shukor, Mustafa and Aubakirova, Dana and Capuano, Francesco and Kooijmans, Pepijn and Palma, Steven and Zouitine, Adil and Aractingi, Michel and Pascal, Caroline and Russi, Martino and Marafioti, Andres and others},
  journal={arXiv preprint arXiv:2506.01844},
  year={2025}
}

@inproceedings{todorov2012mujoco,
  title={MuJoCo: A physics engine for model-based control},
  author={Todorov, Emanuel and Erez, Tom and Tassa, Yuval},
  booktitle={2012 IEEE/RSJ International Conference on Intelligent Robots and Systems},
  pages={5026--5033},
  year={2012},
  organization={IEEE},
  doi={10.1109/IROS.2012.6386109}
}

@MISC{coumans2021,
author =   {Erwin Coumans and Yunfei Bai},
title =    {PyBullet, a Python module for physics simulation for games, robotics and machine learning},
howpublished = {\url{http://pybullet.org}},
year = {2016--2021}
}

About

VLM + VLA Integration System with LangGraph Memory and LeRobot S0101 arm

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors