Skip to content

lancedb/cocoindex-lancedb-demo

Repository files navigation

Keeping your data in LanceDB fresh with CocoIndex

Background

This repo contains a demo of using CocoIndex, a data transformation framework for AI that provides incremental processing and data lineage out-of-the-box, with LanceDB, a modern lakehouse for multimodal AI.

The goal is to store a multimodal dataset of recipe data (images + text) in LanceDB and keep it fresh with CocoIndex.

Why use LanceDB?

One of the biggest benefits of using LanceDB over traditional databases or data lakes is this: Data that would otherwise be scattered in separate directories (e.g., when using Parquet, tables tend to store pointer URLs to images/videos/large binary blobs, not the actual data itself) -- is collocated alongside the embeddings and metadata. This simplifies governance and distribution.

Another key distinguishing factor when using LanceDB is the ability to evolve the schema and data effortlessly -- Lance tables are "two-dimensional", meaning that they can grow both horizontally and vertically in essentially a zero-cost manner. Say you want to use an LLM to extract new features from one of the columns in a Lance table: you would run your pipeline, update the table schema to add a new column, and backfill it with the required values by running the transform.

In traditional data lakes (e.g., based on Iceberg), this would require a full table rewrite, but in Lance, only the new data is being written (no table locks while writes happen). This means that large teams working on multiple feature engineering tasks can simultaneously write new columns without affecting the layout on disk.

There are several more advantages to using LanceDB that leverage all the benefits of the Lance format, an open lakehouse format for multimodal AI, so we won't list them all here.

Why do incremental processing via CocoIndex?

Not all vector processing workloads are offline batch workloads. Consider this scenario: you have a user-facing application where users enter their recipes (along with images of the food/drink item that they prepared), and you want to persist the data to a multimodal storage engine. In this scenario, you typically don't begin with huge amounts of data. You accumulate data over time, as users add their creations. And the volume/velocity of the data aren't staggeringly high -- at times, there's only a trickle of data coming in, but at other times, you may observe larger volumes coming in at a higher velocity than normal.

For scenarios like this, incremental processing is an efficient technique that processes only new or changed data (deltas) since the last update, rather than reprocessing entire large datasets. This tends to reduce computation while lowering costs for near real-time analytics. CocoIndex is ideal for managing constantly evolving data sources, handling small batches of updates to keep data fresh with less overhead than full batch workloads.

CocoIndex uses a declarative approach to defining indexing "flows",involving source data and transformed data (either as an intermediate result or the final result to be put into targets). All data within the indexing flow has a schema determined at flow definition time, which aligns very well with LanceDB's strictly typed schema-driven storage mechanism.

Dataset

We'll be using the food ingredients and recipes dataset from Kaggle. The data contains 13k+ recipes and images of food/drinks scraped from the Epicurious website. The dataset is multimodal, containing images, string arrays, and text.

Download the dataset from Kaggle to the local directory (it will be in a file named archive.zip). Unzip the file at the root level of this repository.

Setup

We'll use uv to manage the dependencies for this project. Run the following command to install the required Python libraries to get started.

uv sync

Tools used

The nomic-embed-text embedding model from Ollama is used for text embeddings. For image embeddings, the CLIP model from Hugging Face is used.

In addition to embedding, we'll be enriching the dataset with features (e.g., detecting allergens in recipes and storing them as new boolean columns). LLMs are used to make inferences on the recipe ingredients for this task. We'll be using DSPy, a declarative framework for programming, not prompting LLMs for this stage.

The following diagram shows where each tool sits in the overall workflow. CocoIndex provides a built-in target for LanceDB, and this is used to manage the data pipeline and persist the multimodal data in a LanceDB table.

Generate data

To simulate a scenario where we have data intermittently coming in from a source, we'll use the script data_generator.py. This script looks at the source data in the archive directory and writes JSON records of the source data. The JSON records also contain a path to the image file for the corresponding recipe ID, so that it can be easily located for ingestion into LanceDB.

uv run data_generator.py --start 0 --end 5

This writes out the first 5 recipe records to a JSON file in the path data/*.json. Simultaneously, it copies the image file into the data/images/*.jpg path.

To generate the data for the next 5 records, the corresponding start and end indices can be entered.

uv run data_generator.py --start 5 --end 10

To delete existing records and start afresh, use the --refresh flag.

uv run data_generator.py --start 0 --end 10 --refresh

Running the script multiple times will generate multiple JSON files, one for each run. This mimics "real" data that may be coming from a push API in an application.

Running the CocoIndex flow

CocoIndex uses a Postgres server to maintain a long-lived connection between the source and the target. It's simple to get it running via Docker as follows:

docker compose -f <(curl -L https://raw.githubusercontent.com/cocoindex-io/cocoindex/refs/heads/main/dev/postgres.yaml) up

One-time flow

To run a one-time update in CocoIndex, use the following command:

cocoindex update main

CocoInsight (Free beta for now) can be used to view the index generation and understand the data lineage of the pipeline in a GUI. Start a local CocoInsight server as follows:

cocoindex server -ci main

Open the CocoInsight UI at https://cocoindex.io/cocoinsight. You can also run queries in the CocoInsight UI to test that the search functionality is working as intended.

Live updates

To run a CocoIndex server that watches for changes to the source or target and automatically run updates when there is a change, use the following command:

cocoindex update main -L

Data compaction and why it's needed

LanceDB uses Lance tables under the hood. Unlike Parquet (which uses row groups and partitions to store data on disk) the Lance format uses fragments and tracks data versions via a manifest. In incremental data processing pipelines such as those run in CocoIndex, a lot of smaller fragments can add up over time, which can impact query latency as the data grows in size. It's recommended to run compaction at periodic intervals (e.g., once every 7 days), or more frequently depending on the volume/velocity of commits to the storage layer in a given period of time.

The optimize() method can be used to run compaction periodically.

# Open your Lance table and run this command
table.optimize()

This command performs the following:

  • Compaction: Merges small files into larger ones
  • Pruning: Removes old versions of the dataset
  • Indexing: Optimizes the indexes, adding new data to existing indexes

There is no need to compact tables too frequently, as this comes with computational overhead. LanceDB is highly performant up to millions of rows, so you can adjust the frequency of compaction gradually, based on the needs of your workloads.


[Optional] Running a pure LanceDB pipeline

To contrast the CocoIndex "incremental way" with the traditional batch processing approach, we provide sample scripts in the scripts/ directory. ingest.py contains code to ingest the recipe data into LanceDB. This step is optional (the aim of this repo is to show how to do it using the CocoIndex flow defined above).

The ingestion script also generates two kinds of embeddings:

  • Text embeddings on the instructions column (TODO: concatenate the title and instructions and embed that instead)
  • Image embeddings on the image binary column

The text embeddings use the nomic-embed-text model via Ollama, and the image embeddings use the openai/clip-vit-base-patch32 model, from Hugging Face.

Run the script as follows:

cd scripts
# Overwrite the existing database
uv run ingest.py -o
# Or, append to an existing database (default mode)
uv run ingest.py

An upsert pipeline is used during ingestion, so that duplicate data isn't written to the table. This means that as the script is run multiple times (as new data comes in), only records that have a new unique id field for the recipe are written to the table.

Search web app (FastAPI + React)

Use the provided FastAPI app (app.py) and Vite frontend (frontend/) to run text->image or text->text searches against the LanceDB store. The UI offers buttons to switch between text search (nomic-embed-text via Ollama) and image search (CLIP text encoder against stored image embeddings). Images are served directly from /img backed by data/images.

Environment

Copy .env.example to .env and adjust as needed. Defaults assume:

  • CocoIndex (flow/updater) runs outside docker-compose; use your preferred command (e.g., cocoindex update main -L with the Postgres helper you already run).
  • LanceDB path /app/recipe_lancedb and table recipes (mounted from the repo).
  • Ollama reachable at http://host.docker.internal:11434 (set OLLAMA_HOST if different).

Run locally

  1. Ensure data and embeddings exist (e.g., uv run data_generator.py --start 0 --end 10 and cocoindex update main -L in a separate shell).
  2. Start the API:
    uv run uvicorn app:app --reload --host 0.0.0.0 --port 8000
  3. Start the frontend:
    cd frontend
    npm install
    npm run dev -- --host 0.0.0.0 --port 4173
  4. Open http://localhost:4173 and try queries like "yellow soup" or "latte art". Use the mode buttons to toggle between text and image search.

Run with Docker Compose

The compose file brings up only the FastAPI backend and Vite frontend. Run CocoIndex separately (e.g., with your existing Postgres helper compose).

docker compose up --build

Services:

Data and LanceDB files are mounted from ./data and ./recipe_lancedb so you can reuse local artifacts between runs.

If you already have the CocoIndex Postgres helper running (e.g., via docker compose -f <(curl -L https://raw.githubusercontent.com/cocoindex-io/cocoindex/refs/heads/main/dev/postgres.yaml) up), keep running it separately; the backend/frontend compose here does not manage CocoIndex.

Querying the database

You can also query the database manually in LanceDB. The query.py script contains sample code to query the data once it's persisted to LanceDB.

uv run query.py

Two kinds of queries are run:

  • Query via a text embedding on the instruction_vector column
  • Query via a text-to-image embedding on the image_vector column

Each should return relevant top-k results based on the query.

Inspect the database

To display the schema of the table, display its row count and list its indexes, run the inspect_db.py script.

uv run inspect_db.py

About

Incremental processing and maintaining data freshness with CocoIndex and LanceDB

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published