Anirudh Arrepu Niranjan M Suriyaa MM Rishi Ravi Amit Jomy
- Text Morph
- LightShift Remove
- Mood Lens
- HertzMark
Text Morph is a generative scene text editing engine. Instead of simple pixel overlay, it replaces text within natural images while strictly preserving the original typography, lighting, and perspective. The pipeline integrates state-of-the-art segmentation with a specialized diffusion model (GaMuSA) to ensure the new text is visually indistinguishable from the original scene.
Pipeline/Workflow
- Segmentation & Localization
- User Interaction: The user provides point prompts on the text region.
- Masking: We trigger SAM 2.1 Base to generate high-precision segmentation masks. YOLO (MSCOCO) is utilized to localize the bounding box context for the text region.
- Generative Text Inpainting using GaMuSA The core engine is a diffusion-based text editor. It utilizes a Gylph Adaptive Mutual Self Attention to inject the new text glyphs while locking onto the original font style and background texture.
- Latent Optimization The model performs controlled sampling. This controlled diffusion process ensures the generated text aligns perfectly with the surrounding pixel context.
- High-Fidelity Reconstruction The generated latent representation is decoded and resized to match the original input resolution, blending the edited region seamlessly into the high-res image.
Key Design Decisions
- yolo_mscoco (custom trained)
We trained a custom YOLO-v11s model on the
MSCOCO-Textdataset to obtain fine-grained, text-level bounding boxes. SAM produced only coarse region proposals, whereas our task required precise, per-text localization. The custom detector delivers dense, piecewise text boxes suitable for downstream OCR and layout analysis.
| Model | Params |
|---|---|
| AutoencoderKL | 83.65M |
| ControlUNetModel | 859.52M |
| LabelEncoder | 66.24M |
- Since SAM 2 is deployed as a frozen foundational model (zero-shot), its computational profile is well-documented by Meta AI. We focused our profiling efforts on the custom components (GaMuSA and Color Grading) where our novel optimizations were applied. Compute Profile for SAM
- Model
- Dataset
- ANYTEXT: MULTILINGUAL VISUAL TEXT GENERA- TION AND EDITING
- TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control
This proposed workflow addresses this challenge by combining state-of-the-art segmentation and generative inpainting with 3D geometric reconstruction. By leveraging SAM 2 for precise object isolation and Lama for seamless removal, the system clears the obstruction, it then utilizes Marigold to generate high-fidelity depth maps and surface normals from the modified image. This allows for a final shader computation step that physically re-lights the skin texture, ensuring that the area where the shadow once fell matches the lighting of the rest of the face perfectly.
Pipeline/Workflow
- Segmentation for Region Masking using SAM 2 High-fidelity bounding boxes are generated from sparse point inputs, enabling precise localization of the target regions for subsequent processing.
- LaMa Inpainting The LaMa-Inpainting model, which leverages Fast Fourier Convolutions (FFCs), is employed to inpaint the masked regions while preserving global structural coherence.
- Depth Estimation Performs denoising on the inpainted RGB latent space to hallucinate affine-invariant depth maps. Recovers high-resolution depth topology, including subtle facial curvature (e.g., nose, cheekbones), typically lost during the 2D inpainting stage.
- Surface Normal Reconstruction Computes spatial gradients derived from the Marigold depth map. Produces a per-pixel normal map that encodes the orientation of the skin surface relative to the camera viewpoint.
- Shader Computation & Relighting Computes pixel intensity as a function of the reconstructed surface normals and the estimated light vector. Restores specular highlights and diffuse shading across the inpainted region, ensuring consistency with the global illumination characteristics of the portrait.
Key Design Decisions
Transition to Marigold LCM for Surface Normals
- Design Adopted Marigold LCM (Latent Consistency Model) for direct surface-normal estimation.
- Rationale The earlier approach, deriving normals from a predicted depth map was inherently unstable, producing geometric noise and unreliable surface orientation in complex scenes. Marigold reframes normal estimation as a generative task, yielding far smoother and more coherent geometry. The LCM variant preserves this quality while significantly reducing latency, aligning well with our mobile-first constraints.
Transition to Custom-Trained LaMa Inpainting
- Design Employed the LaMa Inpainting architecture with Fast Fourier Convolutions (FFC), custom-trained on the Pipe dataset.
- Rationale Although the ONNX baseline offered better runtime, the quality degradation was too severe for a creative tool. Our custom Fourier-based model captures long-range structure and global context far more effectively, resulting in cleaner, seamless inpainting.
To overcome the limitations inherent in the standard implementation, we extensively optimized the Lama-Inpainting repository for integration into our AI workflow. This involved substantial code-level modifications, culminating in the creation of a dedicated GitHub repository to maintain and document our enhanced version.
| Category | Module | Params | FLOPs |
|---|---|---|---|
| Model | Total | 865.92M | 761.498G |
| Core Component | time_embedding | 2.05M | 2.05M |
| Attention (Total) | All Attention Layers | 243.43M | 229.398G |
| ResNet (Total) | All ResNet Layers | 368.66M | 369.240G |
| Model Architecture | Value |
|---|---|
| Parameters (Total) | 865.92M |
| Parameters (Trainable) | 865.92M |
| Weights Size | 1651.62 MB |
| Complexity | 761.50 GFLOPs |
| Runtime Metric | Value |
|---|---|
| Peak VRAM Usage | 1,370.21 MB |
| Total Latency | 1,806.29 ms |
| System Throughput | 0.55 s |
- Models
- Datasets
Mood Lens is an emotion-based color grading engine. Rather than manually adjusting curves, the user provides an emotional intent (e.g., "Melancholy" or "Tenderness"). The system utilizes a Retrieval-Augmented Generation (RAG) pipeline to fetch style references and employs a Neural 3D LUT (Look-Up Table) to transfer the color palette while strictly preserving the original image's structure.
Pipeline/Workflow
- Retrieval Augmented Generation
- Embedding: The input image is encoded via CLIP ViT-B/32. We embed only visual features, ignoring text labels to preserve geometric structure.
- Search & Filter: We query FAISS for the top 300 visual neighbors, then apply a strict metadata filter on emotion to select the top 3 stylistic references.
- Feature Extraction A pre-trained VGG19 extracts Content Features for structure and Style Gram Matrices for texture/color.
- Test-Time LUT Optimization A lightweight Trilinear 3D LUT is initialized as an Identity Matrix. We run ~100 iterations of backpropagation to minimize Perceptual Loss on a downsampled proxy (256×256).
- High-Res Inference The optimized LUT is applied to the original high-resolution image in sub-milliseconds, ensuring 100% detail preservation with zero quality loss.
Key Design Decisions
-
Image-Only Indexing Strategy
- Strategy We deliberately excluded emotion text labels from the embedding process.
- Rationale Text embeddings dilute structural information (e.g., "Sad" retrieves crying faces). By embedding only the image, we restrict the search to geometry and luminance, ensuring that the retrieved reference shares the same spatial features as the user's photo.
-
Oversampling & Post-Filtering
- Strategy Instead of fetching the top 3 matches directly, we fetch the Top visual neighbours and filter for the target emotion.
- Rationale Vector search is approximate. This two-step process guarantees (almost) candidates that are both visually compatible (structure) and thematically accurate (mood).
-
3D LUTs vs Generative Diffusion
- Design We built a differentiable TrilinearLUT module instead of using Stable Diffusion/ControlNet.
- Rationale
- Mobile Feasibility: ~100k parameters vs. ~860M for U-Nets.
- Zero Hallucinations: A LUT is a pure colour mapping function. It is mathematically impossible for it to distort objects or add artefacts, making it safe for professional editing.
For the Vector Database construction we used the dataset from unsplash respecting the terms and condition.
Unsplash Dataset Link
| Model Architecture | Value |
|---|---|
| Parameters (Total) | 0.11M |
| Parameters (Trainable) | 0.11M |
| Weights Size (Disk, MB) | 0.41 |
| Complexity (FLOPs, M) | 3.15 |
| Module | Number of Parameters | FLOPs |
|---|---|---|
| TrilinearLUT | 107.81K | 3.15M |
| Runtime Metrics | Value |
|---|---|
| VRAM | 1718.45MB |
| Total Latency | 7135.04ms |
| System Throughput | 0.14 img/s |
- Models
- Datasets
- Learning Image-adaptive 3D Lookup Tables for High Performance Photo Enhancement in Real-time
- AdaInt: Learning Adaptive Intervals for 3D Lookup Tables on Real-time Image Enhancement
- NILUT: Conditional Neural Implicit 3D Lookup Tables for Image Enhancement
- CLIPstyler: Image Style Transfer with a Single Text Condition
To ensure content provenance and detect AI-generated outputs, we implemented a robust watermarking system. Unlike spatial watermarks, this system operates in the Spectral Domain (Frequency Domain), making it resilient to compression and cropping.
-
Embedder Pipeline The embedding process transforms the image into its frequency components to inject a learned signature into the Mid Frequency bands.
-
Spectral Transformation (FFT) The input image I is converted to the frequency domain using the Fast Fourier Transform. We shift the zero frequency component to the center to easily target specific frequency bands.
$$F(u,v)=FFT(I)$$ -
Key-Based Pattern Generation A dense neural network maps a unique 128-dim cryptographic key to a spatial 64×64 watermark pattern. This is bilinearly upsampled to match the image resolution, ensuring every watermark is unique to its session/user.
-
Perceptual Masking: To prevent visual artifacts, we calculate a Texture Map using a convolution layer.
- Rationale: The human eye is sensitive to noise in flat regions (e.g., clear sky) but insensitive in textured regions (e.g., foliage).
- Design: The watermark intensity is scaled down in flat areas and scaled up in textured areas using this map.
-
Mid-Frequency Injection Low frequencies contain color/structure (modifying them ruins aesthetics). High frequencies contain noise (modifying them gets removed by JPEG compression). Mid-frequencies are the robust middle ground.
$$F_{new}= F + (\alpha \times W_{pattern} \times M_{texture})$$ -
Reconstruction The modified frequency spectrum is converted back to the spatial domain via Inverse FFT to produce the final signed image.
- Frequency Domain Injection We chose spectral injection over pixel patching because it is Global. If a user crops the top-left corner of the image, the frequency information remains intact, allowing the detector to still identify the watermark.
-
Adaptive Alpha Scaling
The parameter
$\alpha=0.0005$ is extremely low. By combining this low baseline with the texture_map, we ensure the Peak Signal-to-Noise Ratio remains high, meaning the watermarked image is visually indistinguishable from the original.
- The watermarking injection operates as a near-instantaneous post-processing step. With an architectural footprint dominated by highly optimized Fast Fourier Transforms (FFT) and a compact dense projection head, the latency overhead is negligible (< 5ms).
| Field | Value |
|---|---|
| operating system | 24.04.1-Ubuntu |
| miniconda | 25.11.0 |
| docker | 29.0.0 |
# nvidia-container-toolkit installation
# -------------------------------------
# add the gpg key
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
# add the generic deb repository
echo "deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/stable/deb/$(dpkg --print-architecture) /" | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# install
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
- Clone the repository
# clone the repo
# --------------
git clone https://github.com/photosaverneedbackup-star/team37-adobe-14
cd team_37_adobe
git submodule init
- Clone the repo
# change the directory
# --------------
cd text_morph
# create the conda environment
# ----------------------------
conda create -n text_morph python=3.8
conda activate text_morph
# install the requirements
# ------------------------
# NOTE: these versions are specifically written to avoid dependency issues,
# do not change them
pip install --upgrade torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu117
# other dependencies
pip install -r requirements.txt
pip install huggingface_hub
# download the dataset
python weights_download.py --path ./TextCtrl/weights
(incase the above fails)
- Download the model weights for
TextCtrlfrom Mega - Move the
TextCtrlweights totext_morph/TextCtrl/weights
# start the server
# ----------------
python server.py
# use the html files provided to interact with the server
# or use CURL to send commands
# building the docker image
# -------------------------
sudo docker build -t text_morph .
# running the docker image
# ------------------------
sudo docker run -p 7000:7000 text_morph
- Clone the repo
cd lightshift_remove
# create the conda environment
# ----------------------------
conda create -n lightshift_remove python=3.11.14
conda activate lightshift_remove
pip install -r requirements.txt
# start the standalone server
# ---------------------------
python server.py
docker build -t lightshift_remove .
docker run -it --gpus all lightshift_remove
# create the conda environment
# ----------------------------
conda env create -f environment.yaml
conda activate lama
export TORCH_HOME=$(pwd) && export PYTHONPATH=$(pwd)
- Download the dataset from Pipe Dataset and generate your validations masks from the images (or)
python pipe_download.py
-
Change the path of your dataset in
configs/training/orin.yamland the variabledata_root_dirto the output directory ofpipe_download.py -
Train the model
python bin/train.py -cn lama-fourier location=orin
- Place your test images in the folder
test_images/
python bin/predict.py model.path=$(pwd)/fourier indir=$(pwd)/test_images outdir=$(pwd)/test_images_output
# create the conda environment
conda create -n mood_lens python=3.11.14
conda activate mood_lens
# install the dependencies
pip install -r requirements.txt
- Download the Dataset [unsplash]
- Move the
photos.csv000to this directory
python mood_lens/download_dataset.py
dataset_manifest.csvwill be created
# get the gemini api key for this step and replace the line at the top
# of the file to caption the dataset with emotion
#
# NOTE: run caption_dataset.ipynb
#
# main.py will automatically build the index for the first time & reuse it
# for subsequent runs!
cd .. #exit out of the pipeline directory
uvicorn mood_lens.server:app
sudo docker run -it -p 8000:8000 -e dataset_path="/app/dataset" -v $(pwd)/dataset:/app/dataset -v $(pwd)/output:/app/output -v $(pwd)/misc:/app/misc colour-correction
cd hertz_mark
# create the conda environment
# ----------------------------
conda env create -f environment.yaml
conda activate hertz_mark
- Download the dataset from HertzMark [filmset from Kaggle]
- Move the dataset to
data/ - To modify the dataset path, modify the
DATA_FOLDERvariable
# for training
python train.py
- To infer, run these commands:
python embed_inferency.py #for embedding the input image with a watermark
python detect_inference.py #for detecting if the inout image was photoshopped by our pipeline
# uses the docker-compose.yml
docker compose up -d# webprotoype
cd client
npm i
npm start| Field | Value |
|---|---|
| Architecture | x86_64 |
| CPU Modes | 32-bit, 64-bit |
| Address Sizes | 46-bit physical, 48-bit virtual |
| Byte Order | Little Endian |
| Model Name | 13th Gen Intel(R) Core(TM) i9-13900 |
| Field | Value |
|---|---|
| Total CPUs | 32 |
| Threads per Core | 2 |
| Max Frequency | 5600 MHz |
| Min Frequency | 800 MHz |
| Cache Level | Size | Instances |
|---|---|---|
| L1d | 896 KiB | 24 |
| L1i | 1.3 MiB | 24 |
| L2 | 32 MiB | 12 |
| L3 | 36 MiB | 1 |
| Field | Value |
|---|---|
| GPU Model | NVIDIA GeForce RTX 3050 OEM |
| Driver Version | 570.195.03 |
| CUDA Version | 12.8 |
| VRAM | 8192 MiB |
- GaMuSa - Glyph Adaptive Mutual Self Attention
- CLIP - Contrastive Language-Image Pretraining
- SAM - Segment Anything
- LaMa - Large Mask
- YOLO - You Only Look Once
- All datasets linked are open-sourced and attached licenses.
- Terms have been attached in the repository for required Datasets.
- HertzMark has been implemented to ensure provenance.
- Repository: https://github.com/photosaverneedbackup-star/team37-adobe-14
- Drive Link: https://drive.google.com/drive/folders/1V2aOJvSMgiQv2A5QFmgA2ER-_sZHhoP1?usp=sharing






