Skip to content

Sigma-Squad/Adobe-InterIIT-TechMeet-14.0

 
 

Repository files navigation

Team_37_Adobe_Technical_Report

Contributors

Anirudh Arrepu Niranjan M Suriyaa MM Rishi Ravi Amit Jomy

Market Research, Problem Understanding, Solution Description

Proposed Features

  • Text Morph
  • LightShift Remove
  • Mood Lens
  • HertzMark

Text Morph

Text Morph is a generative scene text editing engine. Instead of simple pixel overlay, it replaces text within natural images while strictly preserving the original typography, lighting, and perspective. The pipeline integrates state-of-the-art segmentation with a specialized diffusion model (GaMuSA) to ensure the new text is visually indistinguishable from the original scene.


Architecture & Workflow

text_morph_architecture


Pipeline/Workflow

  • Segmentation & Localization
    • User Interaction: The user provides point prompts on the text region.
    • Masking: We trigger SAM 2.1 Base to generate high-precision segmentation masks. YOLO (MSCOCO) is utilized to localize the bounding box context for the text region.
  • Generative Text Inpainting using GaMuSA The core engine is a diffusion-based text editor. It utilizes a Gylph Adaptive Mutual Self Attention to inject the new text glyphs while locking onto the original font style and background texture.
  • Latent Optimization The model performs controlled sampling. This controlled diffusion process ensures the generated text aligns perfectly with the surrounding pixel context.
  • High-Fidelity Reconstruction The generated latent representation is decoded and resized to match the original input resolution, blending the edited region seamlessly into the high-res image.

Key Design Decisions

  • yolo_mscoco (custom trained) We trained a custom YOLO-v11s model on the MSCOCO-Text dataset to obtain fine-grained, text-level bounding boxes. SAM produced only coarse region proposals, whereas our task required precise, per-text localization. The custom detector delivers dense, piecewise text boxes suitable for downstream OCR and layout analysis.

Weights & Biases=

Model Params
AutoencoderKL 83.65M
ControlUNetModel 859.52M
LabelEncoder 66.24M

Compute Profile

gamusa_latency_profile

  • Since SAM 2 is deployed as a frozen foundational model (zero-shot), its computational profile is well-documented by Meta AI. We focused our profiling efforts on the custom components (GaMuSA and Color Grading) where our novel optimizations were applied. Compute Profile for SAM

Terms, Conditions & License

References


LightShift Remove

This proposed workflow addresses this challenge by combining state-of-the-art segmentation and generative inpainting with 3D geometric reconstruction. By leveraging SAM 2 for precise object isolation and Lama for seamless removal, the system clears the obstruction, it then utilizes Marigold to generate high-fidelity depth maps and surface normals from the modified image. This allows for a final shader computation step that physically re-lights the skin texture, ensuring that the area where the shadow once fell matches the lighting of the rest of the face perfectly.

Architecture & Workflow

relighting_architecture


Pipeline/Workflow

  • Segmentation for Region Masking using SAM 2 High-fidelity bounding boxes are generated from sparse point inputs, enabling precise localization of the target regions for subsequent processing.
  • LaMa Inpainting The LaMa-Inpainting model, which leverages Fast Fourier Convolutions (FFCs), is employed to inpaint the masked regions while preserving global structural coherence.
  • Depth Estimation Performs denoising on the inpainted RGB latent space to hallucinate affine-invariant depth maps. Recovers high-resolution depth topology, including subtle facial curvature (e.g., nose, cheekbones), typically lost during the 2D inpainting stage.
  • Surface Normal Reconstruction Computes spatial gradients derived from the Marigold depth map. Produces a per-pixel normal map that encodes the orientation of the skin surface relative to the camera viewpoint.
  • Shader Computation & Relighting Computes pixel intensity as a function of the reconstructed surface normals and the estimated light vector. Restores specular highlights and diffuse shading across the inpainted region, ensuring consistency with the global illumination characteristics of the portrait.

Key Design Decisions

Transition to Marigold LCM for Surface Normals

  • Design Adopted Marigold LCM (Latent Consistency Model) for direct surface-normal estimation.
  • Rationale The earlier approach, deriving normals from a predicted depth map was inherently unstable, producing geometric noise and unreliable surface orientation in complex scenes. Marigold reframes normal estimation as a generative task, yielding far smoother and more coherent geometry. The LCM variant preserves this quality while significantly reducing latency, aligning well with our mobile-first constraints.

Transition to Custom-Trained LaMa Inpainting

  • Design Employed the LaMa Inpainting architecture with Fast Fourier Convolutions (FFC), custom-trained on the Pipe dataset.
  • Rationale Although the ONNX baseline offered better runtime, the quality degradation was too severe for a creative tool. Our custom Fourier-based model captures long-range structure and global context far more effectively, resulting in cleaner, seamless inpainting.

Use of Lama-Inpainting

To overcome the limitations inherent in the standard implementation, we extensively optimized the Lama-Inpainting repository for integration into our AI workflow. This involved substantial code-level modifications, culminating in the creation of a dedicated GitHub repository to maintain and document our enhanced version.


Weights & Biases

Category Module Params FLOPs
Model Total 865.92M 761.498G
Core Component time_embedding 2.05M 2.05M
Attention (Total) All Attention Layers 243.43M 229.398G
ResNet (Total) All ResNet Layers 368.66M 369.240G

Model Architecture Value
Parameters (Total) 865.92M
Parameters (Trainable) 865.92M
Weights Size 1651.62 MB
Complexity 761.50 GFLOPs

Compute Profile

relighting_latency_profile


Runtime Metric Value
Peak VRAM Usage 1,370.21 MB
Total Latency 1,806.29 ms
System Throughput 0.55 s

Terms, Conditions & License


Mood Lens

Mood Lens is an emotion-based color grading engine. Rather than manually adjusting curves, the user provides an emotional intent (e.g., "Melancholy" or "Tenderness"). The system utilizes a Retrieval-Augmented Generation (RAG) pipeline to fetch style references and employs a Neural 3D LUT (Look-Up Table) to transfer the color palette while strictly preserving the original image's structure.


Architecture & Workflow

colour_correction_architecture_diagram


Pipeline/Workflow

  • Retrieval Augmented Generation
    • Embedding: The input image is encoded via CLIP ViT-B/32. We embed only visual features, ignoring text labels to preserve geometric structure.
    • Search & Filter: We query FAISS for the top 300 visual neighbors, then apply a strict metadata filter on emotion to select the top 3 stylistic references.
  • Feature Extraction A pre-trained VGG19 extracts Content Features for structure and Style Gram Matrices for texture/color.
  • Test-Time LUT Optimization A lightweight Trilinear 3D LUT is initialized as an Identity Matrix. We run ~100 iterations of backpropagation to minimize Perceptual Loss on a downsampled proxy (256×256).
  • High-Res Inference The optimized LUT is applied to the original high-resolution image in sub-milliseconds, ensuring 100% detail preservation with zero quality loss.

Key Design Decisions

  • Image-Only Indexing Strategy

    • Strategy We deliberately excluded emotion text labels from the embedding process.
    • Rationale Text embeddings dilute structural information (e.g., "Sad" retrieves crying faces). By embedding only the image, we restrict the search to geometry and luminance, ensuring that the retrieved reference shares the same spatial features as the user's photo.
  • Oversampling & Post-Filtering

    • Strategy Instead of fetching the top 3 matches directly, we fetch the Top visual neighbours and filter for the target emotion.
    • Rationale Vector search is approximate. This two-step process guarantees (almost) candidates that are both visually compatible (structure) and thematically accurate (mood).
  • 3D LUTs vs Generative Diffusion

    • Design We built a differentiable TrilinearLUT module instead of using Stable Diffusion/ControlNet.
    • Rationale
      • Mobile Feasibility: ~100k parameters vs. ~860M for U-Nets.
      • Zero Hallucinations: A LUT is a pure colour mapping function. It is mathematically impossible for it to distort objects or add artefacts, making it safe for professional editing.

Dataset

For the Vector Database construction we used the dataset from unsplash respecting the terms and condition. Unsplash Dataset Link


Weights & Biases

Model Architecture Value
Parameters (Total) 0.11M
Parameters (Trainable) 0.11M
Weights Size (Disk, MB) 0.41
Complexity (FLOPs, M) 3.15
Module Number of Parameters FLOPs
TrilinearLUT 107.81K 3.15M

Compute Profile

colour_correction_latency_profile


Runtime Metrics Value
VRAM 1718.45MB
Total Latency 7135.04ms
System Throughput 0.14 img/s

Terms, Conditions & Licenses


References


HertzMark

To ensure content provenance and detect AI-generated outputs, we implemented a robust watermarking system. Unlike spatial watermarks, this system operates in the Spectral Domain (Frequency Domain), making it resilient to compression and cropping.

Architecture

synth_id_architecture


Embedder Architecture

  • Embedder Pipeline The embedding process transforms the image into its frequency components to inject a learned signature into the Mid Frequency bands.

  • Spectral Transformation (FFT) The input image I is converted to the frequency domain using the Fast Fourier Transform. We shift the zero frequency component to the center to easily target specific frequency bands. $$F(u,v)=FFT(I)$$

  • Key-Based Pattern Generation A dense neural network maps a unique 128-dim cryptographic key to a spatial 64×64 watermark pattern. This is bilinearly upsampled to match the image resolution, ensuring every watermark is unique to its session/user.

  • Perceptual Masking: To prevent visual artifacts, we calculate a Texture Map using a convolution layer.

    • Rationale: The human eye is sensitive to noise in flat regions (e.g., clear sky) but insensitive in textured regions (e.g., foliage).
    • Design: The watermark intensity is scaled down in flat areas and scaled up in textured areas using this map.
  • Mid-Frequency Injection Low frequencies contain color/structure (modifying them ruins aesthetics). High frequencies contain noise (modifying them gets removed by JPEG compression). Mid-frequencies are the robust middle ground. $$F_{new}= F + (\alpha \times W_{pattern} \times M_{texture})$$

  • Reconstruction The modified frequency spectrum is converted back to the spatial domain via Inverse FFT to produce the final signed image.


Design Decisions

  • Frequency Domain Injection We chose spectral injection over pixel patching because it is Global. If a user crops the top-left corner of the image, the frequency information remains intact, allowing the detector to still identify the watermark.
  • Adaptive Alpha Scaling The parameter $\alpha=0.0005$ is extremely low. By combining this low baseline with the texture_map, we ensure the Peak Signal-to-Noise Ratio remains high, meaning the watermarked image is visually indistinguishable from the original.

Compute Profile

  • The watermarking injection operates as a near-instantaneous post-processing step. With an architectural footprint dominated by highly optimized Fast Fourier Transforms (FFT) and a compact dense projection head, the latency overhead is negligible (< 5ms).

Terms, Conditions & Licenses

References

Appendix I - Installation Guide

Software

Field Value
operating system 24.04.1-Ubuntu
miniconda 25.11.0
docker 29.0.0

Pre-requisites

# nvidia-container-toolkit installation
# -------------------------------------
# add the gpg key
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

# add the generic deb repository
echo "deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/stable/deb/$(dpkg --print-architecture) /" | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# install
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
  • Clone the repository
# clone the repo
# --------------
git clone https://github.com/photosaverneedbackup-star/team37-adobe-14
cd team_37_adobe
git submodule init

Text Morph

  • Clone the repo
# change the directory
# --------------
cd text_morph

Path 1 : Via Conda & Standalone server

# create the conda environment
# ----------------------------
conda create -n text_morph python=3.8
conda activate text_morph
# install the requirements
# ------------------------
# NOTE: these versions are specifically written to avoid dependency issues,
# do not change them
pip install --upgrade torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu117
# other dependencies
pip install -r requirements.txt
pip install huggingface_hub

# download the dataset
python weights_download.py --path ./TextCtrl/weights

(incase the above fails)

  • Download the model weights for TextCtrl from Mega
  • Move the TextCtrl weights to text_morph/TextCtrl/weights
# start the server
# ----------------
python server.py

# use the html files provided to interact with the server
# or use CURL to send commands

Path 2 : Docker Environment

# building the docker image
# -------------------------
sudo docker build -t text_morph .

# running the docker image
# ------------------------
sudo docker run -p 7000:7000 text_morph

Lightshift Remove

  • Clone the repo
cd lightshift_remove

Path 1 : Via the Conda Environment

# create the conda environment
# ----------------------------
conda create -n lightshift_remove python=3.11.14
conda activate lightshift_remove
pip install -r requirements.txt

# start the standalone server
# ---------------------------
python server.py

Path 2 : Docker Environment

docker build -t lightshift_remove .
docker run -it --gpus all lightshift_remove

Training LaMa-Fourier

Download and setup the repo
# create the conda environment
# ----------------------------
conda env create -f environment.yaml
conda activate lama
export TORCH_HOME=$(pwd) && export PYTHONPATH=$(pwd)
Training
  • Download the dataset from Pipe Dataset and generate your validations masks from the images (or)
python pipe_download.py
  • Change the path of your dataset in configs/training/orin.yaml and the variable data_root_dir to the output directory of pipe_download.py

  • Train the model

python bin/train.py -cn lama-fourier location=orin
Inference
  • Place your test images in the folder test_images/
python bin/predict.py model.path=$(pwd)/fourier indir=$(pwd)/test_images outdir=$(pwd)/test_images_output

Mood Lens

Path 1 : Via Conda & Standalone server

# create the conda environment
conda create -n mood_lens python=3.11.14
conda activate mood_lens
# install the dependencies
pip install -r requirements.txt

Preparing the Dataset

  • Download the Dataset [unsplash]
  • Move the photos.csv000 to this directory
python mood_lens/download_dataset.py
  • dataset_manifest.csv will be created
# get the gemini api key for this step and replace the line at the top
# of the file to caption the dataset with emotion
#
# NOTE: run caption_dataset.ipynb
#
# main.py will automatically build the index for the first time & reuse it
# for subsequent runs!
cd .. #exit out of the pipeline directory
uvicorn mood_lens.server:app

Path 2 : Docker Environment

sudo docker run -it -p 8000:8000 -e dataset_path="/app/dataset" -v $(pwd)/dataset:/app/dataset -v $(pwd)/output:/app/output -v $(pwd)/misc:/app/misc colour-correction

HertzMark

cd hertz_mark

# create the conda environment
# ----------------------------
conda env create -f environment.yaml
conda activate hertz_mark
  • Download the dataset from HertzMark [filmset from Kaggle]
  • Move the dataset to data/
  • To modify the dataset path, modify the DATA_FOLDER variable
# for training
python train.py
  • To infer, run these commands:
python embed_inferency.py #for embedding the input image with a watermark
python detect_inference.py #for detecting if the inout image was photoshopped by our pipeline

Production Ready Docker

# uses the docker-compose.yml
docker compose up -d

Website

# webprotoype
cd client
npm i
npm start

Appendix I - Hardware

CPU


General Information

Field Value
Architecture x86_64
CPU Modes 32-bit, 64-bit
Address Sizes 46-bit physical, 48-bit virtual
Byte Order Little Endian
Model Name 13th Gen Intel(R) Core(TM) i9-13900

Topology

Field Value
Total CPUs 32
Threads per Core 2
Max Frequency 5600 MHz
Min Frequency 800 MHz

Cache Summary

Cache Level Size Instances
L1d 896 KiB 24
L1i 1.3 MiB 24
L2 32 MiB 12
L3 36 MiB 1

GPU


General Information

Field Value
GPU Model NVIDIA GeForce RTX 3050 OEM
Driver Version 570.195.03
CUDA Version 12.8
VRAM 8192 MiB

Appendix III - Abbreviations

  • GaMuSa - Glyph Adaptive Mutual Self Attention
  • CLIP - Contrastive Language-Image Pretraining
  • SAM - Segment Anything
  • LaMa - Large Mask
  • YOLO - You Only Look Once

NOTE

About

An AI-powered Photoshop-like editor designed for mobile devices of 2030. This project explores next-generation creative workflows including font-aware text editing directly on images, emotion-driven color grading without manual curve adjustments, and intelligent object removal with realistic relighting. Built for the Adobe PS, Inter IIT Tech 14.0

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 90.4%
  • JavaScript 4.5%
  • Jupyter Notebook 3.2%
  • HTML 1.3%
  • Shell 0.4%
  • Dockerfile 0.2%