Skip to content

Latest commit

 

History

History
183 lines (147 loc) · 12 KB

File metadata and controls

183 lines (147 loc) · 12 KB

Offline-First Architecture & Data Synchronization Deep Dive

In mobile system design interviews, "Offline Support" is often a dedicated section or a major non-functional requirement. Unlike web apps, mobile apps must assume the network is unreliable.

Your Goal: Demonstrate how to architect an app that works seamlessly without internet, syncs efficiently when connectivity returns, and handles data conflicts gracefully.

1. Core Philosophy: The Single Source of Truth

The most critical architectural decision is defining the "Single Source of Truth" (SSOT).

The "Online-First" Mistake

Many candidates design apps that fetch data from the network and display it directly in the UI.

  • Problem: If the network fails, the screen is empty. If the user navigates away and back, they wait for a loader again.
  • Result: Poor UX and high data usage.

The "Offline-First" Solution (Repository Pattern)

The UI only observes the Local Database.

  1. Read: UI subscribes to the Local DB (e.g., Room Flow, CoreData NSFetchedResultsController).
  2. Write: User actions update the Local DB immediately.
  3. Sync: A background "Sync Engine" synchronizes the Local DB with the Remote API.

The Signal: This decouples the UI from the Network. The app feels instant because it's reading from local disk, regardless of network latency.

2. Synchronization Strategies

How do you keep the Local DB and Remote Server in sync?

2.1 Full Sync vs. Delta Sync

  • Full Sync: Download the entire dataset every time.
    • Pros: Simple to implement. Guaranteed consistency.
    • Cons: High bandwidth, slow, battery drain. Only acceptable for tiny datasets (e.g., User Settings).
  • Delta Sync (Incremental Sync): Download only what changed since the last sync.
    • Mechanism: The client sends a sync marker to the server. The server returns only records modified after that point.
    • Option A: last_synced_timestamp
      • Flow:
        1. Client fetches all data. Server returns current server time (e.g., 2025-12-19T10:00:00Z).
        2. Client saves this timestamp locally.
        3. Next sync, client requests: GET /sync?since=2025-12-19T10:00:00Z.
        4. Server queries DB for updated_at > since.
      • Risk: Vulnerable to Clock Skew. If server instances have different times, or an update happens in the exact same millisecond as the sync, data might be missed.
    • Option B: sync_token (The "Opaque Cursor") - Recommended
      • Definition: A string or number generated by the server (e.g., v2_seq_98765) that acts as a bookmark. The client stores it blindly without interpreting it.
      • Flow:
        1. Response: Server returns data + token: {"data": [...], "sync_token": "v2_seq_98765"}.
        2. Request: Next sync, client sends it back: GET /sync?token=v2_seq_98765.
        3. Server Logic: Server decodes the token (e.g., maps it to a Global Sequence ID) and returns newer items.
      • Benefit: Stateless & Robust. Avoids clock skew entirely by using monotonic sequence IDs. Allows the backend to change versioning logic without breaking the mobile app.
    • Pros: Efficient, fast, saves battery.
    • Cons: Complex backend logic (requires "Soft Deletes" to sync deletions).

2.2 Sync Direction

  • Pull (Down-sync): Fetching updates from the server to the device.
  • Push (Up-sync): Sending local changes (pending writes) to the server.

2.3 Advanced Sync Patterns

Operation Log (Event Sourcing)

  • The Concept: Instead of syncing the current state (e.g., "Note Title is 'Groceries'"), you sync the list of changes (e.g., "User A changed title to 'Groceries'").
  • How it works:
    • The server keeps an append-only log of all mutations.
    • The client says "Give me all operations starting from Offset 100."
    • The client "replays" these actions locally.
  • Pros: Preserves intent (e.g., distinguishing between "User set value to 0" and "User decremented value"). Easier to resolve conflicts.
  • Cons: If the client is very old (Offset 0), replaying the entire history is slow. Requires "Snapshots" to fix.
  • Read More: Martin Fowler on Event Sourcing

Merkle Trees (Hash Trees)

  • The Concept: Instead of tracking when data changed, you track the signature of the data.
  • How it works:
    1. Both the Client and Server organize their data records into a tree structure.
    2. Each leaf node is a hash of a data record. Each parent node is a hash of its children.
    3. The Sync Protocol: The client sends the "Root Hash" of its tree to the server.
    4. Comparison: If Client.RootHash == Server.RootHash, they are perfectly in sync (0 bytes transferred). If different, the server compares children hashes to traverse down and pinpoint specifically which record is out of sync.
  • Use Case: Blockchain, Git, Cassandra, and complex file syncing (like Dropbox).
  • Pros: Extremely bandwidth-efficient for verifying consistency of large datasets.
  • Cons: High computational cost (hashing) and complexity to maintain the tree.
  • Read More: Merkle Tree (Wikipedia)

Version Vectors (Vector Clocks)

  • The Concept: Instead of a single "Server Time", we track logical counters for every actor (device) that modifies data.
  • How it works:
    • State is tracked as [DeviceA: 5, DeviceB: 3, Server: 10].
    • This allows the system to distinguish between "Device A hasn't seen Device B's update" vs "Device A overwrote Device B's update."
  • Use Case: Peer-to-Peer systems or truly distributed offline-first apps where devices might sync directly with each other (rare in typical mobile interviews, but good for "Signal").
  • Read More: Vector Clocks (Wikipedia)

Hash-Based "Check-In"

  • The Concept: A simplified "Merkle Tree" for a quick sanity check.
  • How it works:
    • Client calculates a single hash of its entire dataset (e.g., md5(all_ids + timestamps)).
    • Client sends this hash to the server.
    • Server compares it with its own calculation. If match -> Done. If mismatch -> Trigger full sync or standard delta sync.
  • Pros: Very easy to implement. Great for verifying consistency after a series of complex delta syncs.
  • Cons: Requires hashing the entire dataset, which can be slow for large databases.

2.4 Soft Deletes

You cannot physically delete a row on the server in a Delta Sync system, because the client won't know it's gone.

  • Solution: Use a is_deleted (tombstone) column.
  • Flow:
    1. Server marks item as is_deleted = true.
    2. Client requests changes since T.
    3. Server sends the "deleted" item.
    4. Client sees the flag and removes it from the Local DB (or keeps it hidden if undo is allowed).

3. Handling Local Writes (The "Pending" Queue)

When a user performs an action (e.g., "Like Tweet") while offline:

  1. Optimistic Update: Immediately update the UI to show the "Like" state (red heart).
  2. Persist Action: Store the action in a Persistent Queue (not just memory).
    • Why Persistent? If the app is killed before the network returns, the action must not be lost.
  3. Background Sync: When the network returns (via WorkManager on Android or BackgroundTasks on iOS), process the queue.
    • Success: Remove item from queue.
    • Failure (Transient): Retry with Exponential Backoff.
    • Failure (Permanent): Remove from queue and notify user (e.g., "Could not like tweet"). Revert the Optimistic Update.

4. Conflict Resolution Strategies

This is the hardest part of offline architecture. What happens if the user edits a note offline, but someone else edits the same note on the server?

Strategy A: Last Write Wins (LWW)

  • Logic: The system looks at the timestamp. The most recent update overwrites the other.
  • Pros: Easy to implement.
  • Cons: Data loss. (If I edit offline at 10:00, and you edit online at 10:05, my changes are wiped out when I eventually sync).

Strategy B: Server Authority (The "Git Push -f" approach)

  • Logic: The server's version is the truth. If the client tries to upload a stale version, the server rejects it (HTTP 409 Conflict).
  • Client Handling: The client must download the new server version and ask the user what to do.
  • Pros: Safe, prevents silent data loss.
  • Cons: Annoying UX ("Conflict detected, please resolve").

Strategy C: Field-Level Merging

  • Logic: Merge non-conflicting fields automatically.
  • Example: User A updates Title. User B updates Description. Both changes are kept.
  • Pros: Reduces conflicts significantly.

Strategy D: CRDTs (Conflict-Free Replicated Data Types)

  • Logic: specialized data structures designed to always merge successfully mathematically.
  • Use Case: Collaborative text editors (Google Docs), counters.
  • The Signal: Mentioning CRDTs shows deep theoretical knowledge, but acknowledge they are complex to implement from scratch.

5. Mobile-Specific Components

  • Database:
    • Android: Room (SQLite wrapper). Strongly typed, observable.
    • iOS: Core Data (Object Graph) or SwiftData.
    • Cross-Platform: Realm (NoSQL, easy sync), SQLite (Raw).
  • Job Schedulers:
    • Android: WorkManager. The gold standard. Handles constraints (e.g., "Run only when on WiFi and Charging").
    • iOS: BGAppRefreshTask / BGProcessingTask. stricter limitations on execution time.

8. Real-World Case Studies

  • Trello: (Engineering Blog)
    • Strategy: A complex "Command Queue" system that replays user actions.
    • Signal: Demonstrates how to handle "optimistic UI" with potentially thousands of offline edits.
  • Linear: (Engineering Blog)
    • Strategy: "Sync Engine" that treats the local database as a cache of the entire dataset.
    • Signal: Shows how high-performance apps prioritize local-first reads for speed.
  • CouchDB / PouchDB: (Website)
    • Strategy: Replication protocols built into the database layer.
    • Signal: Understanding "Replication" vs. "Custom Sync" trade-offs.

9. Summary Checklist for the Interview

  1. Define SSOT: "I will use the Repository Pattern with a local database as the Single Source of Truth."
  2. Define Sync Strategy: "I will implement Delta Sync using a last_updated cursor to minimize bandwidth."
  3. Handle Offline Writes: "I will use a persistent operation queue and WorkManager to flush changes when connectivity returns."
  4. Address Conflicts: "For this use case, [Last Write Wins / User Prompt] is appropriate because..."
  5. Mention UX: "I will use Optimistic Updates to make the app feel responsive."

10. Common Pitfalls ("Red Flags")

  • "I'll use a boolean isOffline flag." -> Bad code smell. Avoid building separate logic paths. Always write to DB, let the Sync Engine handle the rest.
  • In-Memory Queues: -> Data loss if the app crashes. Always persist pending actions.
  • Infinite Retries: -> Battery drain. Always use Exponential Backoff and jitter.
  • Blocking the UI: -> Database and Network operations must happen on background threads.

11. Further Reading