Skip to content

dieterich-lab/RmapDifyChatbot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RmapDifyChatbot

RmapDifyChatbot is a production-oriented Python project for operating a Dify-based academic assistant with explicit metadata routing.

Status Snapshot (2026-05-29)

  1. Iterative map/reduce workflow is stable for core two-turn handover: paper list -> summarize selected papers.
  2. Structured-output extractor handoff is now robust (parser prefers extractor JSON over raw free text).
  3. Final answer sanitation strips leaked <think> blocks, including malformed or unclosed variants.
  4. Two-pass bulk upload is integrated for local and SLURM execution paths via Python module invocation.

Overview

The project has two responsibilities:

  1. Main use-case: deploy and operate a metadata-aware Dify chatbot workflow.
  2. Secondary service: extract metadata from papers and upload documents into Dify datasets.

Current routing workflow (config/RMAP Chatbot Meta Routing.yml):

flowchart LR
		A[Start] --> B[Query Rewriter]
		B --> C{Question Classifier}

		C -->|Knowledge Route| D[Knowledge Retrieval]
		D --> E[Knowledge LLM]
		E --> H[Answer]

		C -->|Metadata Route| F[Parameter Extractor]
		F --> G[Metadata Router Code]
		G --> I[Metadata LLM]
		I --> H
Loading

Current iterative retrieval workflow (config/RMAP Chatbot Iterative Retrieval.yml):

flowchart LR
		A[Start] --> B[Query Rewriter]
		B --> C[JSON Metadata Extractor]
		C --> F{Paper List Empty?}

		F -->|Yes| G[Knowledge Retrieval]
		G --> H[Knowledge LLM]
		H --> Z[Answer]

		F -->|No| I[Paper Iterator]
		I --> J[Question Classifier]
		J -->|Count/List| K[Metadata Code]
		J -->|Content| L[KR with filter]
		L --> M[Paper Map LLM]
		K --> N[Iteration Aggregator]
		M --> N
		N --> O[Map/Reduce LLM]
		O --> Z
Loading

Installation

Requirements

  1. Python 3.11+
  2. Poetry

Setup

poetry install
poetry run dify-upload --help

Optional local environment file (for import/debug scripts):

source .secrets/dify_console_session.env

API credential map

Use different keys for different endpoint families:

  1. DIFY_APP_API_KEY (prefix app-): app runtime endpoints under /v1 (for example /v1/chat-messages, /v1/meta).
  2. DIFY_DATASET_API_KEY (prefix dataset-): dataset upload/metadata endpoints under /v1/datasets/... used by dify-upload.
  3. DIFY_CONSOLE_API_KEY: console-management endpoints under /console/api/... (workflow import, draft run).
  4. Cookie fallback (DIFY_CONSOLE_COOKIE + DIFY_CSRF_TOKEN): only for deployments where console API keys are not accepted.

Notes:

  1. The uploader currently supports DIFY_API_KEY as a backward-compatible alias for DIFY_DATASET_API_KEY.
  2. In this deployment, app keys are valid for /v1 but not for /console/api.

Main Use-Case: Set Up The Meta Routing Chatbot

1. Import workflow DSL into Dify

Preferred mode (console API key):

DIFY_BASE_URL="http://your-dify-host" \
DIFY_CONSOLE_API_KEY="<console_api_key>" \
AUTO_CONFIRM=true \
scripts/import_dify_dsl.sh "config/RMAP Chatbot Meta Routing.yml" --app-id "<app_id>"

Cookie fallback (for deployments without console API key support):

DIFY_BASE_URL="http://your-dify-host" \
DIFY_CONSOLE_COOKIE="..." \
DIFY_CSRF_TOKEN="..." \
AUTO_CONFIRM=true \
scripts/import_dify_dsl.sh "config/RMAP Chatbot Meta Routing.yml" --app-id "<app_id>" --allow-cookie-auth

2. Validate routing behavior

DIFY_BASE_URL="http://your-dify-host" \
DIFY_CONSOLE_COOKIE="..." \
DIFY_CSRF_TOKEN="..." \
scripts/debug_route_draft.sh \
	--app-id "<app_id>" \
	--allow-cookie-auth \
	--query "What are the main methods and findings of Sci-ModoM?" \
	--query "How many papers has Christoph Dieterich published?" \
	--query "Which papers have been (co-) authored by Christoph Dieterich?"

Or with console API key (if supported by your deployment):

DIFY_BASE_URL="http://your-dify-host" \
DIFY_CONSOLE_API_KEY="<console_api_key>" \
scripts/debug_route_draft.sh \
	--app-id "<app_id>" \
	--query "What are the main methods and findings of Sci-ModoM?"

Expected routes:

  1. Content questions -> Knowledge Route
  2. Count/list/filter questions -> Metadata Route

Main Use-Case: Iterative Map/Reduce Routing

Import the iterative workflow config:

DIFY_BASE_URL="http://your-dify-host" \
DIFY_CONSOLE_COOKIE="..." \
DIFY_CSRF_TOKEN="..." \
AUTO_CONFIRM=true \
scripts/import_dify_dsl.sh "config/RMAP Chatbot Iterative Retrieval.yml" --app-id "<app_id>" --allow-cookie-auth

Validate a two-turn handover (author list -> summarize previous papers):

DIFY_BASE_URL="http://your-dify-host" \
DIFY_CONSOLE_COOKIE="..." \
DIFY_CSRF_TOKEN="..." \
scripts/debug_route_draft.sh \
	--app-id "<app_id>" \
	--allow-cookie-auth \
	--classifier-node-id "17786780005730" \
	--query "What papers did Christoph Dieterich author?" \
	--query "Can you summarize these papers?"

Notes:

  1. scripts/debug_route_draft.sh now reuses conversation_id automatically across multiple --query values in one run.

  2. You can also force a known conversation with --conversation-id "<uuid>".

  3. Iterative workflow now uses minimal memory handover: concrete paper entities from list queries are persisted in conversation.memory for follow-up summarize turns.

  4. Memory fallback is only applied for follow-up style prompts (for example: "these papers", "Fasse mir diese Papiere zusammen") to avoid contaminating unrelated turns.

  5. Hardened follow-up intents for map/reduce now include references like:

    • compare/contrast: "Compare these papers by methods and key findings"
    • ranking/subset: "Which one is the newest?", "top 3", "first two papers"
    • cross-lingual references: "diese Paper", "vergleiche diese", "welches davon"
  6. Milestone 2026-05-21 (Map/Reduce follow-up hardening):

    • Resolve Paper List now performs deterministic subset selection from conversation.memory for first/top/newest/oldest follow-ups.
    • Restored missing iteration Code node (17786780698570) to keep graph edges consistent and avoid draft runtime MISSING_NODE errors.
    • Verified two-turn flow for Which paper have been published by Christoph Dieterich -> Please summarize the first two papers. returns successfully (HTTP 200) in draft run.
  7. Milestone 2026-05-22 (Boss demo stabilization):

    • Split overloaded paper resolution logic into staged nodes (extractor parser, follow-up intent gate, memory subset selector, slim resolver merge) to make two-turn behavior explainable and robust.
    • Added final answer sanitization node (1778800001013) and routed the Answer node through cleaned_text to strip <think>...</think> leakage reliably.
    • Hardened reduce prompt with authoritative requested identities/order from resolver output to keep section 1/2 mapped to the requested first two papers.
    • Fixed YAML import fragility in single-quoted prompt blocks (apostrophe handling) and re-validated import success (HTTP 200).
    • Enforced deterministic fixed-subset formatting (**1. / **2. headers) even under sparse text context, so demo checks remain stable.
    • Re-tested two-turn draft flow (author list -> summarize first two) with HTTP 200 on both turns, no <think> in final answer, and both section markers present.
  8. Milestone 2026-05-29 (structured-output and sanitizer hardening):

    • JSON Metadata Extractor switched to structured output as primary machine-readable contract.
    • Parser node updated to accept both extractor_text and extractor_structured_output, with structured output taking precedence.
    • Extractor max token budget increased to reduce truncation risk in larger author-paper lists.
    • Metadata LLM max token budget increased (768 -> 1200) to improve long-answer completeness.
    • Final sanitizer hardened for both closed and unclosed <think> segments.
    • Validated two-turn runs with HTTP 200 on both turns and no <think> content in final answer.

Secondary Service: Metadata Extraction And Paper Upload

Use the CLI entrypoint:

poetry run dify-upload

If needed, provide dataset credentials via environment variables:

export DIFY_DATASET_API_KEY="dataset-..."
export DIFY_API_URL="http://your-dify-host/v1"
export DATASET_ID="<dataset_id>"

Common commands:

# Run default two-pass workflow
poetry run dify-upload default

# Run two-pass upload on one file
poetry run dify-upload two-pass --file "RMaP papers first funding period/your-file.pdf"

# Run diagnostics
poetry run dify-upload abc-test --file "RMaP papers first funding period/your-file.pdf"

# Preview extracted metadata
poetry run dify-upload metadata --file "RMaP papers first funding period/your-file.pdf"

# Process selected authors only
poetry run dify-upload selected-authors --author "Mark Helm" --author "Christoph Dieterich"

# Bulk processing
poetry run dify-upload bulk-two-pass --folder "RMaP papers first funding period"

# Quality report for extracted authors
poetry run dify-upload author-quality --folder "RMaP papers first funding period"

SLURM/GPU execution (recommended for full-folder runs with LLM fallback):

sbatch scripts/slurm_bulk_two_pass_ollama.sh

The SLURM script explicitly uses the Python module path to avoid wrapper ambiguity:

.venv/bin/python -m dify_uploader bulk-two-pass --folder "RMaP papers first funding period"

Hybrid extraction behavior in dify_uploader/author_extraction.py:

  1. Fast regex and heuristics.
  2. Optional LLM fallback via BAML for low-confidence cases.

BAML runtime example:

export BAML_OLLAMA_BASE_URL="http://127.0.0.1:11434/v1"
export BAML_OLLAMA_MODEL="qwen3:32b"
export AUTHOR_EXTRACTION_ENABLE_LLM_FALLBACK="true"

Next Steps (Execution Plan)

  1. Finish and verify the full 2-pass run for all PDFs in RMaP papers first funding period and capture success/failure counts in a run log.
  2. Add a concise post-run summary artifact (processed files, retries, failures, elapsed time) under reports/slurm/.
  3. Introduce a lightweight regression check for the two-turn path (list papers -> summarize those papers) after each workflow import.
  4. Add a small acceptance checklist for stakeholder demos (HTTP status, sanitizer check, section mapping check).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors