RmapDifyChatbot is a production-oriented Python project for operating a Dify-based academic assistant with explicit metadata routing.
- Iterative map/reduce workflow is stable for core two-turn handover: paper list -> summarize selected papers.
- Structured-output extractor handoff is now robust (parser prefers extractor JSON over raw free text).
- Final answer sanitation strips leaked
<think>blocks, including malformed or unclosed variants. - Two-pass bulk upload is integrated for local and SLURM execution paths via Python module invocation.
The project has two responsibilities:
- Main use-case: deploy and operate a metadata-aware Dify chatbot workflow.
- Secondary service: extract metadata from papers and upload documents into Dify datasets.
Current routing workflow (config/RMAP Chatbot Meta Routing.yml):
flowchart LR
A[Start] --> B[Query Rewriter]
B --> C{Question Classifier}
C -->|Knowledge Route| D[Knowledge Retrieval]
D --> E[Knowledge LLM]
E --> H[Answer]
C -->|Metadata Route| F[Parameter Extractor]
F --> G[Metadata Router Code]
G --> I[Metadata LLM]
I --> H
Current iterative retrieval workflow (config/RMAP Chatbot Iterative Retrieval.yml):
flowchart LR
A[Start] --> B[Query Rewriter]
B --> C[JSON Metadata Extractor]
C --> F{Paper List Empty?}
F -->|Yes| G[Knowledge Retrieval]
G --> H[Knowledge LLM]
H --> Z[Answer]
F -->|No| I[Paper Iterator]
I --> J[Question Classifier]
J -->|Count/List| K[Metadata Code]
J -->|Content| L[KR with filter]
L --> M[Paper Map LLM]
K --> N[Iteration Aggregator]
M --> N
N --> O[Map/Reduce LLM]
O --> Z
- Python 3.11+
- Poetry
poetry install
poetry run dify-upload --helpOptional local environment file (for import/debug scripts):
source .secrets/dify_console_session.envUse different keys for different endpoint families:
DIFY_APP_API_KEY(prefixapp-): app runtime endpoints under/v1(for example/v1/chat-messages,/v1/meta).DIFY_DATASET_API_KEY(prefixdataset-): dataset upload/metadata endpoints under/v1/datasets/...used bydify-upload.DIFY_CONSOLE_API_KEY: console-management endpoints under/console/api/...(workflow import, draft run).- Cookie fallback (
DIFY_CONSOLE_COOKIE+DIFY_CSRF_TOKEN): only for deployments where console API keys are not accepted.
Notes:
- The uploader currently supports
DIFY_API_KEYas a backward-compatible alias forDIFY_DATASET_API_KEY. - In this deployment, app keys are valid for
/v1but not for/console/api.
Preferred mode (console API key):
DIFY_BASE_URL="http://your-dify-host" \
DIFY_CONSOLE_API_KEY="<console_api_key>" \
AUTO_CONFIRM=true \
scripts/import_dify_dsl.sh "config/RMAP Chatbot Meta Routing.yml" --app-id "<app_id>"Cookie fallback (for deployments without console API key support):
DIFY_BASE_URL="http://your-dify-host" \
DIFY_CONSOLE_COOKIE="..." \
DIFY_CSRF_TOKEN="..." \
AUTO_CONFIRM=true \
scripts/import_dify_dsl.sh "config/RMAP Chatbot Meta Routing.yml" --app-id "<app_id>" --allow-cookie-authDIFY_BASE_URL="http://your-dify-host" \
DIFY_CONSOLE_COOKIE="..." \
DIFY_CSRF_TOKEN="..." \
scripts/debug_route_draft.sh \
--app-id "<app_id>" \
--allow-cookie-auth \
--query "What are the main methods and findings of Sci-ModoM?" \
--query "How many papers has Christoph Dieterich published?" \
--query "Which papers have been (co-) authored by Christoph Dieterich?"Or with console API key (if supported by your deployment):
DIFY_BASE_URL="http://your-dify-host" \
DIFY_CONSOLE_API_KEY="<console_api_key>" \
scripts/debug_route_draft.sh \
--app-id "<app_id>" \
--query "What are the main methods and findings of Sci-ModoM?"Expected routes:
- Content questions ->
Knowledge Route - Count/list/filter questions ->
Metadata Route
Import the iterative workflow config:
DIFY_BASE_URL="http://your-dify-host" \
DIFY_CONSOLE_COOKIE="..." \
DIFY_CSRF_TOKEN="..." \
AUTO_CONFIRM=true \
scripts/import_dify_dsl.sh "config/RMAP Chatbot Iterative Retrieval.yml" --app-id "<app_id>" --allow-cookie-authValidate a two-turn handover (author list -> summarize previous papers):
DIFY_BASE_URL="http://your-dify-host" \
DIFY_CONSOLE_COOKIE="..." \
DIFY_CSRF_TOKEN="..." \
scripts/debug_route_draft.sh \
--app-id "<app_id>" \
--allow-cookie-auth \
--classifier-node-id "17786780005730" \
--query "What papers did Christoph Dieterich author?" \
--query "Can you summarize these papers?"Notes:
-
scripts/debug_route_draft.shnow reusesconversation_idautomatically across multiple--queryvalues in one run. -
You can also force a known conversation with
--conversation-id "<uuid>". -
Iterative workflow now uses minimal memory handover: concrete paper entities from list queries are persisted in
conversation.memoryfor follow-up summarize turns. -
Memory fallback is only applied for follow-up style prompts (for example: "these papers", "Fasse mir diese Papiere zusammen") to avoid contaminating unrelated turns.
-
Hardened follow-up intents for map/reduce now include references like:
- compare/contrast: "Compare these papers by methods and key findings"
- ranking/subset: "Which one is the newest?", "top 3", "first two papers"
- cross-lingual references: "diese Paper", "vergleiche diese", "welches davon"
-
Milestone 2026-05-21 (Map/Reduce follow-up hardening):
- Resolve Paper List now performs deterministic subset selection from
conversation.memoryforfirst/top/newest/oldestfollow-ups. - Restored missing iteration Code node (
17786780698570) to keep graph edges consistent and avoid draft runtimeMISSING_NODEerrors. - Verified two-turn flow for
Which paper have been published by Christoph Dieterich->Please summarize the first two papers.returns successfully (HTTP 200) in draft run.
- Resolve Paper List now performs deterministic subset selection from
-
Milestone 2026-05-22 (Boss demo stabilization):
- Split overloaded paper resolution logic into staged nodes (extractor parser, follow-up intent gate, memory subset selector, slim resolver merge) to make two-turn behavior explainable and robust.
- Added final answer sanitization node (
1778800001013) and routed the Answer node throughcleaned_textto strip<think>...</think>leakage reliably. - Hardened reduce prompt with authoritative requested identities/order from resolver output to keep section 1/2 mapped to the requested first two papers.
- Fixed YAML import fragility in single-quoted prompt blocks (apostrophe handling) and re-validated import success (HTTP 200).
- Enforced deterministic fixed-subset formatting (
**1./**2.headers) even under sparse text context, so demo checks remain stable. - Re-tested two-turn draft flow (
author list->summarize first two) with HTTP 200 on both turns, no<think>in final answer, and both section markers present.
-
Milestone 2026-05-29 (structured-output and sanitizer hardening):
- JSON Metadata Extractor switched to structured output as primary machine-readable contract.
- Parser node updated to accept both
extractor_textandextractor_structured_output, with structured output taking precedence. - Extractor max token budget increased to reduce truncation risk in larger author-paper lists.
- Metadata LLM max token budget increased (
768 -> 1200) to improve long-answer completeness. - Final sanitizer hardened for both closed and unclosed
<think>segments. - Validated two-turn runs with HTTP 200 on both turns and no
<think>content in final answer.
Use the CLI entrypoint:
poetry run dify-uploadIf needed, provide dataset credentials via environment variables:
export DIFY_DATASET_API_KEY="dataset-..."
export DIFY_API_URL="http://your-dify-host/v1"
export DATASET_ID="<dataset_id>"Common commands:
# Run default two-pass workflow
poetry run dify-upload default
# Run two-pass upload on one file
poetry run dify-upload two-pass --file "RMaP papers first funding period/your-file.pdf"
# Run diagnostics
poetry run dify-upload abc-test --file "RMaP papers first funding period/your-file.pdf"
# Preview extracted metadata
poetry run dify-upload metadata --file "RMaP papers first funding period/your-file.pdf"
# Process selected authors only
poetry run dify-upload selected-authors --author "Mark Helm" --author "Christoph Dieterich"
# Bulk processing
poetry run dify-upload bulk-two-pass --folder "RMaP papers first funding period"
# Quality report for extracted authors
poetry run dify-upload author-quality --folder "RMaP papers first funding period"SLURM/GPU execution (recommended for full-folder runs with LLM fallback):
sbatch scripts/slurm_bulk_two_pass_ollama.shThe SLURM script explicitly uses the Python module path to avoid wrapper ambiguity:
.venv/bin/python -m dify_uploader bulk-two-pass --folder "RMaP papers first funding period"Hybrid extraction behavior in dify_uploader/author_extraction.py:
- Fast regex and heuristics.
- Optional LLM fallback via BAML for low-confidence cases.
BAML runtime example:
export BAML_OLLAMA_BASE_URL="http://127.0.0.1:11434/v1"
export BAML_OLLAMA_MODEL="qwen3:32b"
export AUTHOR_EXTRACTION_ENABLE_LLM_FALLBACK="true"- Finish and verify the full 2-pass run for all PDFs in
RMaP papers first funding periodand capture success/failure counts in a run log. - Add a concise post-run summary artifact (processed files, retries, failures, elapsed time) under
reports/slurm/. - Introduce a lightweight regression check for the two-turn path (
list papers->summarize those papers) after each workflow import. - Add a small acceptance checklist for stakeholder demos (HTTP status, sanitizer check, section mapping check).