RmapDifyChatbot

RmapDifyChatbot is a production-oriented Python project for operating a Dify-based academic assistant with explicit metadata routing.

Status Snapshot (2026-05-29)

Iterative map/reduce workflow is stable for core two-turn handover: paper list -> summarize selected papers.
Structured-output extractor handoff is now robust (parser prefers extractor JSON over raw free text).
Final answer sanitation strips leaked <think> blocks, including malformed or unclosed variants.
Two-pass bulk upload is integrated for local and SLURM execution paths via Python module invocation.

Overview

The project has two responsibilities:

Main use-case: deploy and operate a metadata-aware Dify chatbot workflow.
Secondary service: extract metadata from papers and upload documents into Dify datasets.

Current routing workflow (config/RMAP Chatbot Meta Routing.yml):

flowchart LR
		A[Start] --> B[Query Rewriter]
		B --> C{Question Classifier}

		C -->|Knowledge Route| D[Knowledge Retrieval]
		D --> E[Knowledge LLM]
		E --> H[Answer]

		C -->|Metadata Route| F[Parameter Extractor]
		F --> G[Metadata Router Code]
		G --> I[Metadata LLM]
		I --> H

Current iterative retrieval workflow (config/RMAP Chatbot Iterative Retrieval.yml):

flowchart LR
		A[Start] --> B[Query Rewriter]
		B --> C[JSON Metadata Extractor]
		C --> F{Paper List Empty?}

		F -->|Yes| G[Knowledge Retrieval]
		G --> H[Knowledge LLM]
		H --> Z[Answer]

		F -->|No| I[Paper Iterator]
		I --> J[Question Classifier]
		J -->|Count/List| K[Metadata Code]
		J -->|Content| L[KR with filter]
		L --> M[Paper Map LLM]
		K --> N[Iteration Aggregator]
		M --> N
		N --> O[Map/Reduce LLM]
		O --> Z

Installation

Requirements

Python 3.11+
Poetry

Setup

poetry install
poetry run dify-upload --help

Optional local environment file (for import/debug scripts):

source .secrets/dify_console_session.env

API credential map

Use different keys for different endpoint families:

DIFY_APP_API_KEY (prefix app-): app runtime endpoints under /v1 (for example /v1/chat-messages, /v1/meta).
DIFY_DATASET_API_KEY (prefix dataset-): dataset upload/metadata endpoints under /v1/datasets/... used by dify-upload.
DIFY_CONSOLE_API_KEY: console-management endpoints under /console/api/... (workflow import, draft run).
Cookie fallback (DIFY_CONSOLE_COOKIE + DIFY_CSRF_TOKEN): only for deployments where console API keys are not accepted.

Notes:

The uploader currently supports DIFY_API_KEY as a backward-compatible alias for DIFY_DATASET_API_KEY.
In this deployment, app keys are valid for /v1 but not for /console/api.

Main Use-Case: Set Up The Meta Routing Chatbot

1. Import workflow DSL into Dify

Preferred mode (console API key):

DIFY_BASE_URL="http://your-dify-host" \
DIFY_CONSOLE_API_KEY="<console_api_key>" \
AUTO_CONFIRM=true \
scripts/import_dify_dsl.sh "config/RMAP Chatbot Meta Routing.yml" --app-id "<app_id>"

Cookie fallback (for deployments without console API key support):

DIFY_BASE_URL="http://your-dify-host" \
DIFY_CONSOLE_COOKIE="..." \
DIFY_CSRF_TOKEN="..." \
AUTO_CONFIRM=true \
scripts/import_dify_dsl.sh "config/RMAP Chatbot Meta Routing.yml" --app-id "<app_id>" --allow-cookie-auth

2. Validate routing behavior

DIFY_BASE_URL="http://your-dify-host" \
DIFY_CONSOLE_COOKIE="..." \
DIFY_CSRF_TOKEN="..." \
scripts/debug_route_draft.sh \
	--app-id "<app_id>" \
	--allow-cookie-auth \
	--query "What are the main methods and findings of Sci-ModoM?" \
	--query "How many papers has Christoph Dieterich published?" \
	--query "Which papers have been (co-) authored by Christoph Dieterich?"

Or with console API key (if supported by your deployment):

DIFY_BASE_URL="http://your-dify-host" \
DIFY_CONSOLE_API_KEY="<console_api_key>" \
scripts/debug_route_draft.sh \
	--app-id "<app_id>" \
	--query "What are the main methods and findings of Sci-ModoM?"

Expected routes:

Content questions -> Knowledge Route
Count/list/filter questions -> Metadata Route

Main Use-Case: Iterative Map/Reduce Routing

Import the iterative workflow config:

DIFY_BASE_URL="http://your-dify-host" \
DIFY_CONSOLE_COOKIE="..." \
DIFY_CSRF_TOKEN="..." \
AUTO_CONFIRM=true \
scripts/import_dify_dsl.sh "config/RMAP Chatbot Iterative Retrieval.yml" --app-id "<app_id>" --allow-cookie-auth

Validate a two-turn handover (author list -> summarize previous papers):

DIFY_BASE_URL="http://your-dify-host" \
DIFY_CONSOLE_COOKIE="..." \
DIFY_CSRF_TOKEN="..." \
scripts/debug_route_draft.sh \
	--app-id "<app_id>" \
	--allow-cookie-auth \
	--classifier-node-id "17786780005730" \
	--query "What papers did Christoph Dieterich author?" \
	--query "Can you summarize these papers?"

Notes:

scripts/debug_route_draft.sh now reuses conversation_id automatically across multiple --query values in one run.
You can also force a known conversation with --conversation-id "<uuid>".
Iterative workflow now uses minimal memory handover: concrete paper entities from list queries are persisted in conversation.memory for follow-up summarize turns.
Memory fallback is only applied for follow-up style prompts (for example: "these papers", "Fasse mir diese Papiere zusammen") to avoid contaminating unrelated turns.
Hardened follow-up intents for map/reduce now include references like:
- compare/contrast: "Compare these papers by methods and key findings"
- ranking/subset: "Which one is the newest?", "top 3", "first two papers"
- cross-lingual references: "diese Paper", "vergleiche diese", "welches davon"
Milestone 2026-05-21 (Map/Reduce follow-up hardening):
- Resolve Paper List now performs deterministic subset selection from conversation.memory for first/top/newest/oldest follow-ups.
- Restored missing iteration Code node (17786780698570) to keep graph edges consistent and avoid draft runtime MISSING_NODE errors.
- Verified two-turn flow for Which paper have been published by Christoph Dieterich -> Please summarize the first two papers. returns successfully (HTTP 200) in draft run.
Milestone 2026-05-22 (Boss demo stabilization):
- Split overloaded paper resolution logic into staged nodes (extractor parser, follow-up intent gate, memory subset selector, slim resolver merge) to make two-turn behavior explainable and robust.
- Added final answer sanitization node (1778800001013) and routed the Answer node through cleaned_text to strip <think>...</think> leakage reliably.
- Hardened reduce prompt with authoritative requested identities/order from resolver output to keep section 1/2 mapped to the requested first two papers.
- Fixed YAML import fragility in single-quoted prompt blocks (apostrophe handling) and re-validated import success (HTTP 200).
- Enforced deterministic fixed-subset formatting (**1. / **2. headers) even under sparse text context, so demo checks remain stable.
- Re-tested two-turn draft flow (author list -> summarize first two) with HTTP 200 on both turns, no <think> in final answer, and both section markers present.
Milestone 2026-05-29 (structured-output and sanitizer hardening):
- JSON Metadata Extractor switched to structured output as primary machine-readable contract.
- Parser node updated to accept both extractor_text and extractor_structured_output, with structured output taking precedence.
- Extractor max token budget increased to reduce truncation risk in larger author-paper lists.
- Metadata LLM max token budget increased (768 -> 1200) to improve long-answer completeness.
- Final sanitizer hardened for both closed and unclosed <think> segments.
- Validated two-turn runs with HTTP 200 on both turns and no <think> content in final answer.

Secondary Service: Metadata Extraction And Paper Upload

Use the CLI entrypoint:

poetry run dify-upload

If needed, provide dataset credentials via environment variables:

export DIFY_DATASET_API_KEY="dataset-..."
export DIFY_API_URL="http://your-dify-host/v1"
export DATASET_ID="<dataset_id>"

Common commands:

# Run default two-pass workflow
poetry run dify-upload default

# Run two-pass upload on one file
poetry run dify-upload two-pass --file "RMaP papers first funding period/your-file.pdf"

# Run diagnostics
poetry run dify-upload abc-test --file "RMaP papers first funding period/your-file.pdf"

# Preview extracted metadata
poetry run dify-upload metadata --file "RMaP papers first funding period/your-file.pdf"

# Process selected authors only
poetry run dify-upload selected-authors --author "Mark Helm" --author "Christoph Dieterich"

# Bulk processing
poetry run dify-upload bulk-two-pass --folder "RMaP papers first funding period"

# Quality report for extracted authors
poetry run dify-upload author-quality --folder "RMaP papers first funding period"

SLURM/GPU execution (recommended for full-folder runs with LLM fallback):

sbatch scripts/slurm_bulk_two_pass_ollama.sh

The SLURM script explicitly uses the Python module path to avoid wrapper ambiguity:

.venv/bin/python -m dify_uploader bulk-two-pass --folder "RMaP papers first funding period"

Hybrid extraction behavior in dify_uploader/author_extraction.py:

Fast regex and heuristics.
Optional LLM fallback via BAML for low-confidence cases.

BAML runtime example:

export BAML_OLLAMA_BASE_URL="http://127.0.0.1:11434/v1"
export BAML_OLLAMA_MODEL="qwen3:32b"
export AUTHOR_EXTRACTION_ENABLE_LLM_FALLBACK="true"

Next Steps (Execution Plan)

Finish and verify the full 2-pass run for all PDFs in RMaP papers first funding period and capture success/failure counts in a run log.
Add a concise post-run summary artifact (processed files, retries, failures, elapsed time) under reports/slurm/.
Introduce a lightweight regression check for the two-turn path (list papers -> summarize those papers) after each workflow import.
Add a small acceptance checklist for stakeholder demos (HTTP status, sanitizer check, section mapping check).

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
config		config
dify_uploader		dify_uploader
scripts		scripts
.gitignore		.gitignore
README.md		README.md
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RmapDifyChatbot

Status Snapshot (2026-05-29)

Overview

Installation

Requirements

Setup

API credential map

Main Use-Case: Set Up The Meta Routing Chatbot

1. Import workflow DSL into Dify

2. Validate routing behavior

Main Use-Case: Iterative Map/Reduce Routing

Secondary Service: Metadata Extraction And Paper Upload

Next Steps (Execution Plan)

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RmapDifyChatbot

Status Snapshot (2026-05-29)

Overview

Installation

Requirements

Setup

API credential map

Main Use-Case: Set Up The Meta Routing Chatbot

1. Import workflow DSL into Dify

2. Validate routing behavior

Main Use-Case: Iterative Map/Reduce Routing

Secondary Service: Metadata Extraction And Paper Upload

Next Steps (Execution Plan)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages