Skip to content

Conversation

@MathisVerstrepen
Copy link
Owner

@MathisVerstrepen MathisVerstrepen commented Jan 5, 2026

Short Summary

This PR removes the dependency on the external arxivmd.org service for processing arXiv URLs. Instead, it implements local PDF-to-Markdown conversion using the arxiv2text library. This change aims to improve reliability by processing papers directly within the application infrastructure rather than relying on a third-party formatting service.

Technical Changes

  • Dependencies:
    • Added arxiv2text for converting arXiv PDFs to Markdown.
    • Added requests (likely a dependency for arxiv2text or general fetching).
  • File: api/app/services/web/web_extract.py:
    • Refactor: Converted _preprocess_url from a synchronous function to an async function.
    • Logic Update: The function now returns a tuple tuple[str, bool] instead of just a string. The boolean indicates if the returned string is raw content (True) or a URL to be fetched (False).
    • New Feature: Implemented local arXiv processing:
      • Extracts the paper ID and constructs the PDF URL.
      • Uses tempfile.TemporaryDirectory to manage temporary files during conversion.
      • Executes the blocking arxiv_to_md function in a separate thread using loop.run_in_executor to prevent blocking the main event loop.
    • Integration: Updated url_to_markdown to await _preprocess_url and immediately return content if the new flag indicates direct content generation.

Impact and Scope

  • Functional Impact: Users submitting arXiv links will now have the content extracted directly by the Meridian backend rather than via a proxy service. The extraction logic is now self-contained.
  • System Reach:
    • Performance: Shifts the computational burden of PDF conversion to the local application server.
    • Reliability: Eliminates failure points associated with arxivmd.org downtime.
  • Breaking Changes: None externally. Internal function signatures (_preprocess_url) have changed, but usage within the module was updated.
  • Security/Resources: Uses temporary directories for file handling, ensuring cleanup after processing.

@MathisVerstrepen MathisVerstrepen self-assigned this Jan 5, 2026
@MathisVerstrepen MathisVerstrepen added the bug Something isn't working as expected, or there's an error in existing functionality. label Jan 5, 2026
@MathisVerstrepen MathisVerstrepen changed the title Hotfix Fix : Replace external arXivmd service with local arXiv-to-Markdown processing Jan 5, 2026
@MathisVerstrepen MathisVerstrepen merged commit 491ed75 into main Jan 5, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working as expected, or there's an error in existing functionality.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant