Fix : Replace external arXivmd service with local arXiv-to-Markdown processing #268
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Short Summary
This PR removes the dependency on the external
arxivmd.orgservice for processing arXiv URLs. Instead, it implements local PDF-to-Markdown conversion using thearxiv2textlibrary. This change aims to improve reliability by processing papers directly within the application infrastructure rather than relying on a third-party formatting service.Technical Changes
arxiv2textfor converting arXiv PDFs to Markdown.requests(likely a dependency forarxiv2textor general fetching).api/app/services/web/web_extract.py:_preprocess_urlfrom a synchronous function to anasyncfunction.tuple[str, bool]instead of just a string. The boolean indicates if the returned string is raw content (True) or a URL to be fetched (False).tempfile.TemporaryDirectoryto manage temporary files during conversion.arxiv_to_mdfunction in a separate thread usingloop.run_in_executorto prevent blocking the main event loop.url_to_markdownto await_preprocess_urland immediately return content if the new flag indicates direct content generation.Impact and Scope
arxivmd.orgdowntime._preprocess_url) have changed, but usage within the module was updated.