feat: add generic docstring extraction tools (Phase 2) by neuromechanist · Pull Request #115 · OpenScience-Collective/osa

neuromechanist · 2026-01-27T17:44:57Z

Summary

Implements Phase 2 of the EEGLab community integration (#100, part of epic #97).

Adds generic, reusable tools to extract and index MATLAB/Python docstrings from any community's codebase. These tools are designed to work for ANY community (not EEGLab-specific).

Key Components

Parsers

Python Parser (src/knowledge/python_parser.py): AST-based extraction of docstrings from functions, classes, methods, and modules
MATLAB Parser (src/knowledge/matlab_parser.py): Regex-based extraction of function/script documentation

Infrastructure

Database Schema: Added docstrings table with FTS5 search support
Sync System (src/knowledge/docstring_sync.py): Fetches files from GitHub API and indexes docstrings
Search (src/knowledge/search.py): FTS5-powered search with GitHub source links
CLI (src/cli/sync.py): osa sync docstrings command with language filters
Tools (src/tools/knowledge.py): LangChain tool factory for community integration

Testing

28 new tests (9 Python parser + 12 MATLAB parser + 7 integration)
All tests passing (78 total)
Real-world validation: Successfully synced 720 docstrings from sccn/eeglab

Usage

# Initialize database
osa sync init --community eeglab

# Sync MATLAB docstrings
osa sync docstrings --community eeglab --language matlab --repo sccn/eeglab

# Sync Python docstrings
osa sync docstrings --community eeglab --language python --repo sccn/liblsl

# Search docstrings
osa sync search "pop_loadset" --community eeglab

Closes

Closes #100

Fixes all critical and important issues from PR review #107: **Critical Fixes:** 1. Fix message state corruption on streaming errors - Track message indices explicitly to avoid race conditions - Prevent wrong messages from being removed on error - Properly handle streaming vs non-streaming error paths 2. Improve saveHistory() error handling - Distinguish error types (QuotaExceededError, SecurityError) - Provide actionable user feedback for each error type - Re-throw errors so callers know save failed - Log errors with structured context 3. Fix backend exception masking - Add specific handlers for ValueError (input errors) - Preserve HTTPException for proper HTTP error codes - Add error IDs to all backend errors for support tracking - Include structured logging context **Important Fixes:** 4. Add fetch timeout to streaming requests - 120 second timeout for connection + streaming - Prevents indefinite hangs on connection failures 5. Fix worker error detail loss - Check response.ok before processing - Pass through backend HTTP error codes (400, 403, 429, etc.) - Extract and forward backend error messages - Only categorize actual network/proxy errors 6. Improve reader resource leak handling - Check if reader exists before releasing - Log cleanup failures (indicates serious issues) - Don't silently swallow releaseError **Impact:** - Prevents data loss from message corruption - Better user feedback on storage failures - Proper error categorization for debugging - Support can track errors with error IDs - No more indefinite hangs - Users see actual backend errors (not generic 500) All syntax checks pass. Ready for production.

* fix: preserve prompt caching when tools are bound to models - Update CachingLLMWrapper.bind_tools() to wrap tool-bound models - Add stream() and astream() methods with caching support - Allow wrapping Runnable types (e.g., RunnableBinding) not just BaseChatModel - Add comprehensive test suite for caching functionality Fixes #104 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * feat: add comprehensive error handling and validation to CachingLLMWrapper - Add logging to all wrapper methods for debugging - Add validation to __init__ to prevent double-wrapping - Add error handling to bind_tools with tool validation - Add error handling to _add_cache_control with message validation - Add error handling to stream/astream with input validation - Add unit tests for all error conditions - Add async tests for ainvoke and astream methods - Coverage improved from 41% to 60% for litellm_llm.py * fix: replace unsafe eval with safe AST-based calculator in tests - Replace eval() with ast.parse() and safe operator mapping - Only allow basic arithmetic operations (+, -, *, /, %, **) - Prevents code injection in test calculator tool - Maintains full test functionality * fix: eliminate silent failures and improve error handling Critical fixes addressing PR review: - Replace silent message skipping with hard errors (ValueError) - Replace silent None content fallback with fail-fast validation - Replace overly broad exception catching with specific exception types - Add type annotations to stream/astream methods (Iterator/AsyncIterator) - Add comprehensive error logging to invoke/ainvoke/_generate/_agenerate - Improve hasattr checks with callable() validation - Add actionable guidance to NotImplementedError messages - Change non-list input logging from DEBUG to WARNING (cost impact) All tests still pass. Coverage remains 36% (error paths require real LLM tests). * docs: improve docstrings for clarity and accuracy Addressing documentation issues from PR review: - Expand class docstring to explain nested wrapper chain - Clarify bind_tools() two-step process (delegate then wrap) - Update _add_cache_control() to document strict fail-fast validation - Clarify is_cacheable_model() uses permissive heuristic - Add clarifying comment to CACHEABLE_MODELS constant All docstrings now accurately reflect implementation behavior. * docs: document two-tier testing approach for NO MOCKS policy Add comprehensive module docstring explaining: - Why FakeListChatModel is used for unit tests (wrapper mechanics) - Why real API calls are used for integration tests (LLM behavior) - Clear separation between testing wrapper logic vs LLM responses This addresses the code reviewer's recommendation to document the exception to the NO MOCKS policy for wrapper mechanics testing. --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

Add MATLAB and Python docstring parsers, sync system, and search tools for indexing code documentation from any community repository. Key components: - MATLAB parser: regex-based extraction of function/script comments - Python parser: AST-based extraction of docstrings - Sync system: fetch files from GitHub and index docstrings - Search: FTS5-powered docstring search with GitHub links - CLI: 'osa sync docstrings' command with language filters - Tools: LangChain tool factory for community integration Tests: 28 new tests (parsers + integration) Verified: 720 docstrings synced from sccn/eeglab Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Critical fixes: - Add GitHub auth headers to avoid rate limiting - Improve error handling with specific exception types - Track and report failed files to users - Make Python parser raise SyntaxError properly Important fixes: - Optimize method detection using parent map (O(n)) - Update FTS5 docs to clarify phrase-only search - Document branch hardcoding limitation Tests: - Update test to expect SyntaxError - Remove mocked error tests (violates NO MOCKS rule) - All 71 tests passing

neuromechanist and others added 5 commits January 26, 2026 22:17

chore: bump version to 0.5.3.dev0

48e4391

neuromechanist changed the base branch from develop to epic/issue-97-eeglab January 27, 2026 18:35

neuromechanist merged commit 2cb807b into epic/issue-97-eeglab Jan 27, 2026
5 checks passed

neuromechanist linked an issue Jan 27, 2026 that may be closed by this pull request

Phase 2: Docstring Extraction Tools #100

Closed

19 tasks

This was referenced Jan 27, 2026

Phase 2: Docstring Extraction Tools #100

Closed

feat: EEGLAB Community Assistant (Epic #97) #124

Merged

neuromechanist deleted the feature/issue-100-phase2-docstring-extraction branch January 28, 2026 15:18

neuromechanist mentioned this pull request Jan 28, 2026

feat: EEGLAB community assistant - full epic release #125

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add generic docstring extraction tools (Phase 2)#115

feat: add generic docstring extraction tools (Phase 2)#115
neuromechanist merged 5 commits into
epic/issue-97-eeglabfrom
feature/issue-100-phase2-docstring-extraction

neuromechanist commented Jan 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

neuromechanist commented Jan 27, 2026

Summary

Key Components

Parsers

Infrastructure

Testing

Usage

Closes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant