Skip to content

feat: add generic docstring extraction tools (Phase 2)#115

Merged
neuromechanist merged 5 commits into
epic/issue-97-eeglabfrom
feature/issue-100-phase2-docstring-extraction
Jan 27, 2026
Merged

feat: add generic docstring extraction tools (Phase 2)#115
neuromechanist merged 5 commits into
epic/issue-97-eeglabfrom
feature/issue-100-phase2-docstring-extraction

Conversation

@neuromechanist

Copy link
Copy Markdown
Member

Summary

Implements Phase 2 of the EEGLab community integration (#100, part of epic #97).

Adds generic, reusable tools to extract and index MATLAB/Python docstrings from any community's codebase. These tools are designed to work for ANY community (not EEGLab-specific).

Key Components

Parsers

  • Python Parser (src/knowledge/python_parser.py): AST-based extraction of docstrings from functions, classes, methods, and modules
  • MATLAB Parser (src/knowledge/matlab_parser.py): Regex-based extraction of function/script documentation

Infrastructure

  • Database Schema: Added docstrings table with FTS5 search support
  • Sync System (src/knowledge/docstring_sync.py): Fetches files from GitHub API and indexes docstrings
  • Search (src/knowledge/search.py): FTS5-powered search with GitHub source links
  • CLI (src/cli/sync.py): osa sync docstrings command with language filters
  • Tools (src/tools/knowledge.py): LangChain tool factory for community integration

Testing

  • 28 new tests (9 Python parser + 12 MATLAB parser + 7 integration)
  • All tests passing (78 total)
  • Real-world validation: Successfully synced 720 docstrings from sccn/eeglab

Usage

# Initialize database
osa sync init --community eeglab

# Sync MATLAB docstrings
osa sync docstrings --community eeglab --language matlab --repo sccn/eeglab

# Sync Python docstrings
osa sync docstrings --community eeglab --language python --repo sccn/liblsl

# Search docstrings
osa sync search "pop_loadset" --community eeglab

Closes

Closes #100

neuromechanist and others added 5 commits January 26, 2026 22:17
Fixes all critical and important issues from PR review #107:

**Critical Fixes:**
1. Fix message state corruption on streaming errors
   - Track message indices explicitly to avoid race conditions
   - Prevent wrong messages from being removed on error
   - Properly handle streaming vs non-streaming error paths

2. Improve saveHistory() error handling
   - Distinguish error types (QuotaExceededError, SecurityError)
   - Provide actionable user feedback for each error type
   - Re-throw errors so callers know save failed
   - Log errors with structured context

3. Fix backend exception masking
   - Add specific handlers for ValueError (input errors)
   - Preserve HTTPException for proper HTTP error codes
   - Add error IDs to all backend errors for support tracking
   - Include structured logging context

**Important Fixes:**
4. Add fetch timeout to streaming requests
   - 120 second timeout for connection + streaming
   - Prevents indefinite hangs on connection failures

5. Fix worker error detail loss
   - Check response.ok before processing
   - Pass through backend HTTP error codes (400, 403, 429, etc.)
   - Extract and forward backend error messages
   - Only categorize actual network/proxy errors

6. Improve reader resource leak handling
   - Check if reader exists before releasing
   - Log cleanup failures (indicates serious issues)
   - Don't silently swallow releaseError

**Impact:**
- Prevents data loss from message corruption
- Better user feedback on storage failures
- Proper error categorization for debugging
- Support can track errors with error IDs
- No more indefinite hangs
- Users see actual backend errors (not generic 500)

All syntax checks pass. Ready for production.
* fix: preserve prompt caching when tools are bound to models

- Update CachingLLMWrapper.bind_tools() to wrap tool-bound models
- Add stream() and astream() methods with caching support
- Allow wrapping Runnable types (e.g., RunnableBinding) not just BaseChatModel
- Add comprehensive test suite for caching functionality

Fixes #104

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* feat: add comprehensive error handling and validation to CachingLLMWrapper

- Add logging to all wrapper methods for debugging
- Add validation to __init__ to prevent double-wrapping
- Add error handling to bind_tools with tool validation
- Add error handling to _add_cache_control with message validation
- Add error handling to stream/astream with input validation
- Add unit tests for all error conditions
- Add async tests for ainvoke and astream methods
- Coverage improved from 41% to 60% for litellm_llm.py

* fix: replace unsafe eval with safe AST-based calculator in tests

- Replace eval() with ast.parse() and safe operator mapping
- Only allow basic arithmetic operations (+, -, *, /, %, **)
- Prevents code injection in test calculator tool
- Maintains full test functionality

* fix: eliminate silent failures and improve error handling

Critical fixes addressing PR review:
- Replace silent message skipping with hard errors (ValueError)
- Replace silent None content fallback with fail-fast validation
- Replace overly broad exception catching with specific exception types
- Add type annotations to stream/astream methods (Iterator/AsyncIterator)
- Add comprehensive error logging to invoke/ainvoke/_generate/_agenerate
- Improve hasattr checks with callable() validation
- Add actionable guidance to NotImplementedError messages
- Change non-list input logging from DEBUG to WARNING (cost impact)

All tests still pass. Coverage remains 36% (error paths require real LLM tests).

* docs: improve docstrings for clarity and accuracy

Addressing documentation issues from PR review:

- Expand class docstring to explain nested wrapper chain
- Clarify bind_tools() two-step process (delegate then wrap)
- Update _add_cache_control() to document strict fail-fast validation
- Clarify is_cacheable_model() uses permissive heuristic
- Add clarifying comment to CACHEABLE_MODELS constant

All docstrings now accurately reflect implementation behavior.

* docs: document two-tier testing approach for NO MOCKS policy

Add comprehensive module docstring explaining:
- Why FakeListChatModel is used for unit tests (wrapper mechanics)
- Why real API calls are used for integration tests (LLM behavior)
- Clear separation between testing wrapper logic vs LLM responses

This addresses the code reviewer's recommendation to document the
exception to the NO MOCKS policy for wrapper mechanics testing.

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Add MATLAB and Python docstring parsers, sync system, and search
tools for indexing code documentation from any community repository.

Key components:
- MATLAB parser: regex-based extraction of function/script comments
- Python parser: AST-based extraction of docstrings
- Sync system: fetch files from GitHub and index docstrings
- Search: FTS5-powered docstring search with GitHub links
- CLI: 'osa sync docstrings' command with language filters
- Tools: LangChain tool factory for community integration

Tests: 28 new tests (parsers + integration)
Verified: 720 docstrings synced from sccn/eeglab

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Critical fixes:
- Add GitHub auth headers to avoid rate limiting
- Improve error handling with specific exception types
- Track and report failed files to users
- Make Python parser raise SyntaxError properly

Important fixes:
- Optimize method detection using parent map (O(n))
- Update FTS5 docs to clarify phrase-only search
- Document branch hardcoding limitation

Tests:
- Update test to expect SyntaxError
- Remove mocked error tests (violates NO MOCKS rule)
- All 71 tests passing
@neuromechanist neuromechanist changed the base branch from develop to epic/issue-97-eeglab January 27, 2026 18:35
@neuromechanist neuromechanist merged commit 2cb807b into epic/issue-97-eeglab Jan 27, 2026
5 checks passed
@neuromechanist neuromechanist linked an issue Jan 27, 2026 that may be closed by this pull request
19 tasks
@neuromechanist neuromechanist deleted the feature/issue-100-phase2-docstring-extraction branch January 28, 2026 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Phase 2: Docstring Extraction Tools

1 participant