Skip to content

NDIF CLI#230

Draft
MichaelRipa wants to merge 56 commits intodevfrom
poc-cli
Draft

NDIF CLI#230
MichaelRipa wants to merge 56 commits intodevfrom
poc-cli

Conversation

@MichaelRipa
Copy link
Member

No description provided.

MichaelRipa and others added 30 commits January 12, 2026 13:53
… S3 object storage functionality. Update BackendRequestModel to use 'compress' instead of 'zlib' for serialization. Refactor model deployment to support compression during save operations. Remove unused sandbox module and streamline protected object handling.
… Processor classes

This commit introduces a new configuration module for the queue system, centralizing environment variable management for the Dispatcher and Processor components. The QueueConfig class provides type-safe access to configuration values with defaults and validation.

Additionally, the Dispatcher and Processor classes have been refactored to utilize the new configuration module, improving code clarity and maintainability. The changes include enhanced docstrings for better understanding of class functionalities and attributes, as well as adjustments to how environment variables are accessed throughout the codebase.
…ketIO connection handling

This commit introduces a new configuration module, AppConfig, which centralizes environment variable management for the API service. The app.py file has been refactored to utilize this configuration for Redis connection, SocketIO settings, and request handling parameters. Additionally, the SocketIO connection handling has been improved with enhanced docstrings for better clarity on function purposes and parameters. This refactor aims to improve code maintainability and readability.
…dundant NNSIGHT_ERROR status handling. This change enhances code clarity and maintains consistent logging practices.
This commit introduces a new endpoint in the API service to fetch Python environment details, including the Python version and installed packages. It implements caching for efficiency and handles timeouts and errors gracefully. Additionally, the Dispatcher class has been updated to manage ENV events, ensuring the environment cache is cleared when necessary. The Controller class now includes a method to gather environment information from the Ray cluster, enhancing the overall functionality and maintainability of the service.
… it consitent with the intuitive understaing of the logs
…econnecting is slow.

Log were RUNNING BEFORE deserializeing so its part of the running time calculation
…andling. The traceback formatting has been removed, and only the exception message is now included in the response description, enhancing clarity and consistency in error reporting.
exit 1
fi

echo "Starting Ray head node..."

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

@Butanium
Copy link
Member

Butanium commented Feb 3, 2026

Issue: ndif info hangs when Ray is still initializing

Summary

ndif info can hang indefinitely when Ray is starting up but not yet fully ready to accept connections.

Symptoms

$ uv run ndif info
NDIF Session Information
============================================================
...
Quick Connectivity Check:
  ✓ Broker reachable at redis://localhost:6374
  ✓ Object store reachable at http://localhost:27018
  ✓ API reachable at http://localhost:5001
  ✓ Ray reachable at ray://localhost:10001
<hangs here, no "Ray Cluster Nodes:" output>

The command prints all info but never exits. Requires Ctrl+C to abort.

Root Cause

In cli/commands/info.py, the connectivity check flow is:

  1. check_ray(ray_address) does a socket-level check — only verifies the port is listening (cli/lib/checks.py:116)
  2. If the port is open, _show_ray_nodes(ray_address) is called
  3. Inside _show_ray_nodes(), it calls ray.init(address=ray_address, ...) with no timeout

The problem: Ray's port can be open (listening) before the cluster is fully initialized. The socket check passes, but ray.init() blocks waiting for the cluster to be ready.

Suggested Fix

Add user feedback to _show_ray_nodes() (and maybe a timeout?):

def _show_ray_nodes(ray_address: str):
    """Show Ray cluster nodes."""
    import click
    try:
        import ray
        if not ray.is_initialized():
            click.echo()
            click.echo("Connecting to Ray cluster (this may take a moment if Ray is starting)...")
            ray.init(
                address=ray_address,
                ignore_reinit_error=True,
                logging_level="error",
                _timeout=10,
            )
        # ...
    except TimeoutError:
        click.echo("  Ray cluster is still initializing, skipping node info")
    except Exception:
        pass

Impact

  • Low severity — only affects the info command, not core functionality
  • Confusing UX when running ndif info shortly after ndif start

Workaround

Wait a few seconds after ndif start before running ndif info, or just wait for the command to complete (Ray finishes initializing).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants