Skip to content

[K8s] If the node pool decreases in size Cluster Controller crashes #264

@AlexCuadron

Description

@AlexCuadron

Create a cluster with more than 1 node. Then shut down at least a node in that node pool (gracefully).

Logs:

[gke-underscore-sky-burst-underscore-us-central1-underscore-test-cluster-v1 - k8 Manager] - 2024-06-16 15:58:54,560 - ERROR - Unexpected error: Node gke-test-cluster-v1-cpu-pool-118cb08c-rhft not found in cluster resources.
[gke_sky-burst_us-central1_test-cluster-v1 - Cluster Controller] - 2024-06-16 15:58:54,917 - ERROR - Traceback (most recent call last):
  File "/home/alex/Documents/skyflow/skyflow/cluster_manager/kubernetes/kubernetes_manager.py", line 195, in get_cluster_status
    allocatable_capacity=self.allocatable_resources,
  File "/home/alex/Documents/skyflow/skyflow/cluster_manager/kubernetes/kubernetes_manager.py", line 295, in allocatable_resources
    assert node_name in available_resources.keys(), (
AssertionError: Node gke-test-cluster-v1-cpu-pool-118cb08c-rhft not found in cluster resources.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/alex/Documents/skyflow/skyflow/skylet/cluster_controller.py", line 32, in heartbeat_error_handler
    yield
  File "/home/alex/Documents/skyflow/skyflow/skylet/cluster_controller.py", line 86, in run
    self.controller_loop()
  File "/home/alex/Documents/skyflow/skyflow/skylet/cluster_controller.py", line 93, in controller_loop
    cluster_status = self.manager_api.get_cluster_status()
  File "/home/alex/Documents/skyflow/skyflow/cluster_manager/kubernetes/kubernetes_manager.py", line 218, in get_cluster_status
    allocatable_capacity=self.allocatable_resources,
  File "/home/alex/Documents/skyflow/skyflow/cluster_manager/kubernetes/kubernetes_manager.py", line 295, in allocatable_resources
    assert node_name in available_resources.keys(), (
AssertionError: Node gke-test-cluster-v1-cpu-pool-118cb08c-rhft not found in cluster resources.

[gke_sky-burst_us-central1_test-cluster-v1 - Cluster Controller] - 2024-06-16 15:58:54,917 - ERROR - Encountered unusual error. Trying again.
[gke-underscore-sky-burst-underscore-us-central1-underscore-test-cluster-v1 - k8 Manager] - 2024-06-16 15:58:59,596 - ERROR - Unexpected error: Node gke-test-cluster-v1-cpu-pool-118cb08c-rhft not found in cluster resources.
[gke_sky-burst_us-central1_test-cluster-v1 - Cluster Controller] - 2024-06-16 15:58:59,895 - ERROR - Traceback (most recent call last):
  File "/home/alex/Documents/skyflow/skyflow/cluster_manager/kubernetes/kubernetes_manager.py", line 195, in get_cluster_status
    allocatable_capacity=self.allocatable_resources,
  File "/home/alex/Documents/skyflow/skyflow/cluster_manager/kubernetes/kubernetes_manager.py", line 295, in allocatable_resources
    assert node_name in available_resources.keys(), (
AssertionError: Node gke-test-cluster-v1-cpu-pool-118cb08c-rhft not found in cluster resources.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/alex/Documents/skyflow/skyflow/skylet/cluster_controller.py", line 32, in heartbeat_error_handler
    yield
  File "/home/alex/Documents/skyflow/skyflow/skylet/cluster_controller.py", line 86, in run
    self.controller_loop()
  File "/home/alex/Documents/skyflow/skyflow/skylet/cluster_controller.py", line 93, in controller_loop
    cluster_status = self.manager_api.get_cluster_status()
  File "/home/alex/Documents/skyflow/skyflow/cluster_manager/kubernetes/kubernetes_manager.py", line 218, in get_cluster_status
    allocatable_capacity=self.allocatable_resources,
  File "/home/alex/Documents/skyflow/skyflow/cluster_manager/kubernetes/kubernetes_manager.py", line 295, in allocatable_resources
    assert node_name in available_resources.keys(), (
AssertionError: Node gke-test-cluster-v1-cpu-pool-118cb08c-rhft not found in cluster resources.

[gke_sky-burst_us-central1_test-cluster-v1 - Cluster Controller] - 2024-06-16 15:58:59,896 - ERROR - Encountered unusual error. Trying again.
[gke-underscore-sky-burst-underscore-us-central1-underscore-test-cluster-v1 - k8 Manager] - 2024-06-16 15:59:04,592 - ERROR - Unexpected error: Node gke-test-cluster-v1-cpu-pool-118cb08c-rhft not found in cluster resources.
[gke_sky-burst_us-central1_test-cluster-v1 - Cluster Controller] - 2024-06-16 15:59:04,949 - ERROR - Traceback (most recent call last):
  File "/home/alex/Documents/skyflow/skyflow/cluster_manager/kubernetes/kubernetes_manager.py", line 195, in get_cluster_status
    allocatable_capacity=self.allocatable_resources,
  File "/home/alex/Documents/skyflow/skyflow/cluster_manager/kubernetes/kubernetes_manager.py", line 295, in allocatable_resources
    assert node_name in available_resources.keys(), (
AssertionError: Node gke-test-cluster-v1-cpu-pool-118cb08c-rhft not found in cluster resources.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/alex/Documents/skyflow/skyflow/skylet/cluster_controller.py", line 32, in heartbeat_error_handler
    yield
  File "/home/alex/Documents/skyflow/skyflow/skylet/cluster_controller.py", line 86, in run
    self.controller_loop()
  File "/home/alex/Documents/skyflow/skyflow/skylet/cluster_controller.py", line 93, in controller_loop
    cluster_status = self.manager_api.get_cluster_status()
  File "/home/alex/Documents/skyflow/skyflow/cluster_manager/kubernetes/kubernetes_manager.py", line 218, in get_cluster_status
    allocatable_capacity=self.allocatable_resources,
  File "/home/alex/Documents/skyflow/skyflow/cluster_manager/kubernetes/kubernetes_manager.py", line 295, in allocatable_resources
    assert node_name in available_resources.keys(), (
AssertionError: Node gke-test-cluster-v1-cpu-pool-118cb08c-rhft not found in cluster resources.

[gke_sky-burst_us-central1_test-cluster-v1 - Cluster Controller] - 2024-06-16 15:59:04,949 - ERROR - Encountered unusual error. Trying again.
[gke-underscore-sky-burst-underscore-us-central1-underscore-test-cluster-v1 - k8 Manager] - 2024-06-16 15:59:09,595 - ERROR - Unexpected error: Node gke-test-cluster-v1-cpu-pool-118cb08c-rhft not found in cluster resources.
[gke_sky-burst_us-central1_test-cluster-v1 - Cluster Controller] - 2024-06-16 15:59:09,886 - ERROR - Traceback (most recent call last):
  File "/home/alex/Documents/skyflow/skyflow/cluster_manager/kubernetes/kubernetes_manager.py", line 195, in get_cluster_status
    allocatable_capacity=self.allocatable_resources,
  File "/home/alex/Documents/skyflow/skyflow/cluster_manager/kubernetes/kubernetes_manager.py", line 295, in allocatable_resources
    assert node_name in available_resources.keys(), (
AssertionError: Node gke-test-cluster-v1-cpu-pool-118cb08c-rhft not found in cluster resources.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/alex/Documents/skyflow/skyflow/skylet/cluster_controller.py", line 32, in heartbeat_error_handler
    yield
  File "/home/alex/Documents/skyflow/skyflow/skylet/cluster_controller.py", line 86, in run
    self.controller_loop()
  File "/home/alex/Documents/skyflow/skyflow/skylet/cluster_controller.py", line 93, in controller_loop
    cluster_status = self.manager_api.get_cluster_status()
  File "/home/alex/Documents/skyflow/skyflow/cluster_manager/kubernetes/kubernetes_manager.py", line 218, in get_cluster_status
    allocatable_capacity=self.allocatable_resources,
  File "/home/alex/Documents/skyflow/skyflow/cluster_manager/kubernetes/kubernetes_manager.py", line 295, in allocatable_resources
    assert node_name in available_resources.keys(), (
AssertionError: Node gke-test-cluster-v1-cpu-pool-118cb08c-rhft not found in cluster resources.

[gke_sky-burst_us-central1_test-cluster-v1 - Cluster Controller] - 2024-06-16 15:59:09,887 - ERROR - Encountered unusual error. Trying again.
[gke-underscore-sky-burst-underscore-us-central1-underscore-test-cluster-v1 - k8 Manager] - 2024-06-16 15:59:14,619 - ERROR - Unexpected error: Node gke-test-cluster-v1-cpu-pool-118cb08c-rhft not found in cluster resources.
[gke_sky-burst_us-central1_test-cluster-v1 - Cluster Controller] - 2024-06-16 15:59:14,966 - ERROR - Traceback (most recent call last):
  File "/home/alex/Documents/skyflow/skyflow/cluster_manager/kubernetes/kubernetes_manager.py", line 195, in get_cluster_status
    allocatable_capacity=self.allocatable_resources,
  File "/home/alex/Documents/skyflow/skyflow/cluster_manager/kubernetes/kubernetes_manager.py", line 295, in allocatable_resources
    assert node_name in available_resources.keys(), (
AssertionError: Node gke-test-cluster-v1-cpu-pool-118cb08c-rhft not found in cluster resources.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/alex/Documents/skyflow/skyflow/skylet/cluster_controller.py", line 32, in heartbeat_error_handler
    yield
  File "/home/alex/Documents/skyflow/skyflow/skylet/cluster_controller.py", line 86, in run
    self.controller_loop()
  File "/home/alex/Documents/skyflow/skyflow/skylet/cluster_controller.py", line 93, in controller_loop
    cluster_status = self.manager_api.get_cluster_status()
  File "/home/alex/Documents/skyflow/skyflow/cluster_manager/kubernetes/kubernetes_manager.py", line 218, in get_cluster_status
    allocatable_capacity=self.allocatable_resources,
  File "/home/alex/Documents/skyflow/skyflow/cluster_manager/kubernetes/kubernetes_manager.py", line 295, in allocatable_resources
    assert node_name in available_resources.keys(), (
AssertionError: Node gke-test-cluster-v1-cpu-pool-118cb08c-rhft not found in cluster resources.

[gke_sky-burst_us-central1_test-cluster-v1 - Cluster Controller] - 2024-06-16 15:59:14,966 - ERROR - Encountered unusual error. Trying again.
[gke-underscore-sky-burst-underscore-us-central1-underscore-test-cluster-v1 - k8 Manager] - 2024-06-16 15:59:19,644 - ERROR - Unexpected error: Node gke-test-cluster-v1-cpu-pool-118cb08c-rhft not found in cluster resources.
[gke_sky-burst_us-central1_test-cluster-v1 - Cluster Controller] - 2024-06-16 15:59:19,952 - ERROR - Traceback (most recent call last):
  File "/home/alex/Documents/skyflow/skyflow/cluster_manager/kubernetes/kubernetes_manager.py", line 195, in get_cluster_status
    allocatable_capacity=self.allocatable_resources,
  File "/home/alex/Documents/skyflow/skyflow/cluster_manager/kubernetes/kubernetes_manager.py", line 295, in allocatable_resources
    assert node_name in available_resources.keys(), (
AssertionError: Node gke-test-cluster-v1-cpu-pool-118cb08c-rhft not found in cluster resources.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/alex/Documents/skyflow/skyflow/skylet/cluster_controller.py", line 32, in heartbeat_error_handler
    yield
  File "/home/alex/Documents/skyflow/skyflow/skylet/cluster_controller.py", line 86, in run
    self.controller_loop()
  File "/home/alex/Documents/skyflow/skyflow/skylet/cluster_controller.py", line 93, in controller_loop
    cluster_status = self.manager_api.get_cluster_status()
  File "/home/alex/Documents/skyflow/skyflow/cluster_manager/kubernetes/kubernetes_manager.py", line 218, in get_cluster_status
    allocatable_capacity=self.allocatable_resources,
  File "/home/alex/Documents/skyflow/skyflow/cluster_manager/kubernetes/kubernetes_manager.py", line 295, in allocatable_resources
    assert node_name in available_resources.keys(), (
AssertionError: Node gke-test-cluster-v1-cpu-pool-118cb08c-rhft not found in cluster resources.

[gke_sky-burst_us-central1_test-cluster-v1 - Cluster Controller] - 2024-06-16 15:59:19,952 - ERROR - Encountered unusual error. Trying again.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions