Skip to content

Conversation

@jameslamb
Copy link
Collaborator

Fixes #5596
Fixes #7011

Moves Dockerfiles and CI pipelines for the container images used in Linux CI into this repo. New pipelines here will publish CI images to GitHub's container registry.

Other changes:

  • upgrades from PoCL 1.8 to 7.1 (no more building clang from source 😁)
  • moves aarch64 Linux wheels up to GLIBC 2.28 (manylinux_2_28)

Notes for Reviewers

Post-merge cleanup

After this is merged and once we feel like it's working well, all of the following could be cleaned up:

@jameslamb jameslamb changed the title WIP: [ci] move CI container images to GHCR WIP: [ci] move CI container images to GHCR, workflows to this repo Dec 24, 2025
@letmaik
Copy link
Member

letmaik commented Dec 29, 2025

@jameslamb I can delete the lightgbm.azurecr.io registry once it's not needed anymore.

else # in manylinux image
sudo yum update -y
sudo yum install -y \
clinfo \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Picking a somewhat-arbitrary place to start a thread.

Right now, the images are building successfully but Python tests with device="gpu" are all failing.

The gpu source job (where LightGBM's default device is set to "gpu") has 238 failures like this:

lightgbm.basic.LightGBMError: Check failed: (best_split_info.right_count) > (0) at /__w/LightGBM/LightGBM/src/treelearner/serial_tree_learner.cpp, line 869 .

Which look at a glance like #3679

The bdist_wheel jobs (which just run a single test checking that OpenCL support was compiled in successfully) on both x86_64 and aarch 64 are failing like this:

____________________________ test_cpu_and_gpu_work _____________________________

    @pytest.mark.skipif(
        os.environ.get("LIGHTGBM_TEST_DUAL_CPU_GPU", "0") != "1",
        reason="Set LIGHTGBM_TEST_DUAL_CPU_GPU=1 to test using CPU and GPU training from the same package.",
    )
    def test_cpu_and_gpu_work():
        # If compiled appropriately, the same installation will support both GPU and CPU.
        X, y = load_breast_cancer(return_X_y=True)
        data = lgb.Dataset(X, y)
    
        params_cpu = {"verbosity": -1, "num_leaves": 31, "objective": "binary", "device": "cpu"}
        cpu_bst = lgb.train(params_cpu, data, num_boost_round=10)
        cpu_score = log_loss(y, cpu_bst.predict(X))
    
        params_gpu = params_cpu.copy()
        params_gpu["device"] = "gpu"
        # Double-precision floats are only supported on x86_64 with PoCL
        params_gpu["gpu_use_dp"] = platform.machine() == "x86_64"
>       gpu_bst = lgb.train(params_gpu, data, num_boost_round=10)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

tests/python_package_test/test_dual.py:32: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/root/miniforge/envs/test-env/lib/python3.13/site-packages/lightgbm/engine.py:297: in train
    booster = Booster(params=params, train_set=train_set)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/root/miniforge/envs/test-env/lib/python3.13/site-packages/lightgbm/basic.py:3615: in __init__
    _safe_call(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

ret = -1

    def _safe_call(ret: int) -> None:
        """Check the return value from C API call.
    
        Parameters
        ----------
        ret : int
            The return value from C API calls.
        """
        if ret != 0:
>           raise LightGBMError(_LIB.LGBM_GetLastError().decode("utf-8"))
E           lightgbm.basic.LightGBMError: No OpenCL device found

/root/miniforge/envs/test-env/lib/python3.13/site-packages/lightgbm/basic.py:310: LightGBMError

(build link)

Ideas I'm looking into:

I'm going to focus on the gpu source builds first, because those don't rely on anything in https://github.com/microsoft/LightGBM/blob/master/cmake/IntegratedOpenCL.cmake and so should be a more minimal way to investigate this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticing that the CI job running with an NVIDIA GPU is working: https://github.com/microsoft/LightGBM/actions/runs/20581695041/job/59110391374?pr=7109

So I guess it's just that these jobs are no longer successfully targeting the host CPUs on the GitHub runners? I'll look into that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC] [ci] move management of CI images into this repo? [gpu] upgrade to PoCL v3.0 for building Linux integrated OpenCL wheels

3 participants