WIP: [ci] move CI container images to GHCR, workflows to this repo #7109

jameslamb · 2025-12-24T06:25:07Z

Moves Dockerfiles and CI pipelines for the container images used in Linux CI into this repo. New pipelines here will publish CI images to GitHub's container registry.

Other changes:

upgrades from PoCL 1.8 to 7.1 (no more building clang from source 😁)
moves aarch64 Linux wheels up to GLIBC 2.28 (manylinux_2_28)

Notes for Reviewers

Post-merge cleanup

After this is merged and once we feel like it's working well, all of the following could be cleaned up:

owned by @guolinke :
- https://github.com/guolinke/lightgbm-ci-docker archived / deleted
- https://hub.docker.com/repository/docker/lightgbm/vsts-agent/general
owned by @shiyu1994 (or maybe @letmaik if it's inside Microsoft)
- lightgbm.azurecr.io/vsts-agent image repo
other:
- DockerHub webhook(s) in this repo

…g installed)

letmaik · 2025-12-29T12:21:41Z

@jameslamb I can delete the lightgbm.azurecr.io registry once it's not needed anymore.

…builds

jameslamb · 2025-12-29T20:16:48Z

.ci/setup.sh

+        else # in manylinux image
            sudo yum update -y
            sudo yum install -y \
+                clinfo \


Picking a somewhat-arbitrary place to start a thread.

Right now, the images are building successfully but Python tests with device="gpu" are all failing.

The gpu source job (where LightGBM's default device is set to "gpu") has 238 failures like this:

lightgbm.basic.LightGBMError: Check failed: (best_split_info.right_count) > (0) at /__w/LightGBM/LightGBM/src/treelearner/serial_tree_learner.cpp, line 869 .

Which look at a glance like #3679

The bdist_wheel jobs (which just run a single test checking that OpenCL support was compiled in successfully) on both x86_64 and aarch 64 are failing like this:

____________________________ test_cpu_and_gpu_work _____________________________ @pytest.mark.skipif( os.environ.get("LIGHTGBM_TEST_DUAL_CPU_GPU", "0") != "1", reason="Set LIGHTGBM_TEST_DUAL_CPU_GPU=1 to test using CPU and GPU training from the same package.", ) def test_cpu_and_gpu_work(): # If compiled appropriately, the same installation will support both GPU and CPU. X, y = load_breast_cancer(return_X_y=True) data = lgb.Dataset(X, y) params_cpu = {"verbosity": -1, "num_leaves": 31, "objective": "binary", "device": "cpu"} cpu_bst = lgb.train(params_cpu, data, num_boost_round=10) cpu_score = log_loss(y, cpu_bst.predict(X)) params_gpu = params_cpu.copy() params_gpu["device"] = "gpu" # Double-precision floats are only supported on x86_64 with PoCL params_gpu["gpu_use_dp"] = platform.machine() == "x86_64" > gpu_bst = lgb.train(params_gpu, data, num_boost_round=10) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ tests/python_package_test/test_dual.py:32: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /root/miniforge/envs/test-env/lib/python3.13/site-packages/lightgbm/engine.py:297: in train booster = Booster(params=params, train_set=train_set) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ /root/miniforge/envs/test-env/lib/python3.13/site-packages/lightgbm/basic.py:3615: in __init__ _safe_call( _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ret = -1 def _safe_call(ret: int) -> None: """Check the return value from C API call. Parameters ---------- ret : int The return value from C API calls. """ if ret != 0: > raise LightGBMError(_LIB.LGBM_GetLastError().decode("utf-8")) E lightgbm.basic.LightGBMError: No OpenCL device found /root/miniforge/envs/test-env/lib/python3.13/site-packages/lightgbm/basic.py:310: LightGBMError

(build link)

Ideas I'm looking into:

maybe we need to update to a newer Boost to work with a new PoCL?

maybe we need to update the OpenCL kernels in https://github.com/microsoft/LightGBM/blob/master/src/treelearner/ocl to match more platforms

I'm going to focus on the gpu source builds first, because those don't rely on anything in https://github.com/microsoft/LightGBM/blob/master/cmake/IntegratedOpenCL.cmake and so should be a more minimal way to investigate this.

Noticing that the CI job running with an NVIDIA GPU is working: https://github.com/microsoft/LightGBM/actions/runs/20581695041/job/59110391374?pr=7109

So I guess it's just that these jobs are no longer successfully targeting the host CPUs on the GitHub runners? I'll look into that.

jameslamb added 15 commits December 22, 2025 23:40

[ci] move CI container images to GHCR

b0c4ad6

push

30b8679

fix env

a7fcbf8

empty commit to re-trigger CI

c6ccf66

lowercase repo

6bfed04

cannot use github.workspace

5ba87cb

fix OS, reduce duplication

8df5568

fix job names

5f99901

work around newer CMake + old pocl

dbb7b96

try forcing C++11

22fca6d

upgrade to PoCL 7.1

95f1462

try building the same way on x86_64

6671e4a

fix CPU flag

946bb7d

try images in CI

45735f0

use -dev label

c0ed153

jameslamb added in progress breaking labels Dec 24, 2025

jameslamb changed the title ~~WIP: [ci] move CI container images to GHCR~~ WIP: [ci] move CI container images to GHCR, workflows to this repo Dec 24, 2025

jameslamb added 4 commits December 24, 2025 00:32

fix image URIs

e56817f

try new OpenCL headers

5fef0f5

temporarily skip check-wheel-contents (some OpenCL headers are gettin…

d1151d5

…g installed)

align OpenCL versions

dc3a465

jameslamb added 2 commits December 29, 2025 13:37

Merge branch 'master' of github.com:microsoft/LightGBM into ci/image-…

193fd23

…builds

try getting more information from 'clinfo'

f4f655a

jameslamb commented Dec 29, 2025

View reviewed changes

jameslamb added 4 commits December 31, 2025 21:56

try using LLVM instead

b09a916

try installing Intel OpenCL support

7121551

ensure the PoCL ICD loader gets installed

9a58b2e

check LLC_HOST_CPU values, try skipping intel driver installation

9c99f6e

jameslamb added 5 commits January 1, 2026 17:40

target cortex-a53

83732ef

try compiling support for more Arm CPUs

a4b8ace

comment out some CI, start adding docs

4b6b1d8

comment out even more CI

08ef096

get more debugging information, install 'all' component

d62591c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: [ci] move CI container images to GHCR, workflows to this repo #7109

WIP: [ci] move CI container images to GHCR, workflows to this repo #7109

Uh oh!

jameslamb commented Dec 24, 2025

Uh oh!

letmaik commented Dec 29, 2025

Uh oh!

jameslamb Dec 29, 2025

Uh oh!

jameslamb Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WIP: [ci] move CI container images to GHCR, workflows to this repo #7109

Are you sure you want to change the base?

WIP: [ci] move CI container images to GHCR, workflows to this repo #7109

Uh oh!

Conversation

jameslamb commented Dec 24, 2025

Notes for Reviewers

Post-merge cleanup

Uh oh!

letmaik commented Dec 29, 2025

Uh oh!

jameslamb Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

jameslamb Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants