-
Notifications
You must be signed in to change notification settings - Fork 4k
WIP: [ci] move CI container images to GHCR, workflows to this repo #7109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@jameslamb I can delete the lightgbm.azurecr.io registry once it's not needed anymore. |
| else # in manylinux image | ||
| sudo yum update -y | ||
| sudo yum install -y \ | ||
| clinfo \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Picking a somewhat-arbitrary place to start a thread.
Right now, the images are building successfully but Python tests with device="gpu" are all failing.
The gpu source job (where LightGBM's default device is set to "gpu") has 238 failures like this:
lightgbm.basic.LightGBMError: Check failed: (best_split_info.right_count) > (0) at /__w/LightGBM/LightGBM/src/treelearner/serial_tree_learner.cpp, line 869 .
Which look at a glance like #3679
The bdist_wheel jobs (which just run a single test checking that OpenCL support was compiled in successfully) on both x86_64 and aarch 64 are failing like this:
____________________________ test_cpu_and_gpu_work _____________________________
@pytest.mark.skipif(
os.environ.get("LIGHTGBM_TEST_DUAL_CPU_GPU", "0") != "1",
reason="Set LIGHTGBM_TEST_DUAL_CPU_GPU=1 to test using CPU and GPU training from the same package.",
)
def test_cpu_and_gpu_work():
# If compiled appropriately, the same installation will support both GPU and CPU.
X, y = load_breast_cancer(return_X_y=True)
data = lgb.Dataset(X, y)
params_cpu = {"verbosity": -1, "num_leaves": 31, "objective": "binary", "device": "cpu"}
cpu_bst = lgb.train(params_cpu, data, num_boost_round=10)
cpu_score = log_loss(y, cpu_bst.predict(X))
params_gpu = params_cpu.copy()
params_gpu["device"] = "gpu"
# Double-precision floats are only supported on x86_64 with PoCL
params_gpu["gpu_use_dp"] = platform.machine() == "x86_64"
> gpu_bst = lgb.train(params_gpu, data, num_boost_round=10)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tests/python_package_test/test_dual.py:32:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/root/miniforge/envs/test-env/lib/python3.13/site-packages/lightgbm/engine.py:297: in train
booster = Booster(params=params, train_set=train_set)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/root/miniforge/envs/test-env/lib/python3.13/site-packages/lightgbm/basic.py:3615: in __init__
_safe_call(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
ret = -1
def _safe_call(ret: int) -> None:
"""Check the return value from C API call.
Parameters
----------
ret : int
The return value from C API calls.
"""
if ret != 0:
> raise LightGBMError(_LIB.LGBM_GetLastError().decode("utf-8"))
E lightgbm.basic.LightGBMError: No OpenCL device found
/root/miniforge/envs/test-env/lib/python3.13/site-packages/lightgbm/basic.py:310: LightGBMError
Ideas I'm looking into:
- maybe we need to update to a newer
Boostto work with a new PoCL? - maybe we need to update the OpenCL kernels in https://github.com/microsoft/LightGBM/blob/master/src/treelearner/ocl to match more platforms
I'm going to focus on the gpu source builds first, because those don't rely on anything in https://github.com/microsoft/LightGBM/blob/master/cmake/IntegratedOpenCL.cmake and so should be a more minimal way to investigate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticing that the CI job running with an NVIDIA GPU is working: https://github.com/microsoft/LightGBM/actions/runs/20581695041/job/59110391374?pr=7109
So I guess it's just that these jobs are no longer successfully targeting the host CPUs on the GitHub runners? I'll look into that.
Fixes #5596
Fixes #7011
Moves Dockerfiles and CI pipelines for the container images used in Linux CI into this repo. New pipelines here will publish CI images to GitHub's container registry.
Other changes:
clangfrom source 😁)aarch64Linux wheels up to GLIBC 2.28 (manylinux_2_28)Notes for Reviewers
Post-merge cleanup
After this is merged and once we feel like it's working well, all of the following could be cleaned up: