-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Fast Function Approximations lowering. #8566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
bea8612 to
0de4dbc
Compare
abadams
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for doing this, and sorry it took me so long to review it (I'm finally clawing out from under my deadlines). It generally looks good but I have some review comments.
What order should this be done in vs our change to strict_float? Is there any reason to delay this until after strict_float is changed?
| return false; | ||
| } | ||
|
|
||
| if (const Call *c = e.as<Call>()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The need for this makes me think the extra args would be better just as flat extra args to the math intrinsic, instead of being packed into a struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if I don't pass the things as a make_struct(), CSE will still lift these arguments out of the call if you have multiple of those. This makes it only harder to figure out what the actual precision-arguments were to the Call node in the lowering pass that wants to read them back. They might now have become Let or LetStmt, instead of a simple ImmFloat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CSE will never lift constants, so they should be fine.
strict_float is already broken today in a few scenarios. All these transcendentals are broken right now regarding strict_float(). Every GPU-backend has a different implementation of them, with different precisions. So a The good thing is, is that this PR actually paves the way towards fixing strict_float for transcendentals. Long term, if we have our own implementation for all of them, we can strict_float guarantee they will yield the same result. However, this would require us to have an FMA intrinsic as well (to be able to express the polynomials using Horner's method with fma-ops). So yeah, I'm definitely in favor of accepting that strict_float is badly broken, and moving forward with this PR. |
No worries! Thanks a lot for looking into this! I'll address your feedback, questions and improvements tomorrow, such that this is all still fresh in your head. |
|
Update: Nevermind, I found another approach that works well for now. |
3211d3a to
f6f7fd0
Compare
|
I took care of all feedback, except for the make_struct wrapper for the precision arguments. That's for later. I updated the original post for the PR on top. More info with the latest concerns and notes can be found there! |
f28a8b0 to
7000f21
Compare
a171ec1 to
6cebc56
Compare
|
Update: I changed it to have accessor functions that simply return the static member. That seems to be the way it's done everywhere in Halide header files. |
|
@alexreinking Can you assess what this macos buildbot is up to? |
4ceca2c to
845d83a
Compare
|
@abadams There seems to be an issue with the strict float behavior on the WebAssembly target. It seems to be powered by LLVM, so it's weird it doesn't work. The other LLVM-powered backends seem to work fine. Any clues what might be going wrong there? Is the Wasm runtime further simplifying and not respecting the stream of instructions as-is? |
…Selectively disable some tests that require strict_float on GPU backends.
…t float calculations for f64 and f16.
Fix FloatImm codegen on several GPU backends. Fix gpu_float16_intrinsics test. Was not really using many float16 ops at all, because fast_pow was historically casting to float. Implement a few quick workarounds for NVIDIA not properly implementing fp16 built-in functions.
…ch failed on x87.
…n case that's marked as supported by the GPU backend.) Change printing style of float-literals to use scientific notation with enough digits to be exact. Relax performance test for fast_tanh on WebGPU. Bugfix float16 nan/inf constants on WebGPU. Separately print out compilation log in runtime/opencl as those logs can get very large, beyond the size of the HeapPrinter capacity.
c05f2cc to
6c82133
Compare
The big transcendental lower update! Replaces #8388.
TODO
I still have to do:
Overview:
Fast transcendentals implemented for: sin, cos, tan, atan, atan2, exp, log, expm1, tanh, asin, acos.
Simple API to specify precision requirements. Default-initialized precision (
AUTOwithout constraints) means "don't care about precision, as long as it's reasonable and fast", which gives you the highest chance of selecting a high-performance implementation based on hardware instructions. Optimization objectivesMULPE(max ULP error), andMAE(max absolute error) are available. Compared to previous PR, I removedMULPE_MAEas I didn't see a good purpose for it.Tabular info on intrinsics and native functions their precision and speed, to select an appropriate implementation for lowering to something that is definitely not slower, while satisfying the precision requirements.
native_cos,native_exp, etc...fast::cos,fast::exp, etc...Tabular info measuring exact precision obtained by exhaustively iterating over all floats in the polynomial's native interval. Measured both MULPE and MAE for Float32. Precisions are not yet evaluated on f16 or f64, which is future work (which I have currently not planned). Precisions are measured by
correctness/determine_fast_function_approximation_metrics.cpp.Performance tests validating that:
Accuracy tests validating that:
Drive-by fix for adding
libOpenCL.so.1to the list of tested sonames for the OpenCL runtime.Review guide
Call::make_structnode with 4 parameters (see API below). This approximation precision Call node survives until lowering pass where the transcendentals are lowered. In this pass, they are extracted again from this Call node's arguments. I conceptually like that this way, they are bundled and clearly not at the same level as the actual mathematical arguments. Is this a good approach? In order for this to work, I had to stopCSEfrom extracting those precision arguments, andStrictfyFloatfrom recursing down into that struct and litterstrict_floaton those numbers. I have seen theCall::bundleintrinsic. Perhaps this one is better for that purpose? @abadamsFloat(16)andFloat(64), but those are not yet implemented/tested. The polynomial approximations should work correctly (although untested) for these other data-types.native_tan()compiles to the same three instructions as I implemented on CUDA:sin.approx.f32,cos.approx.f32,div.approx.f32. I haven't investigated AMD's documentation on available hardware instructions.Concerns
exp()(regularexp(), notfast_exp()) claims to be bit-exact, which is proven wrong by the pre-existing testcorrectness/vector_math.cpp:tan_f32(vec2<f32>)did not get converted totan_f32(first element), tan_f32(second element).API
Fixes #8243.