Fast Function Approximations lowering. #8566

mcourteaux · 2025-02-08T15:26:37Z

The big transcendental lower update! Replaces #8388.

TODO

I still have to do:

Validate all is fine with build bots.

Overview:

Fast transcendentals implemented for: sin, cos, tan, atan, atan2, exp, log, expm1, tanh, asin, acos.
Simple API to specify precision requirements. Default-initialized precision (AUTO without constraints) means "don't care about precision, as long as it's reasonable and fast", which gives you the highest chance of selecting a high-performance implementation based on hardware instructions. Optimization objectives MULPE (max ULP error), and MAE (max absolute error) are available. Compared to previous PR, I removed MULPE_MAE as I didn't see a good purpose for it.
Tabular info on intrinsics and native functions their precision and speed, to select an appropriate implementation for lowering to something that is definitely not slower, while satisfying the precision requirements.
- OpenCL: lower to native_cos, native_exp, etc...
- Metal: lower to fast::cos, fast::exp, etc...
- CUDA: lower to dedicated PTX instructions when available.
- When no fast hardware versions are available: polynomial approximations.
Tabular info measuring exact precision obtained by exhaustively iterating over all floats in the polynomial's native interval. Measured both MULPE and MAE for Float32. Precisions are not yet evaluated on f16 or f64, which is future work (which I have currently not planned). Precisions are measured by correctness/determine_fast_function_approximation_metrics.cpp.
Performance tests validating that:
- the AUTO-versions are at least always faster.
- all known-to-be faster functions are faster.
Accuracy tests validating that:
- the AUTO-versions are at least somewhat reasonable precise (at least 1e-4).
- all polynomials satisfy the precision they advertise on their non-range-reduced interval.
Drive-by fix for adding libOpenCL.so.1 to the list of tested sonames for the OpenCL runtime.

Review guide

I pass ApproximationPrecision parameters as a Call::make_struct node with 4 parameters (see API below). This approximation precision Call node survives until lowering pass where the transcendentals are lowered. In this pass, they are extracted again from this Call node's arguments. I conceptually like that this way, they are bundled and clearly not at the same level as the actual mathematical arguments. Is this a good approach? In order for this to work, I had to stop CSE from extracting those precision arguments, and StrictfyFloat from recursing down into that struct and litter strict_float on those numbers. I have seen the Call::bundle intrinsic. Perhaps this one is better for that purpose? @abadams
I tried to design the API such that it would also be compatible with Float(16) and Float(64), but those are not yet implemented/tested. The polynomial approximations should work correctly (although untested) for these other data-types.
The intrinsics table and their behavior (MULPE/MAE-precision) is measured on devices I have available (and build bots). On some backends (such as OpenCL, and Vulkan) these intrinsics have implementation-defined behavior. This probably means it's AMD or NVIDIA that gets to implement them and determine the precision. I do not have any AMD GPU available to test the OpenCL and Vulkan backends on to see how these functions behave. I have realized that for example Vulkan's native_tan() compiles to the same three instructions as I implemented on CUDA: sin.approx.f32, cos.approx.f32, div.approx.f32. I haven't investigated AMD's documentation on available hardware instructions.

Concerns

I disabled bit-exact accuracy tests on GPU backends, because they don't play nice without proper control over floating point optimizations.
We need proper strict_float ops (i.e.: strict_add(), strict_mul(), strict_sub(), strict_div(), strict_fma()).

Documentation for exp() (regular exp(), not fast_exp()) claims to be bit-exact, which is proven wrong by the pre-existing test correctness/vector_math.cpp:

exp(47.812500) = 581706671813124161536.0000000000 instead of 581707832897403092992.0000000000 (mantissa: 8144523 vs 8144490)
log mantissa error: 2
exp mantissa error: 33
pow mantissa error: 24
fast_log mantissa error: 16
fast_exp mantissa error: 59
fast_pow mantissa error: 54

Buildbot reveals that WebGPU does not support vectorization of scalar function calls. tan_f32(vec2<f32>) did not get converted to tan_f32(first element), tan_f32(second element).

API

struct ApproximationPrecision {
    enum OptimizationObjective {
        AUTO,   //< No preference, but favor speed.
        MAE,    //< Optimized for Max Absolute Error.
        MULPE,  //< Optimized for Max ULP Error. ULP is "Units in Last Place", when represented in IEEE 32-bit floats.
    } optimized_for{AUTO};

    /**
     * Most function approximations have a range where the approximation works
     * natively (typically close to zero), without any range reduction tricks
     * (e.g., exploiting symmetries, repetitions). You may specify a maximal
     * absolute error or maximal units in last place error, which will be
     * interpreted as the maximal absolute error within this native range of the
     * approximation. This will be used as a hint as to which implementation to
     * use.
     */
    // @{
    uint64_t constraint_max_ulp_error{0};
    double constraint_max_absolute_error{0.0};
    // @}

    /**
     * For most functions, Halide has a built-in table of polynomial
     * approximations. However, some targets have specialized instructions or
     * intrinsics available that allow to produce an even faster approximation.
     * Setting this integer to a non-zero value will force Halide to use the
     * polynomial with at least this many terms, instead of specialized
     * device-specific code. This means this is still combinable with the
     * other constraints.
     * This is mostly useful for testing and benchmarking.
     */
    int force_halide_polynomial{0};

    /** MULPE-optimized, with max ULP error. */
    static ApproximationPrecision max_ulp_error(uint64_t mulpe) {
        return ApproximationPrecision{MULPE, mulpe, 0.0f, false};
    }
    /** MAE-optimized, with max absolute error. */
    static ApproximationPrecision max_abs_error(float mae) {
        return ApproximationPrecision{MAE, 0, mae, false};
    }
    /** MULPE-optimized, forced Halide polynomial with given number of terms. */
    static ApproximationPrecision poly_mulpe(int num_terms) {
        user_assert(num_terms > 0);
        return ApproximationPrecision{MULPE, 0, 0.0f, num_terms};
    }
    /** MAE-optimized, forced Halide polynomial with given number of terms. */
    static ApproximationPrecision poly_mae(int num_terms) {
        user_assert(num_terms > 0);
        return ApproximationPrecision{MAE, 0, 0.0f, num_terms};
    }
};

/** Fast approximation to some trigonometric functions for Float(32).
 * Slow on x86 if you don't have at least sse 4.1.
 * Vectorize cleanly when using polynomials.
 * See \ref ApproximationPrecision for details on specifying precision.
 */
// @{
/** Caution: Might exceed the range (-1, 1) by a tiny bit.
 * On NVIDIA CUDA: default-precision maps to a dedicated sin.approx.f32 instruction. */
Expr fast_sin(const Expr &x, ApproximationPrecision precision = {});
/** Caution: Might exceed the range (-1, 1) by a tiny bit.
 * On NVIDIA CUDA: default-precision maps to a dedicated cos.approx.f32 instruction. */
Expr fast_cos(const Expr &x, ApproximationPrecision precision = {});
/** On NVIDIA CUDA: default-precision maps to a combination of sin.approx.f32,
 * cos.approx.f32, div.approx.f32 instructions. */
Expr fast_tan(const Expr &x, ApproximationPrecision precision = {});
Expr fast_asin(const Expr &x, ApproximationPrecision precision = {});
Expr fast_acos(const Expr &x, ApproximationPrecision precision = {});
Expr fast_atan(const Expr &x, ApproximationPrecision precision = {});
Expr fast_atan2(const Expr &y, const Expr &x, ApproximationPrecision = {});
// @}

/** Fast approximate log for Float(32).
 * Returns nonsense for x <= 0.0f.
 * Approximation available up to the Max 5 ULP, Mean 2 ULP.
 * Vectorizes cleanly when using polynomials.
 * Slow on x86 if you don't have at least sse 4.1.
 * On NVIDIA CUDA: default-precision maps to a combination of lg2.approx.f32 and a multiplication.
 * See \ref ApproximationPrecision for details on specifying precision.
 */
Expr fast_log(const Expr &x, ApproximationPrecision precision = {});

/** Fast approximate exp for Float(32).
 * Returns nonsense for inputs that would overflow.
 * Approximation available up to Max 3 ULP, Mean 1 ULP.
 * Vectorizes cleanly when using polynomials.
 * Slow on x86 if you don't have at least sse 4.1.
 * On NVIDIA CUDA: default-precision maps to a combination of ex2.approx.f32 and a multiplication.
 * See \ref ApproximationPrecision for details on specifying precision.
 */
Expr fast_exp(const Expr &x, ApproximationPrecision precision = {});

/** Fast approximate expm1 for Float(32).
 * Returns nonsense for inputs that would overflow.
 * Slow on x86 if you don't have at least sse 4.1.
 */
Expr fast_expm1(const Expr &x, ApproximationPrecision precision = {});

/** Fast approximate pow for Float(32).
 * Returns nonsense for x < 0.0f.
 * Returns 1 when x == y == 0.0.
 * Approximations accurate up to Max 53 ULPs, Mean 13 ULPs.
 * Gets worse when approaching overflow.
 * Vectorizes cleanly when using polynomials.
 * Slow on x86 if you don't have at least sse 4.1.
 * On NVIDIA CUDA: default-precision maps to a combination of ex2.approx.f32 and lg2.approx.f32.
 * See \ref ApproximationPrecision for details on specifying precision.
 */
Expr fast_pow(const Expr &x, const Expr &y, ApproximationPrecision precision = {});

/** Fast approximate pow for Float(32).
 * Approximations accurate to 2e-7 MAE, and Max 2500 ULPs (on average < 1 ULP) available.
 * Caution: might exceed the range (-1, 1) by a tiny bit.
 * Vectorizes cleanly when using polynomials.
 * Slow on x86 if you don't have at least sse 4.1.
 * On NVIDIA CUDA: default-precision maps to a combination of ex2.approx.f32 and lg2.approx.f32.
 * See \ref ApproximationPrecision for details on specifying precision.
 */
Expr fast_tanh(const Expr &x, ApproximationPrecision precision = {});

Fixes #8243.

src/CMakeLists.txt

src/FastMathFunctions.cpp

src/runtime/opencl.cpp

abadams

Thanks so much for doing this, and sorry it took me so long to review it (I'm finally clawing out from under my deadlines). It generally looks good but I have some review comments.

What order should this be done in vs our change to strict_float? Is there any reason to delay this until after strict_float is changed?

src/ApproximationTables.cpp

abadams · 2025-03-10T17:00:04Z

src/CSE.cpp

        return false;
    }

+    if (const Call *c = e.as<Call>()) {


The need for this makes me think the extra args would be better just as flat extra args to the math intrinsic, instead of being packed into a struct.

Well, if I don't pass the things as a make_struct(), CSE will still lift these arguments out of the call if you have multiple of those. This makes it only harder to figure out what the actual precision-arguments were to the Call node in the lowering pass that wants to read them back. They might now have become Let or LetStmt, instead of a simple ImmFloat.

CSE will never lift constants, so they should be fine.

src/FastMathFunctions.cpp

src/IROperator.h

test/correctness/register_shuffle.cpp

test/performance/fast_function_approximations.cpp

mcourteaux · 2025-03-10T21:10:39Z

What order should this be done in vs our change to strict_float? Is there any reason to delay this until after strict_float is changed?

strict_float is already broken today in a few scenarios. All these transcendentals are broken right now regarding strict_float(). Every GPU-backend has a different implementation of them, with different precisions. So a strict_float(tan(strict_float(x))) will give wildly different results on different backends. Vulkan in particular has a very imprecise native tan() function. They already default to some approximation. So given that all transcendentals are already broken (because we rely on some third-party function), I wouldn't be concerned that strict_float has serious limitations and is not playing nicely with the lowering pass selecting a polynomial in this PR.

The good thing is, is that this PR actually paves the way towards fixing strict_float for transcendentals. Long term, if we have our own implementation for all of them, we can strict_float guarantee they will yield the same result. However, this would require us to have an FMA intrinsic as well (to be able to express the polynomials using Horner's method with fma-ops). So yeah, I'm definitely in favor of accepting that strict_float is badly broken, and moving forward with this PR.

mcourteaux · 2025-03-10T23:00:45Z

Thanks so much for doing this, and sorry it took me so long to review it (I'm finally clawing out from under my deadlines). It generally looks good but I have some review comments.

No worries! Thanks a lot for looking into this! I'll address your feedback, questions and improvements tomorrow, such that this is all still fresh in your head.

mcourteaux · 2025-03-11T10:16:35Z

@abadams Any chance you can put together an fma intrinsic in Halide with reasonable effort? I don't wanna ask too much, but you had a pretty good idea of what that would look like I think. That would be helpful to finalize this PR.

Update: Nevermind, I found another approach that works well for now.

mcourteaux · 2025-03-15T01:53:50Z

I took care of all feedback, except for the make_struct wrapper for the precision arguments. That's for later. I updated the original post for the PR on top. More info with the latest concerns and notes can be found there!

mcourteaux · 2025-06-01T17:19:57Z

~~@slomp or @alexreinking Can you identify the issue with the Windows builds here? https://buildbot.halide-lang.org/master/#/builders/107/builds/98~~

~~The test doesn't see the extern declared symbol, compiled in ApproximationTables.cpp. I tried adding HL_EXPOROT_SYMBOL but that didn't resolve this.~~

Update: I changed it to have accessor functions that simply return the static member. That seems to be the way it's done everywhere in Halide header files.

mcourteaux · 2025-06-02T22:29:56Z

@alexreinking Can you assess what this macos buildbot is up to?

mcourteaux · 2025-06-16T09:24:36Z

@abadams There seems to be an issue with the strict float behavior on the WebAssembly target. It seems to be powered by LLVM, so it's weird it doesn't work. The other LLVM-powered backends seem to work fine. Any clues what might be going wrong there? Is the Wasm runtime further simplifying and not respecting the stream of instructions as-is?

…Selectively disable some tests that require strict_float on GPU backends.

…cally.

…t float calculations for f64 and f16.

Fix FloatImm codegen on several GPU backends. Fix gpu_float16_intrinsics test. Was not really using many float16 ops at all, because fast_pow was historically casting to float. Implement a few quick workarounds for NVIDIA not properly implementing fp16 built-in functions.

…ch failed on x87.

…n case that's marked as supported by the GPU backend.) Change printing style of float-literals to use scientific notation with enough digits to be exact. Relax performance test for fast_tanh on WebGPU. Bugfix float16 nan/inf constants on WebGPU. Separately print out compilation log in runtime/opencl as those logs can get very large, beyond the size of the HeapPrinter capacity.

…literals.

mcourteaux requested a review from halidebuildbots February 8, 2025 15:26

mcourteaux marked this pull request as draft February 8, 2025 21:36

alexreinking reviewed Feb 9, 2025

View reviewed changes

src/CMakeLists.txt Show resolved Hide resolved

alexreinking reviewed Feb 9, 2025

View reviewed changes

src/FastMathFunctions.cpp Outdated Show resolved Hide resolved

mcourteaux requested a review from abadams February 10, 2025 18:11

mcourteaux marked this pull request as ready for review February 10, 2025 18:12

alexreinking reviewed Feb 10, 2025

View reviewed changes

src/runtime/opencl.cpp Show resolved Hide resolved

mcourteaux added enhancement New user-visible features or improvements to existing features. performance gpu release_notes For changes that may warrant a note in README for official releases. labels Feb 10, 2025

mcourteaux force-pushed the fast-math-lowering branch 2 times, most recently from bea8612 to 0de4dbc Compare February 11, 2025 11:36

mcourteaux added skip_buildbots Do not run buildbots on this PR. Must add before opening PR as we scan labels immediately. and removed skip_buildbots Do not run buildbots on this PR. Must add before opening PR as we scan labels immediately. labels Feb 11, 2025

abadams reviewed Mar 10, 2025

View reviewed changes

mcourteaux added the skip_buildbots Do not run buildbots on this PR. Must add before opening PR as we scan labels immediately. label Mar 12, 2025

mcourteaux force-pushed the fast-math-lowering branch from 3211d3a to f6f7fd0 Compare March 12, 2025 14:31

mcourteaux removed the skip_buildbots Do not run buildbots on this PR. Must add before opening PR as we scan labels immediately. label Mar 14, 2025

mcourteaux requested a review from abadams March 15, 2025 11:49

mcourteaux force-pushed the fast-math-lowering branch from f28a8b0 to 7000f21 Compare March 17, 2025 19:48

mcourteaux force-pushed the fast-math-lowering branch from a171ec1 to 6cebc56 Compare June 1, 2025 13:30

mcourteaux force-pushed the fast-math-lowering branch from 4ceca2c to 845d83a Compare June 14, 2025 12:02

mcourteaux added 29 commits July 22, 2025 15:31

Implemented fast_asin() fast_acos(). Slowly redoing coefficients.

68aa6bf

WIP: determine precision of the polynomials.

9c5e3d1

Revived all tests.

366abe2

Clang format

186c8bb

Implement expm1. Fix accuracy of tanh. Fix lowering of tanh on CUDA. …

157d59e

…Selectively disable some tests that require strict_float on GPU backends.

Clang-format

87fb2ca

Feedback, and remove expm1 test.

6ca6457

Fix compilation issues.

84408af

One more compilation issue.

229c2da

Fixed a bracket.

eac79b5

Update some precision info on math intrinsics for Vulkan and Metal.

efabc05

Fix makefile after I accidentally broke it by sorting files alphabeti…

38606bc

…cally.

Add fast math calls to new extern_function_name_map for OpenCL.

daf492a

Move fast function calls to extern table for Metal.

a695340

Try to fix compile/test issues.

733514d

Fix Makefile and symbol visibility issue.

8947659

Clang-format

19d31db

Make use of the new strict_float intrinsics for the fast math functions.

7f4b655

Relax performance tests for GPUs.

1b77e28

Clang-format

225b8e9

Fix incorrect forward declaration.

f24228e

Fix acos on Metal. Relax perf-test for tanh on OpenCL.

23c9251

Fix strict float behavior for the fast_tan function. Implemented spli…

ee33b9b

…t float calculations for f64 and f16.

Clear internal assert, as it assumed SSE floating point behavior, whi…

3183778

…ch failed on x87.

Fix internal test for CodeGen_C given the scientific way of printing …

3cdcc70

…literals.

Merge branch 'main' into fast-math-lowering

7702d13

Update.

6c82133

mcourteaux force-pushed the fast-math-lowering branch from c05f2cc to 6c82133 Compare February 1, 2026 00:23

Fast Function Approximations lowering. #8566

Are you sure you want to change the base?

Fast Function Approximations lowering. #8566

Conversation

mcourteaux commented Feb 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO

Overview:

Review guide

Concerns

API

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abadams left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abadams Mar 10, 2025

Choose a reason for hiding this comment

Uh oh!

mcourteaux Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

abadams Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mcourteaux commented Mar 10, 2025

Uh oh!

mcourteaux commented Mar 10, 2025

Uh oh!

mcourteaux commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mcourteaux commented Mar 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mcourteaux commented Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mcourteaux commented Jun 2, 2025

Uh oh!

mcourteaux commented Jun 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mcourteaux commented Feb 8, 2025 •

edited

Loading

mcourteaux commented Mar 11, 2025 •

edited

Loading

mcourteaux commented Mar 15, 2025 •

edited

Loading

mcourteaux commented Jun 1, 2025 •

edited

Loading