Improve speed of fsum by TylerSagendorf · Pull Request #828 · fastverse/collapse

TylerSagendorf · 2026-05-06T03:35:48Z

Description

Utilized multiple accumulators in fsum to enhance instruction throughput. This led to a 2x performance improvement in certain cases.

Closes #824. Similar to #826 and #827.

Main Changes

Improved speed of fsum by using multiple accumulators

Checklist

I have performed a self-review of my code.
I have commented on my code, particularly in hard-to-understand areas.
I have updated the documentation where applicable.

Additional Context

I tried to modify the integer-based sums and the grouped sums, but their loop bodies were too complex to benefit from this technique, so I reverted my changes. Similarly, there was no improvement in the speed of the weighted sums when na.rm = TRUE.

Benchmarks were performed on an AMD Ryzen 5 7600X CPU (x86_64).

library(collapse)

n <- 1e6L

set.seed(0L)
x <- rnorm(n)
x_int <- sample.int(1e3L, size = n, replace = TRUE)
w <- runif(n)

bench::mark(
  # Doubles
  sum(x, na.rm = TRUE),
  sum(x, na.rm = FALSE),
  fsum(x),
  fsum(x, na.rm = FALSE),
  fsum(x, nthreads = 4),
  fsum(x, na.rm = FALSE, nthreads = 4),
  # Doubles, weighted
  fsum(x, w = w),
  fsum(x, w = w, na.rm = FALSE),
  fsum(x, w = w, nthreads = 4),
  fsum(x, w = w, na.rm = FALSE, nthreads = 4),
  # Integers
  sum(x_int, na.rm = TRUE),
  sum(x_int, na.rm = FALSE),
  fsum(x_int),
  fsum(x_int, na.rm = FALSE),
  fsum(x_int, nthreads = 4),
  fsum(x_int, na.rm = FALSE, nthreads = 4),
  iterations = 1e3L,
  check = FALSE
)

Only showing iterations/second for comparison purposes:

                                                Before  After Improvement
 1 sum(x, na.rm = TRUE)                            738.     -
 2 sum(x, na.rm = FALSE)                           700.     -
 3 fsum(x)                                        3070.  5942.      1.94x
 4 fsum(x, na.rm = FALSE)                         3074.  6070.      1.97x
 5 fsum(x, nthreads = 4)                         11570. 21874.      1.89x
 6 fsum(x, na.rm = FALSE, nthreads = 4)          11764. 22379.      1.90x
 7 fsum(x, w = w)                                 1543.     -
 8 fsum(x, w = w, na.rm = FALSE)                  3046.  5883.      1.93x
 9 fsum(x, w = w, nthreads = 4)                   5995.     -
10 fsum(x, w = w, na.rm = FALSE, nthreads = 4)   11695. 22041.      1.88x
11 sum(x_int, na.rm = TRUE)                       2331.     -
12 sum(x_int, na.rm = FALSE)                      2333.     -
13 fsum(x_int)                                    2318.     -
14 fsum(x_int, na.rm = FALSE)                     2325.     -
15 fsum(x_int, nthreads = 4)                     18980.     -
16 fsum(x_int, na.rm = FALSE, nthreads = 4)      32257.     -

TylerSagendorf · 2026-05-08T04:30:09Z

Tested on a Macbook with an M4 Max chip (arm64). The use of multiple accumulators in fsum increased performance (iterations/second) by a factor of 1.77 to 3.84.

                                            Original   New
1 fsum(x)                                      2008.  4930.  2.46x
2 fsum(x, na.rm = FALSE)                       2104.  8077.  3.84x
3 fsum(x, nthreads = 4L)                       2006.  4856.  2.42x
4 fsum(x, na.rm = FALSE, nthreads = 4L)        2111.  7856.  3.72x
5 fsum(x, w = w, na.rm = FALSE)                2031.  3606.  1.78x
6 fsum(x, w = w, na.rm = FALSE, nthreads = 4L) 2028.  3585.  1.77x

However, after enabling OpenMP (https://mac.r-project.org/openmp/) by updating the global Makevars to include the flags below, performance decreased by ~20-40% relative to the original fsum code.

CPPFLAGS += -Xclang -fopenmp
LDFLAGS += -lomp -fexperimental-library

                                             Original    New
1 fsum(x)                                      10458.   6422.  0.61x
2 fsum(x, na.rm = FALSE)                       13098.  10775.  0.82x
3 fsum(x, nthreads = 4L)                       23276.  16448.  0.71x
4 fsum(x, na.rm = FALSE, nthreads = 4L)        26826.  23375.  0.87x
5 fsum(x, w = w, na.rm = FALSE)                 7513.   5399.  0.72x
6 fsum(x, w = w, na.rm = FALSE, nthreads = 4L) 18392.  15335.  0.83x

Speed up fsum with multiple accumulators.

064248c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve speed of fsum#828

Improve speed of fsum#828
TylerSagendorf wants to merge 1 commit intofastverse:developmentfrom
TylerSagendorf:development

TylerSagendorf commented May 6, 2026

Uh oh!

TylerSagendorf commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TylerSagendorf commented May 6, 2026

Description

Main Changes

Checklist

Additional Context

Uh oh!

TylerSagendorf commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant