Skip to content

Improve speed of fsum#828

Open
TylerSagendorf wants to merge 1 commit intofastverse:developmentfrom
TylerSagendorf:development
Open

Improve speed of fsum#828
TylerSagendorf wants to merge 1 commit intofastverse:developmentfrom
TylerSagendorf:development

Conversation

@TylerSagendorf
Copy link
Copy Markdown

Description

Utilized multiple accumulators in fsum to enhance instruction throughput. This led to a 2x performance improvement in certain cases.

Closes #824. Similar to #826 and #827.

Main Changes

  • Improved speed of fsum by using multiple accumulators

Checklist

  • I have performed a self-review of my code.
  • I have commented on my code, particularly in hard-to-understand areas.
  • I have updated the documentation where applicable.

Additional Context

I tried to modify the integer-based sums and the grouped sums, but their loop bodies were too complex to benefit from this technique, so I reverted my changes. Similarly, there was no improvement in the speed of the weighted sums when na.rm = TRUE.

Benchmarks were performed on an AMD Ryzen 5 7600X CPU (x86_64).

library(collapse)

n <- 1e6L

set.seed(0L)
x <- rnorm(n)
x_int <- sample.int(1e3L, size = n, replace = TRUE)
w <- runif(n)

bench::mark(
  # Doubles
  sum(x, na.rm = TRUE),
  sum(x, na.rm = FALSE),
  fsum(x),
  fsum(x, na.rm = FALSE),
  fsum(x, nthreads = 4),
  fsum(x, na.rm = FALSE, nthreads = 4),
  # Doubles, weighted
  fsum(x, w = w),
  fsum(x, w = w, na.rm = FALSE),
  fsum(x, w = w, nthreads = 4),
  fsum(x, w = w, na.rm = FALSE, nthreads = 4),
  # Integers
  sum(x_int, na.rm = TRUE),
  sum(x_int, na.rm = FALSE),
  fsum(x_int),
  fsum(x_int, na.rm = FALSE),
  fsum(x_int, nthreads = 4),
  fsum(x_int, na.rm = FALSE, nthreads = 4),
  iterations = 1e3L,
  check = FALSE
)

Only showing iterations/second for comparison purposes:

                                                Before  After Improvement
 1 sum(x, na.rm = TRUE)                            738.     -
 2 sum(x, na.rm = FALSE)                           700.     -
 3 fsum(x)                                        3070.  5942.      1.94x
 4 fsum(x, na.rm = FALSE)                         3074.  6070.      1.97x
 5 fsum(x, nthreads = 4)                         11570. 21874.      1.89x
 6 fsum(x, na.rm = FALSE, nthreads = 4)          11764. 22379.      1.90x
 7 fsum(x, w = w)                                 1543.     -
 8 fsum(x, w = w, na.rm = FALSE)                  3046.  5883.      1.93x
 9 fsum(x, w = w, nthreads = 4)                   5995.     -
10 fsum(x, w = w, na.rm = FALSE, nthreads = 4)   11695. 22041.      1.88x
11 sum(x_int, na.rm = TRUE)                       2331.     -
12 sum(x_int, na.rm = FALSE)                      2333.     -
13 fsum(x_int)                                    2318.     -
14 fsum(x_int, na.rm = FALSE)                     2325.     -
15 fsum(x_int, nthreads = 4)                     18980.     -
16 fsum(x_int, na.rm = FALSE, nthreads = 4)      32257.     -

@TylerSagendorf
Copy link
Copy Markdown
Author

Tested on a Macbook with an M4 Max chip (arm64). The use of multiple accumulators in fsum increased performance (iterations/second) by a factor of 1.77 to 3.84.

                                            Original   New
1 fsum(x)                                      2008.  4930.  2.46x
2 fsum(x, na.rm = FALSE)                       2104.  8077.  3.84x
3 fsum(x, nthreads = 4L)                       2006.  4856.  2.42x
4 fsum(x, na.rm = FALSE, nthreads = 4L)        2111.  7856.  3.72x
5 fsum(x, w = w, na.rm = FALSE)                2031.  3606.  1.78x
6 fsum(x, w = w, na.rm = FALSE, nthreads = 4L) 2028.  3585.  1.77x

However, after enabling OpenMP (https://mac.r-project.org/openmp/) by updating the global Makevars to include the flags below, performance decreased by ~20-40% relative to the original fsum code.

CPPFLAGS += -Xclang -fopenmp
LDFLAGS += -lomp -fexperimental-library
                                             Original    New
1 fsum(x)                                      10458.   6422.  0.61x
2 fsum(x, na.rm = FALSE)                       13098.  10775.  0.82x
3 fsum(x, nthreads = 4L)                       23276.  16448.  0.71x
4 fsum(x, na.rm = FALSE, nthreads = 4L)        26826.  23375.  0.87x
5 fsum(x, w = w, na.rm = FALSE)                 7513.   5399.  0.72x
6 fsum(x, w = w, na.rm = FALSE, nthreads = 4L) 18392.  15335.  0.83x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant