In SIMD terminology, each vector has a certain “width” (number of lanes). A vector processor is able to perform two kinds of operations on a vector:
Example: vertical addition of two f32x4
vectors
%0 == | 2 | -3.5 | 0 | 7 | + + + + %1 == | 4 | 1.5 | -1 | 0 | = = = = %0 + %1 == | 6 | -2 | -1 | 7 |
Example: horizontal addition of two u64x2
vectors
%0 == | 1 | 3 | └─+───┘ └───────┐ │ %1 == | 4 | -1 | │ └─+──┘ │ └───┐ │ │ │ ┌─────│───┘ ▼ ▼ %0 + %1 == | 4 | 3 |
The result of vertical operations, like vector negation: -a
, for a given lane, does not depend on the result of the operation for the other lanes. The result of horizontal operations, like the vector sum
reduction: a.sum()
, depends on the value of all vector lanes.
In virtually all architectures vertical operations are fast, while horizontal operations are, by comparison, very slow.
Consider the following two functions for computing the sum of all f32
values in a slice:
fn fast_sum(x: &[f32]) -> f32 { assert!(x.len() % 4 == 0); let mut sum = f32x4::splat(0.); // [0., 0., 0., 0.] for i in (0..x.len()).step_by(4) { sum += f32x4::from_slice_unaligned(&x[i..]); } sum.sum() } fn slow_sum(x: &[f32]) -> f32 { assert!(x.len() % 4 == 0); let mut sum: f32 = 0.; for i in (0..x.len()).step_by(4) { sum += f32x4::from_slice_unaligned(&x[i..]).sum(); } sum }
The inner loop over the slice is where the bulk of the work actually happens. There, the fast_sum
function perform vertical operations into a vector, doing a single horizontal reduction at the end, while the slow_sum
function performs horizontal vector operations inside of the loop.
On all widely-used architectures, fast_sum
is a large constant factor faster than slow_sum
. You can run the slice_sum example and see for yourself. On the particular machine tested there the algorithm using the horizontal vector addition is 2.7x slower than the one using vertical vector operations!