“Know how to solve every problem that has been solved.” “What I cannot create, I do not understand.” — Richard Feynman

Vectorization Experiments

What You'll Learn

How to measure vectorization speedup
Scalar vs auto-vectorized performance
Why speedup is not always 4x/8x
Load/store vs compute bottlenecks

Experiment: Dot Product

#include <chrono>
#include <iostream>
#include <vector>

using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::nanoseconds;

// Scalar version
float dot_product_scalar(const std::vector<float>& a, const std::vector<float>& b) {
    float sum = 0.0f;
    for (size_t i = 0; i < a.size(); ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}

// Compiler will auto-vectorize this (with -O3 -march=native)
float dot_product_vectorized(const std::vector<float>& a, const std::vector<float>& b) {
    float sum = 0.0f;
    #pragma GCC ivdep  // Hint: no dependencies
    for (size_t i = 0; i < a.size(); ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}

double benchmark(std::function<float()> func, int iterations = 100) {
    volatile float sink = 0;
    auto start = Clock::now();
    for (int i = 0; i < iterations; ++i) {
        sink = func();
    }
    auto end = Clock::now();
    auto elapsed = std::chrono::duration_cast<Duration>(end - start);
    return static_cast<double>(elapsed.count()) / iterations;
}

int main() {
    const size_t N = 1000000;
    std::vector<float> a(N, 1.0f);
    std::vector<float> b(N, 2.0f);
    
    double scalar_time = benchmark([&]() { return dot_product_scalar(a, b); });
    double vectorized_time = benchmark([&]() { return dot_product_vectorized(a, b); });
    
    std::cout << "Scalar: " << scalar_time << " ns\n";
    std::cout << "Vectorized: " << vectorized_time << " ns\n";
    std::cout << "Speedup: " << scalar_time / vectorized_time << "x\n";
    
    return 0;
}

Compile: g++ -O3 -march=native -o vectorization vectorization.cpp

What to Measure

Speedup: Ratio of scalar to vectorized time
Absolute times: ns per operation

Expected Shape of Results

Typical results:

Scalar: Slower, one element per instruction
Vectorized: Faster, multiple elements per instruction
Speedup: Often 2-4x, but not always 8x (limited by memory bandwidth)

Interpretation

Load/store vs compute: If operations are memory-bound, vectorization helps less. If compute-bound, vectorization helps more.

Why speedup is not always 4x/8x: Memory bandwidth can be the bottleneck. Processing 8 elements in parallel doesn't help if you can't load/store them fast enough.

Checklist

✓ Measured scalar vs vectorized performance
✓ Understood why speedup varies
✓ Checked assembly to verify vectorization