“Know how to solve every problem that has been solved.” “What I cannot create, I do not understand.” — Richard Feynman

Capstone Lab: High-Performance Memcpy

What You'll Learn

Apply all concepts from this section
Optimize a real kernel through measurement
Understand what changed and why it helped
Report optimization iterations

Task

Implement and optimize a high-performance memory copy function. Start with a baseline, then apply optimizations based on measurements.

Baseline Implementation

#include <cstring>
#include <chrono>
#include <iostream>

using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::nanoseconds;

void memcpy_baseline(void* dst, const void* src, size_t size) {
    const char* s = static_cast<const char*>(src);
    char* d = static_cast<char*>(dst);
    for (size_t i = 0; i < size; ++i) {
        d[i] = s[i];
    }
}

double benchmark_memcpy(void (*func)(void*, const void*, size_t), size_t size) {
    std::vector<char> src(size, 1);
    std::vector<char> dst(size, 0);
    
    const int iterations = 100;
    auto start = Clock::now();
    for (int i = 0; i < iterations; ++i) {
        func(dst.data(), src.data(), size);
    }
    auto end = Clock::now();
    
    auto elapsed = std::chrono::duration_cast<Duration>(end - start);
    double seconds = elapsed.count() / 1e9;
    double gb_per_sec = (size * iterations / 1e9) / seconds;
    
    return gb_per_sec;
}

int main() {
    size_t size = 100 * 1024 * 1024;  // 100 MB
    double baseline_bw = benchmark_memcpy(memcpy_baseline, size);
    std::cout << "Baseline: " << baseline_bw << " GB/s\n";
    return 0;
}

Optimization Iterations

Iteration 1: Use std::memcpy

Replace byte-by-byte copy with std::memcpy. This uses optimized library routines.

Iteration 2: Alignment

Handle unaligned start, then use aligned copies. Aligned memory access is faster.

Iteration 3: Loop Unrolling

Unroll the copy loop to reduce loop overhead and expose more ILP.

Iteration 4: Non-Temporal Stores (Optional)

Use non-temporal stores (e.g., _mm_stream_si128) to bypass cache for large copies.

Report Requirements

For each iteration, report:

What changed: Code modifications
Why it helped: CPU mechanism (cache, ILP, alignment, etc.)
How to measure: Metrics and methodology
Results: Bandwidth improvement

Checklist

✓ Implemented baseline
✓ Applied 3+ optimization iterations
✓ Measured each iteration
✓ Documented what changed and why
✓ Reported results with metrics