Capstone Lab: High-Performance Memcpy
What You'll Learn
- Apply all concepts from this section
- Optimize a real kernel through measurement
- Understand what changed and why it helped
- Report optimization iterations
Task
Implement and optimize a high-performance memory copy function. Start with a baseline, then apply optimizations based on measurements.
Baseline Implementation
#include <cstring>
#include <chrono>
#include <iostream>
using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::nanoseconds;
void memcpy_baseline(void* dst, const void* src, size_t size) {
const char* s = static_cast<const char*>(src);
char* d = static_cast<char*>(dst);
for (size_t i = 0; i < size; ++i) {
d[i] = s[i];
}
}
double benchmark_memcpy(void (*func)(void*, const void*, size_t), size_t size) {
std::vector<char> src(size, 1);
std::vector<char> dst(size, 0);
const int iterations = 100;
auto start = Clock::now();
for (int i = 0; i < iterations; ++i) {
func(dst.data(), src.data(), size);
}
auto end = Clock::now();
auto elapsed = std::chrono::duration_cast<Duration>(end - start);
double seconds = elapsed.count() / 1e9;
double gb_per_sec = (size * iterations / 1e9) / seconds;
return gb_per_sec;
}
int main() {
size_t size = 100 * 1024 * 1024; // 100 MB
double baseline_bw = benchmark_memcpy(memcpy_baseline, size);
std::cout << "Baseline: " << baseline_bw << " GB/s\n";
return 0;
} Optimization Iterations
Iteration 1: Use std::memcpy
Replace byte-by-byte copy with std::memcpy. This uses optimized library routines.
Iteration 2: Alignment
Handle unaligned start, then use aligned copies. Aligned memory access is faster.
Iteration 3: Loop Unrolling
Unroll the copy loop to reduce loop overhead and expose more ILP.
Iteration 4: Non-Temporal Stores (Optional)
Use non-temporal stores (e.g., _mm_stream_si128) to bypass cache for large copies.
Report Requirements
For each iteration, report:
- What changed: Code modifications
- Why it helped: CPU mechanism (cache, ILP, alignment, etc.)
- How to measure: Metrics and methodology
- Results: Bandwidth improvement
Checklist
- ✓ Implemented baseline
- ✓ Applied 3+ optimization iterations
- ✓ Measured each iteration
- ✓ Documented what changed and why
- ✓ Reported results with metrics