SD-E2 Benchmark and Optimise¶
Goal: measure first, then change one thing that improves performance without breaking correctness.
This mini-project benchmarks two data layouts:
- AoS: Array of Structures
- SoA: Structure of Arrays
The SoA version is intentionally written in a way that prevents good optimisation.
Success criteria¶
bench_layoutbuilds and prints benchmark results- You record baseline vs improved results
- You keep results correct (same numeric output within tolerance)
Where to work¶
cd exercises/SD-E2-benchmark-and-optimize/starter
0. Build the starter¶
The first configure may download Google Benchmark if it is not already installed.
If you are offline and do not have a system install of Google Benchmark available, this project will not configure. In that case, use a prepared course environment (Codespace/devcontainer) or ask the instructor for a fallback.
cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build build -j"$(nproc)"
1. Run the baseline benchmark¶
Run a few repetitions and only print aggregates (easier to compare):
./build/bench_layout \
--benchmark_format=console \
--benchmark_repetitions=3 \
--benchmark_report_aggregates_only=true
Write down the SoA vs AoS timings for each problem size.
Expected baseline behaviour:
- The SoA version can be slower than AoS because it uses expensive math (
std::pow) in the hot loop.
2. Identify the hot code¶
Open:
include/ParticleLayouts.hpp
Look at:
sd_e2::ParticlesSoA::sumEnergy()
The inner loop uses std::pow(x, 2). That is much more expensive than x * x and often blocks vectorisation.
3. Optimise one thing: remove std::pow(x, 2)¶
Replace the pow calls with plain multiplication.
Patch:
--- a/include/ParticleLayouts.hpp
+++ b/include/ParticleLayouts.hpp
@@
double sumEnergy() const {
double sum = 0.0;
for (size_t i = 0; i < px.size(); ++i) {
- // Intentionally slow: std::pow is expensive and blocks vectorization.
- const double p2 = std::pow(px[i], 2) + std::pow(py[i], 2) + std::pow(pz[i], 2);
- sum += std::sqrt(p2 + std::pow(mass[i], 2));
+ // Cheaper and compiler-friendly: multiply instead of pow(..., 2)
+ const double px2 = px[i] * px[i];
+ const double py2 = py[i] * py[i];
+ const double pz2 = pz[i] * pz[i];
+ const double m2 = mass[i] * mass[i];
+
+ const double p2 = px2 + py2 + pz2;
+ sum += std::sqrt(p2 + m2);
}
return sum;
}
Rebuild:
cmake --build build -j"$(nproc)"
4. Add a quick correctness check¶
Benchmarks should not be your only correctness signal.
Add a small executable that checks AoS and SoA compute the same sum (within tolerance).
Step 4.1: add a new source file¶
Create src/verify_layout.cpp:
#include "ParticleLayouts.hpp"
#include <cmath>
#include <cstddef>
#include <iostream>
int main() {
constexpr size_t N = 10000;
sd_e2::ParticlesAoS aos;
sd_e2::ParticlesSoA soa;
aos.resize(N);
soa.resize(N);
for (size_t i = 0; i < N; ++i) {
const double px = 0.1 * i;
const double py = 0.2 * i;
const double pz = 0.3 * i;
const double m = 0.511;
aos.particles[i].px = px;
aos.particles[i].py = py;
aos.particles[i].pz = pz;
aos.particles[i].mass = m;
soa.px[i] = px;
soa.py[i] = py;
soa.pz[i] = pz;
soa.mass[i] = m;
}
const double a = aos.sumEnergy();
const double b = soa.sumEnergy();
const double rel = std::abs(a - b) / (std::abs(a) + 1e-12);
std::cout << "AoS sum: " << a << "\n";
std::cout << "SoA sum: " << b << "\n";
std::cout << "Relative diff: " << rel << "\n";
if (rel > 1e-12) {
std::cerr << "Mismatch: optimisation changed the result too much\n";
return 1;
}
return 0;
}
Step 4.2: wire it into CMake¶
Edit CMakeLists.txt:
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@
add_executable(bench_layout
src/bench_layout.cpp
)
target_include_directories(bench_layout PRIVATE include)
target_link_libraries(bench_layout PRIVATE benchmark::benchmark)
+add_executable(verify_layout
+ src/verify_layout.cpp
+)
+target_include_directories(verify_layout PRIVATE include)
Rebuild and run the check:
cmake --build build -j"$(nproc)"
./build/verify_layout
5. Re-run the benchmark and compare¶
./build/bench_layout \
--benchmark_format=console \
--benchmark_repetitions=3 \
--benchmark_report_aggregates_only=true
Compare against the baseline numbers you wrote down. You should see SoA improve significantly.
Stretch¶
- Try
RelWithDebInfoand inspect the generated assembly (-S) to see vectorisation. - Add one more benchmark size (e.g.
->Arg(1000000)) and see if the crossover point changes. - Explore compiler flags like
-march=native(only when comparing on the same machine).