The prior tutorial showed how to determine whether a loop can be shuffled (parallelized) by examining access patterns, computing distance vectors, and applying the first-positive rule. This tutorial puts those ideas into code. We’ll write a mini LLVM pass that takes a loop nest and reports which levels are shuffleable. For pedagogical purposes, we keep the problem constrained and avoid LLVM’s built-in dependence analysis.
Problem Scope
Our pass handles loops matching this template:
for (int i = 0; i < N; i++) // stride-1
for (int j = 0; j < M; j++) // stride-1
...
A[i+c1][j+c2]... = A[i+c3][j+c4]...; // subscript = IV ± constant
Requirements:
- Perfectly nested — no statements between loop headers
- Stride-1 induction variables —
i++, noti += 2 - Linear subscripts —
A[i][j-1]is fine;A[i+j],A[2*i], andA[B[i]]are not - Subscript order matches loop order —
A[i][j], notA[j][i] - Dimensionality matches nest depth — 2D array in a 2-deep nest
- Read-after-write dependencies only
Out of scope: non-unit strides (i += 2), coupled subscripts (A[i+j]), symbolic offsets (A[i-N]), indirect indexing (A[B[i]]).
We use only LLVM’s LoopInfo pass to identify loop structure; everything else is our own pattern matching. No ScalarEvolution, no built-in dependence analysis—just enough IR traversal to extract what we need.
The IR Structure
We’ll use Case 1 from the prior tutorial as our running example:
for (int i = 1; i < N; i++) {
for (int j = 0; j < M; j++) {
A[i][j] = A[i-1][j] + 1;
}
}
Here’s the corresponding control-flow graph:
The high-level loop structure visible in C has been lowered away. To analyze shuffleability, we must recover key information from the IR:
- Induction variables — the PHI nodes that count loop iterations
- Memory accesses — the load/store instructions and their subscripts
This requires pattern matching on the IR’s dataflow graph, where vertices are instructions and edges represent data dependencies.
Pattern 1: Induction Variables
Loop counters in LLVM IR appear as PHI nodes. A PHI selects between values depending on which predecessor block we came from. For a loop counter, it chooses between an initial value (on entry) and an incremented value (on the back-edge). In our example, %i takes the value 1 when entering from %entry, and %i.next when looping back from %for.exit.inner.
To identify stride-1 induction variables:
- Find all PHI nodes with exactly two incoming edges.
- Identify the back-edge—the one originating from within the loop.
- Verify that the back-edge value is an
addof the PHI and constant 1. (In LLVM, each instruction doubles as the name of its result, so%i.nextrefers to both the add instruction and its output.)
| What we see in IR | What we conclude |
|---|---|
| PHI with 2 inputs: init value + back-edge value | Loop counter candidate |
Back-edge value is add nsw %phi, 1 | Stride-1 induction variable |
Here’s the pattern-matching code:
// Find the back-edge input (the one from inside the loop)
Value *BackVal = nullptr;
for (unsigned i = 0; i < 2; i++)
if (L->contains(Phi.getIncomingBlock(i)))
BackVal = Phi.getIncomingValue(i);
// Check: is the stride 1?
auto *Add = dyn_cast<BinaryOperator>(BackVal);
if (Add && Add->getOpcode() == Instruction::Add) {
auto *C = dyn_cast<ConstantInt>(Add->getOperand(1 - i));
if (C && C->getSExtValue() == 1)
IVs.push_back(&Phi); // Found a stride-1 IV
}
Pattern 2: Memory Access Parsing
Next, we parse memory accesses. The goal: given a load or store, extract the base array and the access pattern—the induction variable and offset for each dimension.
In LLVM IR, array indexing uses two instructions:
- GEP (GetElementPtr) computes the address from a base pointer and indices
- load/store performs the actual memory access
We trace backward from each load/store to its GEP, then parse each index operand. An N-level loop nest produces a GEP with N+1 operands: the base pointer plus one index per dimension.
Here’s the dataflow for the load in our example:
The tracing proceeds as follows:
- Start at
load %addr.src. - Follow the pointer operand to
getelementptr %A, %i.m.1, %j. - For each index, determine the subscript form:
- A PHI node directly → offset 0 (e.g.,
A[i]) - An
addof a PHI and a constant → positive offset (e.g.,A[i+1]) - A
subof a PHI and a constant → negative offset (e.g.,A[j-2])
- A PHI node directly → offset 0 (e.g.,
| GEP index operand | Parsed subscript |
|---|---|
%j (PHI directly) | {j, 0} |
%i.m.1 → sub nsw %i, 1 | {i, -1} |
add nsw %iv, C | {iv, +C} |
For our example:
- Load:
GEP(%A, %i.m.1, %j)→{A, load, [{i, -1}, {j, 0}]} - Store:
GEP(%A, %i, %j)→{A, store, [{i, 0}, {j, 0}]}
Here’s the implementation:
// parseMemoryOp: trace load/store → GEP → subscripts
auto *GEP = dyn_cast<GetElementPtrInst>(Ptr);
for (unsigned d = 0; d < GEP->getNumIndices(); d++) {
auto Sub = parseSubscript(GEP->getOperand(1 + d), IVs);
if (!Sub) return std::nullopt;
Subs.push_back(*Sub);
}
// parseSubscript
// Case 1: operand is a PHI (induction variable with zero offset)
if (auto *Phi = dyn_cast<PHINode>(V))
return LinearSubscript{Phi, 0};
// Case 2: operand is an ADD or SUB of a PHI and a constant
if (auto *BinOp = dyn_cast<BinaryOperator>(V)) {
auto *Phi = dyn_cast<PHINode>(BinOp->getOperand(0));
auto *C = dyn_cast<ConstantInt>(BinOp->getOperand(1));
if (Phi && C) {
int64_t Off = C->getSExtValue();
if (BinOp->getOpcode() == Instruction::Sub) Off = -Off;
return LinearSubscript{Phi, Off};
}
}
Computing Dependencies
With both accesses parsed, computing the dependency is straightforward arithmetic—no more pattern matching.
From our example:
- Load:
[{i, -1}, {j, 0}] - Store:
[{i, 0}, {j, 0}]
Compute distance[d] = store.offset[d] - load.offset[d]:
distance[0] = 0 - (-1) = 1distance[1] = 0 - 0 = 0- Distance vector: (1, 0)
Apply the first-positive rule: scan left-to-right; the first non-zero component is at d=0, so the dependency is carried at depth 1. The outer loop cannot be shuffled; the inner loop can.
This is the same rule from the prior tutorial—only the representation changed.
// Compute distance vector
for (unsigned d = 0; d < Store.Subs.size(); d++)
Dist.push_back(Store.Subs[d].Offset - Load.Subs[d].Offset);
// Apply first-positive rule
unsigned carryingDepth = 0;
for (unsigned d = 0; d < Dist->size(); d++) {
if ((*Dist)[d] != 0) {
carryingDepth = d + 1;
break;
}
}
if (carryingDepth > 0)
Result.Blocked[carryingDepth - 1] = true;
Building and Running
The full source is available at GitHub.
We provide seven test cases written as hand-crafted LLVM IR rather than Clang output. This avoids unexpected compiler optimizations and keeps the IR minimal and readable.
To build:
# mkdir -p build && cd build
# cmake -DLLVM_DIR="$(llvm-config --cmakedir)" ..
# cmake --build . --parallel
# cd ..
Run on a single test case:
# opt -load-pass-plugin=./build/libToyDepAnalysis.so \
# -passes="toy-dep-analysis" \
# -disable-output \
# ./case1-vertical.ll
Run all test cases:
# ./build.sh test