The prior tutorial showed how to determine whether a loop can be shuffled (parallelized) by examining access patterns, computing distance vectors, and applying the first-positive rule. This tutorial puts those ideas into code. We’ll write a mini LLVM pass that takes a loop nest and reports which levels are shuffleable. For pedagogical purposes, we keep the problem constrained and avoid LLVM’s built-in dependence analysis.


Problem Scope

Our pass handles loops matching this template:

for (int i = 0; i < N; i++)                   // stride-1
  for (int j = 0; j < M; j++)                 // stride-1
    ...
      A[i+c1][j+c2]... = A[i+c3][j+c4]...;    // subscript = IV ± constant

Requirements:

  • Perfectly nested — no statements between loop headers
  • Stride-1 induction variablesi++, not i += 2
  • Linear subscriptsA[i][j-1] is fine; A[i+j], A[2*i], and A[B[i]] are not
  • Subscript order matches loop orderA[i][j], not A[j][i]
  • Dimensionality matches nest depth — 2D array in a 2-deep nest
  • Read-after-write dependencies only

Out of scope: non-unit strides (i += 2), coupled subscripts (A[i+j]), symbolic offsets (A[i-N]), indirect indexing (A[B[i]]).

We use only LLVM’s LoopInfo pass to identify loop structure; everything else is our own pattern matching. No ScalarEvolution, no built-in dependence analysis—just enough IR traversal to extract what we need.


The IR Structure

We’ll use Case 1 from the prior tutorial as our running example:

for (int i = 1; i < N; i++) {
    for (int j = 0; j < M; j++) {
        A[i][j] = A[i-1][j] + 1;
    }
}

Here’s the corresponding control-flow graph:

Control flow graph for Case 1

The high-level loop structure visible in C has been lowered away. To analyze shuffleability, we must recover key information from the IR:

  1. Induction variables — the PHI nodes that count loop iterations
  2. Memory accesses — the load/store instructions and their subscripts

This requires pattern matching on the IR’s dataflow graph, where vertices are instructions and edges represent data dependencies.

Pattern 1: Induction Variables

Loop counters in LLVM IR appear as PHI nodes. A PHI selects between values depending on which predecessor block we came from. For a loop counter, it chooses between an initial value (on entry) and an incremented value (on the back-edge). In our example, %i takes the value 1 when entering from %entry, and %i.next when looping back from %for.exit.inner.

Induction variable detection pattern

To identify stride-1 induction variables:

  1. Find all PHI nodes with exactly two incoming edges.
  2. Identify the back-edge—the one originating from within the loop.
  3. Verify that the back-edge value is an add of the PHI and constant 1. (In LLVM, each instruction doubles as the name of its result, so %i.next refers to both the add instruction and its output.)
What we see in IRWhat we conclude
PHI with 2 inputs: init value + back-edge valueLoop counter candidate
Back-edge value is add nsw %phi, 1Stride-1 induction variable

Here’s the pattern-matching code:

// Find the back-edge input (the one from inside the loop)
Value *BackVal = nullptr;
for (unsigned i = 0; i < 2; i++)
  if (L->contains(Phi.getIncomingBlock(i)))
    BackVal = Phi.getIncomingValue(i);

// Check: is the stride 1?
auto *Add = dyn_cast<BinaryOperator>(BackVal);
if (Add && Add->getOpcode() == Instruction::Add) {
  auto *C = dyn_cast<ConstantInt>(Add->getOperand(1 - i));
  if (C && C->getSExtValue() == 1)
    IVs.push_back(&Phi);  // Found a stride-1 IV
}

Pattern 2: Memory Access Parsing

Next, we parse memory accesses. The goal: given a load or store, extract the base array and the access pattern—the induction variable and offset for each dimension.

In LLVM IR, array indexing uses two instructions:

  • GEP (GetElementPtr) computes the address from a base pointer and indices
  • load/store performs the actual memory access

We trace backward from each load/store to its GEP, then parse each index operand. An N-level loop nest produces a GEP with N+1 operands: the base pointer plus one index per dimension.

Here’s the dataflow for the load in our example:

Memory access dataflow

The tracing proceeds as follows:

  1. Start at load %addr.src.
  2. Follow the pointer operand to getelementptr %A, %i.m.1, %j.
  3. For each index, determine the subscript form:
    • A PHI node directly → offset 0 (e.g., A[i])
    • An add of a PHI and a constant → positive offset (e.g., A[i+1])
    • A sub of a PHI and a constant → negative offset (e.g., A[j-2])
GEP index operandParsed subscript
%j (PHI directly){j, 0}
%i.m.1sub nsw %i, 1{i, -1}
add nsw %iv, C{iv, +C}

For our example:

  • Load: GEP(%A, %i.m.1, %j){A, load, [{i, -1}, {j, 0}]}
  • Store: GEP(%A, %i, %j){A, store, [{i, 0}, {j, 0}]}

Here’s the implementation:

// parseMemoryOp: trace load/store → GEP → subscripts
auto *GEP = dyn_cast<GetElementPtrInst>(Ptr);
for (unsigned d = 0; d < GEP->getNumIndices(); d++) {
  auto Sub = parseSubscript(GEP->getOperand(1 + d), IVs);
  if (!Sub) return std::nullopt;
  Subs.push_back(*Sub);
}

// parseSubscript

// Case 1: operand is a PHI (induction variable with zero offset)
if (auto *Phi = dyn_cast<PHINode>(V))
  return LinearSubscript{Phi, 0};

// Case 2: operand is an ADD or SUB of a PHI and a constant
if (auto *BinOp = dyn_cast<BinaryOperator>(V)) {
  auto *Phi = dyn_cast<PHINode>(BinOp->getOperand(0));
  auto *C = dyn_cast<ConstantInt>(BinOp->getOperand(1));
  if (Phi && C) {
    int64_t Off = C->getSExtValue();
    if (BinOp->getOpcode() == Instruction::Sub) Off = -Off;
    return LinearSubscript{Phi, Off};
  }
}

Computing Dependencies

With both accesses parsed, computing the dependency is straightforward arithmetic—no more pattern matching.

From our example:

  • Load: [{i, -1}, {j, 0}]
  • Store: [{i, 0}, {j, 0}]

Compute distance[d] = store.offset[d] - load.offset[d]:

  • distance[0] = 0 - (-1) = 1
  • distance[1] = 0 - 0 = 0
  • Distance vector: (1, 0)

Apply the first-positive rule: scan left-to-right; the first non-zero component is at d=0, so the dependency is carried at depth 1. The outer loop cannot be shuffled; the inner loop can.

This is the same rule from the prior tutorial—only the representation changed.

// Compute distance vector
for (unsigned d = 0; d < Store.Subs.size(); d++)
  Dist.push_back(Store.Subs[d].Offset - Load.Subs[d].Offset);

// Apply first-positive rule
unsigned carryingDepth = 0;
for (unsigned d = 0; d < Dist->size(); d++) {
  if ((*Dist)[d] != 0) {
    carryingDepth = d + 1;
    break;
  }
}
if (carryingDepth > 0)
  Result.Blocked[carryingDepth - 1] = true;

Building and Running

The full source is available at GitHub.

We provide seven test cases written as hand-crafted LLVM IR rather than Clang output. This avoids unexpected compiler optimizations and keeps the IR minimal and readable.

To build:

# mkdir -p build && cd build
# cmake -DLLVM_DIR="$(llvm-config --cmakedir)" ..
# cmake --build . --parallel
# cd ..

Run on a single test case:

# opt -load-pass-plugin=./build/libToyDepAnalysis.so \
#     -passes="toy-dep-analysis" \
#     -disable-output \
#     ./case1-vertical.ll

Run all test cases:

# ./build.sh test