<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://danielep.xyz/feed.xml" rel="self" type="application/atom+xml"/><link href="https://danielep.xyz/" rel="alternate" type="text/html" hreflang="en"/><updated>2026-05-27T20:41:01+00:00</updated><id>https://danielep.xyz/feed.xml</id><title type="html">Daniele Paliotta</title><subtitle></subtitle><entry><title type="html">MiniDynamo: torch.compile in 500 lines of Python</title><link href="https://danielep.xyz/blog/2026/minidynamo-torch-compile/" rel="alternate" type="text/html" title="MiniDynamo: torch.compile in 500 lines of Python"/><published>2026-05-27T15:12:00+00:00</published><updated>2026-05-27T15:12:00+00:00</updated><id>https://danielep.xyz/blog/2026/minidynamo-torch-compile</id><content type="html" xml:base="https://danielep.xyz/blog/2026/minidynamo-torch-compile/"><![CDATA[<div class="l-body"> <p><em>When you run code wrapped in <code class="language-plaintext highlighter-rouge">@torch.compile</code>, PyTorch intercepts Python bytecode execution, captures a graph of tensor operations, hands that graph to an optimizing compiler, and caches the result – all transparently. This article rebuilds the basic pieces of that system from scratch, with a small implementation designed to make the moving parts visible.</em></p> <p>Note: <em>A good part of the code was generated with coding agents. To understand it properly, I then went line by line through the implementation, added tests, wrote small benchmark scripts, and used the resulting toy system to build intuition.</em></p> </div> <h2 id="1-the-big-picture">1. The Big Picture</h2> <p>Deep learning frameworks all face the same tension at some point: Python is easy to write, but slow to execute. When writing modeling code in PyTorch, everything is convenient for the programmer. In a <code class="language-plaintext highlighter-rouge">forward()</code> method you can use <code class="language-plaintext highlighter-rouge">if</code> statements, call helper functions, print for debugging, and rely on ordinary Python control flow. In eager mode, though, GPU tensor programs are usually dispatched one operation at a time. Eleven elementwise operations can mean eleven separate kernel launches, eleven memory round-trips, and a CPU-side dispatcher sitting between each operation.</p> <p>This mode of execution is called <strong>eager mode</strong>, and it was probably the main reason PyTorch won over TensorFlow.</p> <p>If you remember the early static-graph TensorFlow days, this trade-off was obvious: eager Python is vastly more productive than building static computation graphs by hand. But it’s bad for performance.</p> <p><code class="language-plaintext highlighter-rouge">torch.compile</code>, introduced in PyTorch 2.0, attempts to resolve this tension. When you wrap a function in <code class="language-plaintext highlighter-rouge">torch.compile()</code> and call it, PyTorch does something remarkable: it captures a graph of your tensor operations (TorchDynamo takes care of this), then hands that graph to an optimizing compiler (Inductor) that can fuse multiple operations into a single kernel. One read from memory, all operations computed in registers, one write. Whenever the function is called again, PyTorch tries to reuse the same optimized graph, as long as the assumptions made during tracing still hold.</p> <p>But to do any of this, you first need to symbolically execute arbitrary Python and PyTorch code, and that’s the hard part.</p> <h3 id="the-graph-capture-problem">The graph capture problem</h3> <p>Getting a graph of tensor operations from a Python function is harder than it sounds. PyTorch went through years of earlier approaches, each useful but each with painful trade-offs, before landing on TorchDynamo:</p> <ul> <li> <p><strong><code class="language-plaintext highlighter-rouge">torch.jit.trace</code></strong> (2018): Run the function with real inputs and record which C++ operations fire. Problem: completely invisible to Python control flow. An <code class="language-plaintext highlighter-rouge">if</code> statement gets silently baked in as whichever branch happened to execute during tracing. Dynamic shapes break everything.</p> </li> <li> <p><strong>TorchScript</strong> (<code class="language-plaintext highlighter-rouge">torch.jit.script</code>, 2018): Parse the Python source code into a restricted typed language that can be compiled. Problem: required users to rewrite their code to fit a restricted Python subset. Most real-world code couldn’t be scripted without significant refactoring. This was ultimately a non-starter.</p> </li> <li> <p><strong>FX Tracing</strong> (<code class="language-plaintext highlighter-rouge">torch.fx.symbolic_trace</code>, 2021): Python-level symbolic tracing with proxy objects. Better, but still breaks on dynamic Python constructs, data-dependent control flow, and code whose behavior depends on concrete Python values.</p> </li> <li> <p><strong>Lazy Tensors</strong> (2021): Operate below the dispatcher — record operations lazily and flush them as a batch. Can optimize operations but can’t eliminate Python overhead because Python still runs eagerly.</p> </li> </ul> <p><strong>TorchDynamo</strong> (the thing that powers torch.compile) took a fundamentally different approach: instead of working at the Python level or the C++ dispatcher level, it works at the <strong>bytecode level</strong>. Using PEP 523’s Frame Evaluation API, Dynamo installs a C-level hook that intercepts every Python frame <em>before</em> CPython’s interpreter runs it. It then walks the bytecode instructions, symbolically evaluating them to identify tensor operations and record them into an FX graph.</p> <h3 id="what-happens-when-you-call-torchcompile">What happens when you call <code class="language-plaintext highlighter-rouge">torch.compile</code></h3> <p><img src="/assets/img/mini-dynamo/pipeline.svg" alt="The torch.compile pipeline: first call runs every stage; subsequent calls with matching guards skip straight to EXECUTE."/></p> <ol> <li> <p><strong>Trace</strong>: Dynamo intercepts the Python frame via PEP 523 and walks the bytecode. Tensor operations are recorded into an FX graph. Much of the surrounding Python logic is handled outside the graph: some values are evaluated concretely, some assumptions become <em>guards</em>, and unsupported regions can trigger <em>graph breaks</em>. We will see in detail what this means.</p> </li> <li> <p><strong>Compile</strong>: The FX graph is passed to a compiler backend. The default backend is Inductor, which generates fused Triton kernels.</p> </li> <li> <p><strong>Guard</strong>: Dynamo records the assumptions made during tracing — tensor shapes, dtypes, devices, values of Python variables used in control flow — as boolean predicates.</p> </li> <li> <p><strong>Cache</strong>: The compiled function and its guards are stored together. On subsequent calls, if all guards pass, the compiled function is reused without re-tracing.</p> </li> <li> <p><strong>Execute</strong>: If guards pass, run the compiled function. If they fail (e.g., tensor shape changed), re-trace and compile, adding a new cache entry.</p> </li> </ol> <p>The key insight: <strong>tracing happens once, execution happens many times.</strong> The first call is slow (bytecode analysis + compilation), but subsequent calls with matching inputs can skip tracing and reuse the compiled result.</p> <h3 id="what-we-build">What we build</h3> <p>Now to the fun part. We will build this pipeline in ~500 lines of Python. Our implementation, <em>mini-dynamo</em>, is a deliberately tiny TorchDynamo-style tracer. It captures the same core ideas while leaving out the machinery needed for arbitrary real-world PyTorch programs.</p> <aside> <p><strong>What we build vs. what we skip.</strong> Real TorchDynamo handles control flow, nested function calls, user-defined classes, graph breaks, dynamic shapes, and hundreds of Python opcodes. Mini-dynamo handles straight-line tensor computations — enough to understand the architecture without drowning in edge cases. We also skip PEP 523 and the machinery built on top of it: real Dynamo hooks into CPython frame evaluation and effectively re-implements a large chunk of Python execution logic in Python. We also skip AOTAutograd, so we only trace forward computations.</p> </aside> <hr/> <hr/> <h2 id="2-architecture-the-five-components">2. Architecture: The Five Components</h2> <p>Before diving into each component, here’s the map. Mini-dynamo has five modules, each corresponding to a distinct concern in the TorchDynamo architecture:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>              fn(x, y) + example args
                      │
                      ▼
          ┌───────────────────────┐
          │    compile() decorator │ ← __init__.py (orchestrator)
          │    Manages the cache,  │   Checks guards, dispatches
          │    wires everything    │   to trace/compile/guard
          └───┬─────────┬─────┬───┘
              │         │     │
              ▼         │     ▼
  ┌──────────────────┐  │  ┌─────────────┐
  │ Symbolic          │  │  │   Guards     │ ← guards.py
  │ Interpreter       │  │  │   Shape,     │   Boolean checks on
  │                   │  │  │   dtype,     │   input metadata
  │ Walks bytecodes,  │  │  │   device     │
  │ manipulates       │  │  └─────────────┘
  │ VariableTrackers  │  │
  │ on a stack,       │  │
  │ builds the Graph  │  │
  └────────┬──────────┘  │
           │             │
     ┌─────┘             │
     ▼                   ▼
  ┌──────────┐    ┌──────────────┐
  │  Graph   │───▶│  Compiler    │ ← compiler.py
  │  (IR)    │    │  Backend     │   Graph → Python source
  │          │    │              │   → exec() → callable
  └──────────┘    └──────────────┘
   graph.py         (or → Inductor)
</code></pre></div></div> <p>Here’s what each component does, and what it corresponds to in real TorchDynamo:</p> <table> <thead> <tr> <th style="text-align: left">Mini-dynamo</th> <th style="text-align: left">Real TorchDynamo</th> <th style="text-align: left">Role</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">symbolic_interpreter.py</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">InstructionTranslator</code> (~15,000 lines in <code class="language-plaintext highlighter-rouge">symbolic_convert.py</code>)</td> <td style="text-align: left">Walk bytecodes, build the graph. The heart of the system.</td> </tr> <tr> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">variable_tracker.py</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">VariableTracker</code> hierarchy (~50 subclasses across 30+ files)</td> <td style="text-align: left">Symbolic values on the interpreter’s stack. Tell the interpreter what kind of thing each value is (tensor? constant? torch function?) so it can decide whether to record a graph node or evaluate concretely.</td> </tr> <tr> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">graph.py</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">torch.fx.Graph</code> + <code class="language-plaintext highlighter-rouge">torch.fx.Node</code></td> <td style="text-align: left">The computation graph IR. A flat list of nodes, each describing one operation. This is the output of tracing and the input to compilation.</td> </tr> <tr> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">compiler.py</code></td> <td style="text-align: left">Compiler backends (Inductor, etc.)</td> <td style="text-align: left">Takes a finished graph and produces a callable. Our simple backend generates Python source with pre-resolved closures. Real Inductor generates fused Triton/C++ kernels. However, we will build a converter from our graph format to FX Graphs, so that Inductor can be used as a backend as well</td> </tr> <tr> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">guards.py</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">torch._dynamo.guards</code> (C-accelerated)</td> <td style="text-align: left">Boolean predicates that encode the assumptions made during tracing. If guards pass on new inputs, the cached compiled function can be reused.</td> </tr> <tr> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">__init__.py</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">torch._dynamo.convert_frame</code></td> <td style="text-align: left">The orchestrator. Manages the guard-cache loop: check guards → hit? run cached fn. Miss? trace → compile → guard → cache.</td> </tr> </tbody> </table> <p>Two components deserve special attention because their roles are easy to confuse:</p> <p><strong>The Graph is the output.</strong> It’s a pure data structure — a list of nodes describing “what tensor operations to perform.” It has no logic, no execution semantics. After tracing is done, it gets handed to the compiler. Think of it as a recipe.</p> <p><strong>VariableTrackers are the process.</strong> They’re the symbolic values that live on the interpreter’s stack <em>during</em> tracing. They tell the interpreter what type of thing each value is, so it can decide what to do with each bytecode instruction. When tracing finishes, they’re thrown away. Think of them as scaffolding — essential for constructing the graph, but not part of the final product.</p> <p>We need both because CPython’s bytecodes are untyped. When the interpreter sees <code class="language-plaintext highlighter-rouge">BINARY_ADD</code>, it doesn’t know if it’s adding two tensors (→ record <code class="language-plaintext highlighter-rouge">torch.add</code> in the graph) or two integers (→ just compute the result). VariableTrackers carry the type information that lets it make this decision. The Graph records the decisions that were made.</p> <hr/> <h2 id="3-cpython-is-a-stack-machine">3. CPython Is a Stack Machine</h2> <p>To understand our symbolic interpreter, you need one fact about CPython: <strong>it’s a stack-based virtual machine.</strong> Every Python function compiles to a sequence of bytecode instructions that manipulate a <em>value stack</em> and a <em>locals array</em>.</p> <p>For <code class="language-plaintext highlighter-rouge">z = x + y</code>, CPython emits:</p> <table> <thead> <tr> <th style="text-align: left">Instruction</th> <th style="text-align: left">Stack (after)</th> <th style="text-align: left">Effect</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">LOAD_FAST x</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">[x]</code></td> <td style="text-align: left">Push local variable <code class="language-plaintext highlighter-rouge">x</code></td> </tr> <tr> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">LOAD_FAST y</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">[x, y]</code></td> <td style="text-align: left">Push local variable <code class="language-plaintext highlighter-rouge">y</code></td> </tr> <tr> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">BINARY_ADD</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">[x+y]</code></td> <td style="text-align: left">Pop two, push their sum</td> </tr> <tr> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">STORE_FAST z</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">[]</code></td> <td style="text-align: left">Pop and store in local <code class="language-plaintext highlighter-rouge">z</code></td> </tr> </tbody> </table> <p>Our symbolic interpreter mirrors this exactly – same stack, same locals, same dispatch loop. The only difference: instead of real Python values, the stack holds <em>symbolic wrappers</em> that record operations into a graph.</p> <hr/> <h2 id="4-variabletrackers-the-symbolic-values">4. VariableTrackers: The Symbolic Values</h2> <p>Every value in our interpreter is a <code class="language-plaintext highlighter-rouge">VariableTracker</code>. Think of it as: <em>“I’m not a real value – I’m a description of a value that will exist at runtime.”</em></p> <p>We need exactly four types:</p> <h3 id="tensorvariable">TensorVariable</h3> <p>The most important type. It holds a <em>graph node</em> (its identity in the computation graph) and an <em>example value</em> (a real tensor with the same shape/dtype/device, used for metadata propagation).</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">TensorVariable</span><span class="p">(</span><span class="n">VariableTracker</span><span class="p">):</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">node</span><span class="p">,</span> <span class="n">example_value</span><span class="p">):</span>
        <span class="n">self</span><span class="p">.</span><span class="n">node</span> <span class="o">=</span> <span class="n">node</span>              <span class="c1"># Graph Node that produces this tensor
</span>        <span class="n">self</span><span class="p">.</span><span class="n">example_value</span> <span class="o">=</span> <span class="n">example_value</span>  <span class="c1"># Real tensor for shape tracking
</span></code></pre></div></div> <p>When the interpreter sees <code class="language-plaintext highlighter-rouge">x + y</code> where both are <code class="language-plaintext highlighter-rouge">TensorVariable</code>s, it doesn’t compute the runtime result that the user asked for. Instead, it:</p> <ol> <li>Creates a new <code class="language-plaintext highlighter-rouge">Node</code> in the graph: <code class="language-plaintext highlighter-rouge">call_function(torch.add, (x.node, y.node))</code></li> <li>Computes an example output for metadata propagation: <code class="language-plaintext highlighter-rouge">torch.add(x.example_value, y.example_value)</code></li> <li>Returns <code class="language-plaintext highlighter-rouge">TensorVariable(new_node, example_output)</code></li> </ol> <p>The example value flows forward through every operation, so at any point during tracing, we know the exact shape, dtype, and device of every intermediate tensor. We are still running individual example tensor ops during tracing to propagate metadata, but the output of tracing is the graph, not the eager result of the original function.</p> <h3 id="constantvariable">ConstantVariable</h3> <p>A value fully known at trace time: the <code class="language-plaintext highlighter-rouge">2</code> in <code class="language-plaintext highlighter-rouge">x * 2</code>, a dtype like <code class="language-plaintext highlighter-rouge">torch.float32</code>, a shape tuple. Constants don’t become graph nodes – they’re inlined directly into the operations that use them.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">ConstantVariable</span><span class="p">(</span><span class="n">VariableTracker</span><span class="p">):</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">value</span><span class="p">):</span>
        <span class="n">self</span><span class="p">.</span><span class="n">value</span> <span class="o">=</span> <span class="n">value</span>  <span class="c1"># The actual Python value
</span></code></pre></div></div> <h3 id="torchvariable">TorchVariable</h3> <p>A reference to the <code class="language-plaintext highlighter-rouge">torch</code> module or one of its functions. When the interpreter encounters <code class="language-plaintext highlighter-rouge">LOAD_GLOBAL torch</code>, it pushes <code class="language-plaintext highlighter-rouge">TorchVariable(torch)</code>. When it then encounters <code class="language-plaintext highlighter-rouge">LOAD_ATTR relu</code>, it resolves <code class="language-plaintext highlighter-rouge">torch.relu</code> and pushes <code class="language-plaintext highlighter-rouge">TorchVariable(torch.relu)</code>.</p> <h3 id="methodvariable">MethodVariable</h3> <p>A bound tensor method like <code class="language-plaintext highlighter-rouge">x.sum</code>. Created when the interpreter accesses a method on a <code class="language-plaintext highlighter-rouge">TensorVariable</code>. It remembers <em>which tensor</em> and <em>which method</em>, so when called, it can record the correct graph node.</p> <aside> <p><strong>Real Dynamo has ~50 VariableTracker subclasses</strong>, covering lists, dicts, iterators, ranges, user-defined classes, <code class="language-plaintext highlighter-rouge">nn.Module</code>s, and more. Our four types suffice for straight-line tensor code.</p> </aside> <hr/> <h2 id="5-the-graph-ir">5. The Graph IR</h2> <p>As the interpreter runs, it records operations into a <code class="language-plaintext highlighter-rouge">Graph</code> – an ordered list of <code class="language-plaintext highlighter-rouge">Node</code> objects that form a DAG of the computation. This is a simplified version of <code class="language-plaintext highlighter-rouge">torch.fx.Graph</code>.</p> <p>Each <code class="language-plaintext highlighter-rouge">Node</code> has four key fields:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Node</span><span class="p">:</span>
    <span class="n">name</span><span class="p">:</span> <span class="nb">str</span>       <span class="c1"># Unique identifier, e.g. "add_0", "x"
</span>    <span class="n">op</span><span class="p">:</span> <span class="nb">str</span>         <span class="c1"># One of: "placeholder", "call_function", "call_method", "output"
</span>    <span class="n">target</span><span class="p">:</span> <span class="n">Any</span>     <span class="c1"># What to call (e.g., torch.add) or method name (e.g., "sum")
</span>    <span class="n">args</span><span class="p">:</span> <span class="nb">tuple</span>     <span class="c1"># Positional arguments -- can reference other Nodes
</span></code></pre></div></div> <p>Nodes come in four flavors:</p> <table> <thead> <tr> <th style="text-align: left"><code class="language-plaintext highlighter-rouge">op</code></th> <th style="text-align: left">Meaning</th> <th style="text-align: left">Example</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">placeholder</code></td> <td style="text-align: left">Function input</td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">x = placeholder</code></td> </tr> <tr> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">call_function</code></td> <td style="text-align: left">A function call on tensors</td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">add_0 = torch.add(x, y)</code></td> </tr> <tr> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">call_method</code></td> <td style="text-align: left">A method call on a tensor</td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">sum_0 = add_0.sum()</code></td> </tr> <tr> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">output</code></td> <td style="text-align: left">The return value</td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">return sum_0</code></td> </tr> </tbody> </table> <p>For the function:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">fn</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
    <span class="n">z</span> <span class="o">=</span> <span class="n">x</span> <span class="o">+</span> <span class="n">y</span>
    <span class="n">w</span> <span class="o">=</span> <span class="n">z</span> <span class="o">*</span> <span class="mi">2</span>
    <span class="k">return</span> <span class="n">w</span><span class="p">.</span><span class="nf">sum</span><span class="p">()</span>
</code></pre></div></div> <p>The captured graph is:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Graph:
  x = placeholder
  y = placeholder
  add_0 = torch.add(x, y)
  mul_0 = torch.mul(add_0, 2)
  sum_0 = mul_0.sum()
  return sum_0
</code></pre></div></div> <p><img src="/assets/img/mini-dynamo/graph-ir.svg" alt="The same function before and after tracing. Function name, local variables, and Python operators dissolve into a pure data-flow DAG over tensor ops."/></p> <p>Notice the <code class="language-plaintext highlighter-rouge">2</code> in <code class="language-plaintext highlighter-rouge">torch.mul(add_0, 2)</code> – it’s a plain Python integer, not a <code class="language-plaintext highlighter-rouge">Node</code>. Constants are inlined into the args of the operations that consume them.</p> <hr/> <h2 id="6-the-symbolic-interpreter">6. The Symbolic Interpreter</h2> <p>This is the heart of mini-dynamo. The <code class="language-plaintext highlighter-rouge">SymbolicInterpreter</code> is what ties the previous pieces together: it reads bytecode, manipulates <code class="language-plaintext highlighter-rouge">VariableTracker</code>s on a stack, and writes nodes into the <code class="language-plaintext highlighter-rouge">Graph</code>. Everything else in the system either feeds into this loop or consumes its output.</p> <p>The intuition behind it is simple. When you run a Python function normally, CPython walks the bytecode and <em>actually executes</em> each instruction on real objects: integers get added, tensors get multiplied, methods get invoked. We want to do almost the same thing, except we don’t care about the result — we care about the <em>shape</em> of the computation. So instead of re-implementing CPython’s interpreter to produce a value, we re-implement just enough of it to produce a <strong>graph</strong>. Same bytecode, same stack discipline, same dispatch loop — different domain.</p> <h3 id="two-interpreters-in-parallel">Two Interpreters in Parallel</h3> <p>The cleanest way to think about this is to picture two interpreters running side by side on the same bytecode, one real and one symbolic:</p> <table> <thead> <tr> <th style="text-align: left"> </th> <th style="text-align: left">CPython’s interpreter</th> <th style="text-align: left">Our symbolic interpreter</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><strong>Stack holds</strong></td> <td style="text-align: left">Real Python objects</td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">VariableTracker</code>s</td> </tr> <tr> <td style="text-align: left"><strong>Locals hold</strong></td> <td style="text-align: left">Real values</td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">VariableTracker</code>s</td> </tr> <tr> <td style="text-align: left"><strong><code class="language-plaintext highlighter-rouge">BINARY_ADD</code> on two tensors</strong></td> <td style="text-align: left">Calls <code class="language-plaintext highlighter-rouge">torch.add</code>, produces a new tensor</td> <td style="text-align: left">Records <code class="language-plaintext highlighter-rouge">torch.add(x, y)</code> in the graph, pushes a new <code class="language-plaintext highlighter-rouge">TensorVariable</code> wrapping that node</td> </tr> <tr> <td style="text-align: left"><strong><code class="language-plaintext highlighter-rouge">BINARY_ADD</code> on two ints</strong></td> <td style="text-align: left">Computes <code class="language-plaintext highlighter-rouge">a + b</code></td> <td style="text-align: left">Computes <code class="language-plaintext highlighter-rouge">a + b</code> — constants are evaluated concretely</td> </tr> <tr> <td style="text-align: left"><strong><code class="language-plaintext highlighter-rouge">CALL_METHOD x.sum()</code></strong></td> <td style="text-align: left">Invokes the bound method</td> <td style="text-align: left">Records <code class="language-plaintext highlighter-rouge">x.sum()</code> in the graph</td> </tr> <tr> <td style="text-align: left"><strong>Unsupported opcode</strong></td> <td style="text-align: left">Executes it</td> <td style="text-align: left">Raises <code class="language-plaintext highlighter-rouge">NotImplementedError</code></td> </tr> <tr> <td style="text-align: left"><strong>Final output</strong></td> <td style="text-align: left">A return value</td> <td style="text-align: left">A finished <code class="language-plaintext highlighter-rouge">Graph</code></td> </tr> </tbody> </table> <p>Same shape, different domain. CPython operates on values; we operate on <em>descriptions</em> of values. At every bytecode step, the symbolic interpreter faces one recurring decision: <strong>record</strong> this operation into the graph, or <strong>evaluate</strong> it concretely on whatever constants and metadata we already know. That single choice, repeated across every instruction, is what produces the captured graph.</p> <h3 id="the-three-pieces-of-state">The Three Pieces of State</h3> <p>Just like CPython, our interpreter carries three pieces of state through its run:</p> <ul> <li><strong><code class="language-plaintext highlighter-rouge">self.stack</code></strong> — a list of <code class="language-plaintext highlighter-rouge">VariableTracker</code>s. Pushed to by <code class="language-plaintext highlighter-rouge">LOAD_*</code> opcodes, consumed by <code class="language-plaintext highlighter-rouge">BINARY_*</code>, <code class="language-plaintext highlighter-rouge">CALL_*</code>, <code class="language-plaintext highlighter-rouge">STORE_*</code>, and the rest.</li> <li><strong><code class="language-plaintext highlighter-rouge">self.locals</code></strong> — a dict mapping variable names to <code class="language-plaintext highlighter-rouge">VariableTracker</code>s. Initialized from the function’s arguments and updated on <code class="language-plaintext highlighter-rouge">STORE_FAST</code>.</li> <li><strong><code class="language-plaintext highlighter-rouge">self.graph</code></strong> — the <code class="language-plaintext highlighter-rouge">Graph</code> being built. Grows each time a tensor operation gets recorded.</li> </ul> <p>Everything the interpreter does is a transformation of these three. If you snapshotted them after every instruction, you’d have a complete movie of the trace.</p> <h3 id="correspondence-to-real-dynamo">Correspondence to Real Dynamo</h3> <p>Our <code class="language-plaintext highlighter-rouge">SymbolicInterpreter</code> is the direct analogue of TorchDynamo’s <code class="language-plaintext highlighter-rouge">InstructionTranslator</code> (in <code class="language-plaintext highlighter-rouge">torch/_dynamo/symbolic_convert.py</code>). The two share the same skeleton: a value stack, a locals dict, an FX-style graph being mutated, and one handler per opcode. The differences are in scope, not in kind:</p> <ul> <li>Real Dynamo handles ~200 opcodes including jumps, comparisons, exceptions, closures, and generator machinery. We handle ~15.</li> <li>Real Dynamo <em>inline-traces</em> into called functions via PEP 523 frame hooks — when <code class="language-plaintext highlighter-rouge">fn()</code> calls <code class="language-plaintext highlighter-rouge">helper()</code>, Dynamo intercepts the new frame and keeps tracing into it, producing a single unified graph. Our walker only sees top-level bytecode.</li> <li>Real Dynamo emits <strong>guards</strong> on-the-fly as it makes assumptions (e.g. “I looked at <code class="language-plaintext highlighter-rouge">x.shape[0]</code> and treated it as <code class="language-plaintext highlighter-rouge">32</code>, so guard on that”). We emit guards after tracing, from the example inputs.</li> <li>Real Dynamo can <strong>break the graph</strong> when it hits something unsupported — compile what it has so far, let the hard part run in plain Python, and resume tracing after. Our interpreter just halts with <code class="language-plaintext highlighter-rouge">NotImplementedError</code>.</li> </ul> <p>Everything past this point is a bottom-up tour of the walker itself: how it initializes, how the dispatch loop fetches and executes instructions, a worked example showing the stack and graph evolving side by side, and finally the call-dispatch logic that decides whether a given call becomes a graph node or a concrete Python call.</p> <h3 id="initialization">Initialization</h3> <p>When we begin tracing <code class="language-plaintext highlighter-rouge">fn(x, y)</code>, we create a <code class="language-plaintext highlighter-rouge">SymbolicInterpreter</code> with:</p> <ul> <li>A fresh <code class="language-plaintext highlighter-rouge">Graph</code></li> <li>An empty <code class="language-plaintext highlighter-rouge">stack</code></li> <li><code class="language-plaintext highlighter-rouge">locals</code> populated with <code class="language-plaintext highlighter-rouge">TensorVariable</code> placeholders for each tensor argument</li> </ul> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">fn</span><span class="p">,</span> <span class="n">example_args</span><span class="p">):</span>
    <span class="n">self</span><span class="p">.</span><span class="n">fn</span> <span class="o">=</span> <span class="n">fn</span>
    <span class="n">self</span><span class="p">.</span><span class="n">graph</span> <span class="o">=</span> <span class="nc">Graph</span><span class="p">()</span>
    <span class="n">self</span><span class="p">.</span><span class="n">stack</span> <span class="o">=</span> <span class="p">[]</span>     <span class="c1"># Mirrors CPython's value stack, but holds VariableTrackers
</span>    <span class="n">self</span><span class="p">.</span><span class="nb">locals</span> <span class="o">=</span> <span class="p">{}</span>    <span class="c1"># Mirrors CPython's locals: name -&gt; VariableTracker
</span>
    <span class="c1"># fn.__code__ is the compiled CPython code object behind a function.
</span>    <span class="c1"># co_varnames is the tuple of *all* local names; the first co_argcount of
</span>    <span class="c1"># them are the declared parameters, in order. So this slice gives us just
</span>    <span class="c1"># the parameter names without pulling in interior locals.
</span>    <span class="n">code</span> <span class="o">=</span> <span class="n">fn</span><span class="p">.</span><span class="n">__code__</span>
    <span class="n">arg_names</span> <span class="o">=</span> <span class="n">code</span><span class="p">.</span><span class="n">co_varnames</span><span class="p">[:</span><span class="n">code</span><span class="p">.</span><span class="n">co_argcount</span><span class="p">]</span>

    <span class="c1"># Seed the locals dict with one tracker per argument:
</span>    <span class="c1">#   - tensors enter the graph as `placeholder` nodes (they're the inputs
</span>    <span class="c1">#     downstream nodes will reference);
</span>    <span class="c1">#   - non-tensors stay as concrete ConstantVariables, so the interpreter
</span>    <span class="c1">#     can use their actual Python value during tracing (e.g. for `if`
</span>    <span class="c1">#     conditions on ints, shape tuples, dtype objects).
</span>    <span class="k">for</span> <span class="n">name</span><span class="p">,</span> <span class="n">example</span> <span class="ow">in</span> <span class="nf">zip</span><span class="p">(</span><span class="n">arg_names</span><span class="p">,</span> <span class="n">example_args</span><span class="p">):</span>
        <span class="k">if</span> <span class="nf">isinstance</span><span class="p">(</span><span class="n">example</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">):</span>
            <span class="n">node</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">graph</span><span class="p">.</span><span class="nf">placeholder</span><span class="p">(</span><span class="n">name</span><span class="p">)</span>
            <span class="n">self</span><span class="p">.</span><span class="nb">locals</span><span class="p">[</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="nc">TensorVariable</span><span class="p">(</span><span class="n">node</span><span class="p">,</span> <span class="n">example</span><span class="p">)</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="n">self</span><span class="p">.</span><span class="nb">locals</span><span class="p">[</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="nc">ConstantVariable</span><span class="p">(</span><span class="n">example</span><span class="p">)</span>
</code></pre></div></div> <h3 id="the-main-loop">The Main Loop</h3> <p>The interpreter fetches instructions one by one and dispatches to handler methods:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="n">self</span><span class="p">):</span>
    <span class="c1"># dis.get_instructions decodes the function's bytecode into a flat list
</span>    <span class="c1"># of Instruction records — the same data CPython would dispatch on
</span>    <span class="c1"># internally. Each record knows its opname (e.g. "LOAD_FAST"), its
</span>    <span class="c1"># argument value, and where it sits in the bytecode.
</span>    <span class="n">instructions</span> <span class="o">=</span> <span class="nf">list</span><span class="p">(</span><span class="n">dis</span><span class="p">.</span><span class="nf">get_instructions</span><span class="p">(</span><span class="n">self</span><span class="p">.</span><span class="n">fn</span><span class="p">))</span>
    <span class="k">for</span> <span class="n">inst</span> <span class="ow">in</span> <span class="n">instructions</span><span class="p">:</span>
        <span class="c1"># One handler method per opcode, conventionally named op_&lt;OPNAME&gt;
</span>        <span class="c1"># (e.g. op_LOAD_FAST, op_BINARY_ADD). This is the same trick CPython's
</span>        <span class="c1"># ceval.c uses, just spelled in Python via attribute lookup. Anything
</span>        <span class="c1"># we haven't implemented falls through to NotImplementedError rather
</span>        <span class="c1"># than silently producing a wrong graph.
</span>        <span class="n">handler</span> <span class="o">=</span> <span class="nf">getattr</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="sa">f</span><span class="sh">"</span><span class="s">op_</span><span class="si">{</span><span class="n">inst</span><span class="p">.</span><span class="n">opname</span><span class="si">}</span><span class="sh">"</span><span class="p">,</span> <span class="bp">None</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">handler</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
            <span class="k">raise</span> <span class="nc">NotImplementedError</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Unsupported bytecode: </span><span class="si">{</span><span class="n">inst</span><span class="p">.</span><span class="n">opname</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
        <span class="nf">handler</span><span class="p">(</span><span class="n">inst</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="n">graph</span>
</code></pre></div></div> <h3 id="walking-through-an-example">Walking Through an Example</h3> <p>Let’s trace <code class="language-plaintext highlighter-rouge">fn(x, y)</code> where <code class="language-plaintext highlighter-rouge">fn</code> computes <code class="language-plaintext highlighter-rouge">z = x + y; w = z * 2; return w.sum()</code>. Before walking the full table, here’s the picture for the first four instructions — bytecode, stack, and graph all evolving in lockstep:</p> <p><img src="/assets/img/mini-dynamo/tracing.svg" alt="Tracing in motion: only the BINARY_ADD step actually touches the graph. Every other instruction is plumbing."/></p> <p>The full trace, including the multiplication and the method call:</p> <table> <thead> <tr> <th style="text-align: center">Step</th> <th style="text-align: left">Instruction</th> <th style="text-align: left">Stack</th> <th style="text-align: left">Graph (new node)</th> </tr> </thead> <tbody> <tr> <td style="text-align: center">1</td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">LOAD_FAST x</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">[TensorVar(x)]</code></td> <td style="text-align: left">–</td> </tr> <tr> <td style="text-align: center">2</td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">LOAD_FAST y</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">[TensorVar(x), TensorVar(y)]</code></td> <td style="text-align: left">–</td> </tr> <tr> <td style="text-align: center">3</td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">BINARY_ADD</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">[TensorVar(add_0)]</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">add_0 = torch.add(x, y)</code></td> </tr> <tr> <td style="text-align: center">4</td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">STORE_FAST z</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">[]</code></td> <td style="text-align: left">–</td> </tr> <tr> <td style="text-align: center">5</td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">LOAD_FAST z</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">[TensorVar(add_0)]</code></td> <td style="text-align: left">–</td> </tr> <tr> <td style="text-align: center">6</td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">LOAD_CONST 2</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">[TensorVar(add_0), ConstVar(2)]</code></td> <td style="text-align: left">–</td> </tr> <tr> <td style="text-align: center">7</td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">BINARY_MULTIPLY</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">[TensorVar(mul_0)]</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">mul_0 = torch.mul(add_0, 2)</code></td> </tr> <tr> <td style="text-align: center">8</td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">STORE_FAST w</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">[]</code></td> <td style="text-align: left">–</td> </tr> <tr> <td style="text-align: center">9</td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">LOAD_FAST w</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">[TensorVar(mul_0)]</code></td> <td style="text-align: left">–</td> </tr> <tr> <td style="text-align: center">10</td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">LOAD_METHOD sum</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">[MethodVar(mul_0, "sum")]</code></td> <td style="text-align: left">–</td> </tr> <tr> <td style="text-align: center">11</td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">CALL_METHOD 0</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">[TensorVar(sum_0)]</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">sum_0 = mul_0.sum()</code></td> </tr> <tr> <td style="text-align: center">12</td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">RETURN_VALUE</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">[]</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">return sum_0</code></td> </tr> </tbody> </table> <p>Three things to notice:</p> <p><strong>Steps 3 and 7</strong> – when a binary operation involves a <code class="language-plaintext highlighter-rouge">TensorVariable</code>, the interpreter records a <code class="language-plaintext highlighter-rouge">torch.add</code> or <code class="language-plaintext highlighter-rouge">torch.mul</code> node in the graph and pushes a new <code class="language-plaintext highlighter-rouge">TensorVariable</code> wrapping that node. The constant <code class="language-plaintext highlighter-rouge">2</code> is passed directly into the node’s args.</p> <p><strong>Step 10</strong> – <code class="language-plaintext highlighter-rouge">LOAD_METHOD sum</code> on a <code class="language-plaintext highlighter-rouge">TensorVariable</code> produces a <code class="language-plaintext highlighter-rouge">MethodVariable</code> – not a graph node. The method hasn’t been <em>called</em> yet, just <em>looked up</em>. The graph node is created in step 11 when <code class="language-plaintext highlighter-rouge">CALL_METHOD</code> executes it.</p> <p><strong>Step 12</strong> – <code class="language-plaintext highlighter-rouge">RETURN_VALUE</code> marks the output. The graph is now complete.</p> <h3 id="the-call-dispatch-logic">The Call Dispatch Logic</h3> <p>The most interesting handler is <code class="language-plaintext highlighter-rouge">_handle_call</code>, which decides what to do when a function or method is called:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">_handle_call</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">fn</span><span class="p">,</span> <span class="n">args</span><span class="p">):</span>
    <span class="c1"># Known torch function with tensor args? → Record graph node.
</span>    <span class="k">if</span> <span class="nf">isinstance</span><span class="p">(</span><span class="n">fn</span><span class="p">,</span> <span class="n">TorchVariable</span><span class="p">)</span> <span class="ow">and</span> <span class="n">fn</span><span class="p">.</span><span class="n">value</span> <span class="ow">in</span> <span class="n">SUPPORTED_TORCH_FUNCTIONS</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="nf">_call_torch_function</span><span class="p">(</span><span class="n">fn</span><span class="p">.</span><span class="n">value</span><span class="p">,</span> <span class="n">args</span><span class="p">)</span>

    <span class="c1"># Tensor method (x.sum, x.reshape)? → Record graph node.
</span>    <span class="k">if</span> <span class="nf">isinstance</span><span class="p">(</span><span class="n">fn</span><span class="p">,</span> <span class="n">MethodVariable</span><span class="p">):</span>
        <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="nf">_call_tensor_method</span><span class="p">(</span><span class="n">fn</span><span class="p">,</span> <span class="n">args</span><span class="p">)</span>

    <span class="c1"># Unknown callable with tensor args? → Try tracing it anyway.
</span>    <span class="k">if</span> <span class="nf">isinstance</span><span class="p">(</span><span class="n">fn</span><span class="p">,</span> <span class="n">TorchVariable</span><span class="p">)</span> <span class="ow">and</span> <span class="nf">callable</span><span class="p">(</span><span class="n">fn</span><span class="p">.</span><span class="n">value</span><span class="p">):</span>
        <span class="n">has_tensors</span> <span class="o">=</span> <span class="nf">any</span><span class="p">(</span><span class="nf">isinstance</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">TensorVariable</span><span class="p">)</span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">args</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">has_tensors</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="nf">_call_torch_function</span><span class="p">(</span><span class="n">fn</span><span class="p">.</span><span class="n">value</span><span class="p">,</span> <span class="n">args</span><span class="p">)</span>

    <span class="c1"># Pure Python on constants (int, len, etc.)? → Evaluate directly.
</span>    <span class="k">if</span> <span class="nf">isinstance</span><span class="p">(</span><span class="n">fn</span><span class="p">,</span> <span class="n">ConstantVariable</span><span class="p">)</span> <span class="ow">and</span> <span class="nf">callable</span><span class="p">(</span><span class="n">fn</span><span class="p">.</span><span class="n">value</span><span class="p">):</span>
        <span class="n">concrete_args</span> <span class="o">=</span> <span class="p">[</span><span class="n">self</span><span class="p">.</span><span class="nf">_to_concrete</span><span class="p">(</span><span class="n">a</span><span class="p">)</span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">args</span><span class="p">]</span>
        <span class="k">return</span> <span class="nc">ConstantVariable</span><span class="p">(</span><span class="n">fn</span><span class="p">.</span><span class="nf">value</span><span class="p">(</span><span class="o">*</span><span class="n">concrete_args</span><span class="p">))</span>

    <span class="k">raise</span> <span class="nc">RuntimeError</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Don</span><span class="sh">'</span><span class="s">t know how to call </span><span class="si">{</span><span class="nf">type</span><span class="p">(</span><span class="n">fn</span><span class="p">).</span><span class="n">__name__</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div> <p>This two-path dispatch is the core of the design: <strong>supported tensor operations are traced; supported pure-Python work on constants is evaluated.</strong> Anything outside that narrow subset is where mini-dynamo stops and raises an error. Real TorchDynamo is much more sophisticated: it can guard on Python values, rewrite bytecode, and resume after graph breaks. A <em>graph break</em> is Dynamo’s escape hatch for the “don’t know how to handle this” case: instead of giving up on the whole function, it compiles the graph it has built so far, hands control back to the regular Python interpreter to run the unsupported bit (a <code class="language-plaintext highlighter-rouge">print</code>, an unusual data structure, a call into a C extension), and then starts a fresh trace from the next instruction — so a single Python function can end up as several compiled graphs stitched together with plain eager code in between.</p> <p>For a concrete example, imagine your <code class="language-plaintext highlighter-rouge">forward</code> calls into a custom CUDA kernel via a <code class="language-plaintext highlighter-rouge">ctypes</code> binding or a third-party library that bypasses the PyTorch dispatcher. Dynamo can see the Python call site but has no way to introspect the C code on the other side of the FFI boundary, so it can’t represent that call as an FX node. Rather than failing the whole compile, it cuts the graph at that instruction: everything before the call becomes graph #1 (compiled with Inductor), the opaque CUDA call runs in eager Python against the materialized tensors, and whatever comes after starts graph #2. The two compiled graphs never meet the kernel fuser across that boundary — which is exactly why, in practice, people work hard to eliminate graph breaks in hot code paths. A properly registered PyTorch custom op is a different story: because it participates in the dispatcher, Dynamo may be able to keep it as an operator in the graph even if Inductor treats it as an opaque call.</p> <hr/> <h2 id="7-the-compiler-backend">7. The Compiler Backend</h2> <p>The graph is now a clean IR of tensor operations. The compiler’s job is to turn it into a callable. In a real system like <code class="language-plaintext highlighter-rouge">torch.compile</code>, this is where the graph gets handed to Inductor for kernel fusion — the step that actually produces the speedup. <strong>Our educational backend is not that.</strong> It generates a plain Python function that re-dispatches to the same <code class="language-plaintext highlighter-rouge">torch</code> ops as the original, one at a time, with function lookups pre-resolved into a closure. Think of it as the minimum viable backend: it proves we captured the graph correctly and produces a standalone callable you could feed to real Inductor later. It is not where speed comes from.</p> <h3 id="what-the-compiler-generates">What the compiler generates</h3> <p>Our compiler walks the graph and generates a Python function with all function references <strong>pre-resolved in the closure</strong>:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Generated code for many_ops(x, y):
</span><span class="k">def</span> <span class="nf">compiled_fn</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
    <span class="n">add_0</span> <span class="o">=</span> <span class="nf">__fn_add_0</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>       <span class="c1"># __fn_add_0 = torch.add (in closure)
</span>    <span class="n">mul_0</span> <span class="o">=</span> <span class="nf">__fn_mul_0</span><span class="p">(</span><span class="n">add_0</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>   <span class="c1"># __fn_mul_0 = torch.mul (in closure)
</span>    <span class="n">sub_0</span> <span class="o">=</span> <span class="nf">__fn_sub_0</span><span class="p">(</span><span class="n">mul_0</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span>   <span class="c1"># __fn_sub_0 = torch.sub (in closure)
</span>    <span class="bp">...</span>
    <span class="k">return</span> <span class="n">sum_0</span>
</code></pre></div></div> <p>Each <code class="language-plaintext highlighter-rouge">__fn_*</code> variable is resolved from the function’s <code class="language-plaintext highlighter-rouge">__globals__</code> dict (the closure namespace), avoiding the <code class="language-plaintext highlighter-rouge">LOAD_GLOBAL torch</code> + <code class="language-plaintext highlighter-rouge">LOAD_ATTR add</code> pair in the original. It’s worth being explicit about what this does and doesn’t save, because it’s tempting to read the previous paragraph as if we’ve <em>optimized</em> something. We haven’t, really. CPython’s bytecode dispatch runs on the order of tens of nanoseconds per instruction, while a single eager PyTorch op on the GPU spends microseconds in the C++ dispatcher, more microseconds launching the kernel, and then whatever the kernel itself takes. Skipping one <code class="language-plaintext highlighter-rouge">LOAD_GLOBAL</code> + <code class="language-plaintext highlighter-rouge">LOAD_ATTR</code> pair per op saves at most a tiny fraction of the smallest of those costs — it’s immeasurable on real workloads. The real purpose of this step is not performance; it’s to produce a clean, self-contained callable that faithfully reproduces the graph. That’s what an optimizing backend like Inductor expects as input, and it’s also a useful sanity check that our tracer produced something equivalent to the original function.</p> <h3 id="code-generation">Code Generation</h3> <p>The compiler walks the graph and emits one line of Python per node:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">compile_graph</span><span class="p">(</span><span class="n">graph</span><span class="p">):</span>
    <span class="c1"># Graph placeholders become the parameters of the generated function,
</span>    <span class="c1"># in the same order they appeared in the original `fn`.
</span>    <span class="n">param_names</span> <span class="o">=</span> <span class="p">[</span><span class="n">n</span><span class="p">.</span><span class="n">name</span> <span class="k">for</span> <span class="n">n</span> <span class="ow">in</span> <span class="n">graph</span><span class="p">.</span><span class="n">inputs</span><span class="p">]</span>
    <span class="n">signature</span> <span class="o">=</span> <span class="sh">"</span><span class="s">, </span><span class="sh">"</span><span class="p">.</span><span class="nf">join</span><span class="p">(</span><span class="n">param_names</span><span class="p">)</span>

    <span class="n">body_lines</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="c1"># `closure_vars` ends up as the globals dict for the exec()'d function.
</span>    <span class="c1"># Stashing the actual callables (torch.add, torch.mul, …) in here lets
</span>    <span class="c1"># generated code refer to them as plain names — no LOAD_GLOBAL + LOAD_ATTR
</span>    <span class="c1"># pair on every call.
</span>    <span class="n">closure_vars</span> <span class="o">=</span> <span class="p">{}</span>

    <span class="k">for</span> <span class="n">node</span> <span class="ow">in</span> <span class="n">graph</span><span class="p">.</span><span class="n">nodes</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">node</span><span class="p">.</span><span class="n">op</span> <span class="o">==</span> <span class="sh">"</span><span class="s">placeholder</span><span class="sh">"</span><span class="p">:</span>
            <span class="k">continue</span>  <span class="c1"># Already covered by the function signature above.
</span>        <span class="k">elif</span> <span class="n">node</span><span class="p">.</span><span class="n">op</span> <span class="o">==</span> <span class="sh">"</span><span class="s">call_function</span><span class="sh">"</span><span class="p">:</span>
            <span class="c1"># Give this op a unique closure key, stash its target callable,
</span>            <span class="c1"># and emit a single line that invokes it.
</span>            <span class="n">closure_key</span> <span class="o">=</span> <span class="sa">f</span><span class="sh">"</span><span class="s">__fn_</span><span class="si">{</span><span class="n">node</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="sh">"</span>
            <span class="n">closure_vars</span><span class="p">[</span><span class="n">closure_key</span><span class="p">]</span> <span class="o">=</span> <span class="n">node</span><span class="p">.</span><span class="n">target</span>
            <span class="n">args_str</span> <span class="o">=</span> <span class="nf">_format_call_args</span><span class="p">(</span><span class="n">node</span><span class="p">.</span><span class="n">args</span><span class="p">)</span>
            <span class="n">body_lines</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">    </span><span class="si">{</span><span class="n">node</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s"> = </span><span class="si">{</span><span class="n">closure_key</span><span class="si">}</span><span class="s">(</span><span class="si">{</span><span class="n">args_str</span><span class="si">}</span><span class="s">)</span><span class="sh">"</span><span class="p">)</span>
        <span class="k">elif</span> <span class="n">node</span><span class="p">.</span><span class="n">op</span> <span class="o">==</span> <span class="sh">"</span><span class="s">call_method</span><span class="sh">"</span><span class="p">:</span>
            <span class="c1"># Methods are dispatched on the receiver, so there's nothing to
</span>            <span class="c1"># stash in the closure — we just write `&lt;self&gt;.&lt;method&gt;(...)`.
</span>            <span class="n">self_name</span> <span class="o">=</span> <span class="nf">_arg_to_str</span><span class="p">(</span><span class="n">node</span><span class="p">.</span><span class="n">args</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
            <span class="n">rest_args</span> <span class="o">=</span> <span class="nf">_format_call_args</span><span class="p">(</span><span class="n">node</span><span class="p">.</span><span class="n">args</span><span class="p">[</span><span class="mi">1</span><span class="p">:])</span>
            <span class="n">body_lines</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">    </span><span class="si">{</span><span class="n">node</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s"> = </span><span class="si">{</span><span class="n">self_name</span><span class="si">}</span><span class="s">.</span><span class="si">{</span><span class="n">node</span><span class="p">.</span><span class="n">target</span><span class="si">}</span><span class="s">(</span><span class="si">{</span><span class="n">rest_args</span><span class="si">}</span><span class="s">)</span><span class="sh">"</span><span class="p">)</span>
        <span class="k">elif</span> <span class="n">node</span><span class="p">.</span><span class="n">op</span> <span class="o">==</span> <span class="sh">"</span><span class="s">output</span><span class="sh">"</span><span class="p">:</span>
            <span class="n">body_lines</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">    return </span><span class="si">{</span><span class="nf">_arg_to_str</span><span class="p">(</span><span class="n">node</span><span class="p">.</span><span class="n">args</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>

    <span class="n">source</span> <span class="o">=</span> <span class="sa">f</span><span class="sh">"</span><span class="s">def compiled_fn(</span><span class="si">{</span><span class="n">signature</span><span class="si">}</span><span class="s">):</span><span class="se">\n</span><span class="sh">"</span> <span class="o">+</span> <span class="sh">"</span><span class="se">\n</span><span class="sh">"</span><span class="p">.</span><span class="nf">join</span><span class="p">(</span><span class="n">body_lines</span><span class="p">)</span>
    <span class="c1"># Two-step materialization. `compile()` (Python builtin, not ours) turns
</span>    <span class="c1"># the source string into a code object; `exec()` runs that code with
</span>    <span class="c1"># `closure_vars` as its globals. The side effect is that `compiled_fn`
</span>    <span class="c1"># is now defined inside `closure_vars`, ready to be pulled back out.
</span>    <span class="n">code</span> <span class="o">=</span> <span class="nf">compile</span><span class="p">(</span><span class="n">source</span><span class="p">,</span> <span class="sh">"</span><span class="s">&lt;mini-dynamo-compiled&gt;</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">exec</span><span class="sh">"</span><span class="p">)</span>
    <span class="nf">exec</span><span class="p">(</span><span class="n">code</span><span class="p">,</span> <span class="n">closure_vars</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">closure_vars</span><span class="p">[</span><span class="sh">"</span><span class="s">compiled_fn</span><span class="sh">"</span><span class="p">],</span> <span class="n">source</span>
</code></pre></div></div> <p>The <code class="language-plaintext highlighter-rouge">exec()</code> call creates the function in a namespace that contains the closure variables – the pre-resolved torch functions. This string-codegen trick is just our educational backend. Real TorchDynamo’s default path produces FX graphs and hands them to backends such as Inductor; it does not rely on this tiny Python source generator for performance.</p> <h3 id="the-jit-backend">The JIT Backend</h3> <p>For an additional step, we can trace the generated Python function with <code class="language-plaintext highlighter-rouge">torch.jit.trace</code> to get a TorchScript function:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">compile_graph_jit</span><span class="p">(</span><span class="n">graph</span><span class="p">,</span> <span class="n">example_inputs</span><span class="p">):</span>
    <span class="c1"># First, produce our usual Python source via compile_graph(). Then hand
</span>    <span class="c1"># that callable to torch.jit.trace, which re-records it as a single
</span>    <span class="c1"># TorchScript graph by running it once with the example inputs. The
</span>    <span class="c1"># result is a function where Python drops out of the per-op loop —
</span>    <span class="c1"># but each op still launches its own kernel; nothing is fused.
</span>    <span class="n">compiled_fn</span><span class="p">,</span> <span class="n">source</span> <span class="o">=</span> <span class="nf">compile_graph</span><span class="p">(</span><span class="n">graph</span><span class="p">)</span>
    <span class="n">traced_fn</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">jit</span><span class="p">.</span><span class="nf">trace</span><span class="p">(</span><span class="n">compiled_fn</span><span class="p">,</span> <span class="n">example_inputs</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">traced_fn</span><span class="p">,</span> <span class="n">source</span>
</code></pre></div></div> <p>This produces a TorchScript function where the entire graph executes as a single C++ call — no Python interpreter between operations. However, each operation is still a separate kernel launch. There is no kernel fusion, so this does not produce a meaningful speedup. It’s included here because it demonstrates the concept of lowering a graph to a different runtime, which is what real backends like Inductor do (but with actual kernel fusion).</p> <hr/> <h2 id="8-guards-when-can-we-reuse-compiled-code">8. Guards: When Can We Reuse Compiled Code?</h2> <p>A compiled function makes assumptions about its inputs. The graph we traced for <code class="language-plaintext highlighter-rouge">fn(x, y)</code> with <code class="language-plaintext highlighter-rouge">x.shape = (3, 4)</code> might not be valid for <code class="language-plaintext highlighter-rouge">x.shape = (5, 6)</code> – different shapes could change broadcasting behavior, output sizes, or even which operations are valid.</p> <p><strong>Guards</strong> encode these assumptions as boolean checks:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@classmethod</span>
<span class="k">def</span> <span class="nf">from_example_inputs</span><span class="p">(</span><span class="n">cls</span><span class="p">,</span> <span class="n">example_args</span><span class="p">):</span>
    <span class="n">guard_set</span> <span class="o">=</span> <span class="nf">cls</span><span class="p">()</span>
    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">arg</span> <span class="ow">in</span> <span class="nf">enumerate</span><span class="p">(</span><span class="n">example_args</span><span class="p">):</span>
        <span class="k">if</span> <span class="nf">isinstance</span><span class="p">(</span><span class="n">arg</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">):</span>
            <span class="c1"># Snapshot the shape *now*, while we still have the example tensor.
</span>            <span class="n">expected_shape</span> <span class="o">=</span> <span class="nf">tuple</span><span class="p">(</span><span class="n">arg</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
            <span class="n">guard_set</span><span class="p">.</span><span class="nf">add</span><span class="p">(</span><span class="nc">Guard</span><span class="p">(</span>
                <span class="c1"># The `idx=i, s=expected_shape` default arguments are the
</span>                <span class="c1"># standard Python trick for capturing loop variables *by value*
</span>                <span class="c1"># into a closure. Without them, every lambda would close over
</span>                <span class="c1"># the same `i` and `expected_shape` bindings and all end up
</span>                <span class="c1"># checking whatever those names held at the end of the loop.
</span>                <span class="k">lambda</span> <span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="n">idx</span><span class="o">=</span><span class="n">i</span><span class="p">,</span> <span class="n">s</span><span class="o">=</span><span class="n">expected_shape</span><span class="p">:</span> <span class="nf">tuple</span><span class="p">(</span><span class="n">args</span><span class="p">[</span><span class="n">idx</span><span class="p">].</span><span class="n">shape</span><span class="p">)</span> <span class="o">==</span> <span class="n">s</span><span class="p">,</span>
                <span class="sa">f</span><span class="sh">"</span><span class="s">args[</span><span class="si">{</span><span class="n">i</span><span class="si">}</span><span class="s">].shape == </span><span class="si">{</span><span class="n">expected_shape</span><span class="si">}</span><span class="sh">"</span><span class="p">,</span>
            <span class="p">))</span>
            <span class="c1"># ... similarly for dtype and device
</span>    <span class="k">return</span> <span class="n">guard_set</span>
</code></pre></div></div> <p>On each call, every guard is checked. If all pass, the cached compiled function is valid and we skip tracing entirely. If any guard fails, we retrace and compile for the new input signature, adding a new entry to the cache.</p> <h3 id="seeing-guards-in-action">Seeing Guards in Action</h3> <p>The easiest way to build intuition is to actually call a compiled function and watch the cache evolve. Take the same function we used earlier:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">torch</span>
<span class="kn">import</span> <span class="n">mini_dynamo</span>

<span class="nd">@mini_dynamo.compile</span>
<span class="k">def</span> <span class="nf">fn</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
    <span class="n">z</span> <span class="o">=</span> <span class="n">x</span> <span class="o">+</span> <span class="n">y</span>
    <span class="n">w</span> <span class="o">=</span> <span class="n">z</span> <span class="o">*</span> <span class="mi">2</span>
    <span class="k">return</span> <span class="n">w</span><span class="p">.</span><span class="nf">sum</span><span class="p">()</span>
</code></pre></div></div> <p>On the <strong>first call</strong>, there’s no cache entry yet, so we fall through to the slow path: trace → compile → build guards. The guard set produced by <code class="language-plaintext highlighter-rouge">GuardSet.from_example_inputs</code> gets three guards per tensor argument — one for shape, one for dtype, one for device:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nf">randn</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nf">randn</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span>
<span class="nf">fn</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>     <span class="c1"># ≈ milliseconds — full trace + compile
</span>
<span class="c1"># Inspect what got stored in the cache:
</span><span class="n">guard_set</span><span class="p">,</span> <span class="n">compiled_fn</span> <span class="o">=</span> <span class="n">fn</span><span class="p">.</span><span class="n">_cache</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="nf">print</span><span class="p">(</span><span class="n">guard_set</span><span class="p">)</span>
<span class="c1"># GuardSet([
#   args[0].shape == (3, 4)
#   args[0].dtype == torch.float32
#   args[0].device == cpu
#   args[1].shape == (3, 4)
#   args[1].dtype == torch.float32
#   args[1].device == cpu
# ])
</span></code></pre></div></div> <p>On the <strong>second call</strong>, the wrapper iterates the cache and calls <code class="language-plaintext highlighter-rouge">guard_set.check_all(a2, b2)</code>. Every guard is a tiny lambda (e.g. <code class="language-plaintext highlighter-rouge">tuple(args[0].shape) == (3, 4)</code>), so the whole check is a handful of Python comparisons — microseconds — and we jump straight to the compiled function without re-tracing:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a2</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nf">randn</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span>   <span class="c1"># same shape/dtype/device → guards pass
</span><span class="n">b2</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nf">randn</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span>
<span class="nf">fn</span><span class="p">(</span><span class="n">a2</span><span class="p">,</span> <span class="n">b2</span><span class="p">)</span>               <span class="c1"># ≈ microseconds of guard check + the compiled fn
</span></code></pre></div></div> <p>On the <strong>third call</strong>, we pass in tensors with a new shape. The shape guards on <code class="language-plaintext highlighter-rouge">args[0]</code> and <code class="language-plaintext highlighter-rouge">args[1]</code> both fail, <code class="language-plaintext highlighter-rouge">check_all</code> returns <code class="language-plaintext highlighter-rouge">False</code>, so the wrapper falls through to the slow path again: retrace, recompile, build a new guard set, append it to the cache. Now the cache has <em>two</em> entries, and future calls will check both in order:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a3</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nf">randn</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">)</span>   <span class="c1"># different shape → guards fail
</span><span class="n">b3</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nf">randn</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">)</span>
<span class="nf">fn</span><span class="p">(</span><span class="n">a3</span><span class="p">,</span> <span class="n">b3</span><span class="p">)</span>               <span class="c1"># ≈ milliseconds — cache miss, retrace
</span>
<span class="c1"># If you want to know *why* a call missed, ask the guard set:
</span><span class="nf">print</span><span class="p">(</span><span class="n">fn</span><span class="p">.</span><span class="n">_cache</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">].</span><span class="nf">failing_guards</span><span class="p">(</span><span class="n">a3</span><span class="p">,</span> <span class="n">b3</span><span class="p">))</span>
<span class="c1"># [Guard(args[0].shape == (3, 4)), Guard(args[1].shape == (3, 4))]
# And now len(fn._cache) == 2 — one entry per input signature we've seen.
</span></code></pre></div></div> <p>This is the mechanism in a nutshell: the first call pays the compile tax, identical calls are nearly free, and the cache grows by one entry every time Dynamo encounters a genuinely new input signature. In the worst case — a function called with a different shape every time — every call misses and <code class="language-plaintext highlighter-rouge">@compile</code> is pure overhead, which is why <strong>recompilation rate</strong> is one of the first things to look at when <code class="language-plaintext highlighter-rouge">torch.compile</code> isn’t giving you the speedup you expected.</p> <p><img src="/assets/img/mini-dynamo/cache-and-guards.svg" alt="One cache scan: the wrapper walks entries top-to-bottom and runs the first whose guards all pass. Entries further down never get checked on a hit."/></p> <p>This is the same trade-off real Dynamo makes:</p> <table> <thead> <tr> <th style="text-align: left"> </th> <th style="text-align: left">First call</th> <th style="text-align: left">Subsequent calls (cache hit)</th> <th style="text-align: left">Shape change (cache miss)</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><strong>Cost</strong></td> <td style="text-align: left">Full trace + compile</td> <td style="text-align: left">Guard checks only</td> <td style="text-align: left">Full retrace + compile</td> </tr> <tr> <td style="text-align: left"><strong>Typical time</strong></td> <td style="text-align: left">Milliseconds</td> <td style="text-align: left">Microseconds</td> <td style="text-align: left">Milliseconds</td> </tr> </tbody> </table> <aside> <p><strong>Real Dynamo’s guards are far more extensive.</strong> They check type IDs, object identity, dict version tags, global variable values, tensor strides, and more. They’re also implemented in C for speed. Our Python lambda guards demonstrate the concept.</p> </aside> <hr/> <h2 id="9-the-compile-decorator-putting-it-all-together">9. The <code class="language-plaintext highlighter-rouge">compile()</code> Decorator: Putting It All Together</h2> <p>The top-level API ties together all five pipeline stages:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">compile</span><span class="p">(</span><span class="n">fn</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="o">*</span><span class="p">,</span> <span class="n">backend</span><span class="o">=</span><span class="sh">"</span><span class="s">python</span><span class="sh">"</span><span class="p">):</span>
    <span class="c1"># The cache lives in this closure, so each @compile'd function gets its
</span>    <span class="c1"># own. Entries are appended in the order they were compiled; we scan
</span>    <span class="c1"># from the front on every call.
</span>    <span class="n">cache</span> <span class="o">=</span> <span class="p">[]</span>

    <span class="nd">@functools.wraps</span><span class="p">(</span><span class="n">fn</span><span class="p">)</span>
    <span class="k">def</span> <span class="nf">wrapper</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">):</span>
        <span class="c1"># Fast path: walk the cache and run the first entry whose guards
</span>        <span class="c1"># all pass on the current args. This is the path every steady-state
</span>        <span class="c1"># call takes.
</span>        <span class="k">for</span> <span class="n">guard_set</span><span class="p">,</span> <span class="n">compiled_fn</span> <span class="ow">in</span> <span class="n">cache</span><span class="p">:</span>
            <span class="k">if</span> <span class="n">guard_set</span><span class="p">.</span><span class="nf">check_all</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">):</span>
                <span class="k">return</span> <span class="nf">compiled_fn</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">)</span>       <span class="c1"># Cache hit → fast path
</span>
        <span class="c1"># Slow path: nothing in the cache matches, so run the full pipeline
</span>        <span class="c1"># and append a new entry. The next call with the same signature
</span>        <span class="c1"># will hit it in the loop above.
</span>        <span class="n">graph</span> <span class="o">=</span> <span class="nc">SymbolicInterpreter</span><span class="p">(</span><span class="n">fn</span><span class="p">,</span> <span class="n">args</span><span class="p">).</span><span class="nf">run</span><span class="p">()</span>         <span class="c1"># STEP 1: Trace
</span>        <span class="n">compiled_fn</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="nf">compile_graph</span><span class="p">(</span><span class="n">graph</span><span class="p">)</span>                <span class="c1"># STEP 2: Compile
</span>        <span class="n">guard_set</span> <span class="o">=</span> <span class="n">GuardSet</span><span class="p">.</span><span class="nf">from_example_inputs</span><span class="p">(</span><span class="n">args</span><span class="p">)</span>       <span class="c1"># STEP 3: Guard
</span>        <span class="n">cache</span><span class="p">.</span><span class="nf">append</span><span class="p">((</span><span class="n">guard_set</span><span class="p">,</span> <span class="n">compiled_fn</span><span class="p">))</span>               <span class="c1"># STEP 4: Cache
</span>        <span class="k">return</span> <span class="nf">compiled_fn</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">)</span>                            <span class="c1"># STEP 5: Execute
</span>
    <span class="k">return</span> <span class="n">wrapper</span>
</code></pre></div></div> <p>The previous sections built each component in isolation — the interpreter, the graph, the compiler, the guards. It’s worth seeing them compose end-to-end on a real call, with each intermediate artifact laid out explicitly. Take the same toy function as before, and imagine we’re running the very first call through the wrapper, stage by stage. Each stage produces something concrete you can print, so we’ll print it.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">torch</span>
<span class="kn">import</span> <span class="n">mini_dynamo</span>

<span class="nd">@mini_dynamo.compile</span>
<span class="k">def</span> <span class="nf">fn</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
    <span class="n">z</span> <span class="o">=</span> <span class="n">x</span> <span class="o">+</span> <span class="n">y</span>
    <span class="n">w</span> <span class="o">=</span> <span class="n">z</span> <span class="o">*</span> <span class="mi">2</span>
    <span class="k">return</span> <span class="n">w</span><span class="p">.</span><span class="nf">sum</span><span class="p">()</span>

<span class="n">a</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nf">randn</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nf">randn</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span>
</code></pre></div></div> <h3 id="step-1-trace">Step 1: Trace</h3> <p>The wrapper’s cache is empty, so we fall into the slow path. <code class="language-plaintext highlighter-rouge">SymbolicInterpreter(fn, (a, b)).run()</code> walks <code class="language-plaintext highlighter-rouge">fn</code>’s bytecode, pushing <code class="language-plaintext highlighter-rouge">VariableTracker</code>s on its stack, and records every tensor operation as a <code class="language-plaintext highlighter-rouge">Node</code>. It returns a <code class="language-plaintext highlighter-rouge">Graph</code>:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Graph:
  x = placeholder
  y = placeholder
  add_0 = torch.add(x, y)
  mul_0 = torch.mul(add_0, 2)
  sum_0 = mul_0.sum()
  return sum_0
</code></pre></div></div> <p>Notice what’s <em>not</em> there: no <code class="language-plaintext highlighter-rouge">z = ...</code>, no <code class="language-plaintext highlighter-rouge">w = ...</code>, no <code class="language-plaintext highlighter-rouge">STORE_FAST</code> noise, no <code class="language-plaintext highlighter-rouge">LOAD_GLOBAL torch</code> lookups. The intermediate local variables from the Python source have been flattened out into a straight-line DAG of tensor operations. The constant <code class="language-plaintext highlighter-rouge">2</code> is inlined directly into <code class="language-plaintext highlighter-rouge">torch.mul</code>’s args rather than becoming a node. This is precisely what makes the graph useful as an IR: it’s a pure description of <em>“what tensor ops, in what order, wired how”</em>, stripped of everything the compiler doesn’t care about.</p> <h3 id="step-2-compile">Step 2: Compile</h3> <p><code class="language-plaintext highlighter-rouge">compile_graph(graph)</code> walks those nodes and emits one line of Python per operation. It returns a callable plus the source string, which is worth looking at because it’s small enough to read in full:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">compiled_fn</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
    <span class="n">add_0</span> <span class="o">=</span> <span class="nf">__fn_add_0</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
    <span class="n">mul_0</span> <span class="o">=</span> <span class="nf">__fn_mul_0</span><span class="p">(</span><span class="n">add_0</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
    <span class="n">sum_0</span> <span class="o">=</span> <span class="n">mul_0</span><span class="p">.</span><span class="nf">sum</span><span class="p">()</span>
    <span class="k">return</span> <span class="n">sum_0</span>
</code></pre></div></div> <p>The <code class="language-plaintext highlighter-rouge">__fn_add_0</code> and <code class="language-plaintext highlighter-rouge">__fn_mul_0</code> names aren’t magic — they’re just keys into the closure namespace the compiler builds alongside the source. That dict looks like <code class="language-plaintext highlighter-rouge">{"__fn_add_0": torch.add, "__fn_mul_0": torch.mul}</code>, and it becomes the globals for the <code class="language-plaintext highlighter-rouge">exec()</code> call that materializes the function. Each op still goes through <code class="language-plaintext highlighter-rouge">torch.add</code> and the full PyTorch dispatcher, and each still launches its own kernel — we haven’t fused anything, haven’t skipped the C++ dispatcher, haven’t avoided a single kernel launch.</p> <p>This backend’s role is purely to produce a faithful, standalone callable that does exactly what the captured graph says. Kernel fusion and dispatcher elimination are what happens when you hand the <em>same</em> graph to Inductor instead, which we get to in Section 10.</p> <h3 id="step-3-guard">Step 3: Guard</h3> <p><code class="language-plaintext highlighter-rouge">GuardSet.from_example_inputs((a, b))</code> inspects each tensor argument and builds three lambda guards per tensor — one each for shape, dtype, and device:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>GuardSet([
  args[0].shape == (3, 4)
  args[0].dtype == torch.float32
  args[0].device == cpu
  args[1].shape == (3, 4)
  args[1].dtype == torch.float32
  args[1].device == cpu
])
</code></pre></div></div> <p>These six predicates are the contract: “the <code class="language-plaintext highlighter-rouge">compiled_fn</code> we just produced is valid as long as these hold”. The guard set isn’t attached to the tensors <code class="language-plaintext highlighter-rouge">a</code> and <code class="language-plaintext highlighter-rouge">b</code>; it’s a set of <em>checks</em> that any future arguments must satisfy.</p> <h3 id="step-4-cache">Step 4: Cache</h3> <p>The pair <code class="language-plaintext highlighter-rouge">(guard_set, compiled_fn)</code> gets appended to the cache list. After this first call:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">print</span><span class="p">(</span><span class="nf">len</span><span class="p">(</span><span class="n">fn</span><span class="p">.</span><span class="n">_cache</span><span class="p">))</span>    <span class="c1"># → 1
</span></code></pre></div></div> <p>That’s the entire cache: one entry. The cache is per-<code class="language-plaintext highlighter-rouge">@compile</code>d function (it lives in the wrapper’s closure), and its order matters — on every subsequent call, we’ll scan it from index 0 upward, returning the first entry whose guards all pass.</p> <h3 id="step-5-execute">Step 5: Execute</h3> <p>Finally, we actually call <code class="language-plaintext highlighter-rouge">compiled_fn(a, b)</code> and return the result. The result is identical to what eager <code class="language-plaintext highlighter-rouge">fn(a, b)</code> would produce — we haven’t <em>changed</em> the computation, we’ve just reorganized how it’s dispatched:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">compiled_fn</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span> <span class="o">==</span> <span class="nf">fn</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>   <span class="c1"># → tensor(True)
</span></code></pre></div></div> <p>All five steps have run in service of this one call. The first-call latency (a few milliseconds on our example) is almost entirely spent in steps 1–3; step 5 is microseconds.</p> <h3 id="what-happens-on-the-second-and-third-calls">What Happens on the Second and Third Calls</h3> <p>Now the structure earns its keep. On the <strong>second call</strong> with the same shape/dtype/device, the wrapper iterates <code class="language-plaintext highlighter-rouge">cache</code>, finds that <code class="language-plaintext highlighter-rouge">guard_set.check_all(*args)</code> returns <code class="language-plaintext highlighter-rouge">True</code> on the first entry, and jumps directly to <code class="language-plaintext highlighter-rouge">compiled_fn(*args)</code>. Steps 1–4 are skipped entirely. The cache is still length 1.</p> <p>On the <strong>third call</strong> with <code class="language-plaintext highlighter-rouge">(5, 6)</code> tensors, <code class="language-plaintext highlighter-rouge">check_all</code> returns <code class="language-plaintext highlighter-rouge">False</code> on every existing entry (the shape guards fail). The wrapper falls through to the slow path again, traces a fresh graph, compiles a new function, builds a new guard set, and appends. Now:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">print</span><span class="p">(</span><span class="nf">len</span><span class="p">(</span><span class="n">fn</span><span class="p">.</span><span class="n">_cache</span><span class="p">))</span>    <span class="c1"># → 2
</span></code></pre></div></div> <p>Future calls will scan both entries in order — a <code class="language-plaintext highlighter-rouge">(3, 4)</code> call hits entry 0, a <code class="language-plaintext highlighter-rouge">(5, 6)</code> call hits entry 1, and any brand-new shape falls through to a new compile and a third entry.</p> <p>This is the full pipeline in motion. Five stages, five concrete artifacts — a <code class="language-plaintext highlighter-rouge">Graph</code>, a <code class="language-plaintext highlighter-rouge">compiled_fn</code>, a <code class="language-plaintext highlighter-rouge">GuardSet</code>, a cache list, and a tensor result — each one handed to the next, cached so that steady-state calls skip all the expensive work. Real <code class="language-plaintext highlighter-rouge">torch.compile</code> is dramatically more sophisticated in every stage (keyword arguments, nested calls via PEP 523, dynamic shapes, C-level guard evaluation, per-code-object caches, graph breaks), but the spine is the same shape: <strong>trace → compile → guard → cache → execute</strong>.</p> <hr/> <h2 id="10-back-to-the-big-picture-where-does-speedup-actually-come-from">10. Back to the Big Picture: Where Does Speedup Actually Come From?</h2> <p>Now that we’ve built the full system, we can ask honestly: <strong>how much faster is it?</strong></p> <p>The short version: <strong>graph capture on its own produces no meaningful speedup.</strong> All of the win comes from what an optimizing backend does with the graph. It’s worth unpacking why, because a common and slightly misleading way to describe <code class="language-plaintext highlighter-rouge">torch.compile</code> is to say it “removes Python overhead.” That phrasing glosses over several very different costs that live between a user’s <code class="language-plaintext highlighter-rouge">x + y</code> and the kernel running on the GPU.</p> <h3 id="where-the-time-actually-goes">Where the Time Actually Goes</h3> <p>Here’s the rough per-op cost decomposition for an elementwise op in eager PyTorch on a modern CUDA setup:</p> <table> <thead> <tr> <th style="text-align: left">Cost</th> <th style="text-align: left">Typical scale per op</th> <th style="text-align: left">Who pays it</th> </tr> </thead> <tbody> <tr> <td style="text-align: left">CPython bytecode dispatch</td> <td style="text-align: left">tens of nanoseconds</td> <td style="text-align: left">The interpreter</td> </tr> <tr> <td style="text-align: left">Python-level method resolution, <code class="language-plaintext highlighter-rouge">__torch_function__</code></td> <td style="text-align: left">hundreds of nanoseconds</td> <td style="text-align: left">CPython + PyTorch’s Python bindings</td> </tr> <tr> <td style="text-align: left">PyTorch C++ dispatcher (device, autograd, vmap, …)</td> <td style="text-align: left">a few microseconds</td> <td style="text-align: left">libtorch</td> </tr> <tr> <td style="text-align: left">Kernel launch onto the CUDA / MPS stream</td> <td style="text-align: left">5–20 microseconds</td> <td style="text-align: left">The GPU driver</td> </tr> <tr> <td style="text-align: left">The kernel itself</td> <td style="text-align: left">nanoseconds to milliseconds</td> <td style="text-align: left">The GPU</td> </tr> </tbody> </table> <p>The first two rows are what most people mean when they say “Python overhead.” They are also the <em>smallest</em> rows. Our Python backend only touches those: it pre-resolves function lookups into a closure so each op skips one <code class="language-plaintext highlighter-rouge">LOAD_GLOBAL</code> + <code class="language-plaintext highlighter-rouge">LOAD_ATTR</code> pair. Nothing below that line changes — every op still boxes arguments into PyObjects, still traverses libtorch’s dispatch key logic, still waits on its own kernel launch.</p> <p>The numbers reflect this. For a function with 11 chained elementwise ops on 256×256 tensors:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Eager (original):            ~150 us
mini_dynamo (python):        ~140 us   (~1.07x — mostly noise)
mini_dynamo (jit):           ~120 us   (~1.25x)
torch.compile (inductor):     ~40 us   (~3.75x)
</code></pre></div></div> <p><img src="/assets/img/mini-dynamo/cost-decomposition.svg" alt="Where the time goes for 11 chained elementwise ops. Inductor's win is not from making each op faster — it's from collapsing 11 launches into a single fused kernel."/></p> <p>The Python backend’s ~1.07× is essentially within noise. We saved a handful of bytecodes per op, and that’s dwarfed by everything underneath.</p> <p>The JIT backend’s ~1.25× is small but real, and it’s worth understanding where it comes from — because it is <em>not</em> bytecode dispatch savings. <code class="language-plaintext highlighter-rouge">torch.jit.trace</code> wraps the generated function into a single TorchScript graph call, so from Python’s point of view the whole chain becomes one <code class="language-plaintext highlighter-rouge">call into C++</code>. Python drops out of the loop between ops, and some of the per-op dispatcher and Python↔C++ boundary-crossing work gets amortized. We’re nibbling at rows 2–3 of the table, not row 1.</p> <h3 id="where-the-real-speedup-comes-from">Where the Real Speedup Comes From</h3> <p>Inductor’s ~3.75× is a different beast entirely. It comes from operating <em>below</em> the dispatcher rather than saving a few interpreter instructions above it, and it relies on having a captured graph as input:</p> <ul> <li><strong>Kernel fusion.</strong> Inductor generates a single Triton (GPU) or C++ (CPU) kernel for a whole chain of memory-bound ops. Eager does 11 kernels, each reading from HBM, computing one op, writing back. Fusion does one read, all the arithmetic in registers, one write. For elementwise chains, softmax, layer norm, activations — almost everything that isn’t a matmul — this alone is the 3–5× number you see in PyTorch benchmarks.</li> <li><strong>Launch overhead collapse.</strong> Even after fusion, each kernel launch still costs microseconds. When the same shapes recur (e.g. the steady-state of a training loop), CUDA graph integration lets you record the launches once and replay them as a single stream op, eliminating the per-step dispatcher and launch costs.</li> <li><strong>Memory planning.</strong> With a full graph in hand, Inductor can plan intermediate buffers once and reuse them, avoiding the per-op allocator churn eager incurs.</li> </ul> <p>None of these live in our mini-dynamo’s backend, or could. They require the graph as input, and they operate on the <em>biggest</em> rows of the cost table — the dispatcher, the launch, and the kernel itself. That is the part of the stack where the microseconds actually live.</p> <h3 id="the-key-insight">The Key Insight</h3> <p><strong>Dynamo and Inductor are separate concerns, and they are not symmetric.</strong> Dynamo captures the graph; on its own, that gets you essentially nothing for performance. Inductor optimizes the graph; in normal deep-learning workloads, that is where the meaningful speedup comes from. The entire point of going through the elaborate machinery of a bytecode-level tracer is to hand an optimizing backend something it can fuse, schedule, and lower. Our mini-dynamo replaces only the Dynamo part — and because we produce a compatible graph, we can plug in the <em>real</em> Inductor backend and recover the actual speedup:</p> <p>We can convert our mini-dynamo graph into an <code class="language-plaintext highlighter-rouge">fx.GraphModule</code>, lower it to ATen ops, and pass it directly to <code class="language-plaintext highlighter-rouge">compile_fx_inner</code> – Inductor’s entry point. For the straight-line tensor programs that mini-dynamo supports, this can produce the same ATen graph and therefore the same fused kernels as real <code class="language-plaintext highlighter-rouge">torch.compile</code>. That’s what the parity tests in this repository validate. It is not a claim of general equivalence across arbitrary PyTorch programs.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">mini_dynamo_to_inductor</span><span class="p">(</span><span class="n">fn</span><span class="p">,</span> <span class="o">*</span><span class="n">example_inputs</span><span class="p">):</span>
    <span class="c1"># 1. Trace with our symbolic interpreter — produces a mini-dynamo Graph
</span>    <span class="c1">#    whose nodes call torch.add, torch.mul, etc.
</span>    <span class="n">graph</span> <span class="o">=</span> <span class="nc">SymbolicInterpreter</span><span class="p">(</span><span class="n">fn</span><span class="p">,</span> <span class="n">example_inputs</span><span class="p">).</span><span class="nf">run</span><span class="p">()</span>

    <span class="c1"># 2. Repackage our graph as a torch.fx.GraphModule, which is the format
</span>    <span class="c1">#    Inductor's pipeline accepts.
</span>    <span class="n">gm</span> <span class="o">=</span> <span class="nf">to_fx_graph_module</span><span class="p">(</span><span class="n">graph</span><span class="p">)</span>

    <span class="c1"># 3. make_fx re-traces gm one more time, this time under PyTorch's ATen
</span>    <span class="c1">#    dispatch layer. Surface-level ops (torch.add) get rewritten to their
</span>    <span class="c1">#    canonical ATen counterparts (torch.ops.aten.add.Tensor). Inductor
</span>    <span class="c1">#    works on ATen, not on the Python-facing torch API.
</span>    <span class="n">aten_gm</span> <span class="o">=</span> <span class="nf">make_fx</span><span class="p">(</span><span class="n">gm</span><span class="p">)(</span><span class="o">*</span><span class="n">example_inputs</span><span class="p">)</span>

    <span class="c1"># 4. Hand the ATen graph to Inductor's entry point, which does the actual
</span>    <span class="c1">#    kernel fusion and code generation. The returned `compiled` is the
</span>    <span class="c1">#    fused-kernel callable.
</span>    <span class="n">compiled</span> <span class="o">=</span> <span class="nf">compile_fx_inner</span><span class="p">(</span><span class="n">aten_gm</span><span class="p">,</span> <span class="nf">list</span><span class="p">(</span><span class="n">example_inputs</span><span class="p">))</span>
    <span class="k">return</span> <span class="n">compiled</span>
</code></pre></div></div> <hr/> <h2 id="11-what-we-left-out">11. What We Left Out</h2> <p>Mini-dynamo demonstrates the architecture of TorchDynamo. But real Dynamo is a vastly more complex system. Here are the most important gaps:</p> <h3 id="pep-523-frame-evaluation">PEP 523 Frame Evaluation</h3> <p>Real Dynamo doesn’t use <code class="language-plaintext highlighter-rouge">dis.get_instructions()</code>. It installs a <strong>C-level frame evaluator</strong> via PEP 523 that intercepts every Python function call before CPython’s interpreter runs. This is transparent – no special calling convention needed – and enables:</p> <ul> <li> <p><strong>Function inlining:</strong> When <code class="language-plaintext highlighter-rouge">fn()</code> calls <code class="language-plaintext highlighter-rouge">helper()</code>, Dynamo intercepts the new frame and traces <em>into</em> it, capturing a single unified graph. Our bytecode walker only sees the top-level function.</p> </li> <li> <p><strong>Graph breaks:</strong> When Dynamo hits an unsupported operation (a <code class="language-plaintext highlighter-rouge">print()</code>, an unsupported data structure), it can <em>break the graph</em> – compiling what it has so far, executing the unsupported operation in normal Python, and resuming tracing after. Our interpreter simply raises <code class="language-plaintext highlighter-rouge">NotImplementedError</code>.</p> </li> </ul> <h3 id="control-flow">Control Flow</h3> <p>We skip all jump instructions (<code class="language-plaintext highlighter-rouge">JUMP_IF_TRUE</code>, <code class="language-plaintext highlighter-rouge">FOR_ITER</code>, etc.). Real Dynamo handles control flow by specializing: if the branch condition is a tensor property known at trace time (like <code class="language-plaintext highlighter-rouge">x.shape[0] &gt; 5</code>), it evaluates it and traces only the taken branch, guarding on the condition.</p> <h3 id="dynamic-shapes">Dynamic Shapes</h3> <p>Our guards require exact shape matches. Real Dynamo supports <strong>dynamic shapes</strong> – symbolic integers that represent unknown dimensions. This avoids recompilation when batch size changes, at the cost of more complex guard logic and symbolic reasoning.</p> <h3 id="50-variabletracker-subclasses">50+ VariableTracker Subclasses</h3> <p>Our four types cover tensors, constants, torch functions, and tensor methods. Real Dynamo has trackers for lists, dicts, ranges, slices, iterators, <code class="language-plaintext highlighter-rouge">nn.Module</code> instances, user-defined classes, closures, generators, and more.</p> <hr/> <h2 id="12-summary">12. Summary</h2> <p><code class="language-plaintext highlighter-rouge">torch.compile</code> is not magic. It’s a well-structured pipeline:</p> <ol> <li><strong>Intercept</strong> Python execution at the bytecode level</li> <li><strong>Replay</strong> each instruction symbolically, recording tensor operations into a graph</li> <li><strong>Compile</strong> the graph with an optimizing backend</li> <li><strong>Guard</strong> against changes in input metadata</li> <li><strong>Cache</strong> the result for fast reuse</li> </ol> <p>The symbolic interpreter is a CPython emulator. The graph is an IR. The compiler is a code generator. The guards are boolean predicates. Each component is understandable in isolation, and together they explain how <code class="language-plaintext highlighter-rouge">torch.compile</code> can speed up PyTorch programs. In this mini implementation, the graph-capture machinery is the educational focus; the large speedups only arrive once you pair that captured graph with an optimizing backend like Inductor.</p> <div class="l-body"> <p><em>The full source code for mini-dynamo is in this repository. Every module is heavily commented and designed to be read linearly.</em></p> </div> <hr/> <div class="appendix"> <h2 id="appendix-file-map">Appendix: File Map</h2> <table> <thead> <tr> <th style="text-align: left">File</th> <th style="text-align: left">Purpose</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">mini_dynamo/__init__.py</code></td> <td style="text-align: left">The <code class="language-plaintext highlighter-rouge">compile()</code> decorator – ties together all five stages</td> </tr> <tr> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">mini_dynamo/symbolic_interpreter.py</code></td> <td style="text-align: left">The bytecode walker – CPython emulator on VariableTrackers</td> </tr> <tr> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">mini_dynamo/variable_tracker.py</code></td> <td style="text-align: left">Four symbolic value types</td> </tr> <tr> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">mini_dynamo/graph.py</code></td> <td style="text-align: left">The computation graph IR (<code class="language-plaintext highlighter-rouge">Node</code> + <code class="language-plaintext highlighter-rouge">Graph</code>)</td> </tr> <tr> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">mini_dynamo/compiler.py</code></td> <td style="text-align: left">Code generation backends (Python + JIT)</td> </tr> <tr> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">mini_dynamo/guards.py</code></td> <td style="text-align: left">Guard creation and checking</td> </tr> <tr> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">examples/benchmark.py</code></td> <td style="text-align: left">Performance analysis: where speedup comes from</td> </tr> <tr> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">examples/benchmark_mps.py</code></td> <td style="text-align: left">GPU benchmark: Python vs JIT vs Inductor</td> </tr> <tr> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">examples/inductor_integration.py</code></td> <td style="text-align: left">Plugging into the real Inductor backend</td> </tr> </tbody> </table> </div>]]></content><author><name></name></author><summary type="html"><![CDATA[Rebuilding the core ideas behind torch.compile with a tiny TorchDynamo-style tracer.]]></summary></entry><entry><title type="html">On the AI Sentience Debate</title><link href="https://danielep.xyz/blog/2022/sentience/" rel="alternate" type="text/html" title="On the AI Sentience Debate"/><published>2022-11-07T15:12:00+00:00</published><updated>2022-11-07T15:12:00+00:00</updated><id>https://danielep.xyz/blog/2022/sentience</id><content type="html" xml:base="https://danielep.xyz/blog/2022/sentience/"><![CDATA[<p>This article is available on my <a href="https://danielepaliotta.substack.com/p/on-the-ai-sentience-debate">Substack</a>.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Some thoughts]]></summary></entry><entry><title type="html">Teaching Myself Quantum Mechanics, Part one.</title><link href="https://danielep.xyz/blog/2021/qm-1/" rel="alternate" type="text/html" title="Teaching Myself Quantum Mechanics, Part one."/><published>2021-02-04T15:12:00+00:00</published><updated>2021-02-04T15:12:00+00:00</updated><id>https://danielep.xyz/blog/2021/qm-1</id><content type="html" xml:base="https://danielep.xyz/blog/2021/qm-1/"><![CDATA[<p>This series is about me learning something in my spare time. More specifically, it’s about me learning quantum mechanics, as a way of getting back into physics, and as a way of making good science.</p> <p>I’ll explain: I have recently started a PhD in machine learning at the University of Geneva. The main project I’ll be working on is about applying machine learning and statistical methods to physics, mainly high energy physics (HEP) and solar astronomy. I am part of a very strong multidisciplinary team of physicists and AI people, so I could probably do without a strong physics background. Still, I feel like that can’t hurt, right?</p> <p>So there you have it: I want to study and understand physics in order to do better research.</p> <p>But that’s not the whole story. I also love physics, and I have tortured myself over and over as to whether I should have majored in physics rather that CS during my university career. This is not to say that I am any good at it. I only had one big physics exam during my bachelor (topics were kinematics, some classical mechanics, electromagnetism), and all the rest of my knowledge is made of bits and pieces coming from random youtube lectures and non-technical books I have read over the years. So yeah, it’s finally time to get serious.</p> <h3 id="why-start-with-quantum-mechanics">Why start with Quantum Mechanics?</h3> <p>Because I find it exciting, and that’s probably enough. Starting with classical mechanics would likely bore me to death. I know QM is not easy or intuitive, and that math can get hard. However, I don’t feel completely ill-equipped. Thanks to my engineering and ML background, I know one thing or two about probabilities, linear algebra, Fourier series and complex numbers, which I have been told are a substantial part of the required background.</p> <h3 id="what-is-your-study-method">What is your study method?</h3> <p>I usually try to be extremely deliberate about my study method. I won’t go into the details here, but I use active recall and spatial repetition tools such as Anki so that remembering what I study is no longer a random event but a purposeful act. There are many interesting resources about such techniques, such as <a href="http://augmentingcognition.com/ltm.html">this</a>. Michael Nielsen has even developed an <a href="https://quantum.country/qcvc">introduction to quantum computing</a> delivered in a “new mnemonic medium which makes it almost effortless to remember what you read” (hint: it’s active recall and spaced repetition).</p> <p>As for the learning resources, I have chosen two introductory books: one is from the theoretical minimum series from the great Leonard Susskind. This is supposed to be a very simple introduction, starting from first principles. I have already flipped through the pages, and there actually seems to be quite a lot of content, or at least more than you would expect from a book of this kind.</p> <p>The second book it more technical, and it comes up over and over as a suggestion for learning QM: Griffiths’ Introduction to Quantum Mechanics. Honestly, here I have no idea what to expect.</p> <p>So here’t the plan: I’ll go through both books simultaneously and try to solve most exercises. I’ll also create flashcards and diagrams of the concepts, and let Anki do the rest. Additionally, I’ll write this blog series, which hopefully helps me understand things better. So if you are here to learn, don’t just count on my writing.</p> <h3 id="requirements">Requirements</h3> <p>The requirements are basic linear algebra and basic probability. And complex numbers. Some notions of classical physics can also help.</p> <hr/> <h2 id="spins-and-qubits">Spins and Qubits</h2> <p>The fact that we cannot fully grasp some things about the inner workings of our universe should come at no surprise. After all, we evolved to survive, not to understand. Ok yes, we have a strong intuition for many classical-mechanical things, but the reason is that things that behave classically are closer to our experience than things that behave relativistically or in a quantum mechanical way. So what we can do here is simple: let’s just go with the math and with the experiment. I don’t expect any of the concepts I encounter to be “relatable”. I am going to trust the result of the experiments, and link that to the math which hopefully explains them. And I’ll start with the <strong>spin</strong>.</p> <p>Sooo, the spin is an extra degree of freedom that some particle have (all of them? I don’t know).</p> <p>This spin is a very quantum mechanical concept, which is why trying to visualize it classically would miss the point.</p> <p>Now, the spin of a particle is a quantum systems on its own right. In fact, it’s the “simplest and the most quantum of systems”, according to Susskind. Just by going through a very simple experiment involving the spin, we can already start to notice some important differences between classical physics and quantum mechanics.</p> <p>The experiment involves a measuring system, which we call \(\mathcal{A}\), that records the state of the spin of our system (we’ll call it \(\sigma\)). We can orient the measuring apparatus in order to measure this “something” along a specified axis. As a test, we rotate the apparatus along the \(z\)-axis, and notice that our measurement has the value of \(1\) or \(-1\).</p> <p>Now, if we reset the apparatus to measure again the spin along the \(z\)-axis, and assume the simple evolution law</p> \[\sigma(n+1) = \sigma(n)\] <p>which tells us that the state of the system remains unchanged through time, then the previous measurement should be confirmed, again and again. We say that the measurement of a state <strong>prepares the system.</strong></p> <p>In quantum experiments, performing the same measurement repeatedly gives the same answer until the system is prepared differently. We will see this in more detail now.</p> <p>After we have prepared the system in the state \(\sigma_z = 1\), let’s now rotate the apparatus of, say, 90 degrees, and measure the spin along the \(x\)-axis. Something weird happens. Sometimes we get +1, sometimes we get -1. It looks like the result is no longer deterministic, but in some peculiar way. In fact, if we repeat the experiment multiple times, we find that the average value of the measurement along the \(x\)-axis is \(0\).</p> <p>In fact, this is generalizable to any direction. We can pick any direction \(\hat{m}\) and prepare a spin so that the apparatus measures a \(+1\) in that direction. Then, we rotate the apparatus to any other direction \(\hat{n}\). Again, the result of the measurement will still be either \(+1\) or \(-1\), but the expected value will be equal to the projection of \(\hat{m}\) over \(\hat{n}\) , that is, \(\cos{\theta}\), with \(\theta\) being the angle between the vectors.</p> <p>In <em>bra-ket notation</em> (we’ll talk about this more in detail)<em>:</em></p> \[\langle{\sigma}\rangle = \hat{m} \cdot\hat{n} = \cos{\theta}\] <p>Notice that this is the same result that we would expect in classical physics for some vector quantity that we measure. However, in the classical sense, the result would be deterministic. In QM, the result is statistically random, but the average converges to the classical result.</p> <h2 id="vector-spaces-inner-products-bases">Vector spaces, inner products, bases.</h2> <p>In quantum mechanics the space of the states is a vector space. We call vectors <em>ket.</em> This is a <em>ket:</em> \(\vert a\rangle\). Some simple axioms are defined in this space: the sum of two kets is a ket, ket addition is commutative and associative, and some other things like the existence of a 0 vector, an additive inverse, and linearity. Up to this point, everything looks normal, and the vector space we have defined is practically the same as the space of 3-vectors in Euclidean space. However, the space of our <em>kets</em> is a complex vector space, made of complex numbers, and where the multiplication by a scalar value also extends to complex numbers! This is a very abstract concept, but it’s absolutely needed to make the theory work mathematically.</p> <p>Now, if you know something about complex vector, you will likely remember about the complex conjugate. Briefly, given a complex number \(z\), there exists a complex conjugate \(z^*\) that we obtain by reversing the sign of the imaginary part. So if \(z=x+iy = re^{i\theta}\), then \(z^*=x-iy = re^{-i\theta}\). Note that the product of a complex number and its complex conjugate is always positive, yielding \(r^2\) in our case.</p> <p>In the same way, a complex vector space has a <em>dual</em> space that is its complex conjugate vector space. In our case, for every <em>ket</em> \(\vert A\rangle\), we can define a <em>bra</em> vector in that dual space: \(\langle A \vert\), which is essentially its complex conjugate.</p> <p>Now, if you got up to this point, I am pretty sure you remember about dot-products in Euclidean space, right? Well, that dot-product is a specific instance of so-called <em>inner products,</em> which we can also define in our complex space. In out quantum-mechanical magical world, the inner product is defined between a <em>bra</em> and a <em>ket</em>, and the notation is</p> \[\langle B|A \rangle\] <p>and, of course, the result is a complex number. These products are linear, and interchanging bras and kets corresponds to complex conjugation</p> \[\langle B|A \rangle = \langle A|B \rangle ^*\] <p>Concretely, the product is performed in the exact same way as for the dot-product: sum of the products of the components.</p> <p>We can also bring with us some familiar concepts from Euclidean spaces:</p> <ul> <li>A vector is <em>normalized</em> if its inner products with itself is 1.</li> <li>Two vectors are <em>orthogonal</em> if their inner products is 0.</li> </ul> <p>In our vector space, we can also define a <em>basis</em>, which is simply a set of vector that can be used to derive any other vector in the space through linear combinations. It’s often useful, and it is especially in QM, to talk about <em>orthonormal bases</em>, which are <em>bases</em> where the vectors are normalized and orthogonal between each other. Note that the number of vectors in a <em>basis</em> is equivalent to the dimension of the space.</p> <h3 id="quantum-states">Quantum States</h3> <p>With some math tools under our belt, let’s try to formalize the notion of a <em>spin state</em> from earlier. For this task, we will use vectors, and create a representation that captures what we know about how spins behave.</p> <p>Let’s start but labelling all the possible spin states along the three axes. When the apparatus \(\mathcal{A}\) is oriented along the \(z\)-axis, the two possible states are \(\sigma_z = \pm1\). We can call them <em>up</em> and <em>down</em> states and assign them ket vectors \(\vert u\rangle\) and \(\vert d\rangle\). So, when \(\mathcal{A}\) is oriented along the \(z\)-axis and registers \(+1\), the system is in state \(\vert u\rangle\). We also define the states for the directions along the \(x\)-axis and call them \(\vert r\rangle\) and \(\vert l\rangle\) (for <em>right</em> and left), and finally along the \(y\)-axis which we call \(\vert i\rangle\) and \(\vert o\rangle\) (<em>in</em> and <em>out</em>).</p> <p>Now, a consequence of this formalization is that the space of the states for a single spin has only two dimensions. This means that all possible states of a spin can be represented in a two-dimensional vector space. If we choose \(\vert u\rangle\) and \(\vert d\rangle\) as the basis vectors for this space, following by the definition of a <em>basis</em>, we can then write any other state as a linear combination (or <em>superposition) of these two vectors.</em></p> <p>Thus, a generic state \(\vert A\rangle\) comes in the form</p> \[|A\rangle = \alpha_u|u\rangle + a_d|d\rangle\] <p>where \(\alpha_u\) and \(\alpha_d\) are the components of \(\vert A\rangle\) along the two basis vectors.</p> <p>We can retrieve these components through an inner product (which is essentially a projection) of the state vector on the respective basis vectors:</p> \[\alpha_u = \langle u|A\rangle \\ \alpha_d = \langle d|A\rangle \\\] <p>It is important to remember that these components are complex number, and carry no physical meaning by themselves. However, their magnitude does. In fact, given that the spin has been prepared in state \(\vert A\rangle\), and that the apparatus is oriented along \(z\), the quantity \(\alpha_u^* \alpha_u\) is the probability that the spin would be measured as \(\sigma_z = +1\) (an <em>up</em> spin along the \(z\)-axis).</p> <p>Obviously, the same holds for \(\alpha_d^*\alpha_d\), which is the probability of measuring \(\sigma_z = -1\).</p> <p>To be more precise, the quantities \(\alpha_u\) and \(\alpha_d\) are classed probability amplitudes and are not actual probabilities. To get actual probabilities, these quantities need to be squared, so that</p> \[P_u = \langle A|u\rangle\langle u|A\rangle \\ P_d = \langle A|d\rangle\langle d|A\rangle\] <p>Something important to notice is the state \(\vert A\rangle\) is not what we measure. \(\vert A\rangle\) is something we know <em>before</em> the measurement. It is our knowledge of the state of the system, or how the system was <em>prepared</em>. It represents the <em>potential</em> possibilities of the values of our measurements. Once we measure the spin along a certain direction, however, we can only get a \(\vert u\rangle\) or a \(\vert d\rangle\).</p> <p>Two additional points are important. First, \(\vert u\rangle\) an \(\vert d\rangle\) have to be mutually orthogonal, meaning that</p> \[\langle u|d\rangle = 0 \\ \langle d|u\rangle = 0 \\\] <p>but what does this mean? It means that the <em>up</em> and <em>down</em> states are physically distinct and mutually exclusive. If the spin is prepared in the <em>up</em> state, then it can’t be detected to be in the <em>down</em> state, and viceversa. This is true not just for the spin, but for any quantum system.</p> <p>Another point is that, since <em>up</em> and <em>down</em> are the only possible result of the measurement, their respective probabilities need to sum up to 1. Mathematically,</p> \[\alpha_u^*\alpha_u + \alpha_d^*\alpha_d =1\] <p>which is the same as saying that \(\vert A\rangle\) is normalized:</p> \[\langle A| A\rangle = 1\] <p>And this is the last thing I’ll state for today, but it’s a very important principle: <em>the state of a system is represented by a unit vector in a vector space of states. Moreover, the squared magnitudes of the components of the state-vector, along particular basis vectors, represent probabilities for various experimental outcomes.</em> Prof. Susskind said that so it must be true.</p> <h3 id="what-about-next-time">What about next time?</h3> <p>To recap, we made some spin experiment, looked at the weird results we were getting, and developed a small part of a mathematical framework to represent these result, work with them, and one day make predictions.</p> <p>I think this is enough for today, as we went thought quite some stuff. In the next episode, we’ll be deriving state vectors for the spin in any direction, and state the principles of quantum mechanics!</p> <p>See ya!</p>]]></content><author><name></name></author><summary type="html"><![CDATA[A long series on learning quantum mechanics from scratch]]></summary></entry></feed>