AST Datom Streams: Bytecode Performance with Semantic Preservation

The Performance Paradox

Bytecode interpreters are fast. AST interpreters are slow. This is considered a fundamental trade-off in language implementation.

But what if this trade-off is an artifact of how we represent programs, not an inherent limitation?

In datom.world, we represent ASTs as streams of immutable facts. This seemingly simple change blurs the line between bytecode and AST interpretation, unlocking a hybrid architecture that delivers bytecode-like performance while preserving full semantic queryability.

Why Bytecode is Fast

Traditional bytecode interpreters achieve high performance through four key properties:

1. Sequential Memory Access

Bytecode is a linear array. The CPU can predict and prefetch instructions. Cache lines are used efficiently:

; Python bytecode (linear)
Instruction[0]: LOAD_FAST 0
Instruction[1]: LOAD_FAST 1
Instruction[2]: BINARY_ADD
Instruction[3]: RETURN_VALUE

; Memory layout: [LOAD_FAST][LOAD_FAST][BINARY_ADD][RETURN_VALUE]
; Cache-friendly: sequential access pattern

2. Minimal Indirection

The interpreter loop directly indexes into the instruction array. No pointer chasing:

(loop [pc 0]
  (let [instruction (aget bytecode pc)]
    (execute instruction)
    (recur (inc pc))))

3. Compact Encoding

Each instruction is a dense binary representation—often just 1-3 bytes. Thousands of instructions fit in L1 cache.

4. No Pointer Chasing

Instructions are stored contiguously. No following pointers to heap-allocated objects.

Why AST Interpretation is Slow

Traditional AST interpreters suffer from the opposite characteristics:

1. Random Memory Access

AST nodes are scattered across the heap. Traversing a tree means jumping between distant memory locations:

; AST nodes scattered in heap
Node A @ 0x1000  →  Node B @ 0x5000  →  Node C @ 0x2000
                ↓
             Node D @ 0x8000

; Cache-hostile: each traversal step likely a cache miss

2. Pointer Chasing

Each parent→child traversal requires dereferencing a pointer. Modern CPUs cannot predict these patterns:

;; Traditional AST walk
(defn eval-ast [node]
  (case (:type node)
    :application
    (let [fn (eval-ast (:operator node))    ; Pointer chase
          args (map eval-ast (:operands node))] ; More pointer chases
      (apply fn args))))

3. Verbose Structures

Each AST node is a full object or map with metadata, type tags, source locations. A simple addition might be 100+ bytes.

4. Recursive Traversal

Function call overhead for visiting each node. Stack frames, register spills, return address management.

The Hidden Assumption

The bytecode-vs-AST trade-off assumes a fundamental dichotomy:

  • Bytecode: Fast execution, erased semantics, opaque to introspection
  • AST: Slow execution, rich semantics, fully queryable

But this assumes bytecode must erase semantics and ASTs must be tree structures.

What if both assumptions are wrong?

AST as Datom Stream: The Hybrid Architecture

When you represent an AST as a stream of datoms, you unlock a hybrid execution model that combines the best of both worlds.

Two Streams, Not One

The key insight: there are two different streams at play:

Stream 1: AST Structure Stream

The datoms that describe the program structure:

[node-1 :ast/type :application]
[node-1 :ast/operator node-2]
[node-1 :ast/operands [node-3 node-4]]
[node-2 :ast/type :variable]
[node-2 :ast/name '+]
[node-3 :ast/type :literal]
[node-3 :ast/value 10]
[node-4 :ast/type :literal]
[node-4 :ast/value 32]

This is the semantic layer. It preserves:

  • What operations mean (addition, not opcode 23)
  • Variable names and source locations
  • Type information and certainty metadata
  • Transformation lineage across languages

Stream 2: Execution Stream

The datoms that capture what to do next:

[step-1 :exec/instruction :eval-operator]
[step-1 :exec/node node-1]
[step-1 :exec/order 0]
[step-2 :exec/instruction :eval-operand]
[step-2 :exec/node node-3]
[step-2 :exec/order 1]
[step-3 :exec/instruction :eval-operand]
[step-3 :exec/node node-4]
[step-3 :exec/order 2]
[step-4 :exec/instruction :apply]
[step-4 :exec/function node-2]
[step-4 :exec/args [node-3 node-4]]
[step-4 :exec/order 3]

This is the execution layer. It has:

  • Linear sequence (ordered by :exec/order)
  • Sequential access pattern (iterate by order, cache-friendly)
  • Links to AST nodes (semantic provenance preserved)
  • Queryable execution trace (still datoms!)

Compilation as Datalog Query

Here's where it gets powerful. You compile the AST structure stream into an execution stream using Datalog:

;; Datalog query: AST datoms → Execution datoms
(defn compile-to-execution-stream [db start-node]
  (d/q '[:find ?step ?instruction ?order
         :in $ ?start
         :where
         ;; Traverse AST and emit execution steps
         (execution-order ?start ?step ?instruction ?order)]
       db
       start-node))

This compilation phase:

  • Traverses the AST structure (once)
  • Generates a linear execution sequence
  • Preserves links back to AST nodes
  • Stores the result as datoms

Where Do AST Datoms Come From?

Before execution can begin, the program must exist as a stream of AST datoms in the database. There are two primary paths for this to happen:

1. The Native Path: Structure Editing

In the pure Datom.World model, code is not written as text. Instead, a developer uses a structure editor that directly manipulates the AST. Every change directly creates, updates, or retracts datoms in the database. In this path, the AST datoms are the direct medium of creation, so no separate decomposition step is needed.

2. The Ingestion Path: The Yang Compiler

To bridge with the world of text-based source code, Datom.World provides the Yang compiler. This tool parses traditional source files (like Python or Clojure) and produces an in-memory AST map. This map must then be decomposed into the flat, canonical datom representation for storage in DaoDB.

Text Source Code → Yang Compiler → AST Map → Decomposition → AST Datoms in DB

This ingestion path is crucial for interoperability. Regardless of which path is taken, the end result is the same: a canonical, queryable representation of the program as a stream of AST datoms. The following execution model begins from this point.

The Execution Model

With both streams in place, execution becomes a two-phase process:

Phase 1: Compilation (Slow, Happens Once)

AST Structure Stream (Datoms)
         ↓
    Datalog Query
         ↓
Execution Stream (Datoms)

This phase:

  • Runs Datalog queries over AST datoms
  • Generates ordered execution steps
  • Preserves semantic links
  • Can be cached/reused

Phase 2: Execution (Fast, Bytecode-like)

;; Sequential iteration over execution stream
(defn run-execution-stream [db exec-stream]
  (reduce
   (fn [state step]
     (let [instruction (:exec/instruction step)
           node (:exec/node step)]
       (execute-step state instruction node)))
   initial-state
   (sort-by :exec/order exec-stream)))

This phase:

  • Iterates sequentially over execution datoms (cache-friendly)
  • No pointer chasing (linear traversal by order)
  • No Datalog overhead (pre-compiled)
  • Links to AST preserved (can query semantics during execution)

The Critical Difference from Traditional Bytecode

Traditional bytecode erases semantics:

; Python bytecode
LOAD_FAST 0    ; What variable? From where?
LOAD_FAST 1    ; Why these two?
BINARY_ADD     ; Lost: this was (+ price tax) in the source
STORE_FAST 2   ; What am I storing?

Debugging requires external debug symbols. Queries like "where did this value come from?" are impossible at runtime.

Execution datom stream preserves semantics:

[step-1 :exec/instruction :load-var]
[step-1 :exec/node node-5]           ; Link to AST
[step-1 :exec/order 0]
[node-5 :ast/type :variable]         ; Semantic meaning
[node-5 :ast/name "price"]         ; Original name
[node-5 :ast/source-location [10 5]] ; Source position

You can query during execution:

  • "What AST node created this execution step?"
  • "Show me the semantic meaning of this operation"
  • "Trace this value back to source code location"
  • "Find all execution steps that mutate variable 'count'"
  • "What was the transformation lineage of this node?"

Performance Characteristics

How does this compare to traditional approaches?

Memory Layout

ApproachCache BehaviorIndirection
Traditional ASTRandom access (cache-hostile)Pointer chasing every step
Traditional BytecodeSequential access (cache-friendly)Direct array indexing
Execution Datom StreamSequential access (cache-friendly)Linear iteration by order

Semantic Preservation

ApproachQueryabilityProvenance
Traditional ASTFull (slow queries)Complete
Traditional BytecodeNone (erased)Lost
Execution Datom StreamFull (fast queries)Complete via links

Compilation Overhead

ApproachCompilationRuntime
Traditional ASTNone (interpret directly)Slow (pointer chasing)
Traditional BytecodeHeavy (semantic erasure)Fast (sequential)
Execution Datom StreamMedium (Datalog query)Fast (sequential + semantic)

The Blurred Line

The datom stream representation blurs the distinction between bytecode and AST:

Like Bytecode:

  • Linear execution sequence
  • Sequential memory access
  • Cache-friendly iteration
  • Pre-compiled for performance

Like AST:

  • High-level semantic operations
  • Queryable structure
  • Preserves type information
  • Links to source code

Unique Properties:

  • Execution steps are facts (queryable)
  • Semantic provenance via links (AST nodes)
  • Time-travel debugging built-in (transaction history)
  • Multi-dimensional queries during execution

Execution Modes

This architecture supports multiple execution strategies:

Mode 1: Direct Interpretation

AST Datoms → Walk structure directly via queries

Pros: Simple, no compilation
Cons: Slow (Datalog query per step)

Mode 2: Compiled Execution

AST Datoms → Compile to Execution Stream → Sequential iteration

Pros: Fast (bytecode-like), semantic preservation
Cons: Compilation overhead, two representations in memory

Mode 3: JIT-Style Hybrid

Start: Direct interpretation
Detect: Hot code paths
Compile: Hot paths to execution streams
Execute: Mix of interpreted + compiled

Pros: Best of both (fast hot paths, no upfront cost)
Cons: Complexity, profile-guided optimization needed

Queryability During Execution

The killer feature: you can query execution state as it runs.

Example: Find All Steps Touching a Variable

[:find ?step ?instruction ?order
 :where
 [?step :exec/node ?node]
 [?step :exec/instruction ?instruction]
 [?step :exec/order ?order]
 [?node :ast/name "count"]]

Returns all execution steps that reference the variable count. Impossible with traditional bytecode.

Example: Trace Value Provenance

[:find ?source-node ?transform-step
 :where
 [?current-step :exec/value ?val]
 [?current-step :exec/produced-by ?transform-step]
 [?transform-step :exec/node ?source-node]
 [?source-node :ast/source-location ?loc]]

Traces where a value came from, across transformations, back to source code.

Example: Find Type Certainty Boundaries

[:find ?call-step ?arg-node
 :where
 [?call-step :exec/instruction :apply]
 [?call-step :exec/function ?fn]
 [?fn :type/certainty :static]         ; High certainty function
 [?call-step :exec/args ?args]
 [?args :contains ?arg-node]
 [?arg-node :type/certainty :dynamic]] ; Low certainty data

Finds where dynamically-typed data flows into statically-typed functions. This query is impossible in systems that erase one dimension or the other.

Memory Trade-offs

Storing both AST and execution streams has costs:

Traditional Bytecode

Text Source Code → AST (in-memory, transient) → Bytecode (discards AST)
Memory: 1x (bytecode only)

Datom Stream Hybrid

Canonical AST (datoms) → Execution Stream (datoms, can be cached)
Memory: 1x (AST only) or 2x (both cached)

But the execution stream can be regenerated from AST datoms. You can discard it and recompile when needed:

Memory-constrained mode:
  Store: AST datoms only
  Compile: On-demand to execution stream
  Discard: Execution stream after use

Performance mode:
  Store: Both AST and execution streams
  Cache: Execution stream indefinitely
  Query: Both streams simultaneously

The Spectrum of Designs

Different use cases favor different points on the spectrum:

Mobile Agents (Datom.World Focus)

  • Priority: Semantic preservation, queryability, migration
  • Execution speed: Secondary (network latency dominates)
  • Design: Store AST datoms, interpret directly or compile on-demand
  • Memory: Minimize (agents travel light)

High-Performance Backend

  • Priority: Execution speed
  • Queryability: Nice-to-have (debugging, observability)
  • Design: Compile to execution stream, keep both in memory
  • Memory: Abundant (servers have RAM)

Live Introspection / Debugger

  • Priority: Queryability, time-travel debugging
  • Execution speed: Moderate
  • Design: Record full execution history as datoms
  • Memory: Configurable retention (sliding window)

Comparison to Existing Systems

GraalVM/Truffle

GraalVM has a Universal AST for polyglot execution. But:

  • AST is tree-structured (pointer chasing)
  • No datom representation (limited queryability)
  • JIT compilation to machine code (erases semantics)
  • Runtime introspection limited

CodeQL/Glean

CodeQL stores ASTs as queryable databases. But:

  • Static analysis only (no execution)
  • No execution stream (query structure, not runtime)
  • No time dimension (snapshots, not streams)

Erlang BEAM

BEAM bytecode preserves some semantics:

  • Function names and arities visible
  • Message passing introspectable
  • But: Still bytecode (low-level), limited queryability

Datom.World / Yin.vm

Combines all three:

  • GraalVM: Polyglot execution via Universal AST
  • CodeQL: Queryable program structure via Datalog
  • BEAM: Live introspection of running systems
  • Plus: Time-travel, mobile code, multi-dimensional queries

The Fundamental Insight

The bytecode-vs-AST trade-off is not fundamental. It's an artifact of representation choices:

Traditional Assumption:

AST = In-memory Tree Structure → Pointer Chasing → Slow
Bytecode = Linear Array → Sequential Access → Fast
Performance requires Semantic Erasure

Datom Stream Reality:

Canonical AST = Datom Stream (queryable semantic layer)
Execution Plan = Datom Stream (linear execution sequence)
Compilation = Datalog Query (AST → Execution Plan)
Performance AND Semantics → Both Preserved

The key realization: linearity is orthogonal to semantics.

  • You can have a linear representation (fast) with high-level semantics (queryable)
  • You can link execution steps to AST nodes (provenance) without pointer chasing (sequential access)
  • You can compile once (Datalog query) and execute fast (linear iteration)
  • You can preserve both representations (AST + Execution) as queryable datoms

Conclusion: The False Dichotomy

For decades, language implementers accepted a false choice:

  • Fast execution OR rich semantics
  • Bytecode performance OR AST queryability
  • Compact representation OR introspection

Datom streams reveal this as a false dichotomy. You can have both:

  • Linear execution (bytecode-like performance)
  • Semantic preservation (AST-like queryability)
  • Provenance tracking (execution links to source)
  • Time-travel (transaction history built-in)
  • Multi-dimensional queries (structure, time, types, execution)

The secret: compilation as Datalog query. The AST structure stream becomes an execution stream via queries. Both remain datoms. Both remain queryable. Performance comes from sequential iteration. Semantics come from preserved links.

This is not just a better implementation technique. It's a different computational model. When programs are datom streams and interpreters are queries, the line between data and code, between structure and execution, between compile-time and run-time blurs and recombines into something new.

This is the architecture of datom.world. Everything is a datom stream. Everything is queryable. And execution is just another dimension.

Learn more: