AST Datom Streams: Bytecode Performance with Semantic Preservation
The Performance Paradox
Bytecode interpreters are fast. AST interpreters are slow. This is considered a fundamental trade-off in language implementation.
But what if this trade-off is an artifact of how we represent programs, not an inherent limitation?
In datom.world, we represent ASTs as streams of immutable facts. This seemingly simple change blurs the line between bytecode and AST interpretation, unlocking a hybrid architecture that delivers bytecode-like performance while preserving full semantic queryability.
Why Bytecode is Fast
Traditional bytecode interpreters achieve high performance through four key properties:
1. Sequential Memory Access
Bytecode is a linear array. The CPU can predict and prefetch instructions. Cache lines are used efficiently:
; Python bytecode (linear)
Instruction[0]: LOAD_FAST 0
Instruction[1]: LOAD_FAST 1
Instruction[2]: BINARY_ADD
Instruction[3]: RETURN_VALUE
; Memory layout: [LOAD_FAST][LOAD_FAST][BINARY_ADD][RETURN_VALUE]
; Cache-friendly: sequential access pattern2. Minimal Indirection
The interpreter loop directly indexes into the instruction array. No pointer chasing:
(loop [pc 0]
(let [instruction (aget bytecode pc)]
(execute instruction)
(recur (inc pc))))3. Compact Encoding
Each instruction is a dense binary representation—often just 1-3 bytes. Thousands of instructions fit in L1 cache.
4. No Pointer Chasing
Instructions are stored contiguously. No following pointers to heap-allocated objects.
Why AST Interpretation is Slow
Traditional AST interpreters suffer from the opposite characteristics:
1. Random Memory Access
AST nodes are scattered across the heap. Traversing a tree means jumping between distant memory locations:
; AST nodes scattered in heap
Node A @ 0x1000 → Node B @ 0x5000 → Node C @ 0x2000
↓
Node D @ 0x8000
; Cache-hostile: each traversal step likely a cache miss2. Pointer Chasing
Each parent→child traversal requires dereferencing a pointer. Modern CPUs cannot predict these patterns:
;; Traditional AST walk
(defn eval-ast [node]
(case (:type node)
:application
(let [fn (eval-ast (:operator node)) ; Pointer chase
args (map eval-ast (:operands node))] ; More pointer chases
(apply fn args))))3. Verbose Structures
Each AST node is a full object or map with metadata, type tags, source locations. A simple addition might be 100+ bytes.
4. Recursive Traversal
Function call overhead for visiting each node. Stack frames, register spills, return address management.
The Hidden Assumption
The bytecode-vs-AST trade-off assumes a fundamental dichotomy:
- Bytecode: Fast execution, erased semantics, opaque to introspection
- AST: Slow execution, rich semantics, fully queryable
But this assumes bytecode must erase semantics and ASTs must be tree structures.
What if both assumptions are wrong?
AST as Datom Stream: The Hybrid Architecture
When you represent an AST as a stream of datoms, you unlock a hybrid execution model that combines the best of both worlds.
Two Streams, Not One
The key insight: there are two different streams at play:
Stream 1: AST Structure Stream
The datoms that describe the program structure:
[node-1 :ast/type :application]
[node-1 :ast/operator node-2]
[node-1 :ast/operands [node-3 node-4]]
[node-2 :ast/type :variable]
[node-2 :ast/name '+]
[node-3 :ast/type :literal]
[node-3 :ast/value 10]
[node-4 :ast/type :literal]
[node-4 :ast/value 32]This is the semantic layer. It preserves:
- What operations mean (addition, not opcode 23)
- Variable names and source locations
- Type information and certainty metadata
- Transformation lineage across languages
Stream 2: Execution Stream
The datoms that capture what to do next:
[step-1 :exec/instruction :eval-operator]
[step-1 :exec/node node-1]
[step-1 :exec/order 0]
[step-2 :exec/instruction :eval-operand]
[step-2 :exec/node node-3]
[step-2 :exec/order 1]
[step-3 :exec/instruction :eval-operand]
[step-3 :exec/node node-4]
[step-3 :exec/order 2]
[step-4 :exec/instruction :apply]
[step-4 :exec/function node-2]
[step-4 :exec/args [node-3 node-4]]
[step-4 :exec/order 3]This is the execution layer. It has:
- Linear sequence (ordered by
:exec/order) - Sequential access pattern (iterate by order, cache-friendly)
- Links to AST nodes (semantic provenance preserved)
- Queryable execution trace (still datoms!)
Compilation as Datalog Query
Here's where it gets powerful. You compile the AST structure stream into an execution stream using Datalog:
;; Datalog query: AST datoms → Execution datoms
(defn compile-to-execution-stream [db start-node]
(d/q '[:find ?step ?instruction ?order
:in $ ?start
:where
;; Traverse AST and emit execution steps
(execution-order ?start ?step ?instruction ?order)]
db
start-node))This compilation phase:
- Traverses the AST structure (once)
- Generates a linear execution sequence
- Preserves links back to AST nodes
- Stores the result as datoms
Where Do AST Datoms Come From?
Before execution can begin, the program must exist as a stream of AST datoms in the database. There are two primary paths for this to happen:
1. The Native Path: Structure Editing
In the pure Datom.World model, code is not written as text. Instead, a developer uses a structure editor that directly manipulates the AST. Every change directly creates, updates, or retracts datoms in the database. In this path, the AST datoms are the direct medium of creation, so no separate decomposition step is needed.
2. The Ingestion Path: The Yang Compiler
To bridge with the world of text-based source code, Datom.World provides the Yang compiler. This tool parses traditional source files (like Python or Clojure) and produces an in-memory AST map. This map must then be decomposed into the flat, canonical datom representation for storage in DaoDB.
Text Source Code → Yang Compiler → AST Map → Decomposition → AST Datoms in DBThis ingestion path is crucial for interoperability. Regardless of which path is taken, the end result is the same: a canonical, queryable representation of the program as a stream of AST datoms. The following execution model begins from this point.
The Execution Model
With both streams in place, execution becomes a two-phase process:
Phase 1: Compilation (Slow, Happens Once)
AST Structure Stream (Datoms)
↓
Datalog Query
↓
Execution Stream (Datoms)This phase:
- Runs Datalog queries over AST datoms
- Generates ordered execution steps
- Preserves semantic links
- Can be cached/reused
Phase 2: Execution (Fast, Bytecode-like)
;; Sequential iteration over execution stream
(defn run-execution-stream [db exec-stream]
(reduce
(fn [state step]
(let [instruction (:exec/instruction step)
node (:exec/node step)]
(execute-step state instruction node)))
initial-state
(sort-by :exec/order exec-stream)))This phase:
- Iterates sequentially over execution datoms (cache-friendly)
- No pointer chasing (linear traversal by order)
- No Datalog overhead (pre-compiled)
- Links to AST preserved (can query semantics during execution)
The Critical Difference from Traditional Bytecode
Traditional bytecode erases semantics:
; Python bytecode
LOAD_FAST 0 ; What variable? From where?
LOAD_FAST 1 ; Why these two?
BINARY_ADD ; Lost: this was (+ price tax) in the source
STORE_FAST 2 ; What am I storing?Debugging requires external debug symbols. Queries like "where did this value come from?" are impossible at runtime.
Execution datom stream preserves semantics:
[step-1 :exec/instruction :load-var]
[step-1 :exec/node node-5] ; Link to AST
[step-1 :exec/order 0]
[node-5 :ast/type :variable] ; Semantic meaning
[node-5 :ast/name "price"] ; Original name
[node-5 :ast/source-location [10 5]] ; Source positionYou can query during execution:
- "What AST node created this execution step?"
- "Show me the semantic meaning of this operation"
- "Trace this value back to source code location"
- "Find all execution steps that mutate variable 'count'"
- "What was the transformation lineage of this node?"
Performance Characteristics
How does this compare to traditional approaches?
Memory Layout
| Approach | Cache Behavior | Indirection |
|---|---|---|
| Traditional AST | Random access (cache-hostile) | Pointer chasing every step |
| Traditional Bytecode | Sequential access (cache-friendly) | Direct array indexing |
| Execution Datom Stream | Sequential access (cache-friendly) | Linear iteration by order |
Semantic Preservation
| Approach | Queryability | Provenance |
|---|---|---|
| Traditional AST | Full (slow queries) | Complete |
| Traditional Bytecode | None (erased) | Lost |
| Execution Datom Stream | Full (fast queries) | Complete via links |
Compilation Overhead
| Approach | Compilation | Runtime |
|---|---|---|
| Traditional AST | None (interpret directly) | Slow (pointer chasing) |
| Traditional Bytecode | Heavy (semantic erasure) | Fast (sequential) |
| Execution Datom Stream | Medium (Datalog query) | Fast (sequential + semantic) |
The Blurred Line
The datom stream representation blurs the distinction between bytecode and AST:
Like Bytecode:
- Linear execution sequence
- Sequential memory access
- Cache-friendly iteration
- Pre-compiled for performance
Like AST:
- High-level semantic operations
- Queryable structure
- Preserves type information
- Links to source code
Unique Properties:
- Execution steps are facts (queryable)
- Semantic provenance via links (AST nodes)
- Time-travel debugging built-in (transaction history)
- Multi-dimensional queries during execution
Execution Modes
This architecture supports multiple execution strategies:
Mode 1: Direct Interpretation
AST Datoms → Walk structure directly via queries
Pros: Simple, no compilation
Cons: Slow (Datalog query per step)Mode 2: Compiled Execution
AST Datoms → Compile to Execution Stream → Sequential iteration
Pros: Fast (bytecode-like), semantic preservation
Cons: Compilation overhead, two representations in memoryMode 3: JIT-Style Hybrid
Start: Direct interpretation
Detect: Hot code paths
Compile: Hot paths to execution streams
Execute: Mix of interpreted + compiled
Pros: Best of both (fast hot paths, no upfront cost)
Cons: Complexity, profile-guided optimization neededQueryability During Execution
The killer feature: you can query execution state as it runs.
Example: Find All Steps Touching a Variable
[:find ?step ?instruction ?order
:where
[?step :exec/node ?node]
[?step :exec/instruction ?instruction]
[?step :exec/order ?order]
[?node :ast/name "count"]]Returns all execution steps that reference the variable count. Impossible with traditional bytecode.
Example: Trace Value Provenance
[:find ?source-node ?transform-step
:where
[?current-step :exec/value ?val]
[?current-step :exec/produced-by ?transform-step]
[?transform-step :exec/node ?source-node]
[?source-node :ast/source-location ?loc]]Traces where a value came from, across transformations, back to source code.
Example: Find Type Certainty Boundaries
[:find ?call-step ?arg-node
:where
[?call-step :exec/instruction :apply]
[?call-step :exec/function ?fn]
[?fn :type/certainty :static] ; High certainty function
[?call-step :exec/args ?args]
[?args :contains ?arg-node]
[?arg-node :type/certainty :dynamic]] ; Low certainty dataFinds where dynamically-typed data flows into statically-typed functions. This query is impossible in systems that erase one dimension or the other.
Memory Trade-offs
Storing both AST and execution streams has costs:
Traditional Bytecode
Text Source Code → AST (in-memory, transient) → Bytecode (discards AST)
Memory: 1x (bytecode only)Datom Stream Hybrid
Canonical AST (datoms) → Execution Stream (datoms, can be cached)
Memory: 1x (AST only) or 2x (both cached)But the execution stream can be regenerated from AST datoms. You can discard it and recompile when needed:
Memory-constrained mode:
Store: AST datoms only
Compile: On-demand to execution stream
Discard: Execution stream after use
Performance mode:
Store: Both AST and execution streams
Cache: Execution stream indefinitely
Query: Both streams simultaneouslyThe Spectrum of Designs
Different use cases favor different points on the spectrum:
Mobile Agents (Datom.World Focus)
- Priority: Semantic preservation, queryability, migration
- Execution speed: Secondary (network latency dominates)
- Design: Store AST datoms, interpret directly or compile on-demand
- Memory: Minimize (agents travel light)
High-Performance Backend
- Priority: Execution speed
- Queryability: Nice-to-have (debugging, observability)
- Design: Compile to execution stream, keep both in memory
- Memory: Abundant (servers have RAM)
Live Introspection / Debugger
- Priority: Queryability, time-travel debugging
- Execution speed: Moderate
- Design: Record full execution history as datoms
- Memory: Configurable retention (sliding window)
Comparison to Existing Systems
GraalVM/Truffle
GraalVM has a Universal AST for polyglot execution. But:
- AST is tree-structured (pointer chasing)
- No datom representation (limited queryability)
- JIT compilation to machine code (erases semantics)
- Runtime introspection limited
CodeQL/Glean
CodeQL stores ASTs as queryable databases. But:
- Static analysis only (no execution)
- No execution stream (query structure, not runtime)
- No time dimension (snapshots, not streams)
Erlang BEAM
BEAM bytecode preserves some semantics:
- Function names and arities visible
- Message passing introspectable
- But: Still bytecode (low-level), limited queryability
Datom.World / Yin.vm
Combines all three:
- GraalVM: Polyglot execution via Universal AST
- CodeQL: Queryable program structure via Datalog
- BEAM: Live introspection of running systems
- Plus: Time-travel, mobile code, multi-dimensional queries
The Fundamental Insight
The bytecode-vs-AST trade-off is not fundamental. It's an artifact of representation choices:
Traditional Assumption:
AST = In-memory Tree Structure → Pointer Chasing → Slow
Bytecode = Linear Array → Sequential Access → Fast
Performance requires Semantic ErasureDatom Stream Reality:
Canonical AST = Datom Stream (queryable semantic layer)
Execution Plan = Datom Stream (linear execution sequence)
Compilation = Datalog Query (AST → Execution Plan)
Performance AND Semantics → Both PreservedThe key realization: linearity is orthogonal to semantics.
- You can have a linear representation (fast) with high-level semantics (queryable)
- You can link execution steps to AST nodes (provenance) without pointer chasing (sequential access)
- You can compile once (Datalog query) and execute fast (linear iteration)
- You can preserve both representations (AST + Execution) as queryable datoms
Conclusion: The False Dichotomy
For decades, language implementers accepted a false choice:
- Fast execution OR rich semantics
- Bytecode performance OR AST queryability
- Compact representation OR introspection
Datom streams reveal this as a false dichotomy. You can have both:
- Linear execution (bytecode-like performance)
- Semantic preservation (AST-like queryability)
- Provenance tracking (execution links to source)
- Time-travel (transaction history built-in)
- Multi-dimensional queries (structure, time, types, execution)
The secret: compilation as Datalog query. The AST structure stream becomes an execution stream via queries. Both remain datoms. Both remain queryable. Performance comes from sequential iteration. Semantics come from preserved links.
This is not just a better implementation technique. It's a different computational model. When programs are datom streams and interpreters are queries, the line between data and code, between structure and execution, between compile-time and run-time blurs and recombines into something new.
This is the architecture of datom.world. Everything is a datom stream. Everything is queryable. And execution is just another dimension.
Learn more:
- AST as Higher Dimensional Construction of Datom Streams (the five dimensions of AST datoms)
- Universal AST vs Assembly (why AST is low-level semantics, not low-level execution)
- Yin.vm: Chinese Characters for Programming Languages (the Universal Semantic AST)
- DaoDB (the Datalog database powering datom streams)
- Yin VM Documentation (technical implementation details)