Universal AST vs Assembly: High-Level Semantics in Low-Level Form

The Observation

Look at the Universal AST expressed as datom streams:

[node-1 :ast/type :data-with-operations]
[node-1 :ast/name "Counter"]
[node-1 :ast/constructor constructor-1]
[constructor-1 :ast/type :function]
[state-1 :ast/name "count"]
[state-1 :ast/mutability :mutable]
[op-1 :ast/name "increment"]
[op-1 :ast/mutates state-1]

Compare it to assembly code:

MOV  R1, #0          ; count = 0
ADD  R1, R1, #1      ; count++
PUSH R1              ; save state
CALL increment       ; call function
POP  R1              ; restore state

They look similar. Both are flat sequences of low-level instructions. Both reference entities by identifier. Both are verbose and explicit. Is the Universal AST just assembly code for semantics?

Yes and No

The Universal AST is like assembly in form, but fundamentally different in abstraction level.

Like Assembly:

  • Flat structure (no syntax sugar, everything explicit)
  • Low-level primitives (datoms instead of machine instructions)
  • References by ID (node-1, state-1 like register names)
  • Verbose (every relationship is a separate fact)
  • Canonical (single representation, not multiple high-level languages)

Unlike Assembly:

  • Semantic, not mechanical (types, functions, scopes vs registers, memory addresses)
  • High-level abstractions preserved ("data-with-operations" not "load/store")
  • Queryable (Datalog queries vs linear scan)
  • Time-aware (transactions preserve history)
  • Multi-dimensional (structure, time, types, language, execution)

Assembly for What Layer?

The key question: assembly for what abstraction layer?

Machine Assembly: Hardware Abstraction

Machine assembly is the lowest level before hardware:

ADD R1, R2, R3    ; Hardware will add two registers

It abstracts:

  • Transistors → registers
  • Voltage levels → bits
  • Circuit paths → instructions

Below assembly is hardware. Above it is everything else (high-level languages, compilers, interpreters).

Universal AST: Semantic Abstraction

The Universal AST is the lowest level before syntax:

[node-1 :ast/type :map-operation]    ; Semantic operation

It abstracts:

  • Characters → AST nodes
  • Syntax → semantics
  • Language-specific constructs → semantic primitives

Below the AST is syntax (text). Above it is... nothing. The AST is the highest semantic level. It's not compiled further. It is the program.

The Inversion

This reveals a profound architectural inversion:

Traditional Stack:

High Level:  Python/Java/C++       (source code)
             ↓ compile
Mid Level:   Bytecode/IR            (intermediate)
             ↓ compile
Low Level:   Assembly               (machine code)
             ↓ assemble
Hardware:    CPU instructions

The source code is high-level and abstract. Assembly is low-level and concrete.

Yin.vm Stack:

Semantic:    Universal AST (datoms) (canonical code)
             ↓ render
Syntax:      Python/Java/C++        (views)
             ↓ compile (optional)
Execution:   Bytecode/Assembly      (optimization)

The Universal AST is canonical. Source code is a rendering. Assembly is an optimization.

The AST looks like assembly because it's the lowest level of the semantic domain, just as assembly is the lowest level of the execution domain.

Why This Matters: Different Optimization Targets

Assembly optimizes for execution speed. The Universal AST optimizes for semantic preservation.

Assembly Optimizations:

  • Register allocation (minimize memory access)
  • Instruction scheduling (minimize pipeline stalls)
  • Dead code elimination (remove unused instructions)
  • Loop unrolling (reduce branching overhead)

All focused on: "How fast can the CPU execute this?"

Universal AST Optimizations:

  • Semantic equivalence (preserve meaning across languages)
  • Queryability (enable cross-dimensional Datalog queries)
  • Provenance tracking (preserve transformation history)
  • Type certainty (track static/dynamic confidence)

All focused on: "How completely can we preserve and query the semantics?"

Composability: The Key Difference

Assembly instructions don't compose well:

; Adding two numbers
MOV R1, #5
MOV R2, #3
ADD R3, R1, R2

; Adding two other numbers
MOV R4, #10
MOV R5, #7
ADD R6, R4, R5

There's no abstraction. You can't query "show me all additions" without scanning every instruction and pattern-matching opcodes.

Universal AST datoms compose through Datalog:

; Find all additions
[:find ?node
 :where
 [?node :ast/type :addition]]

; Find all operations on variable 'count'
[:find ?op
 :where
 [?op :ast/mutates ?var]
 [?var :ast/name "count"]]

The low-level form enables high-level queries. This is the opposite of assembly, where low-level form prevents high-level reasoning.

The RISC Analogy

A better analogy might be RISC vs CISC, but for semantics instead of instructions.

CISC (Complex Instruction Set):

  • Rich, high-level instructions
  • "STRCMP" compares two strings in one instruction
  • Lots of special cases
  • Hard to optimize (complex semantics per instruction)

RISC (Reduced Instruction Set):

  • Small set of simple instructions
  • String comparison built from LOAD, CMP, BRANCH
  • Everything composed from primitives
  • Easy to optimize (simple semantics per instruction)

The Universal AST is "RISC for semantics":

  • Small set of semantic primitives (data, computation, control flow, scope, effects)
  • Complex constructs composed from primitives (classes → data-with-operations)
  • Easy to query (simple, orthogonal datoms)
  • Easy to transform (compose primitives differently for different languages)

Why It Looks Like Assembly: Explicit Is Better

High-level code hides details:

# Python
items.sort()

// Java
Collections.sort(items);

;; Clojure
(sort items)

What's hidden:

  • Is this in-place or pure?
  • What's the comparison function?
  • What's the algorithm?
  • Does it mutate?

The Universal AST makes it explicit:

[node-1 :ast/type :sort-operation]
[node-1 :ast/collection items]
[node-1 :ast/comparator default-compare]
[node-1 :ast/mutates? false]           ; Pure
[node-1 :ast/algorithm :quicksort]
[node-1 :ast/returns new-collection]

This looks verbose (like assembly). But the verbosity is semantic richness. Every fact is queryable. Every relationship is explicit.

Assembly is verbose about execution steps. The Universal AST is verbose about semantic relationships.

The Human Readability Problem

Assembly is hard to read because humans think in high-level abstractions, not register operations.

Is the Universal AST hard to read for the same reason?

No. Because you don't read the Universal AST directly. You read the rendered syntax:

# Python view
items.sort()

// Java view
Collections.sort(items);

;; Clojure view
(sort items)

All three are views of the same Universal AST. You pick the syntax you prefer. The canonical form (datoms) is for machines to query, not for humans to read.

Similarly, you don't read assembly directly. You read C/C++/Rust, which compiles to assembly. But in Yin.vm, the compilation is reversed. The AST is canonical, and syntax is the projection.

Assembly Compiles Down, AST Renders Up

This is the key inversion:

Traditional: Compile Down

C code → Assembly → Machine code
(lose semantics at each step)

Each step loses information. Assembly doesn't know what a "struct" is. Machine code doesn't know what a "loop" is. Information flows downward and is lost.

Yin.vm: Render Up

Universal AST → Python/Java/C++ syntax
(all semantics preserved, only syntax changes)

No information is lost. The AST contains complete semantics. Syntax is just a display choice. Information is projected upward without loss.

What About Performance?

Assembly exists because hardware needs instructions. Does the Universal AST introduce overhead?

No. The Universal AST is the canonical representation, not the execution format. You can:

  1. Keep the AST in memory for queries (DaoDB)
  2. Compile to optimized bytecode for execution
  3. Preserve both simultaneously

Example:

  • AST (datoms) stored in DaoDB for queryability
  • Bytecode compiled from AST for fast execution
  • Both coexist. AST for introspection, bytecode for speed

This is what the blog post on type systems calls "full-speed-plus-full-introspection". You get assembly-level performance and AST-level queryability.

The Hierarchy

So where does the Universal AST sit in the abstraction hierarchy?

Abstraction Level     | Purpose
─────────────────────────────────────────────
Universal AST         | Semantic preservation
  (datoms)            | Queryability
                      | Cross-language translation
─────────────────────────────────────────────
Source Code           | Human readability
  (Python/Java/C++)   | Syntax preferences
─────────────────────────────────────────────
Bytecode/IR           | Execution optimization
  (JVM/LLVM)          | Platform independence
─────────────────────────────────────────────
Assembly              | Hardware abstraction
  (x86/ARM)           | Direct CPU control
─────────────────────────────────────────────
Machine Code          | Execution
  (binary)            | What hardware runs

The Universal AST is above source code in semantic level, but looks like assembly in form because it's the lowest level of its domain (semantics).

Conclusion: Low-Level Form, High-Level Semantics

Yes, the Universal AST expressed as datom streams looks like assembly. But:

  • Assembly is low-level execution (registers, memory, CPU)
  • Universal AST is low-level semantics (types, functions, scope)

Assembly optimizes for hardware execution. The Universal AST optimizes for semantic preservation and queryability.

Assembly loses information (source code → assembly). The Universal AST preserves information (AST → syntax views).

The similarity in form comes from both being canonical, low-level representations of their respective domains. But assembly sits at the bottom of the execution stack, while the Universal AST sits at the top of the semantic stack.

This is why Yin.vm inverts the traditional compilation model. The AST is not an intermediate representation. It's the final representation. Everything else is either a view (syntax) or an optimization (bytecode).

Learn more: