Datalog as Compiler Infrastructure: Why DaoDB Makes Every Optimization Queryable

The Compiler Writer's Dilemma

Traditional compilers operate on limited views of the program. Each optimization pass walks the AST, performs local reasoning, annotates nodes, and passes control to the next stage. Global optimizations require building complex data structures—call graphs, control flow graphs, data flow analysis—each one a custom implementation.

But what if the entire program was already a queryable database? What if every optimization was just another Datalog query?

When you store ASTs as datoms in DaoDB, you've turned the compiler into a queryable semantic database—something almost no traditional compiler architecture can do. Compilation transforms from pipeline-based traversal to query-based reasoning. This unlocks optimizations that are impossibly complex with traditional approaches—and makes the compiler itself programmable by users.

This is not just more powerful. It's categorically different. You're no longer implementing compiler passes. You're implementing semantic reasoning over code.

Traditional Compilation: The Pipeline Model

Traditional compilers represent ASTs as:

  • Pointer-based trees (scattered across heap memory)
  • Tightly coupled structs/classes (language-specific representations)
  • Opaque in-memory objects (held only by the compiler process)

This makes it extremely difficult—often impossible—to query across the entire codebase using arbitrary logic.

Consider a typical compiler pipeline:

Source Code
    ↓
  Parse → AST (in-memory tree)
    ↓
  Type Check (walk tree, annotate types)
    ↓
  Optimize (constant folding, dead code elimination)
    ↓
  Code Generation (emit bytecode/assembly)
    ↓
  Output

Each pass has limited context:

  • Constant folding: Only sees local expressions (can't propagate across functions)
  • Dead code elimination: Needs explicit call graph construction
  • Escape analysis: Requires complex pointer analysis
  • Type inference: Must build constraint systems manually

Every global optimization requires custom infrastructure. Want whole-program analysis? Build a new data structure. Want incremental compilation? Track dependencies manually. Want cross-language optimization? Good luck.

Fundamentally, the AST isn't queryable. It's just nested objects you must walk by hand.

Essential vs Accidental Complexity

This is why traditional ASTs feel "simple": they intentionally capture only a transient slice of meaning. The essential complexity of representing full semantics—spanning time, provenance, types, and execution state—is discarded as soon as machine code is emitted. Yin/DaoDB keep that complexity in the primary data model because it is inherent to reasoning about code.

Compiler ComponentComplexity TypeDescription
Traditional compiler ASTAccidentalLightweight tree optimized for one-time linear traversal; semantics evaporate after codegen.
DaoDB / Universal ASTEssentialCanonical datom graph that preserves meaning across paradigms, time, and execution so every tool can query it.

Why avoid the essential model? Historically:

  1. Execution throughput trumped everything — the mission was "turn source into binaries fast." Any richer data model was treated as accidental complexity that slowed compilation.
  2. Abstraction mismatch trade-off — refactoring, static analysis, and debugging were punted to external tools. IDEs re-parse text heuristically, linters rebuild simplified graphs, debuggers rely on coarse line-number symbols. Each tool reconstructs its own partial truth, creating massive accidental complexity in the glue.

DaoDB instead centralizes the essential complexity once. The Universal AST becomes shared infrastructure for compilers, IDEs, analyzers, and runtimes. By absorbing the irreducible semantics upfront, it eliminates the downstream accidental complexity of keeping disparate models in sync.

Datalog Compilation: The Query Model

When you store the AST as datoms in DaoDB, you turn the entire program into a semantic graph:

[node-1 :ast/type :function]
[node-1 :ast/name "foo"]
[node-1 :ast/args [arg-1 arg-2]]
[call-1 :ast/operator node-1]
[call-1 :ast/args [value-1 value-2]]
[value-1 :ast/type :literal]
[value-1 :ast/value 42]
...

This relational representation can be queried and transformed using Datalog—a full logic programming language. This is what no classical compiler can do.

Compilation becomes:

Source Code
    ↓
  Parse → Datoms (stored in DaoDB)
    ↓
  Issue Queries:
    - Type inference query
    - Optimization queries
    - Code generation query
    ↓
  Apply Transformations (add more datoms)
    ↓
  Query Again (incremental, composable)
    ↓
  Output

Every optimization is a declarative Datalog query. The entire program is always queryable. Context is unlimited.

Novel Optimizations: Trivial with Datalog

Let's explore optimizations that are extremely difficult with traditional AST walking but straightforward with Datalog queries.

1. Cross-Procedural Constant Propagation

Traditional approach: Build call graphs, track data flow across function boundaries, perform iterative fixed-point computation. Hundreds of lines of code.

Datalog approach: One query.

;; Find all functions that always return the same literal value
[:find ?fn ?return-value
 :where
 [?fn :ast/type :function]
 [?fn :ast/body ?body]

 ;; Body is a single literal return
 [?body :ast/type :literal]
 [?body :ast/value ?return-value]]

;; Now inline those function calls
[:find ?call-site ?literal-value
 :where
 [?call-site :ast/type :application]
 [?call-site :ast/operator ?fn]
 [?fn :ast/always-returns ?literal-value]]

;; Result: Replace call-site with literal-value

The query automatically finds all call sites across the entire codebase. No manual call graph construction. No iterative algorithms. Just declarative logic.

2. Dead Code Elimination via Reachability

Traditional approach: Build call graph, mark reachable functions, sweep unmarked nodes.

Datalog approach: Recursive rules.

;; Define reachability as a recursive rule
;; Base case: main is reachable
(defrule reachable-from-main
  [?fn :ast/name "main"]
  =>
  [?fn :reachability/from-main true])

;; Recursive case: if A is reachable and calls B, B is reachable
(defrule transitive-reachability
  [?caller :reachability/from-main true]
  [?call :ast/parent-function ?caller]
  [?call :ast/calls ?callee]
  =>
  [?callee :reachability/from-main true])

;; Find all unreachable functions
[:find ?fn
 :where
 [?fn :ast/type :function]
 (not [?fn :reachability/from-main true])]

Datalog's recursive rules naturally express reachability. The database engine handles fixed-point computation automatically.

3. Escape Analysis for Stack Allocation

Traditional approach: Complex pointer analysis, track object lifetimes, analyze captures.

Datalog approach: Query for non-escaping allocations.

;; Find allocations that never escape their function scope
[:find ?alloc
 :where
 [?alloc :ast/type :allocation]
 [?alloc :ast/parent-function ?fn]

 ;; Not returned from the function
 (not [?fn :ast/returns ?alloc])

 ;; Not stored in a closure
 (not [?capture :closure/captures ?alloc])

 ;; Not passed to a function that could capture it
 (not-join [?alloc]
   [?call :ast/args ?args]
   [(contains? ?args ?alloc)]
   [?call :ast/operator ?callee]
   [?callee :closure/captures-args? true])]

;; These allocations can be stack-allocated instead of heap!

The query reasons about the entire program's data flow. It finds allocations that provably don't escape. Traditional compilers need specialized analysis frameworks for this.

4. Automatic Parallelization Detection

Traditional approach: Complex dependency analysis, loop transformations, conservative assumptions.

Datalog approach: Query for parallelizable operations.

;; Find map operations that can be parallelized
[:find ?map-op
 :where
 [?map-op :ast/type :map-operation]
 [?map-op :ast/function ?fn]
 [?map-op :ast/collection ?coll]

 ;; The mapped function is pure (no side effects)
 [?fn :purity/pure true]

 ;; The collection is immutable
 [?coll :mutability/immutable true]

 ;; No data dependencies between iterations
 (not [?map-op :data-flow/iteration-dependency true])]

;; Compiler can automatically emit parallel code!

Purity analysis, mutability tracking, and dependency analysis are all queryable properties. Compose them in one query to find parallelization opportunities.

5. Purity Analysis: Automatic Fixed-Point Computation

A function is pure if it has no side effects and only calls other pure functions. This is a recursive definition that requires fixed-point iteration.

What's a Fixed Point?

A fixed point is where f(x) = x. In compiler analysis, you start with incomplete information and iteratively refine it until nothing changes.

Traditional compiler approach (manual fixed-point iteration):

(defn find-pure-functions [program]
  (loop [pure-set #{+ - * /}  ; Start: built-in primitives
         changed? true]
    (if changed?
      (let [new-pure-set
            (into pure-set
              (for [fn (filter-functions program)
                    :when (all-calls-pure? fn pure-set)]
                fn))]
        (if (= pure-set new-pure-set)
          pure-set  ; Fixed point! Nothing changed
          (recur new-pure-set true)))  ; Keep iterating
      pure-set)))

You must:

  1. Manually implement the iteration loop
  2. Track what changed between iterations
  3. Check for convergence (fixed point reached)
  4. Ensure termination
  5. Handle dependencies correctly

This is painful and error-prone. Every recursive analysis needs custom iteration logic.

Datalog's Automatic Fixed-Point Semantics

Datalog has built-in fixed-point computation. You declare the rules, and the engine iterates automatically:

;; Base case: Built-in pure primitives
[?fn :ast/name "+"]
[?fn :purity/pure true]

[?fn :ast/name "*"]
[?fn :purity/pure true]

;; Recursive rule: Datalog auto-iterates until fixed point!
(defrule function-purity
  [?fn :ast/type :function]

  ;; Doesn't perform IO
  (not [?fn :effects/performs :io])

  ;; Doesn't mutate state
  (not [?fn :effects/mutates :global-state])

  ;; All called functions are pure
  (not-join [?fn]
    [?call :ast/parent-function ?fn]
    [?call :ast/calls ?callee]
    (not [?callee :purity/pure true]))
  =>
  [?fn :purity/pure true])

The Datalog engine automatically:

  • Applies the rule repeatedly (iteration 1, 2, 3...)
  • Tracks newly derived facts
  • Stops when nothing new is inferred (fixed point reached)
  • Guarantees termination (stratified negation)
  • Finds the least fixed point (smallest set satisfying the rules)

You don't write the loop. You declare the logical relationship, and the engine finds the fixed point for you.

How the Fixed Point is Reached

Watch the Datalog engine converge:

Iteration 0 (base facts):
  + is pure
  * is pure

Iteration 1 (apply recursive rule):
  abs is pure (only calls +, *)
  double is pure (only calls *)

Iteration 2 (apply rule again):
  sum-of-squares is pure (only calls abs, +, double)

Iteration 3 (apply rule again):
  ... (no new facts derived)

Fixed point reached! ✓

Traditional compiler: Manual iteration with convergence checks.

Datalog: Declarative rules, automatic fixed-point.

Many Compiler Analyses Are Fixed-Point Computations

Purity analysis isn't unique. Most global compiler analyses require fixed-point iteration:

  • Live variable analysis — Which variables are used later? (iterate backward through CFG)
  • Reaching definitions — Which definitions reach a use? (iterate forward)
  • Constant propagation — Which variables have constant values? (iterate until stable)
  • Type inference — What are the types of expressions? (iterate constraint propagation)
  • Escape analysis — Which objects escape their scope? (iterate through call graph)
  • Reachability — Which functions are called from main? (transitive closure)

In traditional compilers: manual loops, convergence checks, worklist algorithms.

In Datalog: declarative recursive rules, automatic fixed-point.

This is a massive reduction in complexity. You write the logic once. The engine handles iteration, termination, and efficiency.

Once computed, pure functions enable aggressive optimization: memoization, common subexpression elimination, reordering for parallelism.

6. Type Inference via Constraint Propagation

Traditional approach: Build constraint systems, solve with unification algorithms.

Datalog approach: Types propagate via recursive rules.

;; Base case: Literals have known types
[?literal :ast/type :literal]
[?literal :ast/value 42]
=>
[?literal :type/inferred :int]

;; Recursive case: Function application propagates types
(defrule application-type-inference
  [?call :ast/type :application]
  [?call :ast/operator ?fn]
  [?fn :type/signature ?sig]
  [?sig :type/return-type ?return-type]
  =>
  [?call :type/inferred ?return-type])

;; Variable types from bindings
(defrule variable-type-from-binding
  [?var :ast/type :variable]
  [?var :ast/name ?name]
  [?binding :scope/binds ?name]
  [?binding :scope/value ?value]
  [?value :type/inferred ?type]
  =>
  [?var :type/inferred ?type])

Types flow through the program automatically. No manual constraint solving. The Datalog engine handles propagation.

7. Security: Taint Tracking

Find code paths where untrusted user input reaches sensitive operations (SQL queries, shell commands, file access).

Traditional approach: Complex flow analysis, path-sensitive reasoning.

Datalog approach: Query for dangerous flows.

;; Find tainted data reaching sensitive operations
[:find ?sensitive-op ?tainted-value
 :where
 ;; Value originates from user input
 [?source :data-source/origin :user-input]
 [?source :security/tainted true]

 ;; Reaches a sensitive operation
 [?sensitive-op :security/sensitive true]
 [?sensitive-op :ast/type :sql-query]  ; or :shell-command, :file-write
 [?sensitive-op :ast/args ?args]
 [(contains? ?args ?value)]

 ;; Data flow from source to value
 [?source :data-flow/reaches ?value]

 ;; No sanitization in between
 (not-join [?source ?value]
   [?sanitizer :security/sanitizes true]
   [?source :data-flow/reaches ?sanitizer]
   [?sanitizer :data-flow/reaches ?value])]

;; → Security vulnerability detected!

This query is impossibly complex with traditional AST walking. With Datalog, it's declarative and composable.

What's Impossible in Traditional Compilers

Why can't LLVM, JVM, GCC, or any mainstream compiler do this?

Because their ASTs are pointer-based in-memory trees, not relational databases.

Example: Find Every Function Whose Last Argument is Unused

In DaoDB, this is trivial:

[:find ?fn
 :where
 [?fn :ast/type :function]
 [?fn :ast/args ?args]
 [(last ?args) ?arg]
 (not [?_ :ast/use ?arg])]

Four lines. One query. Done.

In a traditional compiler:

  1. Hand-write a giant AST walker
  2. Track variable bindings across scopes
  3. Track all uses of each parameter
  4. Build custom data structures for def-use chains
  5. Handle edge cases (closures, captures, mutations)
  6. Maintain this fragile code as the language evolves

Painful. Fragile. Language-specific. And you'd have to reimplement it for every compiler optimization that needs whole-program analysis.

Why SSA and Control Flow Graphs Aren't Enough

Traditional compilers use SSA (Static Single Assignment) and CFGs (Control Flow Graphs) for optimization. These are powerful, but:

  • SSA gives you: Local flow analysis, simple def-use chains, single-assignment form
  • But it's still a graph you must walk by hand. No higher-order logic. No declarative queries.

Datalog subsumes SSA. Everything SSA provides can be encoded as datoms, but Datalog adds:

  • Higher-order logic (quantifiers, negation, recursion)
  • Declarative pathfinding (transitive closure for free)
  • Existential queries ("find a path where...")
  • Automatic join optimization (query planner handles it)
  • Multi-indexed stores (efficient access patterns)
  • Stratified views (derived facts from base facts)

It's not "SSA vs Datalog." It's Datalog subsumes SSA—encoded at a more general level.

Composable Optimizations

The real power emerges when you compose multiple analyses in a single query.

Example: Loop-Invariant Code Hoisting

Find pure function calls in hot loops that always receive the same arguments—these can be hoisted outside the loop.

[:find ?call ?loop ?optimization
 :where
 ;; Find a function call
 [?call :ast/type :application]
 [?call :ast/operator ?fn]

 ;; The function is pure (analysis 1: purity)
 [?fn :purity/pure true]

 ;; Call is inside a loop
 [?loop :ast/type :loop]
 [?call :ast/ancestor ?loop]

 ;; Loop is a hot path (analysis 2: profiling)
 [?loop :profiling/hot-path true]

 ;; Arguments are loop-invariant (analysis 3: data flow)
 [?call :data-flow/constant-args true]

 ;; → Optimization opportunity: Hoist call outside loop
 [(vector :hoist-invariant ?call ?loop) ?optimization]]

This query composes three separate analyses (purity, profiling, data flow) into one optimization pass. With traditional compilers, you'd need to manually coordinate these phases.

Whole-Program Optimizations

Because the entire program is in the database, whole-program reasoning is trivial.

Automatic Data Structure Selection

;; Find collections that should use different data structures
[:find ?collection ?better-structure
 :where
 [?collection :ast/type :list]

 ;; Analyzed usage patterns
 [?collection :usage-pattern/random-access-ratio ?ratio]
 [(> ?ratio 0.8)]  ; Mostly random access, not sequential

 ;; Never modified
 (not [?mutation :ast/mutates ?collection])

 ;; → Suggest using vector (O(1) indexing) instead of list
 [(str "Use vector instead of list for " ?collection) ?better-structure]]

Cross-Language Optimization

Find Python code that would run faster if translated to C++ for a specific hot path.

[:find ?py-fn ?estimated-speedup
 :where
 [?py-fn :ast/source-lang "Python"]
 [?py-fn :profiling/hot-path true]

 ;; Semantically translatable to C++ (no dynamic features)
 [?py-fn :ast/translatable-to "C++"]

 ;; Uses mostly numeric operations (C++ excels here)
 [?py-fn :ast/operation-types ?ops]
 [(> (count (filter #{:numeric-op} ?ops)) 0.9)]

 ;; Estimate speedup based on similar migrations
 [(estimate-speedup ?py-fn "C++") ?estimated-speedup]
 [(> ?estimated-speedup 10.0)]]

;; → Suggest automatic translation for 10x+ speedup

This kind of polyglot optimization is unique to systems with a Universal Semantic AST.

Query-Driven JIT Specialization

Because DaoDB is also a runtime tuple store, you can mix static and dynamic information:

  • Runtime profile datoms ("this function is called 10,000 times/sec")
  • Static AST datoms ("this function's structure")
  • Type inference datoms ("arguments are always integers")
  • Specialization decisions ("generate SIMD version")

Query for JIT opportunities:

;; Which functions should be JIT-compiled?
[:find ?fn ?specialization-strategy
 :where
 ;; Hot path (runtime profile)
 [?fn :profiling/calls-per-sec ?rate]
 [(> ?rate 1000)]

 ;; Operates on homogeneous data types (runtime observation)
 [?fn :profiling/arg-types-stable true]
 [?fn :profiling/observed-types [?type ?type ?type]]

 ;; No escaped closures (static analysis)
 (not [?fn :closure/escapes true])

 ;; → Generate specialized version for this type
 [(vector :specialize-for-type ?fn ?type) ?specialization-strategy]]

This is structural intelligence that grows as the program runs. Traditional compilers have fixed pipelines and lose context after compilation. With DaoDB, compilation and execution continuously inform each other.

Incremental Compilation

Because everything is datoms with transaction IDs, incremental compilation is built-in.

;; Find what needs recompilation
[:find ?affected-fn
 :where
 ;; Functions modified since last compile
 [?fn :ast/type :function ?tx]
 [(> ?tx ?last-compile-tx) ?modified]

 ;; OR: Functions that call modified functions
 (or
   [(?modified ?fn)]  ; Direct modification
   (and
     [?call :ast/calls ?modified-fn]
     [(?modified ?modified-fn)]
     [?call :ast/parent-function ?affected-fn]))]

;; Only recompile these functions

Add a datom → Query affected code → Recompile only what changed. Traditional compilers need explicit dependency tracking infrastructure.

The Killer Feature: User-Programmable Compilation

Users can write their own optimizations and lint rules without modifying the compiler.

Custom Lint Rule

;; User-defined: Detect nested loops over the same collection
[:find ?outer-loop ?inner-loop
 :where
 [?outer-loop :ast/type :for-loop]
 [?outer-loop :ast/collection ?coll]
 [?inner-loop :ast/type :for-loop]
 [?inner-loop :ast/collection ?coll]
 [?inner-loop :ast/ancestor ?outer-loop]]

;; → Warning: Nested loops over same collection (O(n²) - consider deduping)

Custom Optimization

;; User-defined: Find string concatenation in loops
[:find ?concat ?loop
 :where
 [?concat :ast/type :string-concatenation]
 [?loop :ast/type :loop]
 [?concat :ast/ancestor ?loop]]

;; → Suggest using StringBuilder instead (avoid O(n²) allocations)

Team-Specific Rules

;; Company policy: No file I/O from request handlers
[:find ?io-call ?handler
 :where
 [?handler :ast/type :http-handler]
 [?io-call :ast/type :file-operation]
 [?io-call :ast/ancestor ?handler]]

;; → Policy violation: File I/O in request handler (use async worker instead)

The compiler becomes programmable. Anyone who can write Datalog can extend the compiler's reasoning.

Historical Optimization Learning

Because datoms have transaction timestamps, you can query across time.

;; Find patterns that were successfully optimized in past versions
[:find ?current-code ?historical-optimization ?speedup
 :in $ $history
 :where
 ;; In historical database
 ($history [?old-code :ast/pattern ?pattern])
 ($history [?old-code :optimization/applied ?optimization])
 ($history [?optimization :performance/speedup ?speedup])
 [(> ?speedup 2.0)]  ; At least 2x speedup

 ;; Similar pattern exists in current code
 [?current-code :ast/similar-to ?pattern]
 [(?historical-optimization ?optimization ?speedup)]]

;; → Apply optimizations that worked in the past!

The compiler learns from history. Profile-guided optimization becomes queryable across versions.

Comparison to State-of-the-Art Systems

How does this Datalog-based approach compare to other advanced compilation frameworks like GraalVM/Truffle or static analysis tools like CodeQL?

GraalVM/Truffle

GraalVM's Truffle framework also uses a universal AST for polyglot execution, which is a powerful concept. However, its architecture is fundamentally different:

  • AST Representation: The Truffle AST is a traditional in-memory, pointer-based tree. This leads to pointer-chasing during interpretation and lacks the rich, declarative queryability of a Datalog database.
  • Semantic Erasure: The JIT compiler aggressively optimizes and compiles the AST to machine code, erasing high-level semantics in the process. This makes runtime introspection difficult without separate, limited debugging information.
  • Limited Queryability: You cannot run arbitrary, whole-program Datalog queries on a Truffle AST. Each analysis requires writing a specific visitor or pass in Java.

CodeQL/Glean

CodeQL and Glean are closer in spirit, as they represent code as a queryable database for static analysis. But they are designed for a different purpose:

  • Static Analysis Only: These tools are for analyzing code at rest. They have no concept of execution, runtime state, or performance profiling.
  • No Execution Dimension: You can query the structure of the code, but not its runtime behavior. Queries like "find all functions that were called more than 1000 times" are impossible.
  • No Time Dimension: They analyze snapshots of code, not a continuous, versioned stream of datoms. Time-travel queries are not part of their model.

Yin.vm: A Synthesis

Yin.vm, powered by DaoDB, synthesizes the strengths of these systems while addressing their limitations:

  • Like GraalVM, it offers polyglot execution via a Universal AST.
  • Like CodeQL, it provides deep queryability of program structure via Datalog.
  • Uniquely, it stores the AST, type information, transformation history, and runtime execution state in the same queryable datom stream.
  • Plus: It enables capabilities beyond either system, such as true code mobility (serializable continuations), multi-dimensional queries (spanning structure, time, and execution), and a fully programmable compiler that users can extend with simple queries.

Summary of Differences

FeatureGraalVM / TruffleYin.vm / DaoDB
AST RepresentationIn-memory, pointer-based treeStream of queryable datoms in a Datalog DB
Compiler ModelProcedural pipeline (traverse and transform)Declarative queries (query and derive)
QueryabilityLimited to specific Java-based passesBuilt-in. Entire program & execution is a database.
Runtime SemanticsErased by JIT for performancePreserved. Hybrid model links execution to AST.
Typing ModelSupports static and dynamic languagesUnifies static/dynamic as a "continuum of certainty."
ExtensibilityRequires writing compiler passes in JavaProgrammable. Users write Datalog queries.
Code MobilityNot a primary design goalCore feature. Continuations are serializable datoms.

Comparison: Traditional vs Datalog

Traditional Compiler

Pipeline-Based:
  Parse → Build AST (in-memory tree)
       → Type Check (annotate nodes)
       → Build Call Graph (custom data structure)
       → Build CFG (custom data structure)
       → Data Flow Analysis (custom algorithm)
       → Optimize (traverse tree, apply transformations)
       → Code Gen

Challenges:
  - Each analysis requires custom infrastructure
  - Limited context (local reasoning only)
  - Composing analyses is manual
  - Incremental compilation needs explicit dependency tracking
  - Users cannot extend optimization passes

Datalog Compiler

Query-Based:
  Parse → Store as Datoms (DaoDB)
       → Issue Queries:
         * Type inference (recursive rules)
         * Purity analysis (recursive rules)
         * Reachability (recursive rules)
         * Optimization queries (composition)
       → Apply Transformations (add datoms)
       → Query Again (incremental, composable)
       → Code Gen (query for execution plan)

Advantages:
  - Every analysis is a Datalog query
  - Unlimited context (whole-program reasoning)
  - Composing analyses is natural (join queries)
  - Incremental compilation built-in (transaction IDs)
  - Users can write custom queries (programmable compiler)

The Paradigm Shift: From Passes to Semantic Queries

Traditional compilers require compiler engineers to write each optimization by hand. Every pass is custom code that walks trees and updates annotations.

Yin + DaoDB shifts the burden:

  • The compiler writer doesn't implement optimization passes
  • They write queries that express semantic truths about the program
  • The optimizer naturally emerges from the logic

This is why storing code as datoms is so powerful. You're not implementing compiler passes. You're implementing semantic reasoning over code.

Traditional compilation is procedural:

  • Walk the tree
  • Update annotations
  • Build auxiliary data structures
  • Coordinate phases manually

Datalog compilation is declarative:

  • Store the program as facts
  • Declare what you want to find
  • Let the query engine handle reasoning
  • Compose analyses via joins

This is a fundamental shift in how we think about compilation:

  • From "traverse and transform" to "query and derive"
  • From "custom algorithms" to "declarative logic"
  • From "compiler internals" to "queryable database"
  • From "hand-coded passes" to "emergent optimization"

Optimizations That Would Be Research Papers Become One-Liners

Here's the ultimate example of the power gap between traditional compilers and Datalog:

Global Dead-Code Elimination Across Dynamic Languages

In a traditional compiler, this is a multi-month research project:

  1. Build call graph across all modules
  2. Handle dynamic dispatch
  3. Track reflection and metaprogramming
  4. Perform reachability analysis
  5. Handle edge cases (eval, dynamic loading)
  6. Write a paper about it

In Datalog:

;; Find unreachable code
(not [?use :ast/refers-to ?node])

One line.

That's it. The entire optimization. If no datom exists that references a node, the node is dead code.

This is the power of relational reasoning. Traditional compilers must construct the information. Datalog queries information that already exists.

Why This Changes Everything

1. Optimizations Become Queries

Every optimization is just another Datalog query. Want a new optimization? Write a query. Want to disable an optimization? Don't run that query. Want to customize an optimization? Fork the query.

2. Composition is Natural

Combine multiple analyses with joins. Traditional compilers require careful phase ordering and manual coordination. Datalog queries compose automatically.

3. Whole-Program Reasoning is Cheap

The entire program is in the database. No need to build call graphs, control flow graphs, or data flow graphs—just query for what you need.

4. Incremental Compilation is Free

Transaction IDs track changes. Query for affected code. Recompile incrementally. No explicit dependency tracking needed.

5. Users Can Extend the Compiler

Write custom lint rules, optimizations, analyses—all as Datalog queries. The compiler becomes programmable.

6. Cross-Language Optimization

With a Universal Semantic AST, you can optimize across language boundaries. Find Python hot paths, translate to C++, inline across languages.

7. Learning from History

Query past optimizations. What worked? What didn't? Apply successful patterns to new code automatically.

The same database that empowers compiler engineers is also accessible to new kinds of agents. When every optimization, execution trace, and user edit is a datom, even an external assistant—like a large language model—can participate in compilation without bespoke hooks.

LLMs as Compiler Optimization Agents

Once optimizations, transformations, provenance, and execution traces live as datoms, a large language model no longer needs privileged compiler hooks. It simply becomes another client of DaoDB—one that can query existing knowledge, propose new rules, and verify semantics using the same Datalog interface as the compiler.

1. LLM Queries Existing Optimizations

Because optimization passes are represented as facts, an LLM can inspect them directly:

;; "How does Yin.vm optimize tail calls?"
[:find ?before ?after ?rule
 :where
 [?opt :optimization/type :tail-call-elimination]
 [?opt :optimization/pattern-before ?before]
 [?opt :optimization/pattern-after ?after]
 [?opt :optimization/rule ?rule]]

The response returns the actual transformation patterns and rules. The LLM doesn't hallucinate compiler internals—it reads them.

2. LLM Proposes and Tests New Optimizations

LLMs can mine DaoDB for suspicious IR shapes, emit candidate rules, and immediately verify the impact:

;; Detect x + x
[:find ?expr ?x
 :where
 [?expr :ir/op :add]
 [?expr :ir/lhs ?x]
 [?expr :ir/rhs ?x]]

;; Proposed transform: x + x → x * 2
(defn optimize-double-add [db]
  (doseq [[expr x] (d/q '[:find ?expr ?x
                          :where
                          [?expr :ir/op :add]
                          [?expr :ir/lhs ?x]
                          [?expr :ir/rhs ?x]]
                        db)]
    (transact! db
      [[:db/retract expr :ir/op :add]
       [:db/add expr :ir/op :mul]
       [:db/add expr :ir/rhs 2]])))

After the transaction, the LLM can ask DaoDB if instruction counts dropped, or if semantics still match sample executions. It's the same workflow compiler authors perform manually, but now expressed as data the LLM can manipulate.

3. LLM-Driven Transpilation via IR Queries

With a Universal AST, language translation becomes a pure query problem:

;; Map Python list-comprehension IR to JavaScript equivalent
[:find ?js-pattern
 :where
 [?opt :transpile/from "Python"]
 [?opt :transpile/to "JavaScript"]
 [?opt :transpile/ir-pattern :list-comp]
 [?opt :transpile/js-equivalent ?js-pattern]]

The LLM doesn't have to invent the translation. It queries previously recorded equivalences, applies them to the AST, and emits target syntax that shares semantic annotations with the source.

4. Learning from Historical Optimizations

DaoDB already records when an optimization was applied, who authored it, and the measured benefit. LLMs can mine this telemetry to prioritize future suggestions:

[:find ?opt (avg ?perf-gain) (count ?application)
 :where
 [?application :applied/optimization ?opt]
 [?application :applied/perf-gain ?perf-gain]
 :group-by ?opt
 :order-by (desc (avg ?perf-gain))]

If loop unrolling historically delivers 30% speedups on similar IR, the LLM can recommend it first, or auto-trigger the corresponding rule.

5. Automatic Correctness Checks

DaoDB already records regression harness results as :ir/eval-to facts keyed by specific inputs. Verifying an LLM-proposed change is therefore just another query:

;; Ensure optimized IR still returns 25 when x=5
[:find ?value
 :where
 [?expr :ir/id expr-1]
 [?expr :ir/eval-to ?value {:input {:x 5}}]]

If pre- and post-optimization evaluations match for all sampled inputs, the proposal can be merged automatically or surfaced to a reviewer with proof.

6. Feedback Loop from User Edits

Developers often hand-tune code. DaoDB can capture those diffs as new optimization patterns, which LLMs then reuse:

[pattern-1 :learn/from-user true]
[pattern-1 :opt/before "indexed-loop-modify"]
[pattern-1 :opt/after "list-comprehension"]
[pattern-1 :opt/perf-gain 1.5]

The next time an indexed loop appears, the LLM can cite this pattern, show the expected speedup, and offer the comprehension rewrite as an automated fix.

Traditional compilers hide optimization logic inside imperative passes. DaoDB makes that logic explicit, queryable, and editable—exactly the substrate LLMs need to become useful co-optimizers rather than documentation parrots.

The Implementation: DaoDB

This vision requires a database designed for code. DaoDB is built specifically for storing ASTs as datoms:

  • Immutable datoms — Every fact is permanent, timestamped
  • Recursive rules — Express transitive properties (reachability, purity, types)
  • Efficient joins — Compose analyses without performance penalties
  • Transaction time — Query history, track evolution
  • Incremental indexing — Fast updates for incremental compilation

Traditional Datalog engines (Datomic, Datascript) weren't designed for compilers. DaoDB is optimized for code as data:

  • AST-specific indexes (parent-child, data flow, control flow)
  • Provenance tracking (which transformation created this datom?)
  • Multi-dimensional queries (structure, time, types, execution)
  • Integration with Yin.vm execution engine

Practical Example: End-to-End

Let's walk through a complete optimization pipeline.

Step 1: Parse and Store

;; Source code
(defn calculate-total [items]
  (reduce + (map :price items)))

;; Parsed to datoms
[fn-1 :ast/type :function]
[fn-1 :ast/name "calculate-total"]
[fn-1 :ast/params [param-1]]
[fn-1 :ast/body reduce-call]
[reduce-call :ast/type :application]
[reduce-call :ast/operator plus-fn]
[map-call :ast/type :application]
[map-call :ast/operator map-fn]
...

Step 2: Type Inference

;; Query: Infer types
[:find ?expr ?type
 :where
 [?expr :ast/type :application]
 [?expr :ast/operator ?fn]
 [?fn :ast/name "map"]
 [?fn :type/signature ?sig]
 [?sig :type/return-type ?type]]

;; Result: map-call has type (Seq Number)

Step 3: Purity Analysis

;; Query: Is calculate-total pure?
[:find ?fn
 :where
 [?fn :ast/name "calculate-total"]
 [?fn :purity/pure true]]

;; Yes! It only calls pure functions (map, reduce, +)

Step 4: Optimization - Fusion

;; Query: Find map-reduce fusion opportunities
[:find ?map-call ?reduce-call
 :where
 [?reduce-call :ast/type :application]
 [?reduce-call :ast/operator reduce-fn]
 [?reduce-call :ast/args [?reducing-fn ?map-call]]
 [?map-call :ast/type :application]
 [?map-call :ast/operator map-fn]

 ;; Both operations are pure
 [?reduce-fn :purity/pure true]
 [?map-fn :purity/pure true]]

;; Optimization: Fuse map and reduce into single pass (transducers)

Step 5: Code Generation

;; Query: Generate optimized execution plan
[:find ?step ?instruction ?order
 :where
 [?fn :ast/name "calculate-total"]
 [?fn :optimization/fused-transducer ?transducer]
 (execution-plan ?transducer ?step ?instruction ?order)]

;; Generates linear execution stream (bytecode-like)

Every step is a query. Every optimization is composable. Every analysis is reusable.

Conclusion: Capabilities No Existing Compiler Has

For decades, compilers have been procedural programs that walk trees and apply transformations. This model served us well, but it has fundamental limitations:

  • Local reasoning only
  • Custom infrastructure for each analysis
  • Manual coordination of phases
  • Limited extensibility

Storing ASTs as datoms in DaoDB transforms compilation into database queries. This unlocks capabilities that no existing compiler architecture has:

  • Whole-program reasoning — Query the entire codebase with arbitrary logic
  • Composable analyses — Join queries naturally, no manual coordination
  • Declarative optimizations — Describe what, not how
  • Incremental compilation — Built-in via transactions
  • User-programmable — Anyone can write queries to extend the compiler
  • Cross-language — Optimize across boundaries via Universal AST
  • Runtime-informed — Static and dynamic information coexist
  • Historical learning — Query past optimizations, apply successful patterns

This is not just more powerful—it's categorically different.

You're building:

  • A relational compiler (AST as queryable database)
  • A logic-programmable optimizer (Datalog queries as passes)
  • A continuation-centric runtime (execution streams)
  • All backed by a universal datom substrate (DaoDB)

Yin.vm isn't just implementing compiler passes. It's implementing semantic reasoning over code.

This is not just a better implementation technique. It's a paradigm shift. When the program is a database and optimizations are queries, the line between code and data blurs completely.

Compilers become queryable, programmable, composable. Users gain the power to reason about their code in ways that were previously reserved for compiler internals.

This is the architecture of datom.world. Everything is a datom. Everything is queryable. And compilation is just another dimension of queries.

Learn more: