What Makes Datalog Datalog: Semantics, Not Syntax

The Question

Datomic implements Datalog with an EDN-based syntax that looks like structured data. Prolog-style Datalog uses predicates and horn clauses. Souffle uses a C-like syntax. LogicBlox has its own notation. SQL's WITH RECURSIVE enables Datalog-style queries.

Are these all actually Datalog? Or are some pretenders?

The answer: Datalog is defined by its semantics, not its syntax. If the evaluation model is the same, it's Datalog—no matter what the surface representation looks like.

What Datalog Actually Is

Datalog is a restricted subset of first-order logic with a specific evaluation strategy. At its core:

1. Relational Foundation

Data is stored as relations—sets of tuples. Not nested objects. Not trees. Flat tuples.

parent(alice, bob)
parent(bob, charlie)
parent(charlie, diana)

This is a relation parent with two columns. Each row is a fact.

2. Horn Clause Logic

Queries are Horn clauses—logical implications of the form:

head :- body₁, body₂, ..., bodyₙ

Read as: "The head is true if all body clauses are true."

Example:

ancestor(X, Y) :- parent(X, Y).
ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y).

This defines ancestor recursively:

  • Base case: If parent(X, Y), then ancestor(X, Y)
  • Recursive case: If parent(X, Z) and ancestor(Z, Y), then ancestor(X, Y)

3. Safe Recursion

Datalog supports recursive rules with guaranteed termination. This is enforced via:

  • Range restriction — Every variable in the head must appear in a positive (non-negated) body clause
  • Stratification — Negation cannot create circular dependencies
  • No function symbols — Only constants and variables (prevents infinite term generation)

Example of safe recursion:

reachable(X, Y) :- edge(X, Y).
reachable(X, Y) :- edge(X, Z), reachable(Z, Y).

This terminates because:

  • Variables X, Y, Z are bound by edge facts
  • No new nodes are generated (only existing nodes from edge)
  • The number of possible reachable facts is finite

4. Fixed-Point Evaluation

Datalog evaluates rules by iterating until a fixed point is reached:

Iteration 0: Base facts (from database)
Iteration 1: Apply rules → derive new facts
Iteration 2: Apply rules → derive more facts
...
Iteration N: Apply rules → no new facts
FIXED POINT REACHED ✓

This is called least fixed-point semantics. The result is the smallest set of facts that satisfies all rules.

5. Set Semantics

Results are sets, not bags (multisets) or lists:

Result: {(alice, bob), (alice, charlie)}

NOT: [(alice, bob), (alice, bob), (alice, charlie)]  # No duplicates!

Each fact is derived at most once. Adding the same rule multiple times doesn't change the result.

6. Monotonicity

Adding facts never removes previously derived conclusions (until you add stratified negation):

If ancestor(alice, bob) was derived,
Adding more parent() facts won't un-derive it.

This property enables incremental evaluation. You can add new facts and recompute only the affected conclusions.

7. Stratified Negation

Datalog supports negation, but it must be stratified—no circular dependencies through negation:

unreachable(X, Y) :- node(X), node(Y), not reachable(X, Y).

This is safe because:

  • reachable is fully computed first (stratum 0)
  • unreachable uses the result (stratum 1)
  • No circular dependency: unreachable doesn't feed back into reachable

Stratification ensures negation is computable and meaningful.

Syntax Is Just Surface Representation

Different Datalog implementations use wildly different syntaxes, but they all share the same evaluation model:

Prolog-Style (Classic)

parent(alice, bob).
parent(bob, charlie).

ancestor(X, Y) :- parent(X, Y).
ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y).

?- ancestor(alice, charlie).

This is the canonical Datalog syntax, borrowed from Prolog.

Datomic/Datascript (EDN/Clojure)

[:find ?ancestor ?descendant
 :where
 [?ancestor :parent ?child]
 [?child :parent ?descendant]]

This looks like structured data (EDN—Extensible Data Notation). It's a vector containing keywords and symbols.

Key insight: This is homoiconic. The query is data that Clojure code can manipulate:

;; Compose queries as data
(def base-query
  '[:find ?person
    :where
    [?person :person/name ?name]])

(def age-filter
  '[[?person :person/age ?age]
    [(> ?age 18)]])

;; Programmatically build queries
(concat base-query [:where] age-filter)

Despite the syntax difference, this is Datalog because:

  • Variables (?person, ?name) unify
  • Clauses are conjunctive (AND semantics)
  • Supports recursion via rules
  • Set semantics (:find returns a set)
  • Stratified negation (not, not-join)

Souffle (Research-Oriented)

.decl parent(x: symbol, y: symbol)
.decl ancestor(x: symbol, y: symbol)
.input parent

ancestor(x, y) :- parent(x, y).
ancestor(x, y) :- parent(x, z), ancestor(z, y).

.output ancestor

Souffle adds type declarations and explicit input/output directives. It compiles to C++ for high performance.

LogicBlox (Commercial)

ancestor(x, y) <- parent(x, y).
ancestor(x, y) <- parent(x, z), ancestor(z, y).

Uses <- instead of :-. Same semantics, different arrow.

SQL WITH RECURSIVE

WITH RECURSIVE ancestor AS (
  SELECT parent, child FROM parent_table
  UNION
  SELECT p.parent, a.child
  FROM parent_table p
  JOIN ancestor a ON p.child = a.parent
)
SELECT * FROM ancestor;

SQL's WITH RECURSIVE enables Datalog-style recursive queries. Same fixed-point evaluation, SQL syntax.

All Datalog, Different Clothes

These are all Datalog because they share the same semantic properties:

  • Relational model (tuples, not nested terms)
  • Horn clause logic (if-then rules)
  • Fixed-point semantics (iterate until stable)
  • Safe recursion (guaranteed termination)
  • Stratified negation (safe NOT)
  • Set semantics (no duplicates)
  • Monotonicity (adding facts preserves conclusions)

The syntax is irrelevant. Datalog is an evaluation strategy, not a programming language.

What Datalog Is NOT

To clarify the boundaries:

Datalog Is NOT Full Prolog

Prolog has:

  • Function symbols — Terms like f(g(h(x))) can nest infinitely
  • Backtracking search — Explores multiple solutions with chronological backtracking
  • Cut operator (!) — Non-logical control for pruning search
  • Non-monotonic updatesassert/retract modify the database
  • Turing-complete — Can express arbitrary computation (doesn't always terminate)

Datalog restricts:

  • No function symbols — Only constants and variables (prevents infinite term generation)
  • No backtracking — Set-based evaluation (all solutions at once)
  • No cut — Declarative only (no procedural control)
  • Monotonic reasoning — Facts only accumulate (until stratified negation)
  • Guaranteed termination — Finite domain + range restriction = always halts

Example of legal Prolog, illegal Datalog:

% Prolog: Function symbols allow infinite nesting
list([]).
list([H|T]) :- list(T).
append([], L, L).
append([H|T1], L2, [H|T3]) :- append(T1, L2, T3).

?- append([1,2], [3,4], X).  % Builds X = [1,2,3,4]

This uses function symbols ([H|T] is cons(H, T)). Datalog cannot express this because:

  • The term [1,[2,[3,[4,[]]]]] is a nested function
  • Prolog can generate infinitely many different terms
  • Datalog requires finite domains

Datalog Is NOT SQL (But Close)

SQL has:

  • Aggregate functionsSUM, AVG, COUNT, GROUP BY
  • Nulls — Three-valued logic (true/false/unknown)
  • Order-dependent operationsLIMIT, OFFSET, ORDER BY
  • Procedural extensions — Stored procedures, triggers
  • Imperative updatesUPDATE, DELETE, INSERT

Classic Datalog has:

  • Simple selection/projection/join — No aggregates
  • No nulls — Closed-world assumption (missing = false)
  • Order-independent — Set semantics (no ORDER BY)
  • Declarative rules only — No procedures
  • Immutable facts — No updates (monotonic reasoning)

Modern Datalog extensions (Datomic, Souffle) add aggregates, but core Datalog is simpler and more restricted than SQL.

Why the Restrictions Matter

Datalog's restrictions seem limiting. No function symbols? No loops? No updates? Why bother?

Because these restrictions unlock powerful guarantees:

1. Guaranteed Termination

Every Datalog query always terminates. No infinite loops. No non-termination.

This is why Datalog is perfect for compiler infrastructure. You can run arbitrary user-defined queries without worrying about hanging the compiler.

2. Efficient Evaluation

The restrictions enable aggressive optimization:

  • Join ordering — Query planner chooses optimal join order
  • Indexing — Multi-dimensional indexes for fast lookup
  • Semi-naive evaluation — Incremental fixed-point computation
  • Magic sets — Push down selections for efficiency
  • Tabling — Memoize intermediate results

These optimizations are impossible in full Prolog (backtracking and function symbols break them).

3. Parallel Evaluation

Datalog rules can be applied in parallel. No shared mutable state. No race conditions.

// Apply rules in parallel
parallel_for_each rule in rules:
    derive_facts(rule, current_facts)

// Merge results (set union)
new_facts = union(all derived facts)

This enables massive parallelism on modern hardware.

4. Incremental Maintenance

Monotonicity enables incremental updates:

Add new fact: parent(diana, eve)

Recompute only affected conclusions:
  ancestor(diana, eve)  ← new
  ancestor(charlie, eve) ← new (via diana)
  ancestor(bob, eve)     ← new (via charlie)
  ancestor(alice, eve)   ← new (via bob)

Everything else unchanged!

Traditional databases must recompute everything. Datalog recomputes only what changed.

5. Declarative Optimization

You write what you want, not how to compute it. The query planner chooses the execution strategy.

;; User writes this
[:find ?x ?z
 :where
 [?x :related-to ?y]
 [?y :connected-to ?z]
 [(> ?z 100)]]

;; Engine optimizes:
Option 1: Filter z > 100 first, then join
Option 2: Join first, then filter
Option 3: Use index on :connected-to for z > 100

Chooses best based on statistics!

This is why Datalog as compiler infrastructure is so powerful. Compiler writers express semantic truths, and the engine optimizes automatically.

The Power of 'Just' Datalog

Datalog seems simple compared to Prolog or SQL. But this simplicity is intentional:

  • Expressive enough for complex analyses (reachability, type inference, escape analysis)
  • Restricted enough for efficiency (guaranteed termination, parallelism, indexing)
  • Declarative enough for composability (rules are independent)
  • Safe enough for user-programmable compilers (no hanging queries)

This is the sweet spot for compiler optimization.

Homoiconicity: Why Datomic's Syntax Matters

While syntax doesn't change what Datalog is, Datomic's choice to represent queries as data structures (EDN) unlocks unique capabilities:

Queries as Data

;; A query is just a data structure
(def my-query
  '[:find ?person ?name
    :where
    [?person :person/name ?name]])

Programmatic Composition

;; Build queries dynamically
(defn add-age-filter [query min-age]
  (update query :where conj
    ['?person :person/age '?age]
    [list '> '?age min-age]))

(add-age-filter my-query 18)
;; => [:find ?person ?name
;;     :where
;;     [?person :person/name ?name]
;;     [?person :person/age ?age]
;;     [(> ?age 18)]]

Macro Generation

;; Generate queries from specifications
(defmacro defquery [name spec]
  `(def ~name
     ~(compile-spec-to-query spec)))

(defquery ancestors
  {:entity :person
   :relation :parent
   :transitive true})

Serialization

;; Serialize queries as EDN
(pr-str my-query)
;; => "[:find ?person ?name :where [?person :person/name ?name]]"

;; Send over network, store in database, etc.
;; Then read back:
(read-string query-string)

This homoiconicity (code as data) is why Lisp-family languages excel at metaprogramming. Datomic's EDN syntax makes queries first-class data that can be manipulated, composed, and generated programmatically.

DaoDB's Datalog Design

When building DaoDB for Yin.vm, the design choices are clear:

Syntax: Datomic's EDN-Based Datalog

DaoDB implements Datomic's syntax of Datalog—using Clojure's EDN (Extensible Data Notation) for homoiconic queries:

[:find ?fn ?optimization
 :where
 [?fn :ast/type :function]
 [?fn :purity/pure true]
 [?call :ast/calls ?fn]
 [?loop :ast/type :loop]
 [?call :ast/ancestor ?loop]
 [(vector :hoist-call ?call ?loop) ?optimization]]

Semantics: Pure Datalog

  • Horn clauses (rules with heads and bodies)
  • Fixed-point evaluation (iterate until stable)
  • Stratified negation (safe NOT)
  • Range restriction (guaranteed termination)
  • Set semantics (no duplicates)

Extensions: Modern Datalog Features

  • Aggregatessum, count, max for optimization metrics
  • Temporal queries — Query across transaction time (history)
  • Provenance — Which rule derived this fact? (for debugging)
  • User-defined predicates — Custom functions in queries
  • Pull API — Fetch entity graphs (Datomic-style)

Optimizations: High-Performance Datalog

  • Semi-naive evaluation — Incremental fixed-point (only new facts)
  • Multi-indexed EAVT — Fast lookup on entity/attribute/value/time
  • Join ordering — Cost-based query planner
  • Parallel evaluation — Rules applied concurrently
  • Compilation to execution streamsBytecode-like performance

The Essence: Evaluation Model, Not Surface Syntax

Datalog is not defined by its syntax. It's defined by:

  1. Relational model — Data is tuples
  2. Horn clause logic — Rules are if-then implications
  3. Fixed-point evaluation — Iterate until no new facts
  4. Safe recursion — Guaranteed termination
  5. Stratified negation — Computable NOT
  6. Set semantics — No duplicates
  7. Monotonicity — Facts accumulate

These properties make Datalog:

  • Powerful — Recursive queries, transitive closure, fixpoints
  • Efficient — Indexing, join ordering, parallelism
  • Safe — Always terminates, no infinite loops
  • Declarative — Describe what, not how
  • Composable — Rules are independent

This is why Datalog is the perfect foundation for compiler infrastructure. It provides:

  • Expressive power for complex analyses
  • Efficiency for whole-program reasoning
  • Safety for user-programmable optimizations
  • Declarative clarity for maintainability

Conclusion: Look Beyond the Syntax

When you see Datalog queries that look different:

  • Prolog-style: ancestor(X,Y) :- parent(X,Y).
  • Datomic-style: [:find ?x :where [?x :parent ?y]]
  • SQL-style: WITH RECURSIVE ancestor AS ...

Don't be fooled by the surface representation. Ask:

  • Does it use Horn clauses?
  • Does it evaluate via fixed-point?
  • Does it guarantee termination?
  • Does it support safe recursion?
  • Does it have set semantics?

If yes, it's Datalog—no matter what it looks like.

Syntax is just clothing. Semantics are the soul.

This is why datom.world chose Datalog as the foundation for DaoDB and Yin.vm. The evaluation model—not the syntax—is what makes Datalog the perfect query language for code as data.

Learn more: