What Makes Datalog Datalog: Semantics, Not Syntax
The Question
Datomic implements Datalog with an EDN-based syntax that looks like structured data. Prolog-style Datalog uses predicates and horn clauses. Souffle uses a C-like syntax. LogicBlox has its own notation. SQL's WITH RECURSIVE enables Datalog-style queries.
Are these all actually Datalog? Or are some pretenders?
The answer: Datalog is defined by its semantics, not its syntax. If the evaluation model is the same, it's Datalog—no matter what the surface representation looks like.
What Datalog Actually Is
Datalog is a restricted subset of first-order logic with a specific evaluation strategy. At its core:
1. Relational Foundation
Data is stored as relations—sets of tuples. Not nested objects. Not trees. Flat tuples.
parent(alice, bob)
parent(bob, charlie)
parent(charlie, diana)This is a relation parent with two columns. Each row is a fact.
2. Horn Clause Logic
Queries are Horn clauses—logical implications of the form:
head :- body₁, body₂, ..., bodyₙRead as: "The head is true if all body clauses are true."
Example:
ancestor(X, Y) :- parent(X, Y).
ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y).This defines ancestor recursively:
- Base case: If
parent(X, Y), thenancestor(X, Y) - Recursive case: If
parent(X, Z)andancestor(Z, Y), thenancestor(X, Y)
3. Safe Recursion
Datalog supports recursive rules with guaranteed termination. This is enforced via:
- Range restriction — Every variable in the head must appear in a positive (non-negated) body clause
- Stratification — Negation cannot create circular dependencies
- No function symbols — Only constants and variables (prevents infinite term generation)
Example of safe recursion:
reachable(X, Y) :- edge(X, Y).
reachable(X, Y) :- edge(X, Z), reachable(Z, Y).This terminates because:
- Variables
X,Y,Zare bound byedgefacts - No new nodes are generated (only existing nodes from
edge) - The number of possible
reachablefacts is finite
4. Fixed-Point Evaluation
Datalog evaluates rules by iterating until a fixed point is reached:
Iteration 0: Base facts (from database)
Iteration 1: Apply rules → derive new facts
Iteration 2: Apply rules → derive more facts
...
Iteration N: Apply rules → no new facts
FIXED POINT REACHED ✓This is called least fixed-point semantics. The result is the smallest set of facts that satisfies all rules.
5. Set Semantics
Results are sets, not bags (multisets) or lists:
Result: {(alice, bob), (alice, charlie)}
NOT: [(alice, bob), (alice, bob), (alice, charlie)] # No duplicates!Each fact is derived at most once. Adding the same rule multiple times doesn't change the result.
6. Monotonicity
Adding facts never removes previously derived conclusions (until you add stratified negation):
If ancestor(alice, bob) was derived,
Adding more parent() facts won't un-derive it.This property enables incremental evaluation. You can add new facts and recompute only the affected conclusions.
7. Stratified Negation
Datalog supports negation, but it must be stratified—no circular dependencies through negation:
unreachable(X, Y) :- node(X), node(Y), not reachable(X, Y).This is safe because:
reachableis fully computed first (stratum 0)unreachableuses the result (stratum 1)- No circular dependency:
unreachabledoesn't feed back intoreachable
Stratification ensures negation is computable and meaningful.
Syntax Is Just Surface Representation
Different Datalog implementations use wildly different syntaxes, but they all share the same evaluation model:
Prolog-Style (Classic)
parent(alice, bob).
parent(bob, charlie).
ancestor(X, Y) :- parent(X, Y).
ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y).
?- ancestor(alice, charlie).This is the canonical Datalog syntax, borrowed from Prolog.
Datomic/Datascript (EDN/Clojure)
[:find ?ancestor ?descendant
:where
[?ancestor :parent ?child]
[?child :parent ?descendant]]This looks like structured data (EDN—Extensible Data Notation). It's a vector containing keywords and symbols.
Key insight: This is homoiconic. The query is data that Clojure code can manipulate:
;; Compose queries as data
(def base-query
'[:find ?person
:where
[?person :person/name ?name]])
(def age-filter
'[[?person :person/age ?age]
[(> ?age 18)]])
;; Programmatically build queries
(concat base-query [:where] age-filter)Despite the syntax difference, this is Datalog because:
- Variables (
?person,?name) unify - Clauses are conjunctive (AND semantics)
- Supports recursion via rules
- Set semantics (
:findreturns a set) - Stratified negation (
not,not-join)
Souffle (Research-Oriented)
.decl parent(x: symbol, y: symbol)
.decl ancestor(x: symbol, y: symbol)
.input parent
ancestor(x, y) :- parent(x, y).
ancestor(x, y) :- parent(x, z), ancestor(z, y).
.output ancestorSouffle adds type declarations and explicit input/output directives. It compiles to C++ for high performance.
LogicBlox (Commercial)
ancestor(x, y) <- parent(x, y).
ancestor(x, y) <- parent(x, z), ancestor(z, y).Uses <- instead of :-. Same semantics, different arrow.
SQL WITH RECURSIVE
WITH RECURSIVE ancestor AS (
SELECT parent, child FROM parent_table
UNION
SELECT p.parent, a.child
FROM parent_table p
JOIN ancestor a ON p.child = a.parent
)
SELECT * FROM ancestor;SQL's WITH RECURSIVE enables Datalog-style recursive queries. Same fixed-point evaluation, SQL syntax.
All Datalog, Different Clothes
These are all Datalog because they share the same semantic properties:
- Relational model (tuples, not nested terms)
- Horn clause logic (if-then rules)
- Fixed-point semantics (iterate until stable)
- Safe recursion (guaranteed termination)
- Stratified negation (safe NOT)
- Set semantics (no duplicates)
- Monotonicity (adding facts preserves conclusions)
The syntax is irrelevant. Datalog is an evaluation strategy, not a programming language.
What Datalog Is NOT
To clarify the boundaries:
Datalog Is NOT Full Prolog
Prolog has:
- Function symbols — Terms like
f(g(h(x)))can nest infinitely - Backtracking search — Explores multiple solutions with chronological backtracking
- Cut operator (
!) — Non-logical control for pruning search - Non-monotonic updates —
assert/retractmodify the database - Turing-complete — Can express arbitrary computation (doesn't always terminate)
Datalog restricts:
- No function symbols — Only constants and variables (prevents infinite term generation)
- No backtracking — Set-based evaluation (all solutions at once)
- No cut — Declarative only (no procedural control)
- Monotonic reasoning — Facts only accumulate (until stratified negation)
- Guaranteed termination — Finite domain + range restriction = always halts
Example of legal Prolog, illegal Datalog:
% Prolog: Function symbols allow infinite nesting
list([]).
list([H|T]) :- list(T).
append([], L, L).
append([H|T1], L2, [H|T3]) :- append(T1, L2, T3).
?- append([1,2], [3,4], X). % Builds X = [1,2,3,4]This uses function symbols ([H|T] is cons(H, T)). Datalog cannot express this because:
- The term
[1,[2,[3,[4,[]]]]]is a nested function - Prolog can generate infinitely many different terms
- Datalog requires finite domains
Datalog Is NOT SQL (But Close)
SQL has:
- Aggregate functions —
SUM,AVG,COUNT,GROUP BY - Nulls — Three-valued logic (true/false/unknown)
- Order-dependent operations —
LIMIT,OFFSET,ORDER BY - Procedural extensions — Stored procedures, triggers
- Imperative updates —
UPDATE,DELETE,INSERT
Classic Datalog has:
- Simple selection/projection/join — No aggregates
- No nulls — Closed-world assumption (missing = false)
- Order-independent — Set semantics (no ORDER BY)
- Declarative rules only — No procedures
- Immutable facts — No updates (monotonic reasoning)
Modern Datalog extensions (Datomic, Souffle) add aggregates, but core Datalog is simpler and more restricted than SQL.
Why the Restrictions Matter
Datalog's restrictions seem limiting. No function symbols? No loops? No updates? Why bother?
Because these restrictions unlock powerful guarantees:
1. Guaranteed Termination
Every Datalog query always terminates. No infinite loops. No non-termination.
This is why Datalog is perfect for compiler infrastructure. You can run arbitrary user-defined queries without worrying about hanging the compiler.
2. Efficient Evaluation
The restrictions enable aggressive optimization:
- Join ordering — Query planner chooses optimal join order
- Indexing — Multi-dimensional indexes for fast lookup
- Semi-naive evaluation — Incremental fixed-point computation
- Magic sets — Push down selections for efficiency
- Tabling — Memoize intermediate results
These optimizations are impossible in full Prolog (backtracking and function symbols break them).
3. Parallel Evaluation
Datalog rules can be applied in parallel. No shared mutable state. No race conditions.
// Apply rules in parallel
parallel_for_each rule in rules:
derive_facts(rule, current_facts)
// Merge results (set union)
new_facts = union(all derived facts)This enables massive parallelism on modern hardware.
4. Incremental Maintenance
Monotonicity enables incremental updates:
Add new fact: parent(diana, eve)
Recompute only affected conclusions:
ancestor(diana, eve) ← new
ancestor(charlie, eve) ← new (via diana)
ancestor(bob, eve) ← new (via charlie)
ancestor(alice, eve) ← new (via bob)
Everything else unchanged!Traditional databases must recompute everything. Datalog recomputes only what changed.
5. Declarative Optimization
You write what you want, not how to compute it. The query planner chooses the execution strategy.
;; User writes this
[:find ?x ?z
:where
[?x :related-to ?y]
[?y :connected-to ?z]
[(> ?z 100)]]
;; Engine optimizes:
Option 1: Filter z > 100 first, then join
Option 2: Join first, then filter
Option 3: Use index on :connected-to for z > 100
Chooses best based on statistics!This is why Datalog as compiler infrastructure is so powerful. Compiler writers express semantic truths, and the engine optimizes automatically.
The Power of 'Just' Datalog
Datalog seems simple compared to Prolog or SQL. But this simplicity is intentional:
- Expressive enough for complex analyses (reachability, type inference, escape analysis)
- Restricted enough for efficiency (guaranteed termination, parallelism, indexing)
- Declarative enough for composability (rules are independent)
- Safe enough for user-programmable compilers (no hanging queries)
This is the sweet spot for compiler optimization.
Homoiconicity: Why Datomic's Syntax Matters
While syntax doesn't change what Datalog is, Datomic's choice to represent queries as data structures (EDN) unlocks unique capabilities:
Queries as Data
;; A query is just a data structure
(def my-query
'[:find ?person ?name
:where
[?person :person/name ?name]])Programmatic Composition
;; Build queries dynamically
(defn add-age-filter [query min-age]
(update query :where conj
['?person :person/age '?age]
[list '> '?age min-age]))
(add-age-filter my-query 18)
;; => [:find ?person ?name
;; :where
;; [?person :person/name ?name]
;; [?person :person/age ?age]
;; [(> ?age 18)]]Macro Generation
;; Generate queries from specifications
(defmacro defquery [name spec]
`(def ~name
~(compile-spec-to-query spec)))
(defquery ancestors
{:entity :person
:relation :parent
:transitive true})Serialization
;; Serialize queries as EDN
(pr-str my-query)
;; => "[:find ?person ?name :where [?person :person/name ?name]]"
;; Send over network, store in database, etc.
;; Then read back:
(read-string query-string)This homoiconicity (code as data) is why Lisp-family languages excel at metaprogramming. Datomic's EDN syntax makes queries first-class data that can be manipulated, composed, and generated programmatically.
DaoDB's Datalog Design
When building DaoDB for Yin.vm, the design choices are clear:
Syntax: Datomic's EDN-Based Datalog
DaoDB implements Datomic's syntax of Datalog—using Clojure's EDN (Extensible Data Notation) for homoiconic queries:
[:find ?fn ?optimization
:where
[?fn :ast/type :function]
[?fn :purity/pure true]
[?call :ast/calls ?fn]
[?loop :ast/type :loop]
[?call :ast/ancestor ?loop]
[(vector :hoist-call ?call ?loop) ?optimization]]Semantics: Pure Datalog
- Horn clauses (rules with heads and bodies)
- Fixed-point evaluation (iterate until stable)
- Stratified negation (safe NOT)
- Range restriction (guaranteed termination)
- Set semantics (no duplicates)
Extensions: Modern Datalog Features
- Aggregates —
sum,count,maxfor optimization metrics - Temporal queries — Query across transaction time (history)
- Provenance — Which rule derived this fact? (for debugging)
- User-defined predicates — Custom functions in queries
- Pull API — Fetch entity graphs (Datomic-style)
Optimizations: High-Performance Datalog
- Semi-naive evaluation — Incremental fixed-point (only new facts)
- Multi-indexed EAVT — Fast lookup on entity/attribute/value/time
- Join ordering — Cost-based query planner
- Parallel evaluation — Rules applied concurrently
- Compilation to execution streams — Bytecode-like performance
The Essence: Evaluation Model, Not Surface Syntax
Datalog is not defined by its syntax. It's defined by:
- Relational model — Data is tuples
- Horn clause logic — Rules are if-then implications
- Fixed-point evaluation — Iterate until no new facts
- Safe recursion — Guaranteed termination
- Stratified negation — Computable NOT
- Set semantics — No duplicates
- Monotonicity — Facts accumulate
These properties make Datalog:
- Powerful — Recursive queries, transitive closure, fixpoints
- Efficient — Indexing, join ordering, parallelism
- Safe — Always terminates, no infinite loops
- Declarative — Describe what, not how
- Composable — Rules are independent
This is why Datalog is the perfect foundation for compiler infrastructure. It provides:
- Expressive power for complex analyses
- Efficiency for whole-program reasoning
- Safety for user-programmable optimizations
- Declarative clarity for maintainability
Conclusion: Look Beyond the Syntax
When you see Datalog queries that look different:
- Prolog-style:
ancestor(X,Y) :- parent(X,Y). - Datomic-style:
[:find ?x :where [?x :parent ?y]] - SQL-style:
WITH RECURSIVE ancestor AS ...
Don't be fooled by the surface representation. Ask:
- Does it use Horn clauses?
- Does it evaluate via fixed-point?
- Does it guarantee termination?
- Does it support safe recursion?
- Does it have set semantics?
If yes, it's Datalog—no matter what it looks like.
Syntax is just clothing. Semantics are the soul.
This is why datom.world chose Datalog as the foundation for DaoDB and Yin.vm. The evaluation model—not the syntax—is what makes Datalog the perfect query language for code as data.
Learn more:
- Datalog as Compiler Infrastructure (why Datalog makes every optimization queryable)
- AST as Higher Dimensional Construction of Datom Streams (code as queryable datoms)
- AST Datom Streams: Bytecode Performance with Semantic Preservation (execution + queries)
- DaoDB Documentation (the Datalog database powering datom.world)
- Yin VM Documentation (technical implementation)