Datom Representation and the Hidden Performance Cost

Datoms are elegantly simple: [e a v t m]. Entity, attribute, value, transaction, metadata entity ID. Five elements. Fixed-size tuple. The fifth element (m) is a metadata entity ID that references an entity containing causality, branching info, encryption context, or other interpreter-specific information.

But there's a performance trap hiding in this simplicity.

The Abstraction vs The Reality

In theory, a datom is a fixed-size tuple. Five elements. That should mean:

  • Predictable memory layout
  • Fast sequential access
  • Efficient parsing
  • Cache-friendly data structures

But in practice, the tuple is fixed-size, but its elements are not.

The Variable-Size Problem

Consider what each element actually contains:

e  — Entity ID (could be UUID, integer, or reference)
a  — Attribute (could be keyword, string, or symbol)
v  — Value (could be anything: string, number, blob, reference)
t  — Transaction ID (typically integer or timestamp)
m  — Metadata entity ID (references entity with causality, context, or other meta-info)

None of these are guaranteed to be fixed-size.

Entity IDs

  • Small integers: 1-8 bytes
  • UUIDs: 16 bytes
  • Content-addressed hashes: 32+ bytes
  • Composite keys: variable length

Attributes

  • Short keywords: :name (4 bytes + overhead)
  • Namespaced keywords: :user/email-address (20+ bytes)
  • Fully qualified: :com.example.domain/attribute (30+ bytes)

Values

  • Booleans: 1 byte
  • Small integers: 1-8 bytes
  • Short strings: 10-50 bytes
  • Long strings: kilobytes
  • Blobs: megabytes
  • References: size of entity ID

Transaction IDs

  • Sequential integers: 4-8 bytes
  • Timestamps: 8 bytes
  • Distributed transaction IDs: 16+ bytes

Context Metadata

  • Null (no context): 0 bytes
  • Node ID: 4-16 bytes
  • Vector clock: 8 bytes × node count
  • Hash-based provenance: 32 bytes
  • Composite metadata: variable

Why This Matters: Parsing Performance

When you store datoms on disk or send them over the network, you need to serialize and deserialize them.

With fixed-size elements, parsing is trivial:

// Fixed-size parsing (fast)
struct Datom {
    uint64_t entity;      // 8 bytes
    uint64_t attribute;   // 8 bytes
    uint64_t value;       // 8 bytes
    uint64_t tx;          // 8 bytes
    uint64_t context;     // 8 bytes
};

// Total: 40 bytes, perfectly aligned
// Parsing: memcpy the whole thing

This is blazingly fast:

  • Sequential memory access (cache-friendly)
  • No branching (CPU-friendly)
  • SIMD-friendly (can parse multiple datoms in parallel)
  • Direct memory mapping (zero-copy possible)

But with variable-size elements, parsing becomes complex:

// Variable-size parsing (slow)
struct Datom {
    uint32_t e_len;       // Length prefix
    uint8_t* e_data;      // Entity bytes
    uint32_t a_len;       // Length prefix
    uint8_t* a_data;      // Attribute bytes
    uint32_t v_len;       // Length prefix
    uint8_t* v_data;      // Value bytes (could be huge!)
    uint32_t t_len;       // Length prefix
    uint8_t* t_data;      // Transaction bytes
    uint32_t c_len;       // Length prefix
    uint8_t* c_data;      // Context metadata bytes
};

// Parsing requires:
// 1. Read length prefix
// 2. Allocate/copy bytes
// 3. Repeat for each field
// 4. Handle variable offsets

The Performance Implications

1. Slower Sequential Scans

Fixed-size datoms can be scanned at memory bandwidth:

Read 40 bytes → Parse datom → Next datom (40 bytes away)

Variable-size datoms require tracking offsets:

Read 4 bytes (e_len) → Read e_len bytes →
Read 4 bytes (a_len) → Read a_len bytes →
Read 4 bytes (v_len) → Read v_len bytes → ...

This introduces branch mispredictions and variable memory jumps.

2. Index Bloat

Indexes (EAVT, AEVT, AVET, VAET) store datom references. With fixed-size datoms:

Index entry: [sort-key, offset]
Offset = datom_index × 40 bytes  (simple multiplication)

With variable-size datoms:

Index entry: [sort-key, offset]
Offset = sum of all previous datom sizes  (requires offset table)

You need an auxiliary offset index just to find datoms, adding memory overhead and indirection.

3. Cache Inefficiency

Fixed-size datoms pack predictably into cache lines (64 bytes):

Cache line 1: [datom1 (40 bytes), partial datom2 (24 bytes)]
Cache line 2: [rest of datom2 (16 bytes), datom3 (40 bytes), ...]

Variable-size datoms fragment cache utilization:

Cache line 1: [small datom (20 bytes), padding, next offset (4 bytes), ...]
Cache line 2: [huge value blob (5000 bytes spanning 78 cache lines)]
Cache line 80: [next small datom (25 bytes), ...]

Large values evict useful data from cache, thrashing performance.

4. Compression Challenges

Fixed-size datoms compress well with columnar encoding:

All entities:     [1, 2, 3, 4, 5, ...] → delta encoding
All attributes:   [1, 1, 1, 2, 2, ...] → run-length encoding
All values:       [100, 101, 105, ...] → delta + dictionary

Variable-size elements resist simple compression because:

  • Length prefixes add entropy
  • Value sizes vary unpredictably
  • Pointers/offsets break columnar patterns

Solutions: Trading Off Flexibility and Performance

1. Hybrid Representation: Inline Small, Reference Large

Store small values inline, large values out-of-line:

;; Small value (inline)
[42 :name "Alice" 1001 nil]

;; Large value (reference)
[42 :photo [:ref blob-store-id-xyz] 1001 nil]

This keeps most datoms fixed-size while allowing large values without bloat.

2. Separate Storage for Large Values

Store the datom index separately from large value blobs:

Datom index:  [e a v_ref t c]  (fixed size)
Blob store:   {v_ref → actual_large_value}

Index scans remain fast. Blob retrieval is explicit and rare.

3. Intern Common Values

Replace repeated strings/keywords with small integer IDs:

;; Before interning
[1 :user/email "alice@example.com" 1001 nil]
[2 :user/email "bob@example.com" 1002 nil]

;; After interning
[1 47 "alice@example.com" 1001 nil]  ;; 47 = :user/email
[2 47 "bob@example.com" 1002 nil]

Attributes especially benefit from interning - there are typically far fewer unique attributes than entities.

4. Fixed-Width Encoding for Common Cases

Use fixed-width encoding for the most common datom shapes:

Type 0: [u64 entity, u16 attr_id, u64 int_value, u64 tx, nil]
Type 1: [u64 entity, u16 attr_id, u32 str_ref, u64 tx, nil]
Type 2: [u64 entity, u16 attr_id, u64 ref_entity, u64 tx, nil]
Type 255: [variable, variable, variable, variable, variable]

Most datoms fit common patterns. Reserve variable encoding for edge cases.

5. Columnar Storage

Store datoms in columnar format instead of row format:

entities:    [1, 2, 3, 4, 5, ...]
attributes:  [1, 1, 1, 2, 2, ...]
values:      [100, 200, 300, 400, ...]
txs:         [1001, 1001, 1002, 1002, ...]
context:     [nil, nil, nil, nil, ...]

This enables:

  • Better compression (similar values group together)
  • SIMD processing (vectorized operations)
  • Skipping columns you don't need (projection pushdown)

6. Typed Streams: Go Channels for Datoms

What if, instead of one heterogeneous datom stream, you had typed streams per attribute?

This is the insight behind Go channels. When you create chan int64, the type is known - every element is exactly 8 bytes. No parsing. No type inspection. Just read and go. Type safety eliminates runtime overhead.

Apply this to datoms:

;; Type-erased (current): slow parsing
stream: [[1 :name "Alice" 1001 nil]
         [2 :age 30 1002 nil]
         [3 :photo  1003 nil]]

;; Typed streams (proposed): fast parsing
:user/name-stream   → chan
:user/age-stream    → chan
:user/photo-stream  → chan

Each attribute gets its own typed stream.

Why this is powerful:

  • Fixed-size per stream: Fast parsing, no length prefixes
  • Natural partitioning: Each attribute is isolated
  • Columnar automatically: All :user/name values together
  • Type safety: Compiler can verify at stream creation
  • SIMD-friendly: Homogeneous data = vectorization
  • Easy to extend: New attribute = new typed stream

The AEVT Index Connection

This isn't just theoretical - it maps directly to how datom indexes work.

The AEVT index (attribute, entity, value, transaction) groups datoms by attribute. If you store each attribute's datoms in a typed stream, you get:

AEVT[:user/name] = typed stream of (entity, string-ref, tx, ctx)
AEVT[:user/age]  = typed stream of (entity, int64, tx, ctx)
AEVT[:user/photo] = typed stream of (entity, blob-ref, tx, ctx)

The index IS the typed stream.

Benefits:

  • Index scans become sequential reads of typed data
  • No runtime type checking (type known at stream creation)
  • Compression works better (homogeneous values)
  • Can use specialized codecs per attribute type

Trade-offs

Typed streams aren't free:

  • More streams to manage: One per attribute (but you'd have indexes anyway)
  • Routing overhead: Must dispatch datoms to correct stream
  • Schema evolution: Changing types requires migration
  • Mixed-type attributes: Need union types or multiple streams

But in exchange, you get parsing performance that approaches raw memory bandwidth.

Hybrid: Schema-on-Write

The best approach: infer types on first write, then use typed streams:

;; First write: establish type
(transact! [[:db/add 1 :user/age 30]])
;; → Creates typed stream: :user/age → chan

;; Subsequent writes: use typed stream (fast!)
(transact! [[:db/add 2 :user/age 25]])
;; → Appends to typed stream (no parsing, direct write)

;; Type mismatch: error or coercion
(transact! [[:db/add 3 :user/age "thirty"]])
;; → Error: Expected int64, got string

This is schema-on-write: the first write defines the type, subsequent writes are validated and optimized.

DaoDB's Approach

DaoDB combines multiple strategies:

  • Typed attribute streams: Each attribute stored in a typed stream (AEVT index = typed streams)
  • Attribute interning: Attributes are always small integer IDs
  • Inline small values: Integers, small strings, booleans stored directly
  • Reference large values: Blobs and large strings stored separately
  • Columnar indexes: EAVT/AEVT/AVET indexes use columnar encoding
  • Type-tagged encoding: Common datom shapes use optimized fixed-width encodings

This balances:

  • Fast index scans (fixed-width common case)
  • Flexible value types (variable-width escape hatch)
  • Low memory overhead (interning and references)
  • Good compression (columnar + type tagging)

The Fundamental Tension

This is a microcosm of a deeper trade-off in datom.world:

Semantic flexibility vs execution performance

Universal representation vs specialized encoding

Datoms are semantically universal - they can represent any fact. But universality has a cost.

The art is in finding encodings that:

  • Preserve the logical model (everything is a datom)
  • Optimize the physical representation (not everything is encoded the same way)

This is exactly the same pattern as:

  • Relational algebra (logical) vs B-trees and hash indexes (physical)
  • Lambda calculus (logical) vs register machines (physical)
  • HTML/CSS (logical) vs GPU rasterization (physical)

The abstraction is pure. The implementation is pragmatic.

Conclusion: Abstractions Have Weight

[e a v t m] looks simple. Five elements. Fixed-size tuple.

But the moment you serialize it, you encounter reality.

Variable-size elements mean:

  • Slower parsing
  • More complex indexing
  • Cache inefficiency
  • Compression challenges

The solution isn't to abandon the abstraction - it's to recognize the implementation space beneath it.

Datoms are the semantic model. Interning, referencing, columnar encoding, and type tagging are the performance model.

Both are necessary. Neither is sufficient alone.

The abstraction gives you composability. The encoding gives you speed.

Learn More