Data-Oriented Programming
Data-Oriented Programming
Data-Oriented Programming (DOP) is a way of thinking that designs programs around the shape, flow, transformation, and access patterns of data rather than around hierarchies of objects or classes.
However, this term is used in two distinct ways.
1. Data-Oriented Design (DOD)
Data-Oriented Design from the game/engine/performance optimization camp
Key concepts: CPU cache, memory layout, batch processing, ECS, hot/cold split
2. Data-Oriented Programming (DOP)
Immutable data-centric programming as described by Yehonathan Sharvit
Key concepts: separation of code and data, generic data structures, immutable data, pure transformation
The two share common ground even though they emphasize different things.
A program is a transformer that converts data from one form into anotherIf Object-Oriented Programming (OOP) asks "which objects exchange messages with each other," Data-Oriented Programming starts by asking a different question.
What data exists?
How much of it is there?
How often is it read and written?
In what order is it accessed?
Into what form is it transformed?
Which data moves together, and which data should be kept separate?Personal note: If OOP asks "what nouns exist in this world?", Data-Oriented Programming asks "so what exactly is sitting in memory, how many of them are there, and how are they laid out?"
OOP describes the world in a way that is easy for humans to understand, while Data-Oriented Programming re-describes the world in a way that is easy for the CPU to understand.
We write programs for humans, yet in the end we have to appease the CPU.
All of this contradiction stems from the fact that at runtime, it is the CPU that shows up for work, not the domain expert.
Why It Matters
A lot of code starts out looking clean, modeled neatly as objects like this.
This is especially true in game development.
Player
Enemy
Bullet
Item
ParticleBut at actual runtime, these questions often become far more important.
How many bullets are there, tens of thousands?
Do we only need to update the position each frame?
Which fields are needed for collision detection?
Which data is needed for rendering?
Is this data laid out in contiguous memory?When a single OOP object bundles position, velocity, rendering info, sound, network state, and behavior state all together, it reads naturally. But when a function that only needs position and velocity iterates over that array every frame, the CPU is forced to drag along unnecessary data surrounding the entire object during each cache line fill.
For example, you get stuttering from particle counts when adding a VFX effect, or you end up hand-optimizing when adding animations. People try everything from Object Pooling on up, but there are limits.
Data-oriented thinking aims to reduce that waste.
Object-centric:
A single object holds multiple responsibilities and data together.
Data-centric:
Keep data that is processed together in the same place,
and separate data that is not processed together.DOD: Data-Oriented Design for Performance
In game engines, rendering, physics, simulation, and high-volume event processing, the same operations are repeated over large amounts of data. Mike Acton's talk "Data-Oriented Design and C++" is the landmark example that brought this thinking to widespread attention in the C++/game engine world.1
Update 100,000 positions
Check collisions for 50,000 bullets
Decrease lifetime for 1,000,000 particles
Evaluate 20,000 NPC states
The key question here is not "did we model our objects beautifully?" but rather "how predictably and contiguously does the CPU read data?"
AoS vs. SoA
Traditional object arrays are typically closer to AoS (Array of Structures). Human intuition produces AoS, while the CPU's instinct demands SoA (Structure of Arrays). Traditional object arrays in OOP are usually AoS.
struct Particle {
float x;
float y;
float vx;
float vy;
float life;
int color;
};
Particle particles[100000];
This structure is comfortable for the human eye, which perceives a particle as a single whole. But when a function that updates positions every frame iterates over this array, the CPU suffers. Even though a position update needs neither color nor life, those unnecessary fields are forcibly loaded alongside the relevant data during each cache line fill.
This is exactly Cache Pollution. Data-Oriented Design, by contrast, rearranges the world into SoA (Structure of Arrays).
struct Particles {
float x[100000];
float y[100000];
float vx[100000];
float vy[100000];
float life[100000];
int color[100000];
};
Now the position-update system sweeps through only the trajectory arrays it actually needs, in memory order, sequentially.
for (int i = 0; i < count; i++) {
x[i] += vx[i] * dt;
y[i] += vy[i] * dt;
}
The proportion of values within a single cache line that are actually needed for the computation increases. In other words, the same memory bandwidth delivers more useful work.
The point is not that "AoS is bad." There is no single correct shape for data; what matters is the ability to flip the layout of arrays at any time to match which data pieces a given job or system reads, and how often it reads them.
Personal note: The reason people of old used to refer to ships and cars with feminine pronouns was that the equipment was temperamental. In that same spirit, the computer is more than qualified to be called "she."
Modern CPUs do not fetch a single byte from memory. They fetch surrounding data together in cache line units.
That is why code that reads a contiguous array sequentially is fast.
Good approach:
x[0], x[1], x[2], x[3] ...
Bad approach:
objectA.position
objectQ.position
objectM.position
objectZ.position
The second approach may look clean in code, but from the CPU's perspective it is jumping all over the place.
The essence of Data-Oriented Design is not just one cache trick. Richard Fabian explains that Data-Oriented Design is an approach that looks beyond cache miss avoidance to consider the type, frequency, quantity, shape, and probability of data.[2]
type: ์ด๋ค ์ข
๋ฅ์ ๋ฐ์ดํฐ์ธ๊ฐ?
frequency: ์ผ๋ง๋ ์์ฃผ ๋ฐ์ํ๋๊ฐ?
quantity: ์ผ๋ง๋ ๋ง์๊ฐ?
shape: ์ด๋ค ๊ตฌ์กฐ์ธ๊ฐ?
probability: ์ด๋ค ๊ฐ/๋ถ๊ธฐ๊ฐ ์ผ๋ง๋ ์์ฃผ ๋ํ๋๋๊ฐ?
In other words, data-oriented thinking is broader than "data structure optimization." It treats the actual distribution and usage patterns of data as the starting point of design.
Hot Data and Cold Data
Data that is accessed frequently is called hot data, and data that is accessed rarely is called cold data.
Consider a game character with data like this.
position
velocity
health
name
description
inventory
questHistory
lastDialogueText
What is needed every frame is typically position, velocity, and health. By contrast, name, description, questHistory, and lastDialogueText are only needed when a UI panel opens or a dialogue event fires. Bundling both in the same object means the hot loop drags cold data along with it. Even though the actual computation only needs position and velocity, the object's memory layout can place infrequently used data such as name, description, and inventory near the same cache lines. Data-Oriented Design therefore separates data by access frequency.
hot:
- position
- velocity
- health
cold:
- name
- description
- questHistory
- dialogue state
The key is keeping hot data small and contiguous. Data read by loops that run every frame should be packed as densely as possible, while data needed only occasionally should be moved to a separate structure. Rather than treating a character as "one realistic object," you split it into multiple data sets that match the access patterns at runtime.
Personal note: The real reason the computer deserves to be called "she" is that she absolutely hates having stale conversation history (
lastDialogueText) and the contents of a bag (inventory) rummaged through at random moments.Ask her cleanly about where to go next (
position) and how fast to get there (velocity), and she answers at blazing speed. But the moment you sneak in garbage like "hey, remember that quest when we first met? (questHistory)", she blocks you and the runtime freezes.The safe approach is to keep only the daily essentials (Hot) compact and ready to serve, while the messy historical records (Cold) are strictly quarantined on a separate, password-protected external drive.
Maybe the programmer who truly masters DOP is a Casanova.
ECS and Data-Oriented Design
Entity Component System (ECS) is a common form in which Data-Oriented Design is implemented, though ECS itself is not the whole of data-oriented thinking. ECS keeps data out of object internals, separates it into components, and lets systems process matching data combinations in batches.
Entity:
ID. Usually just a number. The Entity itself holds almost no logic or data.
Component:
Data. Pure data bundles such as Position, Velocity, and Health.
System:
Logic. Processes entities that have a specific combination of components in batches.
Example:
MovementSystem:
Find all entities with Position + Velocity and
perform `position += velocity * dt`
In OOP, Player.update(), Enemy.update(), and Bullet.update() each mutate their own internal state.
In ECS, by contrast, MovementSystem collects everything that can move into the same data shape and processes it all at once in a batch.
Object-centric:
- Each object performs its own update.
- Logic is scattered inside individual objects.
- The same work is spread across multiple types.
ECS-centric:
- Systems batch-process groups of the same components.
- Data and logic are separated.
- The same operation is applied sequentially to data of the same shape.
Unity's DOTS (Data-Oriented Technology Stack) is a leading industry example of using ECS as the foundation for Data-Oriented Design support.2
Personal note: Unity puts ECS front and center and pitches it as "the paradigm is changing now," but 99% of the Asset Store is MonoBehaviour-based, and mixing the two in some "hybrid" arrangement produces a strange monster.
You start wondering whether you're actually making a game or just arm-wrestling the compiler. The problem is, the day I win that arm-wrestling match probably won't come in my lifetime.
DOP: Programming Centered on Immutable Data
The Data-Oriented Programming laid out by Sharvit differs somewhat from the game-engine-style DOD. The central problem here is not CPU cache but the complexity of information systems. The Manning book description also explains DOP as a paradigm that simplifies state management through "immutable generic data structures" and "non-mutating general-purpose functions."3
The core principles are the following four.4
1. Separate code from data.
2. Represent data using generic data structures.
3. Do not mutate data directly.
4. Separate the data schema from its representation.
As a practical technique, there is an accompanying principle: manipulate data with general-purpose functions. This means favoring general operations such as map, filter, reduce, pick, merge, assoc, update, and groupBy over class-specific methods.
OOP-style question:
"What methods should a Book object have?"
DOP-style question:
"What shape does Book data have,
and what pure transformation functions operate on that shape?"
OOP style:
class Book {
public constructor(
public title: string,
public author: string,
public checkedOut: boolean,
) {}
public checkout(): void {
this.checkedOut = true;
}
}DOP style:
Personal note: Plaster
readonlyall over your code. It is the highest-ROI posturing that makes you look like a seasoned developer with just a few extra keystrokes, and it is also a physical restraint that prevents your future self from corrupting global state and blowing up the runtime five minutes before delivery.
type Book = {
readonly id: string;
readonly title: string;
readonly author: string;
readonly checkedOut: boolean;
};
type BookView = {
readonly title: string;
readonly status: "available" | "checkedOut";
};
const checkoutBook = (book: Book): Book => {
return {
...book,
checkedOut: true,
};
};
const toBookView = (book: Book): BookView => {
return {
title: book.title,
status: book.checkedOut ? "checkedOut" : "available",
};
};
// Externally it is 100% pure functions, but internally it uses local mutation to conserve memory.
// This is the difference between dogmatic FP and pragmatic DOP.
const groupBooksByAuthor = (
books: readonly Book[],
): ReadonlyMap<string, readonly Book[]> => {
const groups = new Map<string, Book[]>(); // Allowing internal temporary mutation
for (const book of books) {
let current = groups.get(book.author);
if (!current) {
current = [];
groups.set(book.author, current);
}
current.push(book); // Pushing elements in without recreating the array
}
return groups as ReadonlyMap<string, readonly Book[]>; // Closing with immutability (Readonly) on the way out and returning
};This approach closely resembles functional programming. However, rather than foregrounding all of functional programming's abstractions, such as monads, typeclasses, and higher-kinded types, DOP focuses on keeping data representations simple and transforming that data with non-destructive functions.
Put differently, separating the schema from the data representation means you do not assume that data must be bound to a specific class constructor or its methods. Data can be represented as a plain map or object, and whether that data is valid can be verified by a separate schema or validator.
DOP Principle 1: Separate Code from Data
In DOP, data does not have methods. Data is data, and behavior is functions.
// Data
type Member = {
id: string;
name: string;
borrowedBookIds: string[];
};
// Behavior
function canBorrow(member: Member): boolean {
return member.borrowedBookIds.length < 5;
}
Sharvit observes that structures separating code and data tend to be composed of simpler parts than structures that intermix the two.5
From this perspective, an object's bundling of data and methods together is both a strength and a potential source of tight coupling. When an object's internal state and its methods are strongly bound together, it can become difficult to reuse, record, compare, or transmit that data in other contexts.
DOP tries to keep data as plain values and separate behavior into functions that take those values as input and return new values.
Advantages:
Functions are easy to test independently.
The same data is easy to reuse across multiple contexts.
Serialization, logging, diff, and replay become straightforward.
Costs:
Control over which functions access which data becomes weaker.
It is harder to discover usage through a list of methods, as you would with an object.
Separating data from functions can increase the number of files and modules.
Personal note: DOP is not saying "encapsulation is unnecessary." It is closer to asking, "does encapsulation necessarily require locking all data inside an object?"
But even that deep reflection becomes utterly useless philosophical self-indulgence the moment a project manager says, "Hey, the deadline is tomorrow, just make it public and wire it up fast." Philosophy is a luxury reserved for those who have already shipped.
DOP Principle 2: Represent Data with Generic Data Structures
DOP prefers to represent domain data using general data structures such as maps, dictionaries, objects, arrays, and lists rather than dedicated classes.6
const book = {
id: "book-1",
title: "Data-Oriented Programming",
author: "Yehonathan Sharvit",
tags: ["programming", "architecture"],
};
The point is not "do whatever you want with types." It is to ride the data representation on the ecosystem's general-purpose data operations.
const publicBookView = {
title: book.title,
author: book.author,
};
const serialized = JSON.stringify(publicBookView);
The more dedicated classes you have, the more dedicated conversion code each class demands. Conversely, when data is a plain object or map, operations like serialization, partial selection, merging, comparison, and diffing are easy to handle with general-purpose functions.
Class-centric:
Book.toJson()
Author.toJson()
Member.toJson()
Loan.toJson()
Generic data-centric:
JSON.stringify(data)
pick(data, keys)
merge(dataA, dataB)
diff(before, after)
There are costs too.
Typos in field names can hide until runtime.
You may get less help from IDE autocompletion and static typing.
It can be less efficient than `class`/`struct` field access in terms of performance.
Therefore, in languages like TypeScript, a pragmatic compromise is to annotate plain data shapes with explicit types rather than using DOP with untyped objects.
type BookData = {
id: string;
title: string;
author: string;
tags: string[];
};
DOP Principle 3: Treat Data as Immutable Values
In DOP, data is a value. The value itself does not change; mutations are expressed by creating a new version.7
const before = {
id: "book-1",
checkedOut: false,
};
const after = {
...before,
checkedOut: true,
};
An important distinction:
Data values do not change.
A variable can change to point to a new data value.
This distinction is the same as immutability in functional programming. In practice, the following benefits are significant.
It is easy to compare the previous state with the new state.
Undo/redo, event replay, and audit logging become straightforward.
Shared mutable state problems in concurrent contexts are reduced.
In tests, it is easy to verify inputs and outputs by comparing values.
Example:
function checkoutBook(state: LibraryState, memberId: string, bookId: string): LibraryState {
return {
...state,
loans: [
...state.loans,
{ memberId, bookId, checkedOutAt: new Date().toISOString() },
],
books: state.books.map(book =>
book.id === bookId ? { ...book, checkedOut: true } : book
),
};
}
This code does not secretly mutate internal object state. The input state and the output LibraryState are clearly defined. However, with large data, you need to pay attention to the cost of shallow versus deep copying and whether to use a structural-sharing library. This is where persistent data structures become relevant.
DOP Principle 4: Separate Schema from Data Representation
Because DOP represents data as generic structures, it does not bind the shape of data to a class definition. Instead, the schema is kept separate.8
const addBookRequestSchema = {
type: "object",
required: ["title", "author"],
properties: {
title: { type: "string" },
author: { type: "string" },
tags: {
type: "array",
items: { type: "string" },
},
},
};
This approach works well with tools like JSON Schema.9
๋ฐ์ดํฐ:
{ "title": "DOP", "author": "Sharvit" }
schema:
title์ ํ์ string
author๋ ํ์ string
tags๋ ์ ํ array<string>
Advantages:
Validation of external request/response data becomes clear.
The schema can be reused for runtime validation, documentation, and test data generation.
During the exploration phase, you can attach a schema late, then make it strict once things stabilize.
Costs:
The connection between data and schema is looser than with classes.
If schema validation is skipped, runtime errors are discovered late.
Using static types alongside runtime schema leads to duplicate management overhead.
In TypeScript, you typically have the following options.
1. TypeScript types only
Great at compile time, but weak for validating external input at runtime.
2. Runtime schemas such as JSON Schema, Zod, or io-ts
Strong for validating external input, but introduces schema management overhead.
3. Synchronizing types and schemas with code generation tools
The most robust approach, but it makes the build pipeline more complex.Personal note: DOP's "separate schema from data" is idealistic. Data is free, schema is optional, validation is explicit, and the developer enjoys philosophical peace.
Then you sit in your room drinking coffee and realize there are three days until the deadline. The moment that hits you,
anygets plastered everywhere and DOP becomes DROP.Separating the schema sounds great, but separation breaks down when the deadline is looming.
How Does DOP Handle Polymorphism?
In OOP, polymorphism is expressed primarily through class/interface hierarchies and method dispatch.
interface Shape {
area(): number;
}
In DOP, a kind or type field is placed on the data, and a general-purpose function branches on it or uses a dispatch table.
type Circle = {
kind: "circle";
radius: number;
};
type Rectangle = {
kind: "rectangle";
width: number;
height: number;
};
type Shape = Circle | Rectangle;
function area(shape: Shape): number {
switch (shape.kind) {
case "circle":
return Math.PI * shape.radius * shape.radius;
case "rectangle":
return shape.width * shape.height;
}
}
This approach resembles Algebraic Data Types (ADTs). TypeScript's discriminated unions, Rust's enums, F#'s discriminated unions, and Haskell's ADTs give strong support for exactly this problem.
From a DOP perspective, polymorphism is achievable without objects. The Manning book description also lists "polymorphism without objects" as a learning objective.[1]
Personal note: In school we were taught that
switchstatements are a bad habit to avoid, and that we should solve problems elegantly with "inheritance and overriding." But after debugging a few of those "elegant Java enterprise projects" twisted into dozens of layers, you find yourself ready to part with OOP inheritance forever. Fortunately for me, since I have very little in the way of real-life inheritance from my parents, I have absolutely no attachment to "inheritance" in code either.
Where DOP Fits Best
Sharvit-style DOP is particularly well-suited to information systems, that is, systems where the movement and transformation of data matters more than CPU cache optimization.
REST/GraphQL API
JSON request/response handling
Frontend application state
ETL / data pipeline
event enrichment
workflow state
Configuration file/policy file processing
Audit log/audit trail
In these systems, data already flows as JSON, maps, records, or tables. Rather than forcing it into deep class hierarchies, keeping data shapes and transformation functions explicit can be the simpler approach.
Where DOP Gets Dangerous
DOP reduces OOP's problems but has its own.
Field names scattered as string keys make the code vulnerable to typos.
It can become difficult to track which function expects which data shape.
Without schema validation, the code devolves into a "loose collection of maps."
Failing to understand the cost of immutable updates leads to performance problems.
The responsible party for enforcing domain invariants can disappear entirely.
Practical DOP therefore typically needs to be paired with the following safeguards.
Schemas like TypeScript types, JSON Schema, and Zod
pure function-centered testing
single-path state mutation
domain events and command handlers
diff, snapshot, audit log
Personal note: Misuse DOP and you escape "the complex maze of object-oriented code" only to fall into "the JSON swamp where anyone can touch anything."
Even after gaining freedom, you eventually need new constraints. Too much freedom is never a good thing.
Differences from OOP
OOP:
Data and behavior are kept together inside objects.
Methods protect the internal invariants of objects.
Objects collaborate through message/method calls.
Data-Oriented:
Data and behavior are separated.
The shape and flow of data are considered first.
Data of the same shape is processed in batches.
OOP is not bad. In particular, objects are a natural fit for things that require strong lifecycle and invariant management, such as external resources, drivers, file handles, and network connections. Richard Fabian also acknowledges that OOP can be the better choice for stable, large-scale concepts like file system handles or graphics APIs.10
Problems arise when everything starts as an object.
Objects carry too much data.
Inheritance hierarchies obscure data flow.
Memory is laid out by conceptual unit rather than by unit of work.
When the same operation is performed on many objects, cache locality degrades.
Data-Oriented Programming critiques that point.
Relationship with Functional Programming
Data-Oriented Programming overlaps significantly with functional programming.
Common ground:
- Avoids mutating data directly.
- Places transformation functions at the center.
- Aims to make inputs and outputs explicit.
- Reducing state changes makes testing easier.
Differences:
- Functional programming places greater emphasis on purity, composition, types, and controlling side effects.
- Data-Oriented Programming (DOP) places greater emphasis on the shape, volume, flow, layout, and access patterns of data.
Simply put:
Functional: Is this computation pure?
Data-Oriented: What shape does this data take as it flows?
Performance DOD: How is this data read from memory?
Advantages
1. Bulk data processing performance can improve.
2. Data flow becomes clearer.
3. Unnecessary object graphs can be reduced.
4. Batch processing and parallelization become easier.
5. Serialization, logging, replay, and testing become simpler.
6. It becomes easier to separate business data from transformation logic.
It is especially powerful in domains where the same operations are repeated over large amounts of data, such as games, simulation, rendering, physics, data pipelines, and event processing.
Disadvantages
1. The design looks less like "real-world nouns."
2. For small programs, it can become over-optimization.
3. Because data and logic are separated, tracing them can be difficult.
4. If misused, it devolves into a pile of global data tables and procedural code.
5. The Immutable Data approach requires attention to allocation and copying costs.
6. Performance-focused DOD requires an understanding of CPU, cache, and memory models.
Personal note: Right after learning about data-oriented design, you feel the urge to tear every class apart into arrays. It is better to hold that impulse for a moment.
Not every program is a game engine, and not every object is a criminal.
When to Use It
You process large volumes of similar data.
You repeatedly perform the same operations.
Performance bottlenecks originate from memory access.
The object graph is too complex for state tracking to be practical.
Serialization, persistence, transmission, and replay are important concerns.
You want to compare input data against output data in tests.
When to Be Careful
The data volume is small.
There are no performance bottlenecks.
Objects must strictly enforce domain invariants.
Managing the lifetime of external resources is the core concern.
The team is unfamiliar with memory layouts or batch processing.
It is a business application where simple CRUD matters more than abstraction.
This does not mean data-oriented thinking is useless in business applications. Bringing in performance-focused DOD as-is would be overkill, though. For business apps, Sharvit-style DOP, separating data from transformations and reducing mutation, is often the more practical choice.
Anti-Patterns
1. Tearing Everything into Arrays
Data-oriented thinking does not mean always using SoA. When a unit of work frequently needs the whole object, AoS is better.
Question:
Which fields does this operation actually read?
How often are those fields read together?
Changing the structure without asking this question just produces code that is harder to read.
2. Failing to Separate Data and Invariants
Separating data from logic is good, but scattering invariants across the codebase is dangerous.
Bad example:
Multiple systems modify `health` arbitrarily.
It is impossible to tell where it drops to 0 or below.
Death handling, UI updates, and event dispatching fall out of sync with each other.
Even in Data-Oriented Design, ownership of invariants is necessary.
Good example:
`DamageSystem` is solely responsible for decreasing health.
`DeathSystem` handles the `health <= 0` state.
The `HealthChanged` event is explicitly published.
3. Optimizing Without Measuring
Data-Oriented Design places emphasis on actual data and access patterns. Changing things based solely on a feeling that "it will probably be cache-friendly," without measuring, turns into design theater.
Measure first:
- Number of data items
- Access frequency
- hot path
- cache miss
- allocation
- branch miss
- frame time / latency
4. Blanket Rejection of OOP
OOP is strong at expressing invariants and lifecycle management, while Data-Oriented Programming is strong at bulk processing and data flow.
Where objects work well:
File handles, DB connections, transactions, UI widgets, external API clients
Where data-oriented design works well:
particle, transform, physics body, telemetry event, order rows, batch job
There is no need to pick only one; use both together.
Just as our parents told us to get along with our friends.
5. Overusing Sharvit-Style DOP (Immutability) in Performance-Critical DOD Environments
Just because both carry the label "data-oriented" does not mean DOD and DOP are perfectly interchangeable.
Sharvit-style DOP's core concept of immutable data processing inevitably involves spread operators (...spread) or new object allocations. If you produce a fresh immutable object for every state update inside a game engine's hot loop (DOD territory) that runs tens of thousands of times per frame, the garbage collector will scream and halt the runtime before you ever get to optimize the CPU cache.
Information systems (DOP): willingly pay the overhead of immutability and garbage collection in order to tame the complexity of state.
High-performance loops (DOD): to prevent any GC involvement entirely, data is overwritten in-place (in-place mutation) without mercy within pre-allocated contiguous memory regions (arrays).If you blindly mix the two paradigms without clearly distinguishing their respective bottlenecks (cognitive load vs. hardware limits), you end up with a horrifying chimera that is neither maintainable nor performant.
Practical Checklist
When approaching something with a data-oriented lens, write these down first.
[What]
What data does this system transform?
[Why]
Why should you look at data flow before the object model?
[Shape]
What is the shape of the data? Is it rows, a tree, a graph, or an event stream?
[Volume]
How many data items are there? Ten? A hundred thousand? Ten million?
[Frequency]
How often is it read and written?
[Hot Path]
What is the most frequently executed loop?
[Invariant]
What are the rules that must never be broken?
[Layout]
Is data that is read together kept together?
Is data that is not read together kept separate?
[Next]
If the measurement results change, which structure will you change?
Personal note: The biggest problem with checklists is that you almost never actually use them. Every new project, you forget and start over. That is the fate of a checklist.
Final Summary
Data-Oriented Programming is not a movement against objects.
Personal note: Of course, for some people, it is exactly a movement against objects.
The core is this.
Place the program's focus not on abstract objects, but on actual data and its transformation flow.In performance-focused DOD, the following matter:
Data Volume
Access Frequency
Memory Layout
cache locality
batch processing
hot/cold split
In complexity-focused DOP, the following matter:
Separating Code from Data
generic data structure
immutable data
non-mutating function
schema
A one-line definition of data-oriented thinking can be written as follows.
Data-Oriented Programming is
a programming approach that, rather than asking "what exists,"
first asks "what data is being transformed, in what shape, and how often."Personal note: That said, the kind of engineer I aspire to be is not a zealot of any particular paradigm.
Faced with the harsh realities of business requirements and the runtime environment (available memory, CPU cache, and deadlines), I need to be a mercenary willing to compromise on principles in order to find the optimal trade-off.
See Also
Persistent Data Structures
Data Locality
ECS
JSON Schema
Algebraic Data Types (ADTs)
Footnotes
- Mike Acton. Data-Oriented Design and C++. CppCon 2014 presentation. This is a landmark talk that popularized Data-Oriented Design (DOD) discussion in the C++ and game engine communities. โฉ
- Unity. Introduction to the Data-Oriented Technology Stack. Unity Data-Oriented Technology Stack (DOTS) is a representative case study applying Data-Oriented Design centered on Entity Component System (ECS). โฉ
- Yehonathan Sharvit. Data-Oriented Programming. Manning, 2022. Manning's introduction explains Data-Oriented Programming (DOP) in terms of immutable generic data structures and non-mutating general-purpose functions. โฉ
- Yehonathan Sharvit. "Principles of Data-Oriented Programming". An article that outlines four principles of DOP: separation of code and data, generic data structures, immutable data, and separation of schema from representation. โฉ
- Yehonathan Sharvit. "Separate code from data". DOP Principle 1. Explains the benefits and costs of separating code and data. โฉ
- Yehonathan Sharvit. "Represent data with generic data structures". DOP Principle 2. Explains generic data structures like maps and arrays, along with their trade-offs. โฉ
- Yehonathan Sharvit. "Data is immutable". DOP Principle 3. Explains the approach of creating new data versions instead of mutation. โฉ
- Yehonathan Sharvit. "Separate data schema from data representation". DOP Principle 4. Explains why separating data representation from schema matters. โฉ
- JSON Schema. Official website. A vocabulary for expressing the shape and validation rules of JSON data. โฉ
- Richard Fabian. Data-Oriented Design. Fabian explains that Data-Oriented Design is not merely about cache misses but rather an approach that considers data type, frequency, quantity, shape, and probability. โฉ