From 8KiB to 64B Cache Lines
From Pokémon's toxic counter to modern ECS cache lines

(Pokémon Green, Venusaur — the poster child of Toxic stalling)
Helpful Background Knowledge
This post assumes a minimal familiarity with bitwise operations and memory structure. Keeping the following four concepts in mind should be enough to follow along.
1. Bits and Bytes
A bit is the smallest unit of information, capable of holding only one of two values: 0 or 1. Eight bits make one byte, and a single byte can represent 2⁸ = 256 distinct values, ranging from 0 to 255.
Number of bits | Range of representable values | Practical examples |
|---|---|---|
1 bit | 0~1 | On/off, poisoned/not poisoned |
2 bits | 0~3 | 4 directions |
3 bits | 0~7 | Sleep turn count |
4 bits | 0~15 | Badly poisoned accumulation count |
8 bits | 0~255 | One character, a small counter |
The core idea of this post is: "how do we split this 8-bit or 32-bit space to pack in multiple pieces of information?" For example, the lower 3 bits hold the sleep turn count, the next 1 bit represents poison, and the bit after that represents burn, and so on.
Bit Masking and Bit Shifts
Once you've packed bits in, you need to read them back out. The two fundamental operations for doing that are masking and shifting.
Masking is an operation that isolates specific bits. A value with only the desired bits set to 1 is called a mask; when you & it with the original value, only the positions where the mask has 1s survive, and everything else is zeroed out.
For example, suppose a 32-bit value called status1 stores a badly poisoned counter in bits 8 through 11. The mask to extract just those four bits is 0xF00. The hex digit F is 1111 in binary, so 0xF00 is a value with only bits 8 through 11 set to 1.
status1 = 0x12345678;
mask = 0x00000F00;
counterBits = status1 & mask; // 0x00000600
This leaves only bits 8 through 11 from status1, with everything else set to 0.
However, the result 0x600 is not yet the actual number 6. The value "6" has been shifted 8 bits to the left. In other words, it is stored internally in the form 6 << 8 = 0x600.
So to use the result as an actual number, you need to shift it 8 bits to the right.
counter = (status1 & 0xF00) >> 8; // 6The pattern that keeps appearing throughout this post is as follows (worth committing to memory):
value = (packed & mask) >> shift;Conversely, when writing a value in, you clear the existing value, shift the new value to the left, and then combine them with the | operator.
status1 = (status1 & ~0xF00) | (newCounter << 8);In short, when reading: & and >>; when writing: << and |. Those are the fundamentals.
3. Structure Padding
In C, when a struct lays out its members in memory, it aligns them to positions that the CPU can access efficiently. This process introduces empty gaps between members, which are called padding.
Consider the following struct as an example.
struct Example {
char c; // 1 byte
int i; // 4 bytes
bool b; // 1 byte
};
On the surface, the data appears to occupy 1 + 4 + 1 = 6 bytes. In many environments, however, int must be aligned to a 4-byte boundary, so 3 bytes of padding are inserted after the first char, and additional padding may be appended at the end to bring the struct to the proper total size.
As a result, the actual struct size can end up being 12 bytes.
Actual data: 1 + 4 + 1 = 6 bytes
Padding: 3 + 3 = 6 bytes
Total: 12 bytesA bitfield is a technique that packs multiple small flags and counters into a single word to eliminate this kind of waste.
To put it simply, think of it as stuffing a suitcase as tightly as possible so that not a single gap is left inside.
4. Cache Lines
When the CPU reads data from memory, it does not fetch just the single byte it needs. It typically retrieves a 64-byte chunk all at once, and that chunk is called a cache line.
Reading data that is not in the cache requires a round trip all the way to main memory, and that latency can grow to tens or even hundreds of cycles. By contrast, data already residing in the L1/L2 cache can be accessed much more quickly.
That is why it is important to keep data that is frequently used together within the same cache line. When data is scattered, multiple cache lines must be loaded, but when it is grouped in one place, a single cache line load brings in all the values you need at once.
For our purposes, it is enough to think of a cache line simply as a region of memory that can be fetched faster than going all the way to main memory. That understanding is all you need to follow the discussion below.
Introduction: Why shave bits when memory is abundant?
"I have 32 gigabytes of RAM, so why on earth am I cramming these boolean values into a bitfield?"
It is a question any modern software engineer has probably asked at least once. You could just use a comfortable full byte and not worry about padding leaking space, yet game engines and ECS libraries still pack small values tightly into a single word.
You might think: "In the age of AI, memory prices are going through the roof, so you have to save every byte you can, right?" Memory prices really have skyrocketed in the AI era, and we have reached the point where upgrading a personal machine is starting to strain the budget.
But one might counter: "Hardware keeps improving, so how long can a retro technique like this still be necessary? Caches are growing, RAM is getting faster. Won't it all just sort itself out eventually?"
The real answer, however, lies somewhere else entirely. The reason so many open-source projects still shave bits has nothing to do with RAM capacity or hardware going backward. It connects to a more fundamental and much older question.
To answer that question, let us travel back in time to the place where the need is most plainly visible.
1996: the Game Boy's 8KB of RAM.
Pokémon is the clearest example. The sleep-turn counter occupied 3 bits, while poison, burn, freeze, and paralysis each got 1 bit, all crammed into a single byte. Badly poisoned (toxic) needed a separate counter to track accumulated damage that grew each turn, and since there was no room for it inside the one-byte status field, it was split off into dedicated battle memory.
Then the Game Boy Advance arrived in 2002, and RAM grew 32-fold to 256KB; the STATUS1 field expanded to 32 bits. Even so, the toxic counter was never promoted to a standalone integer field.
That raises a question: the capacity problem was solved, so why not fix it?
Returning to the present raises an even bigger question: even now, 30 years later, game engines still split values down to individual bits.
This post chases that question. Starting from the 8KB of 1996, let us trace the path to the 64-byte cache line of 2026 in order of space, design, and bandwidth.
1. The Constraint of Space: the Game Boy's 1 Byte (1996)
The Game Boy's working RAM, which ran the first generation of Pokémon, was 8KiB.
8,192 bytes.
In an era when a single text file holding this very article runs to tens of kilobytes, the Game Boy had to keep the entire game state running in less than half that space.
The species values, individual values, moves, HP, and status conditions for a party of six; the full state of the two monsters currently in battle; and even the display buffer: all of it had to fit within 8KiB. A single byte was all that was allocated for one monster's status conditions. Eight bits on a panel, nothing more.
So let's look at how programmers back then actually pulled it off. The code below is taken from the following Git repository.
Looking at constants/battle_constants.asm in pret's Game Boy disassembly pokered, you can see the brilliance of those early programmers: that single byte is divided up like this.

DEF SLP_MASK EQU %111 ; bits 0-2, sleep turn count
const PSN ; bit 3, poison
const BRN ; bit 4, burn
const FRZ ; bit 5, freeze
const PAR ; bit 6, paralysisSharp-eyed readers will notice that bits 0 through 2 are different from the rest.
Poison, burn, freeze, and paralysis are straightforward. You only need to know whether each condition is active or not, so one bit per condition is enough. A single bit can only express 0 or 1, making it perfect for states like "poisoned or not" and "burned or not."
Sleep is different, though. Sleep isn't simply "asleep"; it needs to store "how many turns are left to sleep." In other words, it requires a number rather than just on/off. To represent up to 7 turns, you need to be able to hold 0 through 7, and 0 through 7 is expressed with exactly 3 bits.
1 bit = 0–1
2 bits = 0–3
3 bits = 0–7So in the first generation's status condition byte, bits 0 through 2 were assigned to the sleep turn counter. The bits that follow each hold one flag for poison, burn, freeze, and paralysis.
In the end, the range of values to be represented determines the bit width: a true/false condition like poison needs just 1 bit, while a numeric range like sleep's 0 through 7 requires 3 bits. But readers who know Pokémon well will say:

"What about badly poisoned (toxic)?"
Badly poisoned is, at its core, poison, but it's a variant whose damage scales up with each passing turn.
Badly poisoned differs from regular poison. Regular poison deals fixed damage every turn, while badly poisoned deals increasing damage as turns go by. Regular poison only requires knowing "is the target poisoned?", so the single PSN bit suffices. Badly poisoned, however, requires not just "is it toxic?" but also "which toxic turn are we on?" That means badly poisoned is not a simple 1-bit state; it needs both a flag and a counter together.
The base status condition byte in the first generation already held 3 bits for sleep and flags for poison, burn, freeze, and paralysis. There was no room to squeeze in a badly poisoned counter as well. So instead of forcing it into the persistent status byte, badly poisoned was separated out into a dedicated flag and counter used only during battle.
The actual code is structured the same way. BADLY_POISONED is not part of the base status condition byte; it is defined as a bit flag in wPlayerBattleStatus3 or wEnemyBattleStatus3.

; wPlayerBattleStatus3 or wEnemyBattleStatus3 bit flags
const_def
const BADLY_POISONED ; 0 ; Toxic
const HAS_LIGHT_SCREEN_UP ; 1
const HAS_REFLECT_UP ; 2
const TRANSFORMED ; 3This bit indicates "is this poison currently badly poisoned?" But even that is not enough on its own. Because badly poisoned must deal increasing damage each turn, a separate counter is needed to track which badly poisoned turn we are on.
That counter appears in the battle damage handling code in pokered/engine/battle/core.asm. On the player's side it uses wPlayerToxicCounter, and on the opponent's side it uses wEnemyToxicCounter.

.nonZeroDamage
ld hl, wPlayerBattleStatus3
ld de, wPlayerToxicCounter
ldh a, [hWhoseTurn]
and a
jr z, .playersTurn
ld hl, wEnemyBattleStatus3
ld de, wEnemyToxicCounter
.playersTurn
bit BADLY_POISONED, [hl]
jr z, .noToxic
ld a, [de] ; increment toxic counter
inc a
ld [de], a
ld hl, 0Looking at them in order makes things clearer. Let's lay it out for easy reading.
1. PSN bit - in the persistent status condition byte - marks "the target is poisoned"
2. BADLY_POISONED bit - in BattleStatus3 - marks "this poison is badly poisoned"
3. ToxicCounter - wPlayerToxicCounter / wEnemyToxicCounter - stores "which badly poisoned turn are we on"
First, the PSN bit in the base status condition byte marks the fact that the target is poisoned. Then the BADLY_POISONED bit in BattleStatus3 marks that the poison is not ordinary poison but badly poisoned. Finally, wPlayerToxicCounter or wEnemyToxicCounter counts which badly poisoned turn it currently is.
What an incredibly tedious arrangement. It genuinely makes you grateful not to have been a programmer living in that era, and deeply impressed by the developers who were.
At this stage, bit packing needed no grand justification. The Game Boy's work RAM was just 8 KiB. Even representing a single state required constantly asking "how many bits are actually enough for this value?" Bit packing back then was not a stylistic choice; it was survival.
2. The Constraint of Design: Why Bits Were Still Sliced Even After 32-Bit Arrived
Pokémon then moved from the Game Boy (GB) to the Game Boy Color (GBC). The second-generation Gold and Silver had 32 KiB of work RAM, four times the 8 KiB of the original Game Boy.
Yet the base status condition byte barely changed. Sleep still occupied the lower 3 bits, and poison, burn, freeze, and paralysis each took 1 bit. Badly poisoned still did not move into this byte either. Whether a Pokémon has regular poison is carried by the PSN bit, whether that poison is badly poisoned is carried by the SUBSTATUS_TOXIC bit in SubStatus5, and the accumulated turn count is tracked by separate counters named wPlayerToxicCount and wEnemyToxicCount.
; wPlayerSubStatus5 or wEnemySubStatus5 bit flags
const_def
const SUBSTATUS_TOXIC
const_skip
const_skip
const SUBSTATUS_TRANSFORMED
const SUBSTATUS_ENCORED
const SUBSTATUS_LOCK_ON
const SUBSTATUS_DESTINY_BOND
const SUBSTATUS_CANT_RUN
You can see this clearly by looking at the actual second-generation Pokémon code.
The important question here is: "Why does it still work this way even though RAM quadrupled?" The Japanese Pokémon developers have never publicly explained this, so the details are unknown, but my personal guess is that the engine code was inherited directly from the first generation. On top of that, the second-generation games like Gold and Silver had to support the Time Capsule feature for trading with first-generation games like Red, Green, and Blue, so keeping data formats as similar as possible would have made that implementation much easier.
Pokémon introduced in Generation II, or those that knew Generation II moves, could not be sent back, since only data the Generation I games could understand was allowed through.
Once a data layout gets locked in like that, it effectively becomes a hardcoded API specification. The moment you change the layout, data crossing from Generation I into Generation II corrupts the system. This is the "legacy swamp" we encounter in production work today, the moment a past constraint becomes a future contract. And then, in 2002, Ruby and Sapphire were released on the GBA.
Looking at pret's pokeruby include/constants/battle.h, the status condition word is defined as follows.

// Non-volatile status conditions
// These persist remain outside of battle and after switching out
#define STATUS1_NONE 0
#define STATUS1_SLEEP (1 << 0 | 1 << 1 | 1 << 2) // Sleep turns remaining
#define STATUS1_SLEEP_TURN(num) ((num) << 0) // Just for readability (or if rearranging statuses)
#define STATUS1_POISON (1 << 3)
#define STATUS1_BURN (1 << 4)
#define STATUS1_FREEZE (1 << 5)
#define STATUS1_PARALYSIS (1 << 6)
#define STATUS1_TOXIC_POISON (1 << 7)
#define STATUS1_TOXIC_COUNTER (1 << 8 | 1 << 9 | 1 << 10 | 1 << 11)
#define STATUS1_TOXIC_TURN(num) ((num) << 8)
The GBA's STATUS1 is 32 bits wide, yet its lower bit layout is carried over almost unchanged from the Game Boy's 1-byte status byte.
Bits 0–2 hold the sleep turn counter, and bits 3–6 represent poison, burn, freeze, and paralysis. What changed is that bit 7 now holds the badly poisoned (toxic) flag, and bits 8–11 hold the toxic accumulation counter.
But this status1 field is a u32. Only bits 0–11 are actually used, meaning just 12 bits, while the upper 20 bits sit empty. The GBA's EWRAM is 256KiB, 32 times larger than the original Game Boy's 8KiB. At this point, the packing can no longer be explained simply by "we need to save every last byte." There is enough room to leave 20 bits idle, yet status conditions are still broken down at the bit level.
This is where the character of bit-packing changes. If the original Pokémon's packing was a survival technique for cramming data into 8KiB of RAM, then Ruby and Sapphire's packing is closer to a layout strategy for managing all status state within a single word. Sleep turn count, poison, burn, freeze, paralysis, toxic flag, and toxic accumulation count all live inside one word called STATUS1.
Looking at the code that actually processes toxic damage makes this intention even clearer. In Ruby and Sapphire's end-of-turn logic, toxic damage is calculated like this. Stripping it down to the essentials rather than showing the full original:

case ENDTURN_BAD_POISON: // toxic poison
if ((gBattleMons[gActiveBattler].status1 & STATUS1_TOXIC_POISON)
&& gBattleMons[gActiveBattler].hp != 0)
{
gBattleMoveDamage = gBattleMons[gActiveBattler].maxHP / 16;
if (gBattleMoveDamage == 0)
gBattleMoveDamage = 1;
if ((gBattleMons[gActiveBattler].status1 & STATUS1_TOXIC_COUNTER)
!= STATUS1_TOXIC_TURN(15)) // not 16 turns
gBattleMons[gActiveBattler].status1 += STATUS1_TOXIC_TURN(1);
gBattleMoveDamage *=
(gBattleMons[gActiveBattler].status1 & STATUS1_TOXIC_COUNTER) >> 8;
BattleScriptExecute(BattleScript_PoisonTurnDmg);
effect++;
}
gBattleStruct->turnEffectsTracker++;
break;In gBattleMoveDamage = gBattleMons[gActiveBattler].maxHP / 16;, the base damage from toxic is set to 1/16 of the Pokémon's max HP. The code then checks STATUS1_TOXIC_COUNTER, a mask that isolates only bits 8–11. If the value is not STATUS1_TOXIC_TURN(15), meaning the counter has not yet reached 15, it adds STATUS1_TOXIC_TURN(1). Since STATUS1_TOXIC_TURN(n) expands to n << 8, this effectively increments the small number stored in bits 8–11 by one.
The last line is the heart of this layout.
(gBattleMons[gActiveBattler].status1 & STATUS1_TOXIC_COUNTER) >> 8 first uses & STATUS1_TOXIC_COUNTER to extract only bits 8–11. However, that value is still shifted eight positions to the left, so >> 8 pulls it back to the right to convert it into the actual turn count. That value is then multiplied by 1/16 of the max HP. The result is 1/16 on turn 1, 2/16 on turn 2, 3/16 on turn 3, and so on. This process lets you infer why exactly 4 bits were chosen.
Why 4 bits specifically? Toxic's accumulating damage cannot grow without bound, and 4 bits are sufficient because the range 0 to 15 covers everything needed. Without any HP recovery, the Pokémon will faint on the sixth turn from 21/16 total accumulated damage, and even accounting for typical recovery moves, 15 turns is more than enough. The code also saturates the counter at 15, preventing it from ever growing large enough to overflow into the upper bits.
Ultimately, the width of a bit field is determined by the domain range of the values it must hold. True or false needs only 1 bit; a number from 0 to 15 needs exactly 4 bits. Four bits is precisely the range needed to poison a Pokémon and send it to the Pokémon Center.
The constraint at this stage, then, is not a spatial one. It is a layout constraint. A bit arrangement that once began out of RAM scarcity survived into the era of plentiful memory as the very way status conditions are expressed and accessed.
Once a layout gets embedded in a serialization format or communication protocol, it is not easy to change. Moving a single bit position breaks save-file compatibility and link-trade compatibility alike. A layout forced into existence by 8KiB of RAM lived on as both technical debt and a binding contract even after the original constraint was gone. In practice, we often find ourselves using inherited conventions not because we cannot improve them, but simply because they work well enough. Some would call this the constraint of convention; personally, I would call it the constraint of design.
3. The Bandwidth Constraint: Packing for the Cache Line (Modern Day)
What is the third constraint? Memory is plentiful now. Instead, the speed gap between the CPU and memory has become the bottleneck, and the cache is what bridges that gap.
The CPU reads memory in 64-byte cache line units. When the data it needs isn't in the cache (a cache miss), it has to reach all the way out to main memory, and that costs hundreds of cycles. Ulrich Drepper's What Every Programmer Should Know About Memory is the canonical reference for how cache lines work and why data layout determines performance. To put it briefly: the denser the data, the more useful information fits into a single cache line load. Conversely, when data is spread thin, a cache line fills up with wasted space, and the CPU is forced to make repeated round trips to main memory, burning hundreds of cycles each time. Bit packing is fundamentally about how much valid data you can deliver to the CPU in a single memory fetch.
It would be nice to illustrate this with Pokémon as well, but Switch-era games (Sword/Shield, Scarlet/Violet) have no official decompilation, so there is no way to tell whether status conditions are still stored as bitfield words or unpacked into ordinary fields. Instead, let's look at how bit packing has evolved through the lens of modern game engines.
Data-Oriented Design (DOP) and ECS in games and simulations are driven by exactly the same motivation.
The modern C++ ECS library EnTT doesn't split an entity identifier into two separate fields; it packs both the id and the version together as bits inside a single integer.

[[nodiscard]] static constexpr value_type construct(
const entity_type entity,
const version_type version
) noexcept
{
if constexpr(Traits::version_mask == 0u) {
return value_type{entity & entity_mask};
} else {
return value_type{
(entity & entity_mask)
| (static_cast<entity_type>(version & version_mask) << length)
};
}
}This code uses the same technique as Pokémon's STATUS1_TOXIC_COUNTER. It isolates the id region with entity & entity_mask, then shifts version & version_mask left by length bits to place it in the upper bits. Extraction works the same way in reverse: the id is pulled out with a mask, and the version is recovered with a shift followed by a mask.
If we scatter fields like isPoisoned and toxicTurnCount across objects in the usual object-oriented style, every access can trigger a cache miss. Packing multiple status flags and a counter into a single word lets you determine state with one cache line load and a bit masking operation, and it also opens the door to parallel processing strategies like SIMD. As Eric Raymond points out in The Lost Art of Structure Packing, gathering booleans into individual bits adds a small access cost, but that cost is swallowed up by the reduction in cache misses. Mike Acton's CppCon 2014 talk Data-Oriented Design and C++ demonstrates just how dramatically data layout can shift performance in a game engine.
To summarize: the 1996 Game Boy shaved bits to fit data into 8KiB of RAM. Then the 2002 Ruby/Sapphire folded that status representation into a 32-bit STATUS1 word. Modern game engines and ECS frameworks shave bits again, this time to pack more useful information into a single cache line.
The technique is the same: masks, shifts, and small fields. When you look at programming techniques, breaking things down to their smallest unit always makes them look simple. But actually decomposing them to that level and modernizing the approach is never easy.
Pokémon's status condition bitfield and EnTT's entity identifier come from entirely different eras and address entirely different problems, yet at the layer closest to machine code they share the same grammar: pack small values into a single word, then extract them with masks and shifts. Not because memory is scarce, but because the way data is laid out determines the cost of executing on it. Hearing it put that way, you might conclude that the old-school developer's approach is always right, but that isn't quite true.

(The meme about old-school developers being gods and modern developers being clueless)
That's because in modern multi-core environments, "density" can actually work against you.
The CPU cache manages data in 64-byte units called cache lines. What happens if multiple threads are running simultaneously, and the fields each core is modifying happen to share the same cache line? Even though those fields are logically independent, the moment one core writes a value, the entire cache line is invalidated. This is the well-known "False Sharing" phenomenon.
Bit packing delivers overwhelming performance in read-heavy, single-threaded traversal by increasing data density, but in a multi-threaded environment the opposite strategy is required. There, you need to unpack the data instead, isolating hot fields onto separate cache lines entirely using specifiers like alignas(64).
Ultimately, the choice between "packing bits tightly to maximize cache hit rate" and "isolating data at cache-line granularity to avoid thread contention" is a nuanced architectural judgment that depends entirely on the access pattern through which your data flows.
Conclusion
It started with space. To represent sleep turn counts and status conditions within the Game Boy's 8KiB of RAM, every single bit inside a byte had to be counted. Poison needed just 1 bit, but sleep required a turn counter, so it took 3 bits. Badly poisoned status couldn't be captured by a simple poison flag alone; it needed a separate counter to track accumulated turns. Bit-packing at this stage was a fight for survival.
Then it became about layout. Moving to Gold/Silver, RAM expanded, but the core status condition byte retained almost exactly the same form as in the original generation. In Ruby/Sapphire, STATUS1 grew to 32 bits, and the badly poisoned flag along with its counter folded into that same field. Even with the upper 20 bits sitting empty, status flags were still laid out at the bit level. By this point, the packing was no longer purely about saving space. You could call it inertia, but in the sense that it reflects how code understands state, it was really a question of layout.
Today, it is about bandwidth. Memory has grown, but the distance between the CPU and memory is still costly. Modern game engines and Entity Component Systems iterate over enormous numbers of entities every frame, and how densely data fits within a cache line determines performance. EnTT's choice to pack an entity's id and version into a single integer is a judgment of exactly the same kind: place small values together in one word, then extract them with masks and shifts.
That said, shaving bits because of constraints that no longer exist is cargo-culting. If the cache is your bottleneck yet your data is spread loosely across memory, you are bleeding performance. Conversely, if multiple threads are contending over the same cache line and you pack even more tightly, you can destroy yourself with false sharing.
Ultimately, the design must vary with the constraints. I always wonder what a programmer's mindset really is, and writing this post led me to a conclusion of my own (well, I had to finish the article somehow). To me, a programmer's mindset is this: figuring out what constraints you have been given and then struggling to find the best solution within that environment.
Perhaps that is what it means to think like a programmer.
References
Eric S. Raymond, "The Lost Art of Structure Packing" (catb.org/esr/structure-packing) — the de facto standard reference on alignment, padding, and bitfield packing.
Ulrich Drepper, "What Every Programmer Should Know About Memory" (LWN, 2007; lwn.net/Articles/250967) — a classic work on how cache lines operate and why data layout determines performance. The basis for the cache-line arguments in Chapter 5.
Mike Acton, "Data-Oriented Design and C++", CppCon 2014 (youtube.com/watch?v=rX0ItVEVjHc) — a Data-Oriented Design (DOP) talk on structuring data to fit cache lines and extract maximum speed from game engines.
Pokémon code references: pret's GitHub
github.com
pretpret has 30 repositories available. Follow their code on GitHub.