Exploring CPU caches

Processors use system ram to temporarily store data for processing, but in addition to that storage there are also cpu caches, which hold tiny subsets of instructions or data chunks much closer to the cpu or even a single core, to further minimize time spent waiting for data to become available. System ram much faster than storage disks, but cpu caches are much faster than even system memory.

L1 cache

The L1 cache is both the smallest and fastest form of CPU caching, private to each CPU core. If a system is advertised to have "32KB L1 cache", it refers to 32KB per core, totaling to a much larger total value (although each core may only use the advertised amount). L1 caches are typically divided into separate sections for instruction and data storage to further optimize lookup times, and responds within nanoseconds. A tiny size of only a few KB (currently around 32-64KB) means address sizes are short, physical space is minimal and heat output negligible, so it can be placed extremely close to each core, reducing physical distance between the actual computing unit and it's data storage. It runs almost as fast as the core can process the instructions, making the execution of a L1 cached instruction practically free of i/o overhead.

L2 cache

L2 caches can vary depending on CPU. For some, they simply serve as a larger fallback for the L1 cache, while more sophisticated processors use it to as a shared cache for a group of cores (typically 2-4). This form of cache does usually not separate instructions and data, simply caching whatever is accessed most frequently. Having a size of several hundred KB up to a few MB combined with the larger physical distance to the cores using it means it is quite a bit slower than L1, but still much faster than L3 (or ram). It is a tradeoff between storage capacity and speed, significantly enhancing some multi-threading applications (albeit unreliably; depends if the application is only using cores from the same L2 cache group).

L3 cache

The slowest form of memory cache is used to store instructions and data accessible to all cores, further increasing it's physical distance to each individual core and address sizes. It is the largest of the three cache layers, with storage capacity exceeding 100MB on top-tier CPUs. While it is much slower compared to L1 or L2 caches, it is still significantly faster than the system ram, making it a good tradeoff for storing data that is frequently accessed by multiple cpu cores, reliably boosting the speed of mulithreaded applications.

How data is cached

When it comes to caching data for CPUs, there are two types of information one may want to cache:

Application data. When some variables or file contents are accessed repeatedly, keeping them in a quickly accessible cache layer increases performance. Most people think of this mechanism when talking about caching.
Instructions. Not too widely known is the fact that modern CPU instruction sets like x86 can be quite complex to decode into physical operations. Keeping a cache of what x86 instructions results in which physical operations can dramatically accelerate cpu performance without needing higher clock speeds or more expensive hardware.

Caching is done with mostly proprietary caching algorithms, often based loosely on the principles of the LRU (least recently used) algorithm. Caching logic is then expanded with strategies like spatial locality (the data next to what i am fetching is likely needed next) and temporal locality (data i have previously fetched may be needed again soon). Intelligent design and implementation of caching mechanisms is what drives the reliably fast operations of modern processors.

Note: This article provides a rough overview of what cpu caches are, what information is cached and what roles they serve. Processor caching is an extremely complex topic, and the design and manufacturing of modern cpus is nowhere near as simplistic as this article's descriptions.