A colleague at the IBM Almaden Research Center once remarked that half of computer science is caching. He said it lightly, but it was one of those lines that stays with you because it’s both funny and true.
Caching is the art of pretending the world is faster than it really is. At every level of computing, we cache: to keep what we’ll soon need close at hand, to hide latency, to make slow things seem fast. The details change from decade to decade, but the principle is constant.
In the mainframe and time-sharing era, computing was centralized. Users sat at dumb terminals connected to a distant machine that did everything. The system’s success depended on how cleverly it could fake responsiveness. Process schedulers kept the most active jobs in memory, disk controllers cached frequently used blocks, and even early CPUs used tiny associative memories to remember recent addresses. Caching was the lubricant that kept an overworked central system usable.
When personal computers arrived, the pendulum swung toward locality. Computing moved onto our desks. Processor cycles were suddenly cheap; communication bandwidth was not. Performance came from keeping data close: multi-level caches, virtual memory, prefetching, and all the other tricks that made local machines feel instantaneous. Even applications joined in: spreadsheets avoided recomputing entire workbooks, image editors cached intermediate results, and compilers memoized parsed code. The whole industry became an experiment in how much could be done without waiting for anyone else’s machine.
Then the network changed everything again. The cloud and the web re-centralized computation, but only in theory. In practice, we built an enormous hierarchy of caches to disguise the fact that the data we wanted might be halfway around the world. Browsers cache pages, proxies cache sites, CDNs cache whole continents’ worth of content, and data centers cache the results of computation in DRAM, SSDs, and even on specialized accelerators. Once again, caching became the means by which locality was simulated—this time on a planetary scale.
Across these cycles, the boundaries of “local” and “remote” have shifted, but the need to hide latency has not. Whether the distance is measured in nanoseconds between CPU registers and main memory, or in milliseconds across the internet, the goal remains the same: make yesterday’s result available today without paying the full cost of discovery.
So yes, half of computer science really is caching, though it might be more accurate to say that much of computer science is about pretending distance doesn’t exist. Every time the industry redistributes where computing happens (centralized, decentralized, and back again) caching is what makes the transition tolerable. It is the art of anticipation, of guessing what will be needed before it’s asked for.
It may not stop there. As we move toward edge computing, federated AI, and on-device learning, the old rhythm will repeat. We’ll push computation outward, find that communication is too slow, and invent new kinds of caching to make distributed systems feel local again. In that sense, caching isn’t a subfield. It’s a mirror of computing itself: a continuing negotiation between the limits of physics and the illusions of speed.
The other half of computer science, you might say, is managing parallelism. Without it, the modern model (the cloud, the web, and especially AI) would collapse under its own weight. Caching hides latency, but parallelism hides waiting. The two are partners: caching saves what has already been done; parallelism arranges for many things to be done at once.
As transistor speeds stopped doubling, progress came from multiplying. CPUs sprouted cores; GPUs turned massive parallelism into a commodity. The cloud scaled this up again, using fleets of processors to simulate a single responsive system. Frameworks like MapReduce, Spark, and modern AI training pipelines all depend on decomposing work into fragments that can be cached, shuffled, and recombined.
In large language models, every layer’s output becomes the next layer’s input—a vast assembly line of matrix multiplications. The only way this works is through finely orchestrated caching and parallel scheduling. Intermediate activations are cached in GPU memory; gradient checkpoints are cached on local disks; data shards are cached near compute clusters to avoid saturating networks. The same pattern repeats at every scale, from L1 caches in silicon to exabyte object stores in the cloud.
Parallelism, then, is what lets caching scale. A single cache can hide the delay of one computation; a coordinated army of them can make global computation appear instantaneous. It’s a delicate illusion, built on managing concurrency and consistency across distances that once seemed impossible for real-time systems.
So yes, half of computer science really is caching, but the other “half” is parallelism, and together they amount to a kind of time management for machines. Caching borrows from the past; parallelism borrows from the future. Every generation of computing has rebalanced those two debts, trading bandwidth for speed, or local cycles for global reach.
As we move into the era of edge computing and distributed AI, the rhythm continues. We push computation outward, find that communication is too slow, and invent new caches and new schedulers to make distributed systems feel local again. In that sense, caching and parallelism are not just engineering techniques; they’re the twin poles that make modern computing possible.
Leave a Reply