The memory crunch – Is AI to blame?

An in-depth look at the current shortage of DRAM and NAND

The global market for memory chips, particularly DRAM (Dynamic Random Access Memory) and NAND Flash, is currently experiencing a period of significant price increases, primarily attributed to the explosive demand from operators of AI applications who are buying up these essential components in enormous quantities. But while the AI industry is undoubtedly a massive demand driver, this is only a partial cause of the complex market dynamics.

The shift from commodity products to HBM

Several recent developments illustrate the pressure on the memory market.

The resulting shortages and price increases are affecting a wide range of hardware manufacturers. In particular, producers of PCs, conventional datacentre servers, laptops and the entire consumer electronics industry are faced with rising procurement costs. Ultimately, these additional costs are passed on to end consumers in the form of higher device prices.

An oligopoly dominates the memory market

The market for DRAM and NAND flash is effectively dominated by just three major players: Micron Technology, SK Hynix and Samsung Electronics. The strategic decisions and production priorities of these three giants largely determine global prices and availability.

A crucial structural factor is the strategic realignment of manufacturers. HBM, which is essential for the lucrative AI infrastructure, offers significantly higher profit margins than traditional memory products such as DDR memory (commodity products). As a direct consequence, the three dominant manufacturers are massively shifting their production capacities and resources towards HBM.

This shift is at the expense of the production of standard DDR memory, which further exacerbates the supply shortage for conventional computers and servers. Micron even plans to give up its entire consumer division, Crucial. But what sounds like good news for the enterprise could quickly turn out to be bad news: Crucial products are also installed in the PCs and laptops of corporate customers.

Hesitant capacity expansion after the semiconductor crisis

The current situation is also a long-term consequence of the global semiconductor crisis triggered by the Covid pandemic, which saw production capacity reduced in many places. Since then, reconstruction and expansion of capacity havve been slow. Investment decisions are further slowed by the fundamental uncertainty regarding the sustainability of current AI demand - the fear of a potential AI bubble.

Technological change and production bottlenecks

The shortage is further exacerbated by the natural product life cycle of storage types. DDR4 memory production will be phased out at the three major suppliers' large factories to free up capacity for newer, higher-margin technologies. Although smaller chip manufacturers are stepping in to fill the gap, they still need time to optimise their production processes and reach full capacity. This dynamic is not new; a similar development was already evident during the transition from DDR3 to DDR4.

High investment barriers and long lead times

The construction of new production facilities (fabs) is an extremely capital-intensive and time-consuming undertaking. It typically takes several years from the laying of the foundation stone to volume production. Given the current geopolitical and economic uncertainties, this high investment risk represents a significant additional challenge that massively slows down the rapid response to increased global demand.

There is also reportedly a backlog of hard disk drives (HDDs) estimated at around two years. This is a direct result of the massive increase in the need for storage capacity, which was triggered in particular by the boom in AI and the associated data-intensive applications.

The development and operation of large language models (LLMs), neural networks, and other AI technologies require massive amounts of training and operational data. Hyperscalers that provide the infrastructure for these AI applications are being forced to expand their storage capacities exponentially. Although solid-state drives (SSDs) are often preferred for speed and performance, HDDs remain the primary choice for storing large amounts of cold data essential for AI training due to their significantly cheaper price per terabyte.

This sudden and massive increase in demand has put severe pressure on the production capacities of leading HDD manufacturers such as Seagate, Western Digital and Toshiba. Supply chains for critical components, including platters and heads, are strained, preventing manufacturers from ramping up production quickly enough.

The two-year backlog of orders points to a profound shift in the memory market. It signals not only a short-term bottleneck, but also a longer-term structural challenge that could potentially impact the speed of global AI growth. Companies that rely on large HDD shipments - from datacentres to big data analytics companies to supercomputing facilities - now have to plan for significant delays in expanding their infrastructure. This shortage is also expected to drive up storage capacity prices on the spot market, further increasing the total cost of ownership for AI-centric projects.

It may seem paradoxical that AI development relies on HDD technology, which is slower than SSDs. The explanation lies in the way storage is used in AI infrastructure:

Storage space for storing training data sets: AI models, especially large language models (LLMs) and complex neural networks, require huge amounts of training data (terabytes, often petabytes). This data does not need to be accessed at top speed all the time; their main requirement is low-cost and dense storage. HDDs offer the best price-per-terabyte ratio.

Storage space for generating (acquiring) training data sets: The process of collecting, pre-processing and storing new, potential training data - be it through web crawling, sensor technology or simulations - also generates enormous volumes of data that primarily have to be archived.

The role of QLC-NAND in the AI era

In parallel with HDD demand, AI development is also driving the proliferation of QLC (Quad-Level Cell) NAND flash memory. QLC stores four bits per cell, offering the highest storage density among current NAND technologies (in contrast to TLC/Triple-Level Cell or MLC/Multi-Level Cell).

The need for large, relatively inexpensive SSDs for faster data access (e.g. for hot storage, caching or parts of the training pipelines) is driving up the volume of QLC storage massively.

Mass production leads to increased investment in research and development of controller firmware and associated algorithms. This is crucial because QLC's higher data density tends to result in lower durability (fewer write cycles) and more complex error correction. The constant improvement of controller technology improves the quality, durability and performance of QLC SSDs (also SATA SSDs).

The AI-driven HDD shortage could boost QLC storage if it weren't for the problem with semiconductors. But production bottlenecks aren't the biggest problem for QLC storage.

Although QLC technology was originally positioned as a cheaper alternative, the greatly increased demand, particularly from the AI sector, is leading to a paradoxical development: the increased demand for QLC NAND is causing QLC memory to become more expensive, contrary to expectations. Providers can charge higher prices due to the critical role of storage density in AI infrastructure. This indicates a market tightening in the high-density storage space, affecting both HDDs and QLC-NAND.

Don’t panic, be proactive

Given volatile markets and potential supply bottlenecks, a prudent and strategic approach to the procurement of IT components is essential. Organisations should resist the temptation to panic buy or hoard hardware that far exceeds current needs. Such behaviour distorts prices, places unnecessary strain on working capital and may result in the acquisition of inappropriate technology.

Forward planning is essential. Reactive supply management is a significant competitive disadvantage in today's business world. Organisations that accurately anticipate their current and future needs - based on medium to long-term growth goals, the introduction of new projects or the natural life cycle of existing systems - gain significant advantages. This strategic foresight enables more efficient capacity planning and the early procurement of appropriate resources. By placing orders with longer lead times or using framework agreements, companies can mitigate price fluctuations and ensure the availability of critical components.

Investing in quality is likely to pay off. Short-term savings from purchasing consumer-grade or inferior components are often offset by higher total cost of ownership (TCO). Higher-quality models have a longer service life and significantly higher reliability. Investing in high-quality ECC (Error-Correcting Code) memory or enterprise-grade HDD or SSD (with a higher write capability) reduces the likelihood of a system failure and has lengthens the replacement cycle.

Use existing and alternative technologies efficiently

For many use cases, the latest tech is not required. Older generations can still provide excellent service, especially for applications with moderate performance requirements or well-optimised workloads. Organisations that precisely analyse and know their applications and user behaviour can often rely on models from previous generations. This relieves the strain on the supply chains of the latest products and often offers significant cost advantages. Another sustainable and cost-effective alternative is the use of refurbished modules that come from certified providers. These components are professionally tested and often come with guarantees, creating a sensible bridge between cost savings and reliability.

This article was first published in German on Computing’s sister site Computing Deutschland