AI data centres must use flash storage for efficiency

AI data centres must use flash storage for efficiency

Alex McMullan, CTO International, Pure Storage

With data volumes increasing exponentially, it is more important than ever that organisations use densest, most efficient, data storage possible, to limit sprawling data centre footprints, and spiralling power and cooling costs, says Alex McMullan at Pure Storage.

As a technology with huge but unrealised potential, AI has been on the corporate agenda for a long time. This year it has undoubtedly gone into overdrive, due to Microsoft’s $10 billion investment in OpenAI, together with strategic initiatives by Meta, Google and others in generative AI.

Although we have seen many advances in AI over the years, and arguably just as many false dawns in terms of its widespread adoption, there can be little doubt now that it is here to stay. As such, now is the time for CTOs and IT teams to consider the wider implications of the coming AI driven era.

In terms of its likely impact on the technology sector and society in general, AI can be likened to the introduction of the relational database, in that it was the spark that ignited a widespread appreciation for large data sets, resonating with both end users and software developers.

AI and ML can be viewed in the same terms as they provide a formative foundation for not only building powerful new applications, but also enhancing and improving the way we engage with groundbreaking technology alongside large and disparate datasets. We are already seeing how these developments can help us solve complex problems much faster than was previously possible.

Managing data models for AI

To understand the challenges that AI presents from a data storage perspective, we need to look at its foundations. Any machine learning capability requires a training data set. In the case of generative AI, the data sets need to be exceptionally large and complex, including different types of data.

Generative AI relies on complex models, and the algorithms on which it is based can include an exceptionally large number of parameters that it is tasked with learning. The greater the number of features, size and variability of the anticipated output, the greater the level of data batch size combined with the number of epochs in the training runs before inference can begin.

Generative AI is in essence being tasked with making an educated guess or running an extrapolation, regression or a classification based on the data set. The more data the model has to work with, the greater the chance of an accurate outcome or minimising the error, cost function.

Over the last few years, AI has steadily driven the size of these datasets upwards, but the introduction of large language models, upon which ChatGPT and the other generative AI platforms rely, has seen their size and complexity increase by an order of magnitude.

This is because the learned knowledge patterns that emerge during the AI model training process need to be stored in memory, which can become a challenge with larger models.

Managing data storage for AI

Checkpointing large and complex models also puts huge pressure on underlying network and storage infrastructure, as the model cannot continue until the internal data has all been saved in the checkpoint, these checkpoints function as restart or recovery points if the job crashes or the error gradient is not improving.

Given the connection between data volumes and the accuracy of AI platforms, it follows that organisations investing in AI will want to build their own exceptionally large data sets to take advantage of the unlimited opportunities that AI affords. This is achieved through utilising neural networks to identify the patterns and structures within existing data to create new, proprietary content.

Because data volumes are increasing exponentially, it is more important than ever that organisations can use the densest, most efficient data storage possible, to limit sprawling data centre footprints, and the spiralling power and cooling costs that go with them. This presents another challenge that is beginning to surface as a significant issue, the implications massively scaled-up storage requirements have for being able to achieve net zero carbon targets by 2030-2040.

It is clear that AI will have an impact on sustainability commitments because of the extra demands it places on data centres, at a time when CO2 footprints and power consumption are already a major issue. This is only going to increase pressure on organisations, but it can be accommodated and managed by working with the right technology suppliers.

The latest GPU servers consume 6-10kW each, and most existing datacentres are not designed to deliver more than 15kW per rack, so there is a large and looming challenge for datacentre professionals as GPU deployments increase in scale.

Flash data centres for AI

Some technology vendors are already addressing sustainability in their product design. For example, all-flash storage solutions are considerably more efficient than their spinning disk, HDD counterparts.

Some vendors are even going beyond off the shelf SSDs, creating their own flash modules which allow all-flash arrays to communicate directly with raw flash storage, which maximises the capabilities of flash and provides better performance, power utilisation, and efficiency.

As well as being more sustainable than HDD, it is also a fact that flash storage is much better suited to running AI projects. This is because the key to results is connecting AI models or AI powered applications to data.

To do this successfully requires large and varied data types, streaming bandwidth for training jobs, write performance for checkpointing, and checkpoint restores, random read performance for inference and crucially it all needs to be 24×7 dependable and easily accessible, across silos and applications.

This set of characteristics is not possible with HDD based storage underpinning your operations, all-flash is needed.

Data centres are now facing a secondary but equally important challenge that will be exacerbated by the continued rise of AI and ML. That is water consumption, which is set to become an even bigger problem, especially when you take into consideration the continued rise in global temperatures.

Many data centres use evaporative cooling, which works by spraying fine mists of water onto cloth strips, with the ambient heat being absorbed by the water, thus cooling the air around it. It is a smart idea but it is problematic, given the added strain that climate change is placing on water resources, especially in built-up areas.

As a result, this method of cooling has fallen out of favour in the past year, resulting in a reliance on more traditional, power intensive cooling methods like air conditioning. This is yet another reason to move to all-flash data centres, which consume far less power and do not have the same intensive cooling requirements as HDD and hybrid.

Road ahead

As AI and ML continue to rapidly evolve, the focus will increase on data security, to ensure that rogue or adversarial inputs cannot change the output, model repeatability, using techniques like Shapley values to gain a better understanding of how inputs alter the model and stronger ethics, to ensure this powerful technology is used to actually benefit humanity.

All these worthy goals will increasingly place new demands on data storage. Storage vendors are already factoring this into their product development roadmaps, knowing that CTOs will be looking for secure, high-performance, scalable, efficient storage solutions that help them towards these goals.

The focus should therefore not be entirely on the capabilities of data storage hardware and software, the big picture in this case is noticeably big indeed.


Key takeaways

  • Generative AI is being tasked with making an educated guess or running an extrapolation, regression or classification based on the data set.
  • The more data the model has to work with, the greater the chance of an accurate outcome or minimising the error, cost function.
  • Learned knowledge patterns that emerge during the AI model training process need to be stored in memory, which can become a challenge with larger models.
  • Checkpointing large and complex models also puts pressure on underlying network and storage infrastructure, as the model cannot continue until the internal data has all been saved in the checkpoint.
  • Checkpoints function as restart or recovery points if the job crashes or the error gradient is not improving.
  • There are implications massively scaled-up storage requirements have for being able to achieve net zero carbon targets by 2030-2040.
  • Given the connection between data volumes and accuracy of AI platforms, it follows that organisations investing in AI will want to build their own exceptionally large data sets, achieved through neural networks.
  • With data volumes increasing exponentially, it is important organisations use the densest data storage possible, to limit sprawling data centre footprints, and spiralling power and cooling costs.
  • Another challenge is the implications massively scaled-up storage requirements have for being able to achieve net zero carbon targets by 2030-2040.
  • Latest GPU servers consume 6-10kW each, and most existing datacentres are not designed to deliver more than 15kW per rack.

Click below to share this article

Browse our latest issue

Intelligent CIO Middle East

View Magazine Archive