Dark data: The elephant in the data centre

Dark data: The elephant in the data centre

With the management of dark data being a daunting issue for CIOs, Mark Kidd, EVP and GM, Iron Mountain Data Centres and Asset Lifecycle Management, asks what is dark data and why does it matter?

Mark Kidd, EVP and GM, Iron Mountain Data Centres and Asset Lifecycle Management

‘Dark data’, a term coined by Gartner, is defined as the information assets organisations collect, process and store during regular business activities, but generally fail to use for other purposes.

Like dark matter, dark data takes up huge amounts of space in data centres and is virtually invisible. This doesn’t mean we can ignore it. I think it’s worth taking a moment to think about the nature of dark data, its impact, and what we might be able to do to improve things.

Personal footprint

Dark data is easiest to grasp and deal with at a personal level. For most of us it consists of unused photos and videos. In the old days, film was precious and development expensive, but now we can take 20 shots to get the one we want, and we can edit easily, creating more backup files in the process. In 2020, Google said it stored 4 trillion photos, with 28 billion new photos and videos uploaded each week. Google Photos is just one photo service, and those upload rates have no doubt grown in the last few years.

This personal dark data also creates a privacy issue. However, as secure as our cloud service is, there is always the possibility that ID photos, personal chat screengrabs and private files can be used by cybercriminals. The answer? Think before you shoot, tidy up caches and archives regularly, and be particularly careful not to leave sensitive files lying around.

Hidden losses

For companies, the challenge is on a larger scale and affects the bottom line. Dark data consists of near-identical images or documents, IoT data sets, log files and applications. This data takes up server space, and powering these servers takes up energy and equipment, which not only costs money, but can also mean significant emissions if low-carbon or renewable power is not being used. Dark data is also unstructured and unexplored, which brings with it privacy and compliance risks.

No organisation is unaffected. Estimated levels of commercial dark data vary by sector from 40% to 90%, so it’s extremely likely that the majority of your company’s data is dark. According to the World Economic Forum, companies generate 1.3 trillion gigabytes of dark data every day. Storing that data for a year using non-renewables generates as much CO2 as three million flights from London to New York. So, if we’re interested in decarbonising the data centre industry – and we should be – we should tackle this issue.

Technology lag

For many businesses the level of dark data reflects a lack of data structuring processes. The ability of an organisation to collect data can exceed the throughput at which it can analyse the data. In some cases, the organisation may not even be aware the data is being collected.

Organisations retain dark data for a multitude of reasons. Often it is stored for regulatory compliance and record keeping, but equally often the complexity of compliance, privacy and data discovery is the reason that these data lakes are allowed to build up. Some organisations believe that dark data could be useful to them in the future once they have acquired better analytic and business intelligence technology to process the information.

New tools and standards

There is good news here. The scale of the task may appear daunting for CIO and CDOs, but AI and Machine Learning have now advanced to the point that they can help automate the data structuring process. Only a tiny percentage of dark data needs to be reviewed at the outset by humans to kickstart the process. This can then be followed up with a reinforcement learning model to assess the relevance of remaining data and prioritise it. From then on, a virtuous cycle of tagging and analysis makes the process easier to manage.

Measurement would also help to benchmark progress; considering the scale of the problem, there may be a case for setting standards for effective data use. Perhaps there is a case for a Data Usage Effectiveness (DUE) metric to sit alongside CUE (Carbon) WUE (Water) and PUE (Power), where 1 = 100% elimination of non-essential single-use data. This, or some similar metric, would be well worth working towards, and could also have value as a digital performance indicator. However, it may be too early to measure, while so much dark data remains invisible.

Let’s talk

Whatever dark data means to you or your business, it is an ‘elephant in the room’ for data centres, and the more we talk about it, the likelier we are to come up with incremental improvements. For individual data users there are things we can do to reduce single-use data. For organisations it’s a bit more complicated but approaches and tools are emerging. These should be discussed and shared.

As with energy efficiency, identifying and eliminating waste at source is the most obvious opportunity. According to IBM 60% of data loses its value within milliseconds of being acquired, and any scheme to use data more effectively must first address the issue of collecting useless data. A robust approach to data gathering is the key here; assessing how data can be used, or if it is usable.

The next step is structuring the data we keep. Structured data is not only more valuable, but easier to track and, if necessary, delete. By making data more visible, it should be possible to reduce the environmental and financial burden of storage at the same time as using our valuable data to empower our organisations and serve our customers better.

Click below to share this article

Browse our latest issue

Intelligent CIO Europe

View Magazine Archive