Digital Transformation, Momentum

What is Dark Data and how can it be used?

Estimated Reading Time: 9 Minutes

by Philippe Zimmermann

Man standing in dark Data Warehouse Headerbild Blogbeitrag

After years of the Big Data boom, many analysts are coming to a sobering conclusion: only a fraction of the globally accumulated data volume can and is actually being used.

Overshadowing this fraction is a larger pile of what is known as “ROT” data: Data that is redundant, obsolete and/or trivial – all indicators that it should actually be deleted.

But what really worries analysts is the third and relatively largest area: “Dark Data”, a proliferation of unused, orphaned and unstructured data that has not yet been assigned any value – but which nevertheless generates storage costs and robs working time.

Dark Data defined

As it turns out, Germany appears to be the global leader in this discipline: According to industry estimates, more than 55 percent of all information stored in Germany is simply “dark”, meaning that it meets one or more of the following loose criteria used to define Dark Data:

  • It exists in its own data silos, contradicting the Data Lake ideal (rather producing a “Data Swamp”).
  • It is neither structured nor searchable.
  • The corporation is completely unaware that it possesses the information.
  • Its evaluation is postponed indefinitely.
  • It is located on storage devices that are no longer used (backups, USB sticks, mailboxes).
  • The information is distributed on spreadsheets that are no longer consulted.
  • Its storage causes more costs than benefits.
  • It is sometimes encrypted and no one has access any more.
  • The (unknown) existence of the data can pose a threat to the business, for example in the context of GDPR compliance.
  • It is often more expensive to sort later than it would have been to classify immediately.

Companies and administrations that do not set the course in time are therefore in danger of being buried under the avalanche of information they themselves have collected. And the global volume of data is increasing rapidly: The IDC Institute estimates that some 59 billion terabytes (or 59 zettabytes) of digital information will have been created in 2020 alone – and that more data will be collected and stored in the next three years than in the previous 30.

CEOs don’t know what they don’t know

The term Dark Data originated in the scientific community and illustrates a variety of problems in the analysis of statistics. A famous example from World War II: British engineers recorded the location of bullet holes from returning airmen and bombers to draw conclusions about where armor should be improved. Only one engineer realized that armor would in fact have to be placed wherever the returned planes just did not have bullet holes – since these planes could return despite being hit, armor was not required in these damaged areas. However, all the planes that did not come back were most likely hit at exactly these places. When looking at the statistics, the data on these planes was simply missing: an early example of Dark Data.

In the business world, Dark Data poses less risk to life and limb, but the missing insights can lead to flawed long-term decisions based on incomplete data sets. After all, even entrepreneurs don’t know what they don’t know.

Business leaders are thinking about how to solve their Dark Data problem

The downside of Big Data

The cause of many of these data “garbage piles” can be found in the gold rush of the past decade. In many industries, data was collected and stored simply because it was possible – for example, by app providers. In other cases, data was not destroyed because people wrongly believed they had to keep it due to legal requirements. Or machines and manufacturing facilities were equipped with sensors and logs whose output was subsequently overlooked.

But there is great potential for development in all of this. Whether the digital transformation will impact the world to the same extent the industrial revolution once did will also depend on how the problem of Dark Data is solved.

Until then, it is not only storage costs that can be avoided in many cases, but also certain risks that are involved:

  • Untapped data assets pose a legal risk because they may contain compliance and regulatory violations.
  • If raw, unused data is released to the public due to inadequate security, there is a serious risk of reputational damage or involuntary disclosure of trade secrets.
  • Once a “business as usual” mentality is established, problem solving only drags on into the future. Storage costs may be decreasing, but Dark Data often grows exponentially.
  • Because the global storage space requirements and energy consumption of Dark Data are so massive, environmentalists are increasingly taking up the cause. In this regard, smart companies can stay ahead of potential regulation (and bad publicity).

Using Dark Data: How to regain control

The insight that “data is the new oil” is thus true in two respects: oil, too, must first be refined before it can unfold its value. Cleaning up legacy data may be a one-time effort, but it will pay off. Not only will decisions be made on more solid ground, but the risk of involuntary legal violations will be reduced – regardless of whether the company in question is a small local business or a global player.

What’s more, coordinated cleanup and structuring can give rise to new business models and secondary uses that could not even be foreseen beforehand. A cottage industry of small and medium-sized businesses is emerging, focusing primarily on the secondary use of larger companies’ data sumps.

In addition to employee discipline, the right tools are essential for dealing with the problem. Document management systems and internally developed content applications serve as the foundation. The use of optical character recognition (OCR), artificial intelligence (AI), and machine learning algorithms can help to reduce the amount of time required. And well-defined workflows ensure that mountains of Dark Data cannot simply grow back.

And, as a pleasant side effect, such tools also prevent the ROT data mentioned at the beginning; data that is classified as redundant, obsolete and/or trivial.