Big is big news these days. But most organisations just end up hoarding vast reams of data, leaving them with a massive repository of unstructured – or “dark” – that is of little use to anyone.

Given the potential benefits of big data, it’s crucial that we find better ways to gather, store and analyse in order to make the most of it.

Stories of big successes have triggered significant investments in big initiatives. This has prompted many organisations to gather significant volumes of external and internal into so-called “ lakes”. These are repositories that contain in any format, whether structured, like databases, or unstructured, like emails or audio and video.

As a result, the growth in the amount of being generated, collected and stored continues at an exponential rate.

But according to a recent IBM study, more than 80% of all is inactive, unmanaged, often unstructured, lacking meaningful metadata, and even unknown to the organisation. The proportion of this dark is expected to reach 93% by 2020.

For example, generated from vehicle on-board devices can be expected to reach 350MB of every second. Where does all this go and who is using it?

Organisations can also generate significant internal data. For example, a recent study found that a company with 1,500 employees had around 2.5 million spreadsheets, each of which were only used by 12 people on average.

What’s more, there is evidence of a variety of unstructured such as document versions, project notes and emails that is left behind from organisational processes and subsequently sits dormant in servers.

Use or lose it

Lessons learnt from years of research in information system use have shown that the assumption that “more is better” when comes to is unfounded.

Even in traditional projects that follow carefully crafted analysis and design life cycles, the misalignment between perceived and actual value has been a notoriously difficult problem, often leading to poor returns on investment.

In big projects, the can often be externally sourced with little or no knowledge of its schemata, quality or expected utility. Thus the risk of making investments that will not deliver is greatly heightened.

The old adage of “use or lose it” is by no means obsolete, and brings attention back to the purpose of how we use big data. Organisations may retain for a variety of reasons, including retention regulations, but perceived future value is typically the main reason.

Although is relatively cheap, given the volume of being assimilated, the maintenance and energy consumption of centres is not trivial. Furthermore, there are costs and risks related to the security of such unmanaged data.

Thus defining the purpose is pivotal to ensure that big investments are targeted towards a meaningful problems, and collection and is well justified.

Approaches such as design thinking, which encourages people to use creative solution-focused thinking, are proving to be highly successful in genuine problem formulation for big data.

When appropriately applied, design thinking can equip scientists to bring together desirability (customer need) and viability (business value) with technological feasibility, and thereby guide them towards developing meaningful solutions.

Garbage in, garbage out

When the gap between creation and use becomes larger, makes more likely that quality decreases. This means an organisation will have to employ a lot of effort cleaning old if wants to use today.

According to the US Chief Scientist DJ Patil:

is super messy, and cleanup will always be literally 80% of the work. In other words, is the problem.

Earlier this year, a group of global thought leaders from the database research community outlined the grand challenges in getting value from big data. The key message was the need to develop the capacity to “understand how the quality of that affects the quality of the insight we derive from it”.

The golden principle of “garbage in, garbage out” is still true in the context of big data. Without scientifically credible knowledge that provides the ability to efficiently evaluate the underlying quality characteristics of the data, there is a significant risk of organisations and governments accumulating large volumes of low value density data, or investing in low return-on-investment products.

Moreover, the lack of knowledge on the underlying (distributions, semantics and other nuances) could result in analytical traps, where the analysis can lead to erroneous, and possibly dangerous, conclusions.

exploration is emerging as a promising approach to empower users with exploratory capabilities to investigate the quality of the and gain awareness of data’s shortcomings in terms of their intended use, and do so before they invest in expensive cleaning and tasks.

The search for enlightenment from the deluge will consume the energy and investments of the data-driven society in the foreseeable future. Whereas there is immense power in the scale of data, when left unattended will propel organisations into the abyss of dark data.

All this underscores the growing need for well-trained scientists who have the ability to articulate a well-justified business, scientific or and align with the technological efforts for collection, storage, and analysis.



Shazia Sadiq, Professor, and Knowledge Engineering, The University of Queensland

This article was originally published on The Conversation. Read the original article.