Introducing data lakes

Bernard Marr
Last Updated : Jan 26 2015 | 12:10 AM IST
You have probably heard of data warehousing, but now there's a newer phrase doing the rounds, and it is one you're likely to hear more in the future if you're involved in big data: 'Data Lakes'.

So what is a data lake? Well, the best way to describe it is to compare it to data warehouses, because the difference is very much the same as between storing something in a warehouse and storing something in a lake.

In a warehouse, everything is archived and ordered in a defined way - the products are inside containers, the containers on shelves, the shelves are in rows, and so on. This is the way that data is stored in a traditional data warehouse. In a data lake, everything is just poured in, in an unstructured way. A molecule of water in the lake is equal to any other molecule and can be moved to any part of the lake where it will feel equally at home.

This means that data in a lake has a great deal of agility - another word which is becoming more frequently used these days - in that it can be configured or reconfigured as necessary, depending on the job you want to do with it.

A data lake contains data in its rawest form - fresh from capture, and unadulterated by processing or analysis.

It uses what is known as object-based storage, because each individual piece of data is treated as an object, made up of the information itself packaged together with its associated metadata, and a unique identifier.

No piece of information is "higher-level" than any other, because it is not a hierarchically archived system, like a warehouse - it is basically a big free-for-all, as water molecules exist in a lake.

The term is thought to have first been used by Pentaho CTO James Dixon in 2011, who didn't invent the concept but gave a name to the type of innovative data architecture solutions being put to use by companies such as Google and Facebook.

It didn't take long for the term to get used by select companies - Hortonworks, for instance, includes it in the name of its service, Hortonworks Datalakes. It is a practice which is expected to become more popular in the future, as more organisations become aware of the increased agility afforded by storing data in data lakes rather than strict hierarchical databases. For example, the way that data is stored in a database (its "schema") is often defined in the early days of the design of a data strategy. The needs and priorities of the organisation may well change as time goes on.

One way of thinking about it is that data stored without structure can be more quickly shaped into whatever form it is needed, than if you first have to disassemble the previous structure before reassembling it.

Another advantage is that the data is available to anyone in the organisation, and can be analysed and interrogated via different tools and interfaces as appropriate for each job.

It also means that all of an organisation's data is kept in one place - rather than having separate data stores for individual departments or applications, as is often the case.

This brings its own advantages and disadvantages - on the one hand, it makes auditing and compliancy simpler, with only one store to manage. On the other, there are obvious security implications if you're keeping "all your eggs in one basket."

Data lakes are usually built within the Hadoop framework, as the datasets they are comprised of are "big" and need the volume of storage offered by distributed systems. A lot of it is theoretical at the moment because there are very few organisations which are ready to make the move to keeping all of their data in a lake. Many are bogged down in a "data swamp" - hard-to-navigate mishmashes of land and water where their data has been stored in various, uncoordinated ways over the years.

And it has its critics of course - some say that the name itself is a problem (and I am inclined to agree) as it implies a lack of architectural awareness, when a more careful consideration of data architecture is what's really needed when designing new solutions.

The author is Bernard Marr, a big data expert. Re-printed with permission.
Link: https://www.linkedin.com/pulse/big-data-what-heck-lakes-bernard-marr?trk=prof-post
*Subscribe to Business Standard digital and get complimentary access to The New York Times

Smart Quarterly

₹900

3 Months

₹300/Month

SAVE 25%

Smart Essential

₹2,700

1 Year

₹225/Month

SAVE 46%
*Complimentary New York Times access for the 2nd year will be given after 12 months

Super Saver

₹3,900

2 Years

₹162/Month

Subscribe

Renews automatically, cancel anytime

Here’s what’s included in our digital subscription plans

Exclusive premium stories online

  • Over 30 premium stories daily, handpicked by our editors

Complimentary Access to The New York Times

  • News, Games, Cooking, Audio, Wirecutter & The Athletic

Business Standard Epaper

  • Digital replica of our daily newspaper — with options to read, save, and share

Curated Newsletters

  • Insights on markets, finance, politics, tech, and more delivered to your inbox

Market Analysis & Investment Insights

  • In-depth market analysis & insights with access to The Smart Investor

Archives

  • Repository of articles and publications dating back to 1997

Ad-free Reading

  • Uninterrupted reading experience with no advertisements

Seamless Access Across All Devices

  • Access Business Standard across devices — mobile, tablet, or PC, via web or app

More From This Section

First Published: Jan 26 2015 | 12:10 AM IST

Next Story