SAND

Big Data: Where Hadoop fits

Home »  SAND News »  How To »  Steven Green »  Big Data: Where Hadoop fits

Big Data: Where Hadoop fits

On August 18, 2011, Posted by , In Steven Green, With No Comments

It seems any time any one mentions Big Data on the web these days, the conversation inevitably turns to Hadoop. And why not, there’s a lot of [elephant poop](http://www.sand.com/hadoop-elephant-room/) and while the [philosophy and approach can be argued](http://www.sand.com/hadoop-revisited/), it’s important to remember that the reason you have so many choices is that [you need the right database for the right job](http://www.sand.com/tool-job/).

In that context, let’s take a look at where Hadoop fits.

Hadoop, comprised at its core of the Hadoop File System and MapReduce, is very well designed to handle huge volumes of data across a large number of nodes. At a high level, Hadoop leverages parallel processing across many commodity servers to respond to client applications. The key difference is, rather than only looking at parallel computing, it looks at parallelizing the data access.

This all sounds great, but in reality Hadoop is designed for large files, not large quantities of small files, so if you have millions of 50 Kb documents, that is not Hadoop’s sweet spot.

Likewise, Hadoop stores its data on hard disks spread across the many nodes. This is opposite to the industry standard of storing the data on a single (or a few) file servers, NAS or SAN. So if you already have big data, then moving to a Hadoop system will require time and resources to re-architect.

Even though Hadoop leverages many servers, each one requires a significant amount of memory (more than your typical desktop), and if the name node runs out of memory, you are looking at a crash.

Also, at the moment Hadoop is open source, and that means you save money at the expense of time — it is a developer tool requiring client-side development, but growing and adapting. And that means it requires some patience.

Now let’s look at what Big Data really is.

While the volumes of data are growing by leaps and bounds from many sources, such as social media, location data, loyalty information, operations and supply chain, the type of information is also an issue. It may be structured, semi-structured or unstructured. Making sense of and gaining knowledge from this data to achieve a competitive advantage should be the driving goal. So if you have Big Data and need to search and sort through the bulk of that data, then Hadoop may serve your purpose. If the majority of the data is structured or even unstructured but you are able to add structured meta-data describing the unstructured portion, and you want to run standard reports on the structured portion or retrieve individual unstructured elements (such as a single PDF document), then standard databases may suit your needs. If you have structured, semi-structured, or unstructured with structured meta-data, and want to run complex analyses on the data, to predict or ask questions outside of the standard reports, questions which cannot be prepared in advance (i.e. the types of queries most valuable to real Business Intelligence), then you probably need a column-based data store.

So while Hadoop is trendy at the moment, most vendors who have been around for a while are adjusting to handle Big Data, or have been [able to manage the large volumes for years](http://www.sand.com). The market is only now noticing its need. Rather than letting people guide the data, let the data empower the people to guide the business.

Leave a Reply