SAND

Hadoop redux

Home »  SAND News »  SAND Blogs »  Mike Pilcher »  Hadoop redux

Hadoop redux

On July 22, 2011, Posted by , In Mike Pilcher, With No Comments

I recently wondered aloud about [Hadoop and its place in the modern enterprise](http://www.sand.com/hadoop-elephant-room/). I received a lot of feedback on that post, and it brought a few other questions to mind.

Hadoop is an Apache project based on some papers Google released on their Map Reduce and Google File System (GFS) technologies. Google is an advertising company that attracts eyeballs through search and other services like YouTube. Apache is best known for their eponymous web server application. Neither of them are analytic database specialists.

Hadoop was developed for sifting data for enterprises whose business was sifting data. To use a sporting analogy, most enterprise data is the equivalent of not knowing which team you will play for, or where the ball is going to go, or the result at the end of the match. Hadoop doesn’t need or want to know any of that. It doesnt even care if you’re playing football or baseball… or swimming.

If we look at the technological under-pinnings of Hadoop, it’s beautifully designed for businesses that didn’t know their direction, whose task at hand changed every day, and who were growing so fast that they were regularly the subject of case studies. This resulted in building an infrastructure that by its nature was akin to painting the Golden Gate bridge — when you were finished you walked back to the other end of the bridge and started again. With Hadoop, you simply kept stacking new hardware on top of old hardware without ever stopping. As the bits at the bottom were crushed by the weight of the technology at the top, you simply plugged new ones into the top. Again, a beautiful architecture if you have one thing to do, and do it better than anyone.

A consumer search experience has to be the least sticky experience out there. Switching costs are zero, and if I get a bad result or a slow search there’s a Mike-shaped hole in the wall, as I can’t get out of that search engine quick enough. Don’t believe me? When was the last time you used AltaVista?

Hadoop was developed by businesses that have a lot more in common with software companies than traditional enterprise. It takes algorithms and exposes them to the market — that’s what software companies do. That means a lot of their staff are software developers. You can’t read even the most supportive Hadoop blogs without reading about the requirement to do a lot of coding to get it to work. Over the last 30 years or so, software companies have grown and prospered precisely because enterprises want to get out of the business of developing software and treat it like a commodity. If you need developers to make it work, is it a commodity?

Hadoop was designed to take workloads and set a task, and chug away at it, and then give answers when it is done with the task. There’s no concept of interactivity. It is a batch job with a single focus. That makes Hadoop perfectly suited for transforming data… data you know nothing about.

In summary, if you have a lot of commodity hardware sitting around un-used, if you have or want to have a lot of software developers on staff, if you know you don’t have any idea as to what data you are looking for and all you are looking for is “rank”, Hadoop is likely the best solution. If not, well then it may be worth looking to the task at hand.

What most of SAND’s customers want is the need to analyse vast amounts of data, structured, semi-structured, and unstructured. They want immediate response times. They know which type of business they are. They have very complex analytics that need to be done. And that analytics needs to be deployed across the enterprise to all users.

A lot of vendors are putting Hadoop’s little elephant all over their collateral and websites in an apparent attempt to look cool. (Like my Dad rocking a fauxhawk with his pants hanging half-way down the back of his legs). Hadoop is a useful collection of routines for crunching large amounts of unstructured (don’t get me started on the topic of whether there actually is such a thing as unstructured data – all data has some structure), semi-structured, or structured data. When you have a swathe of commodity hardware under-used, when you have software development skills in-house that usually reside in a software company, and you don’t mind how long it takes to get an answer, Hadoop’s a great solution.

Otherwise, it’s just another solution looking for a problem.

Leave a Reply