Thursday, August 13, 2009

Thoughts on HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

Authors: Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz (Yale University), Alexander Rasin (Brown University)

Venue: VLDB 2009

Paper: http://db.cs.yale.edu/hadoopdb/hadoopdb.pdf

Summary:

The paper starts by saying that DBs are becoming increasingly important and data is growing at a huge rate. Unfortunately, the usual parallel DBMSs do not scale well because they cannot handle failures very well. On the other hand, MapReduce handles failures and stragglers (heterogeneous systems) well and scales well, but does not perform as well as DBMSs.

The paper describes a system that has roughly three layers. The top layer is a SQL layer which accepts SQL queries from users. The SQL queries are translated into MapReduce jobs. The MapReduce jobs are executed in the middle layer, Hadoop. The job is divided into several tasks. Each task itself is a smaller SQL query that is executed on a single node. Each node independently runs the third layer, a single-node DBMS (PostgreSQL in this case). It uses Hadoop for communication, Hive for the translation, and PostgreSQL for the DB.

HadoopDB consists of the following main components:

1. The database connector: provides an interface to independent databases so that worker nodes (TaskTrackers) can get results of SQL queries in MapReduce format (i.e. key-value pairs).

2. Catalog: maintains metainformation about the third-layer DBs---connections parameters (e.g. DB location, driver class, credentials) and metadata (data sets, partitions, replica locations, …)

3. Data Loader: Partitions data and stores it in the local DBs.

4. SQL to MapReduce to SQL (SMS) Planner: Extends Hive which translates SQL queries to MapReduce (Hadoop) jobs by optimizing the output to push as much of the work onto the local DB instances in the worker nodes since the single node DBMSs can do much better than MapReduce.

In the eval section, they compare Vertica (a column store DBMS), DBMS-X, HadoopDB, and Hadoop. They tested the systems without failures and with failures. I didn’t quite understand all the differences in workloads, but I did notice that Vertica was much faster than any of the other systems. The authors claim that HadoopDB performs almost as well as the other systems they tested and almost always better than Hadoop. They say that the performance differences are mainly because of the lowest PostgreSQL layer. This seems plausible, but it would have been a much more compelling eval if they tested comparable systems (example use compression on PostgreSQL and/or a column store).

When faults and slow nodes were introduced, Vertica performed much worse: 50% performance hit with failures and 170% with slow nodes. HadoopDB on the other hand was only 30% slower. They then say that this slow-down is a minor bug that can be easily fixed so that their performance matches Hadoop’s (which is better than HadoopDB with slowdown).

The Good:

The paper is interesting and reads well. The system itself is also very smart, and it’s a wonder no one has built it before. I would really like to see what ends up happening with it and how it ends up performing in the real world.

The Bad:

It would have been much more insightful if the used the same DB bottom layer in the eval, or at least eval different ones. It would have been compelling to see how Hadoop can parallelize DBMS. It would have been nice too if they could have put Vertica there to show how HadoopDB can fix the issues with failures. Also, I didn’t think the queries the used were that complex, but I don’t know whether this is of any importance.

No comments:

Post a Comment