Thursday, March 4, 2010

Thoughts on Loose Source Routing as a Mechanism for Traffic Policies

Authors: Katerina Argyraki, David R. Cheriton

Venue: Proceedings of the ACM SIGCOMM workshop on Future directions in network architecture


Summary:


Motivation:

  • Senders want to choose paths so they can route around failures/congestion and use "good" paths for some definition of good. These are transmit policies. **JN: Don't they care more about paths to them though?
  • Receivers want to filter incoming packets to mitigate DDoS attacks. These are receive policies.
  • Today can only control first/last hop which is insufficient.
Mechanism:
  • Packet has WRAP shim layer between IP and payload.
  • WRAP header has two paths: forward and reverse.
  • Sender specifies the path through trusted relays and puts their IP addresses in the forward path. Reverse path is empty.
  • The sender also sets the IP dst field to the first relay's IP address and sets the src field to its IP address.
  • At each relay, the first hop is popped from the forward path and used as the IP dst address, and the outgoing interface's IP addr is used as the src address. The old src address is put in the reverse path.
  • This continues until destination is reached.
The authors claim this is better than IP's LSRR option because:
  1. They claim it is easier to do in hardware ** JN: Maybe only marginally if anything.
  2. it does not go into the IP options field which causes packets to go into the router's slow path usually. They claim the issue is that the IP options field is variable length. Conventional wire-speed filters can be used to filter these packets. ** JN: I'm guessing not based on the WRAP header though.
  3. The src IP address is not that of the original source IP but that of previous relay, and the receiver uses the recorded reverse path rather than reversing the forward path as is done in traditional LSRR. They claim this makes it harder to hijack communication. **JN: I might be missing something here, but this claim seems totally bogus. A bad guy can make a packet look whichever way it wants and just insert its address in the reverse path. I don't see how WRAP helps at all.
For transmit policies, either the provider can choose paths or the end-host can. Obviously. They say that they can use things like FBR (Feedback-based Routing).

For receive policies, they need an additional mechanism which is Active Internet Traffic Filtering (AITF) which becomes more accurate using WRAP.

Other things like MPLS and DiffServ  end up becoming sort of like virtual circuits (Stealth Virtual Circuits) and go against the motivation for choosing datagrams for IP over a connection-oriented protocol.

(See comments below by Katerina for rebuttals and clarifications)

Friday, December 11, 2009

Thoughts on Automatic Patch-Based Exploit Generation is Possible: Techniques and Implications

Authors: David Brumley (CMU), Pongsin Poosankam (CMU), Dawn Song(UC Berkeley), Jiang Zheng (U. Pittsburgh)

Venue: 2008 IEEE Symposium on Security and Privacy

Summary:
The work describes how to generate exploits for an unpatched binary program P given a patched version P'. This is important because it highlights the fact that the patching process itself is important. Staggering updates over a long period of time can lead to exploits that were previously unknown and that can be generated in minutes.

The work, however, only targets input validation checks. So, for example, if a buffer overflow was fixed by checking the input size or the pointer, the system the author built works. If, on the other hand, it was fixed by increasing the buffer size, the system fails.

The authors generate exploits (user inputs to a program that compromises its safety) in four main steps:
  1. "Identify the new sanitization checks added in P'." This is done using off-the-shelf binary differencing tools like EBDS. EBDS does syntactic analysis, but not semantic. So if a check in P is i > 10 while in P' it's i - 1 > 10 the latter might be reported as a new check by EBDS. The verification step later on weeds out these semantic differences.
  2. For each new check found, generate a candidate exploit which fails the new check in P' by:
    1. Calculating the boolean function F that specifies the weakest constraints on the inputs that would fail these checks. This is done by basically trying to find which instructions in the program to include in the formula. They use three approaches:
      1. Dynamic: If we know the constraints on the inputs that exercise the check, they simply specify the constraints that lead to the new check and add the constraint that the input must fail the new check. This is the case if the new check is on a common path executed for all inputs. This is the quickest method, and generates the easiest constraint functions to solve. In practice, this is done by analyzing an execution trace, lifting the executed instructions into the Vine modeling language for easier analysis, then doing the analysis.
      2. Static: (See Creating vulnerability signatures using weakest preconditions) In this approach, the whole program is lifted to the Vine modeling language first. Then, the program's control flow graph is analyzed and the program is chopped so that it only includes paths that lead to the new checks. The constraint function is then basically the whole CFG. This approach is more thorough and will generate exploits that might require more searching in the dynamic approach, but the generated functions are usually much larger and more difficult to solve.
      3. Hybrid: Use the dynamic approach to reach a junction point in the execution closer to the new check, then use the static approach between that point and the new check. This approach creates smaller constraint functions so they are easier to solve but also allows wider coverage to find exploits.
        They automate the process by iterating through junction points until an exploit is found, where the heuristic they use for these points is to place them at procedure boundaries. While the past two approaches have been done by other before, the authors claim the hybrid approach as a novel contribution.
    2. Solve the constraint function F to find the candidate exploits using an off-the-shelf solver.
  3. Check that the candidate exploit is a real exploit by checking that it compromises the program's "safety policy", where "a safety policy" is a function that maps the program's state space to a boolean value of safe or unsafe. Examples include type checking, buffer overflows, ... In the paper, the authors use policies that are enforceable by their execution monitor (dynamic taint analysis to detect information leaks and memory safety issues, dynamic type checking, return address integrity, ...)
  4. Generate additional variants x' by solving the new constraint function F'(x') = F(x') and x' != x

If the solver fails to find an input, then the path described by the constraint function does not have an exploit. If it takes too much time, then the authors set a timeout so that the system moves on to check a new path by, for example, changing the mix (junction) point in the hybrid approach.

For evaluation, the authors ran their system on 5 MS updates. Some exploits worked with dynamic analysis, others worked with static analysis, and other required the hybrid approach. Exploits required between tens of seconds to less than 3 minutes, and most of them resulted in control hijacking.

The techniques seem to also be useful in other applications such as automatic deviation detection to check whether two programs behave the same, or, more concretely, to check if compiler optimizations affected the behavior of a program. But this doesn't seem to me to be easily attainable since the differences between two programs might be large enough that the techniques don't work.

The solutions to the automatic patch exploit generation problem they propose are obvious: obfuscation, patch encryption so that everyone can apply it simultaneously, and fast patch distribution using P2P. They mention that off-line systems would stay vulnerable, but this is the same issue with the current path distribution architecture.

The Likes:
  • The paper presents a cool attack
  • It works (at least for the cases shown)!

The Dislikes:
  • There was no real quantitative estimate of how important this is in practice or how well the system will fare in general. What is the percentage of bugs it does work on and through which a bad guy can cause real damage?
  • There was too much notation and repetition
  • I did not understand anything on dynamic analysis. The math notation looked horrible. I'm sure the ideas are probably not too difficult to understand, but I didn't want to wade through it with no helping hand (i.e. real explanation).
  • This is not quite about the paper, but there are no satisfactory solutions to the problem they raised yet. It would've been nice if they were able to describe one, but that's not the goal here.
  • I wish there was a bit more education on the technical background.

Thursday, December 3, 2009

Thoughts on Dispatcher: Enabling Active Botnet Infiltration using Automatic Protocol Reverse-Engineering

Authors: Juan Caballero, Pongsin Poosankam, Christian Kreibich, Dawn Song

Venue: CCS 2009

Summary:
The paper describes a tool that can be used to automatically reverse engineer the protocol format of an unknown protocol. The tool can be used to deduce the format and some of the semantics of the fields in both sent and received protocol messages. It can do this reverese engineering without requiring the binaries on both sides of a protocol, which makes it useful in situations where only the client is available (such as a bot binary) or only the server.

The ideas used are quite clever and simple. The main idea is to look at each byte in the buffer that constitutes a message sent or received, and then create a dependency chain for that byte. The dependency chain is the sequence of instructions and data sources used to write the byte. The dependecy chain ends when a source for an instruction is an immediate, a constant, another memory location, or unknown (such as an arithmetic op). Then by comparing the ultimate sources for contiguous destination bytes and seeing if they are derived from the same contiguous set of source bytes, they can tell the position and length of buffers that form the fields in a message. They do this hierarchically in the buffer so they can also extract the hierarchy of fields.

They can infer some semantics by looking at where these fields are used. If they are used as arguments to or if they are return values of known functions, they can immediately get the type, and some semantics such as an IP address when used in the connect function.

Furthermore, they are able to infer loops using static and dynamic analysis and then be able to infer fields that are used to indicate offsets and lengths. And by looking at repeated bytes, they are able to infer delimiters. While arrays are inferred by assuming that arrays are usually written inside loops.

To handle encrypted messages, they use the intuition behind other work, ReFormat by Z. Wang et al. The idea is that functions that have a disproportionate amount of arithmetic instructions to other types of instructions are probably cryptographic. They use this idea to infer which buffers are unencrypted and operate on them. They have used a threshold of 20 minimum instructions for a function to be considered and a ratio of 0.55, and were able to get a false positive rate of 0.002%.

For evaluation, the authors studied the MegaD command and control C&C protocol. They were able to infer the message formats used by MegaD and were able to rewrite a message for indicating SMTP sending capabilities. They have also looked analyzed other open known protocols and compared their results using Dispatcher with those from Wireshark. They say that Dispatcher outperforms Wireshark.

In terms of resiliency, the authors claim that they target fundamental properties to avoid becoming obsolete if malware adapts. However, they do confess that if malware can detect that it is being run in a virtual environment, then their technique can be circumvented.

The Likes:
  • The paper is very well written and clear.
  • The ideas are nice and clever, and they seem to work well.
  • They do not need both sides of a protocol
  • Can be used to infer correctness of messages when a protocol is known
  • Can make writing a Wireshark dissector easier
  • Might be used to send erroneous information to botnet masters (I don't know how useful this is. If you know that a machine is already compromised, you can should fix the situation. You cannot take over the botnet.)
The Dislikes:
  • Can be circumvented by detecting a virtualized environment. I don't know how easy this is, but for serious malware, this might not be a problem.
  • The authors claim that they target fundamental properties, but provide no evidence or proof. It is not clear that the properties they target are really fundamental. For example, they use some known functions to infer semantics. The bad guy might ship his own libraries so as to restrict the API used. The thresholds used to detect encryption and decryption are empirical, and seem easy to obfuscate. Adaptive malware might be able to change keyword values or restructure the program to make it more difficult to infer which functions are doing what. The program might be written in a difficult to analyze way where fields of the messages are always results of arithmetic computations. The program might use complex loops with jumps and gotos that make analysis difficult.
  • They claim they do better than Wireshark, but it's not clear why they made that conclusion. The dissectors made different assumptions about what delimiters are, but it does not mean they are incorrect (or maybe I just did not understand what they meant).

Friday, October 16, 2009

Thought on Making Information Flow Explicit in HiStar

Authors: Nickolai Zeldovich, Silas Boyd-Wickizer, Eddie Kohler, David Mazieres

Venue: SOSP 2006

Paper: PDF

Summary:

The paper describes, in painful detail, the HiStar operating system and the implementation of two applications on it. HiStar uses distributed information flow control to provide a platform to run untrusted code on sensitive data and provide guarantees for data leakage and integrity.

Motivation:
  • Hard to reason about implications of every action by untrusted
    code
  • chroot can break application assumptions
  • syscall interposition error-prone
  • Easier to specify policy in terms of where to where information can flow
  • Unlike other IFC systems, HiStar implements standard OS abstractions on top of lower level building blocks that obey IFC
  • Taint tracking mechanism does not leak info itself
The HiStar DIFC model:
There are six kernel object types in HiStar. Of these, threads are the only ones that execute code. Under Distributed Information Flow Control (DIFC), each kernel object has a label. The label describes how it is tainted. An object can be tainted in different sets of categories with different degrees. The degree of taint in a particular category places a restriction on the object. In HiStar, there are 4 levels of taint.
  1. Cannot be written/modified (lowest taint)
  2. No restriction
  3. Cannot be untainted/exported
  4. Cannot be read/observed (highest taint)
Information from one object A can only flow to another object B iff B is at least as tainted as A in every category. This is the flow to relationship. A restriction imposed by a particular category can be bypassed by objects that own that category. Ownership of a category is specified as an additional degree «. Categories are only created by threads. A thread that creates a new category is the only owner of that category and can then grant ownership to another thread or to a gate (another kernel object).

A label looks like {a0, b2,c«, 1}, which means that the object is tainted in category a with degree 0, b with degree 2, c with degree «(i.e. the object owns category c), and in all other categories with degree 1 (i.e. untainted in other categories). The labels on all objects, except for threads, are immutable and have to be specified when the object is created. Objects are created by threads only.

Threads can read restricted objects by raising their label. Threads, however, can only raise their labels to the limit imposed by their clearance label. The clearance label is an additional label on threads, that also places an upper limit on the label a thread can request for a newly created object. A created object’s label must flow to the thread’s label which in turn must flow to the thread’s clearance (bypassing owned categories).

Kernel:
HiStar uses a single-level store. All objects are stored on disk and, on bootup, the entire system state is reloaded from the latest snapshot. All objects are stored in a container hierarchy with quotas.

There are six kernel object types in HiStar: segments, threads, address spaces, containers, gates, and devices.
  • Segments: segments are the simplest object type in HiStar. They are basically arrays of bytes and are analogous to other OSes’ files.
  • Threads: threads execute code. They can create categories and objects, and change their own label and clearance as long as the new label or clearance can flow to the current clearance ignoring owned categories.
  • Containers: All objects are stored in containers that are organized in a hierarchicy rooted at the root container with quotas. Objects are hard-linked into containers, and when they are disconnected from the root, they are garbage-collected. The existence of hard links can leak information, so objects are always specified as a (Container ID, Object ID) pair. To be able to see if a hard link exists, the thread would need to be able to read the container. Quotas on objects do not change so as not to convey information.
  • Address Spaces: Each thread has runs in an address space. An address space is a mapping from virtual address -> containers, segments in them and permissions. The address spaces are used to specify code to run and virtual memory. They are created by a thread and used in launching a new thread. They are used when page faults occur to find whether a thread has permission to read some segment and to find where that segment should be mapped.
  • Gates: gates “provide protected control transfer, allowing a thread to jump to a pre-defined entry point in another address space with additional privileges”. Basically, they are very similar to RPCs except that the calling thread provides the resources for execution. A thread can create a gate and give it some label (respecting DIFC). Another thread can invoke the gate and request a label and clearance. The requested label and clearance are a combination of the invoking thread’s label and clearance and the gate’s label and clearance. The thread then starts executing in the new address space (i.e. executing some other code or function) using the requested label and clearance. Return values can be stored inside the thread object itself, but there is no implicit return mechanism. The invoking thread needs to create a return gate and label appropriately (the discussion on this is quite interesting).
  • Devices: Can be mounted…
User-Level Libraries:
The authors implemented a Unix-like environment using the HiStar kernel completely in userspace, and so they all respect IFC. The filesystem is implemented using segments and containers. A directory is a container with a special directory segment that contains a map of names to object IDs and a mutex to synchronize directory operations. To get a consistent view of the directory entries without acquiring the mutex (which needs write permissions), a user atomically reads a generation number and busy flag before and after reading an entry.
Processes in HiStar are containers with two process-specific categories to protect its secrecy and integrity. The process has two containers, a process container and an internal container. The process container exposes objects that are the external interface to the process. The internal container has segments used to implement the file descriptors, stack, heap, and executable code. It also has the thread that executes the code and its address space.

A user has a pair of unique categories that define her read/write privileges. HiStar does not support ACLs.

HiStar also supports IPC using gates, signals, and networking using lwIP. The network device has a label {nr3, nw0, i2, 1} to taint all data read from the network.

Unix leaks information for process exit, quota adjustment, and file creation. For these operations, HiStar uses untainting gates that explicitly allow information to leak.

Applications:
The authors used ClamAV as the motivating example and split it up so that the scanner part cannot export user data.

They also implemented an untrusted user authentication mechanism that cannot leak more than one bit of the user’s password. The discussion for this application is interesting, but too complicated to explain in a summary. The main idea is to split up the login mechanism into a directory and gates and use each gate with different privileges to get the information for authentication, check the password, and get the user’s privileges.

Another application was VPN, where another lwIP stack was run in parallel to the network with a different label so that public Internet data and VPN data do not mix (at least not directly).

Performance:
HiStar runs comparable to Linux and OpenBSD, sometimes fasters, others slower. It is slower in the fork and exec microbenchmark. However, it does allow new consistency primitives such as group sync where no changes are recorded until an application is done execution.

Limitations:
Some useful features are missing from the Unix environment, while the semantics of others are modified. A crash or killed process can leave mutexes locked, and there needs to be support for cleaning those up. There are no CPU quotas and HiStar does not solve timing covert channels.

The Likes:
  • The paper was thorough in the description and did not leave questions
    unanswered.
  • DIFC seems very useful, and the OS seems to implement it well.
  • It is nice to build a layered system basing the design on a few basic blocks that enforce security and be able to reason about the rest of the system from those.
  • The OS has a load of cool and smart things whose composition was carefully designed to build useful libraries.
  • It works!
  • For non-malicious code, HiStar seems to be able to protect against bugs.
  • It’s cool that the taint degrees work out to map into numbers that can be compared using simple integer comparisons.
The Dislikes:
  • The paper was very dense, and takes quite a bit of effort to read. It would’ve been nice if the authors could have focused on the important points and insights instead of just a spewing out of information.
  • The userspace environment is not completely compatible with Unix, but ah well.
  • The timing channels problem seems to be a major issue
  • It looks difficult to build applications using the HiStar primitives.
  • What is the killer app for HiStar? What can it be used for given all its limitations? What are all these features good for?

Tuesday, October 13, 2009

Thoughts on The Case for OneSwarm

I should've posted this entry a while ago. This is a talk that Tom Anderson gave at MSR in Mountain View on 9/29/2009.

Summary:
Cloud computing has many issues:
  • Data can be locked in (Facebook, ...)
  • Need to trust providers with data
  • Cloud is a natural monopoly:
    • The business model is to attract developers then the users. So the users have no choice.
    • Reliability and the ability to handle flash crowds is a high barrier to entry.
    • There is no plausible avenue for an open source Cloud. Somebody needs to pay for the infrastructure, so somebody needs to be making money off of it to be sustainable, and the way to do this is through data lock-in and/or data mining...
  • Since all data is stored somewhere centrally, it's easier to censor it
Can we use a P2P system instead where users can donate some of their resources to be able to use other people's resources? At first glance, it looks like this might be alright. P2P systems
  • have high availability -> But in fact this is not true because seeds disappear
  • have no bottlenecks -> But their could be network inefficiencies or ISP throttling
  • have no centralized monitoring -> But it's actually easy to track users
The alternative is OneSwarm. OneSwarm has no centralized trust. The idea is to use privacy-preserving sharing where you build a web of trust through friends. Then when a user makes a query, the query goes through friends, then friends of friends, and so on... and the data travels on the reverse path back to the user. Nodes are not supposed to reveal information about where a query they pass along came from, so the receiver of a query cannot relate a user to a query.

The system, is by Tom's admission, vulnerable to collusion, but that's okay. The system is not designed for complete security but rather as a compromise between security, flexibility, and performance. Another assumption is that there is no partitioning in the network and that eventually everyone will be connected, and everyone will have access to everything (which will probably end up happening anyway).

The network of friends is organized as a DHT. To maintain availability and solve the issues with traditional P2P systems, they invented a new type of DHT that is highly scalable and reliable. groups of nodes are organized into tight groups that are replicated via Paxos. Each group then appears as a single node in the DHT.

As a final note, I had a chat with Tom after the talk. He has a pessimistic outlook on what will eventually happen. Mainly, that cloud providers will always end up creating a monopoly using data lock-in and be able to mine the data stored on the cloud, because they will manipulate the API so they can do so.

Likes:
I like OneSwarm. In particular, I liked the DHT implementation. They had some really cute ideas there. It is not a security protocol, it is a "casual" privacy protocol that presents a minimum barrier to snooping. However, if someone wants to get to my data and to know what requests I am making, they would be able to do so.
The availability of a P2P system that is presented as cloud storage on top which users can implement "free" applications, such as an open-source Facebook.

Dislikes:
  • I'm not sure I agree with the pessimistic view. Google has started a movement called the data liberation front to basically free users from data lock-in, and I expect this movement to become more important.
  • There are more reasons than just convenience for applications to run in a data center. If we move from DCs to P2P systems for storage, we will get much higher lookup costs, and it is not clear systems will scale.
  • There are no guarantees of connectivity. The social network might get partitioned, limiting the available data.

Thursday, August 13, 2009

Thoughts on A System for Authenticated Policy-Compliant Routing (Platypus)

Authors: Barath Raghavan and Alex C. Snoeren
Venue: SIGCOMM 2004
Paper: http://www.cs.ucsd.edu/~snoeren/papers/platypus-sigcomm04.pdf

Summary:

Providers want to forward packets only for entities they can bill. Regular source-routing makes this difficult since packets can arrive from anywhere and it is not clear whom to bill. The authors propose a network protocol that allows an ISP to give capabilities to whomever it wants to bill. The capabilities themselves can then be further delegated. Their work assumes that the knowledge of alternate routes or waypoints is done through some out-of-band means and the issue is not addressed.

Applications:

1. Efficient overlays and on-demand routing where end-hosts can decide how to route packets.

2. Preferential egress points where different peers get routed differently even if they all peer at the same router.

3. Preferential ingress points where multi-homed ASes can select what which ISP traffic comes in through.

4. Virtual multi-homing where a stub AS can get multi-homing at the provider (i.e. a hop or more away) even though it only has one connection.

The Platypus header is below the IP header so it is compatible with IP. A packet has a list of capabilities where each capability specifies the resource principal to bill, and the waypoint to go to. Packets are forwarded between waypoints by replacing the IP address in the IP-level packet.

Capabilities are authenticated using a “double MAC” trick. The key server and the router share a single key k. Secret capability key s = MAC(k, capability) is given to customer. Customer then generates b = MAC(s, payload) and puts b in the packet along with the capability. The router can then authenticate the capability.

The capability key s is temporary and expires after some time. The expiration mechanism used requires loose synchronization, but allows for only one value. When a key expires, the customers need to get a new key. Customers are given a master key they can use. They encrypt the op code for requests and put it in the DNS lookup name. So if a key is requested from a server a.b.edu, the request is sent to E(master key, request code).a.b.edu. The response contains the encoded new key. Requests are thus cacheable.

They also have revocation lists that are pulled from key servers into the routers.

The describe two ways to do delegation on prefixes. One by XORing, and the other by recursive MACing. The first is susceptible to collusion, the second might be too slow (this is what we do in ICING).

They evaluate their system using kernel modules, and it is a little slower than IP forwarding. I did not inderstand their discussion on clustered waypoints and their proposed solution traffic engineering. So I won’t say anything about it. I will say that the way they choose their nonces is not very nice.

They punt issues like accounting saying it is possible to do some distributed per-packet counting and rate-limiting(unknown). Their scalability section is a little hand-wavy since they have no idea what it would really take to build it in hardware. Replay attacks are also hand-waved saying they will use bloom filters (but they don’t mention costs or how they would do that in hardware, what do they hash…)

The Good:

· This is good seminal work on how to verify policy compliant packets.

· Used some cute smart tricks here and there (expiration, double MAC, delegation, key lookups)

· System seems to perform well enough in software

· They present the system and crypto nicely (who has what keys and how they derive them for example)

The Bad:

· They concentrate only on recognizing whom to bill. In reality, policies are probably more complex for security reasons. For example, don’t let AT&T send packets through Sprint even if another customer is paying for it. Or only use trusted ISPs.

· Much has been hand-waved. The implementation doesn’t explicitly specify (or maybe I missed it) what has been implemented.

· Revocation lists and polling from key-servers might be a problem.

· Once a capability is in a packet, it is compliant. There is no check whether or not the packet actually follows the path.


Thoughts on HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

Authors: Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz (Yale University), Alexander Rasin (Brown University)

Venue: VLDB 2009

Paper: http://db.cs.yale.edu/hadoopdb/hadoopdb.pdf

Summary:

The paper starts by saying that DBs are becoming increasingly important and data is growing at a huge rate. Unfortunately, the usual parallel DBMSs do not scale well because they cannot handle failures very well. On the other hand, MapReduce handles failures and stragglers (heterogeneous systems) well and scales well, but does not perform as well as DBMSs.

The paper describes a system that has roughly three layers. The top layer is a SQL layer which accepts SQL queries from users. The SQL queries are translated into MapReduce jobs. The MapReduce jobs are executed in the middle layer, Hadoop. The job is divided into several tasks. Each task itself is a smaller SQL query that is executed on a single node. Each node independently runs the third layer, a single-node DBMS (PostgreSQL in this case). It uses Hadoop for communication, Hive for the translation, and PostgreSQL for the DB.

HadoopDB consists of the following main components:

1. The database connector: provides an interface to independent databases so that worker nodes (TaskTrackers) can get results of SQL queries in MapReduce format (i.e. key-value pairs).

2. Catalog: maintains metainformation about the third-layer DBs---connections parameters (e.g. DB location, driver class, credentials) and metadata (data sets, partitions, replica locations, …)

3. Data Loader: Partitions data and stores it in the local DBs.

4. SQL to MapReduce to SQL (SMS) Planner: Extends Hive which translates SQL queries to MapReduce (Hadoop) jobs by optimizing the output to push as much of the work onto the local DB instances in the worker nodes since the single node DBMSs can do much better than MapReduce.

In the eval section, they compare Vertica (a column store DBMS), DBMS-X, HadoopDB, and Hadoop. They tested the systems without failures and with failures. I didn’t quite understand all the differences in workloads, but I did notice that Vertica was much faster than any of the other systems. The authors claim that HadoopDB performs almost as well as the other systems they tested and almost always better than Hadoop. They say that the performance differences are mainly because of the lowest PostgreSQL layer. This seems plausible, but it would have been a much more compelling eval if they tested comparable systems (example use compression on PostgreSQL and/or a column store).

When faults and slow nodes were introduced, Vertica performed much worse: 50% performance hit with failures and 170% with slow nodes. HadoopDB on the other hand was only 30% slower. They then say that this slow-down is a minor bug that can be easily fixed so that their performance matches Hadoop’s (which is better than HadoopDB with slowdown).

The Good:

The paper is interesting and reads well. The system itself is also very smart, and it’s a wonder no one has built it before. I would really like to see what ends up happening with it and how it ends up performing in the real world.

The Bad:

It would have been much more insightful if the used the same DB bottom layer in the eval, or at least eval different ones. It would have been compelling to see how Hadoop can parallelize DBMS. It would have been nice too if they could have put Vertica there to show how HadoopDB can fix the issues with failures. Also, I didn’t think the queries the used were that complex, but I don’t know whether this is of any importance.