Jad's Blog: Thoughts on FAWN: A Fast Array of Wimpy Nodes

Authors: David G. Andersen (Carnegie Mellon University), Jason Franklin (Carnegie Mellon University), Michael Kaminsky (Intel Research Pittsburgh), Amar Phanishayee (Carnegie Mellon University), Lawrence Tan (Carnegie Mellon University), Vijay Vasudevan (Carnegie Mellon University)

Venue: SOSP 2009

Summary:
The paper describes the FAWN a system for storage using a large number of small low-performance (hence wimpy) node that have moderate amounts of local storage. The system has two parts: FAWN-DS and FAWN-KV.
FAWN-DS is the backend that consists of the large number of nodes. Motivation is simple: I/O is current bottleneck and current storage is inefficient.

high speed CPUs consume too much power, and CPU scaling doesn't work very well b/c of high leakage.
2GB DRAM uses as much power as 1 TB disk.
power cost = 50% 3-yr TCO of cluster

Each node in FAWN-DS has some RAM and Flash. The storage is log-structured key-value based:

Each node can appear as multiple vnodes, where each vnode has its own data store. This helps in management, and doesn't decrease performance b/c for a small number of files, it becomes semi-random writes.
Part of the key is used to lookup in an in-RAM index.
The index stores a key fragment.
If the key fragment matches the lookup key, then check the full key in flash
If the flash matches, then we have found the entry
Otherwise, use hash chaining to get next location

The store is log structured:

writes and deletes are just appends to the store
lookups are as above
Every once in a while the store is compacted and check-pointed
Other ops the DS node can do are merge and split data stores

FAWN-KV uses a DHT similar to Chord, but with no DHT routing. There is a three level management hierarchy:

Key value ranges are split up and assigned to front-end nodes by management nodes.
Front-end nodes handle requests and keep track of which backend nodes have which keys:
- If one node receives a request and the req is not in its range then it forwards the request to the right node
- If the req is within the range, it forwards the request to the correct back-end node
- Front-ends have a small RAM cache they use to buffer requests and avoid hot-spots
Back-end nodes implement the FAWN-DS:
- When a new k,v write arrives, the write is forwarded in a replication chain along the chord ring.
- The last node in the chain (the tail) acks the req back to the frontend and handles all lookup requests.
- So each node is in R chains: once as head, once as tail and R-2 as middle node.

Joins and leaves:

When a node joins, all nodes split their key ranges.
For each range, the node gets the DS from the current tail
The node is linked into the replication chain
The node stores all updates in a temp log DS
The node gets all updates between when copy was done and when it started receiving new updates
All updates are merged into the permanent log DS, and node becomes fully on. Old tail can delete old DS.
For head join, need to coordinate with front-end
For tail join, node inserted before old tail, and predeccessor serves lookups
When node leaves, merge range, add new node to replication chain as in regular join

Failures assumed fail-stop. Can't handle communication. More work to be done at detection.

Evaluation:

FAWN-DS has quite a bit of overhead over raw filesystem
For lookup overhead is 38% for 1KB queries, 34% for 256B queries. i.e. about 62-66% throughput
For store, 96% throughput achieved b/c log structured
semi-random write performance is highly dependant on the flash device used
They compare their performance to BDB, which is abysmal to say the least on flash. But I don't get what the point is. BDB is not optimized for flash. I guess the point is that you will need something exactly like FAWN-DS if you want to use flash. What else should they do to get a grip of how well FAWN-DS does?
Small random writes do much better than random reads.
FAWN-KV throughput is about 80% of FAWN-DS single node throughput due to network overhead+marshalling/unmarshalling
Cluster with no front-end does 364 queries/joule
Including frontend: 330 queries/joule
Current network overhead = 20% and increasing as cluster size increases.
Median latency < 1ms, 90% < 2ms, 99.9% < 26.3ms
With split, median still < 1ms, 99.9% < 611ms under high load
Desktop does 52 queries/joule, 3-node FAWN does 240 with half of the idle power consumed by switch.
They have a nice figure illustrating the solution space for minimal cost depending on data set size and query-rate. Almost all uses FAWN with HD, DRAM, or SSD. Only a part uses Server+DRAM.

In general: higher query rate = towards DRAM. Bigger dataset = SSD then Disk for really large ds and low throughput. Traditional + DRAM has a very small slice for high query rate and datasets that are a little larger (assumption is each fawn node only has 2GB DRAM).

Interesting tidbits:

Facebook has 1:6 of puts:gets

The good:
The paper in general is excellently executed with a thorough eval section. The related work section is also quite thorough and is a good starting point for someone interested in the space. The paper has many interesting tidbits and facts.

The FAWN system itself is also very nice and simple (at least as they explained it). They seem to have picked the right level of abstraction at which to explain it. Performance seems comparable to traditional systems at a much lower cost and higher efficiency. Their Figure 15 is beautiful. They basically convinced me to build my next DC using FAWN.

The Bad:
They didn't go into details on failures. I would have preferred a little less on impact of maintenance ops and more on failures and reliability. That said, I don't really see why failures would be a major problem.

Jad's Blog

Sunday, July 26, 2009

Thoughts on FAWN: A Fast Array of Wimpy Nodes

No comments:

Post a Comment

Followers

Blog Archive

About Me