Venue: SOSP 2009
Summary:
The paper describes the FAWN a system for storage using a large number of small low-performance (hence wimpy) node that have moderate amounts of local storage. The system has two parts: FAWN-DS and FAWN-KV.
FAWN-DS is the backend that consists of the large number of nodes. Motivation is simple: I/O is current bottleneck and current storage is inefficient.
- high speed CPUs consume too much power, and CPU scaling doesn't work very well b/c of high leakage.
- 2GB DRAM uses as much power as 1 TB disk.
- power cost = 50% 3-yr TCO of cluster
- Each node can appear as multiple vnodes, where each vnode has its own data store. This helps in management, and doesn't decrease performance b/c for a small number of files, it becomes semi-random writes.
- Part of the key is used to lookup in an in-RAM index.
- The index stores a key fragment.
- If the key fragment matches the lookup key, then check the full key in flash
- If the flash matches, then we have found the entry
- Otherwise, use hash chaining to get next location
- writes and deletes are just appends to the store
- lookups are as above
- Every once in a while the store is compacted and check-pointed
- Other ops the DS node can do are merge and split data stores
- Key value ranges are split up and assigned to front-end nodes by management nodes.
- Front-end nodes handle requests and keep track of which backend nodes have which keys:
- If one node receives a request and the req is not in its range then it forwards the request to the right node
- If the req is within the range, it forwards the request to the correct back-end node
- Front-ends have a small RAM cache they use to buffer requests and avoid hot-spots
- Back-end nodes implement the FAWN-DS:
- When a new k,v write arrives, the write is forwarded in a replication chain along the chord ring.
- The last node in the chain (the tail) acks the req back to the frontend and handles all lookup requests.
- So each node is in R chains: once as head, once as tail and R-2 as middle node.
- When a node joins, all nodes split their key ranges.
- For each range, the node gets the DS from the current tail
- The node is linked into the replication chain
- The node stores all updates in a temp log DS
- The node gets all updates between when copy was done and when it started receiving new updates
- All updates are merged into the permanent log DS, and node becomes fully on. Old tail can delete old DS.
- For head join, need to coordinate with front-end
- For tail join, node inserted before old tail, and predeccessor serves lookups
- When node leaves, merge range, add new node to replication chain as in regular join
Evaluation:
- FAWN-DS has quite a bit of overhead over raw filesystem
- For lookup overhead is 38% for 1KB queries, 34% for 256B queries. i.e. about 62-66% throughput
- For store, 96% throughput achieved b/c log structured
- semi-random write performance is highly dependant on the flash device used
- They compare their performance to BDB, which is abysmal to say the least on flash. But I don't get what the point is. BDB is not optimized for flash. I guess the point is that you will need something exactly like FAWN-DS if you want to use flash. What else should they do to get a grip of how well FAWN-DS does?
- Small random writes do much better than random reads.
- FAWN-KV throughput is about 80% of FAWN-DS single node throughput due to network overhead+marshalling/unmarshalling
- Cluster with no front-end does 364 queries/joule
- Including frontend: 330 queries/joule
- Current network overhead = 20% and increasing as cluster size increases.
- Median latency < 1ms, 90% < 2ms, 99.9% < 26.3ms
- With split, median still < 1ms, 99.9% < 611ms under high load
- Desktop does 52 queries/joule, 3-node FAWN does 240 with half of the idle power consumed by switch.
- They have a nice figure illustrating the solution space for minimal cost depending on data set size and query-rate. Almost all uses FAWN with HD, DRAM, or SSD. Only a part uses Server+DRAM.
Interesting tidbits:
Facebook has 1:6 of puts:gets
The good:
The paper in general is excellently executed with a thorough eval section. The related work section is also quite thorough and is a good starting point for someone interested in the space. The paper has many interesting tidbits and facts.
The FAWN system itself is also very nice and simple (at least as they explained it). They seem to have picked the right level of abstraction at which to explain it. Performance seems comparable to traditional systems at a much lower cost and higher efficiency. Their Figure 15 is beautiful. They basically convinced me to build my next DC using FAWN.
The Bad:
They didn't go into details on failures. I would have preferred a little less on impact of maintenance ops and more on failures and reliability. That said, I don't really see why failures would be a major problem.
No comments:
Post a Comment