SIGCOMM 2009
Summary:
The authors make the case that large data-center networks are expensive and that cost increases non-linearly with size. Current design techniques do not fully utilize fan-out and multipath, and when they do it's either at the cost of compatibility (Myrinet != Ethernet/Infiniband non-TCP/IP /ECMP with randomization reorders packets) or complexity (ECMP with region splitting needs 600k TCAM entries for 25k net).
The authors demonstrate how to design a fat-tree data-center network using "commodity" Ethernet/IP switches and routers. They give switches IP addresses based on their location in the fat tree, making it easier to do routing and select paths. They presented three methods of packet diffusion on the upward path in the fat-tree:
- Static two-level table routing based on host ID: simple, but performs the worst, non-dynamic, needs extra work when a link dies to send updates everywhere. Host IDs may not provide sufficient entropy depending on the communication pattern (hence the bad performance). Tables (TCAMs), however, only need k<=48 entries, which is remarkably cheap, and does not in fact need a discrete component.
- Flow Classification: switches monitor flow sizes and periodically reassign a few flows to balance usage across ports. Only local optimization and does not avoid hotspots in the core. Performs better than
- Central Flow Scheduler: Edge switches monitor flow sizes and notify a central scheduler when a flow goes above a certain threshold. The scheduler then reassigns the flow's path to the minimally loaded links.
Using some back-of-the-envelope calcuations, traditional networks use almost twice the power and dissipate almost twice the heat. This is because they use switches with 10GigE uplinks and 10 GigE switches, and a 10GigE switch uses approximately double the power per Gbps and disspates 3x the heat of a 1GigE switch. The calculation was done for the largest networks they support (~27k hosts).
They implemented the design on NetFPGA but gave no performance numbers. They also incorrectly assumed that because it was easy to modify the NetFPGA, it would be easy to modify other switches.
They also implemented the three designs above in software using Click, and measured the throughput. In all cases, the hierarchical traditional design did worst, then the two-level table, then flow classification, while flow scheduling did best.
The authors also investigated a packagin solution, proposing to put all the pod switches in a central rack and have all pod hosts in other racks connect to it. Then, core switches are equally divided among the pods, and the pods are laid out in a grid. This bundles up the cables that are going between pods and the cables between racks. The switch rack itself can have built in wiring to simplify cabling inside the pod switch rack.
The Good:
This is work that has needed to be done for a while now. It is surprising no one has done this research before given that most of the algorithms used are simple and intuitive. They do not give enough credit to centralization for their work, probably so they don't upset some people. It is noteworthy that all of the work would have been much more complex if they did not use centralization (if not impossible). That is why it is a little unfair to compare the work with OSPF2-ECMP which is fully distributed.
The work is feasible, and seems not too difficult to implement or build. On the whole I really like the paper and work.
The Bad:
Even though they claim they use "commodity" switches, the switches needed are not "commodity". Depending on the algorithms used, they either need two-level table support (possible hardware change though not necessarily) or OpenFlow-support (software mostly). They call them "commodity" because they do not use aggregated uplinks (such as 10GigE ports or switches). If the hardware modification is needed, it is not clear it can happen. OpenFlow, on the other hand, is quite likely to happen and is already on the way. Given these new needs and changes from the fully-distributed hierarchical design, they should have mentioned more openly how their infrastructure differs from traditional infrastructure. They cannot really go buy a bunch of switches from Fry's and use them.
The calculations for power/heat and costs are all back-of-envelope calculations. I am also uncertain of the fairness in comparisons. The higher-end more expensive switches and more power-consuming switches are more robust and provide much more functionality and features than the lower-end switches they use. A good question is, does the network they build with cheap switches offer all the same features an admin expects from a traditional network?
While difficult to do, it would have been nice to see a concrete example that includes cabling, packaging, and operation costs. In their throughput calculations, they use artificial traffic patterns that do not take into account the bursty nature of traffic. Their algorithms also assume "Internet-like" traffic with heavy tails. It would be interesting to see how their algorithms fair with real data center traces. It is surprising they could not get any. I expect a data-center's traffic burstiness to significantly change their results.
It is also unfortunate that they have not tried to implement their solutions on any real hardware. So it is not clear how easy it is to modify currently available infrastructure. It doesn't seem that the whole DC network needs to be overhauled.
While it is a point they made in the paper, the wiring complexity seems to be quite horrible, and I am not sure the solution they propose fixes the problem. Perhaps one needs to look at the problem a higher level up and consider building chassis that house pods...
No comments:
Post a Comment