Wednesday, July 8, 2009

Thoughts on VL2: A Scalable and Flexible Data Center Network

Authors: Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David A. Maltz, Parveen Patel, and Sudipta Sengupta

BibTeX:
To Appear: SIGCOMM 2009

Summary:
The paper has two parts. The first describes results from measurements in a DC and the other describes an architecture.

Datacenter Results:
The authors looked at a 1500 node cluster in a DC that does data-mining.
  • Traffic patterns highly variable, and there are more than 50 traffic matrices that change very frequently (usually in under 100s) with no periodicity (predictability).
  • 99% of flows are smaller than 100MB
  • More than 90% of bytes are in flows between 100MB and 1GB
  • Flows over a few GB are rare.
  • Around 20% of traffic leaves/enters the DC. The rest is local.
  • Intense computation and communication does not straddle DCs
  • Demand for BW inside DC growing faster than demand for BW to external hosts
  • Network is computation bottleneck. ToR uplinks frequently more than 80% utilized.
  • More than 50% of the time a machine has 10 flows
  • At least 5% of the time it has 80 flows, but almost never more than 100.
  • Most failures are small: 50% of net device failures involve <>
  • Downtimes can be significant: 95% <10min,>10days
  • 0.3% of failures redundant components all failed
  • Most problems due to : net misconfigurations, firmware bugs, faulty components
VL2 Architecture:
VL2 proposes a network architecture that enables multipath, scalability, and isolation with layer 2 semantics and no ARP and DHCP broadcasts. They also use commodity switches that must have OSPF, ECMP, and IP-in-IP decapsulation.

The concepts are simple. VM hosts add a layer to the networking stack called the VL2 agent. When services or VMs wish to send traffic, the VL2 encapsulates the packets in an IP-in-IP tunnel and uses VLB in a Clos network on a flow-by-flow basis to split load across intermediate (backbone) switches.

Hosts are assigned location-specific IP addresses (LA) while services are assigned application-specific IP addresses (AA) that they maintain when they migrate around the datacenter (DC). When a service sends a packet, it uses the AAs. The VL2 agent encapsulates the packet twice: In the inner layer it puts the LA of the dst ToR switch, and in the outer layer it puts the LA of an intermediate switch and sends it out. The packet then gets routed to the intermediate switch that decapsulates the first layer and forwards it to the correct ToR switch. The ToR switch decapsulates the packet and delivers it to the correct host.

While selecting a random intermediate switch LA to forward a packet will implement VLB, a large number of VL2 agents will need to be updated if a switch is added/removed from the intermediate switches. Instead, they would like to use one address for all intermediate switches, and let ECMP choose one of the switches using anycast. But since ECMP only allows 16 different paths (256 coming later), they choose a number (don't say what this number is or how to choose it) of anycast addresses and assign the maximum number of switches to each address. If a switch dies, the addresses assigned to it are migrated to other switches.

When encapsulating, the VL2 agent puts a hash of the packet's 5-tuple in the IP src address of the encapsulation headers. This can be used to create additional entropy for ECMP, or to modify flow placement to prevent large flows from being placed on the same link.

The VL2 agent obtains the LA from a directory and caches it. When the service running on a host sends an ARP request for an AA, the VL2 intercepts the ARP request and queries the directory for the destination ToR switch LA, which it caches and uses for other packets. They don't say how the anycast addresses are obtained for the intermediate switches or what MAC address is returned in the ARP reply to the VM.

The VL2 agent can be used to enforce access control (they say that the directory does, but it actually only makes the policy decisions) and hence isolate services from each other. In addition, two services that need to communicate don't need to go over an IP gateway to bridge two VLANs as in traditional DCs (then again their whole network is made of IP routers anyway).

Services that connect to hosts in the Internet (such as front-end webservers) have two addresses: LA and AA. The LA address is externally reachable. They do not say what this means when the externally reachable service migrates.

For broadcast other than ARP (intercepted by VL2 agent) and DHCP (handled by DHCP relay) an IP multicast address is used which is unique for each service.

The VL2 directory system is two-tiered. Replicated directory servers cache AA-to-LA mappings and serve them to clients. Replicated State Machine servers use the Paxos consensus algorithm to implement reliable updates. Lookups are fast, but updates are slow and reliable. To fix inconsistencies in VL2 agent caches, if a ToR switch receives a packet with a stale LA-AA mapping, it sends the packet to a directory server that updates the VL2 agent's stale cache. Does this mean the ToR switch needs to be modified to support this?

The eval section is good. It is nicely done and thorough. They get 94% of optimal network capacity, high TCP fairness, graceful degradation and recovery under failures, and fast lookups (good enough to replace ARP). VLB provides good fairness because of the high number of flows (statistical muxing), and because uplinks have a 10x gap in speed.

Long lived flows' aggregate goodput is not affected by other flows starting or ending in the network or by bursts of short flows. This is due to VLB, spreading all traffic around uniformly, and because TCP is fast enough to ensure that flows only get their fair share of throughput.

They compared VLB to other routing techinques, adaptive and best oblivious, and found that VLB is at worst only 20% worse, which they claim is good enough for such a simple mechanism.

In terms of cost, they claim their network is cheaper for the same performance.

The Good:
The analysis of datacenter characteristics is the first of its kind. Thank you Microsoft!
The architecture is nice and simple to understand and achieves excellent performance. They do not require modification of switches and use commodity components. In exchange, they modify the networking stack on VM hosts. They say this is OK because the hosts need to run crazy hypervisors and VMM software anyway.
The evaluation is thorough, and, if you buy that the DC measurements are representative, convincing.
They answer the question "What can you do without modifying the switches and their protocols?" well. But it is not clear this was their aim.

They made some very interesting points such as that TCP is good enough for the DC and for VLB. But if we are going to modify the network stack, will we use TCP?

The Bad:
  • It was not clear how representative the measurement of the DC was of other DCs and other areas of the same DC. But something is better than nothing.
  • At a high-level, the architecture seems to be a little ad-hoc, trying to solve a problem by patching a solution on top of existing systems. Perhaps this is the right approach for existing systems whose networking equipment cannot be changed.
  • What are the performance characteristics of the VL2 agent? Does it add any significant overhead?
  • How are the machines configured? How are the routers configured? If misconfiguration is the number 1 cause of failures, how do they address that? Nothing is mentioned.
  • They do not mention power. While Fat-tree used 1G links everywhere and needed more powerful scheduling, it did consume much less power and was quite cheap.
  • The architecture seems to require many independent not very well integrated and controlled components. How do we configure the separate parts, how can we monitor the whole network, so on.
  • The directory service itself seems difficult to implement and build. How difficult was it?
  • They say they will use a number of anycast addresses. How are these addresses obtained? How many are there? How do we add more?
  • For externally reachable services that have a LA, what happens when they migrate?
  • At one point, they mention that the ToR switch can help with reactive stale cache updates. Does this mean the switch has to be modified, or did they mean that the VM host does this. And what happens when the host is gone? Or when the ToR is gone? How are the VL2 caches updated when the reactive mechanism is not working due to failures somewhere else?
  • I do not fully buy that this is better than fat-tree, even if Fat-tree requires centralized control of switches (actually a good thing!)

No comments:

Post a Comment