Jad's Blog

Saturday, July 18, 2009

Thoughts on Open Cirrus Cloud Computing Testbed: Federated Data Centers for Open Source Systems and Services Research

Authors: Roy Campbell,5 Indranil Gupta,5 Michael Heath,5 Steven Y. Ko,5 Michael Kozuch,3 Marcel Kunze,4 Thomas Kwan,6 Kevin Lai,1 Hing Yan Lee,2 Martha Lyons,1 Dejan Milojicic,1 David O’Hallaron,3 and Yeng Chai Soh2

1HP Labs, 2IDA, 3Intel Research, 4KIT, 5UIUC, and 6Yahoo!

Venue: HotCloud '09

Paper: http://www.usenix.org/events/hotcloud09/tech/full_papers/campbell.pdf

Summary:

The authors present the need for systems research on datacenter infrastructure. While virtualized systems like EC2/S3, Azure, and such allows applications research, it is difficult to do research on systems that compose the infrastructure of the data center. Open Cirrus is an effort to provide researchers with access to hardware level end-hosts. They have fedarated several data centers around the world, and are working on providing a unified API to access all of them. There will be some basic common functionality even though the DCs themselves are heterogeneous. OpenCirrus is basically federated Emulab across multiple sites where the sites are only loosely coupled (whatever that means).

The systems are scheduled and manage as Physical Resource Sets (PRS). HP's integrated Lights-Out technology is used to manage machines at the firmware level.

They compare options for testbeds and do a back-of-the-envelope calculation of the break-even point for renting (S3/EC2) vs. owning hardware (they use some reasonable cost-breakdown). There calculation says that for a service running more than 12 months, it is better to buy computation while for storage more than 6 months it is better to buy.

The paper is interesting in that it surveys other testbeds, and introduces a new one wider scale one than emulab.

Thursday, July 16, 2009

Thoughts on Providing Packet Obituaries

Authors: Katernia Argyraki, Petros Maniatis, David Cheriton, Scott Shenker
Venue: Hotnet '04

Summary:
The paper draws a strawman design of how to provide some form of accountability on which AS on a path dropped a packet. This is done by adding Accountability Boxes (A-boxes) on links at the entry and exit points of an AS. The A-boxes maintain calculate digests of packets that pass through and cascade reports from the last AS to see a packet back to the sender along the reverse AS-level path. The mechanism allows an end-host to know which AS has dropped a packet. If ASes forge reports, the sender can know at which link a report discrepancy occurred and can tell that one of two ASes has lied. The paper also quickly studies the feasibility of the design. I did not read the paper well enough to comment on how pros and cons. I like the function that it provides. I'm more worried about its security and whether it is practical. I am not sure that necessitating extra hardware is the right approach. But perhaps this is OK since carriers put all sorts of specialized hardware in the networks anyway.

Wednesday, July 15, 2009

Thoughts on CloudViews: Communal Data Sharing in Public Clouds

Authors: Roxana Geambasu, Steven D. Gribble, Henry M. Levy
Appeared in: HotCloud '09

Summary:
Public clouds have lots of bandwidth, so it is easier to share data between services in the same cloud. The paper raises the issues in sharing and what the security options are. They maintain the view that data is owned by the service rather than the data creator. They allow services to specify which portions of their data is accessible to other portions. But once data is shared by one service with another, it becomes owned by the other service and can do whatever it likes with it. They don't provide federation of resources and assume a few large public clouds.

This was a quick read and needs to be read again.

Wednesday, July 8, 2009

Thoughts on VL2: A Scalable and Flexible Data Center Network

Authors: Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David A. Maltz, Parveen Patel, and Sudipta Sengupta

BibTeX:
To Appear: SIGCOMM 2009

Summary:
The paper has two parts. The first describes results from measurements in a DC and the other describes an architecture.

Datacenter Results:
The authors looked at a 1500 node cluster in a DC that does data-mining.

Traffic patterns highly variable, and there are more than 50 traffic matrices that change very frequently (usually in under 100s) with no periodicity (predictability).
99% of flows are smaller than 100MB
More than 90% of bytes are in flows between 100MB and 1GB
Flows over a few GB are rare.
Around 20% of traffic leaves/enters the DC. The rest is local.
Intense computation and communication does not straddle DCs
Demand for BW inside DC growing faster than demand for BW to external hosts
Network is computation bottleneck. ToR uplinks frequently more than 80% utilized.
More than 50% of the time a machine has 10 flows
At least 5% of the time it has 80 flows, but almost never more than 100.
Most failures are small: 50% of net device failures involve <>
Downtimes can be significant: 95% <10min,>10days
0.3% of failures redundant components all failed
Most problems due to : net misconfigurations, firmware bugs, faulty components

VL2 Architecture:
VL2 proposes a network architecture that enables multipath, scalability, and isolation with layer 2 semantics and no ARP and DHCP broadcasts. They also use commodity switches that must have OSPF, ECMP, and IP-in-IP decapsulation.

The concepts are simple. VM hosts add a layer to the networking stack called the VL2 agent. When services or VMs wish to send traffic, the VL2 encapsulates the packets in an IP-in-IP tunnel and uses VLB in a Clos network on a flow-by-flow basis to split load across intermediate (backbone) switches.

Hosts are assigned location-specific IP addresses (LA) while services are assigned application-specific IP addresses (AA) that they maintain when they migrate around the datacenter (DC). When a service sends a packet, it uses the AAs. The VL2 agent encapsulates the packet twice: In the inner layer it puts the LA of the dst ToR switch, and in the outer layer it puts the LA of an intermediate switch and sends it out. The packet then gets routed to the intermediate switch that decapsulates the first layer and forwards it to the correct ToR switch. The ToR switch decapsulates the packet and delivers it to the correct host.

While selecting a random intermediate switch LA to forward a packet will implement VLB, a large number of VL2 agents will need to be updated if a switch is added/removed from the intermediate switches. Instead, they would like to use one address for all intermediate switches, and let ECMP choose one of the switches using anycast. But since ECMP only allows 16 different paths (256 coming later), they choose a number (don't say what this number is or how to choose it) of anycast addresses and assign the maximum number of switches to each address. If a switch dies, the addresses assigned to it are migrated to other switches.

When encapsulating, the VL2 agent puts a hash of the packet's 5-tuple in the IP src address of the encapsulation headers. This can be used to create additional entropy for ECMP, or to modify flow placement to prevent large flows from being placed on the same link.

The VL2 agent obtains the LA from a directory and caches it. When the service running on a host sends an ARP request for an AA, the VL2 intercepts the ARP request and queries the directory for the destination ToR switch LA, which it caches and uses for other packets. They don't say how the anycast addresses are obtained for the intermediate switches or what MAC address is returned in the ARP reply to the VM.

The VL2 agent can be used to enforce access control (they say that the directory does, but it actually only makes the policy decisions) and hence isolate services from each other. In addition, two services that need to communicate don't need to go over an IP gateway to bridge two VLANs as in traditional DCs (then again their whole network is made of IP routers anyway).

Services that connect to hosts in the Internet (such as front-end webservers) have two addresses: LA and AA. The LA address is externally reachable. They do not say what this means when the externally reachable service migrates.

For broadcast other than ARP (intercepted by VL2 agent) and DHCP (handled by DHCP relay) an IP multicast address is used which is unique for each service.

The VL2 directory system is two-tiered. Replicated directory servers cache AA-to-LA mappings and serve them to clients. Replicated State Machine servers use the Paxos consensus algorithm to implement reliable updates. Lookups are fast, but updates are slow and reliable. To fix inconsistencies in VL2 agent caches, if a ToR switch receives a packet with a stale LA-AA mapping, it sends the packet to a directory server that updates the VL2 agent's stale cache. Does this mean the ToR switch needs to be modified to support this?

The eval section is good. It is nicely done and thorough. They get 94% of optimal network capacity, high TCP fairness, graceful degradation and recovery under failures, and fast lookups (good enough to replace ARP). VLB provides good fairness because of the high number of flows (statistical muxing), and because uplinks have a 10x gap in speed.

Long lived flows' aggregate goodput is not affected by other flows starting or ending in the network or by bursts of short flows. This is due to VLB, spreading all traffic around uniformly, and because TCP is fast enough to ensure that flows only get their fair share of throughput.

They compared VLB to other routing techinques, adaptive and best oblivious, and found that VLB is at worst only 20% worse, which they claim is good enough for such a simple mechanism.

In terms of cost, they claim their network is cheaper for the same performance.

The Good:
The analysis of datacenter characteristics is the first of its kind. Thank you Microsoft!
The architecture is nice and simple to understand and achieves excellent performance. They do not require modification of switches and use commodity components. In exchange, they modify the networking stack on VM hosts. They say this is OK because the hosts need to run crazy hypervisors and VMM software anyway.
The evaluation is thorough, and, if you buy that the DC measurements are representative, convincing.
They answer the question "What can you do without modifying the switches and their protocols?" well. But it is not clear this was their aim.

They made some very interesting points such as that TCP is good enough for the DC and for VLB. But if we are going to modify the network stack, will we use TCP?

The Bad:

It was not clear how representative the measurement of the DC was of other DCs and other areas of the same DC. But something is better than nothing.
At a high-level, the architecture seems to be a little ad-hoc, trying to solve a problem by patching a solution on top of existing systems. Perhaps this is the right approach for existing systems whose networking equipment cannot be changed.
What are the performance characteristics of the VL2 agent? Does it add any significant overhead?
How are the machines configured? How are the routers configured? If misconfiguration is the number 1 cause of failures, how do they address that? Nothing is mentioned.
They do not mention power. While Fat-tree used 1G links everywhere and needed more powerful scheduling, it did consume much less power and was quite cheap.
The architecture seems to require many independent not very well integrated and controlled components. How do we configure the separate parts, how can we monitor the whole network, so on.
The directory service itself seems difficult to implement and build. How difficult was it?
They say they will use a number of anycast addresses. How are these addresses obtained? How many are there? How do we add more?
For externally reachable services that have a LA, what happens when they migrate?
At one point, they mention that the ToR switch can help with reactive stale cache updates. Does this mean the switch has to be modified, or did they mean that the VM host does this. And what happens when the host is gone? Or when the ToR is gone? How are the VL2 caches updated when the reactive mechanism is not working due to failures somewhere else?
I do not fully buy that this is better than fat-tree, even if Fat-tree requires centralized control of switches (actually a good thing!)

Thursday, June 25, 2009

Thoughts on Floodless in SEATTLE: A Scalable Ethernet Architecture for Large Enterprises

Authors: Changhoon Kim, Matthew Caesar, and Jennifer Rexford

Reading Detail Level: Quick Read

BibTeX:

@inproceedings{seattle,
author = {Kim, Changhoon and Caesar, Matthew and Rexford, Jennifer},
title = {Floodless in seattle: a scalable ethernet architecture for large enterprises},
booktitle = {SIGCOMM '08: Proceedings of the ACM SIGCOMM 2008 conference on Data communication},
year = {2008},
isbn = {978-1-60558-175-0},
pages = {3--14},
location = {Seattle, WA, USA},
doi = {http://doi.acm.org/10.1145/1402958.1402961},
publisher = {ACM},
address = {New York, NY, USA},
}

Summary:
The main idea is to use the switches to form a single-hop DHT. The DHT is accessed by using consistent hashing (function F) to minimize churn when switches fail or recover and keys are re-hashed. SEATTLE switches do the following things:

They use a link-state algorithm to build the network topology (uses broadcast but only for switches)
They form a one-hop DHT that can be used to store (k,v) pairs.
They also participate in a multi-level DHT so that a large network such as an ISP can be divided into regions and queries to other regions get encapsulated at border switches when crossing regions.
For every k a switch inserts, it keeps track of whether the resolver changes (maybe because it died or a new one came up), and inserts the (k,v) at the new location.
If a switch dies, each switch will scan through its list of stored (k,v) and if the dead switch is a stored value, that entry is deleted.
They advertise themselves as multiple virtual switches so that more powerful switches would appear as several virtual switches and would handle a larger load in the DHT.
When a host arrives:
1. The access switch snoops to find the host's IP and MAC addrs
2. The access switch stores MAC -> (IP, location) at switch R=F(MAC) in the DHT
3. The access switch stores IP -> (MAC, location) at switch V=F(IP) in the DHT
To resolve an ARP request, an access switch uses the DHT's IP->(MAC, location) and caches the result so packets go directly to the target location without additional lookups.
If a packet is sent to a MAC address which the access switch does not know:
1. The packet is encapsulated and sent to switch R=F(MAC).
2. R forwards the packet to the right location.
3. R sends the info to the access switch
4. The access switch caches the info
When a host's MAC and/or IP addresses and/or location change:
1. The switch at which the host gets attached (or remains) updates the info in the DHT.
2. For a location change, the old access switch is told of the change; so, if it receives packets to the old address, it informs the sending access switch of the change.
3. For a MAC address change, 1) the attached host maintains a revocation list (IP, MACold, MACnew) for the host. If a host is using the old MAC address, the switch sends a gratuitous ARP to that host. 2) The switch tells the other switch of the new MAC->location mapping so it doesn't do a lookup.
SEATTLE separates unicast reachability from broadcast domains. It uses groups---a set of hosts that share the same broadcast domain regardless of location. Switches somehow (not clear now) know what groups a host belongs to, and use a protocol similar to IP multicast to graft a branch to an existing multicast tree.

The DHT can also store mappings from service names to locations (e.g. DHCP_SERVER and PRINTER). This is how they handle DHCP without broadcasts.

For evaluation, the authors implemented SEATTLE (on XoRP and Click), and used traces from several locations to run simulations. They compared with Ethernet and ROFL showing that SEATTLE is superior then both in many ways.

The Good:
The architecture is beautiful in many ways. The DHT approach seems very powerful indeed. I don't know of any other solutions to the broadcasting problem, and none that can handle arbitrary L2 topologies. I liked the notions of virtual switches and groups (with caveats below). The evaluation section is nice, though not perfect.

The Bad:
No paper is perfect, unfortunately.

While the authors claim to eliminate broadcast, they do use link state updates which are broadcast, even though these do not happen as often as ARP.
Virtual switches are nice, but no analysis of their side-effects exist or how difficult it is to implement them.
The discussion on areas was completely confusing.
The discussion on groups was too abstract. How do switches know what hosts are in what groups? I was not able to visualize how groups would be implented from the discussion, and so I have to say they seem too difficult.
The evaluation section is missing a critical piece. The paper does not mention what has been implemented and what is missing. This makes the evaluation much less credible. How difficult was the implementation? How does it compare to other designs? What is the cost comparison?
The per-packet processing cost was not convincing because it was not clear what was implemented and what not.

Nothing was said on multipath, but that was not the point of the paper anyway.

Sunday, June 14, 2009

Thoughts on Linux Secutiry Modules: General Security Support for the Linux Kernel

Authors: Chris Wright, Crispin Cowan, Stephen Smalley, James Morris, and Greg Kroah-Hartman

BibTeX:

@inproceedings{DBLP:conf/uss/WrightCSMK02,
author    = {Chris Wright and
          Crispin Cowan and
          Stephen Smalley and
          James Morris and
          Greg Kroah-Hartman},
title     = {Linux Security Modules: General Security Support for the
          Linux Kernel},
booktitle = {USENIX Security Symposium},
year      = {2002},
pages     = {17-31},
ee        = {http://www.usenix.org/publications/library/proceedings/sec02/wright.html},
crossref  = {DBLP:conf/uss/2002},
bibsource = {DBLP, http://dblp.uni-trier.de}
}

@proceedings{DBLP:conf/uss/2002,
editor    = {Dan Boneh},
title     = {Proceedings of the 11th USENIX Security Symposium, San Francisco,
          CA, USA, August 5-9, 2002},
booktitle = {USENIX Security Symposium},
publisher = {USENIX},
year      = {2002},
isbn      = {1-931971-00-5},
bibsource = {DBLP, http://dblp.uni-trier.de}
}

Summary:

The paper describes the motivation, design, and some of the implementation of the Linux Security Modules framework. The LSM framework allows many security models to be implemented as loadable Linux kernel modules. Rather than only interposing on system calls, LSM provides hooks for modules asking whether an access to an internal kernel object should be granted or denied. Almost always, the LSM hooks are restrictive rather than authoritative. That is, they allow a module to deny a request that the Linux DAC mechanism would have allowed rather than override a denial.

To support POSIX.1e capabilities, LSM provides some minimal support for permissive hooks that do override a denial. LSM is limited to supporting the core access control functionalities needed by existing Linux security projects, rather than adding auditing and virtualization.

LSM add opaque void * security fields in various internal kernel objects that the module can use to label them. It also allows module stacking where the primary module has the final say on whether to allow or deny an access request.

To implement LSM, the kernel was modified in 5 main ways:

Opaque Security Fields were added to objects, and hooks were defined to instantiate, free, and update them. Note that special handling is required for object the already exist before the security module is loaded.
Security Function Hooks were added in key accesses in the kernel. These are implemented by the module.
A security System Call was added to allow security aware userspace applications to interact with the security module. The system call implementation is up to the security module.
Registering security modules requires a special API to activate the hooks. Other were added so that modules can register themselves with others and stack.
Modify capabilities to reduce the capable call to a wrapper for a LSM hook. Moving the capabilities bit vector from the task_struct to the opaque security field and modifying the system call interface are the only steps left in making capabilities completely standalone.

Additional hooks were provided

for working with tasks (nice, kill, setuid)
for program loading and controlling inheritance of state across program executions (such as file descriptors)
for IPC
for file ops (read, write, sockets)
for network ops (devices, syscalls, sk_buffs)
for module operations (create, register, delete)
for sytem operations (hostname, accessing I/O ports, process accounting)

They performed some evaluation (LMBench and kernel compilation) and surprisingly, kernels patched with LSM seem to perform slightly better than unpatched ones. The authors claim this is experimental error and only consider that LSM has an unnoticeable overhead.

LSM provides no persistent storage of security attributes to files since that requires extended attributes, a complex issue. They decided not to completely modularize the Linux DAC security checks since that would be too invasive.

The Good:
This is really cool. It is now very easy to implement new security models. While it's true that notall models are possible, it is still a great leap forward. The interface seems simple and workable and they have interposed on the major kernel functions.

The Bad:
Unfortunately, the paper makes the implementation look a little arbitrary. There does not seem to be a systematic approach into what needs to be looked at, how many hooks there are, and where they should be. They already mentioned that access control has yet to be completely modularized to provide authoritative hooks. LSM only provides access control, unfortunately there is no interposition except on access request. It is not clear whether there has been any interposition on context switching or on cache control. No mention of interposing on memory or CPU resource allocations either.

Thoughts on A Scalable, Commodity Center Network Architecture

Authors: Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat

SIGCOMM 2009

Summary:
The authors make the case that large data-center networks are expensive and that cost increases non-linearly with size. Current design techniques do not fully utilize fan-out and multipath, and when they do it's either at the cost of compatibility (Myrinet != Ethernet/Infiniband non-TCP/IP /ECMP with randomization reorders packets) or complexity (ECMP with region splitting needs 600k TCAM entries for 25k net).

The authors demonstrate how to design a fat-tree data-center network using "commodity" Ethernet/IP switches and routers. They give switches IP addresses based on their location in the fat tree, making it easier to do routing and select paths. They presented three methods of packet diffusion on the upward path in the fat-tree:

Static two-level table routing based on host ID: simple, but performs the worst, non-dynamic, needs extra work when a link dies to send updates everywhere. Host IDs may not provide sufficient entropy depending on the communication pattern (hence the bad performance). Tables (TCAMs), however, only need k<=48 entries, which is remarkably cheap, and does not in fact need a discrete component.
Flow Classification: switches monitor flow sizes and periodically reassign a few flows to balance usage across ports. Only local optimization and does not avoid hotspots in the core. Performs better than
Central Flow Scheduler: Edge switches monitor flow sizes and notify a central scheduler when a flow goes above a certain threshold. The scheduler then reassigns the flow's path to the minimally loaded links.

In all three cases change is needed to the switches in order to support the designs. First, something like OpenFlow is needed to do central managing and routing because they did not come up with a distributed routing protocol, and it helps implement (3) above. Second, a change in the lookup mechanism is needed to support the two-level tables used.

Using some back-of-the-envelope calcuations, traditional networks use almost twice the power and dissipate almost twice the heat. This is because they use switches with 10GigE uplinks and 10 GigE switches, and a 10GigE switch uses approximately double the power per Gbps and disspates 3x the heat of a 1GigE switch. The calculation was done for the largest networks they support (~27k hosts).

They implemented the design on NetFPGA but gave no performance numbers. They also incorrectly assumed that because it was easy to modify the NetFPGA, it would be easy to modify other switches.

They also implemented the three designs above in software using Click, and measured the throughput. In all cases, the hierarchical traditional design did worst, then the two-level table, then flow classification, while flow scheduling did best.

The authors also investigated a packagin solution, proposing to put all the pod switches in a central rack and have all pod hosts in other racks connect to it. Then, core switches are equally divided among the pods, and the pods are laid out in a grid. This bundles up the cables that are going between pods and the cables between racks. The switch rack itself can have built in wiring to simplify cabling inside the pod switch rack.

The Good:
This is work that has needed to be done for a while now. It is surprising no one has done this research before given that most of the algorithms used are simple and intuitive. They do not give enough credit to centralization for their work, probably so they don't upset some people. It is noteworthy that all of the work would have been much more complex if they did not use centralization (if not impossible). That is why it is a little unfair to compare the work with OSPF2-ECMP which is fully distributed.

The work is feasible, and seems not too difficult to implement or build. On the whole I really like the paper and work.

The Bad:
Even though they claim they use "commodity" switches, the switches needed are not "commodity". Depending on the algorithms used, they either need two-level table support (possible hardware change though not necessarily) or OpenFlow-support (software mostly). They call them "commodity" because they do not use aggregated uplinks (such as 10GigE ports or switches). If the hardware modification is needed, it is not clear it can happen. OpenFlow, on the other hand, is quite likely to happen and is already on the way. Given these new needs and changes from the fully-distributed hierarchical design, they should have mentioned more openly how their infrastructure differs from traditional infrastructure. They cannot really go buy a bunch of switches from Fry's and use them.

The calculations for power/heat and costs are all back-of-envelope calculations. I am also uncertain of the fairness in comparisons. The higher-end more expensive switches and more power-consuming switches are more robust and provide much more functionality and features than the lower-end switches they use. A good question is, does the network they build with cheap switches offer all the same features an admin expects from a traditional network?

While difficult to do, it would have been nice to see a concrete example that includes cabling, packaging, and operation costs. In their throughput calculations, they use artificial traffic patterns that do not take into account the bursty nature of traffic. Their algorithms also assume "Internet-like" traffic with heavy tails. It would be interesting to see how their algorithms fair with real data center traces. It is surprising they could not get any. I expect a data-center's traffic burstiness to significantly change their results.

It is also unfortunate that they have not tried to implement their solutions on any real hardware. So it is not clear how easy it is to modify currently available infrastructure. It doesn't seem that the whole DC network needs to be overhauled.

While it is a point they made in the paper, the wiring complexity seems to be quite horrible, and I am not sure the solution they propose fixes the problem. Perhaps one needs to look at the problem a higher level up and consider building chassis that house pods...