Thursday, June 25, 2009

Thoughts on Floodless in SEATTLE: A Scalable Ethernet Architecture for Large Enterprises

Authors: Changhoon Kim, Matthew Caesar, and Jennifer Rexford

Reading Detail Level: Quick Read

author = {Kim, Changhoon and Caesar, Matthew and Rexford, Jennifer},
title = {Floodless in seattle: a scalable ethernet architecture for large enterprises},
booktitle = {SIGCOMM '08: Proceedings of the ACM SIGCOMM 2008 conference on Data communication},
year = {2008},
isbn = {978-1-60558-175-0},
pages = {3--14},
location = {Seattle, WA, USA},
doi = {},
publisher = {ACM},
address = {New York, NY, USA},


The main idea is to use the switches to form a single-hop DHT. The DHT is accessed by using consistent hashing (function F) to minimize churn when switches fail or recover and keys are re-hashed. SEATTLE switches do the following things:
  1. They use a link-state algorithm to build the network topology (uses broadcast but only for switches)
  2. They form a one-hop DHT that can be used to store (k,v) pairs.
  3. They also participate in a multi-level DHT so that a large network such as an ISP can be divided into regions and queries to other regions get encapsulated at border switches when crossing regions.
  4. For every k a switch inserts, it keeps track of whether the resolver changes (maybe because it died or a new one came up), and inserts the (k,v) at the new location.
  5. If a switch dies, each switch will scan through its list of stored (k,v) and if the dead switch is a stored value, that entry is deleted.
  6. They advertise themselves as multiple virtual switches so that more powerful switches would appear as several virtual switches and would handle a larger load in the DHT.
  7. When a host arrives:
    1. The access switch snoops to find the host's IP and MAC addrs
    2. The access switch stores MAC -> (IP, location) at switch R=F(MAC) in the DHT
    3. The access switch stores IP -> (MAC, location) at switch V=F(IP) in the DHT
  8. To resolve an ARP request, an access switch uses the DHT's IP->(MAC, location) and caches the result so packets go directly to the target location without additional lookups.
  9. If a packet is sent to a MAC address which the access switch does not know:
    1. The packet is encapsulated and sent to switch R=F(MAC).
    2. R forwards the packet to the right location.
    3. R sends the info to the access switch
    4. The access switch caches the info
  10. When a host's MAC and/or IP addresses and/or location change:
    1. The switch at which the host gets attached (or remains) updates the info in the DHT.
    2. For a location change, the old access switch is told of the change; so, if it receives packets to the old address, it informs the sending access switch of the change.
    3. For a MAC address change, 1) the attached host maintains a revocation list (IP, MACold, MACnew) for the host. If a host is using the old MAC address, the switch sends a gratuitous ARP to that host. 2) The switch tells the other switch of the new MAC->location mapping so it doesn't do a lookup.
  11. SEATTLE separates unicast reachability from broadcast domains. It uses groups---a set of hosts that share the same broadcast domain regardless of location. Switches somehow (not clear now) know what groups a host belongs to, and use a protocol similar to IP multicast to graft a branch to an existing multicast tree.
The DHT can also store mappings from service names to locations (e.g. DHCP_SERVER and PRINTER). This is how they handle DHCP without broadcasts.

For evaluation, the authors implemented SEATTLE (on XoRP and Click), and used traces from several locations to run simulations. They compared with Ethernet and ROFL showing that SEATTLE is superior then both in many ways.

The Good:
The architecture is beautiful in many ways. The DHT approach seems very powerful indeed. I don't know of any other solutions to the broadcasting problem, and none that can handle arbitrary L2 topologies. I liked the notions of virtual switches and groups (with caveats below). The evaluation section is nice, though not perfect.

The Bad:
No paper is perfect, unfortunately.
  • While the authors claim to eliminate broadcast, they do use link state updates which are broadcast, even though these do not happen as often as ARP.
  • Virtual switches are nice, but no analysis of their side-effects exist or how difficult it is to implement them.
  • The discussion on areas was completely confusing.
  • The discussion on groups was too abstract. How do switches know what hosts are in what groups? I was not able to visualize how groups would be implented from the discussion, and so I have to say they seem too difficult.
  • The evaluation section is missing a critical piece. The paper does not mention what has been implemented and what is missing. This makes the evaluation much less credible. How difficult was the implementation? How does it compare to other designs? What is the cost comparison?
  • The per-packet processing cost was not convincing because it was not clear what was implemented and what not.
Nothing was said on multipath, but that was not the point of the paper anyway.

Sunday, June 14, 2009

Thoughts on Linux Secutiry Modules: General Security Support for the Linux Kernel

Authors: Chris Wright, Crispin Cowan, Stephen Smalley, James Morris, and Greg Kroah-Hartman


author = {Chris Wright and
Crispin Cowan and
Stephen Smalley and
James Morris and
Greg Kroah-Hartman},
title = {Linux Security Modules: General Security Support for the
Linux Kernel},
booktitle = {USENIX Security Symposium},
year = {2002},
pages = {17-31},
ee = {},
crossref = {DBLP:conf/uss/2002},
bibsource = {DBLP,}
editor = {Dan Boneh},
title = {Proceedings of the 11th USENIX Security Symposium, San Francisco,
CA, USA, August 5-9, 2002},
booktitle = {USENIX Security Symposium},
publisher = {USENIX},
year = {2002},
isbn = {1-931971-00-5},
bibsource = {DBLP,}


The paper describes the motivation, design, and some of the implementation of the Linux Security Modules framework. The LSM framework allows many security models to be implemented as loadable Linux kernel modules. Rather than only interposing on system calls, LSM provides hooks for modules asking whether an access to an internal kernel object should be granted or denied. Almost always, the LSM hooks are restrictive rather than authoritative. That is, they allow a module to deny a request that the Linux DAC mechanism would have allowed rather than override a denial.

To support POSIX.1e capabilities, LSM provides some minimal support for permissive hooks that do override a denial. LSM is limited to supporting the core access control functionalities needed by existing Linux security projects, rather than adding auditing and virtualization.

LSM add opaque void * security fields in various internal kernel objects that the module can use to label them. It also allows module stacking where the primary module has the final say on whether to allow or deny an access request.

To implement LSM, the kernel was modified in 5 main ways:
  1. Opaque Security Fields were added to objects, and hooks were defined to instantiate, free, and update them. Note that special handling is required for object the already exist before the security module is loaded.
  2. Security Function Hooks were added in key accesses in the kernel. These are implemented by the module.
  3. A security System Call was added to allow security aware userspace applications to interact with the security module. The system call implementation is up to the security module.
  4. Registering security modules requires a special API to activate the hooks. Other were added so that modules can register themselves with others and stack.
  5. Modify capabilities to reduce the capable call to a wrapper for a LSM hook. Moving the capabilities bit vector from the task_struct to the opaque security field and modifying the system call interface are the only steps left in making capabilities completely standalone.

Additional hooks were provided
  1. for working with tasks (nice, kill, setuid)
  2. for program loading and controlling inheritance of state across program executions (such as file descriptors)
  3. for IPC
  4. for file ops (read, write, sockets)
  5. for network ops (devices, syscalls, sk_buffs)
  6. for module operations (create, register, delete)
  7. for sytem operations (hostname, accessing I/O ports, process accounting)
They performed some evaluation (LMBench and kernel compilation) and surprisingly, kernels patched with LSM seem to perform slightly better than unpatched ones. The authors claim this is experimental error and only consider that LSM has an unnoticeable overhead.

LSM provides no persistent storage of security attributes to files since that requires extended attributes, a complex issue. They decided not to completely modularize the Linux DAC security checks since that would be too invasive.

The Good:
This is really cool. It is now very easy to implement new security models. While it's true that notall models are possible, it is still a great leap forward. The interface seems simple and workable and they have interposed on the major kernel functions.

The Bad:
Unfortunately, the paper makes the implementation look a little arbitrary. There does not seem to be a systematic approach into what needs to be looked at, how many hooks there are, and where they should be. They already mentioned that access control has yet to be completely modularized to provide authoritative hooks. LSM only provides access control, unfortunately there is no interposition except on access request. It is not clear whether there has been any interposition on context switching or on cache control. No mention of interposing on memory or CPU resource allocations either.

Thoughts on A Scalable, Commodity Center Network Architecture

Authors: Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat


The authors make the case that large data-center networks are expensive and that cost increases non-linearly with size. Current design techniques do not fully utilize fan-out and multipath, and when they do it's either at the cost of compatibility (Myrinet != Ethernet/Infiniband non-TCP/IP /ECMP with randomization reorders packets) or complexity (ECMP with region splitting needs 600k TCAM entries for 25k net).

The authors demonstrate how to design a fat-tree data-center network using "commodity" Ethernet/IP switches and routers. They give switches IP addresses based on their location in the fat tree, making it easier to do routing and select paths. They presented three methods of packet diffusion on the upward path in the fat-tree:
  1. Static two-level table routing based on host ID: simple, but performs the worst, non-dynamic, needs extra work when a link dies to send updates everywhere. Host IDs may not provide sufficient entropy depending on the communication pattern (hence the bad performance). Tables (TCAMs), however, only need k<=48 entries, which is remarkably cheap, and does not in fact need a discrete component.
  2. Flow Classification: switches monitor flow sizes and periodically reassign a few flows to balance usage across ports. Only local optimization and does not avoid hotspots in the core. Performs better than
  3. Central Flow Scheduler: Edge switches monitor flow sizes and notify a central scheduler when a flow goes above a certain threshold. The scheduler then reassigns the flow's path to the minimally loaded links.
In all three cases change is needed to the switches in order to support the designs. First, something like OpenFlow is needed to do central managing and routing because they did not come up with a distributed routing protocol, and it helps implement (3) above. Second, a change in the lookup mechanism is needed to support the two-level tables used.

Using some back-of-the-envelope calcuations, traditional networks use almost twice the power and dissipate almost twice the heat. This is because they use switches with 10GigE uplinks and 10 GigE switches, and a 10GigE switch uses approximately double the power per Gbps and disspates 3x the heat of a 1GigE switch. The calculation was done for the largest networks they support (~27k hosts).

They implemented the design on NetFPGA but gave no performance numbers. They also incorrectly assumed that because it was easy to modify the NetFPGA, it would be easy to modify other switches.

They also implemented the three designs above in software using Click, and measured the throughput. In all cases, the hierarchical traditional design did worst, then the two-level table, then flow classification, while flow scheduling did best.

The authors also investigated a packagin solution, proposing to put all the pod switches in a central rack and have all pod hosts in other racks connect to it. Then, core switches are equally divided among the pods, and the pods are laid out in a grid. This bundles up the cables that are going between pods and the cables between racks. The switch rack itself can have built in wiring to simplify cabling inside the pod switch rack.

The Good:
This is work that has needed to be done for a while now. It is surprising no one has done this research before given that most of the algorithms used are simple and intuitive. They do not give enough credit to centralization for their work, probably so they don't upset some people. It is noteworthy that all of the work would have been much more complex if they did not use centralization (if not impossible). That is why it is a little unfair to compare the work with OSPF2-ECMP which is fully distributed.

The work is feasible, and seems not too difficult to implement or build. On the whole I really like the paper and work.

The Bad:
Even though they claim they use "commodity" switches, the switches needed are not "commodity". Depending on the algorithms used, they either need two-level table support (possible hardware change though not necessarily) or OpenFlow-support (software mostly). They call them "commodity" because they do not use aggregated uplinks (such as 10GigE ports or switches). If the hardware modification is needed, it is not clear it can happen. OpenFlow, on the other hand, is quite likely to happen and is already on the way. Given these new needs and changes from the fully-distributed hierarchical design, they should have mentioned more openly how their infrastructure differs from traditional infrastructure. They cannot really go buy a bunch of switches from Fry's and use them.

The calculations for power/heat and costs are all back-of-envelope calculations. I am also uncertain of the fairness in comparisons. The higher-end more expensive switches and more power-consuming switches are more robust and provide much more functionality and features than the lower-end switches they use. A good question is, does the network they build with cheap switches offer all the same features an admin expects from a traditional network?

While difficult to do, it would have been nice to see a concrete example that includes cabling, packaging, and operation costs. In their throughput calculations, they use artificial traffic patterns that do not take into account the bursty nature of traffic. Their algorithms also assume "Internet-like" traffic with heavy tails. It would be interesting to see how their algorithms fair with real data center traces. It is surprising they could not get any. I expect a data-center's traffic burstiness to significantly change their results.

It is also unfortunate that they have not tried to implement their solutions on any real hardware. So it is not clear how easy it is to modify currently available infrastructure. It doesn't seem that the whole DC network needs to be overhauled.

While it is a point they made in the paper, the wiring complexity seems to be quite horrible, and I am not sure the solution they propose fixes the problem. Perhaps one needs to look at the problem a higher level up and consider building chassis that house pods...