Sunday, July 26, 2009

Thoughts on HAIL: A High Availability and Integrity Layer

author = {Kevin D. Bowers and Ari Juels and Alina Oprea},
title = {HAIL: A High-Availability and Integrity Layer for Cloud Storage},
howpublished = {Cryptology ePrint Archive, Report 2008/489},
year = {2008},
note = {\url{}}, }
The paper describes a system that distributes redundant blocks of a file across multiple servers, and allows a client to make sure that the file is not corrupted even when an attacker can compromise servers, and eventually gain access to all servers. It allows the client to know get proofs of retrievability (POR) efficiently from servers.

HAIL does this by adding what the authors term IP-ECC: Integrity protected error correcting codes. These are basically ECC codes with an embedded MAC. They add these to each block of the file, and then a server can calculate a concise aggregate MAC to prove to the client the existence and integrity of some blocks of a file.

Lots of proofs and cryptospeak, most of which I skipped over. They use standard constructions mostly and put them together.

In terms of performance, the system is slow. In terms of fault-tolerance, the system can-handle byzantine failures where a third of the systems are faulty/compomised. In addition, the files are not lost.

The Good:
Basically secure RAID for the cloud. The servers themselves are untrusted, they have redundancy, and files are stored securely. If one storage provider dies, then the files can still be accessed from other location. System is also robust against modifications and includes integrity checks.

The Bad:
  • Performance is really slow and they didn't compare with other systems.
  • Are storage providers dying really the worst case scenario such that all this overhead and work needs to be done? This seems like a very heavy hammer.
  • It seems that legal recourse + MACs seem to be easier to do. For example, sign an SLA so that storage provider has more to lose by corrupting your data or being unavailable than you.

Thoughts on Building Castles out of Mud: Practical Access Pattern Privacy and Correctness on Untrusted Storage

author = {Williams, Peter and Sion, Radu and Carbunar, Bogdan},
title = {Building castles out of mud: practical access pattern privacy and correctness on untrusted storage},
booktitle = {CCS '08: Proceedings of the 15th ACM conference on Computer and communications security},
year = {2008},
isbn = {978-1-59593-810-7},
pages = {139--148},
location = {Alexandria, Virginia, USA},
doi = {},
publisher = {ACM},
address = {New York, NY, USA},

Read the intro, looked at the performance.

Work addresses the problem of hiding access patterns from a storage provider. Not much work has been done in the area. Previous work was too expensive. This work allows queries to take on the order of 100s of ms. Some of the cost is due to the implementation: lack of parallelism and use of Java. On the other hand, the protocol itself requires multiple round-trip times, and so the cost would still be too high. This is a problem that seems to be not worth solving, at least for now. Get the low-hanging fruit first: access control, controlled sharing, ...

Thoughts on FAWN: A Fast Array of Wimpy Nodes

Authors: David G. Andersen (Carnegie Mellon University), Jason Franklin (Carnegie Mellon University), Michael Kaminsky (Intel Research Pittsburgh), Amar Phanishayee (Carnegie Mellon University), Lawrence Tan (Carnegie Mellon University), Vijay Vasudevan (Carnegie Mellon University)

Venue: SOSP 2009

The paper describes the FAWN a system for storage using a large number of small low-performance (hence wimpy) node that have moderate amounts of local storage. The system has two parts: FAWN-DS and FAWN-KV.
FAWN-DS is the backend that consists of the large number of nodes. Motivation is simple: I/O is current bottleneck and current storage is inefficient.
  • high speed CPUs consume too much power, and CPU scaling doesn't work very well b/c of high leakage.
  • 2GB DRAM uses as much power as 1 TB disk.
  • power cost = 50% 3-yr TCO of cluster
Each node in FAWN-DS has some RAM and Flash. The storage is log-structured key-value based:
  • Each node can appear as multiple vnodes, where each vnode has its own data store. This helps in management, and doesn't decrease performance b/c for a small number of files, it becomes semi-random writes.
  • Part of the key is used to lookup in an in-RAM index.
  • The index stores a key fragment.
  • If the key fragment matches the lookup key, then check the full key in flash
  • If the flash matches, then we have found the entry
  • Otherwise, use hash chaining to get next location
The store is log structured:
  • writes and deletes are just appends to the store
  • lookups are as above
  • Every once in a while the store is compacted and check-pointed
  • Other ops the DS node can do are merge and split data stores
FAWN-KV uses a DHT similar to Chord, but with no DHT routing. There is a three level management hierarchy:
  • Key value ranges are split up and assigned to front-end nodes by management nodes.
  • Front-end nodes handle requests and keep track of which backend nodes have which keys:
    • If one node receives a request and the req is not in its range then it forwards the request to the right node
    • If the req is within the range, it forwards the request to the correct back-end node
    • Front-ends have a small RAM cache they use to buffer requests and avoid hot-spots
  • Back-end nodes implement the FAWN-DS:
    • When a new k,v write arrives, the write is forwarded in a replication chain along the chord ring.
    • The last node in the chain (the tail) acks the req back to the frontend and handles all lookup requests.
    • So each node is in R chains: once as head, once as tail and R-2 as middle node.
Joins and leaves:
  • When a node joins, all nodes split their key ranges.
  • For each range, the node gets the DS from the current tail
  • The node is linked into the replication chain
  • The node stores all updates in a temp log DS
  • The node gets all updates between when copy was done and when it started receiving new updates
  • All updates are merged into the permanent log DS, and node becomes fully on. Old tail can delete old DS.
  • For head join, need to coordinate with front-end
  • For tail join, node inserted before old tail, and predeccessor serves lookups
  • When node leaves, merge range, add new node to replication chain as in regular join
Failures assumed fail-stop. Can't handle communication. More work to be done at detection.

  • FAWN-DS has quite a bit of overhead over raw filesystem
  • For lookup overhead is 38% for 1KB queries, 34% for 256B queries. i.e. about 62-66% throughput
  • For store, 96% throughput achieved b/c log structured
  • semi-random write performance is highly dependant on the flash device used
  • They compare their performance to BDB, which is abysmal to say the least on flash. But I don't get what the point is. BDB is not optimized for flash. I guess the point is that you will need something exactly like FAWN-DS if you want to use flash. What else should they do to get a grip of how well FAWN-DS does?
  • Small random writes do much better than random reads.
  • FAWN-KV throughput is about 80% of FAWN-DS single node throughput due to network overhead+marshalling/unmarshalling
  • Cluster with no front-end does 364 queries/joule
  • Including frontend: 330 queries/joule
  • Current network overhead = 20% and increasing as cluster size increases.
  • Median latency < 1ms, 90% < 2ms, 99.9% < 26.3ms
  • With split, median still < 1ms, 99.9% < 611ms under high load
  • Desktop does 52 queries/joule, 3-node FAWN does 240 with half of the idle power consumed by switch.
  • They have a nice figure illustrating the solution space for minimal cost depending on data set size and query-rate. Almost all uses FAWN with HD, DRAM, or SSD. Only a part uses Server+DRAM.
In general: higher query rate = towards DRAM. Bigger dataset = SSD then Disk for really large ds and low throughput. Traditional + DRAM has a very small slice for high query rate and datasets that are a little larger (assumption is each fawn node only has 2GB DRAM).

Interesting tidbits:

Facebook has 1:6 of puts:gets

The good:
The paper in general is excellently executed with a thorough eval section. The related work section is also quite thorough and is a good starting point for someone interested in the space. The paper has many interesting tidbits and facts.

The FAWN system itself is also very nice and simple (at least as they explained it). They seem to have picked the right level of abstraction at which to explain it. Performance seems comparable to traditional systems at a much lower cost and higher efficiency. Their Figure 15 is beautiful. They basically convinced me to build my next DC using FAWN.

The Bad:
They didn't go into details on failures. I would have preferred a little less on impact of maintenance ops and more on failures and reliability. That said, I don't really see why failures would be a major problem.

Thursday, July 23, 2009

Thoughts on CLAMP: Practical Prevention of Large-Scale Data Leaks

author = {Bryan Parno and Jonathan M. McCune and Dan Wendlandt and
David G. Andersen and Adrian Perrig},
title = {{CLAMP}: Practical Prevention of Large-Scale Data Leaks},
booktitle = {{Proc. IEEE Symposium on Security and Privacy}},
address = {{Oakland, CA}},
month = may,
year = {2009}
The authors describe a web platform the enforces isolation between code running on behalf of different users in both execution and data. They do this as an extension of LAMP (Linux+Apache+MySQL+Perl/PHP) where code for a user runs in a dedicated VM and can access only a subset of the DB---rows that the user owns. This decreases the size of the Trusted Computing Base (TCB).

The system has 5 main components:
The Dispatcher: Receives muxes SSL TCP connection onto VMs (called WebStacks) that run the code on behalf of the user.
The WebStack: Basically a light-weight VM.
The Query Restrictor (QR): Rewrites/restricts queries and updates from the WebStack to only be for rows that the user owns.
The User Authenticator (UA): Authenticates a user, and labels a WebStack with a user id and role. Creates a mapping from VM to UID in the QR.
The DB: Stores all the data.

They study the effect of compromising each component as well as the isolation layer (hypervisor). They update three applications to use their system and it turns out that the systems don't need to be changed much since the only modifications needed are to change the login process usually.

- Overheads ranged from 10-20%. For operations that are already in the 10s of ms, the overhead is tolerable. However, MySQL views incur an additional overhead of 50%! They claim other DBs fair better at around 7% for some.
- Requires too much memory per user. For 6GB, can support about 500 users
- Can handle only 2 logins/sec compared to the original 85 logins/s. Huh?
- Performance of unoptimized prototype about 42% of native.

The Good:
- Isolation and easy to convert other systems to CLAMP.
- To attack the system, at least two components need to be compromised.
- Compromising one user does not automatically grant access to other users

The Bad:
- Ok, so they do isolation between users. But what happens when it is the same bug getting exploited for all users? They say that by isolating users, they will limit the effects of the compromise to that user, but if the same bug manifests in all the webstacks, then they have not solved the problem. They only solve the problem where one user's account/password gets compromised.
- I don't really see the big deal here. A very similar thing can be done with d-star, you just need some generic wrappers if you don't want fine-grained isolation. Here they used a VM instead of a wrapper. Wrappers are much smaller and can provide better security.
- Their attacks discussion did not include dom0.
- Their overheads were really high. and the system is very limited. They equate their system with SSL, saying that SSL has additional overhead, so institutions are willing to pay. But come on. SSL has a much lower overhead, and much higher performance. I mean 2 logins/second? A reduction of 40x? Even assuming that the code still needs to be optimized, and the performance doubles. That is still too low.

Then again, I'm not an expert. So what do I know?

Sunday, July 19, 2009

Thoughts on The Case for Enterprise-Ready Virtual Private Clouds

Authors: Timothy Wood and Prashant Shenoy, University of Massachusetts Amherst; Alexandre Gerber, K.K. Ramakrishnan, and Jacobus Van der Merwe, AT&T Labs—Research
Venue:HotCloud '09

The authors propose a way to connect private clouds to a subset of resources in a public cloud using VPNs such that the two appear as one cloud.

Thoughts on Towards Trusted Cloud Computing

Authors: Nuno Santos, Krishna P. Gummadi, Rodrigo Ridrigues (Max Planck Institute for Software Systems)
Venue: HotCloud '09

They cite a survey where executives and IT people say they don't trust the cloud because they can't control it and they fear for the confidentiality of their data. Even though cloud services always take steps to ensure the customer's data security, any admin with root access to a machine can observe data in memory. The authors cite Terra as a system where a machine can prove its integrity to a VM user. The authors extend this idea to Infrastructure as a Service (IaaS) where the whole service is a big black box.

Attack model: No physical access to machine.

- All nodes runs a Trusted VMM as in "Improving Xen Security Through Disaggregation"
- There exists a trusted external entity (ETE) like Verisign that provides a TC service which keeps track of the Trusted VMs in a cluster. It has to be external so the sysadmins of the IaaS don't tamper with it.
- The TC can attest that the IaaS is providing a secure service and the TC coordinates with the TVMM during critical operations such as starting a VM and migration to ensure security.

The Good:
- Nothing like it exists yet
- Good first attempt

The Bad:
- System has not been built, so although they say the design is "detailed" enough to start building, we can't really verify it's so easy.
- The design requires an external trusted entity that is quite involved in the process of starting a VM and migration and probably other management tasks. It is not clear who will run this service and what/how the incentives work.
- Minor: the authors seem to confuse encryption with signatures.

Saturday, July 18, 2009

Thoughts on Open Cirrus Cloud Computing Testbed: Federated Data Centers for Open Source Systems and Services Research

Authors: Roy Campbell,5 Indranil Gupta,5 Michael Heath,5 Steven Y. Ko,5 Michael Kozuch,3 Marcel Kunze,4 Thomas Kwan,6 Kevin Lai,1 Hing Yan Lee,2 Martha Lyons,1 Dejan Milojicic,1 David O’Hallaron,3 and Yeng Chai Soh2
1HP Labs, 2IDA, 3Intel Research, 4KIT, 5UIUC, and 6Yahoo!

Venue: HotCloud '09

The authors present the need for systems research on datacenter infrastructure. While virtualized systems like EC2/S3, Azure, and such allows applications research, it is difficult to do research on systems that compose the infrastructure of the data center. Open Cirrus is an effort to provide researchers with access to hardware level end-hosts. They have fedarated several data centers around the world, and are working on providing a unified API to access all of them. There will be some basic common functionality even though the DCs themselves are heterogeneous. OpenCirrus is basically federated Emulab across multiple sites where the sites are only loosely coupled (whatever that means).

The systems are scheduled and manage as Physical Resource Sets (PRS). HP's integrated Lights-Out technology is used to manage machines at the firmware level.

They compare options for testbeds and do a back-of-the-envelope calculation of the break-even point for renting (S3/EC2) vs. owning hardware (they use some reasonable cost-breakdown). There calculation says that for a service running more than 12 months, it is better to buy computation while for storage more than 6 months it is better to buy.

The paper is interesting in that it surveys other testbeds, and introduces a new one wider scale one than emulab.

Thursday, July 16, 2009

Thoughts on Providing Packet Obituaries

Authors: Katernia Argyraki, Petros Maniatis, David Cheriton, Scott Shenker
Venue: Hotnet '04

The paper draws a strawman design of how to provide some form of accountability on which AS on a path dropped a packet. This is done by adding Accountability Boxes (A-boxes) on links at the entry and exit points of an AS. The A-boxes maintain calculate digests of packets that pass through and cascade reports from the last AS to see a packet back to the sender along the reverse AS-level path. The mechanism allows an end-host to know which AS has dropped a packet. If ASes forge reports, the sender can know at which link a report discrepancy occurred and can tell that one of two ASes has lied. The paper also quickly studies the feasibility of the design. I did not read the paper well enough to comment on how pros and cons. I like the function that it provides. I'm more worried about its security and whether it is practical. I am not sure that necessitating extra hardware is the right approach. But perhaps this is OK since carriers put all sorts of specialized hardware in the networks anyway.

Wednesday, July 15, 2009

Thoughts on CloudViews: Communal Data Sharing in Public Clouds

Authors: Roxana Geambasu, Steven D. Gribble, Henry M. Levy
Appeared in: HotCloud '09

Public clouds have lots of bandwidth, so it is easier to share data between services in the same cloud. The paper raises the issues in sharing and what the security options are. They maintain the view that data is owned by the service rather than the data creator. They allow services to specify which portions of their data is accessible to other portions. But once data is shared by one service with another, it becomes owned by the other service and can do whatever it likes with it. They don't provide federation of resources and assume a few large public clouds.

This was a quick read and needs to be read again.

Wednesday, July 8, 2009

Thoughts on VL2: A Scalable and Flexible Data Center Network

Authors: Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David A. Maltz, Parveen Patel, and Sudipta Sengupta

To Appear: SIGCOMM 2009

The paper has two parts. The first describes results from measurements in a DC and the other describes an architecture.

Datacenter Results:
The authors looked at a 1500 node cluster in a DC that does data-mining.
  • Traffic patterns highly variable, and there are more than 50 traffic matrices that change very frequently (usually in under 100s) with no periodicity (predictability).
  • 99% of flows are smaller than 100MB
  • More than 90% of bytes are in flows between 100MB and 1GB
  • Flows over a few GB are rare.
  • Around 20% of traffic leaves/enters the DC. The rest is local.
  • Intense computation and communication does not straddle DCs
  • Demand for BW inside DC growing faster than demand for BW to external hosts
  • Network is computation bottleneck. ToR uplinks frequently more than 80% utilized.
  • More than 50% of the time a machine has 10 flows
  • At least 5% of the time it has 80 flows, but almost never more than 100.
  • Most failures are small: 50% of net device failures involve <>
  • Downtimes can be significant: 95% <10min,>10days
  • 0.3% of failures redundant components all failed
  • Most problems due to : net misconfigurations, firmware bugs, faulty components
VL2 Architecture:
VL2 proposes a network architecture that enables multipath, scalability, and isolation with layer 2 semantics and no ARP and DHCP broadcasts. They also use commodity switches that must have OSPF, ECMP, and IP-in-IP decapsulation.

The concepts are simple. VM hosts add a layer to the networking stack called the VL2 agent. When services or VMs wish to send traffic, the VL2 encapsulates the packets in an IP-in-IP tunnel and uses VLB in a Clos network on a flow-by-flow basis to split load across intermediate (backbone) switches.

Hosts are assigned location-specific IP addresses (LA) while services are assigned application-specific IP addresses (AA) that they maintain when they migrate around the datacenter (DC). When a service sends a packet, it uses the AAs. The VL2 agent encapsulates the packet twice: In the inner layer it puts the LA of the dst ToR switch, and in the outer layer it puts the LA of an intermediate switch and sends it out. The packet then gets routed to the intermediate switch that decapsulates the first layer and forwards it to the correct ToR switch. The ToR switch decapsulates the packet and delivers it to the correct host.

While selecting a random intermediate switch LA to forward a packet will implement VLB, a large number of VL2 agents will need to be updated if a switch is added/removed from the intermediate switches. Instead, they would like to use one address for all intermediate switches, and let ECMP choose one of the switches using anycast. But since ECMP only allows 16 different paths (256 coming later), they choose a number (don't say what this number is or how to choose it) of anycast addresses and assign the maximum number of switches to each address. If a switch dies, the addresses assigned to it are migrated to other switches.

When encapsulating, the VL2 agent puts a hash of the packet's 5-tuple in the IP src address of the encapsulation headers. This can be used to create additional entropy for ECMP, or to modify flow placement to prevent large flows from being placed on the same link.

The VL2 agent obtains the LA from a directory and caches it. When the service running on a host sends an ARP request for an AA, the VL2 intercepts the ARP request and queries the directory for the destination ToR switch LA, which it caches and uses for other packets. They don't say how the anycast addresses are obtained for the intermediate switches or what MAC address is returned in the ARP reply to the VM.

The VL2 agent can be used to enforce access control (they say that the directory does, but it actually only makes the policy decisions) and hence isolate services from each other. In addition, two services that need to communicate don't need to go over an IP gateway to bridge two VLANs as in traditional DCs (then again their whole network is made of IP routers anyway).

Services that connect to hosts in the Internet (such as front-end webservers) have two addresses: LA and AA. The LA address is externally reachable. They do not say what this means when the externally reachable service migrates.

For broadcast other than ARP (intercepted by VL2 agent) and DHCP (handled by DHCP relay) an IP multicast address is used which is unique for each service.

The VL2 directory system is two-tiered. Replicated directory servers cache AA-to-LA mappings and serve them to clients. Replicated State Machine servers use the Paxos consensus algorithm to implement reliable updates. Lookups are fast, but updates are slow and reliable. To fix inconsistencies in VL2 agent caches, if a ToR switch receives a packet with a stale LA-AA mapping, it sends the packet to a directory server that updates the VL2 agent's stale cache. Does this mean the ToR switch needs to be modified to support this?

The eval section is good. It is nicely done and thorough. They get 94% of optimal network capacity, high TCP fairness, graceful degradation and recovery under failures, and fast lookups (good enough to replace ARP). VLB provides good fairness because of the high number of flows (statistical muxing), and because uplinks have a 10x gap in speed.

Long lived flows' aggregate goodput is not affected by other flows starting or ending in the network or by bursts of short flows. This is due to VLB, spreading all traffic around uniformly, and because TCP is fast enough to ensure that flows only get their fair share of throughput.

They compared VLB to other routing techinques, adaptive and best oblivious, and found that VLB is at worst only 20% worse, which they claim is good enough for such a simple mechanism.

In terms of cost, they claim their network is cheaper for the same performance.

The Good:
The analysis of datacenter characteristics is the first of its kind. Thank you Microsoft!
The architecture is nice and simple to understand and achieves excellent performance. They do not require modification of switches and use commodity components. In exchange, they modify the networking stack on VM hosts. They say this is OK because the hosts need to run crazy hypervisors and VMM software anyway.
The evaluation is thorough, and, if you buy that the DC measurements are representative, convincing.
They answer the question "What can you do without modifying the switches and their protocols?" well. But it is not clear this was their aim.

They made some very interesting points such as that TCP is good enough for the DC and for VLB. But if we are going to modify the network stack, will we use TCP?

The Bad:
  • It was not clear how representative the measurement of the DC was of other DCs and other areas of the same DC. But something is better than nothing.
  • At a high-level, the architecture seems to be a little ad-hoc, trying to solve a problem by patching a solution on top of existing systems. Perhaps this is the right approach for existing systems whose networking equipment cannot be changed.
  • What are the performance characteristics of the VL2 agent? Does it add any significant overhead?
  • How are the machines configured? How are the routers configured? If misconfiguration is the number 1 cause of failures, how do they address that? Nothing is mentioned.
  • They do not mention power. While Fat-tree used 1G links everywhere and needed more powerful scheduling, it did consume much less power and was quite cheap.
  • The architecture seems to require many independent not very well integrated and controlled components. How do we configure the separate parts, how can we monitor the whole network, so on.
  • The directory service itself seems difficult to implement and build. How difficult was it?
  • They say they will use a number of anycast addresses. How are these addresses obtained? How many are there? How do we add more?
  • For externally reachable services that have a LA, what happens when they migrate?
  • At one point, they mention that the ToR switch can help with reactive stale cache updates. Does this mean the switch has to be modified, or did they mean that the VM host does this. And what happens when the host is gone? Or when the ToR is gone? How are the VL2 caches updated when the reactive mechanism is not working due to failures somewhere else?
  • I do not fully buy that this is better than fat-tree, even if Fat-tree requires centralized control of switches (actually a good thing!)