Friday, December 11, 2009

Thoughts on Automatic Patch-Based Exploit Generation is Possible: Techniques and Implications

Authors: David Brumley (CMU), Pongsin Poosankam (CMU), Dawn Song(UC Berkeley), Jiang Zheng (U. Pittsburgh)

Venue: 2008 IEEE Symposium on Security and Privacy

The work describes how to generate exploits for an unpatched binary program P given a patched version P'. This is important because it highlights the fact that the patching process itself is important. Staggering updates over a long period of time can lead to exploits that were previously unknown and that can be generated in minutes.

The work, however, only targets input validation checks. So, for example, if a buffer overflow was fixed by checking the input size or the pointer, the system the author built works. If, on the other hand, it was fixed by increasing the buffer size, the system fails.

The authors generate exploits (user inputs to a program that compromises its safety) in four main steps:
  1. "Identify the new sanitization checks added in P'." This is done using off-the-shelf binary differencing tools like EBDS. EBDS does syntactic analysis, but not semantic. So if a check in P is i > 10 while in P' it's i - 1 > 10 the latter might be reported as a new check by EBDS. The verification step later on weeds out these semantic differences.
  2. For each new check found, generate a candidate exploit which fails the new check in P' by:
    1. Calculating the boolean function F that specifies the weakest constraints on the inputs that would fail these checks. This is done by basically trying to find which instructions in the program to include in the formula. They use three approaches:
      1. Dynamic: If we know the constraints on the inputs that exercise the check, they simply specify the constraints that lead to the new check and add the constraint that the input must fail the new check. This is the case if the new check is on a common path executed for all inputs. This is the quickest method, and generates the easiest constraint functions to solve. In practice, this is done by analyzing an execution trace, lifting the executed instructions into the Vine modeling language for easier analysis, then doing the analysis.
      2. Static: (See Creating vulnerability signatures using weakest preconditions) In this approach, the whole program is lifted to the Vine modeling language first. Then, the program's control flow graph is analyzed and the program is chopped so that it only includes paths that lead to the new checks. The constraint function is then basically the whole CFG. This approach is more thorough and will generate exploits that might require more searching in the dynamic approach, but the generated functions are usually much larger and more difficult to solve.
      3. Hybrid: Use the dynamic approach to reach a junction point in the execution closer to the new check, then use the static approach between that point and the new check. This approach creates smaller constraint functions so they are easier to solve but also allows wider coverage to find exploits.
        They automate the process by iterating through junction points until an exploit is found, where the heuristic they use for these points is to place them at procedure boundaries. While the past two approaches have been done by other before, the authors claim the hybrid approach as a novel contribution.
    2. Solve the constraint function F to find the candidate exploits using an off-the-shelf solver.
  3. Check that the candidate exploit is a real exploit by checking that it compromises the program's "safety policy", where "a safety policy" is a function that maps the program's state space to a boolean value of safe or unsafe. Examples include type checking, buffer overflows, ... In the paper, the authors use policies that are enforceable by their execution monitor (dynamic taint analysis to detect information leaks and memory safety issues, dynamic type checking, return address integrity, ...)
  4. Generate additional variants x' by solving the new constraint function F'(x') = F(x') and x' != x

If the solver fails to find an input, then the path described by the constraint function does not have an exploit. If it takes too much time, then the authors set a timeout so that the system moves on to check a new path by, for example, changing the mix (junction) point in the hybrid approach.

For evaluation, the authors ran their system on 5 MS updates. Some exploits worked with dynamic analysis, others worked with static analysis, and other required the hybrid approach. Exploits required between tens of seconds to less than 3 minutes, and most of them resulted in control hijacking.

The techniques seem to also be useful in other applications such as automatic deviation detection to check whether two programs behave the same, or, more concretely, to check if compiler optimizations affected the behavior of a program. But this doesn't seem to me to be easily attainable since the differences between two programs might be large enough that the techniques don't work.

The solutions to the automatic patch exploit generation problem they propose are obvious: obfuscation, patch encryption so that everyone can apply it simultaneously, and fast patch distribution using P2P. They mention that off-line systems would stay vulnerable, but this is the same issue with the current path distribution architecture.

The Likes:
  • The paper presents a cool attack
  • It works (at least for the cases shown)!

The Dislikes:
  • There was no real quantitative estimate of how important this is in practice or how well the system will fare in general. What is the percentage of bugs it does work on and through which a bad guy can cause real damage?
  • There was too much notation and repetition
  • I did not understand anything on dynamic analysis. The math notation looked horrible. I'm sure the ideas are probably not too difficult to understand, but I didn't want to wade through it with no helping hand (i.e. real explanation).
  • This is not quite about the paper, but there are no satisfactory solutions to the problem they raised yet. It would've been nice if they were able to describe one, but that's not the goal here.
  • I wish there was a bit more education on the technical background.

Thursday, December 3, 2009

Thoughts on Dispatcher: Enabling Active Botnet Infiltration using Automatic Protocol Reverse-Engineering

Authors: Juan Caballero, Pongsin Poosankam, Christian Kreibich, Dawn Song

Venue: CCS 2009

The paper describes a tool that can be used to automatically reverse engineer the protocol format of an unknown protocol. The tool can be used to deduce the format and some of the semantics of the fields in both sent and received protocol messages. It can do this reverese engineering without requiring the binaries on both sides of a protocol, which makes it useful in situations where only the client is available (such as a bot binary) or only the server.

The ideas used are quite clever and simple. The main idea is to look at each byte in the buffer that constitutes a message sent or received, and then create a dependency chain for that byte. The dependency chain is the sequence of instructions and data sources used to write the byte. The dependecy chain ends when a source for an instruction is an immediate, a constant, another memory location, or unknown (such as an arithmetic op). Then by comparing the ultimate sources for contiguous destination bytes and seeing if they are derived from the same contiguous set of source bytes, they can tell the position and length of buffers that form the fields in a message. They do this hierarchically in the buffer so they can also extract the hierarchy of fields.

They can infer some semantics by looking at where these fields are used. If they are used as arguments to or if they are return values of known functions, they can immediately get the type, and some semantics such as an IP address when used in the connect function.

Furthermore, they are able to infer loops using static and dynamic analysis and then be able to infer fields that are used to indicate offsets and lengths. And by looking at repeated bytes, they are able to infer delimiters. While arrays are inferred by assuming that arrays are usually written inside loops.

To handle encrypted messages, they use the intuition behind other work, ReFormat by Z. Wang et al. The idea is that functions that have a disproportionate amount of arithmetic instructions to other types of instructions are probably cryptographic. They use this idea to infer which buffers are unencrypted and operate on them. They have used a threshold of 20 minimum instructions for a function to be considered and a ratio of 0.55, and were able to get a false positive rate of 0.002%.

For evaluation, the authors studied the MegaD command and control C&C protocol. They were able to infer the message formats used by MegaD and were able to rewrite a message for indicating SMTP sending capabilities. They have also looked analyzed other open known protocols and compared their results using Dispatcher with those from Wireshark. They say that Dispatcher outperforms Wireshark.

In terms of resiliency, the authors claim that they target fundamental properties to avoid becoming obsolete if malware adapts. However, they do confess that if malware can detect that it is being run in a virtual environment, then their technique can be circumvented.

The Likes:
  • The paper is very well written and clear.
  • The ideas are nice and clever, and they seem to work well.
  • They do not need both sides of a protocol
  • Can be used to infer correctness of messages when a protocol is known
  • Can make writing a Wireshark dissector easier
  • Might be used to send erroneous information to botnet masters (I don't know how useful this is. If you know that a machine is already compromised, you can should fix the situation. You cannot take over the botnet.)
The Dislikes:
  • Can be circumvented by detecting a virtualized environment. I don't know how easy this is, but for serious malware, this might not be a problem.
  • The authors claim that they target fundamental properties, but provide no evidence or proof. It is not clear that the properties they target are really fundamental. For example, they use some known functions to infer semantics. The bad guy might ship his own libraries so as to restrict the API used. The thresholds used to detect encryption and decryption are empirical, and seem easy to obfuscate. Adaptive malware might be able to change keyword values or restructure the program to make it more difficult to infer which functions are doing what. The program might be written in a difficult to analyze way where fields of the messages are always results of arithmetic computations. The program might use complex loops with jumps and gotos that make analysis difficult.
  • They claim they do better than Wireshark, but it's not clear why they made that conclusion. The dissectors made different assumptions about what delimiters are, but it does not mean they are incorrect (or maybe I just did not understand what they meant).

Friday, October 16, 2009

Thought on Making Information Flow Explicit in HiStar

Authors: Nickolai Zeldovich, Silas Boyd-Wickizer, Eddie Kohler, David Mazieres

Venue: SOSP 2006

Paper: PDF


The paper describes, in painful detail, the HiStar operating system and the implementation of two applications on it. HiStar uses distributed information flow control to provide a platform to run untrusted code on sensitive data and provide guarantees for data leakage and integrity.

  • Hard to reason about implications of every action by untrusted
  • chroot can break application assumptions
  • syscall interposition error-prone
  • Easier to specify policy in terms of where to where information can flow
  • Unlike other IFC systems, HiStar implements standard OS abstractions on top of lower level building blocks that obey IFC
  • Taint tracking mechanism does not leak info itself
The HiStar DIFC model:
There are six kernel object types in HiStar. Of these, threads are the only ones that execute code. Under Distributed Information Flow Control (DIFC), each kernel object has a label. The label describes how it is tainted. An object can be tainted in different sets of categories with different degrees. The degree of taint in a particular category places a restriction on the object. In HiStar, there are 4 levels of taint.
  1. Cannot be written/modified (lowest taint)
  2. No restriction
  3. Cannot be untainted/exported
  4. Cannot be read/observed (highest taint)
Information from one object A can only flow to another object B iff B is at least as tainted as A in every category. This is the flow to relationship. A restriction imposed by a particular category can be bypassed by objects that own that category. Ownership of a category is specified as an additional degree «. Categories are only created by threads. A thread that creates a new category is the only owner of that category and can then grant ownership to another thread or to a gate (another kernel object).

A label looks like {a0, b2,c«, 1}, which means that the object is tainted in category a with degree 0, b with degree 2, c with degree «(i.e. the object owns category c), and in all other categories with degree 1 (i.e. untainted in other categories). The labels on all objects, except for threads, are immutable and have to be specified when the object is created. Objects are created by threads only.

Threads can read restricted objects by raising their label. Threads, however, can only raise their labels to the limit imposed by their clearance label. The clearance label is an additional label on threads, that also places an upper limit on the label a thread can request for a newly created object. A created object’s label must flow to the thread’s label which in turn must flow to the thread’s clearance (bypassing owned categories).

HiStar uses a single-level store. All objects are stored on disk and, on bootup, the entire system state is reloaded from the latest snapshot. All objects are stored in a container hierarchy with quotas.

There are six kernel object types in HiStar: segments, threads, address spaces, containers, gates, and devices.
  • Segments: segments are the simplest object type in HiStar. They are basically arrays of bytes and are analogous to other OSes’ files.
  • Threads: threads execute code. They can create categories and objects, and change their own label and clearance as long as the new label or clearance can flow to the current clearance ignoring owned categories.
  • Containers: All objects are stored in containers that are organized in a hierarchicy rooted at the root container with quotas. Objects are hard-linked into containers, and when they are disconnected from the root, they are garbage-collected. The existence of hard links can leak information, so objects are always specified as a (Container ID, Object ID) pair. To be able to see if a hard link exists, the thread would need to be able to read the container. Quotas on objects do not change so as not to convey information.
  • Address Spaces: Each thread has runs in an address space. An address space is a mapping from virtual address -> containers, segments in them and permissions. The address spaces are used to specify code to run and virtual memory. They are created by a thread and used in launching a new thread. They are used when page faults occur to find whether a thread has permission to read some segment and to find where that segment should be mapped.
  • Gates: gates “provide protected control transfer, allowing a thread to jump to a pre-defined entry point in another address space with additional privileges”. Basically, they are very similar to RPCs except that the calling thread provides the resources for execution. A thread can create a gate and give it some label (respecting DIFC). Another thread can invoke the gate and request a label and clearance. The requested label and clearance are a combination of the invoking thread’s label and clearance and the gate’s label and clearance. The thread then starts executing in the new address space (i.e. executing some other code or function) using the requested label and clearance. Return values can be stored inside the thread object itself, but there is no implicit return mechanism. The invoking thread needs to create a return gate and label appropriately (the discussion on this is quite interesting).
  • Devices: Can be mounted…
User-Level Libraries:
The authors implemented a Unix-like environment using the HiStar kernel completely in userspace, and so they all respect IFC. The filesystem is implemented using segments and containers. A directory is a container with a special directory segment that contains a map of names to object IDs and a mutex to synchronize directory operations. To get a consistent view of the directory entries without acquiring the mutex (which needs write permissions), a user atomically reads a generation number and busy flag before and after reading an entry.
Processes in HiStar are containers with two process-specific categories to protect its secrecy and integrity. The process has two containers, a process container and an internal container. The process container exposes objects that are the external interface to the process. The internal container has segments used to implement the file descriptors, stack, heap, and executable code. It also has the thread that executes the code and its address space.

A user has a pair of unique categories that define her read/write privileges. HiStar does not support ACLs.

HiStar also supports IPC using gates, signals, and networking using lwIP. The network device has a label {nr3, nw0, i2, 1} to taint all data read from the network.

Unix leaks information for process exit, quota adjustment, and file creation. For these operations, HiStar uses untainting gates that explicitly allow information to leak.

The authors used ClamAV as the motivating example and split it up so that the scanner part cannot export user data.

They also implemented an untrusted user authentication mechanism that cannot leak more than one bit of the user’s password. The discussion for this application is interesting, but too complicated to explain in a summary. The main idea is to split up the login mechanism into a directory and gates and use each gate with different privileges to get the information for authentication, check the password, and get the user’s privileges.

Another application was VPN, where another lwIP stack was run in parallel to the network with a different label so that public Internet data and VPN data do not mix (at least not directly).

HiStar runs comparable to Linux and OpenBSD, sometimes fasters, others slower. It is slower in the fork and exec microbenchmark. However, it does allow new consistency primitives such as group sync where no changes are recorded until an application is done execution.

Some useful features are missing from the Unix environment, while the semantics of others are modified. A crash or killed process can leave mutexes locked, and there needs to be support for cleaning those up. There are no CPU quotas and HiStar does not solve timing covert channels.

The Likes:
  • The paper was thorough in the description and did not leave questions
  • DIFC seems very useful, and the OS seems to implement it well.
  • It is nice to build a layered system basing the design on a few basic blocks that enforce security and be able to reason about the rest of the system from those.
  • The OS has a load of cool and smart things whose composition was carefully designed to build useful libraries.
  • It works!
  • For non-malicious code, HiStar seems to be able to protect against bugs.
  • It’s cool that the taint degrees work out to map into numbers that can be compared using simple integer comparisons.
The Dislikes:
  • The paper was very dense, and takes quite a bit of effort to read. It would’ve been nice if the authors could have focused on the important points and insights instead of just a spewing out of information.
  • The userspace environment is not completely compatible with Unix, but ah well.
  • The timing channels problem seems to be a major issue
  • It looks difficult to build applications using the HiStar primitives.
  • What is the killer app for HiStar? What can it be used for given all its limitations? What are all these features good for?

Tuesday, October 13, 2009

Thoughts on The Case for OneSwarm

I should've posted this entry a while ago. This is a talk that Tom Anderson gave at MSR in Mountain View on 9/29/2009.

Cloud computing has many issues:
  • Data can be locked in (Facebook, ...)
  • Need to trust providers with data
  • Cloud is a natural monopoly:
    • The business model is to attract developers then the users. So the users have no choice.
    • Reliability and the ability to handle flash crowds is a high barrier to entry.
    • There is no plausible avenue for an open source Cloud. Somebody needs to pay for the infrastructure, so somebody needs to be making money off of it to be sustainable, and the way to do this is through data lock-in and/or data mining...
  • Since all data is stored somewhere centrally, it's easier to censor it
Can we use a P2P system instead where users can donate some of their resources to be able to use other people's resources? At first glance, it looks like this might be alright. P2P systems
  • have high availability -> But in fact this is not true because seeds disappear
  • have no bottlenecks -> But their could be network inefficiencies or ISP throttling
  • have no centralized monitoring -> But it's actually easy to track users
The alternative is OneSwarm. OneSwarm has no centralized trust. The idea is to use privacy-preserving sharing where you build a web of trust through friends. Then when a user makes a query, the query goes through friends, then friends of friends, and so on... and the data travels on the reverse path back to the user. Nodes are not supposed to reveal information about where a query they pass along came from, so the receiver of a query cannot relate a user to a query.

The system, is by Tom's admission, vulnerable to collusion, but that's okay. The system is not designed for complete security but rather as a compromise between security, flexibility, and performance. Another assumption is that there is no partitioning in the network and that eventually everyone will be connected, and everyone will have access to everything (which will probably end up happening anyway).

The network of friends is organized as a DHT. To maintain availability and solve the issues with traditional P2P systems, they invented a new type of DHT that is highly scalable and reliable. groups of nodes are organized into tight groups that are replicated via Paxos. Each group then appears as a single node in the DHT.

As a final note, I had a chat with Tom after the talk. He has a pessimistic outlook on what will eventually happen. Mainly, that cloud providers will always end up creating a monopoly using data lock-in and be able to mine the data stored on the cloud, because they will manipulate the API so they can do so.

I like OneSwarm. In particular, I liked the DHT implementation. They had some really cute ideas there. It is not a security protocol, it is a "casual" privacy protocol that presents a minimum barrier to snooping. However, if someone wants to get to my data and to know what requests I am making, they would be able to do so.
The availability of a P2P system that is presented as cloud storage on top which users can implement "free" applications, such as an open-source Facebook.

  • I'm not sure I agree with the pessimistic view. Google has started a movement called the data liberation front to basically free users from data lock-in, and I expect this movement to become more important.
  • There are more reasons than just convenience for applications to run in a data center. If we move from DCs to P2P systems for storage, we will get much higher lookup costs, and it is not clear systems will scale.
  • There are no guarantees of connectivity. The social network might get partitioned, limiting the available data.

Thursday, August 13, 2009

Thoughts on A System for Authenticated Policy-Compliant Routing (Platypus)

Authors: Barath Raghavan and Alex C. Snoeren
Venue: SIGCOMM 2004


Providers want to forward packets only for entities they can bill. Regular source-routing makes this difficult since packets can arrive from anywhere and it is not clear whom to bill. The authors propose a network protocol that allows an ISP to give capabilities to whomever it wants to bill. The capabilities themselves can then be further delegated. Their work assumes that the knowledge of alternate routes or waypoints is done through some out-of-band means and the issue is not addressed.


1. Efficient overlays and on-demand routing where end-hosts can decide how to route packets.

2. Preferential egress points where different peers get routed differently even if they all peer at the same router.

3. Preferential ingress points where multi-homed ASes can select what which ISP traffic comes in through.

4. Virtual multi-homing where a stub AS can get multi-homing at the provider (i.e. a hop or more away) even though it only has one connection.

The Platypus header is below the IP header so it is compatible with IP. A packet has a list of capabilities where each capability specifies the resource principal to bill, and the waypoint to go to. Packets are forwarded between waypoints by replacing the IP address in the IP-level packet.

Capabilities are authenticated using a “double MAC” trick. The key server and the router share a single key k. Secret capability key s = MAC(k, capability) is given to customer. Customer then generates b = MAC(s, payload) and puts b in the packet along with the capability. The router can then authenticate the capability.

The capability key s is temporary and expires after some time. The expiration mechanism used requires loose synchronization, but allows for only one value. When a key expires, the customers need to get a new key. Customers are given a master key they can use. They encrypt the op code for requests and put it in the DNS lookup name. So if a key is requested from a server, the request is sent to E(master key, request code) The response contains the encoded new key. Requests are thus cacheable.

They also have revocation lists that are pulled from key servers into the routers.

The describe two ways to do delegation on prefixes. One by XORing, and the other by recursive MACing. The first is susceptible to collusion, the second might be too slow (this is what we do in ICING).

They evaluate their system using kernel modules, and it is a little slower than IP forwarding. I did not inderstand their discussion on clustered waypoints and their proposed solution traffic engineering. So I won’t say anything about it. I will say that the way they choose their nonces is not very nice.

They punt issues like accounting saying it is possible to do some distributed per-packet counting and rate-limiting(unknown). Their scalability section is a little hand-wavy since they have no idea what it would really take to build it in hardware. Replay attacks are also hand-waved saying they will use bloom filters (but they don’t mention costs or how they would do that in hardware, what do they hash…)

The Good:

· This is good seminal work on how to verify policy compliant packets.

· Used some cute smart tricks here and there (expiration, double MAC, delegation, key lookups)

· System seems to perform well enough in software

· They present the system and crypto nicely (who has what keys and how they derive them for example)

The Bad:

· They concentrate only on recognizing whom to bill. In reality, policies are probably more complex for security reasons. For example, don’t let AT&T send packets through Sprint even if another customer is paying for it. Or only use trusted ISPs.

· Much has been hand-waved. The implementation doesn’t explicitly specify (or maybe I missed it) what has been implemented.

· Revocation lists and polling from key-servers might be a problem.

· Once a capability is in a packet, it is compliant. There is no check whether or not the packet actually follows the path.

Thoughts on HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

Authors: Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz (Yale University), Alexander Rasin (Brown University)

Venue: VLDB 2009



The paper starts by saying that DBs are becoming increasingly important and data is growing at a huge rate. Unfortunately, the usual parallel DBMSs do not scale well because they cannot handle failures very well. On the other hand, MapReduce handles failures and stragglers (heterogeneous systems) well and scales well, but does not perform as well as DBMSs.

The paper describes a system that has roughly three layers. The top layer is a SQL layer which accepts SQL queries from users. The SQL queries are translated into MapReduce jobs. The MapReduce jobs are executed in the middle layer, Hadoop. The job is divided into several tasks. Each task itself is a smaller SQL query that is executed on a single node. Each node independently runs the third layer, a single-node DBMS (PostgreSQL in this case). It uses Hadoop for communication, Hive for the translation, and PostgreSQL for the DB.

HadoopDB consists of the following main components:

1. The database connector: provides an interface to independent databases so that worker nodes (TaskTrackers) can get results of SQL queries in MapReduce format (i.e. key-value pairs).

2. Catalog: maintains metainformation about the third-layer DBs---connections parameters (e.g. DB location, driver class, credentials) and metadata (data sets, partitions, replica locations, …)

3. Data Loader: Partitions data and stores it in the local DBs.

4. SQL to MapReduce to SQL (SMS) Planner: Extends Hive which translates SQL queries to MapReduce (Hadoop) jobs by optimizing the output to push as much of the work onto the local DB instances in the worker nodes since the single node DBMSs can do much better than MapReduce.

In the eval section, they compare Vertica (a column store DBMS), DBMS-X, HadoopDB, and Hadoop. They tested the systems without failures and with failures. I didn’t quite understand all the differences in workloads, but I did notice that Vertica was much faster than any of the other systems. The authors claim that HadoopDB performs almost as well as the other systems they tested and almost always better than Hadoop. They say that the performance differences are mainly because of the lowest PostgreSQL layer. This seems plausible, but it would have been a much more compelling eval if they tested comparable systems (example use compression on PostgreSQL and/or a column store).

When faults and slow nodes were introduced, Vertica performed much worse: 50% performance hit with failures and 170% with slow nodes. HadoopDB on the other hand was only 30% slower. They then say that this slow-down is a minor bug that can be easily fixed so that their performance matches Hadoop’s (which is better than HadoopDB with slowdown).

The Good:

The paper is interesting and reads well. The system itself is also very smart, and it’s a wonder no one has built it before. I would really like to see what ends up happening with it and how it ends up performing in the real world.

The Bad:

It would have been much more insightful if the used the same DB bottom layer in the eval, or at least eval different ones. It would have been compelling to see how Hadoop can parallelize DBMS. It would have been nice too if they could have put Vertica there to show how HadoopDB can fix the issues with failures. Also, I didn’t think the queries the used were that complex, but I don’t know whether this is of any importance.

Friday, August 7, 2009

Thoughts on Fabric: A Platform for Secure Distributed Computation and Storage

Authors: Jed Liu, Michael D. George, K. Vikram, Xin Qi, Lucas Waye, Andrew C. Myers (Cornell University)
Venue: SOSP 2009

The paper describes Fabric: a language that is mainly an extension of Jif into the distributed computating world. The paper start with motivation citing patient record sharing as an example of where it is necessary to share data between two domains such that each domain's security policies are enforced.

In Farbic, there are three types of nodes:
  1. Stores: store objects persistently
  2. Workers: perform computation
  3. Dissemination: CDNs for stored objects
Naming: Fabric uses an object oriented DB model. Fabric objects store information, are mutable, and have an object id or oid. The oid is mainly the concatenation of the location where the object is stored (as a DNS name) and a 64bit number. The oid is permanent. If the object moves, a pointer in the old location will point to the new location.
Certificates are the roots of trust and are used to establish SSL connections.

Labels: Objects has labels decribing its integrity and secrecy requirements.

Classes: Class objects creat an unforgeable binding between each object and the correct code for implementing that object.

Versions: Fabric objects are mutable. Each has a version number that is incremented when a transaction updates the object. The version number is used to check whether a transaction is operating on a stale or the up-to-date copy of the data.

Threat Model: Timing and termination channels are ignored. It is assumed that the network will deliver messages eventually. It is not clear if this assumption is a strong one. Do they need the network to never fail or else everything crumbles down? Or do the only assume that the network works mostly as today (it will be out for a few hours maybe)?

Another assumption that is used in the rest of the paper is that a program knows a priori the nodes it will use for computation and dissemination, or that they will be input into Fabric somehow. Similarly for storage nodes.

Storage Nodes: Store objects. Related objects are grouped together and returned when requested for efficiency. They track delegation or as they call it "trust relationships".

Worker Nodes: Execute Fabric programs.
  • Trusted programs can incorporate other intermediate language-programs that don't have security constructs.
  • Objects modified only inside transactions
  • worker nodes can either request remote data (data-shipping)
  • or can do remote invocations (function shipping)
  • Remote invocations don't need the object to already exist at the remote worker node
  • The entire method call is executed in a subtransaction whose effects are only visible when the top-level transaction commits. The whole transaction tree is then committed.
  • The caller-side of the remote invocation checks its security statically. The callee-checks it on run-time since it doesn't know who is going to invoke it. i.e. the langauage has to explicitly check trust: e.g. if a trusts b: then a.invoke(...)
Dissemination Nodes:
  • Dissemination layer is not prescribed, and currently uses FreePastry
  • Object on dissemination nodes encrypted
  • Key stored on object's store
  • Store can verify security and dole out the key when safe. **They say workers need not fetch keys often, but not clear why...p5
  • Stores and dissemination nodes track readers.
  • When object updated, updates are pushed to tracked worker nodes.
  • Dissemination nodes can learn info from request pattern, so they are labeled with some label representing "the maximum confidentiality of information that may leak on this channel". Workers only do fetch requests if this max is not violated. **I do not understand this whole paragraph. How do they quantify the amount of information leaked on this covert channel and turn it into a label?
The Fabric Language: They extend JIF to support
  1. Transactions and consistency
  2. Remote method calls
The interesting thing is they do both of these without leaking information and so that remote calls are authorized. I think this should probably say "where information leakage is explicit" because there will always be some undesired information leakage.

Then they go on to say: "Fabric adds a new trust ordering on information flow labels". This I found confusing and intriguing at the same time. Eventually, I will be disappointed...

In Fabric, everything is in terms of principals.
  • There are no partial privileges or dynamic labels for prinicpals. A principal always executes with its full privileges.
  • They add an acts-for relationship which is basically just delagation of all privileges, and they call this "trust". If a acts for b then b trusts a or b has delegated all its privileges to a.
  • There's a top principal (I'll call T, but they use a symbol I'm too lazy to lookup) that acts for all other principals.
  • There's a bottom principal (B) that all prinicpals act for.
  • Fabric nodes are also principals, and so also follow the acts-for model.
  • Principals implement their own authentication. The description of how trust is bootstrapped in the system is too confusing. They must have cut too much here. My imagination fails me. **They need to describe bootstrapping more. How does a user delegate trust to a node? How is the authentication turned into a principal?
  • Information flow security policies are expressed in terms of principals. They claim this integrates access control and information flow, though I don't know what that means. Information flow security policies are access control policies!
  • Labels in Farbic look like: {alice -> bob} meaning alice, the owner of this security policy allows bob to read from it. {alice <- bob} means alice allows bob to write to it. Since the owner is explicit (alice), only code running with alice's authority (worker node acts for alice) can declassify or endorse data with alice's labels.
  • {alice->bob} can flow to {charlie->dora} only if charlie acts for alice and dora acts for bob or dora acts for alice.
Trust Ordering: As well as being ordered by information flow using the flow to relationship, labels can be ordered by trust (or the acts-for relationship). The higher a label's trust requirement, the more a node needs to be trusted. For example, {a->b} actsfor {c->d} iff a actsfor c and b actsfor (d or c). That is one label can be used instead of another in that case. This can be used to establish trust in nodes: if an object x with label L_x is to be stored on a node n. Formally, L_n={T->n; T<-n} actsfor L_x for this to work.
e.g. if L_x is {p->q} then either n actsfor p or n actsfor q. In either case, n must be trusted with the contents of x.

They mention that the reason this trust ordering is needed is because "Fabric classes may be parametrized with respect to different labels or principals, and so instances of the same class may have different labels. This allows implementation of reusable classes..." However, it is not clear to me that anything other than good ol' delegation is necessary. Of course you have to delegate to a node if that node is to enforce your labels! That is why in D-Star exporters own categories they in labels they enforce. Hence, the disappointment I alluded to earlier.

As in Jif, implicit flows are tracked with the program counter's label.

For remote invocations, they say that the callee checks at run-time that the caller has the authority to invoke something with another principal's endorsements, but they do not mention how this check is done. Are there certs as in D-Star? **Need to mention how delegation is proved across nodes.

Transactions: Fabric provides transactions that can be distributed. When distributed, they are turned into a transaction tree. The whole tree is committed when the top level worker node starts the commit. It does so by recursively calling the commits of the whole tree (note recursively not iteratively). Most of the things they describe about transactions are how you would expect them to work.
A failure of a node to commit a transaction correctly is as serious as if that node violated trust by not updating the object's fields correctly.
If a transaction coordinator midway through a transaction, so changes might be committed while others not. This can break consistency but only for objects whose integrity the failing node was entrusted with.

Writer Maps: Used to allow workers to check whether an object has been updated without revealing information to workers who shouldn't learn about the updates. The writer map a log of writes. Everytime an object is updated, an entry hash(oid, xid, key)->E(key, w) is stored where w is the writer. It allows nodes that know the key to get the latest update and know what the writer for that update is. But it is not clear how the nodes know what the last transaction id is. It is probably stored in the clear. **Nodes also still know that an object has been updated since the writer map changes even if the are not allowed to know what the exact updates are.

Implementation and Eval: The implementation was 33k lines of code on top of Jif and other libs. They use BDB for the store and FreePastry for dissemination.
A few things are left unimplemented, most notably the distributed deadlock detection and timeouts for divergent computations.

For the Eval they implemented a course management system and tested on 55 students. Complex queries were replaced with object oriented code and this simplified the implementation from 3100 loc to 740. However, they only migrated part of the full system. They also only ported it to FabIL, the intermediate language used by Fabric that has no security properties. Only consistency. Performance shows that FabIL is faster than all other implementations with consistency. Transaction management and consistency is about 50% of performance. **The metric doesn't really show much since only parts of the implementation were done. It is not clear why FabIL is faster. Maybe I didn't understand what FabIL does. Is the implementation on a set of nodes? How does BDB run faster that Oracle? Still it says nothing about security and performance with full Fabric used.

They also implemented a calendar. They didn't get performance numbers. Is it because the calendar didn't fully work?

They used the OO7 benchmark on Fabric nodes and found that Fabric is 10 times slower than non-persistent Java. They say this is acceptable, but that is not clear.

In summary, the paper is a thorough design of a language for distributed computation and storage when DIFC is involved. But like almost every other paper on the subject has left me with more questions than answers.

The Likes:
  • The paper is thorough in the design description
  • The paper is probably the first to combine DIFC with transactions to provide consistency
  • They combine language constructs with run-time checks. It's cool that trust requirements are in the language.
  • They do so much stuff!
The Disklikes:
  • Ah so many questions! See the text (**). I'm too lazy to put them here.
  • Why the trust ordering on labels? Why is delegation not enough? I am not convinced that it is necessary.
  • They have said almost nothing about how runtime checks are implemented. Do they use certs as in DStar to prove delegation?
  • That bootstrapping paragraph was completely useless. It left me more confused than informed. Probably because they did not explain how runtime enforcement is done and how authentication is turned into labels or proofs of labels.
  • The evaluation leaves so much to be desired. There is no full implementation of any system that uses Fabric and no real performance evaluation. In all, that was a disappointing eval section.
  • What is the deal with the dissemination nodes. Yup they are cool, but where is how they help? Are they basically CDNs? How are they used? Why not just replication?
  • I don't like the fact that principals always use their full privileges to access everything. What if a principal wants to operate in an untrusted environment or where it doesn't want a node to be responsible for the integrity of so many objects.
  • The paper does not say much about the "control plane": Where do the names of nodes come from? How does a user select nodes? How do programs select/find nodes? How does a user's credentials translate across multiple nodes? I guess all this is orgthogonal but it would've helped me understand Fabric's usage model which was quite obscure (as in many other DIFC papers).
  • I think my main issue is why is the system not layered? Why aren't transactions built on top of a minimal base whose properties are easy to reason about? Why are dissemination nodes a primary object instead of being implemented on top of worker and storage nodes, i.e. on top of Fabric? Is it convenience? What happens when things need to change? What about the modularity argument?

Sunday, August 2, 2009

Thoughts on Provable Data Possession at Untrusted Stores

Authors: Giuseppe Ateniese† Randal Burns† Reza Curtmola† Joseph Herring†
Lea Kissner ‡ Zachary Peterson† Dawn Song §

†Department of Computer Science, Johns Hopkins
University, Baltimore, MD – {ateniese, randal,
‡Google, Inc. –
§University of California Berkeley/Carnegie Mellon Univer-
sity –

Venue: CCS’07, October 29–November 2, 2007, Alexandria, Virginia, USA


The authors describe a way to probabilistically determine whether a file stored at a SSP exists and has not been modified. I don't quite understand the crypto, but the results are:
  • Fast checks. They are I/O bound, not CPU bound
  • The size of the file stored at the server increases
  • The client needs to store a constant small amount of data per file
  • They use sampling to reduce the load on the server.
  • Uploads are slow because preprocessing is expensive on the order of 160kB/s
This work is almost identical work by Ari Juels(RSA Labs) and Burton Kaliski(EMC Corp) PORs: Proofs-of-Retrievability for Large Files. However, in PORs, a client has to decide the number of checks a priori because the system works by encoding a constant number of sentinel blocks for checks in the file that are queried and hence consumed during challenges.

The main difference between PORs and PDPs is: PORs prove to a client that the entire file can be retrieved unmodified. PDPs do not make this claim. They only claim the file blocks tested are uncorrupted.

Throughts on POTSHARDS: Secure Long-Term Storage Without Encryption

Authors: Mark W. Storer, Kevin M. Greenan, Ethan L. Miller (University of California, Santa Cruz), Kaladhar Voruganti (Network Appliance)
Venue: Usenix 2007

The paper describes a confidential storage system over untrusted storage service providers (SSPs) without the use of encryption by using secret splitting. The system is also reliable using RAID techniques and approximate pointers to recover a user's data. It is also resilient to insider attacks at an SSP where the insider might attempt to recover stored data.
For long-term archival use:
  • throughput more important than latency.
  • encryption is a pain because of:
    1- lost/compromised keys
    2- compromised cryptosystems
    3- key management difficult because user who stored data may be unavailable
System designed around three tenets:
  1. No encryption (cryptosystems break after time)
  2. Fulfilling requests needs no information external to archives because that data will be lost in time. This is the reason archives are used to begin with.
  3. Indviduals are more likely to be malicious than aggregates, so system builds on SSP cooperation.
The system is also built around an implicit tenet (later made explicit in the paper) that each SSP monitors its security and is distrustful of other SSPs.

Each piece of data is split into an ordered set of shards, a shard-tuple that can be used to recontruct the data when put together. Without all the shards, the data cannot be reconstructed and no information of the data is revealed. Some redundancy is added at the shard layer so that not all shards are needed to improve short-term reliability.

With each shard, an approximate pointer is stored. Approximate pointers point to a region of the namespace where other shards are stored. An exact pointer would allow an attacker to easily reconstruct the data because he would know where each shard is stored. If no pointers are used, then an external index would have to be maintained or there would be no way of knowing which shards go together. With approximate pointers, the attacker has to request on average half the region to get the right shard, which is highly conspicuous.

POTSHARDS uses RAID techniques + algebraic signatures for redundancy and verification.

Storage proceeds as follows:
  1. Data preprocessed into objects.
  2. Objects secret-split into fragments for secrecy.
  3. Fragments secret-split into shards for availability. Approximate pointers added here.
  4. Shards are placed across SSPs so that no single SSP can get all shards to make data.
SSPs coordinate between themselves into RAID groups: **Huh?
  • "To deal with the threat of data loss from these events, POTSHARDS utilizes distributed RAID techniques. This is accomplished by dividing each archive into fixed-sized blocks and requiring all archives to agree on distributed, RAID-based methods over these blocks."
For parity updates on placements, a shard stored in a particular block is sent to other archives to update the parity for that block. **What? So archives can get shards that are meant for other archives? I thought the archives are supposed to be mutually distrustful. Nope. They say the "it is assumed that all of the archives are trusted". This seems to me like a contradiction.

For recovery, the SSPs again coordinate and choose one candidate to store the failed data. **Come on, this is stretching too too far. How can they choose that if they are mutually distrustful?

Archives also do integrity checking of other archives on behalf of a customer using algebraic signatures. **These guys are doing too much work.

SSPs store exact pointers from users to shards. And shards store approximate pointers to each other. A user can get all her shards, then use the approximate pointers to find chains and try to reconstitute the fragments. An attacker would have to recover all shards in each block pointed to by approximate pointers from each archive. **If an attacker compromised a SSP, he can get the exact index of the shards! Why try to get all the shards and not the index?

The eval section was hand-wavy. I couldn't really understand what the numbers they found meant. How do they relate to the real world? Some x-axes had no labels. PlanetLab vs. local cluster had too much discrepancy. There was no discussion of overhead costs.

In summary, this paper had some good ideas. The motivation is solid and the architecture seems to be built mostly on solid principles. The design they came up with, though, is questionable. There seems to be many weaknesses. For example, they assumed an attacker would brute-force requests of data without getting meta-data from an archive. They also assumed that SSPs would be able to do all this great amount of coordination among each other. It was also assumed at one point that the SSPs should "question each other's security" i.e. be mutually untrusted, but then they say that the archives are trusted because they send each other shards while doing parity updates. This work needs significant rework and careful re-evaluation. The explanation of how approximate pointers are used by a user is unintelligible (to me at least).

Thoughts on SafeStore: A Durable and Practical Storage System

Authors: Ramakrishna Kotla, Lorenzo Alvisi, and Mike Dahlin (UT Austin)
Venue: Usenix 2007
The author describe a system---SafeStore---that drastically increases the durability of data stored at storage service providers (SSPs). The system relies on the following points:
  • Use hierarchical erasure coding within and across multiple SSPs
  • SSPs should provide an interface that exposes the redundancy levels they support internally
  • Use a heuristic to decide how/where to store data and with what redundancy levels
  • Use auditing to ensure data is stored correctly and is available.
Auditing works as follows:
  • When data owner stores data, it gets a signed receipt with object ID and hash.
  • Data owner encodes and stores receipt across SSPs. **Does the receipt need a receipt?
  • Routine audit:
    - Auditor sends salt for particular ID
    - SSP returns signed message with [obj ID, time, H(salt||data)]
    - If SSP honest, and finds that data is corrupted, returns error
    - If SSP dishonest, forced to return bogus H(salt||data) and now we have a crypto proof
  • Spot Check:
    - Auditor verifies some percentage of the responses
    - It does this by retrieving data from owners' cache, SSP, or other SSPs (**Why retrieve whole data? Isn't hash sufficient? i.e. get SSP data, get receipt hash and compare. Several options...)
    - Proof of Misbehavior (POM) can be produced if hash fails.
  • Cost:
    "our audit protocol improves durability of our system by two 9’s over a system with no audit at an additional audit cost of just 20%"
  • All local state except encryption keys and list of SSPs used are soft-state.
  • Performance vs NFS: 13% worse
  • Adding snapshots makes performance ~40% worse
  • Over the WAN with large delays, moderate drop in performance: 5%
  • SSFS versioning makes replication cheaper with less space overhead.

Saturday, August 1, 2009

Thoughts on Towards Automated Provisioning of Secure Virtualized Networks

Authors: Serdar Cabuk, Chris I. Dalton, HariGovind Ramasamy, Matthias Schunter

Venue: CCS'07, October 29-November 2nd 2007, Alexandria, Virginia, USA.


The work describes a system in which VMs distributed across multiple hosts are isolated within there own Trusted Virtual Domain (TVD). The TVDs are isolated from each other with confidentiality and integrity guarantees. TVDs are implemented using combinations of VLANs, VPNs, and custom vSwitch software that is installed on the virtualized hosts. TVDs enforce which VMs are allowed to be members using certificates or other attributes. Inter-TVD communication is regulated by a traffic matrix whose elements can be 0=no communication, 1=allow any communication, or a P=firewall policy that filters communication.

To communicate between hosts, Ethernet-in-IP encapsulation is used. The mapping is done by the vSwitch component which implements a learning switch. For routing, they use dedicated VMs. To connect a non-virtualized host, they use a gateway VM proxy that has two vNICs.

Each TVD has two VLANs. One that is used for internal communication and is unrestricted, and the other is used for external or inter-TVD communication and passes through a firewall that implements the policies.

They proceed to describe the steps for creating and establishing a TVD and for a VM to join the TVD. They describe how to deploy TVDs in a network and what capabilities of infrastructure should be looked for.

They implemented the system using Xen, and their system has a surprisingly low overhead.

Is this better than Ethane? Well it is better suited for multiple administrative domains since each TVD can be administered separately. Inter-TVD communication is partially administered locally. They don't need specialized hardware. In a way, though, having OpenFlow enabled switches would have simplified much of their work and would have probably allowed everything to be implemented without any end-host modification.

The Good:
  • Really cool work. They managed to use a bunch of ad-hoc mechanisms already available in current hardware to build their system.
  • Each TVD is centrally managed and all policies are maintained in one TVD master
  • The capabilities of the underlying infrastructure is fed to the TVD master which then determines the missing capabilities that need to be instantiated in software. How cool is that!
  • They can connect a number of different platforms along with virtualized and unvirtualized machines
The Bad:
  • Using firewall rules is a very heavy compromise. They cannot stop malicious VMs from communicating. A more thorough approach such as looking into to the VMs memory or using trusted computing to verify guests is necessary.
  • The eval section was a little too hand-wavy.

Sunday, July 26, 2009

Thoughts on HAIL: A High Availability and Integrity Layer

author = {Kevin D. Bowers and Ari Juels and Alina Oprea},
title = {HAIL: A High-Availability and Integrity Layer for Cloud Storage},
howpublished = {Cryptology ePrint Archive, Report 2008/489},
year = {2008},
note = {\url{}}, }
The paper describes a system that distributes redundant blocks of a file across multiple servers, and allows a client to make sure that the file is not corrupted even when an attacker can compromise servers, and eventually gain access to all servers. It allows the client to know get proofs of retrievability (POR) efficiently from servers.

HAIL does this by adding what the authors term IP-ECC: Integrity protected error correcting codes. These are basically ECC codes with an embedded MAC. They add these to each block of the file, and then a server can calculate a concise aggregate MAC to prove to the client the existence and integrity of some blocks of a file.

Lots of proofs and cryptospeak, most of which I skipped over. They use standard constructions mostly and put them together.

In terms of performance, the system is slow. In terms of fault-tolerance, the system can-handle byzantine failures where a third of the systems are faulty/compomised. In addition, the files are not lost.

The Good:
Basically secure RAID for the cloud. The servers themselves are untrusted, they have redundancy, and files are stored securely. If one storage provider dies, then the files can still be accessed from other location. System is also robust against modifications and includes integrity checks.

The Bad:
  • Performance is really slow and they didn't compare with other systems.
  • Are storage providers dying really the worst case scenario such that all this overhead and work needs to be done? This seems like a very heavy hammer.
  • It seems that legal recourse + MACs seem to be easier to do. For example, sign an SLA so that storage provider has more to lose by corrupting your data or being unavailable than you.

Thoughts on Building Castles out of Mud: Practical Access Pattern Privacy and Correctness on Untrusted Storage

author = {Williams, Peter and Sion, Radu and Carbunar, Bogdan},
title = {Building castles out of mud: practical access pattern privacy and correctness on untrusted storage},
booktitle = {CCS '08: Proceedings of the 15th ACM conference on Computer and communications security},
year = {2008},
isbn = {978-1-59593-810-7},
pages = {139--148},
location = {Alexandria, Virginia, USA},
doi = {},
publisher = {ACM},
address = {New York, NY, USA},

Read the intro, looked at the performance.

Work addresses the problem of hiding access patterns from a storage provider. Not much work has been done in the area. Previous work was too expensive. This work allows queries to take on the order of 100s of ms. Some of the cost is due to the implementation: lack of parallelism and use of Java. On the other hand, the protocol itself requires multiple round-trip times, and so the cost would still be too high. This is a problem that seems to be not worth solving, at least for now. Get the low-hanging fruit first: access control, controlled sharing, ...

Thoughts on FAWN: A Fast Array of Wimpy Nodes

Authors: David G. Andersen (Carnegie Mellon University), Jason Franklin (Carnegie Mellon University), Michael Kaminsky (Intel Research Pittsburgh), Amar Phanishayee (Carnegie Mellon University), Lawrence Tan (Carnegie Mellon University), Vijay Vasudevan (Carnegie Mellon University)

Venue: SOSP 2009

The paper describes the FAWN a system for storage using a large number of small low-performance (hence wimpy) node that have moderate amounts of local storage. The system has two parts: FAWN-DS and FAWN-KV.
FAWN-DS is the backend that consists of the large number of nodes. Motivation is simple: I/O is current bottleneck and current storage is inefficient.
  • high speed CPUs consume too much power, and CPU scaling doesn't work very well b/c of high leakage.
  • 2GB DRAM uses as much power as 1 TB disk.
  • power cost = 50% 3-yr TCO of cluster
Each node in FAWN-DS has some RAM and Flash. The storage is log-structured key-value based:
  • Each node can appear as multiple vnodes, where each vnode has its own data store. This helps in management, and doesn't decrease performance b/c for a small number of files, it becomes semi-random writes.
  • Part of the key is used to lookup in an in-RAM index.
  • The index stores a key fragment.
  • If the key fragment matches the lookup key, then check the full key in flash
  • If the flash matches, then we have found the entry
  • Otherwise, use hash chaining to get next location
The store is log structured:
  • writes and deletes are just appends to the store
  • lookups are as above
  • Every once in a while the store is compacted and check-pointed
  • Other ops the DS node can do are merge and split data stores
FAWN-KV uses a DHT similar to Chord, but with no DHT routing. There is a three level management hierarchy:
  • Key value ranges are split up and assigned to front-end nodes by management nodes.
  • Front-end nodes handle requests and keep track of which backend nodes have which keys:
    • If one node receives a request and the req is not in its range then it forwards the request to the right node
    • If the req is within the range, it forwards the request to the correct back-end node
    • Front-ends have a small RAM cache they use to buffer requests and avoid hot-spots
  • Back-end nodes implement the FAWN-DS:
    • When a new k,v write arrives, the write is forwarded in a replication chain along the chord ring.
    • The last node in the chain (the tail) acks the req back to the frontend and handles all lookup requests.
    • So each node is in R chains: once as head, once as tail and R-2 as middle node.
Joins and leaves:
  • When a node joins, all nodes split their key ranges.
  • For each range, the node gets the DS from the current tail
  • The node is linked into the replication chain
  • The node stores all updates in a temp log DS
  • The node gets all updates between when copy was done and when it started receiving new updates
  • All updates are merged into the permanent log DS, and node becomes fully on. Old tail can delete old DS.
  • For head join, need to coordinate with front-end
  • For tail join, node inserted before old tail, and predeccessor serves lookups
  • When node leaves, merge range, add new node to replication chain as in regular join
Failures assumed fail-stop. Can't handle communication. More work to be done at detection.

  • FAWN-DS has quite a bit of overhead over raw filesystem
  • For lookup overhead is 38% for 1KB queries, 34% for 256B queries. i.e. about 62-66% throughput
  • For store, 96% throughput achieved b/c log structured
  • semi-random write performance is highly dependant on the flash device used
  • They compare their performance to BDB, which is abysmal to say the least on flash. But I don't get what the point is. BDB is not optimized for flash. I guess the point is that you will need something exactly like FAWN-DS if you want to use flash. What else should they do to get a grip of how well FAWN-DS does?
  • Small random writes do much better than random reads.
  • FAWN-KV throughput is about 80% of FAWN-DS single node throughput due to network overhead+marshalling/unmarshalling
  • Cluster with no front-end does 364 queries/joule
  • Including frontend: 330 queries/joule
  • Current network overhead = 20% and increasing as cluster size increases.
  • Median latency < 1ms, 90% < 2ms, 99.9% < 26.3ms
  • With split, median still < 1ms, 99.9% < 611ms under high load
  • Desktop does 52 queries/joule, 3-node FAWN does 240 with half of the idle power consumed by switch.
  • They have a nice figure illustrating the solution space for minimal cost depending on data set size and query-rate. Almost all uses FAWN with HD, DRAM, or SSD. Only a part uses Server+DRAM.
In general: higher query rate = towards DRAM. Bigger dataset = SSD then Disk for really large ds and low throughput. Traditional + DRAM has a very small slice for high query rate and datasets that are a little larger (assumption is each fawn node only has 2GB DRAM).

Interesting tidbits:

Facebook has 1:6 of puts:gets

The good:
The paper in general is excellently executed with a thorough eval section. The related work section is also quite thorough and is a good starting point for someone interested in the space. The paper has many interesting tidbits and facts.

The FAWN system itself is also very nice and simple (at least as they explained it). They seem to have picked the right level of abstraction at which to explain it. Performance seems comparable to traditional systems at a much lower cost and higher efficiency. Their Figure 15 is beautiful. They basically convinced me to build my next DC using FAWN.

The Bad:
They didn't go into details on failures. I would have preferred a little less on impact of maintenance ops and more on failures and reliability. That said, I don't really see why failures would be a major problem.

Thursday, July 23, 2009

Thoughts on CLAMP: Practical Prevention of Large-Scale Data Leaks

author = {Bryan Parno and Jonathan M. McCune and Dan Wendlandt and
David G. Andersen and Adrian Perrig},
title = {{CLAMP}: Practical Prevention of Large-Scale Data Leaks},
booktitle = {{Proc. IEEE Symposium on Security and Privacy}},
address = {{Oakland, CA}},
month = may,
year = {2009}
The authors describe a web platform the enforces isolation between code running on behalf of different users in both execution and data. They do this as an extension of LAMP (Linux+Apache+MySQL+Perl/PHP) where code for a user runs in a dedicated VM and can access only a subset of the DB---rows that the user owns. This decreases the size of the Trusted Computing Base (TCB).

The system has 5 main components:
The Dispatcher: Receives muxes SSL TCP connection onto VMs (called WebStacks) that run the code on behalf of the user.
The WebStack: Basically a light-weight VM.
The Query Restrictor (QR): Rewrites/restricts queries and updates from the WebStack to only be for rows that the user owns.
The User Authenticator (UA): Authenticates a user, and labels a WebStack with a user id and role. Creates a mapping from VM to UID in the QR.
The DB: Stores all the data.

They study the effect of compromising each component as well as the isolation layer (hypervisor). They update three applications to use their system and it turns out that the systems don't need to be changed much since the only modifications needed are to change the login process usually.

- Overheads ranged from 10-20%. For operations that are already in the 10s of ms, the overhead is tolerable. However, MySQL views incur an additional overhead of 50%! They claim other DBs fair better at around 7% for some.
- Requires too much memory per user. For 6GB, can support about 500 users
- Can handle only 2 logins/sec compared to the original 85 logins/s. Huh?
- Performance of unoptimized prototype about 42% of native.

The Good:
- Isolation and easy to convert other systems to CLAMP.
- To attack the system, at least two components need to be compromised.
- Compromising one user does not automatically grant access to other users

The Bad:
- Ok, so they do isolation between users. But what happens when it is the same bug getting exploited for all users? They say that by isolating users, they will limit the effects of the compromise to that user, but if the same bug manifests in all the webstacks, then they have not solved the problem. They only solve the problem where one user's account/password gets compromised.
- I don't really see the big deal here. A very similar thing can be done with d-star, you just need some generic wrappers if you don't want fine-grained isolation. Here they used a VM instead of a wrapper. Wrappers are much smaller and can provide better security.
- Their attacks discussion did not include dom0.
- Their overheads were really high. and the system is very limited. They equate their system with SSL, saying that SSL has additional overhead, so institutions are willing to pay. But come on. SSL has a much lower overhead, and much higher performance. I mean 2 logins/second? A reduction of 40x? Even assuming that the code still needs to be optimized, and the performance doubles. That is still too low.

Then again, I'm not an expert. So what do I know?

Sunday, July 19, 2009

Thoughts on The Case for Enterprise-Ready Virtual Private Clouds

Authors: Timothy Wood and Prashant Shenoy, University of Massachusetts Amherst; Alexandre Gerber, K.K. Ramakrishnan, and Jacobus Van der Merwe, AT&T Labs—Research
Venue:HotCloud '09

The authors propose a way to connect private clouds to a subset of resources in a public cloud using VPNs such that the two appear as one cloud.

Thoughts on Towards Trusted Cloud Computing

Authors: Nuno Santos, Krishna P. Gummadi, Rodrigo Ridrigues (Max Planck Institute for Software Systems)
Venue: HotCloud '09

They cite a survey where executives and IT people say they don't trust the cloud because they can't control it and they fear for the confidentiality of their data. Even though cloud services always take steps to ensure the customer's data security, any admin with root access to a machine can observe data in memory. The authors cite Terra as a system where a machine can prove its integrity to a VM user. The authors extend this idea to Infrastructure as a Service (IaaS) where the whole service is a big black box.

Attack model: No physical access to machine.

- All nodes runs a Trusted VMM as in "Improving Xen Security Through Disaggregation"
- There exists a trusted external entity (ETE) like Verisign that provides a TC service which keeps track of the Trusted VMs in a cluster. It has to be external so the sysadmins of the IaaS don't tamper with it.
- The TC can attest that the IaaS is providing a secure service and the TC coordinates with the TVMM during critical operations such as starting a VM and migration to ensure security.

The Good:
- Nothing like it exists yet
- Good first attempt

The Bad:
- System has not been built, so although they say the design is "detailed" enough to start building, we can't really verify it's so easy.
- The design requires an external trusted entity that is quite involved in the process of starting a VM and migration and probably other management tasks. It is not clear who will run this service and what/how the incentives work.
- Minor: the authors seem to confuse encryption with signatures.

Saturday, July 18, 2009

Thoughts on Open Cirrus Cloud Computing Testbed: Federated Data Centers for Open Source Systems and Services Research

Authors: Roy Campbell,5 Indranil Gupta,5 Michael Heath,5 Steven Y. Ko,5 Michael Kozuch,3 Marcel Kunze,4 Thomas Kwan,6 Kevin Lai,1 Hing Yan Lee,2 Martha Lyons,1 Dejan Milojicic,1 David O’Hallaron,3 and Yeng Chai Soh2
1HP Labs, 2IDA, 3Intel Research, 4KIT, 5UIUC, and 6Yahoo!

Venue: HotCloud '09

The authors present the need for systems research on datacenter infrastructure. While virtualized systems like EC2/S3, Azure, and such allows applications research, it is difficult to do research on systems that compose the infrastructure of the data center. Open Cirrus is an effort to provide researchers with access to hardware level end-hosts. They have fedarated several data centers around the world, and are working on providing a unified API to access all of them. There will be some basic common functionality even though the DCs themselves are heterogeneous. OpenCirrus is basically federated Emulab across multiple sites where the sites are only loosely coupled (whatever that means).

The systems are scheduled and manage as Physical Resource Sets (PRS). HP's integrated Lights-Out technology is used to manage machines at the firmware level.

They compare options for testbeds and do a back-of-the-envelope calculation of the break-even point for renting (S3/EC2) vs. owning hardware (they use some reasonable cost-breakdown). There calculation says that for a service running more than 12 months, it is better to buy computation while for storage more than 6 months it is better to buy.

The paper is interesting in that it surveys other testbeds, and introduces a new one wider scale one than emulab.

Thursday, July 16, 2009

Thoughts on Providing Packet Obituaries

Authors: Katernia Argyraki, Petros Maniatis, David Cheriton, Scott Shenker
Venue: Hotnet '04

The paper draws a strawman design of how to provide some form of accountability on which AS on a path dropped a packet. This is done by adding Accountability Boxes (A-boxes) on links at the entry and exit points of an AS. The A-boxes maintain calculate digests of packets that pass through and cascade reports from the last AS to see a packet back to the sender along the reverse AS-level path. The mechanism allows an end-host to know which AS has dropped a packet. If ASes forge reports, the sender can know at which link a report discrepancy occurred and can tell that one of two ASes has lied. The paper also quickly studies the feasibility of the design. I did not read the paper well enough to comment on how pros and cons. I like the function that it provides. I'm more worried about its security and whether it is practical. I am not sure that necessitating extra hardware is the right approach. But perhaps this is OK since carriers put all sorts of specialized hardware in the networks anyway.

Wednesday, July 15, 2009

Thoughts on CloudViews: Communal Data Sharing in Public Clouds

Authors: Roxana Geambasu, Steven D. Gribble, Henry M. Levy
Appeared in: HotCloud '09

Public clouds have lots of bandwidth, so it is easier to share data between services in the same cloud. The paper raises the issues in sharing and what the security options are. They maintain the view that data is owned by the service rather than the data creator. They allow services to specify which portions of their data is accessible to other portions. But once data is shared by one service with another, it becomes owned by the other service and can do whatever it likes with it. They don't provide federation of resources and assume a few large public clouds.

This was a quick read and needs to be read again.

Wednesday, July 8, 2009

Thoughts on VL2: A Scalable and Flexible Data Center Network

Authors: Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David A. Maltz, Parveen Patel, and Sudipta Sengupta

To Appear: SIGCOMM 2009

The paper has two parts. The first describes results from measurements in a DC and the other describes an architecture.

Datacenter Results:
The authors looked at a 1500 node cluster in a DC that does data-mining.
  • Traffic patterns highly variable, and there are more than 50 traffic matrices that change very frequently (usually in under 100s) with no periodicity (predictability).
  • 99% of flows are smaller than 100MB
  • More than 90% of bytes are in flows between 100MB and 1GB
  • Flows over a few GB are rare.
  • Around 20% of traffic leaves/enters the DC. The rest is local.
  • Intense computation and communication does not straddle DCs
  • Demand for BW inside DC growing faster than demand for BW to external hosts
  • Network is computation bottleneck. ToR uplinks frequently more than 80% utilized.
  • More than 50% of the time a machine has 10 flows
  • At least 5% of the time it has 80 flows, but almost never more than 100.
  • Most failures are small: 50% of net device failures involve <>
  • Downtimes can be significant: 95% <10min,>10days
  • 0.3% of failures redundant components all failed
  • Most problems due to : net misconfigurations, firmware bugs, faulty components
VL2 Architecture:
VL2 proposes a network architecture that enables multipath, scalability, and isolation with layer 2 semantics and no ARP and DHCP broadcasts. They also use commodity switches that must have OSPF, ECMP, and IP-in-IP decapsulation.

The concepts are simple. VM hosts add a layer to the networking stack called the VL2 agent. When services or VMs wish to send traffic, the VL2 encapsulates the packets in an IP-in-IP tunnel and uses VLB in a Clos network on a flow-by-flow basis to split load across intermediate (backbone) switches.

Hosts are assigned location-specific IP addresses (LA) while services are assigned application-specific IP addresses (AA) that they maintain when they migrate around the datacenter (DC). When a service sends a packet, it uses the AAs. The VL2 agent encapsulates the packet twice: In the inner layer it puts the LA of the dst ToR switch, and in the outer layer it puts the LA of an intermediate switch and sends it out. The packet then gets routed to the intermediate switch that decapsulates the first layer and forwards it to the correct ToR switch. The ToR switch decapsulates the packet and delivers it to the correct host.

While selecting a random intermediate switch LA to forward a packet will implement VLB, a large number of VL2 agents will need to be updated if a switch is added/removed from the intermediate switches. Instead, they would like to use one address for all intermediate switches, and let ECMP choose one of the switches using anycast. But since ECMP only allows 16 different paths (256 coming later), they choose a number (don't say what this number is or how to choose it) of anycast addresses and assign the maximum number of switches to each address. If a switch dies, the addresses assigned to it are migrated to other switches.

When encapsulating, the VL2 agent puts a hash of the packet's 5-tuple in the IP src address of the encapsulation headers. This can be used to create additional entropy for ECMP, or to modify flow placement to prevent large flows from being placed on the same link.

The VL2 agent obtains the LA from a directory and caches it. When the service running on a host sends an ARP request for an AA, the VL2 intercepts the ARP request and queries the directory for the destination ToR switch LA, which it caches and uses for other packets. They don't say how the anycast addresses are obtained for the intermediate switches or what MAC address is returned in the ARP reply to the VM.

The VL2 agent can be used to enforce access control (they say that the directory does, but it actually only makes the policy decisions) and hence isolate services from each other. In addition, two services that need to communicate don't need to go over an IP gateway to bridge two VLANs as in traditional DCs (then again their whole network is made of IP routers anyway).

Services that connect to hosts in the Internet (such as front-end webservers) have two addresses: LA and AA. The LA address is externally reachable. They do not say what this means when the externally reachable service migrates.

For broadcast other than ARP (intercepted by VL2 agent) and DHCP (handled by DHCP relay) an IP multicast address is used which is unique for each service.

The VL2 directory system is two-tiered. Replicated directory servers cache AA-to-LA mappings and serve them to clients. Replicated State Machine servers use the Paxos consensus algorithm to implement reliable updates. Lookups are fast, but updates are slow and reliable. To fix inconsistencies in VL2 agent caches, if a ToR switch receives a packet with a stale LA-AA mapping, it sends the packet to a directory server that updates the VL2 agent's stale cache. Does this mean the ToR switch needs to be modified to support this?

The eval section is good. It is nicely done and thorough. They get 94% of optimal network capacity, high TCP fairness, graceful degradation and recovery under failures, and fast lookups (good enough to replace ARP). VLB provides good fairness because of the high number of flows (statistical muxing), and because uplinks have a 10x gap in speed.

Long lived flows' aggregate goodput is not affected by other flows starting or ending in the network or by bursts of short flows. This is due to VLB, spreading all traffic around uniformly, and because TCP is fast enough to ensure that flows only get their fair share of throughput.

They compared VLB to other routing techinques, adaptive and best oblivious, and found that VLB is at worst only 20% worse, which they claim is good enough for such a simple mechanism.

In terms of cost, they claim their network is cheaper for the same performance.

The Good:
The analysis of datacenter characteristics is the first of its kind. Thank you Microsoft!
The architecture is nice and simple to understand and achieves excellent performance. They do not require modification of switches and use commodity components. In exchange, they modify the networking stack on VM hosts. They say this is OK because the hosts need to run crazy hypervisors and VMM software anyway.
The evaluation is thorough, and, if you buy that the DC measurements are representative, convincing.
They answer the question "What can you do without modifying the switches and their protocols?" well. But it is not clear this was their aim.

They made some very interesting points such as that TCP is good enough for the DC and for VLB. But if we are going to modify the network stack, will we use TCP?

The Bad:
  • It was not clear how representative the measurement of the DC was of other DCs and other areas of the same DC. But something is better than nothing.
  • At a high-level, the architecture seems to be a little ad-hoc, trying to solve a problem by patching a solution on top of existing systems. Perhaps this is the right approach for existing systems whose networking equipment cannot be changed.
  • What are the performance characteristics of the VL2 agent? Does it add any significant overhead?
  • How are the machines configured? How are the routers configured? If misconfiguration is the number 1 cause of failures, how do they address that? Nothing is mentioned.
  • They do not mention power. While Fat-tree used 1G links everywhere and needed more powerful scheduling, it did consume much less power and was quite cheap.
  • The architecture seems to require many independent not very well integrated and controlled components. How do we configure the separate parts, how can we monitor the whole network, so on.
  • The directory service itself seems difficult to implement and build. How difficult was it?
  • They say they will use a number of anycast addresses. How are these addresses obtained? How many are there? How do we add more?
  • For externally reachable services that have a LA, what happens when they migrate?
  • At one point, they mention that the ToR switch can help with reactive stale cache updates. Does this mean the switch has to be modified, or did they mean that the VM host does this. And what happens when the host is gone? Or when the ToR is gone? How are the VL2 caches updated when the reactive mechanism is not working due to failures somewhere else?
  • I do not fully buy that this is better than fat-tree, even if Fat-tree requires centralized control of switches (actually a good thing!)

Thursday, June 25, 2009

Thoughts on Floodless in SEATTLE: A Scalable Ethernet Architecture for Large Enterprises

Authors: Changhoon Kim, Matthew Caesar, and Jennifer Rexford

Reading Detail Level: Quick Read

author = {Kim, Changhoon and Caesar, Matthew and Rexford, Jennifer},
title = {Floodless in seattle: a scalable ethernet architecture for large enterprises},
booktitle = {SIGCOMM '08: Proceedings of the ACM SIGCOMM 2008 conference on Data communication},
year = {2008},
isbn = {978-1-60558-175-0},
pages = {3--14},
location = {Seattle, WA, USA},
doi = {},
publisher = {ACM},
address = {New York, NY, USA},


The main idea is to use the switches to form a single-hop DHT. The DHT is accessed by using consistent hashing (function F) to minimize churn when switches fail or recover and keys are re-hashed. SEATTLE switches do the following things:
  1. They use a link-state algorithm to build the network topology (uses broadcast but only for switches)
  2. They form a one-hop DHT that can be used to store (k,v) pairs.
  3. They also participate in a multi-level DHT so that a large network such as an ISP can be divided into regions and queries to other regions get encapsulated at border switches when crossing regions.
  4. For every k a switch inserts, it keeps track of whether the resolver changes (maybe because it died or a new one came up), and inserts the (k,v) at the new location.
  5. If a switch dies, each switch will scan through its list of stored (k,v) and if the dead switch is a stored value, that entry is deleted.
  6. They advertise themselves as multiple virtual switches so that more powerful switches would appear as several virtual switches and would handle a larger load in the DHT.
  7. When a host arrives:
    1. The access switch snoops to find the host's IP and MAC addrs
    2. The access switch stores MAC -> (IP, location) at switch R=F(MAC) in the DHT
    3. The access switch stores IP -> (MAC, location) at switch V=F(IP) in the DHT
  8. To resolve an ARP request, an access switch uses the DHT's IP->(MAC, location) and caches the result so packets go directly to the target location without additional lookups.
  9. If a packet is sent to a MAC address which the access switch does not know:
    1. The packet is encapsulated and sent to switch R=F(MAC).
    2. R forwards the packet to the right location.
    3. R sends the info to the access switch
    4. The access switch caches the info
  10. When a host's MAC and/or IP addresses and/or location change:
    1. The switch at which the host gets attached (or remains) updates the info in the DHT.
    2. For a location change, the old access switch is told of the change; so, if it receives packets to the old address, it informs the sending access switch of the change.
    3. For a MAC address change, 1) the attached host maintains a revocation list (IP, MACold, MACnew) for the host. If a host is using the old MAC address, the switch sends a gratuitous ARP to that host. 2) The switch tells the other switch of the new MAC->location mapping so it doesn't do a lookup.
  11. SEATTLE separates unicast reachability from broadcast domains. It uses groups---a set of hosts that share the same broadcast domain regardless of location. Switches somehow (not clear now) know what groups a host belongs to, and use a protocol similar to IP multicast to graft a branch to an existing multicast tree.
The DHT can also store mappings from service names to locations (e.g. DHCP_SERVER and PRINTER). This is how they handle DHCP without broadcasts.

For evaluation, the authors implemented SEATTLE (on XoRP and Click), and used traces from several locations to run simulations. They compared with Ethernet and ROFL showing that SEATTLE is superior then both in many ways.

The Good:
The architecture is beautiful in many ways. The DHT approach seems very powerful indeed. I don't know of any other solutions to the broadcasting problem, and none that can handle arbitrary L2 topologies. I liked the notions of virtual switches and groups (with caveats below). The evaluation section is nice, though not perfect.

The Bad:
No paper is perfect, unfortunately.
  • While the authors claim to eliminate broadcast, they do use link state updates which are broadcast, even though these do not happen as often as ARP.
  • Virtual switches are nice, but no analysis of their side-effects exist or how difficult it is to implement them.
  • The discussion on areas was completely confusing.
  • The discussion on groups was too abstract. How do switches know what hosts are in what groups? I was not able to visualize how groups would be implented from the discussion, and so I have to say they seem too difficult.
  • The evaluation section is missing a critical piece. The paper does not mention what has been implemented and what is missing. This makes the evaluation much less credible. How difficult was the implementation? How does it compare to other designs? What is the cost comparison?
  • The per-packet processing cost was not convincing because it was not clear what was implemented and what not.
Nothing was said on multipath, but that was not the point of the paper anyway.