Friday, October 16, 2009

Thought on Making Information Flow Explicit in HiStar

Authors: Nickolai Zeldovich, Silas Boyd-Wickizer, Eddie Kohler, David Mazieres

Venue: SOSP 2006

Paper: PDF

Summary:

The paper describes, in painful detail, the HiStar operating system and the implementation of two applications on it. HiStar uses distributed information flow control to provide a platform to run untrusted code on sensitive data and provide guarantees for data leakage and integrity.

Motivation:
  • Hard to reason about implications of every action by untrusted
    code
  • chroot can break application assumptions
  • syscall interposition error-prone
  • Easier to specify policy in terms of where to where information can flow
  • Unlike other IFC systems, HiStar implements standard OS abstractions on top of lower level building blocks that obey IFC
  • Taint tracking mechanism does not leak info itself
The HiStar DIFC model:
There are six kernel object types in HiStar. Of these, threads are the only ones that execute code. Under Distributed Information Flow Control (DIFC), each kernel object has a label. The label describes how it is tainted. An object can be tainted in different sets of categories with different degrees. The degree of taint in a particular category places a restriction on the object. In HiStar, there are 4 levels of taint.
  1. Cannot be written/modified (lowest taint)
  2. No restriction
  3. Cannot be untainted/exported
  4. Cannot be read/observed (highest taint)
Information from one object A can only flow to another object B iff B is at least as tainted as A in every category. This is the flow to relationship. A restriction imposed by a particular category can be bypassed by objects that own that category. Ownership of a category is specified as an additional degree «. Categories are only created by threads. A thread that creates a new category is the only owner of that category and can then grant ownership to another thread or to a gate (another kernel object).

A label looks like {a0, b2,c«, 1}, which means that the object is tainted in category a with degree 0, b with degree 2, c with degree «(i.e. the object owns category c), and in all other categories with degree 1 (i.e. untainted in other categories). The labels on all objects, except for threads, are immutable and have to be specified when the object is created. Objects are created by threads only.

Threads can read restricted objects by raising their label. Threads, however, can only raise their labels to the limit imposed by their clearance label. The clearance label is an additional label on threads, that also places an upper limit on the label a thread can request for a newly created object. A created object’s label must flow to the thread’s label which in turn must flow to the thread’s clearance (bypassing owned categories).

Kernel:
HiStar uses a single-level store. All objects are stored on disk and, on bootup, the entire system state is reloaded from the latest snapshot. All objects are stored in a container hierarchy with quotas.

There are six kernel object types in HiStar: segments, threads, address spaces, containers, gates, and devices.
  • Segments: segments are the simplest object type in HiStar. They are basically arrays of bytes and are analogous to other OSes’ files.
  • Threads: threads execute code. They can create categories and objects, and change their own label and clearance as long as the new label or clearance can flow to the current clearance ignoring owned categories.
  • Containers: All objects are stored in containers that are organized in a hierarchicy rooted at the root container with quotas. Objects are hard-linked into containers, and when they are disconnected from the root, they are garbage-collected. The existence of hard links can leak information, so objects are always specified as a (Container ID, Object ID) pair. To be able to see if a hard link exists, the thread would need to be able to read the container. Quotas on objects do not change so as not to convey information.
  • Address Spaces: Each thread has runs in an address space. An address space is a mapping from virtual address -> containers, segments in them and permissions. The address spaces are used to specify code to run and virtual memory. They are created by a thread and used in launching a new thread. They are used when page faults occur to find whether a thread has permission to read some segment and to find where that segment should be mapped.
  • Gates: gates “provide protected control transfer, allowing a thread to jump to a pre-defined entry point in another address space with additional privileges”. Basically, they are very similar to RPCs except that the calling thread provides the resources for execution. A thread can create a gate and give it some label (respecting DIFC). Another thread can invoke the gate and request a label and clearance. The requested label and clearance are a combination of the invoking thread’s label and clearance and the gate’s label and clearance. The thread then starts executing in the new address space (i.e. executing some other code or function) using the requested label and clearance. Return values can be stored inside the thread object itself, but there is no implicit return mechanism. The invoking thread needs to create a return gate and label appropriately (the discussion on this is quite interesting).
  • Devices: Can be mounted…
User-Level Libraries:
The authors implemented a Unix-like environment using the HiStar kernel completely in userspace, and so they all respect IFC. The filesystem is implemented using segments and containers. A directory is a container with a special directory segment that contains a map of names to object IDs and a mutex to synchronize directory operations. To get a consistent view of the directory entries without acquiring the mutex (which needs write permissions), a user atomically reads a generation number and busy flag before and after reading an entry.
Processes in HiStar are containers with two process-specific categories to protect its secrecy and integrity. The process has two containers, a process container and an internal container. The process container exposes objects that are the external interface to the process. The internal container has segments used to implement the file descriptors, stack, heap, and executable code. It also has the thread that executes the code and its address space.

A user has a pair of unique categories that define her read/write privileges. HiStar does not support ACLs.

HiStar also supports IPC using gates, signals, and networking using lwIP. The network device has a label {nr3, nw0, i2, 1} to taint all data read from the network.

Unix leaks information for process exit, quota adjustment, and file creation. For these operations, HiStar uses untainting gates that explicitly allow information to leak.

Applications:
The authors used ClamAV as the motivating example and split it up so that the scanner part cannot export user data.

They also implemented an untrusted user authentication mechanism that cannot leak more than one bit of the user’s password. The discussion for this application is interesting, but too complicated to explain in a summary. The main idea is to split up the login mechanism into a directory and gates and use each gate with different privileges to get the information for authentication, check the password, and get the user’s privileges.

Another application was VPN, where another lwIP stack was run in parallel to the network with a different label so that public Internet data and VPN data do not mix (at least not directly).

Performance:
HiStar runs comparable to Linux and OpenBSD, sometimes fasters, others slower. It is slower in the fork and exec microbenchmark. However, it does allow new consistency primitives such as group sync where no changes are recorded until an application is done execution.

Limitations:
Some useful features are missing from the Unix environment, while the semantics of others are modified. A crash or killed process can leave mutexes locked, and there needs to be support for cleaning those up. There are no CPU quotas and HiStar does not solve timing covert channels.

The Likes:
  • The paper was thorough in the description and did not leave questions
    unanswered.
  • DIFC seems very useful, and the OS seems to implement it well.
  • It is nice to build a layered system basing the design on a few basic blocks that enforce security and be able to reason about the rest of the system from those.
  • The OS has a load of cool and smart things whose composition was carefully designed to build useful libraries.
  • It works!
  • For non-malicious code, HiStar seems to be able to protect against bugs.
  • It’s cool that the taint degrees work out to map into numbers that can be compared using simple integer comparisons.
The Dislikes:
  • The paper was very dense, and takes quite a bit of effort to read. It would’ve been nice if the authors could have focused on the important points and insights instead of just a spewing out of information.
  • The userspace environment is not completely compatible with Unix, but ah well.
  • The timing channels problem seems to be a major issue
  • It looks difficult to build applications using the HiStar primitives.
  • What is the killer app for HiStar? What can it be used for given all its limitations? What are all these features good for?

Tuesday, October 13, 2009

Thoughts on The Case for OneSwarm

I should've posted this entry a while ago. This is a talk that Tom Anderson gave at MSR in Mountain View on 9/29/2009.

Summary:
Cloud computing has many issues:
  • Data can be locked in (Facebook, ...)
  • Need to trust providers with data
  • Cloud is a natural monopoly:
    • The business model is to attract developers then the users. So the users have no choice.
    • Reliability and the ability to handle flash crowds is a high barrier to entry.
    • There is no plausible avenue for an open source Cloud. Somebody needs to pay for the infrastructure, so somebody needs to be making money off of it to be sustainable, and the way to do this is through data lock-in and/or data mining...
  • Since all data is stored somewhere centrally, it's easier to censor it
Can we use a P2P system instead where users can donate some of their resources to be able to use other people's resources? At first glance, it looks like this might be alright. P2P systems
  • have high availability -> But in fact this is not true because seeds disappear
  • have no bottlenecks -> But their could be network inefficiencies or ISP throttling
  • have no centralized monitoring -> But it's actually easy to track users
The alternative is OneSwarm. OneSwarm has no centralized trust. The idea is to use privacy-preserving sharing where you build a web of trust through friends. Then when a user makes a query, the query goes through friends, then friends of friends, and so on... and the data travels on the reverse path back to the user. Nodes are not supposed to reveal information about where a query they pass along came from, so the receiver of a query cannot relate a user to a query.

The system, is by Tom's admission, vulnerable to collusion, but that's okay. The system is not designed for complete security but rather as a compromise between security, flexibility, and performance. Another assumption is that there is no partitioning in the network and that eventually everyone will be connected, and everyone will have access to everything (which will probably end up happening anyway).

The network of friends is organized as a DHT. To maintain availability and solve the issues with traditional P2P systems, they invented a new type of DHT that is highly scalable and reliable. groups of nodes are organized into tight groups that are replicated via Paxos. Each group then appears as a single node in the DHT.

As a final note, I had a chat with Tom after the talk. He has a pessimistic outlook on what will eventually happen. Mainly, that cloud providers will always end up creating a monopoly using data lock-in and be able to mine the data stored on the cloud, because they will manipulate the API so they can do so.

Likes:
I like OneSwarm. In particular, I liked the DHT implementation. They had some really cute ideas there. It is not a security protocol, it is a "casual" privacy protocol that presents a minimum barrier to snooping. However, if someone wants to get to my data and to know what requests I am making, they would be able to do so.
The availability of a P2P system that is presented as cloud storage on top which users can implement "free" applications, such as an open-source Facebook.

Dislikes:
  • I'm not sure I agree with the pessimistic view. Google has started a movement called the data liberation front to basically free users from data lock-in, and I expect this movement to become more important.
  • There are more reasons than just convenience for applications to run in a data center. If we move from DCs to P2P systems for storage, we will get much higher lookup costs, and it is not clear systems will scale.
  • There are no guarantees of connectivity. The social network might get partitioned, limiting the available data.