Friday, August 7, 2009

Thoughts on Fabric: A Platform for Secure Distributed Computation and Storage

Authors: Jed Liu, Michael D. George, K. Vikram, Xin Qi, Lucas Waye, Andrew C. Myers (Cornell University)
Venue: SOSP 2009

Summary:
The paper describes Fabric: a language that is mainly an extension of Jif into the distributed computating world. The paper start with motivation citing patient record sharing as an example of where it is necessary to share data between two domains such that each domain's security policies are enforced.

In Farbic, there are three types of nodes:
  1. Stores: store objects persistently
  2. Workers: perform computation
  3. Dissemination: CDNs for stored objects
Naming: Fabric uses an object oriented DB model. Fabric objects store information, are mutable, and have an object id or oid. The oid is mainly the concatenation of the location where the object is stored (as a DNS name) and a 64bit number. The oid is permanent. If the object moves, a pointer in the old location will point to the new location.
Certificates are the roots of trust and are used to establish SSL connections.

Labels: Objects has labels decribing its integrity and secrecy requirements.

Classes: Class objects creat an unforgeable binding between each object and the correct code for implementing that object.

Versions: Fabric objects are mutable. Each has a version number that is incremented when a transaction updates the object. The version number is used to check whether a transaction is operating on a stale or the up-to-date copy of the data.

Threat Model: Timing and termination channels are ignored. It is assumed that the network will deliver messages eventually. It is not clear if this assumption is a strong one. Do they need the network to never fail or else everything crumbles down? Or do the only assume that the network works mostly as today (it will be out for a few hours maybe)?

Another assumption that is used in the rest of the paper is that a program knows a priori the nodes it will use for computation and dissemination, or that they will be input into Fabric somehow. Similarly for storage nodes.

Storage Nodes: Store objects. Related objects are grouped together and returned when requested for efficiency. They track delegation or as they call it "trust relationships".

Worker Nodes: Execute Fabric programs.
  • Trusted programs can incorporate other intermediate language-programs that don't have security constructs.
  • Objects modified only inside transactions
  • worker nodes can either request remote data (data-shipping)
  • or can do remote invocations (function shipping)
  • Remote invocations don't need the object to already exist at the remote worker node
  • The entire method call is executed in a subtransaction whose effects are only visible when the top-level transaction commits. The whole transaction tree is then committed.
  • The caller-side of the remote invocation checks its security statically. The callee-checks it on run-time since it doesn't know who is going to invoke it. i.e. the langauage has to explicitly check trust: e.g. if a trusts b: then a.invoke(...)
Dissemination Nodes:
  • Dissemination layer is not prescribed, and currently uses FreePastry
  • Object on dissemination nodes encrypted
  • Key stored on object's store
  • Store can verify security and dole out the key when safe. **They say workers need not fetch keys often, but not clear why...p5
  • Stores and dissemination nodes track readers.
  • When object updated, updates are pushed to tracked worker nodes.
  • Dissemination nodes can learn info from request pattern, so they are labeled with some label representing "the maximum confidentiality of information that may leak on this channel". Workers only do fetch requests if this max is not violated. **I do not understand this whole paragraph. How do they quantify the amount of information leaked on this covert channel and turn it into a label?
The Fabric Language: They extend JIF to support
  1. Transactions and consistency
  2. Remote method calls
The interesting thing is they do both of these without leaking information and so that remote calls are authorized. I think this should probably say "where information leakage is explicit" because there will always be some undesired information leakage.

Then they go on to say: "Fabric adds a new trust ordering on information flow labels". This I found confusing and intriguing at the same time. Eventually, I will be disappointed...

In Fabric, everything is in terms of principals.
  • There are no partial privileges or dynamic labels for prinicpals. A principal always executes with its full privileges.
  • They add an acts-for relationship which is basically just delagation of all privileges, and they call this "trust". If a acts for b then b trusts a or b has delegated all its privileges to a.
  • There's a top principal (I'll call T, but they use a symbol I'm too lazy to lookup) that acts for all other principals.
  • There's a bottom principal (B) that all prinicpals act for.
  • Fabric nodes are also principals, and so also follow the acts-for model.
  • Principals implement their own authentication. The description of how trust is bootstrapped in the system is too confusing. They must have cut too much here. My imagination fails me. **They need to describe bootstrapping more. How does a user delegate trust to a node? How is the authentication turned into a principal?
  • Information flow security policies are expressed in terms of principals. They claim this integrates access control and information flow, though I don't know what that means. Information flow security policies are access control policies!
  • Labels in Farbic look like: {alice -> bob} meaning alice, the owner of this security policy allows bob to read from it. {alice <- bob} means alice allows bob to write to it. Since the owner is explicit (alice), only code running with alice's authority (worker node acts for alice) can declassify or endorse data with alice's labels.
  • {alice->bob} can flow to {charlie->dora} only if charlie acts for alice and dora acts for bob or dora acts for alice.
Trust Ordering: As well as being ordered by information flow using the flow to relationship, labels can be ordered by trust (or the acts-for relationship). The higher a label's trust requirement, the more a node needs to be trusted. For example, {a->b} actsfor {c->d} iff a actsfor c and b actsfor (d or c). That is one label can be used instead of another in that case. This can be used to establish trust in nodes: if an object x with label L_x is to be stored on a node n. Formally, L_n={T->n; T<-n} actsfor L_x for this to work.
e.g. if L_x is {p->q} then either n actsfor p or n actsfor q. In either case, n must be trusted with the contents of x.

They mention that the reason this trust ordering is needed is because "Fabric classes may be parametrized with respect to different labels or principals, and so instances of the same class may have different labels. This allows implementation of reusable classes..." However, it is not clear to me that anything other than good ol' delegation is necessary. Of course you have to delegate to a node if that node is to enforce your labels! That is why in D-Star exporters own categories they in labels they enforce. Hence, the disappointment I alluded to earlier.

As in Jif, implicit flows are tracked with the program counter's label.

For remote invocations, they say that the callee checks at run-time that the caller has the authority to invoke something with another principal's endorsements, but they do not mention how this check is done. Are there certs as in D-Star? **Need to mention how delegation is proved across nodes.

Transactions: Fabric provides transactions that can be distributed. When distributed, they are turned into a transaction tree. The whole tree is committed when the top level worker node starts the commit. It does so by recursively calling the commits of the whole tree (note recursively not iteratively). Most of the things they describe about transactions are how you would expect them to work.
A failure of a node to commit a transaction correctly is as serious as if that node violated trust by not updating the object's fields correctly.
If a transaction coordinator midway through a transaction, so changes might be committed while others not. This can break consistency but only for objects whose integrity the failing node was entrusted with.

Writer Maps: Used to allow workers to check whether an object has been updated without revealing information to workers who shouldn't learn about the updates. The writer map a log of writes. Everytime an object is updated, an entry hash(oid, xid, key)->E(key, w) is stored where w is the writer. It allows nodes that know the key to get the latest update and know what the writer for that update is. But it is not clear how the nodes know what the last transaction id is. It is probably stored in the clear. **Nodes also still know that an object has been updated since the writer map changes even if the are not allowed to know what the exact updates are.

Implementation and Eval: The implementation was 33k lines of code on top of Jif and other libs. They use BDB for the store and FreePastry for dissemination.
A few things are left unimplemented, most notably the distributed deadlock detection and timeouts for divergent computations.

For the Eval they implemented a course management system and tested on 55 students. Complex queries were replaced with object oriented code and this simplified the implementation from 3100 loc to 740. However, they only migrated part of the full system. They also only ported it to FabIL, the intermediate language used by Fabric that has no security properties. Only consistency. Performance shows that FabIL is faster than all other implementations with consistency. Transaction management and consistency is about 50% of performance. **The metric doesn't really show much since only parts of the implementation were done. It is not clear why FabIL is faster. Maybe I didn't understand what FabIL does. Is the implementation on a set of nodes? How does BDB run faster that Oracle? Still it says nothing about security and performance with full Fabric used.

They also implemented a calendar. They didn't get performance numbers. Is it because the calendar didn't fully work?

They used the OO7 benchmark on Fabric nodes and found that Fabric is 10 times slower than non-persistent Java. They say this is acceptable, but that is not clear.

In summary, the paper is a thorough design of a language for distributed computation and storage when DIFC is involved. But like almost every other paper on the subject has left me with more questions than answers.

The Likes:
  • The paper is thorough in the design description
  • The paper is probably the first to combine DIFC with transactions to provide consistency
  • They combine language constructs with run-time checks. It's cool that trust requirements are in the language.
  • They do so much stuff!
The Disklikes:
  • Ah so many questions! See the text (**). I'm too lazy to put them here.
  • Why the trust ordering on labels? Why is delegation not enough? I am not convinced that it is necessary.
  • They have said almost nothing about how runtime checks are implemented. Do they use certs as in DStar to prove delegation?
  • That bootstrapping paragraph was completely useless. It left me more confused than informed. Probably because they did not explain how runtime enforcement is done and how authentication is turned into labels or proofs of labels.
  • The evaluation leaves so much to be desired. There is no full implementation of any system that uses Fabric and no real performance evaluation. In all, that was a disappointing eval section.
  • What is the deal with the dissemination nodes. Yup they are cool, but where is how they help? Are they basically CDNs? How are they used? Why not just replication?
  • I don't like the fact that principals always use their full privileges to access everything. What if a principal wants to operate in an untrusted environment or where it doesn't want a node to be responsible for the integrity of so many objects.
  • The paper does not say much about the "control plane": Where do the names of nodes come from? How does a user select nodes? How do programs select/find nodes? How does a user's credentials translate across multiple nodes? I guess all this is orgthogonal but it would've helped me understand Fabric's usage model which was quite obscure (as in many other DIFC papers).
  • I think my main issue is why is the system not layered? Why aren't transactions built on top of a minimal base whose properties are easy to reason about? Why are dissemination nodes a primary object instead of being implemented on top of worker and storage nodes, i.e. on top of Fabric? Is it convenience? What happens when things need to change? What about the modularity argument?

No comments:

Post a Comment