Thursday, December 3, 2009

Thoughts on Dispatcher: Enabling Active Botnet Infiltration using Automatic Protocol Reverse-Engineering

Authors: Juan Caballero, Pongsin Poosankam, Christian Kreibich, Dawn Song

Venue: CCS 2009

Summary:
The paper describes a tool that can be used to automatically reverse engineer the protocol format of an unknown protocol. The tool can be used to deduce the format and some of the semantics of the fields in both sent and received protocol messages. It can do this reverese engineering without requiring the binaries on both sides of a protocol, which makes it useful in situations where only the client is available (such as a bot binary) or only the server.

The ideas used are quite clever and simple. The main idea is to look at each byte in the buffer that constitutes a message sent or received, and then create a dependency chain for that byte. The dependency chain is the sequence of instructions and data sources used to write the byte. The dependecy chain ends when a source for an instruction is an immediate, a constant, another memory location, or unknown (such as an arithmetic op). Then by comparing the ultimate sources for contiguous destination bytes and seeing if they are derived from the same contiguous set of source bytes, they can tell the position and length of buffers that form the fields in a message. They do this hierarchically in the buffer so they can also extract the hierarchy of fields.

They can infer some semantics by looking at where these fields are used. If they are used as arguments to or if they are return values of known functions, they can immediately get the type, and some semantics such as an IP address when used in the connect function.

Furthermore, they are able to infer loops using static and dynamic analysis and then be able to infer fields that are used to indicate offsets and lengths. And by looking at repeated bytes, they are able to infer delimiters. While arrays are inferred by assuming that arrays are usually written inside loops.

To handle encrypted messages, they use the intuition behind other work, ReFormat by Z. Wang et al. The idea is that functions that have a disproportionate amount of arithmetic instructions to other types of instructions are probably cryptographic. They use this idea to infer which buffers are unencrypted and operate on them. They have used a threshold of 20 minimum instructions for a function to be considered and a ratio of 0.55, and were able to get a false positive rate of 0.002%.

For evaluation, the authors studied the MegaD command and control C&C protocol. They were able to infer the message formats used by MegaD and were able to rewrite a message for indicating SMTP sending capabilities. They have also looked analyzed other open known protocols and compared their results using Dispatcher with those from Wireshark. They say that Dispatcher outperforms Wireshark.

In terms of resiliency, the authors claim that they target fundamental properties to avoid becoming obsolete if malware adapts. However, they do confess that if malware can detect that it is being run in a virtual environment, then their technique can be circumvented.

The Likes:
  • The paper is very well written and clear.
  • The ideas are nice and clever, and they seem to work well.
  • They do not need both sides of a protocol
  • Can be used to infer correctness of messages when a protocol is known
  • Can make writing a Wireshark dissector easier
  • Might be used to send erroneous information to botnet masters (I don't know how useful this is. If you know that a machine is already compromised, you can should fix the situation. You cannot take over the botnet.)
The Dislikes:
  • Can be circumvented by detecting a virtualized environment. I don't know how easy this is, but for serious malware, this might not be a problem.
  • The authors claim that they target fundamental properties, but provide no evidence or proof. It is not clear that the properties they target are really fundamental. For example, they use some known functions to infer semantics. The bad guy might ship his own libraries so as to restrict the API used. The thresholds used to detect encryption and decryption are empirical, and seem easy to obfuscate. Adaptive malware might be able to change keyword values or restructure the program to make it more difficult to infer which functions are doing what. The program might be written in a difficult to analyze way where fields of the messages are always results of arithmetic computations. The program might use complex loops with jumps and gotos that make analysis difficult.
  • They claim they do better than Wireshark, but it's not clear why they made that conclusion. The dissectors made different assumptions about what delimiters are, but it does not mean they are incorrect (or maybe I just did not understand what they meant).

No comments:

Post a Comment