It is totally awesome that function similarity is seeing the light of day. Google bought Zynamics way back when and their work has evolved a lot since then.
This sort of work has big implications in signature generation for malware samples, for clustering families of samples as well as finding common functions to generate detections on. You couldn't necessarily throw this into a detection engine, because we don't have a fast (dedicated) function recovery tool for binaries, but you can absolutely use it to generate byte based detection from a seed of a few samples.
Rather than having hash based signatures you could generate signatures that cover many samples (and likely new ones) in bulk. Normally a good signature like that requires manual effort from an analyst, this is a step toward machines doing it. As well, a central authoritative name database that could say "this is Petya" could force some sane naming convention on the industry (every AV wouldn't just be like "it's Zbot lol").
This stuff can even aid manual reverse engineering. You could build a function naming database that uses this. Maybe a new engine for Talos FIRST. [1] Then if you opened up a file without debug symbols this could match it to known functions and really speed up reverse engineering efforts.
I look forward to reading it in more detail tomorrow. Thanks to Halvar for putting this out.
> An efficient implementation of a hash function (based on SimHashing) which calculates a 128-bit hash from disassembled functions - and which preserves similarity (e.g. “distance” of two functions can be calculated by simply calculating the hamming distance between two hashes - which translates to two XOR and two POPCNT instructions on x64).
Huh! This obviously desirable property of a good hash function for this application is one of the classic undesirable properties of a good hash function for cryptography. I don't think that I was previously familiar with this sort of hash function, very cool.
Are space-filling curves related to these hash functions?
Not all hashes are intended for cryptographic applications or want the Avalanche effect. Some are simply succinct digests of data and some attempt to retain important information like the hash described here.
Another example of this sort of hash function (where the hamming distance between the hashes of similar inputs will be small) is pHash[1]. I've been playing with it recently for deduping bit-different but perceptually-identical images, and it works pretty well and the low hamming distance property makes it dead simple to use.
Space filling curves would work in some situations, but wouldn't be as efficient as a purpose-designed hash. A space filling curve, for example, would break down under 50+ dimensions, which is why you'd be using one of these hashes in the first place rather than a tree.
If you want to know more, I'd suggest researching Locality Sensitive Hashing. Flann[1] has a decent implementation.
Pardon if I'm not well informed, that's why I ask. Is the problem similar to string search with multiple patterns? Could Rabin-Karp algorithm be used for searching? It also uses hashing. Or could this algorithm be used instead of Rabin-Karp?
You can use bad hashing functions for a lot of applications. In the traditional interview question: find all anagrams in a list of words, you can sort the letters of a word as the hashing function, and use a traditional hashmap that does conflict-resolution with a linked list as a solution. Fun stuff
Bad isn't the right word here. It's a bad cryptographic hash, sure, but a hash doesn't necessarily need to be cryptographic to be a hash. P-hash for finding similar images; sounded/metaphone for similar English pronunciation; crcs, cityhash, siphash, &c for fast checksums and hashtables; and location sensitive hashes like simhash for "close matches" for spam filtering and content aggregation all have a long history of being good and useful hashes without being cryptographically secure.
The zlib vulnerability was a huge wakeup call. Years ago I used to be in favor of statically linked, to limit dependency creep. Disk space is cheap so large binaries were never a concern.
My favorite approach was single repository compiling optimized binaries with the specific versions of the libs wanted (in case of weird regression) and pushing these static binaries to the rest of the network.
After the zlib debacle, no more: I only use that approach for very specific mission critical tools, where I do not trust ansible or even linux distributions.
The sqlite 0 day may have reignited the same fears in those too young to remember grepping various zlib signatures on binaries -not just yours (you central repository can easily push new version to your network) but the other tools you don't necessary control.
Unfortunately outside of linux distros static linking seems to be becoming more and more then new norm, especially if you include what is effectively "soft" static linking of bundling all dependencies, which suffer from the same problems. Rust is only suitable for static linking, .net core statically links and most .net projects will have all sorts of out of date dependencies, npm bundles huge dependency chains that may not be upgraded. Even in the linux distro world there is movement towards tools like docker where dependencies will often not be patched.
Even the idea of stable versions of libraries with security patches seems to be a dieing one.
It supports it terribly. Since there's no stable ABI you have to use the same compiler version. That or be limited by the C ABI and give up most of the safety features.
This isn’t unique to Rust, C++ has the same issues when it comes to using shared libraries. You either carefully control compiling and shared library versions or you drop down to C.
I don’t know any native language that can do this in a better way.
I feel like it's pretty easy to understand the hate. For example, the recent x.509 verification DoS patched in <Go1.11.3 required me to recompile, tag and redeploy bunch of services. And then also push an RPM to thousands of customer servers, which costs me CDN money :(.
I would have preferred for myself, and my customers to just upgrade openssl.
I am still a big fan of static binaries though, they more than make up for their downsides.
Where did I say I hated it? It just has different security characteristics, and I think static linking proponents are sometimes a bit hand-wavy about the implications of those.
I suppose my gut feeling is that it's more prone to issues of human error - why should I have to recompile my application if the security hole was found in a library I've linked to? With dynamic linking, assuming you're publishing your software for an OS with a conscientious approach to security patches, ie. pretty much all the major ones, it's a solved problem. With static linking, if you forget to update your binary, well, it's your problem.
Very specific tools, sure. If you have the time, people, and skills to keep your own list of baked-in dependencies and monitor it for issues - great. But for a random person downloading something from the internet, it's a dangerous thing to default to.
Even FAANG companies rely on distributions and other companies to spot vulnerabilities, rebuild libraries, test and validate them.
Even if the company rebuilds everything, there's a huge benefit in knowing that you are using a well tested release instead of a less popular one or an internal fork.
Dynamic linking is also really bad for security. There exist whole classes of bugs that require it.
Library substitution is a big problem. If an attacker can get a library into a place where the executable looks for libraries, the attacker gains control.
Merely having the capability to load a dynamic library is an issue. Generally, this means it is possible to load code into the process.
Library ABI mismatches can be security bugs. There can exist two pieces of software that can be installed and used separately without trouble, but which have security bugs when both installed. This happens for example when the second piece of software to be installed brings along an updated library that isn't fully compatible (even bug-for-bug) with the one that the other software came with.
Hah, no. AV databases are just real dumb tries of whole-file MD5s, or a similar level of sophistication (i.e., not very). Usually combined with an unnecessarily privileged and unsandboxed parser for arbitrary weird file formats.
The article makes more sense if you consider that Google is notorious for statically linking (almost) all its production binaries, something that the Go toolchain supported early on for related reasons.
I’m guessing it’s also related to the recent SQLite zero-day. SQLite is public-domain, and so it’s not unusual for projects to simply include the code directly in their source distributions, which will either build it in or will provide a `./configure` option to use an already-built form (this is used by distro packagers). When the “build it in” option is used, then it’s statically linked.
I think Thomas Dullien has been working on this stuff for a very long time; his company, Zynamics, which Google acquired ages ago, was the author of BinDiff.
There I many things I disagree about with Google, but statically linking binaries is a good practice if you know what you are doing and have a central repository where you handle your own versions of everything.
You can not always rely on linux distribution to not mess up with a lib, especially if it "obscure" with few users.
This sort of work has big implications in signature generation for malware samples, for clustering families of samples as well as finding common functions to generate detections on. You couldn't necessarily throw this into a detection engine, because we don't have a fast (dedicated) function recovery tool for binaries, but you can absolutely use it to generate byte based detection from a seed of a few samples.
Rather than having hash based signatures you could generate signatures that cover many samples (and likely new ones) in bulk. Normally a good signature like that requires manual effort from an analyst, this is a step toward machines doing it. As well, a central authoritative name database that could say "this is Petya" could force some sane naming convention on the industry (every AV wouldn't just be like "it's Zbot lol").
This stuff can even aid manual reverse engineering. You could build a function naming database that uses this. Maybe a new engine for Talos FIRST. [1] Then if you opened up a file without debug symbols this could match it to known functions and really speed up reverse engineering efforts.
I look forward to reading it in more detail tomorrow. Thanks to Halvar for putting this out.
1. https://www.talosintelligence.com/first