How do AV files look for files for known signatures so efficiently?

Data in the form of search strings continues to grow as new versions of viruses are released, which prompts my question: how do AV files look for files for known signatures so effectively? If I upload a new file, my AV scanner quickly identifies the file as a threat or not, based on its signatures, but how can it do this so quickly? I am sure that by this moment there are hundreds of thousands of signatures.

+7
source share
3 answers

UPDATE: As the triptych pointed out, the Aho-Corasick algorithm seems very relevant for antivirus scanners. Here's what you need to read:

http://www.dais.unive.it/~calpar/AA07-08/aho-corasick.pdf

http://www.researchgate.net/publication/4276168_Generalized_Aho-Corasick_Algorithm_for_Signature_Based_Anti-Virus_Applications/file/d912f50bd440de76b0.pdf

http://jason.spashett.com/av/index.htm

Aho-Corasick-like algorithm for use in antivirus code

Below is my old answer. Its still relevant for the simple detection of malware, such as worms, that simply make copies of themselves:

I will just write some of my thoughts on how AVs can work. I do not know for sure. If someone considers incorrect information, please let me know.

There are many ways to detect AV threats of possible threats. One way is based on the discovery signature.

A signature is just a unique fingerprint of a file (it's just a sequence of bytes). As for computer science, it can be called a hash . One hash can take about 4/8/16 bytes. Assuming a size of 4 bytes (e.g. CRC32), about 67 million signatures can be stored in 256 MB .

All these hashes can be stored in the signature database. This database can be implemented using a balanced tree structure, so insert, delete and search operations can be performed in O(logn) time, which is quite fast even with large values โ€‹โ€‹of n (n is the number of records). Or, if a lot of memory is available, you can use a hash table that gives O(1) insert, delete and search. It can be faster when n grows larger and a good hashing method is used.

So, what does the antivirus do, roughly speaking, it calculates the hash of the file or only its critical sections (where malicious injections are possible) and looks for its signature database in it. As explained above, the search is very fast, which allows you to scan a huge number of files in a short period of time. If it is found, the file is classified as malicious.

Similarly, the database can be quickly updated, since insertion and deletion are also fast.

You can read these pages to get a deeper understanding.

What happens faster, hash search or binary search?

https://security.stackexchange.com/questions/379/what-are-rainbow-tables-and-how-are-they-used

+4
source

Many signatures are tied to a specific offset or specific section in the binary structure of the file. You can skip parts of the binary that contain sections of data with displayed strings, initialization data for internal structures, etc.

Many modern worms are stand-alone files, for which a whole file signature (SHA1 hash or similar) is sufficient.

In the general question of how to scan a large number of templates in a file, it is best to answer with a pointer to the Aho-Corasick algorithm .

+1
source

I do not know how practical AV works. but I think the question has some relative looking for words in a long text with this dictionary.

For the above question, data structures such as TRIE will do this very quickly. the processing of the text word Length = N of the word K takes only the time O (N).

0
source

All Articles