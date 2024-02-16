



Today, we're open sourcing Magika, Google's AI-powered file type identification system, to help others accurately detect binary and text file types. Under the hood, Magika employs highly optimized custom deep learning models that enable accurate file identification within milliseconds, even when running on a CPU.

Magika command line tool used to recognize and identify different file set types

You can try out the Magika web demo today, or install it as a Python library and standalone command line tool (output shown above) using the standard command line pip install magika.

Why is it difficult to identify file types?

Since the early days of computing, accurately detecting a file's type has been important in determining what to do with the file. Linux includes libmagic and file utilities that have served as the de facto standard for file type identification for over 50 years. Today, web browsers, code editors, and countless other software rely on file type detection to determine how to properly render files. For example, modern code editors use file type detection to choose the syntax coloring scheme to use when a developer starts typing a new file.

Accurately detecting file formats is a notoriously difficult problem, as each file format has a different structure or no structure at all. This is especially difficult because the structure of text formats and programming languages ​​are very similar. Historically, libmagic and most other file type identification software have relied on a collection of hand-written heuristics and custom rules to detect each file type.

This manual approach is time-consuming and error-prone because it is difficult for humans to manually create generalized rules. Especially for security applications, creating reliable detections is especially difficult because attackers are constantly trying to confuse detections with payloads they create.

To address this issue and provide fast and accurate file type detection, we researched and developed Magika, a new AI-powered file type detector. Under the hood, Magika uses a highly optimized custom deep learning model designed and trained using Keras that is only about 1MB. During inference, Magica uses Onnx as the inference engine to ensure that files are identified within milliseconds. This is about as fast as non-AI tools even on the CPU.

Magika Performance Magika's detection quality compared to other tools in the 1M file benchmark

In terms of performance, Magika outperforms other existing tools by around 20% when evaluated on the 1 million file benchmark, which includes over 100 file types, thanks to its AI model and large training dataset. can. As shown in the table below, when we break down the file types, we see even more performance improvements for text files, such as code and configuration files, which are difficult to use with other tools.

Performance of various file type identification tools for some file types included in the benchmark – 'n/a' indicates that the tool does not detect the specified file type. Google Magica

Under the hood, Magika is used at scale to improve the safety of Google users by routing Gmail, Drive, and Safe Browsing files to the appropriate security and content policy scanners. Looking at weekly averages of hundreds of billions of files, we see that Magica improves file type identification accuracy by 50% compared to previous systems that relied on manual rules. Specifically, this increased accuracy increases the number of files that can be scanned by our specialized malicious AI document scanner by 11% and reduces the number of unidentified files to 3%.

The upcoming Magika and VirusTotal integration will complement the platform's existing Code Insight functionality, which leverages Google's generated AI to analyze and detect malicious code. Magika acts as a pre-filter before files are analyzed by Code Insight, increasing the efficiency and accuracy of the platform. Due to the collaborative nature of VirusTotal, this integration directly contributes to the global cybersecurity ecosystem and fosters a more secure digital environment.

Magica open source

By open sourcing Magika, we aim to help other software improve file identification accuracy and provide researchers with a reliable method for identifying large file types.

Magika's code and models are available for free on Github starting today under the Apache2 license. Magika can also be quickly installed as a standalone utility and Python library via the pypi package manager, without requiring a GPU, by simply typing pip install magika. There is also an npm package if you want to use the TFJS version.

For more information on how to use it, please visit the Magica documentation site.

Acknowledgment

Magika was created without the help of many people including Ange Albertini, Rua Farah, François Galilea, Giancarlo Metitieri, Luca Invernizzi, Young Men, Alex Petit Bianco, David Tao, Kurt Thomas, Amanda Walker, and Jishun Tan. was not possible.

Elie Bursztein – Cybersecurity AI Technical and Research Lead, Yanick Fratantonio – Cybersecurity Research Scientist

