How To Build Your Own: Python String Analysis for Malware Insights

When we approach the analysis of malware, understanding the inner workings of malware is crucial. The static analysis serves as our initial reconnaissance, allowing us to dissect and understand a file’s potential threats without executing it. While tools like the UNIX strings command have long been staples in this domain, they sometimes lack the flexibility and depth required for today’s complex malware strains. This is when Python comes into play: a language that not only amplifies our string analysis capabilities but also offers unparalleled customization.
By crafting our tool, we embark on a didactic journey, familiarizing ourselves with automation nuances and ensuring that our analysis is thorough and tailored to our specific needs.

Table of Contents

The Power of String Analysis

String analysis is a cornerstone in the world of static malware analysis. At its core, it’s about extracting readable sequences of characters from binary files.
Why? Because these strings can reveal a lot about a file’s intent and functionality. They might show URLs, file paths, error messages, or even suspicious commands. As a first step in static analysis, string analysis can quickly give us insights without even executing the malware.

Now, UNIX users might be familiar with the strings tool. It’s a handy command-line utility that fetches printable characters from any given file. While it’s great for a quick look, it has its limitations. It’s not very customizable, and sometimes, we need more than just the basics that would require a deep study of the tool.

That’s where our journey begins. We’re going to craft our own string analysis tool using Python. Not just as a replica of strings, but with added features tailored to our needs. Building this tool will not only enhance our malware analysis capabilities but also deepen our understanding of automation in cybersecurity.

Let’s roll up our sleeves and get started!

Crafting a Python String Analysis Tool for Malware Inspection

When we dive into malware forensics, the strings we uncover within a binary file are like the DNA of the software. They hold clues to the file’s purpose and operations. Extracting these strings is a foundational practice known as string analysis.
In this part of our guide, we’ll take you through the process of creating a Python-based string analysis tool that adds a layer of customization for the modern cybersecurity enthusiast.

Malware authors are crafty, always scheming to obstruct our analysis. While we can’t promise a silver bullet, the simplicity and effectiveness of our Python script make it an indispensable arrow in our quiver. This tool is particularly useful post-unpacking of obfuscated malware, peeling back the layers to reveal its secrets.

By the end of this walkthrough, you’ll be able to interpret binary files, extract meaningful strings, and apply regular expressions to identify patterns of interest.

Enhancing configurability is a priority. We’ve designed the script to work with a JSON configuration file, allowing you to define and manage patterns you expect to encounter. Here’s what the tool promises by the time we’re done:

With a simple terminal command:

python strings_analyzer.py <filename> <pattern1,pattern2,...>

You can analyze any binary file – regardless of whether it’s for Windows, Linux, or another system. It’s purely static analysis, without execution risk.

The <pattern> placeholder represents a series of comma-separated patterns defined in a JSON file titled “patterns.json”. For instance, it will look something like this:

{
    "url": "\\b(?:http|https|ftp):\\/\\/[a-zA-Z0-9-._~:?#[\\]@!$&'()*+,;=]+",
    "ipv4": "\\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\b",
    "ipv6": "\\b(?:[A-Fa-f0-9]{1,4}:){7}[A-Fa-f0-9]{1,4}\\b|\\b(?:[A-Fa-f0-9]{1,4}:){1,7}:\\b|\\b:[A-Fa-f0-9]{1,4}(?::[A-Fa-f0-9]{1,4}){1,6}\\b",
    "mac": "\\b(?:[0-9A-Fa-f]{2}[:-]){5}(?:[0-9A-Fa-f]{2})\\b",
    "email": "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"
}

Our script is capable of identifying URLs, IPv4 and IPv6 addresses, MAC addresses, and Email patterns.

Impressive, isn’t it?

Before we dive into the technicalities, all the code and a sample file to test the script are available on our GitHub repository.

Let’s embark on this scripting adventure.

Unveiling the `find_strings` Function: A Deep Dive

Peering into the heart of our string analysis tool, we encounter the find_strings function – the crux of our Python script. It’s designed with simplicity and efficiency in mind, but don’t let that fool you. It’s where the magic happens, transforming a maze of binary data into a clear text revelation.

Let’s unfold the function’s blueprint:

def find_strings(filename, patterns, min_length=4):
    # Open the file in binary mode and read its content
    with open(filename, 'rb') as f:
        content = f.read().decode('ascii', 'ignore')  # Discard non-ASCII bytes during decoding

        results = []  # Prepare a list to hold our findings

        # Craft a regex pattern for ASCII strings of at least 'min_length' characters
        ascii_regex = re.compile(r'[ -~]{' + str(min_length) + r',}')

        # Iterate through each pattern we're searching for
        for pattern_name, pattern_regex in patterns.items():
            # If we want all readable ASCII strings, use the general regex
            if pattern_name == 'all':
                matches = ascii_regex.findall(content)
            else:
                # Otherwise, use the specific regex pattern from our JSON config
                matches = re.findall(pattern_regex, content)
            # Add all matches to our results list
            for match in matches:
                results.append(match)

        return results  # Finally, return the list of strings found

Step-by-Step Breakdown:

Open and Read: Open the binary file in read-only and binary mode. We decode the content to ASCII, discarding anything that’s not ASCII.
Prepare for Results: We initiate an empty list to store the strings we discover.
Crafting the Regex: An ASCII regex is prepared, targeting sequences of printable characters that meet our minimum length requirement.
Pattern Matching: We loop through each pattern defined in our JSON configuration. If the pattern is ‘all’, we search for any ASCII string. Otherwise, we search for specific patterns like URLs or IP addresses.
Gathering Strings: Every match found is appended to our results list, building up a collection of strings.
The Reveal: The function culminates by returning the list of strings, providing a clear picture of the hidden text within the binary file.

The Main Function

When we talk about the entry point of a Python script, we’re referring to the place where the execution begins – it’s the grand kickoff. In our script, this is encapsulated within the if __name__ == "__main__": block. This is where we orchestrate the flow of our string analysis tool, ensuring it responds to user input and behaves as intended. Here’s how we’ve constructed this pivotal part of our program:

# The sentinel of a Python script: Where execution begins
if __name__ == "__main__":
    # Check for the correct number of command line arguments
    if len(sys.argv) < 3:
        print("Usage: python strings_analyzer.py <filename> <pattern1,pattern2,...>")
        print("Use 'all' for pattern to get all strings.")
        sys.exit(1)  # Exit if not enough arguments are passed

    filename = sys.argv[1]  # The binary file to analyze
    chosen_patterns = sys.argv[2].split(',')  # The patterns to look for, split into a list

    # Load the pattern definitions from a JSON file
    with open("patterns.json", "r") as f:
        all_patterns = json.load(f)

    # Prepare the pattern dictionary based on user-selected patterns
    if 'all' in chosen_patterns:
        patterns = {'all': None}
    else:
        # Only include patterns that are present in the config file and chosen by the user
        patterns = {k: all_patterns[k] for k in chosen_patterns if k in all_patterns}

    # Call the find_strings function and print each found string
    for s in find_strings(filename, patterns):
        print(s)

Step by Step Run-Through:

Command Line Validation: The script begins by ensuring that the user has provided the necessary command-line arguments. If not, it offers guidance on the correct usage and exits to prevent further missteps.
Argument Assignment: It then extracts the filename and the patterns the user wants to analyze from the command-line arguments.
Pattern Loading: The script loads the regular expression patterns from a JSON configuration file, where our search criteria are defined.
Pattern Selection: It constructs a dictionary of patterns to be used, either selecting ‘all’ for a broad search or narrowing down based on user input.
Execution and Output: Finally, it executes the find_strings function with the provided arguments and prints out each string it finds.

With this setup, we maintain a user-friendly and efficient entry point for our string analysis tool, staying true to our keyword focus.

Now, let’s stitch together the full code that combines both the find_strings function and the main execution logic:

import sys
import re
import json

def find_strings(filename, patterns, min_length=4):
    with open(filename, 'rb') as f:
        content = f.read().decode('ascii', 'ignore')
        results = []
        ascii_regex = re.compile(r'[ -~]{' + str(min_length) + r',}')
        for pattern_name, pattern_regex in patterns.items():
            if pattern_name == 'all':
                matches = ascii_regex.findall(content)
            else:
                matches = re.findall(pattern_regex, content)
            for match in matches:
                results.append(match)
        return results

if __name__ == "__main__":
    if len(sys.argv) < 3:
        print("Usage: python strings_analyzer.py <filename> <pattern1,pattern2,...>")
        print("Use 'all' for pattern to get all strings.")
        sys.exit(1)
    filename = sys.argv[1]
    chosen_patterns = sys.argv[2].split(',')
    with open("patterns.json", "r") as f:
        all_patterns = json.load(f)
    if 'all' in chosen_patterns:
        patterns = {'all': None}
    else:
        patterns = {k: all_patterns[k] for k in chosen_patterns if k in all_patterns}
    for s in find_strings(filename, patterns):
        print(s)

This code snippet is a complete package, ready to be deployed by any cybersecurity enthusiast or professional looking to refine their string analysis practices using Python.

A Look at Comparison with Strings Tool

Our Python String Analyzer offers a level of control and customization that the venerable UNIX ‘strings’ tool, despite its long-standing development and efficiency, doesn’t provide out of the box. While ‘strings’ performs admirably at its intended function, our Python alternative shines when it comes to tailoring its behaviour to fit specific investigative scenarios.

Yes, the trade-off might be a dip in raw performance when compared with the highly optimized ‘strings’, but what we gain is flexibility — the ability to adjust, adapt, and evolve our tool as our needs change.

Moreover, crafting this Python tool isn’t just about building something practical; it’s a learning expedition. Every line of code you write or modify is an opportunity to deepen your understanding of malware and its indicators, as well as to enhance your coding skills.

Here is an example of how to use the “all” function in the tool, and as you can see it’s very similar to the output of “strings”.

That’s good, but now is a little example of how we can filter emails and IPv4 addresses.

Now, I guess you understand the power of the tool and how easy is to customize it for your needs.

So have fun with that!

Conclusion

As we wrap up our exploration into Python’s capabilities for string analysis, remember that this is just the beginning. Our journey through the binary wilderness has shown that with the right tools and knowledge, even the most complex malware can be decoded and understood.

We invite you to continue this journey with us. By following our blog and engaging with us on our social media accounts, you’ll stay at the forefront of cybersecurity analysis, equipped with the latest insights and tools to keep you one step ahead of the curve. Every new article, and every shared experience, enhances our collective knowledge and defence strategies.

Looking ahead, there are boundless possibilities for enhancing our Python string analysis tool:

Beyond ASCII: Broadening our horizons to include a spectrum of encodings will make our tool more versatile and powerful.
Pattern Expansion: The digital world is ever-evolving, and so should our pattern library. By incorporating more regex patterns, we can detect a wider array of threats.
Chunking for Large Binaries: Dealing with massive files? Chunking can help. By processing pieces of large binaries, we can manage memory more efficiently and maintain performance.

Your feedback and suggestions fuel our progress. If you’ve got ideas for new features, insights on string analysis, or requests for future articles, don’t hesitate to reach out. Together, we’ll continue to refine our tools and techniques.

Don’t miss out on the cutting edge of cybersecurity—follow us, join the discussion, and become a part of our growing community. Your expertise could be the key to the next big breakthrough in string analysis.

Stay curious, stay informed, and let’s push the boundaries of what’s possible with Python and cybersecurity.

How To Build Your Own: Python String Analysis for Malware Insights

How Automate Malware Scans with VirusTotal API and Python: The Ultimate Guide.

Process Injection By Example: The Complete Guide

Process Injection By Example: The Complete Guide

You might also like

Cryptographic Hash Functions in Python: Secure Your Data Easily

Malware Obfuscation Techniques: All That You Need To Know

How To Do Process Enumeration: An Alternative Way

How To Do DLL Injection: An In-Depth Cybersecurity Example

Process Injection By Example: The Complete Guide