Uncertainty Quantification for Android Malware Detectors

This code repository is for our ACSAC 2021 paper (to appear), entitled Can We Leverage Predictive Uncertainty to Detect Dataset Shift and Adversarial Examples in Android Malware Detection?.

Overview

Our aim to explore the uncertainty quantification to harden malware detectors in the realistic environments (i.e., natural adversaries exist). This approach is rarely investigated in the context of malware detection, where the properties of dataset shift are different from other domains (e.g., image). Therefore we are motivated to evaluate the quality of predictive uncertainties inherent in malware detectors under the dataset shift. Specifically, we consider 4 Android malware detectors, including DeepDrebin, MultimodalNN, DeepDroid and Droidetec, and 6 calibration methods, including Vanilla, Temp scaling, Monto Carlo dropout, variational Bayesian Inference, Deep Ensemble and Weighted deep ensemble. The dataset shift is specified as out of source, temporal covariate shift or adversarial evasion attacks.

Dependencies:

We develop the codes on Windows operation system, and run the codes on Ubuntu 18.04. The codes depend on Python 3.6. Other packages (e.g., TensorFlow) can be found in the ./requirements.txt.

Configuration & Usage

1. Datasets

Three datasets are leveraged, namely that Drebin, VirusShare_Android_APK_2013 and Androzoo. Note that for the security consideration, these three datasets are required to follow the policies of their own to obtain the Android applications.

For Drebin, we can download the malicious APKs from the official website and we provides sha256 codes of a portion of Drebin benign APKs, for which the corresponding APKs can be download from Androzoo.

For Androzoo, we use the dataset built by researchers Pendlebury et al. All APKs can be downloaded from Androzoo.

For Virusshare, we use the file named VirusShare_Android_APK_2013.zip.

For adversarial APKs, we resort to this repository.

We additionally provide the preprocessed data files which are available at an anonymous url (the size of unzip folder is ~213GB).

2. Configure

For the purpose of convenience, we provide a conf (Windows platform) / conf-server (Ubuntu) file to assist the customization (Please pick one and rename it config to use rather than both). Before running, all things are changed in the following:

Modify the project_root=/absolute/path/to/malware-uncertainty/.
Modify the database_dir=/absolute/path/to/datasets. For more details (Optionally), there are Drebin or Androzoo malware datasets in this directory with the structure:

datasets
|---drebin
      |---malicious_samples  % malicious apps folder
      |---benign_samples     % benign apps foler
|---androzoo_tesseract
      |---malicious_samples
      |---benign_samples
      |   date_stamp.json    % date stamp for each app, we will provide
|---VirusShare_Android_APK_2013
      |---malicious_samples
      |---benign_samples
|---naive_data               % saving the preprocessed data files 
...

If no real apps are considered, the preprocessing data files make the project work as well. In this case, we need continue to configure the followings:

Download the datasets from the anonymous url, and put the folder in the project root directory, namely malware-uncertainty. Please Note that this datasets is not necessary the same as the directory of database_dir in the second step.
Download the naive_data from the anonymous url, and put the folder in the database_dir directory, which is configured in the second step (need unzip, mv naive_data.tar.gz database_dir; cd database_dir; tar -xvzf naive_data.tar.gz ./).

3. Usage

We suggest users to create a conda environment to run the codes. In this spirit, the following instructions may be helpful:

Create a new environment: conda create -n mal-uncertainty python=3.6
Activate the environment and install dependencies: conda activate mal-uncertainty and pip install -r requirements.txt
Next step:

For training, all scripts are listed in ./run.sh
And then for producing figures and table data, the python code is ./experiments/table-figures.py (we have not implemented this part for the malware detector Droidetec)

Warning

It is usually time consuming to perform feature extraction on Android applications.
Two detectors (DeepDroid and Droidetec) are both RAM and computation consuming because the huge long sequence is used for promoting detection accuracy

License && Issues

We will make our codes public available under a formal license. For now, this is still an ongoing work and we plan to report more results in the future work. It is worth reminding that we found there two issues when checking our codes:

No random seed set for friendly reproducing results exactly as the paper; nevertheless, the similar results can be achieved.
The training, validation, and test datasets are split randomly, leading to a mess of results.

Contact

Any questions, please do not hesitate to contact us (Shouhuai Xu email: sxu@uccs.edu, Deqiang Li email: lideqiang@njust.edu.cn)

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.idea		.idea
core		core
experiments		experiments
test		test
tools		tools
README.md		README.md
conf		conf
conf-server-ubuntu		conf-server-ubuntu
config.py		config.py
main.py		main.py
requirements		requirements
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Uncertainty Quantification for Android Malware Detectors

Overview

Dependencies:

Configuration & Usage

1. Datasets

2. Configure

3. Usage

Warning

License && Issues

Contact

About

Releases

Packages

Languages

deqangss/malware-uncertainty

Folders and files

Latest commit

History

Repository files navigation

Uncertainty Quantification for Android Malware Detectors

Overview

Dependencies:

Configuration & Usage

1. Datasets

2. Configure

3. Usage

Warning

License && Issues

Contact

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages