Skip to content

deep learning, malware detection, predictive uncertainty, dataset shift, calibration, uncertainty quantification, android malware

Notifications You must be signed in to change notification settings

deqangss/malware-uncertainty

Repository files navigation

Uncertainty Quantification for Android Malware Detectors

This code repository is for our ACSAC 2021 paper (to appear), entitled Can We Leverage Predictive Uncertainty to Detect Dataset Shift and Adversarial Examples in Android Malware Detection?.

Overview

Our aim to explore the uncertainty quantification to harden malware detectors in the realistic environments (i.e., natural adversaries exist). This approach is rarely investigated in the context of malware detection, where the properties of dataset shift are different from other domains (e.g., image). Therefore we are motivated to evaluate the quality of predictive uncertainties inherent in malware detectors under the dataset shift. Specifically, we consider 4 Android malware detectors, including DeepDrebin, MultimodalNN, DeepDroid and Droidetec, and 6 calibration methods, including Vanilla, Temp scaling, Monto Carlo dropout, variational Bayesian Inference, Deep Ensemble and Weighted deep ensemble. The dataset shift is specified as out of source, temporal covariate shift or adversarial evasion attacks.

Dependencies:

We develop the codes on Windows operation system, and run the codes on Ubuntu 18.04. The codes depend on Python 3.6. Other packages (e.g., TensorFlow) can be found in the ./requirements.txt.

Configuration & Usage

1. Datasets

  • Three datasets are leveraged, namely that Drebin, VirusShare_Android_APK_2013 and Androzoo. Note that for the security consideration, these three datasets are required to follow the policies of their own to obtain the Android applications.

  For Drebin, we can download the malicious APKs from the official website and we provides sha256 codes of a portion of Drebin benign APKs, for which the corresponding APKs can be download from Androzoo.

  For Androzoo, we use the dataset built by researchers Pendlebury et al. All APKs can be downloaded from Androzoo.

  For Virusshare, we use the file named VirusShare_Android_APK_2013.zip.

  For adversarial APKs, we resort to this repository.

  • We additionally provide the preprocessed data files which are available at an anonymous url (the size of unzip folder is ~213GB).

2. Configure

For the purpose of convenience, we provide a conf (Windows platform) / conf-server (Ubuntu) file to assist the customization (Please pick one and rename it config to use rather than both). Before running, all things are changed in the following:

  • Modify the project_root=/absolute/path/to/malware-uncertainty/.

  • Modify the database_dir=/absolute/path/to/datasets. For more details (Optionally), there are Drebin or Androzoo malware datasets in this directory with the structure:

datasets
|---drebin
      |---malicious_samples  % malicious apps folder
      |---benign_samples     % benign apps foler
|---androzoo_tesseract
      |---malicious_samples
      |---benign_samples
      |   date_stamp.json    % date stamp for each app, we will provide
|---VirusShare_Android_APK_2013
      |---malicious_samples
      |---benign_samples
|---naive_data               % saving the preprocessed data files 
...

If no real apps are considered, the preprocessing data files make the project work as well. In this case, we need continue to configure the followings:

  • Download the datasets from the anonymous url, and put the folder in the project root directory, namely malware-uncertainty. Please Note that this datasets is not necessary the same as the directory of database_dir in the second step.
  • Download the naive_data from the anonymous url, and put the folder in the database_dir directory, which is configured in the second step (need unzip, mv naive_data.tar.gz database_dir; cd database_dir; tar -xvzf naive_data.tar.gz ./).

3. Usage

We suggest users to create a conda environment to run the codes. In this spirit, the following instructions may be helpful:

  1. Create a new environment: conda create -n mal-uncertainty python=3.6
  2. Activate the environment and install dependencies: conda activate mal-uncertainty and pip install -r requirements.txt
  3. Next step:
  • For training, all scripts are listed in ./run.sh
  • And then for producing figures and table data, the python code is ./experiments/table-figures.py (we have not implemented this part for the malware detector Droidetec)

Warning

  • It is usually time consuming to perform feature extraction on Android applications.
  • Two detectors (DeepDroid and Droidetec) are both RAM and computation consuming because the huge long sequence is used for promoting detection accuracy

License && Issues

We will make our codes public available under a formal license. For now, this is still an ongoing work and we plan to report more results in the future work. It is worth reminding that we found there two issues when checking our codes:

  • No random seed set for friendly reproducing results exactly as the paper; nevertheless, the similar results can be achieved.
  • The training, validation, and test datasets are split randomly, leading to a mess of results.

Contact

Any questions, please do not hesitate to contact us (Shouhuai Xu email: sxu@uccs.edu, Deqiang Li email: lideqiang@njust.edu.cn)

About

deep learning, malware detection, predictive uncertainty, dataset shift, calibration, uncertainty quantification, android malware

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published