The challenge in engaging malware activities involves the correct identification and classification of different malware variants. Various malwares incorporate code obfuscation methods that alters their code signatures effectively countering antimalware detection techniques utilizing static methods and signature database. In this study, we utilized an approach of converting a malware binary into an image and use Random Forest to classify various malware families. The resulting accuracy of 0.9562 exhibits the effectivess of the method in detecting malware
Cybercrime operations through networked computer systems remains a growing threat for developed regions with a mature information and communications (ICT) infrastructure in which a considerable number of public and private services are dependent. At the core of Cybercrime operations are Malwares consisting of spywares, bots, rootkits, Trojan, and viruses designed to perform tasks such as service disruption, network hijacking, exploiting resources, and private information stealing [1].
The challenge in engaging malware activities involves the correct identification and classification of different malware variants. Malwares incorporate code obfuscation and metamorphism to change their code signatures while maintaining their behaviors and functionalities [2]. These methods effectively counters anti-malware software relying on malware signature database to identify a specific malware attacking a computer system. Lastly, this code obfuscation and morphing generates a high volume of data points for a certain malware variant alone [3]. Nataraj et al [4] proposed a method on visualizing malware binaries as image file resulting on complex visual patterns acting as a malwares signature. The obfuscation of the code also introduces various changes on the resulting image but still retain the general structure and thus show potential as an approach to classifying malware variants.
In this study, we take advantage of malware as image files as feature vectors and Random Forest to effectively classify and segregate malware families from each other.
In this study we evaluate our methods on the Malimg Dataset [4] consisting of 9,342 malware samples of 25 different malware families. Table 1 shows the malware families consisting the Malimg Dataset and the equivalent population % of each of the families within the data set. It is worth noting that the following dataset is imbalanced and developing a training set must include a stratified sampling of the populations to prevent overfitting and under generalization on specific malware variants. The labeled arrays are appended together as a row into a csv file that would comprise the training set for the machine learning algorithm.
Various researches on malware and cyber anomaly detections utilized machine learning methods such as Support Vector Machines (SVM), K-Nearest Neighbors (K-NN), and Neural Networks (NN) [6], for this study, we utilized the use of Random Forest as a feasible method for malware classification.
In terms of supervised learning and performances various studies have ranked Gradient Boosted Trees, Random Forests, Neural Networks, and Support Vector Machines to have high predictive accuracies [7] [8]. While Gradient Boosted Trees did have the highest accuracy, Random Forest was able to achieve almost the same performance with minor parameter tuning [7].
In this study, we utilized the Random Forest implementation on R with the randomForest and caret library.
Creating the training and testing set involves splitting the data into 80% training and 20% testing. The splitting of the dataset also involves taking into account the relative populations of each malware families to ensure that each family are well represented on the split dataset.
A k-fold Cross-Validation procedure is used to evaluate the model where the training data is randomly partition into different subsamples with equal k sizes. One k subsample is held out as validation data and the remaining k subsamples are used as training data. This process is then repeated k-times (referred as the number of folds) with each of the k subsample used as validation. The resulting accuracies for each fold is averaged to produce a single estimation of the models accuracy for a particular machine learning problem [9].
A 10-fold Cross-Validation procedure performed on the training set to evaluate the model, afterwards the model is tested on the held-out testing set and evaluated for its performance.
The training set for the data consist of a 1024 feature vector with a corresponding label. We first evaluate the crossvalidation results of the model with the training set. The resulting metrics as shown on Table 2 indicates a strong predictive performance from the model. The model’s overall predictive accuracy is 0.9464 within the bounds of [0.9411, 0.9514]. Another metric considered is the Kappa statistic which indicates if the proximity of the instances classified by the predictive model matched the testing data’s ground truth [10].
The measured Kappa for the cross validation result is 0.9367 and provides a strong indication with regards to the accuracy of the random forest model for the training set.
In this study we exhibited the used of malware images as a feature vector for classifying various malware families. The study used Random Forest and performed 10-fold Cross Validation to determine the predictive strength of the model The resulting accuracies have shown that Random Forest model achieved a 0.9526
This content is AI-processed based on open access ArXiv data.