Selection of Representative Samples from Datasets for Malware Detection
Výběr reprezentativních vzorků z datových sad pro detekci malwaru
Authors
Supervisors
Reviewers
Editors
Other contributors
Journal Title
Journal ISSN
Volume Title
Publisher
České vysoké učení technické v Praze
Czech Technical University in Prague
Czech Technical University in Prague
Date
Abstract
Tato závěrečná se zabývá výběrem reprezentativních instancí trénovací množiny pro detekci malware. Experimenty byly provedeny na dvou veřejně dostupných datasetech, obsahujících metadata Windows PE souborů. Jedná se o datasety EMBER a SOREL-20M. V teoretické části jsou popsány metody předzpracování dat, instance selection algoritmy a klasifikační algoritmy, použité v praktické části této thesis, a také struktura PE souboru. Praktická část popisuje průběh předzpracování datasetů a hlavní experimenty související s porovnáním state-of-the-art instance selection algoritmů. V rámci závěrečné práce byly navrženy a implementovány modifikace paralelního instance selection algoritmu PIF, které byly rovněž experimentálně vyhodnoceny a porovnány s výsledky state-of-the-art instance selection algoritmů.
This thesis focuses on the selection of representative instances for the training set in malware detection. Experiments were conducted on two publicly available datasets containing metadata of Windows PE files, namely the EMBER and SOREL-20M datasets. The theoretical part describes data preprocessing methods, instance selection algorithms, and classification algorithms used in the practical part of this thesis. It also includes a description of the structure of PE files. The practical part outlines the process of preprocessing datasets and main experiments related to the comparison of state-of-the-art instance selection algorithms. As part of the thesis, modifications to the parallel instance selection algorithm PIF were proposed and implemented, and these were also experimentally evaluated and compared with the results of state-of-the-art instance selection algorithms.
This thesis focuses on the selection of representative instances for the training set in malware detection. Experiments were conducted on two publicly available datasets containing metadata of Windows PE files, namely the EMBER and SOREL-20M datasets. The theoretical part describes data preprocessing methods, instance selection algorithms, and classification algorithms used in the practical part of this thesis. It also includes a description of the structure of PE files. The practical part outlines the process of preprocessing datasets and main experiments related to the comparison of state-of-the-art instance selection algorithms. As part of the thesis, modifications to the parallel instance selection algorithm PIF were proposed and implemented, and these were also experimentally evaluated and compared with the results of state-of-the-art instance selection algorithms.
Description
Keywords
výběr instancí, PIF, DROP3, MSS, CNN, ICF, AllKNN, RENN, ENN, KNN, strojové učení, umělá inteligence, klasifikace, malware, PE soubory, Windows, instance selection, PIF, DROP3, MSS, CNN, ICF, AllKNN, RENN, ENN, KNN, machine learning, artificial intelligence, classification, malware, PE files, Windows