Výběr reprezentativních vzorků z datových sad pro detekci malwaru

Děd, Lukáš

Selection of Representative Samples from Datasets for Malware Detection

Výběr reprezentativních vzorků z datových sad pro detekci malwaru

Authors

Děd, Lukáš

Supervisors

Jureček, Martin

Reviewers

Kozák, Matouš

Publisher

České vysoké učení technické v Praze
Czech Technical University in Prague

Files

Full Text (666.71 KB)

Review (46.28 KB)

Review (44.29 KB)

Abstract

Tato závěrečná se zabývá výběrem reprezentativních instancí trénovací množiny pro detekci malware. Experimenty byly provedeny na dvou veřejně dostupných datasetech, obsahujících metadata Windows PE souborů. Jedná se o datasety EMBER a SOREL-20M. V teoretické části jsou popsány metody předzpracování dat, instance selection algoritmy a klasifikační algoritmy, použité v praktické části této thesis, a také struktura PE souboru. Praktická část popisuje průběh předzpracování datasetů a hlavní experimenty související s porovnáním state-of-the-art instance selection algoritmů. V rámci závěrečné práce byly navrženy a implementovány modifikace paralelního instance selection algoritmu PIF, které byly rovněž experimentálně vyhodnoceny a porovnány s výsledky state-of-the-art instance selection algoritmů.

This thesis focuses on the selection of representative instances for the training set in malware detection. Experiments were conducted on two publicly available datasets containing metadata of Windows PE files, namely the EMBER and SOREL-20M datasets. The theoretical part describes data preprocessing methods, instance selection algorithms, and classification algorithms used in the practical part of this thesis. It also includes a description of the structure of PE files. The practical part outlines the process of preprocessing datasets and main experiments related to the comparison of state-of-the-art instance selection algorithms. As part of the thesis, modifications to the parallel instance selection algorithm PIF were proposed and implemented, and these were also experimentally evaluated and compared with the results of state-of-the-art instance selection algorithms.

Keywords

výběr instancí, PIF, DROP3, MSS, CNN, ICF, AllKNN, RENN, ENN, KNN, strojové učení, umělá inteligence, klasifikace, malware, PE soubory, Windows, instance selection, PIF, DROP3, MSS, CNN, ICF, AllKNN, RENN, ENN, KNN, machine learning, artificial intelligence, classification, malware, PE files, Windows

Permanent link

http://hdl.handle.net/10467/115292

Rights/License

A university thesis is a work protected by the Copyright Act of the Czech Republic. Extracts, copies and transcripts of the thesis are allowed for personal use only and at one`s own expense. The use of thesis should be in compliance with the Copyright Act.

Vysokoškolská závěrečná práce je dílo chráněné autorským zákonem. Je možné pořizovat z něj na své náklady a pro svoji osobní potřebu výpisy, opisy a rozmnoženiny. Jeho využití musí být v souladu s autorským zákonem v platném znění.

Collections

Master Theses - 18106

Full item page

Selection of Representative Samples from Datasets for Malware Detection

Výběr reprezentativních vzorků z datových sad pro detekci malwaru

Authors

Supervisors

Reviewers

Editors

Other contributors

Journal Title

Journal ISSN

Volume Title

Publisher

Date of defense

Files

Research Projects

Organizational Units

Journal Issue

Abstract

Description

Keywords

Citation

Underlying research data set URL

Permanent link

Rights/License

Collections

Endorsement

Review

Supplemented By

Referenced By