Mapování Internetu — Modelování Interakcí Entit v Komplexních Heterogenních Sítích

Šimon Mandlík

Mapping the Internet — Modelling Entity Interactions in Complex Heterogeneous Networks

Type of document

diplomová práce
master thesis

Author

Šimon Mandlík

Supervisor

Pevný Tomáš

Opponent

Bajer Lukáš

Field of study

Umělá inteligence

Study program

Otevřená informatika

Institutions assigning rank

katedra počítačů

Rights

A university thesis is a work protected by the Copyright Act. Extracts, copies and transcripts of the thesis are allowed for personal use only and at one?s own expense. The use of thesis should be in compliance with the Copyright Act http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf and the citation ethics http://knihovny.cvut.cz/vychova/vskp.html
Vysokoškolská závěrečná práce je dílo chráněné autorským zákonem. Je možné pořizovat z něj na své náklady a pro svoji osobní potřebu výpisy, opisy a rozmnoženiny. Jeho využití musí být v souladu s autorským zákonem http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf a citační etikou http://knihovny.cvut.cz/vychova/vskp.html

Metadata

Show full item record

Abstract

I přesto, že algoritmy strojového učení již nyní hrají důležitou roli v oboru datových věd, většina současných metod klade nerealistické předpoklady na vstupní data, anebo je jejich aplikace složitá díky nekompatibilním datovým formátům či heterogenním, hierarchickým anebo chybějícím položkám v datasetu. Jako řešení navrhujeme všestrannou, unifikovanou knihovnu s názvem `HMill' (Hierarchical multi-instance learning library) pro reprezentaci vzorků, definici modelů a jejich učení, která řeší uvedené problémy a splňuje všechny požadavky pro moderní všeobecný nástroj. V práci se zabýváme paradigmatem multi-instančního učení, ze kterého knihovna vychází a dále ho rozšiřuje. Abychom také teoreticky odůvodnili ideje na kterých je nový přístup založen, ukazujeme rozšíření Univerzální aproximační věty pro množinu funkcí realizovanou modely z knihovny. Dále práce obsahuje diskuzi o technických detailech a optimalizaci konkrétní implementace, která je zveřejněná ke stažení pod licencí MIT. Hlavní přínos nového přístupu spočívá v jeho flexibilitě, což umožňuje modelování rozličných datových zdrojů stejným nástrojem, bez nutné specializace a bez kompromisů ve výkonosti naučených modelů. Kromě klasického případu kdy pozorujeme množinu charakteristik každého objektu zvlášť diskutujeme také, jak lze pomocí knihovny naimplementovat proceduru inference v grafech pomocí posílání zpráv. Pro podporu našich tvrzení v práci řešíme tři rozdílné úlohy z domény počítačové bezpečnosti. První úloha spočívá v identifikaci typu zařízení v Internetu věcí na základě surových dat naměřených v síti a druhá v klasifikaci binárních souborů na základě informací v operačním systému vyjádřených jako orientovaný graf. Poslední příklad se zabývá úlohou rozšiřování blacklistu nebezpečných domén pomocí modelování interakcí mezi entitami v počítačové síti. Ve všech třech úlohách dosáhl navrhovaný přístup přesnosti srovnatelné se specializovanými metodami.

Even though machine learning algorithms already play a significant role in data science, many current methods pose unrealistic assumptions on input data. The application of such methods is difficult due to incompatible data formats, or heterogeneous, hierarchical or entirely missing data fragments in the dataset. As a solution, we propose a versatile, unified framework called `HMill' (Hierarchical multi-instance learning library) for sample representation, model definition and training, which addresses the discussed problems and meets all requirements for a modern general-purpose instrument. We review in depth a multi-instance paradigm for machine learning that the framework builds on and extends. To theoretically justify the design of key components of HMill, we show an extension of the universal approximation theorem to the set of all functions realized by models implemented in the framework. The text also contains a detailed discussion on technicalities and performance improvements in our implementation, which is published for download under the MIT License. The main asset of the framework is its flexibility, which makes modelling of diverse real-world data sources with the same tool possible. This is done with only minor changes in the pipeline and requires neither further specialization nor performance compromises. Additionally to the standard setting in which a set of attributes is observed for each object individually, we explain how message-passing inference in graphs that represent whole systems of objects can be implemented in the framework. To support our claims, we solve three different problems from the cybersecurity domain using the framework. The first use case concerns IoT device identification from raw network observations. In the second problem, we study how malicious binary files can be classified using a snapshot of the operating system represented as a directed graph. The last provided example is a task of domain blacklist extension through modelling interactions between entities in the network. In all three problems, the solution based on the proposed framework achieves performance comparable to specialized approaches.