Concept drift and model degradation in network traffic classification

Jančička, Lukáš

Concept drift and model degradation in network traffic classification

Authors

Jančička, Lukáš

Supervisors

Soukup, Dominik

Reviewers

Koumar, Josef

Publisher

České vysoké učení technické v Praze
Czech Technical University in Prague

Files

Full Text (2.68 MB)

Review (49.75 KB)

Review (51.1 KB)

Abstract

Strojové učení představuje vysoce efektivní a v současnosti oblíbený přístup ke~klasifikaci síťového provozu. Vytvořené modely ale mohou po nasazení rychle degradovat, jelikož síťový provoz představuje náročnou doménu. Kromě zkreslení přítomných během sběru dat a vytváření modelu (tzv. bias) představuje concept drift hlavní zdroj degradace modelu. Vzory v datech objevené při trénování mohou přestat být přesné kvůli vývoji distribucí. Z tohoto důvodu se práce zaměřila na vytvoření základů frameworku pro detekci a analýzu driftu na míru pro doménu síťového provozu. Chování síťového provozu bylo zkoumáno pomocí různých experimentů studujících vývoj distribucí a simulujících nasazení modelu a zkoumajících jeho degradaci modelu v čase. Byla zjištěna přítomnost opakujících se konceptů s víkendovým provozem odlišným od provozu v pracovním týdnu. Když se drift neřešil, F1 skóre kleslo z 0,92 na přibližně 0,7 během několika dní. Jelikož byly případy kdy zdrojem degradace modelu bylo pouze několik silně driftovaných příznaků, byl vynalezen nový přístup vážení výsledků testů driftu podle důležitostí příznaků. Vytvořený detektor může být rozšířen o moduly pro dodatečnou analýzu detekovaného driftu. Je představena nová myšlenka klasifikace typů driftu pro lepší pochopení vývoje provozu. Vytvořený detektor byl testován na experimentu, kde sloužil k přetrénování modelu po detekci a byl schopen nejen zabránit degradaci modelu, ale také zlepšit jeho výkon v průběhu času.

Machine learning represents a highly effective and currently popular approach for network traffic classification. However, network traffic represents a challenging domain, and trained models may degrade quickly after the deployment. Other than biases present during the data capturing and model creation, concept drift represents a major source of model degradation. As the distributions evolve, the trained data patterns may stop being accurate. Because of that, the thesis focused on creating a basis for a framework for concept drift detection and analysis tailored to the domain of network traffic. The behaviour of network traffic was examined using a variety of experiments studying the development of distributions, simulating model deployment and observing the degradation over time. The presence of multiple recurring concepts was discovered with weekend traffic differing from the one of the working week. When concept drift wasn't addressed, the test F1 scores dropped from 0.92 to around 0.7 in a matter of days. Sometimes, only a few severely drifted features were the source of model degradation, so a novel approach of weighing the drift result by the feature importances was invented. The created drift detector may be enhanced by modules for additional analysis of the detected drift. A novel idea of classifying types of drift for better drift understanding is introduced. The created detector was tested to guide the model retraining and was able to not only prevent the model from degrading but also improve its performance over time.

Keywords

Klasifikace síťového provozu, concept drift, aktivní učení, robustnost modelů strojového učení, strojové učení, network traffic classification, concept drift, active learning, machine learning model robustness, machine learning

Permanent link

http://hdl.handle.net/10467/113756

Rights/License

A university thesis is a work protected by the Copyright Act of the Czech Republic. Extracts, copies and transcripts of the thesis are allowed for personal use only and at one`s own expense. The use of thesis should be in compliance with the Copyright Act.

Vysokoškolská závěrečná práce je dílo chráněné autorským zákonem. Je možné pořizovat z něj na své náklady a pro svoji osobní potřebu výpisy, opisy a rozmnoženiny. Jeho využití musí být v souladu s autorským zákonem v platném znění.

Collections

Master Theses - 18105

Full item page

Concept drift and model degradation in network traffic classification

Authors

Supervisors

Reviewers

Editors

Other contributors

Journal Title

Journal ISSN

Volume Title

Publisher

Date of defense

Files

Research Projects

Organizational Units

Journal Issue

Abstract

Description

Keywords

Citation

Underlying research data set URL

Permanent link

Rights/License

Collections

Endorsement

Review

Supplemented By

Referenced By