Klasifikace na temporálních relačních datech

Mück Petr

Temporal Relational Classification

Typ dokumentu

diplomová práce
master thesis

Autor

Mück Petr

Vedoucí práce

Motl Jan

Oponent práce

Surynek Pavel

Studijní obor

Znalostní inženýrství

Studijní program

Informatika

Instituce přidělující hodnost

katedra aplikované matematiky

Práva

A university thesis is a work protected by the Copyright Act. Extracts, copies and transcripts of the thesis are allowed for personal use only and at one?s own expense. The use of thesis should be in compliance with the Copyright Act http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf and the citation ethics http://knihovny.cvut.cz/vychova/vskp.html
Vysokoškolská závěrečná práce je dílo chráněné autorským zákonem. Je možné pořizovat z něj na své náklady a pro svoji osobní potřebu výpisy, opisy a rozmnoženiny. Jeho využití musí být v souladu s autorským zákonem http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf a citační etikou http://knihovny.cvut.cz/vychova/vskp.html

Metadata

Zobrazit celý záznam

Abstrakt

Tato práce se zabývá možnostmi klasifikace temporálních dat. V práci implementuji agregační model, který je schopen pracovat s relačními daty, jejichž záznamy jsou pro určitou entitu ve vztahu n:1 pro daný čas predikce třídy a pomocí agregačních funkcí -- průměr, minimum a maximum -- agreguje hodnoty atributů na jeden záznam pro každou entitu. Dále se v práci zabývám možnostmi optimalizace použité délky historie v agregaci pro zlepšení kvality predikce z důvodu, že nedávná data mohou být relevantnější než ta starší. Závislost mezi agregací atributů zdrojových dat v určité délce historie a cílovou třídou v čase poté hodnotím pomocí měr Chi2, vzájemné informace a Cohenova Kappa po aplikaci klasifikátoru Gaussovský Naivní Bayes. Výsledné nejlepší dosažené hodnoty Kappa poté porovnávám, tam, kde to je možné, s již existujícími klasifikačními algoritmy pro časové řady -- se skrytým Markovovým modelem a algoritmem ARIMA. Nejlepší zjištěné délky historie jsou nakonec aplikovány v klasifikačním algoritmu náhodný les a zjištěn jejich efekt na úspěšnost klasifikace. Provedeným výzkumem jsem zjistil, že výsledky klasifikace pomocí optimalizované délky historie na šesti z deseti testovaných datasetů dosahují lepší hodnoty Kappa v průměru o 33.57% vyšších oproti klasifikace pomocí agregace přes celou délku historie. Pro zbylé čtyři testované datasety pak nedochází k žádné výrazné změně. Agregační model dosahoval v porovnání s algoritmy ARIMA a skrytý Markovův model lepších výsledků, testy ale nebyly příliš rozsáhlé, protože většina datasetů použitých v práci neobsahuje více historických bodů ke klasifikaci pro jednu entitu a tedy nejsou přiliš vhodné pro standardní algoritmy časových řad. Závěrem práce tedy je, že agregační model ve většině případů nabízí lepší výsledky v optimalizované délce historie, než na historii celé.

This thesis describes options of classification of temporal data. In this thesis I implement aggregation model, which is able to work with relational data which have attributes of certain entity in n:1 relation to the predicted classes in certain time of prediction and using aggregation functions -- average, minimum and maximum -- aggregates the values of attributes into one record for each entity. The thesis further describes the ways of optimization of used history length for prediction quality increase, because recent data might be more relevant than the older data. Then, I calculate the similarity between aggregated attribute values and the predicted class of the entity using measures Chi2, mutual information and Kappa after applying the Gaussian Naive Bayes classifier. The best obtained values of Kappa are then compared to existing time series algorithms, hidden Markov model and ARIMA, on the datasets that allow it. The best lengths of history are then used in random forest classificator to find how the optimization affects the classification success. The results of testing are that on six out of ten tested datasets the Kappa values of the classifier using the optimized lengths of history are on average 33.57% better than when using the aggregated values over the whole history. There is no significant change for the four remaining datasets. Aggregation model achieved better results in comparison to time series algorithms ARIMA and hidden Markov model, the tests weren't very extensive however, because datasets used in the thesis usually do not contain more than one classification record in time and therefore are not suitable to standard time series algorithms. The conclusion is that the aggregation model presented in this thesis in most cases achieves better results in optimized history length than on the history as a whole.