Automatická explorační analýza dat pro binární klasifikaci pomocí knihovny pandas profiling

Jan Čáp

Automated exploratory data analysis for binary classification using pandas profiling library

Typ dokumentu

bakalářská práce
bachelor thesis

Autor

Jan Čáp

Vedoucí práce

Vašata Daniel

Oponent práce

Friedjungová Magda

Studijní obor

Znalostní inženýrství

Studijní program

Informatika 2009

Instituce přidělující hodnost

katedra aplikované matematiky

Práva

A university thesis is a work protected by the Copyright Act. Extracts, copies and transcripts of the thesis are allowed for personal use only and at one?s own expense. The use of thesis should be in compliance with the Copyright Act http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf and the citation ethics http://knihovny.cvut.cz/vychova/vskp.html
Vysokoškolská závěrečná práce je dílo chráněné autorským zákonem. Je možné pořizovat z něj na své náklady a pro svoji osobní potřebu výpisy, opisy a rozmnoženiny. Jeho využití musí být v souladu s autorským zákonem http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf a citační etikou http://knihovny.cvut.cz/vychova/vskp.html

Metadata

Zobrazit celý záznam

Abstrakt

Práce se zabývá automatickou explorací dat s binární klasifikací. Je provedena rešerše již existujících řešení pro automatickou exploraci dat. Dále jsou prozkoumány statistické testy a metody vhodné pro testování závislosti dvou proměnných. Jsou zde také prozkoumány vhodné možnosti vizualizací rozložení dat. V další části je navrženo rozšíření do knihovny \textit{Pandas Profiling}, která byla vybrána v rešerši. Rozšíření se specializuje na binární klasifikaci. Rozšíření obsahuje grafy a statistiky reprezentující závislost sloupců na cílové proměnné, vizualizaci závislostí chybějících hodnot na cílové proměnné, navržené transformace sloupců a trénování výchozího modelu pro~klasifikaci cílové proměnné. Na základě návrhu bylo implementováno rozšíření knihovny \textit{Pandas Profiling}, které urychlí exploraci dat s binární klasifikací.

This work deals with automatic data exploration with binary classification. A search of already existing solutions for automatic data exploration is performed. Furthermore, statistical tests and~methods suitable for testing the dependence of two variables are investigated. Suitable options for~data distribution visualizations are also explored. In the next section, an extension to~the~\textit{Pandas Profiling} library selected in the search is proposed. The extension specializes in~binary classification. The extension includes graphs and statistics representing the dependency of~columns on the target variable, visualization of the dependency of missing values on~the~target variable, proposed column transformations and training of the default model for target variable classification. Based on the design, an extension to the \textit{Pandas Profiling} library was implemented to speed up data exploration with binary classification.