Klasifikace dokumentů pomocí metod strojového učení

Artem Ustynov

Documents classification using machine learning methods

dc.contributor.advisor	Buk Zdeněk
dc.contributor.author	Artem Ustynov
dc.date.accessioned	2020-06-20T22:52:28Z
dc.date.available	2020-06-20T22:52:28Z
dc.date.issued	2020-06-20
dc.identifier	KOS-886320421305
dc.identifier.uri	http://hdl.handle.net/10467/88365
dc.description.abstract	Problém hledání v nekategorizovaných dokumentech spočívá v tom, že uživatelům jsou často prezentovány výsledky, které obsahují hledaná klíčová slova, ale nejsou pro uživatele relevantní. Cílem této práce je rozšířit dokumenty o štítky na základě obsahu dokumentů. K dosažení cíle bylo zvažováno několik přístupů: Elasticsearch, Semaphore, LSTM, BERT. Cílem práce je zjistit, která technika má největší potenciál a poskytuje nejlepší výsledky. Všechny uvedené přístupy byly testovány a vyhodnoceny. Bylo zjištěno, že modely BERT fungovaly nejlépe a splnily všechny vstupní požadavky. Zlepšení kvality klasifikace pomocí BERT bylo dosaženo použitím počátečního modelu a manuální klasifikací malé sady dokumentů s nízkým skóre spolehlivosti.	cze
dc.description.abstract	The problem of searching in uncategorized documents is that users are often presented with results that contain searched keywords, but are not relevant to the user. The goal of this work is to extend the documents with tags based on their content. To accomplish this several approaches were considered: Elasticsearch, Semaphore, LSTM, BERT. The objective of this thesis is to determine which technology has the most potential and provides the best results. All listed approaches were tested and evaluated. It was found that BERT models performed the best and satisfied all of the initial business requirements. Some improvements in the quality of classification with BERT were achieved by utilizing the initial model and manually classifying a small set of documents with a low confidence score.	eng
dc.publisher	České vysoké učení technické v Praze. Vypočetní a informační centrum.	cze
dc.publisher	Czech Technical University in Prague. Computing and Information Centre.	eng
dc.rights	A university thesis is a work protected by the Copyright Act. Extracts, copies and transcripts of the thesis are allowed for personal use only and at one?s own expense. The use of thesis should be in compliance with the Copyright Act http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf and the citation ethics http://knihovny.cvut.cz/vychova/vskp.html	eng
dc.rights	Vysokoškolská závěrečná práce je dílo chráněné autorským zákonem. Je možné pořizovat z něj na své náklady a pro svoji osobní potřebu výpisy, opisy a rozmnoženiny. Jeho využití musí být v souladu s autorským zákonem http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf a citační etikou http://knihovny.cvut.cz/vychova/vskp.html	cze
dc.subject	Elasticsearch	cze
dc.subject	Semaphore	cze
dc.subject	LSTM	cze
dc.subject	BERT	cze
dc.subject	klasifikace textu	cze
dc.subject	Elasticsearch	eng
dc.subject	Semaphore	eng
dc.subject	LSTM	eng
dc.subject	BERT	eng
dc.subject	text classification	eng
dc.title	Klasifikace dokumentů pomocí metod strojového učení	cze
dc.title	Documents classification using machine learning methods	eng
dc.type	bakalářská práce	cze
dc.type	bachelor thesis	eng
dc.contributor.referee	Štepanovský Michal
theses.degree.discipline	Computer Science (Bachelor, in English)	cze
theses.degree.grantor	katedra teoretické informatiky	cze
theses.degree.programme	Informatics (in English)	cze

Soubory tohoto záznamu

Název:: F8-BP-2020-Ustynov-Artem-thesis.pdf
Velikost:: 1.273Mb
Formát:: PDF
Popis:: PLNY_TEXT
: Zobrazit/otevřít

Název:: F8-BP-2020-posudek-Buk_Zdenek.pdf
Velikost:: 134.6Kb
Formát:: PDF
Popis:: POSUDEK
: Zobrazit/otevřít

Název:: F8-BP-2020-posudek-Stepanovsky ...
Velikost:: 135.7Kb
Formát:: PDF
Popis:: POSUDEK
: Zobrazit/otevřít

Tento záznam se objevuje v následujících kolekcích

Bakalářské práce - 18101 [337]

Zobrazit minimální záznam