Automatická detekce témat v básnických textech

Martin Bendík

Automatic detection of topics in poetic texts

Type of document

diplomová práce
master thesis

Author

Martin Bendík

Supervisor

Klouda Karel

Opponent

Friedjungová Magda

Field of study

Znalostní inženýrství

Study program

Informatika

Institutions assigning rank

katedra aplikované matematiky

Rights

A university thesis is a work protected by the Copyright Act. Extracts, copies and transcripts of the thesis are allowed for personal use only and at one?s own expense. The use of thesis should be in compliance with the Copyright Act http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf and the citation ethics http://knihovny.cvut.cz/vychova/vskp.html
Vysokoškolská závěrečná práce je dílo chráněné autorským zákonem. Je možné pořizovat z něj na své náklady a pro svoji osobní potřebu výpisy, opisy a rozmnoženiny. Jeho využití musí být v souladu s autorským zákonem http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf a citační etikou http://knihovny.cvut.cz/vychova/vskp.html

Metadata

Show full item record

Abstract

Táto práca sa zaoberá detekciou tém v Korpuse českého verša, ktorý obsahuje desat'tisice básni z 19. a počiatku 20. storočia. Na efektivne spracovanie vel'kého množstva dát využiva metódy strojového učenia. Výstupom týchto algoritmov je množina detekovaných tém a zaradenie jednotlivých básni do týchto tém. To môže pomôct' pri d'alšej analýze diel, sumarizovani a skúmani, čomu sa jednotlivé diela venujú. Práca prezentuje súčasný výskum v oblasti detekcie tém v poetických textoch v rôznych jazykoch a s využitim rôznych technológii. Súčast'ou práce je aj vytvorenie niekol'kých modelov, ktoré slúžia na pridelenie tém jednotlivým básniam. Na tento účel boli využité nesupervizované, supervizované a semi-supervizované algoritmy. Všetky vytvorené modely detailne vyhodnocujeme, vizualizujeme, poukazujeme na ich silné a slabé stránky, špecifické vlastnosti a v neposlednom rade modely navzájom porovnávame. Ked'že Korpus českého verša neobsahuje anotácie tém básni, pre potreby supervizie učenia bol vytvorený anotovaný dataset, ktorý tvori podmnožina básni z pôvodného datasetu.

This thesis studies the detection of topics in the Corpus of Czech Verse, which contains tens of thousands of poems from the 19th and early 20th centuries. It uses machine learning methods to efficiently process the large amount of data. The output of these algorithms is a set of detected topics and the classification of individual poems into these topics. This can help in further analysis of the artworks, summarizing and exploring what each poem addresses. This thesis presents current research in the area of detecting topics in poetic texts in different languages and using different technologies. The thesis also includes the development of several models that are used to assign topics to individual poems. Unsupervised, supervised and semi-supervised algorithms have been used for this purpose. We evaluate all the created models in detail, visualize them, point out their strengths and weaknesses, specific features and last but not least compare the models with each other. Since the Corpus of Czech Verse does not contain annotations of poem topics, for the purpose of supervised learning, an annotated dataset was created, which consists of a subset of poems from the original dataset.