Rozpoznávání pojmenovaných entit v básnických textech

Ondřej Černý

Named entity recognition for poetic texts

Typ dokumentu

bakalářská práce
bachelor thesis

Autor

Ondřej Černý

Vedoucí práce

Klouda Karel

Oponent práce

Friedjungová Magda

Studijní obor

Znalostní inženýrství

Studijní program

Informatika 2009

Instituce přidělující hodnost

katedra aplikované matematiky

Práva

A university thesis is a work protected by the Copyright Act. Extracts, copies and transcripts of the thesis are allowed for personal use only and at one?s own expense. The use of thesis should be in compliance with the Copyright Act http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf and the citation ethics http://knihovny.cvut.cz/vychova/vskp.html
Vysokoškolská závěrečná práce je dílo chráněné autorským zákonem. Je možné pořizovat z něj na své náklady a pro svoji osobní potřebu výpisy, opisy a rozmnoženiny. Jeho využití musí být v souladu s autorským zákonem http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf a citační etikou http://knihovny.cvut.cz/vychova/vskp.html

Metadata

Zobrazit celý záznam

Abstrakt

Výsledkem této práce je program, který využívá techniky Zpracování přirozeného jazyka k identifikaci pojmenovaných entit v Korpusu českého verše (KČV). Jedná se o součást spolupráce s Ústavem pro českou literaturu. Jelikož KČV není ani z části označen pro rozpoznávání pojmenovaných entit (RPE), musíme zprvu vytvořit množinu pravidel, se kterými najdeme entity v textu. Tyto entity jsou následně kategorizovány s pomocí dat z Wikipedie. Poté jsou tyto kategorizované entity využity jakožto trénovací data pro BiLSTM-CRF neuronovou síť, která je následně trénována a vyladěna pro RPE na KČV. Výsledný model je schopen nalézt a rozlišit entity místa, osob, mystických osob and jiné. Jelikož text v KČV není označen pro RPE nejsme schopni udat skutečnou přesnost finálního BiLSTM-CRF modelu. Pokud bychom počítali s tím, že trénovací data použita na natrénování tohoto modelu jsou 100% přesná, pak by výsledný model dosáhl přesnosti 0.99904 a F1 skóre 0.9532.

The result of this work is a program that uses Natural Language Processing (NLP) techniques to identify named entities in the Corpus of Czech Verse (CCV). It is part of a cooperation with the Institute of Czech Literature (ICL). Since CCV is not even partially labeled for entity recognition, we first create a set of rules, and using those, we select entities from the poems. These entities are later on categorized into different entity categories using data from Wikipedia. After that, these categorized entities are used as training data for a BiLSTM-CRF neural network that is trained and fine-tuned for NER on the CCV. The resulting model can find and distinguish entities of Place, Person, Mystic Person, and Other. Since the text in the CCV is not labeled for NER, we cannot know the exact accuracy of the final BiLSTM-CRF model. If we would consider the data that are used for training of this model to be 100% accurate, then the final model would have achieved an accuracy of 0.99904 and an F1 score of 0.9532.