Použití NLI modelů pro ověřování fakticity sumarizací

Jan Dusil

NLI Models for Assessing Facticity in Summarization Methods

Typ dokumentu

diplomová práce
master thesis

Autor

Jan Dusil

Vedoucí práce

Drchal Jan

Oponent práce

Čepek Miroslav

Studijní obor

Kybernetika a robotika

Studijní program

Kybernetika a robotika

Instituce přidělující hodnost

katedra řídicí techniky

Práva

A university thesis is a work protected by the Copyright Act. Extracts, copies and transcripts of the thesis are allowed for personal use only and at one?s own expense. The use of thesis should be in compliance with the Copyright Act http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf and the citation ethics http://knihovny.cvut.cz/vychova/vskp.html
Vysokoškolská závěrečná práce je dílo chráněné autorským zákonem. Je možné pořizovat z něj na své náklady a pro svoji osobní potřebu výpisy, opisy a rozmnoženiny. Jeho využití musí být v souladu s autorským zákonem http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf a citační etikou http://knihovny.cvut.cz/vychova/vskp.html

Metadata

Zobrazit celý záznam

Abstrakt

V posledních letech neuronové sítě, konkrétně Transformers architektura, dominují pole Natural Language Processing. Tento způsob modelování jazyka vykazuje state-of-art výsledky a posouvá celý obor k rychlejšímu vývoji. Vyjímkou není ani abstraktivní sumarizace textu. Transformers architektura a modely založené na ni ovšem také přináší určitá úskalí a výzvy. Obor v současné chvíli nejvíce postupuje pro nejvíce užívané jazyky jako je angličtina, španělština a čínština. Zjeména kvůli dostupnosti datasetů skoro výhradně pro tyto jazyky. Tato práce ukazuje přehled state-of-art přístupů v oblasti NLP se soustředěním na sumarizaci textu. Dále jsou diskutovány výzvy a překážky v prostředí sumarizace pro český jazyk. V praktické části je vytvořen vlastní anotovaný dataset a také vytvořen program pro automatickou evaluaci NLI modelů na generovaných sumarizacích. Výsledkém práce je kompatní shrnutí state-of-art v oblasti automatické sumarizace textu. Dále, jsou prezentovány výsledky evaluace použití NLI modelů se zjištěním, že v případě použití vhodných a datasetů NLI modely ukazují velký potenciál stát se vhodnou metrikou pro ověřování generovaných sumarizací.

In recent years, neural networks, namely the Transformers architecture, have been dominating the field of Natural Language Processing. This approach is showing state-of-the-art results, and the field is progressively developing. One of these fields is the abstractive text summarization. However, feeding the models based on Transformers calls for the need for large datasets. Moreover, the field is mainly advancing in the most-used languages like English, Spanish or Chinese. This master thesis presents an overview of state-of-the-art NLP approaches, with a focus on text summarization. We discuss the challenges and motivation for the task in the environment of the Czech language. In the practical part, we have created a custom annotated dataset and developed an NLI-fact-checking pipeline to test and evaluate the performance of selected NLI models to assess the facticity of generated summaries. As the result of this thesis, we have presented a compact summary of the state-of-art in text summarization. In addition, the results of the NLI-fact-pipeline discover that with a suitable dataset that the NLI models have great potential of being an automatic model-based evaluation medium.