Od FastText k Transformer modelům a jejich aplikace v Retrieval-Augmented generování

Tato práce zkoumá vývoj metod reprezentace textu, od tradičních technik jako FastText až po sofistikované modely založené na transformátorech, jako je Bidirectional Encoder Representations from Transformers (BERT). Studie hodnotí tyto reprezentace prostřednictvím testů analogie a analýzy matic záměn, přičemž využívá korpus UPV pro komplexní posouzení. V pozdější části výzkumu se pozornost přesouvá k optimalizaci reprezentací textu pro algoritmy Retrieval-Augmented Generation (RAG). Výzkum si klade za cíl identifikovat nejúčinnější vektory a určit optimální velikost textových bloků pro úkoly Question Answering (QA), zejména v oblasti generování odpovědí v přirozeném jazyce z technických manuálů. Provádí se důkladné hodnocení s cílem doporučit optimální model reprezentace, který vyvažuje faktickou přesnost a výpočetní efektivitu.

This thesis examines the evolution of text representation methods, starting from traditional techniques like FastText and advancing to sophisticated transformer-based models such as Bidirectional Encoder Representations from Transformers (BERT). The study evaluates these representations through analogy tests and confusion matrix analysis, utilizing the UPV corpus set for comprehensive assessment. In the latter part of the research, the focus shifts to optimizing text representations for Retrieval-Augmented Generation (RAG) algorithms. The investigation aims to identify the most effective embeddings and determine the optimal text chunk size for Question Answering (QA) tasks, particularly within the realm of generating natural language answers from technical manuals. A thorough evaluation is conducted to recommend an optimal representation model that strikes a balance between factual accuracy and computational efficiency.

Keywords

NLP, Word Embedding, Transformátor, FastText, RAG, QA, STS, NLP, Word Embedding, Transformer, FastText, RAG, QA, STS

Permanent link

http://hdl.handle.net/10467/115109

Rights/License

A university thesis is a work protected by the Copyright Act of the Czech Republic. Extracts, copies and transcripts of the thesis are allowed for personal use only and at one`s own expense. The use of thesis should be in compliance with the Copyright Act.

Vysokoškolská závěrečná práce je dílo chráněné autorským zákonem. Je možné pořizovat z něj na své náklady a pro svoji osobní potřebu výpisy, opisy a rozmnoženiny. Jeho využití musí být v souladu s autorským zákonem v platném znění.

Collections

Bachelor Theses - 13133

Full item page

From FastText to Transformer Models, and their Application in Retrieval-Augmented Generation