Bioinformatický indexovací nástroj pro vyhledávání v elastických degenerativních řetězcích

Dominika Draesslerová

Bioinformatics index tool for elastic degenerate string matching

Typ dokumentu

diplomová práce
master thesis

Autor

Dominika Draesslerová

Vedoucí práce

Holub Jan

Oponent práce

Krčál Luboš

Studijní obor

Teoretická informatika

Studijní program

Informatika

Instituce přidělující hodnost

katedra teoretické informatiky

Obhájeno

2023-02-15

Práva

A university thesis is a work protected by the Copyright Act. Extracts, copies and transcripts of the thesis are allowed for personal use only and at one?s own expense. The use of thesis should be in compliance with the Copyright Act http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf and the citation ethics http://knihovny.cvut.cz/vychova/vskp.html
Vysokoškolská závěrečná práce je dílo chráněné autorským zákonem. Je možné pořizovat z něj na své náklady a pro svoji osobní potřebu výpisy, opisy a rozmnoženiny. Jeho využití musí být v souladu s autorským zákonem http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf a citační etikou http://knihovny.cvut.cz/vychova/vskp.html

Metadata

Zobrazit celý záznam

Abstrakt

Schopnost efektivně prohledávat veliké množství dat je důležitou součástí nejen v bioinformatické a informatické sféře, ale v celém moderním světe. Každý organismus je originální a vytvořit komplexní strukturu, která by vhodně reprezentovala genomy a jeho varianty a uměla s nimi efektivně pracovat, je hlavním směrem pangenomického výzkumu. V této práci diskutujeme a implementujeme poměrně nový algoritmus zvaný BIO-FMI, který má potenciál k tomu efektivně komprimovat a vyhledávat nad množinou vysoce repetitivních dat, jako jsou právě DNA sequence. V současné době je způsob ukládání nevyhovující vzhledem k postavení vůči jednomu referenčnímu genomu a hledají se alternativní řešení. Tato práce se zabývá modifikací algoritmu BIO-FMI na formát elastických degenerovaných řetězců (EDS), které jsou kandidátními reprezentanty pro ukládání variant. Práce ukazuje slibné výsledky v rychlosti sestavení indexu, variabilitě nastavení a porovnává je s dalšími algoritmy z této oblasti, kterými jsou LZ-RLBWT a r-index.

The ability to search efficiently over large amount of data is an important part not only the field of bioinformatics but throughout the modern time. Every organism is unique, and to create a complex structure appropriately representing genomes and their variants while at the same time being able to work with them efficiently seems to be the main pan-genomic research direction. This text deals with the discussions about the implementation of a relatively new algorithm called BIO-FMI with the potential to efficiently compress and search over a set of highly repetitive strings, such as the DNA sequences. The storage principles of variants are currently insufficient due to their position against one reference genome and some alternative solutions are being sought now. This thesis specializes in a modification of algorithm BIO-FMI to the format of elastic-degenerate strings which become candidate representatives in terms of storing variants. The thesis shows promising results in index construction time, setting variability while comparing them with other algorithms in this field, namely LZ-RLBWT and r-index.