Automatická detekce intronů v genomech hub pomocí pravděpodobnostních modelů

Marek Zvara

Automatic intron detection in fungal genomes using probabilistic models

Typ dokumentu

diplomová práce
master thesis

Autor

Marek Zvara

Vedoucí práce

Kléma Jiří

Oponent práce

Větrovský Tomáš

Studijní obor

Umělá inteligence

Studijní program

Otevřená informatika

Instituce přidělující hodnost

katedra počítačů

Práva

A university thesis is a work protected by the Copyright Act. Extracts, copies and transcripts of the thesis are allowed for personal use only and at one?s own expense. The use of thesis should be in compliance with the Copyright Act http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf and the citation ethics http://knihovny.cvut.cz/vychova/vskp.html
Vysokoškolská závěrečná práce je dílo chráněné autorským zákonem. Je možné pořizovat z něj na své náklady a pro svoji osobní potřebu výpisy, opisy a rozmnoženiny. Jeho využití musí být v souladu s autorským zákonem http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf a citační etikou http://knihovny.cvut.cz/vychova/vskp.html

Metadata

Zobrazit celý záznam

Abstrakt

Huby sú rozmanité spoločenstvo a pri veľa druhoch doposiaľ nepoznáme ich genóm a funkcie ich génov. Detekcia intrónových oblastí môže pomôcť k vzájomnému celogenómovému porovnaniu húb a určenie ich vzájomných príbuzností. Práca sa tak zameriava na analýzu genómu húb a jeho špecifiká a vlastnosti, aby bolo možné čo najlepšie určiť úzkalia a problémy, ktorým daná detekcia bude musieť čeliť. Okrem toho sa práca venuje aktuálnym pravdepodnostným prístupom detekcie génových oblastí, rozdeleniu a porovnaniu metód rôznych pravdepodobnostných modelov a nakoniec predstavuje aj state-of-the-art nástroje ako Augustus+ a CodingQuarry, aby sme mali prehľad o najlepších možných prístupoch v danej oblasti. Vďaka tejto analýze bolo možné navrhnúť a implementovať pravdepodobnostný model, ktorý je založený na generalizovaných skrytých Markovských modeloch a za pomoci Viterbiho algoritmu pre generalizovaný skrytý Markovský model určiť najpravdepodobnejšiu sekvenciu stavov. V práci sa tak venujeme popisu návrhu a trikoch použitých pri samotnej implementácie ako napríklad urýchlenie Viterbiho algoritmu alebo rozdelenie detekcie do dvoch nadvezujúcich modelov, transkript model, ktorý slúži na označenie kódujúcich oblastí neznámeho genómu a genóm model, ktorý využíva tieto označenia a spresňuje detekciu pomocou zložitejšieho GHMM modelu. Výsledkom je teda nástroj, ktorý za pomoci vstupnej sekvencie a prípadných užívateľom špecifikovaných napovedajúcich anotácii dokáže detekovať jednotlivé génove úseky DNA sekvencie, ako promoter, exón, intrón, stop kodón a UTR oblasti. Výsledkom sú práve anotácie tejto detekcie a prípadná voliteľná vizualizácia v podobe HTML výstupu. Práca sa venuje zovšeobecňovaniu daných modelov za použitia taxonómie húb. Porovnáva presnosti detekcie intrónových oblastí v rámci jednotlivých taxómov húb. Nástroj tak dokáže naučiť všeobecnejšie modely v rámci definovaného taxómu. V práci daný model podrobujeme veľkému množstvu rôznych experimentov, snažíme sa tak nájsť vhodné hyperparametre modelu a prebádať ako funguje daná detekcia naším nástrojom v prípade rôznych taxonomických úrovní. Zmyslom takéhoto zovšeobecňovania modelov je ich následne nasadenie na detekciu v neznámom metagenóme, v ktorom sa nachádza zmes genómov rôznych organizmov, z čoho najväčší podiel tvoria práve huby. Práca poukazuje na slabiny a výhody nasadenia takýchto všeobecnejších modelov na daný metagenóm.

Fungi are a diverse community and for many species we do not yet know their genome and the functions of their genes. Detection of intron regions may be helpful in cross-genome comparison of fungi and so we can better determine their interrelationships. The paper analyses the fungal genome and its specificities and properties in order to determine the bottlenecks and problems that the detection will have to face. In addition, the work addresses and compares current probability approaches for gene region detection. Moreover the paper introduces state-of-the-art tools such as Augustus+ and CodingQuarry to review the best possible approaches in the field. With this analysis, it was possible to design and implement a probabilistic model based on generalized hidden Markov model. To determine the most probable sequence of states the model uses Viterbi algorithm for the generalized hidden Markov model. In this paper, we describe the design and the tricks used in the implementation itself, such as accelerating the Viterbi algorithm or splitting the detection into two emerging models, a transcript model that serves to designate the coding regions of the unknown genome and the genome model that uses these detected coding regions and refines the detection with more complex GHMM model. The result is a tool that, using the input sequence and any user-specified optional annotations, can detect individual gene sequences of a DNA sequence, such as promoter, exon, intron, stop codon, and UTR regions. The output is a list of annotations of this detection and optional visualization in the form of HTML output. The thesis deals with generalization of given models using taxonomy of fungi. It compares the accuracy of intron detection within individual fungal taxa. Thus, the tool can train more general models within a defined taxa. In the paper, we discuss results of many different experiments with the model, trying to find the appropriate hyperparameters of the model and exploring how the detection works with our tool for different taxonomic levels. The purpose of this model generalization is to subsequently deploy them for detection of an unknown metagenome. This metagenom is a mixture of genomes of different organisms. Fungal genome takes the majority in this metagenom. The paper points out the weaknesses and advantages of these general models regarding the detection in metagenome.