Randomizované indexy pro přibližné vyhledávání v multidimenzionálních polích
Randomized Indexing for Approximate Selection Queries on Multidimensional Arrays
Typ dokumentu
disertační prácedoctoral thesis
Autor
Luboš Krčál
Vedoucí práce
Holub Jan
Oponent práce
Krátký Michal
Studijní obor
InformatikaStudijní program
InformatikaInstituce přidělující hodnost
katedra teoretické informatikyPráva
A university thesis is a work protected by the Copyright Act. Extracts, copies and transcripts of the thesis are allowed for personal use only and at one?s own expense. The use of thesis should be in compliance with the Copyright Act http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf and the citation ethics http://knihovny.cvut.cz/vychova/vskp.htmlVysokoškolská závěrečná práce je dílo chráněné autorským zákonem. Je možné pořizovat z něj na své náklady a pro svoji osobní potřebu výpisy, opisy a rozmnoženiny. Jeho využití musí být v souladu s autorským zákonem http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf a citační etikou http://knihovny.cvut.cz/vychova/vskp.html
Metadata
Zobrazit celý záznamAbstrakt
Multidimensional data, either in the form of dense arrays, or sparse relational data are a common data structure for effective storage, access, management, querying, disseminating, analysis, and visualization of scientific datasets. Array data are being used in many scientific domains, including computational fluid dynamics, oceanography, spatiotemporal climate analysis, forecasting, medical, biomedical, astronomical and satellite data processing. Efficient processing of high dimensional data is difficult due to the arbitrary size, cardinality, and so called curse of dimensionality. Bitmap indices are widely used in commercial databases for processing complex queries, due to the efficient use of hardware accelerated bit-wise operations and their spaceefficiency. Compressed, hierarchical, multi-component bitmap indices have also been used for relational data. Inverted indexing is another commonly used technique in a variety of high dimensional data applications, such as exact search, similarity search, or machine learning. Inverted index maps multidimensional data points into lists based on some discretization of their dimension values. Similarly, a column index of a relational database may map individual column values to lists of corresponding rows. To evaluate a query, inverted lists are usually intersected to obtain a list of points satisfying all the constraints. Our interest is in a more generalized approach, where each query evaluates similarity based on the number of matched dimensions. In this work, we have designed, implemented and evaluated two methods of indexing multidimensional array data for selection queries, and an extension of data parallel inverted index as part of generic framework for similarity search on the GPU. Following is a list of individual contributions: For the purpose of efficient execution of various spatiotemporal selection queries in large distributed array databases, we have designed a multidimensional array inverted index based on grid transformations. We demonstrate the efficiency of our multidimensional array index on a complete, large-scale satellite dataset. The work was implemented and integrated as an extension of a distributed open-source array database SciDB. iii Next, we have proposed a hierarchical indexing scheme for multidimensional arrays that overcomes the dimensionality-induced inefficiencies of standard spatial and bitmap indexing techniques on dense multidimensional arrays. The index is based on novel n-dimensional sparse trees for dimension partitioning, with bound number of individual, adaptively binned indices for attribute partitioning. This indexing performs well on queries involving both dimensions and attributes constraints, as it prunes the search space early. Lastly, we have improved query performance of generic similarity search in GENIE (Generic inverted index on GPU) by incorporating compressed inverted index on GPU with data parallel decoding. Multiple decoding schemes were designed, implemented, and evaluated for a fully data parallel decoding and query execution. The implementation has sped up total query processing time in 3-4 times on real world datasets. All the components were integrated into publicly available similarity search framework GENIE in a robust and modular architecture, with configurable query compiler and index management components. The extensions of GENIE were designed for multi-GPU and multi-node distributed deployment with an implementation of the distributed functionality publicly available Multidimensional data, either in the form of dense arrays, or sparse relational data are a common data structure for effective storage, access, management, querying, disseminating, analysis, and visualization of scientific datasets. Array data are being used in many scientific domains, including computational fluid dynamics, oceanography, spatiotemporal climate analysis, forecasting, medical, biomedical, astronomical and satellite data processing. Efficient processing of high dimensional data is difficult due to the arbitrary size, cardinality, and so called curse of dimensionality. Bitmap indices are widely used in commercial databases for processing complex queries, due to the efficient use of hardware accelerated bit-wise operations and their spaceefficiency. Compressed, hierarchical, multi-component bitmap indices have also been used for relational data. Inverted indexing is another commonly used technique in a variety of high dimensional data applications, such as exact search, similarity search, or machine learning. Inverted index maps multidimensional data points into lists based on some discretization of their dimension values. Similarly, a column index of a relational database may map individual column values to lists of corresponding rows. To evaluate a query, inverted lists are usually intersected to obtain a list of points satisfying all the constraints. Our interest is in a more generalized approach, where each query evaluates similarity based on the number of matched dimensions. In this work, we have designed, implemented and evaluated two methods of indexing multidimensional array data for selection queries, and an extension of data parallel inverted index as part of generic framework for similarity search on the GPU. Following is a list of individual contributions: For the purpose of efficient execution of various spatiotemporal selection queries in large distributed array databases, we have designed a multidimensional array inverted index based on grid transformations. We demonstrate the efficiency of our multidimensional array index on a complete, large-scale satellite dataset. The work was implemented and integrated as an extension of a distributed open-source array database SciDB. iii Next, we have proposed a hierarchical indexing scheme for multidimensional arrays that overcomes the dimensionality-induced inefficiencies of standard spatial and bitmap indexing techniques on dense multidimensional arrays. The index is based on novel n-dimensional sparse trees for dimension partitioning, with bound number of individual, adaptively binned indices for attribute partitioning. This indexing performs well on queries involving both dimensions and attributes constraints, as it prunes the search space early. Lastly, we have improved query performance of generic similarity search in GENIE (Generic inverted index on GPU) by incorporating compressed inverted index on GPU with data parallel decoding. Multiple decoding schemes were designed, implemented, and evaluated for a fully data parallel decoding and query execution. The implementation has sped up total query processing time in 3-4 times on real world datasets. All the components were integrated into publicly available similarity search framework GENIE in a robust and modular architecture, with configurable query compiler and index management components. The extensions of GENIE were designed for multi-GPU and multi-node distributed deployment with an implementation of the distributed functionality publicly available
Zobrazit/ otevřít
Kolekce
Související záznamy
Zobrazují se záznamy příbuzné na základě názvu, autora a předmětu.
-
Indexy uspořádaných stromů pro podstromy a stromové vzorky a jejich prostorové složitosti
Autor: Poliak Martin; Vedoucí práce: Janoušek Jan; Oponent práce: Pokorný Jaroslav
(České vysoké učení technické v Praze. Vypočetní a informační centrum.Czech Technical University in Prague. Computing and Information Centre., 2018-04-06)This doctoral thesis deals with methods of indexing of a tree for subtrees and for tree patterns. Two types of indexes are considered. The first type is the index of a tree for subtrees, i.e. a full index that accepts all ... -
Vyhledávání CRISPR segmentů využívající self-index
Autor: Cvacho Ondřej; Vedoucí práce: Holub Jan; Oponent práce: Procházka Petr
(České vysoké učení technické v Praze. Vypočetní a informační centrum.Czech Technical University in Prague. Computing and Information Centre., 2016-05-10)Práce se zaměřuje na využití kompaktních datových struktur v hledání CRISPR segmentů za použití self-indexů. Hledání CRISPR segmentů je srovnatelné s přibližným vyhledáváním řetězce za pomoci generování a vyhledávání všech ... -
Analýza paralelních mikroelektrodových záznamů
Autor: Vošmik Jiří; Vedoucí práce: Sieger Tomáš; Oponent práce: Spilka Jiří
(České vysoké učení technické v Praze. Vypočetní a informační centrum.Czech Technical University in Prague. Computing and Information Centre., 2018-01-09)Lidský mozek je jedna z nejkomplikovanějších známých struktur a jako takový je dlouhodobě zkoumán. Většina výzkumu mozku se zatím zabývala výzkumem chování celých populací neuronů. Detailní výzkum chování jednotlivých ...