Randomizované indexy pro přibližné vyhledávání v multidimenzionálních polích
Randomized Indexing for Approximate Selection Queries on Multidimensional Arrays
dc.contributor.advisor | Holub Jan | |
dc.contributor.author | Luboš Krčál | |
dc.date.accessioned | 2022-10-14T13:19:19Z | |
dc.date.available | 2022-10-14T13:19:19Z | |
dc.date.issued | 2022-08-31 | |
dc.identifier | KOS-577253371205 | |
dc.identifier.uri | http://hdl.handle.net/10467/104474 | |
dc.description.abstract | Multidimensional data, either in the form of dense arrays, or sparse relational data are a common data structure for effective storage, access, management, querying, disseminating, analysis, and visualization of scientific datasets. Array data are being used in many scientific domains, including computational fluid dynamics, oceanography, spatiotemporal climate analysis, forecasting, medical, biomedical, astronomical and satellite data processing. Efficient processing of high dimensional data is difficult due to the arbitrary size, cardinality, and so called curse of dimensionality. Bitmap indices are widely used in commercial databases for processing complex queries, due to the efficient use of hardware accelerated bit-wise operations and their spaceefficiency. Compressed, hierarchical, multi-component bitmap indices have also been used for relational data. Inverted indexing is another commonly used technique in a variety of high dimensional data applications, such as exact search, similarity search, or machine learning. Inverted index maps multidimensional data points into lists based on some discretization of their dimension values. Similarly, a column index of a relational database may map individual column values to lists of corresponding rows. To evaluate a query, inverted lists are usually intersected to obtain a list of points satisfying all the constraints. Our interest is in a more generalized approach, where each query evaluates similarity based on the number of matched dimensions. In this work, we have designed, implemented and evaluated two methods of indexing multidimensional array data for selection queries, and an extension of data parallel inverted index as part of generic framework for similarity search on the GPU. Following is a list of individual contributions: For the purpose of efficient execution of various spatiotemporal selection queries in large distributed array databases, we have designed a multidimensional array inverted index based on grid transformations. We demonstrate the efficiency of our multidimensional array index on a complete, large-scale satellite dataset. The work was implemented and integrated as an extension of a distributed open-source array database SciDB. iii Next, we have proposed a hierarchical indexing scheme for multidimensional arrays that overcomes the dimensionality-induced inefficiencies of standard spatial and bitmap indexing techniques on dense multidimensional arrays. The index is based on novel n-dimensional sparse trees for dimension partitioning, with bound number of individual, adaptively binned indices for attribute partitioning. This indexing performs well on queries involving both dimensions and attributes constraints, as it prunes the search space early. Lastly, we have improved query performance of generic similarity search in GENIE (Generic inverted index on GPU) by incorporating compressed inverted index on GPU with data parallel decoding. Multiple decoding schemes were designed, implemented, and evaluated for a fully data parallel decoding and query execution. The implementation has sped up total query processing time in 3-4 times on real world datasets. All the components were integrated into publicly available similarity search framework GENIE in a robust and modular architecture, with configurable query compiler and index management components. The extensions of GENIE were designed for multi-GPU and multi-node distributed deployment with an implementation of the distributed functionality publicly available | cze |
dc.description.abstract | Multidimensional data, either in the form of dense arrays, or sparse relational data are a common data structure for effective storage, access, management, querying, disseminating, analysis, and visualization of scientific datasets. Array data are being used in many scientific domains, including computational fluid dynamics, oceanography, spatiotemporal climate analysis, forecasting, medical, biomedical, astronomical and satellite data processing. Efficient processing of high dimensional data is difficult due to the arbitrary size, cardinality, and so called curse of dimensionality. Bitmap indices are widely used in commercial databases for processing complex queries, due to the efficient use of hardware accelerated bit-wise operations and their spaceefficiency. Compressed, hierarchical, multi-component bitmap indices have also been used for relational data. Inverted indexing is another commonly used technique in a variety of high dimensional data applications, such as exact search, similarity search, or machine learning. Inverted index maps multidimensional data points into lists based on some discretization of their dimension values. Similarly, a column index of a relational database may map individual column values to lists of corresponding rows. To evaluate a query, inverted lists are usually intersected to obtain a list of points satisfying all the constraints. Our interest is in a more generalized approach, where each query evaluates similarity based on the number of matched dimensions. In this work, we have designed, implemented and evaluated two methods of indexing multidimensional array data for selection queries, and an extension of data parallel inverted index as part of generic framework for similarity search on the GPU. Following is a list of individual contributions: For the purpose of efficient execution of various spatiotemporal selection queries in large distributed array databases, we have designed a multidimensional array inverted index based on grid transformations. We demonstrate the efficiency of our multidimensional array index on a complete, large-scale satellite dataset. The work was implemented and integrated as an extension of a distributed open-source array database SciDB. iii Next, we have proposed a hierarchical indexing scheme for multidimensional arrays that overcomes the dimensionality-induced inefficiencies of standard spatial and bitmap indexing techniques on dense multidimensional arrays. The index is based on novel n-dimensional sparse trees for dimension partitioning, with bound number of individual, adaptively binned indices for attribute partitioning. This indexing performs well on queries involving both dimensions and attributes constraints, as it prunes the search space early. Lastly, we have improved query performance of generic similarity search in GENIE (Generic inverted index on GPU) by incorporating compressed inverted index on GPU with data parallel decoding. Multiple decoding schemes were designed, implemented, and evaluated for a fully data parallel decoding and query execution. The implementation has sped up total query processing time in 3-4 times on real world datasets. All the components were integrated into publicly available similarity search framework GENIE in a robust and modular architecture, with configurable query compiler and index management components. The extensions of GENIE were designed for multi-GPU and multi-node distributed deployment with an implementation of the distributed functionality publicly available | eng |
dc.publisher | České vysoké učení technické v Praze. Vypočetní a informační centrum. | cze |
dc.publisher | Czech Technical University in Prague. Computing and Information Centre. | eng |
dc.rights | A university thesis is a work protected by the Copyright Act. Extracts, copies and transcripts of the thesis are allowed for personal use only and at one?s own expense. The use of thesis should be in compliance with the Copyright Act http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf and the citation ethics http://knihovny.cvut.cz/vychova/vskp.html | eng |
dc.rights | Vysokoškolská závěrečná práce je dílo chráněné autorským zákonem. Je možné pořizovat z něj na své náklady a pro svoji osobní potřebu výpisy, opisy a rozmnoženiny. Jeho využití musí být v souladu s autorským zákonem http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf a citační etikou http://knihovny.cvut.cz/vychova/vskp.html | cze |
dc.subject | Multidimensional Arrays | cze |
dc.subject | Array Database | cze |
dc.subject | SciDB | cze |
dc.subject | Scientific Computing | cze |
dc.subject | Bitmap Index | cze |
dc.subject | Inverted Index | cze |
dc.subject | Similarity Search | cze |
dc.subject | Approximate Nearest Neighbors | cze |
dc.subject | GPU AcceleratedDatabase | cze |
dc.subject | Index Compression | cze |
dc.subject | Data Parallel Decoding | cze |
dc.subject | GENIE | cze |
dc.subject | Multidimensional Arrays | eng |
dc.subject | Array Database | eng |
dc.subject | SciDB | eng |
dc.subject | Scientific Computing | eng |
dc.subject | Bitmap Index | eng |
dc.subject | Inverted Index | eng |
dc.subject | Similarity Search | eng |
dc.subject | Approximate Nearest Neighbors | eng |
dc.subject | GPU AcceleratedDatabase | eng |
dc.subject | Index Compression | eng |
dc.subject | Data Parallel Decoding | eng |
dc.subject | GENIE | eng |
dc.title | Randomizované indexy pro přibližné vyhledávání v multidimenzionálních polích | cze |
dc.title | Randomized Indexing for Approximate Selection Queries on Multidimensional Arrays | eng |
dc.type | disertační práce | cze |
dc.type | doctoral thesis | eng |
dc.contributor.referee | Krátký Michal | |
theses.degree.discipline | Informatika | cze |
theses.degree.grantor | katedra teoretické informatiky | cze |
theses.degree.programme | Informatika | cze |