Zobrazit minimální záznam

Randomized Indexing for Approximate Selection Queries on Multidimensional Arrays



dc.contributor.advisorHolub Jan
dc.contributor.authorLuboš Krčál
dc.date.accessioned2022-10-14T13:19:19Z
dc.date.available2022-10-14T13:19:19Z
dc.date.issued2022-08-31
dc.identifierKOS-577253371205
dc.identifier.urihttp://hdl.handle.net/10467/104474
dc.description.abstractMultidimensional data, either in the form of dense arrays, or sparse relational data are a common data structure for effective storage, access, management, querying, disseminating, analysis, and visualization of scientific datasets. Array data are being used in many scientific domains, including computational fluid dynamics, oceanography, spatiotemporal climate analysis, forecasting, medical, biomedical, astronomical and satellite data processing. Efficient processing of high dimensional data is difficult due to the arbitrary size, cardinality, and so called curse of dimensionality. Bitmap indices are widely used in commercial databases for processing complex queries, due to the efficient use of hardware accelerated bit-wise operations and their spaceefficiency. Compressed, hierarchical, multi-component bitmap indices have also been used for relational data. Inverted indexing is another commonly used technique in a variety of high dimensional data applications, such as exact search, similarity search, or machine learning. Inverted index maps multidimensional data points into lists based on some discretization of their dimension values. Similarly, a column index of a relational database may map individual column values to lists of corresponding rows. To evaluate a query, inverted lists are usually intersected to obtain a list of points satisfying all the constraints. Our interest is in a more generalized approach, where each query evaluates similarity based on the number of matched dimensions. In this work, we have designed, implemented and evaluated two methods of indexing multidimensional array data for selection queries, and an extension of data parallel inverted index as part of generic framework for similarity search on the GPU. Following is a list of individual contributions: For the purpose of efficient execution of various spatiotemporal selection queries in large distributed array databases, we have designed a multidimensional array inverted index based on grid transformations. We demonstrate the efficiency of our multidimensional array index on a complete, large-scale satellite dataset. The work was implemented and integrated as an extension of a distributed open-source array database SciDB. iii Next, we have proposed a hierarchical indexing scheme for multidimensional arrays that overcomes the dimensionality-induced inefficiencies of standard spatial and bitmap indexing techniques on dense multidimensional arrays. The index is based on novel n-dimensional sparse trees for dimension partitioning, with bound number of individual, adaptively binned indices for attribute partitioning. This indexing performs well on queries involving both dimensions and attributes constraints, as it prunes the search space early. Lastly, we have improved query performance of generic similarity search in GENIE (Generic inverted index on GPU) by incorporating compressed inverted index on GPU with data parallel decoding. Multiple decoding schemes were designed, implemented, and evaluated for a fully data parallel decoding and query execution. The implementation has sped up total query processing time in 3-4 times on real world datasets. All the components were integrated into publicly available similarity search framework GENIE in a robust and modular architecture, with configurable query compiler and index management components. The extensions of GENIE were designed for multi-GPU and multi-node distributed deployment with an implementation of the distributed functionality publicly availablecze
dc.description.abstractMultidimensional data, either in the form of dense arrays, or sparse relational data are a common data structure for effective storage, access, management, querying, disseminating, analysis, and visualization of scientific datasets. Array data are being used in many scientific domains, including computational fluid dynamics, oceanography, spatiotemporal climate analysis, forecasting, medical, biomedical, astronomical and satellite data processing. Efficient processing of high dimensional data is difficult due to the arbitrary size, cardinality, and so called curse of dimensionality. Bitmap indices are widely used in commercial databases for processing complex queries, due to the efficient use of hardware accelerated bit-wise operations and their spaceefficiency. Compressed, hierarchical, multi-component bitmap indices have also been used for relational data. Inverted indexing is another commonly used technique in a variety of high dimensional data applications, such as exact search, similarity search, or machine learning. Inverted index maps multidimensional data points into lists based on some discretization of their dimension values. Similarly, a column index of a relational database may map individual column values to lists of corresponding rows. To evaluate a query, inverted lists are usually intersected to obtain a list of points satisfying all the constraints. Our interest is in a more generalized approach, where each query evaluates similarity based on the number of matched dimensions. In this work, we have designed, implemented and evaluated two methods of indexing multidimensional array data for selection queries, and an extension of data parallel inverted index as part of generic framework for similarity search on the GPU. Following is a list of individual contributions: For the purpose of efficient execution of various spatiotemporal selection queries in large distributed array databases, we have designed a multidimensional array inverted index based on grid transformations. We demonstrate the efficiency of our multidimensional array index on a complete, large-scale satellite dataset. The work was implemented and integrated as an extension of a distributed open-source array database SciDB. iii Next, we have proposed a hierarchical indexing scheme for multidimensional arrays that overcomes the dimensionality-induced inefficiencies of standard spatial and bitmap indexing techniques on dense multidimensional arrays. The index is based on novel n-dimensional sparse trees for dimension partitioning, with bound number of individual, adaptively binned indices for attribute partitioning. This indexing performs well on queries involving both dimensions and attributes constraints, as it prunes the search space early. Lastly, we have improved query performance of generic similarity search in GENIE (Generic inverted index on GPU) by incorporating compressed inverted index on GPU with data parallel decoding. Multiple decoding schemes were designed, implemented, and evaluated for a fully data parallel decoding and query execution. The implementation has sped up total query processing time in 3-4 times on real world datasets. All the components were integrated into publicly available similarity search framework GENIE in a robust and modular architecture, with configurable query compiler and index management components. The extensions of GENIE were designed for multi-GPU and multi-node distributed deployment with an implementation of the distributed functionality publicly availableeng
dc.publisherČeské vysoké učení technické v Praze. Vypočetní a informační centrum.cze
dc.publisherCzech Technical University in Prague. Computing and Information Centre.eng
dc.rightsA university thesis is a work protected by the Copyright Act. Extracts, copies and transcripts of the thesis are allowed for personal use only and at one?s own expense. The use of thesis should be in compliance with the Copyright Act http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf and the citation ethics http://knihovny.cvut.cz/vychova/vskp.htmleng
dc.rightsVysokoškolská závěrečná práce je dílo chráněné autorským zákonem. Je možné pořizovat z něj na své náklady a pro svoji osobní potřebu výpisy, opisy a rozmnoženiny. Jeho využití musí být v souladu s autorským zákonem http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf a citační etikou http://knihovny.cvut.cz/vychova/vskp.htmlcze
dc.subjectMultidimensional Arrayscze
dc.subjectArray Databasecze
dc.subjectSciDBcze
dc.subjectScientific Computingcze
dc.subjectBitmap Indexcze
dc.subjectInverted Indexcze
dc.subjectSimilarity Searchcze
dc.subjectApproximate Nearest Neighborscze
dc.subjectGPU AcceleratedDatabasecze
dc.subjectIndex Compressioncze
dc.subjectData Parallel Decodingcze
dc.subjectGENIEcze
dc.subjectMultidimensional Arrayseng
dc.subjectArray Databaseeng
dc.subjectSciDBeng
dc.subjectScientific Computingeng
dc.subjectBitmap Indexeng
dc.subjectInverted Indexeng
dc.subjectSimilarity Searcheng
dc.subjectApproximate Nearest Neighborseng
dc.subjectGPU AcceleratedDatabaseeng
dc.subjectIndex Compressioneng
dc.subjectData Parallel Decodingeng
dc.subjectGENIEeng
dc.titleRandomizované indexy pro přibližné vyhledávání v multidimenzionálních políchcze
dc.titleRandomized Indexing for Approximate Selection Queries on Multidimensional Arrayseng
dc.typedisertační prácecze
dc.typedoctoral thesiseng
dc.contributor.refereeKrátký Michal
theses.degree.disciplineInformatikacze
theses.degree.grantorkatedra teoretické informatikycze
theses.degree.programmeInformatikacze


Soubory tohoto záznamu


Tento záznam se objevuje v následujících kolekcích

Zobrazit minimální záznam