Randomizované indexy pro přibližné vyhledávání v multidimenzionálních polích

Luboš Krčál

Randomized Indexing for Approximate Selection Queries on Multidimensional Arrays

dc.contributor.advisor	Holub Jan
dc.contributor.author	Luboš Krčál
dc.date.accessioned	2022-10-14T13:19:19Z
dc.date.available	2022-10-14T13:19:19Z
dc.date.issued	2022-08-31
dc.identifier	KOS-577253371205
dc.identifier.uri	http://hdl.handle.net/10467/104474
dc.description.abstract	Multidimensional data, either in the form of dense arrays, or sparse relational data are a common data structure for effective storage, access, management, querying, disseminating, analysis, and visualization of scientific datasets. Array data are being used in many scientific domains, including computational fluid dynamics, oceanography, spatiotemporal climate analysis, forecasting, medical, biomedical, astronomical and satellite data processing. Efficient processing of high dimensional data is difficult due to the arbitrary size, cardinality, and so called curse of dimensionality. Bitmap indices are widely used in commercial databases for processing complex queries, due to the efficient use of hardware accelerated bit-wise operations and their spaceefficiency. Compressed, hierarchical, multi-component bitmap indices have also been used for relational data. Inverted indexing is another commonly used technique in a variety of high dimensional data applications, such as exact search, similarity search, or machine learning. Inverted index maps multidimensional data points into lists based on some discretization of their dimension values. Similarly, a column index of a relational database may map individual column values to lists of corresponding rows. To evaluate a query, inverted lists are usually intersected to obtain a list of points satisfying all the constraints. Our interest is in a more generalized approach, where each query evaluates similarity based on the number of matched dimensions. In this work, we have designed, implemented and evaluated two methods of indexing multidimensional array data for selection queries, and an extension of data parallel inverted index as part of generic framework for similarity search on the GPU. Following is a list of individual contributions: For the purpose of efficient execution of various spatiotemporal selection queries in large distributed array databases, we have designed a multidimensional array inverted index based on grid transformations. We demonstrate the efficiency of our multidimensional array index on a complete, large-scale satellite dataset. The work was implemented and integrated as an extension of a distributed open-source array database SciDB. iii Next, we have proposed a hierarchical indexing scheme for multidimensional arrays that overcomes the dimensionality-induced inefficiencies of standard spatial and bitmap indexing techniques on dense multidimensional arrays. The index is based on novel n-dimensional sparse trees for dimension partitioning, with bound number of individual, adaptively binned indices for attribute partitioning. This indexing performs well on queries involving both dimensions and attributes constraints, as it prunes the search space early. Lastly, we have improved query performance of generic similarity search in GENIE (Generic inverted index on GPU) by incorporating compressed inverted index on GPU with data parallel decoding. Multiple decoding schemes were designed, implemented, and evaluated for a fully data parallel decoding and query execution. The implementation has sped up total query processing time in 3-4 times on real world datasets. All the components were integrated into publicly available similarity search framework GENIE in a robust and modular architecture, with configurable query compiler and index management components. The extensions of GENIE were designed for multi-GPU and multi-node distributed deployment with an implementation of the distributed functionality publicly available	cze
dc.description.abstract	Multidimensional data, either in the form of dense arrays, or sparse relational data are a common data structure for effective storage, access, management, querying, disseminating, analysis, and visualization of scientific datasets. Array data are being used in many scientific domains, including computational fluid dynamics, oceanography, spatiotemporal climate analysis, forecasting, medical, biomedical, astronomical and satellite data processing. Efficient processing of high dimensional data is difficult due to the arbitrary size, cardinality, and so called curse of dimensionality. Bitmap indices are widely used in commercial databases for processing complex queries, due to the efficient use of hardware accelerated bit-wise operations and their spaceefficiency. Compressed, hierarchical, multi-component bitmap indices have also been used for relational data. Inverted indexing is another commonly used technique in a variety of high dimensional data applications, such as exact search, similarity search, or machine learning. Inverted index maps multidimensional data points into lists based on some discretization of their dimension values. Similarly, a column index of a relational database may map individual column values to lists of corresponding rows. To evaluate a query, inverted lists are usually intersected to obtain a list of points satisfying all the constraints. Our interest is in a more generalized approach, where each query evaluates similarity based on the number of matched dimensions. In this work, we have designed, implemented and evaluated two methods of indexing multidimensional array data for selection queries, and an extension of data parallel inverted index as part of generic framework for similarity search on the GPU. Following is a list of individual contributions: For the purpose of efficient execution of various spatiotemporal selection queries in large distributed array databases, we have designed a multidimensional array inverted index based on grid transformations. We demonstrate the efficiency of our multidimensional array index on a complete, large-scale satellite dataset. The work was implemented and integrated as an extension of a distributed open-source array database SciDB. iii Next, we have proposed a hierarchical indexing scheme for multidimensional arrays that overcomes the dimensionality-induced inefficiencies of standard spatial and bitmap indexing techniques on dense multidimensional arrays. The index is based on novel n-dimensional sparse trees for dimension partitioning, with bound number of individual, adaptively binned indices for attribute partitioning. This indexing performs well on queries involving both dimensions and attributes constraints, as it prunes the search space early. Lastly, we have improved query performance of generic similarity search in GENIE (Generic inverted index on GPU) by incorporating compressed inverted index on GPU with data parallel decoding. Multiple decoding schemes were designed, implemented, and evaluated for a fully data parallel decoding and query execution. The implementation has sped up total query processing time in 3-4 times on real world datasets. All the components were integrated into publicly available similarity search framework GENIE in a robust and modular architecture, with configurable query compiler and index management components. The extensions of GENIE were designed for multi-GPU and multi-node distributed deployment with an implementation of the distributed functionality publicly available	eng
dc.publisher	České vysoké učení technické v Praze. Vypočetní a informační centrum.	cze
dc.publisher	Czech Technical University in Prague. Computing and Information Centre.	eng
dc.rights	A university thesis is a work protected by the Copyright Act. Extracts, copies and transcripts of the thesis are allowed for personal use only and at one?s own expense. The use of thesis should be in compliance with the Copyright Act http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf and the citation ethics http://knihovny.cvut.cz/vychova/vskp.html	eng
dc.rights	Vysokoškolská závěrečná práce je dílo chráněné autorským zákonem. Je možné pořizovat z něj na své náklady a pro svoji osobní potřebu výpisy, opisy a rozmnoženiny. Jeho využití musí být v souladu s autorským zákonem http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf a citační etikou http://knihovny.cvut.cz/vychova/vskp.html	cze
dc.subject	Multidimensional Arrays	cze
dc.subject	Array Database	cze
dc.subject	SciDB	cze
dc.subject	Scientific Computing	cze
dc.subject	Bitmap Index	cze
dc.subject	Inverted Index	cze
dc.subject	Similarity Search	cze
dc.subject	Approximate Nearest Neighbors	cze
dc.subject	GPU AcceleratedDatabase	cze
dc.subject	Index Compression	cze
dc.subject	Data Parallel Decoding	cze
dc.subject	GENIE	cze
dc.subject	Multidimensional Arrays	eng
dc.subject	Array Database	eng
dc.subject	SciDB	eng
dc.subject	Scientific Computing	eng
dc.subject	Bitmap Index	eng
dc.subject	Inverted Index	eng
dc.subject	Similarity Search	eng
dc.subject	Approximate Nearest Neighbors	eng
dc.subject	GPU AcceleratedDatabase	eng
dc.subject	Index Compression	eng
dc.subject	Data Parallel Decoding	eng
dc.subject	GENIE	eng
dc.title	Randomizované indexy pro přibližné vyhledávání v multidimenzionálních polích	cze
dc.title	Randomized Indexing for Approximate Selection Queries on Multidimensional Arrays	eng
dc.type	disertační práce	cze
dc.type	doctoral thesis	eng
dc.contributor.referee	Krátký Michal
theses.degree.discipline	Informatika	cze
theses.degree.grantor	katedra teoretické informatiky	cze
theses.degree.programme	Informatika	cze

Soubory tohoto záznamu

Název:: F8-D-2021-Krcal-Lubos-disserta ...
Velikost:: 1.383Mb
Formát:: PDF
Popis:: PLNY_TEXT
: Zobrazit/otevřít

Tento záznam se objevuje v následujících kolekcích

Disertační práce - 18000 [50]

Zobrazit minimální záznam

Randomizované indexy pro přibližné vyhledávání v multidimenzionálních polích

Soubory tohoto záznamu

Tento záznam se objevuje v následujících kolekcích

Související záznamy

Indexy uspořádaných stromů pro podstromy a stromové vzorky a jejich prostorové složitosti ﻿

Vyhledávání CRISPR segmentů využívající self-index ﻿

Analýza paralelních mikroelektrodových záznamů ﻿

Indexy uspořádaných stromů pro podstromy a stromové vzorky a jejich prostorové složitosti

Vyhledávání CRISPR segmentů využívající self-index

Analýza paralelních mikroelektrodových záznamů