Out of process byte-code kompilátor pro programovací jazyk R

Adam Plodek

Out of process byte-code copiler for the R programming language

Typ dokumentu

diplomová práce
master thesis

Autor

Adam Plodek

Vedoucí práce

Křikava Filip

Oponent práce

Krynski Sebastián

Studijní obor

Systémové programování

Studijní program

Informatika

Instituce přidělující hodnost

katedra teoretické informatiky

Práva

A university thesis is a work protected by the Copyright Act. Extracts, copies and transcripts of the thesis are allowed for personal use only and at one?s own expense. The use of thesis should be in compliance with the Copyright Act http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf and the citation ethics http://knihovny.cvut.cz/vychova/vskp.html
Vysokoškolská závěrečná práce je dílo chráněné autorským zákonem. Je možné pořizovat z něj na své náklady a pro svoji osobní potřebu výpisy, opisy a rozmnoženiny. Jeho využití musí být v souladu s autorským zákonem http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf a citační etikou http://knihovny.cvut.cz/vychova/vskp.html

Metadata

Zobrazit celý záznam

Abstrakt

R je dynamický programovaní jazyk, který je převážně používaný ve statistice a pro vizualizaci dat. Jeho netypické vlastnosti a bohatý ekosystém balíčků umožňuje statistikům psát software bez pokročilých programátorských znalostí. Hlavní implementace toho programovancího jazyka je GNU R. Pro zrychlení běhu R programů, byla vytvořena GNU R implemetace bytekódu interpretru, který se používá souběžně s AST interpretrem. Součástí tohoto rozšíření byl kompilátor pro GNU R bytekód. Tato práce se zabývá jednou z možností pro vylepšení tohoto procesu, a to kompilace mimo proces interpretru. Tento přístup umožňuje implemetaci v jiných jazycích a otevírá nové možnosti pro sdílení kompilovaného kódu. Dále by toto řešení umožnilo kompilátor přesunou ze zařízení, na kterém je spuštěn interpreter, což by umožnilo přesunout náročný výpočet na výkonější zařízení. Dále popisuji průběh vývoje experimentální implemetace v programovacím jazyce Rust tohoto řešení, která může sloužit jakožto počáteční bod pro budoucí práci. Pro tyto účely byla vytvořena nová reprezentace hodnot v programovacím jazyce R a serializační formát pro tyto hodnoty. Toto bylo následně využito k implementaci samotného kompilátoru a serveru, který je schopný komunikovat s interpretem pomocí balíčku pro programovací jazyk R. Na závěr se zabývám zhodnocením aktuálního stavu kompilačního serveru. Tato kapitola je rozdělena do dvou částí a to na část, která se zabývá korektností implementace, a na část která hodnotí její výkon. V těchto kritériích je moje impletace porovnána s implemetací, která je součástí GNU R intepretu. Výsledky těchto testů ukázaly, že v nejlepším případě je možné zrychlit proces kompilace až 20 krát, pokud je pouze čas na kompilaci počítán, a když bylo do měření přidáno i načítání data, tak zrychlení bylo trojnásobné.

R is a dynamic programming language used mainly in statistics and data visualization. Its unique set of features and extensive ecosystem of packages enables statisticians to write software without the need to be software engineers. The GNU R implementation of an interpreter for R programming language is considered primary implementation. To speed up the execution of the R programs, the bytecode interpreter was implemented next to the standard AST interpreter. To compile the AST representation of the program into its bytecode representation, the compiler for GNU R bytecode was introduced. This thesis explores one possibility of improvement for this compilation process, namely the out-of-process compilation. This approach allows the implementation of the compilers in different languages and could unlock more possibilities for sharing the compiled code between clients. Moreover, the compiler process can be located outside of the machine on which the R interpreter is running, which can be used to move the compilation overhead to a more powerful machine. I describe the process of creating the experimental implementation of such a solution done in Rust programming language, which can serve as a baseline for future work. To achieve this, the custom representation of R values and serialization of those values was created. This was then used to implement the compiler and server, which communicates with the package that can be used by the interpreter. Finally, I evaluate the current state of implementation of my compilation server. This is split into two parts: correctness and performance. Both of these criteria are compared against the current implementation embedded in the GNU R interpreter. The result of this evalutation showed that the compilation process could be sped up 20 times in best case scenario, when only compilation it self is counted. When the loading of the data is included the speed up ended up being 3 times compared to GNU R implementation.