Extraktor informací o firmách z webových zdrojů

Stanovčák, Tomáš

Extraction of company descriptors from web resources

Extraktor informací o firmách z webových zdrojů

Authors

Stanovčák, Tomáš

Supervisors

Kuchař, Jaroslav

Reviewers

Kordík, Pavel

Publisher

České vysoké učení technické v Praze
Czech Technical University in Prague

Files

Full Text (2.87 MB)

Review (42.48 KB)

Review (44.95 KB)

Abstract

Předmětem této práce je získání a zpracování dat o firmách z jejich webových stránek. Po obeznámení se s přístupy extrakce a množinou dostupných firemních informací bude připraven datový soubor ve vhodném formátu, na kterém budou prováděny experimenty. Tato datová množina bude podrobena rozličným způsobům extrakce na principu pravidel i strojového učení. Výsledky experimentů budou vyhodnoceny a implementace jednotlivých přístupů zveřejněna jako knihovna pod volnou licencí.

The subject of this thesis is to obtain and process company data from their websites. After getting acquainted with extraction approaches and available set of company information, dataset will be prepared in a format suitable for experiments. This dataset will undergo the extraction procedures based on both rule and machine learning principles. The results of the experiments will be evaluated and the implementation of the individual approaches will be publicly accessible as a library under a free licence.

Keywords

firma, webová stránka, extrakce, vytěžování obsahu, web scraping, zpracování textu, Python, company, website, extraction, content mining, web scraping, text processing, Python

Permanent link

http://hdl.handle.net/10467/101071

Rights/License

A university thesis is a work protected by the Copyright Act of the Czech Republic. Extracts, copies and transcripts of the thesis are allowed for personal use only and at one`s own expense. The use of thesis should be in compliance with the Copyright Act.

Vysokoškolská závěrečná práce je dílo chráněné autorským zákonem. Je možné pořizovat z něj na své náklady a pro svoji osobní potřebu výpisy, opisy a rozmnoženiny. Jeho využití musí být v souladu s autorským zákonem v platném znění.

Collections

Master Theses - 18105

Full item page

Extraction of company descriptors from web resources

Extraktor informací o firmách z webových zdrojů

Authors

Supervisors

Reviewers

Editors

Other contributors

Journal Title

Journal ISSN

Volume Title

Publisher

Date of defense

Files

Research Projects

Organizational Units

Journal Issue

Abstract

Description

Keywords

Citation

Underlying research data set URL

Permanent link

Rights/License

Collections

Endorsement

Review

Supplemented By

Referenced By