Sociopath: Automatic Local Events Extractor
Sociopath: automatická extrakce informací o kulturních událostech
Authors
Supervisors
Reviewers
Editors
Other contributors
Journal Title
Journal ISSN
Volume Title
Publisher
České vysoké učení technické v Praze
Czech Technical University in Prague
Czech Technical University in Prague
Date of defense
Abstract
The Internet is large data source which is mostly unstructured from the semantic point of view. Despite the fact there are many attempts to unify the way how information is presented, there is still no general format for it. For the computer program, it is easy to read the Web page as HTML code, but it's hard to understand the meaning and extract the semantic structure. It makes the automatic information extraction be the challenging problem. Automatic extraction of the information from Web pages is a common task in data mining. It is used in many modern services and strongly related to the structure of the webpage and the properties of the content itself. The thesis is focused on Web information extraction about local social events. Social events include various cultural events, sports events, and any other activities. One of the biggest problems in Web Extraction field is collecting the training data. In this thesis, we presented the approach with the use of Microdata semantic markup for automatic collecting the labeled training dataset. We built the system which automatically collects the training samples with comprehensive features including visual, textual, spatial and DOM-related. Also, this thesis is focused on various techniques on data processing, cleaning and building the classification model for every extracted event component.
The Internet is large data source which is mostly unstructured from the semantic point of view. Despite the fact there are many attempts to unify the way how information is presented, there is still no general format for it. For the computer program, it is easy to read the Web page as HTML code, but it's hard to understand the meaning and extract the semantic structure. It makes the automatic information extraction be the challenging problem. Automatic extraction of the information from Web pages is a common task in data mining. It is used in many modern services and strongly related to the structure of the webpage and the properties of the content itself. The thesis is focused on Web information extraction about local social events. Social events include various cultural events, sports events, and any other activities. One of the biggest problems in Web Extraction field is collecting the training data. In this thesis, we presented the approach with the use of Microdata semantic markup for automatic collecting the labeled training dataset. We built the system which automatically collects the training samples with comprehensive features including visual, textual, spatial and DOM-related. Also, this thesis is focused on various techniques on data processing, cleaning and building the classification model for every extracted event component.
The Internet is large data source which is mostly unstructured from the semantic point of view. Despite the fact there are many attempts to unify the way how information is presented, there is still no general format for it. For the computer program, it is easy to read the Web page as HTML code, but it's hard to understand the meaning and extract the semantic structure. It makes the automatic information extraction be the challenging problem. Automatic extraction of the information from Web pages is a common task in data mining. It is used in many modern services and strongly related to the structure of the webpage and the properties of the content itself. The thesis is focused on Web information extraction about local social events. Social events include various cultural events, sports events, and any other activities. One of the biggest problems in Web Extraction field is collecting the training data. In this thesis, we presented the approach with the use of Microdata semantic markup for automatic collecting the labeled training dataset. We built the system which automatically collects the training samples with comprehensive features including visual, textual, spatial and DOM-related. Also, this thesis is focused on various techniques on data processing, cleaning and building the classification model for every extracted event component.
Description
Citation
Permanent link
Rights/License
A university thesis is a work protected by the Copyright Act of the Czech Republic. Extracts, copies and transcripts of the thesis are allowed for personal use only and at one`s own expense. The use of thesis should be in compliance with the Copyright Act.
Vysokoškolská závěrečná práce je dílo chráněné autorským zákonem. Je možné pořizovat z něj na své náklady a pro svoji osobní potřebu výpisy, opisy a rozmnoženiny. Jeho využití musí být v souladu s autorským zákonem v platném znění.
Vysokoškolská závěrečná práce je dílo chráněné autorským zákonem. Je možné pořizovat z něj na své náklady a pro svoji osobní potřebu výpisy, opisy a rozmnoženiny. Jeho využití musí být v souladu s autorským zákonem v platném znění.