Scene text localization and recognition in images and videos
Type of document
disertační práceAuthor
Neumann, Lukáš
Supervisor
Matas, Jiří
Field of study
Umělá inteligence a biokybernetikaStudy program
Elektrotechnika a informatikaInstitutions assigning rank
České vysoké učení technické v Praze. Fakulta elektrotechnická. Katedra kybernetikyMetadata
Show full item recordAbstract
Scene Text Localization and Recognition methods nd all areas in an image or a video
that would be considered as text by a human, mark boundaries of the areas and output
a sequence of characters associated with its content. They are used to process images
and videos taken by a digital camera or a mobile phone and to \read" the content of
each text area into a digital format, typically a list of Unicode character sequences, that
can be processed in further applications.
Three di erent methods for Scene Text Localization and Recognition were proposed
in the course of the research, each one advancing the state of the art and improving the
accuracy. The rst method detects individual characters as Extremal Regions (ER),
where the probability of each ER being a character is estimated using novel features
with O(1) complexity and only ERs with locally maximal probability are selected across
several image projections for the second stage, where the classi cation is improved using
more computationally expensive features. The method was the rst published method
to address the complete problem of scene text localization and recognition as a whole
- all previous work in the literature focused solely on di erent subproblems.
Secondly, a novel easy-to-implement stroke detector was proposed. The detector is
signi cantly faster and produces signi cantly less false detections than the commonly
used ER detector. The detector e ciently produces character strokes segmentations,
which are exploited in a subsequent classi cation phase based on features e ectively
calculated as part of the segmentation process. Additionally, an e cient text clustering
algorithm based on text direction voting is proposed, which as well as the previous
stages is scale- and rotation- invariant and supports wide variety of scripts and fonts.
The third method exploits a deep-learning model, which is trained for both text
detection and recognition in a single trainable pipeline. The method localizes and
recognizes text in an image in a single feed-forward pass, it is trained purely on synthetic
data so it does not require obtaining expensive human annotations for training and it
achieves state-of-the-art accuracy in the end-to-end text recognition on two standard
datasets, whilst being an order of magnitude faster than the previous methods - the
whole pipeline runs at 10 frames per second.
Collections
- Disertační práce - 13000 [697]
The following license files are associated with this item: