Detekce fake news metodami zpracování přirozeného jazyka

Denis Řeháček

Detecting Fake News Using NLP Methods

Typ dokumentu

diplomová práce
master thesis

Autor

Denis Řeháček

Vedoucí práce

Drchal Jan

Oponent práce

Šír Gustav

Studijní obor

Umělá inteligence

Studijní program

Otevřená informatika

Instituce přidělující hodnost

katedra počítačů

Práva

A university thesis is a work protected by the Copyright Act. Extracts, copies and transcripts of the thesis are allowed for personal use only and at one?s own expense. The use of thesis should be in compliance with the Copyright Act http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf and the citation ethics http://knihovny.cvut.cz/vychova/vskp.html
Vysokoškolská závěrečná práce je dílo chráněné autorským zákonem. Je možné pořizovat z něj na své náklady a pro svoji osobní potřebu výpisy, opisy a rozmnoženiny. Jeho využití musí být v souladu s autorským zákonem http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf a citační etikou http://knihovny.cvut.cz/vychova/vskp.html

Metadata

Zobrazit celý záznam

Abstrakt

Tato práce představuje problematiku dezinformací ve světě bohatém na informace. Detekce Fake News (falešných zpráv) byla řešena jako text classification problem. Bylo provedeno více než sto experimentů s cílem nalézt vhodnou kombinaci zpracování přirozeného jazyka (NLP) a efektivní architektury Neuronové sítě. Specifika a limity tohoto přístupu byla srovnána s jinými úlohami klasifikace textů. Byl použit existující dataset falešných zpráv i několik kombinací dat získaných konkrétně pro tuto práci. Tento projekt jedinečný ve zpracování článků v mnoha evropských jazycích, pokrývajících stejná témata v obou kategoriích - spolehlivé a dezinformační zprávy. Nejlepší přesnosti bylo dosaženo pomocí konvoluční neuronové sítě a to s až 99,9\% správné predikce na existujícím souboru dat a více než 98\% ve většině experimentů na menších samo-získaných datech, což předčilo Self-attention mechanismus. Lepších výsledků bylo dosaženo při použití původních textů namísto jejich lidmi psanými shrnutími (a to i přes to, že druhá možnost byla otestována na větším souboru dat). Vzhledem k vlastnostem datových sad (stejná témata v obou třídách) se lze předpokládat, že existují jazykové vzory specifické pro každou z kategorií, které byly ve shrnutích ztraceny.

This thesis introduces the problem of disinformation in an information-rich world. Fake News detection was addressed as a text classification problem. More than a hundred experiments were done to find an appropriate combination of pre-processing and efficient Neural Network architecture, relieving some specifics and limitations of the Fake News detection problem compared to other text classification tasks. An existing Fake News dataset was used as well as several combinations of a self-obtained data. The work is unique in processing news articles in numerous European languages, covering the same topics in both categories - reliable and disinformation news. The best accuracy was achieved by a convolutional based Neural Network, with up to 99.9\% of correct prediction on the existing dataset, and over 98\% in most experiments on the smaller self-obtained data, outperforming Self-attention mechanism. Better results were achieved when using the original texts instead of human-written summaries (even though the second option was trained on a larger dataset). Considering the datasets properties (same topics in both classes), the results suggest, there are probably language patterns distinctive for each of the two categories that were lost in the human-written summaries.