Annotated Slovak Datasets for Toxicity, Hate Speech, and Sentiment Analysis

Autori

Zuzana Sokolová
Maroš Harahus
Daniel Hládek
Ján Staš

DOI:

https://doi.org/10.2478/jazcas-2025-0025

Kľúčové slová:

datasets, hate speech, natural language processing, sentiment analysis, Slovak language, toxic language

Abstrakt

The rise of social media has led to an increase in toxic language, hate speech, and offensive content. While extensive research exists for widely spoken languages like English, Slovak remains underrepresented due to the lack of high-quality datasets. This gap limits the development of effective models for toxicity detection and sentiment analysis in Slovak. To address this, we introduce three new annotated Slovak datasets focused on toxic language, offensive language, hate speech detection, and sentiment analysis. These native datasets provide a more reliable foundation for automated moderation compared to machine-translated alternatives. Our research also highlights the real-world impact of online toxicity, including social polarization and psychological distress, emphasizing the need for proactive detection systems on social media platforms. This paper reviews existing Slovak datasets, presents our newly developed resources, and provides a comparative analysis. Finally, we outline key contributions and suggest future directions for improving toxic language detection in Slovak.

Sťahovanie

PDF (English)

Publikované

31-03-2025

Číslo

Ročník 76 Číslo 1 (2025): Jazykovedný časopis

Rubrika

Štúdie

Licencia

Táto práca je licencovaná pod Medzinárodnou licenciou Creative Commons Attribution-NonCommercial-NoDerivatives 4.0.

Ako citovať

Annotated Slovak Datasets for Toxicity, Hate Speech, and Sentiment Analysis. (2025). Jazykovedný časopis, 76(1), 279-289. https://doi.org/10.2478/jazcas-2025-0025

Stiahnuť citáciu