Back to articles

Researchers boost multilingual hate speech detection by combining web-scale pre-training with LLM-generated synthetic labels across four languages.

arXiv cs.CL · April 14, 2026

Researchers boost multilingual hate speech detection by combining web-scale pre-training with LLM-generated synthetic labels across four languages.

AI Summary

  • Continued pre-training on unlabeled OpenWebSearch.eu data improved BERT models by ~3% average macro-F1 across 16 benchmarks, with larger gains in low-resource languages
  • Ensemble of four open-source LLMs (Mistral-7B, Llama3.1-8B, Gemma2-9B, Qwen2.5-14B) generated synthetic annotations for hate speech detection
  • LightGBM meta-learner ensemble outperformed simpler strategies like mean averaging and majority voting for combining LLM predictions
  • Study covers English, German, Spanish, and Vietnamese languages, demonstrating improved cross-lingual generalization for hateful content detection

Related Articles

Stay ahead with AI news

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.

Get Started Free