記事一覧に戻る

Researchers boost multilingual hate speech detection by combining web-scale pre-training with LLM-generated synthetic labels across four languages.

arXiv cs.CL · 2026年4月14日

Researchers boost multilingual hate speech detection by combining web-scale pre-training with LLM-generated synthetic labels across four languages.

AI要約

  • Continued pre-training on unlabeled OpenWebSearch.eu data improved BERT models by ~3% average macro-F1 across 16 benchmarks, with larger gains in low-resource languages
  • Ensemble of four open-source LLMs (Mistral-7B, Llama3.1-8B, Gemma2-9B, Qwen2.5-14B) generated synthetic annotations for hate speech detection
  • LightGBM meta-learner ensemble outperformed simpler strategies like mean averaging and majority voting for combining LLM predictions
  • Study covers English, German, Spanish, and Vietnamese languages, demonstrating improved cross-lingual generalization for hateful content detection

関連記事

AIニュースを毎日お届け

200以上のソースから厳選したAIニュースを毎日無料でお届けします。

無料で始める