Researchers introduce C-Mining, an unsupervised method to automatically discover cultural data seeds for LLMs by measuring cross-lingual embedding misalignment.
arXiv cs.CL · 2026年4月20日
AI要約
•C-Mining addresses the 'quantification gap' in cultural seed selection by converting subjective curation into a measurable data mining problem
•The framework leverages geometric misalignment of cultural concepts across pre-trained embedding spaces as a quantifiable discovery signal
•Approach identifies regions with pronounced linguistic exclusivity to improve cultural alignment in Large Language Models
•Replaces manual curation and bias-prone LLM extraction methods with an unsupervised, scalable automated process