Developing and Utilizing a Large-Scale Cantonese Dataset for Multi-Tasking in Large Language Models

Nov 1, 2025·
Jiyue Jiang
,
Alfred Kar Yin Truong
,
Yanyu Chen
,
Qinghang Bao
Sheng Wang
Sheng Wang
,
Pengan Chen
,
Jiuming Wang
,
Lingpeng Kong
,
Yu Li
,
Chuan Wu
· 0 min read
Abstract
This paper presents the development and utilization of a large-scale Cantonese dataset designed for multi-tasking in large language models, addressing the scarcity of high-quality Cantonese resources for training and evaluating LLMs.
Type
Publication
In The 2025 Conference on Empirical Methods in Natural Language Processing
publications
Authors
Sheng Wang
Authors
Sheng Wang (Forence)
PhD Graduate in Computer Science
Sheng Wang is a PhD graduate from The University of Hong Kong, supervised by Prof. Chuan Wu and Prof. Lingpeng Kong. His research focuses on Agent, LLM Super-Alignment, and Data Synthesis. He has published 14+ papers in top-tier conferences including NIPS2025 (Spotlight), ICLR2025, ACL2024/2025, EMNLP2025.
Authors
Authors