Developing and Utilizing a Large-Scale Cantonese Dataset for Multi-Tasking in Large Language Models
Nov 1, 2025·,,,
,,,,,·
0 min read
Jiyue Jiang
Alfred Kar Yin Truong
Yanyu Chen
Qinghang Bao
Sheng Wang
Pengan Chen
Jiuming Wang
Lingpeng Kong
Yu Li
Chuan Wu
Abstract
This paper presents the development and utilization of a large-scale Cantonese dataset designed for multi-tasking in large language models, addressing the scarcity of high-quality Cantonese resources for training and evaluating LLMs.
Type
Publication
In The 2025 Conference on Empirical Methods in Natural Language Processing
Authors
Authors
Authors
Authors

Authors
Sheng Wang
(Forence)
PhD Graduate in Computer Science
Sheng Wang is a PhD graduate from The University of Hong Kong, supervised by Prof. Chuan Wu and Prof. Lingpeng Kong.
His research focuses on Agent, LLM Super-Alignment, and Data Synthesis. He has published 14+ papers in top-tier
conferences including NIPS2025 (Spotlight), ICLR2025, ACL2024/2025, EMNLP2025.
Authors
Authors
Authors
Authors
Authors