With the support of Industrial Support Fund of Hong Kong SAR Government, a series of large-scale Cantonese spoken language corpora, named CUCorpora, was developed at the DSP and Speech Technology Laboratory, Department of Electronic Engineering, CUHK. The corpora provide the essential infrastructure for the advancement of Cantonese spoken language technology. CUCorpora contain read-style Cantonese speech recorded with high-quality microphone in a quiet environment. The total amount of data is about 70 hours. Transcriptions are provided in the forms of Chinese characters and Cantonese phonemic symbols.
According to the intended applications, CUCorpora are divided into five parts:
More information about CUCorpora can be found from:
Tan Lee, W.K. Lo, P.C. Ching and Helen MENG, “Spoken language resources for Cantonese speech processing,” in Speech Communication, Vol.36, No.3-4, pp.327 – 342, March 2002.The followings are some samples of CUCorpora for review purpose. Just download and see if they meet your requirements. By downloading these files, you acknowledge that the copyright of of the data belongs to CUHK and agree the downloaded data will solely be used for preview purpose.
cxsearch is a tool for locating files in CUCorpora from the accompanying transcription files. It accepts regular expression as search pattern. For detail, see the readme file.
You may download cxsearch here.
CUCorpora are now available for licensing. Click here for prices of CUCorpora. There are three different types of licenses, for commercial us, industrial research, and academic research, respectively: