With the support of Industrial Support Fund of Hong Kong SAR Government, a series of large-scale Cantonese spoken language corpora, named CUCorpora, was developed at the DSP and Speech Technology Laboratory, Department of Electronic Engineering, CUHK. The corpora provide the essential infrastructure for the advancement of Cantonese spoken language technology. CUCorpora contain read-style Cantonese speech recorded with high-quality microphone in a quiet environment. The total amount of data is about 70 hours. Transcriptions are provided in the forms of Chinese characters and Cantonese phonemic symbols.

According to the intended applications, CUCorpora are divided into five parts:

  • CUSYL - full collection of Cantonese tonal syllables from 2 speakers, for text-to-speech applications
  • CUWORD - multi-syllable words/phrases from 28 speakers, for general speech recognition applications
  • CUSENT - Cantonese sentences from 80 speakers, phonetically rich, for LVCSR applications
  • CUDIGIT - Cantonese digit strings, for connected-digit recognition
  • CUCMD - Cantonese command words used in the navigation control scenario

More information about CUCorpora can be found from:

Tan Lee, W.K. Lo, P.C. Ching and Helen MENG, “Spoken language resources for Cantonese speech processing,” in Speech Communication, Vol.36, No.3-4, pp.327 – 342, March 2002.
P. C. Ching, Tan Lee, W. K. Lo and Helen Meng, “Cantonese speech recognition and synthesis,” in Advances in Chinese Spoken Language Processing, C.-H. Lee et al., eds., (World Scientific Publishing, Singapore, Dec. 2006), pp.365-386.


The followings are some samples of CUCorpora for review purpose. Just download and see if they meet your requirements. By downloading these files, you acknowledge that the copyright of of the data belongs to CUHK and agree the downloaded data will solely be used for preview purpose.

cxsearch is a tool for locating files in CUCorpora from the accompanying transcription files. It accepts regular expression as search pattern. For detail, see the readme file.

You may download cxsearch here.


CUCorpora are now available for licensing. Click here for prices of CUCorpora. There are three different types of licenses, for commercial us, industrial research, and academic research, respectively:

  1. TWO signed copies of the appropriate types of end user license agreements
  2. A bank draft in Hong Kong Dollar made payable to The Chinese University of Hong Kong for the license fee. Other methods of payment, e.g., bank transfer, credit card, are also acceptable. For details contact Ms. Tracy PANG (, Technology Development Team, Office of Research and Knowledge Transfer Services (ORKTS), CUHK.
  3. Your contact information, including mailing address for the requested materials.
  4. Send these documents to
    Prof. Tan Lee
    Department of Electronic Engineering,
    Room 404, Ho Sin Hang Engineering Building
    The Chinese University of Hong Kong,
    Shatin, New Territories


Prof. Tan Lee
Department of Electronic Engineering, The Chinese University of Hong Kong,
Tel: (852) 39438267
Fax: (852) 26035558