Other databases


CU2C and CUMIX were developed at the DSP and Speech Technology Laboratory, Department of Electronic Engineering, CUHK.


CU2C is a dual-condition Cantonese speech database for speaker recognition research. It is a task-oriented database. The speech contents include Hong Kong ID numbers, Cantonese digit strings and sentences. CU2C is special in that it contains parallel data collected under different acoustic conditions, i.e. public fixed-line telephone channel and wideband desktop microphone. These data are useful for the study of channel effects in speaker recognition. A total of 84 target speakers and 23 impostors were recorded. Each speaker has 18 sessions of recordings, which were collected over 4 - 9 months.

More information about CU2C can be found from:

Nengheng Zheng, Chao Qin, Tan Lee, and P.C. Ching, "CU2C: A dual-condition Cantonese speech database for speaker recognition applications," in Proceedings of Oriental-COCOSDA, 2005, pp.67-72.


CUMIX is a database developed specifically for code-mixing speech recognition. The spoken contents in CUMIX are mainly daily conversation or jargons by university students in Hong Kong. There are three different types of utterances in CUMIX: (1) Cantonese-English code-mixing utterances, (2) Monolingual colloquial Cantonese utterances, and (3) Monolingual English words and phrases. It contains 16 hours of speech data from 74 speakers.

More information about CUMIX can be found from:

Joyce Y. C. Chan, P. C. Ching, and Tan Lee, "Development of a Cantonese-English code-mixing speech corpus," in Proceedings of Interspeech, 2005, pp.1533-1536.


Prof. Tan Lee
Department of Electronic Engineering, The Chinese University of Hong Kong,
Tel: (852) 39438267
Fax: (852) 26035558
Email: tanlee@ee.cuhk.edu.hk