Multi-CAST: Multilingual corpus of annotated spoken texts
Description
Multi-CAST, the Multilingual Corpus of Annotated Spoken Texts (Haig & Schnell 2015), is a collection of annotated spoken-language corpora from a typologically diverse set of languages. Most of the data stem from documentation projects undertaken on lesser-researched and endangered languages. The texts are overwhelmingly unscripted, non-elicited, monologic narratives.
Each corpus in the collection is an individually citable resource that was contributed by experts on the respective languages in cooperation with the collection editors. The Multi-CAST collection as a whole was designed and compiled by Geoffrey Haig and Stefan Schnell with the assistance of Nils Schiborr, and is to date the only freely-available, multilingual, spoken-language corpus that combines morphological and morphosyntactic glossing with annotation of discourse referents. Each Multi-CAST corpus includes audio recordings (as WAV and MP3 files; archived separately, see below), annotation files in a number of file formats (including as EAF files for use with the free linguistic annotation software ELAN, and as TSV and XML files), metadata on the speakers and texts, as well as documentation on the language, speech communities, recording situations, and analytical decisions pertinent to the annotations.
The annotation files use a multi-tier structure built on a time-aligned segmentation of the text into utterance units, from which derive a transcription and idiomatic English translation. Utterance units are segmented further into grammatical words with morphological glossing (following the Leipzig Glossing Rules) and annotations with the GRAID (Grammatical Relations and Animacy in Discourse, Haig & Schnell 2014) and RefIND (Referent Indexing in Natural Language Discourse, Schiborr et al. 2018) annotation schemes. Further information on the contents of the collection and the structure of the annotations can be found in the Multi-CAST collection overview (Schiborr 2023), which is included in this archive. The multicastR package (Schiborr 2018) provides a simple interface for directly accessing the Multi-CAST annotation data through the statistical computing language R.
This archive contains version 2507 of the Multi-CAST collection (originally published in July 2025) and comprises data from 20 languages, encompassing around 21 hours of recordings, 31000 clause units, and 150000 words across 157 individual texts. The audio files accompanying these data sets have been archived separately; they can be found via the links in the list below.
- Arta [arta1239] (Kimoto 2019)
— link to audio: 10.48564/unibafd-c0jd0-7qt52 - Bora [bora1263] (Seifart & Hong 2022)
— link to audio: 10.48564/unibafd-zcyz8-x7f04 - Chirag [chir1284] (Ganenkov & Schiborr 2025) [NEW!]
— link to audio: 10.48564/unibafd-gryr8-j8p15 - Cypriot Greek [cypr1249] (Hadjidas & Vollmer 2015)
— no audio files available - English [sout3282] (Schiborr 2015)
— link to audio: 10.48564/unibafd-4nays-jwa80 - Jinghpaw [kach1280] (Kurabe 2021)
— link to audio: 10.48564/unibafd-jav5f-paa07 - Kalamang [kara1499] (Visser 2021)
— link to audio: 10.48564/unibafd-z9wt8-jwd54 - Mandarin [mand1415] (Vollmer 2020)
— link to audio: 10.48564/unibafd-bxcvm-m9e27 - Matukar Panau [matu1261] (Barth, Davey & Matheas 2023)
— link to audio: doi.org/10.48564/unibafd-0sa31-g8r71 - Nafsan [sout2856] (Thieberger & Brickell 2019)
— link to audio: 10.48564/unibafd-jq8x6-d8p78 - Northern Kurdish [nort2641] (Haig, Vollmer & Thiele 2019)
— link to audio: 10.48564/unibafd-6sbd4-r0868 - Persian [tehr1242] (Adibifar 2016)
— link to audio: 10.48564/unibafd-37wvv-n0j98 - Sanzhi Dargwa [sanz1248] (Forker & Schiborr 2019)
— link to audio: 10.48564/unibafd-fahwc-1ha62 - Sumbawa [sumb1241] (Shiohara 2022)
— link to audio: 10.48564/unibafd-p63kx-vzd97 - Tabasaran [taba1259] (Bogomolova, Ganenkov & Schiborr 2021)
— link to audio: 10.48564/unibafd-vqjky-k8g84 - Teop [teop1238] (Mosel & Schnell 2015)
— link to audio: 10.48564/unibafd-03n2z-bm579 - Tondano [tond1251] (Brickell 2016)
— link to audio: 10.48564/unibafd-1nkkj-f9352 - Tulil [taul1251] (Meng 2019)
— link to audio: 10.48564/unibafd-h1wq5-wzh05 - Uruangnirin [urua1244] (Visser 2025) [NEW!]
— link to audio: 10.48564/unibafd-jyv7d-5hk35 - Vera'a [vera1241] (Schnell 2015)
— link to audio: 10.48564/unibafd-es22h-1j872
Citation for the entire Multi-CAST collection:
- Haig, Geoffrey & Schnell, Stefan (eds.). 2015. Multi-CAST: Multilingual corpus of annotated spoken texts. Version 2507. Bamberg: University of Bamberg. (DOI: 10.48564/unibafd-6nwae-ayk30)
Citations for individual Multi-CAST corpora:
- Adibifar, Shirin. 2016. Multi-CAST Persian. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
- Barth, Danielle & Davey, Kira & Matheas, Maria. 2023. Multi-CAST Matukar Panau. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
- Bogomolova, Natalia & Ganenkov, Dmitry & Schiborr, Nils N. 2021. Multi-CAST Tabasaran. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
- Brickell, Timothy. 2016. Multi-CAST Tondano. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
- Forker, Diana & Schiborr, Nils N. 2019. Multi-CAST Sanzhi Dargwa. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
- Ganenkov, Dmitry & Schiborr, Nils N. 2025. Multi-CAST Chirag. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
- Hadjidas, Harris & Vollmer, Maria. 2015. Multi-CAST Cypriot Greek. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
- Haig, Geoffrey & Vollmer, Maria & Thiele, Hanna. 2019. Multi-CAST Northern Kurdish. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
- Kimoto, Yukinori. 2019. Multi-CAST Arta. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
- Kurabe, Keita. 2021. Multi-CAST Jinghpaw. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
- Meng, Chenxi. 2019. Multi-CAST Tulil. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
- Mosel, Ulrike & Schnell, Stefan. 2015. Multi-CAST Teop. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
- Schiborr, Nils N. 2015. Multi-CAST English. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
- Schnell, Stefan. 2015. Multi-CAST Vera'a. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
- Seifart, Frank & Hong, Tai. 2022. Multi-CAST Bora. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
- Shiohara, Asako. 2022. Multi-CAST Sumbawa. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
- Thieberger, Nick & Brickell, Timothy. 2019. Multi-CAST Nafsan. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
- Visser, Eline. 2021. Multi-CAST Kalamang. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
- Visser, Eline. 2025. Multi-CAST Uruangnirin. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
- Vollmer, Maria. 2020. Multi-CAST Mandarin. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
Files
mc_collection-overview.pdf
Files
(35.5 MB)
Name | Size | Download all |
---|---|---|
md5:c94270b8f1c2343be29e3d1e2deca796
|
415.5 kB | Preview Download |
md5:8557c923d450edc23d1933a12507c475
|
35.1 MB | Preview Download |