Multi-CAST: Multilingual corpus of annotated spoken texts

doi:10.48564/unibafd-jqgfp-vkj64

Published July 2025 | Version 2507

Collection Open

Multi-CAST: Multilingual corpus of annotated spoken texts

Multi-CAST, the Multilingual Corpus of Annotated Spoken Texts (Haig & Schnell 2015), is a collection of annotated spoken-language corpora from a typologically diverse set of languages. Most of the data stem from documentation projects undertaken on lesser-researched and endangered languages. The texts are overwhelmingly unscripted, non-elicited, monologic narratives.

Each corpus in the collection is an individually citable resource that was contributed by experts on the respective languages in cooperation with the collection editors. The Multi-CAST collection as a whole was designed and compiled by Geoffrey Haig and Stefan Schnell with the assistance of Nils Schiborr, and is to date the only freely-available, multilingual, spoken-language corpus that combines morphological and morphosyntactic glossing with annotation of discourse referents. Each Multi-CAST corpus includes audio recordings (as WAV and MP3 files; archived separately, see below), annotation files in a number of file formats (including as EAF files for use with the free linguistic annotation software ELAN, and as TSV and XML files), metadata on the speakers and texts, as well as documentation on the language, speech communities, recording situations, and analytical decisions pertinent to the annotations.

The annotation files use a multi-tier structure built on a time-aligned segmentation of the text into utterance units, from which derive a transcription and idiomatic English translation. Utterance units are segmented further into grammatical words with morphological glossing (following the Leipzig Glossing Rules) and annotations with the GRAID (Grammatical Relations and Animacy in Discourse, Haig & Schnell 2014) and RefIND (Referent Indexing in Natural Language Discourse, Schiborr et al. 2018) annotation schemes. Further information on the contents of the collection and the structure of the annotations can be found in the Multi-CAST collection overview (Schiborr 2023), which is included in this archive. The multicastR package (Schiborr 2018) provides a simple interface for directly accessing the Multi-CAST annotation data through the statistical computing language R.

This archive contains version 2507 of the Multi-CAST collection (originally published in July 2025) and comprises data from 20 languages, encompassing around 21 hours of recordings, 31000 clause units, and 150000 words across 157 individual texts. The audio files accompanying these data sets have been archived separately; they can be found via the links in the list below.

Arta [arta1239] (Kimoto 2019)
— link to audio: 10.48564/unibafd-c0jd0-7qt52
Bora [bora1263] (Seifart & Hong 2022)
— link to audio: 10.48564/unibafd-zcyz8-x7f04
Chirag [chir1284] (Ganenkov & Schiborr 2025) [NEW!]
— link to audio: 10.48564/unibafd-gryr8-j8p15
Cypriot Greek [cypr1249] (Hadjidas & Vollmer 2015)
— no audio files available
English [sout3282] (Schiborr 2015)
— link to audio: 10.48564/unibafd-4nays-jwa80
Jinghpaw [kach1280] (Kurabe 2021)
— link to audio: 10.48564/unibafd-jav5f-paa07
Kalamang [kara1499] (Visser 2021)
— link to audio: 10.48564/unibafd-z9wt8-jwd54
Mandarin [mand1415] (Vollmer 2020)
— link to audio: 10.48564/unibafd-bxcvm-m9e27
Matukar Panau [matu1261] (Barth, Davey & Matheas 2023)
— link to audio: doi.org/10.48564/unibafd-0sa31-g8r71
Nafsan [sout2856] (Thieberger & Brickell 2019)
— link to audio: 10.48564/unibafd-jq8x6-d8p78
Northern Kurdish [nort2641] (Haig, Vollmer & Thiele 2019)
— link to audio: 10.48564/unibafd-6sbd4-r0868
Persian [tehr1242] (Adibifar 2016)
— link to audio: 10.48564/unibafd-37wvv-n0j98
Sanzhi Dargwa [sanz1248] (Forker & Schiborr 2019)
— link to audio: 10.48564/unibafd-fahwc-1ha62
Sumbawa [sumb1241] (Shiohara 2022)
— link to audio: 10.48564/unibafd-p63kx-vzd97
Tabasaran [taba1259] (Bogomolova, Ganenkov & Schiborr 2021)
— link to audio: 10.48564/unibafd-vqjky-k8g84
Teop [teop1238] (Mosel & Schnell 2015)
— link to audio: 10.48564/unibafd-03n2z-bm579
Tondano [tond1251] (Brickell 2016)
— link to audio: 10.48564/unibafd-1nkkj-f9352
Tulil [taul1251] (Meng 2019)
— link to audio: 10.48564/unibafd-h1wq5-wzh05
Uruangnirin [urua1244] (Visser 2025) [NEW!]
— link to audio: 10.48564/unibafd-jyv7d-5hk35
Vera'a [vera1241] (Schnell 2015)
— link to audio: 10.48564/unibafd-es22h-1j872

Citation for the entire Multi-CAST collection:

Haig, Geoffrey & Schnell, Stefan (eds.). 2015. Multi-CAST: Multilingual corpus of annotated spoken texts. Version 2507. Bamberg: University of Bamberg. (DOI: 10.48564/unibafd-6nwae-ayk30)

Citations for individual Multi-CAST corpora:

Adibifar, Shirin. 2016. Multi-CAST Persian. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
Barth, Danielle & Davey, Kira & Matheas, Maria. 2023. Multi-CAST Matukar Panau. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
Bogomolova, Natalia & Ganenkov, Dmitry & Schiborr, Nils N. 2021. Multi-CAST Tabasaran. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
Brickell, Timothy. 2016. Multi-CAST Tondano. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
Forker, Diana & Schiborr, Nils N. 2019. Multi-CAST Sanzhi Dargwa. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
Ganenkov, Dmitry & Schiborr, Nils N. 2025. Multi-CAST Chirag. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
Hadjidas, Harris & Vollmer, Maria. 2015. Multi-CAST Cypriot Greek. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
Haig, Geoffrey & Vollmer, Maria & Thiele, Hanna. 2019. Multi-CAST Northern Kurdish. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
Kimoto, Yukinori. 2019. Multi-CAST Arta. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
Kurabe, Keita. 2021. Multi-CAST Jinghpaw. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
Meng, Chenxi. 2019. Multi-CAST Tulil. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
Mosel, Ulrike & Schnell, Stefan. 2015. Multi-CAST Teop. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
Schiborr, Nils N. 2015. Multi-CAST English. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
Schnell, Stefan. 2015. Multi-CAST Vera'a. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
Seifart, Frank & Hong, Tai. 2022. Multi-CAST Bora. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
Shiohara, Asako. 2022. Multi-CAST Sumbawa. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
Thieberger, Nick & Brickell, Timothy. 2019. Multi-CAST Nafsan. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
Visser, Eline. 2021. Multi-CAST Kalamang. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
Visser, Eline. 2025. Multi-CAST Uruangnirin. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
Vollmer, Maria. 2020. Multi-CAST Mandarin. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.

Files

mc_collection-overview.pdf

Files (35.5 MB)

Name	Size	Download all
mc_collection-overview.pdf md5:c94270b8f1c2343be29e3d1e2deca796	415.5 kB	Preview Download
multicast_2507.zip md5:8557c923d450edc23d1933a12507c475	35.1 MB	Preview Download

	All versions	This version
Views	3,001	1,588
Downloads	2,106	1,267
Data volume	7.9 GB	6.0 GB

Multi-CAST: Multilingual corpus of annotated spoken texts

Creators

Description

Files

mc_collection-overview.pdf

Files (35.5 MB)