Published July 2025 | Version 2507
Collection Open

Multi-CAST: Multilingual corpus of annotated spoken texts

Description

Multi-CAST, the Multilingual Corpus of Annotated Spoken Texts (Haig & Schnell 2015), is a collection of annotated spoken-language corpora from a typologically diverse set of languages. Most of the data stem from documentation projects undertaken on lesser-researched and endangered languages. The texts are overwhelmingly unscripted, non-elicited, monologic narratives. 

Each corpus in the collection is an individually citable resource that was contributed by experts on the respective languages in cooperation with the collection editors. The Multi-CAST collection as a whole was designed and compiled by Geoffrey Haig and Stefan Schnell with the assistance of Nils Schiborr, and is to date the only freely-available, multilingual, spoken-language corpus that combines morphological and morphosyntactic glossing with annotation of discourse referents. Each Multi-CAST corpus includes audio recordings (as WAV and MP3 files; archived separately, see below), annotation files in a number of file formats (including as EAF files for use with the free linguistic annotation software ELAN, and as TSV and XML files), metadata on the speakers and texts, as well as documentation on the language, speech communities, recording situations, and analytical decisions pertinent to the annotations.

The annotation files use a multi-tier structure built on a time-aligned segmentation of the text into utterance units, from which derive a transcription and idiomatic English translation. Utterance units are segmented further into grammatical words with morphological glossing (following the Leipzig Glossing Rules) and annotations with the GRAID (Grammatical Relations and Animacy in Discourse, Haig & Schnell 2014) and RefIND (Referent Indexing in Natural Language Discourse, Schiborr et al. 2018) annotation schemes. Further information on the contents of the collection and the structure of the annotations can be found in the Multi-CAST collection overview (Schiborr 2023), which is included in this archive. The multicastR package (Schiborr 2018) provides a simple interface for directly accessing the Multi-CAST annotation data through the statistical computing language R.

This archive contains version 2507 of the Multi-CAST collection (originally published in July 2025) and comprises data from 20 languages, encompassing around 21 hours of recordings, 31000 clause units, and 150000 words across 157 individual texts. The audio files accompanying these data sets have been archived separately; they can be found via the links in the list below.

 

Citation for the entire Multi-CAST collection:

  • Haig, Geoffrey & Schnell, Stefan (eds.). 2015. Multi-CAST: Multilingual corpus of annotated spoken texts. Version 2507. Bamberg: University of Bamberg. (DOI: 10.48564/unibafd-6nwae-ayk30)
     

Citations for individual Multi-CAST corpora:

  • Adibifar, Shirin. 2016. Multi-CAST Persian. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
  • Barth, Danielle & Davey, Kira & Matheas, Maria. 2023. Multi-CAST Matukar Panau. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
  • Bogomolova, Natalia & Ganenkov, Dmitry & Schiborr, Nils N. 2021. Multi-CAST Tabasaran. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
  • Brickell, Timothy. 2016. Multi-CAST Tondano. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
  • Forker, Diana & Schiborr, Nils N. 2019. Multi-CAST Sanzhi Dargwa. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
  • Ganenkov, Dmitry & Schiborr, Nils N. 2025. Multi-CAST Chirag. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
  • Hadjidas, Harris & Vollmer, Maria. 2015. Multi-CAST Cypriot Greek. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
  • Haig, Geoffrey & Vollmer, Maria & Thiele, Hanna. 2019. Multi-CAST Northern Kurdish. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
  • Kimoto, Yukinori. 2019. Multi-CAST Arta. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
  • Kurabe, Keita. 2021. Multi-CAST Jinghpaw. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
  • Meng, Chenxi. 2019. Multi-CAST Tulil. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
  • Mosel, Ulrike & Schnell, Stefan. 2015. Multi-CAST Teop. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
  • Schiborr, Nils N. 2015. Multi-CAST English. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
  • Schnell, Stefan. 2015. Multi-CAST Vera'a. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
  • Seifart, Frank & Hong, Tai. 2022. Multi-CAST Bora. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
  • Shiohara, Asako. 2022. Multi-CAST Sumbawa. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
  • Thieberger, Nick & Brickell, Timothy. 2019. Multi-CAST Nafsan. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
  • Visser, Eline. 2021. Multi-CAST Kalamang. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
  • Visser, Eline. 2025. Multi-CAST Uruangnirin. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.
  • Vollmer, Maria. 2020. Multi-CAST Mandarin. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST.

 

Files

mc_collection-overview.pdf

Files (35.5 MB)

Name Size Download all
md5:c94270b8f1c2343be29e3d1e2deca796
415.5 kB Preview Download
md5:8557c923d450edc23d1933a12507c475
35.1 MB Preview Download