Language and corpus resources of the Dept. of General Linguistics — Hub page
Description
This record serves as an access hub for the language and corpus resources that were developed at the former Department of General Linguistics at the University of Bamberg, largely under the supervision of Geoffrey Haig. Click the DOIs below to be taken to each resource.
Please do not cite this record directly.
Multi-CAST: Multilingual Corpus of Annotated Spoken Texts (2015)
Haig, Geoffrey & Stefan Schnell
Multi-CAST is a collection of annotated spoken-language corpora from a typologically diverse set of languages. Most of the data stem from documentation projects undertaken on lesser-researched and endangered languages. The texts are overwhelmingly unscripted, non-elicited, monologic narratives.
DOI: 10.48564/unibafd-nzvjx-4x932 (Univ. Bamberg) | TBA (Zenodo)
WOWA — The Word Order in Western Asia Corpus (2024)
Haig, Geoffrey & Stilo, Don & Doğan, Mahîr & Schiborr, N.
WOWA is a collection of transcribed and annotated spoken texts from 41 languages spoken across a region loosely referred to as Western Asia. Most texts are spontaneous (i.e. unscripted) narrative monologues such as oral history and traditional tales. The languages selected are generally under-researched, non-standardized minority languages.
DOI: 10.48564/unibafd-1824r-3xd79 (Univ. Bamberg) | TBA (Zenodo)
HamBam — The Hamedan-Bamberg Corpus of Contemporary Spoken Persian (2022)
Haig, Geoffrey & Rasekh-Mahand, Mohammad
HamBam is an online corpus of contemporary spoken Persian. The design of the corpus follows the architecture and rationale of Multi-CAST (Haig & Schnell 2015), but with certain modifications.
DOI: 10.48564/unibafd-kqx47-c8g48 (Univ. Bamberg) | TBA (Zenodo)
The Laki variety of Harsin (2021)
Belelli, Sara
This corpus contains sound files and transcriptions of texts in the Laki variety of Harsin as documented by Sara Belelli in her PhD dissertation (Belelli 2021). It contains a selection of seven texts recorded between the 10th of January 2014 (20th of Dey 1392) and the 27th of February 2014 (8th of Esfand 1392) in the city of Harsin.
DOI: 10.48564/unibafd-5ny05-56352 (Univ. Bamberg) | TBA (Zenodo)
The Corpus of Contemporary Written Kurdish (2021)
Incekan, Abdullah & Haig, Geoffrey
The CCWK comprises a selection of contemporary written, primarily literary texts in Northern Kurdish (Kurmanjî). The corpus was compiled by Abdullah Incekan as part of his PhD project (Incekan 2018) under the supervision of Geoffrey Haig.
Please note that due to copyright constraints, the corpus data are available only on request. Please contact Geoffrey Haig if you wish to access the data.
DOI: 10.48564/unibafd-96cn0-gjm62 (Univ. Bamberg) | TBA (Zenodo)
The Corpus of Contemporary Kurdish Newspaper Texts (2001)
Haig, Geoffrey
The CCKNT comprises written Northern Kurdish (Kurmanjî) journalistic texts, compiled from online newspaper texts in 1999. The corpus consists of 483 texts, totalling around 214 000 words. It contains texts from two Kurdish publications: Azadiya Welat, a weekly Kurdish newspaper, and CTV, a company that broadcasts news items in Kurdish on the internet. The texts are not tagged or translated. The corpus was compiled as part of a project on modern Kurdish syntax, conducted from 1999–2001 at the Seminar für Allgemeine und Vergleichende Sprachwissenschaft at the University of Kiel.
DOI: 10.48564/unibafd-ft675-b3y16 (Univ. Bamberg) | TBA (Zenodo)
Kurdish spoken texts recorded by David MacKenzie in the mid-1950's in Iraqi Kurdistan, prepared and deposited by Geoffrey Haig (2025)
Haig, Geoffrey
David Neil MacKenzie (1926‒2001) was a British-born philologist who was professor of Iranian Studies at the University of Göttingen. In 1961‒1962, he published a two-volume documentation of different dialects of Kurdish (MacKenzie, David. Kurdish dialect studies, Vol. I‒II. 1961 & 1962, Oxford University Press). It was based on extensive field work in Iraqi Kurdistan in the 1950's, and includes a substantial corpus of transcribed and translated recordings. After MacKenzie’s death, a few of the original recordings (magnetic tapes) were discovered and subsequently digitalized by Geoffrey Haig at the Phonetics Laboratory of the University of Kiel. They are made available here in both WAV and MP3 formats by Geoffrey Haig.
DOI: 10.48564/unibafd-vdjcw-mvt46 (Univ. Bamberg) | TBA (Zenodo)
Annotations using GRAID: Grammatical Relations and Animacy in Discourse (manual, 2014)
Haig, Geoffrey & Schnell, Stefan
GRAID is a system of symbols and conventions for glossing the grammatical relations and overt forms (noun phrases, pronouns etc.) of major clause constituents in texts. The purpose of GRAID annotations is to facilitate cross-corpus research in language typology. The GRAID system was developed on the basis of transcribed recordings from typologically diverse languages, using data that had been collected and archived in language documentation projects. It has, among others, been applied to the texts in the Multi-CAST collection of corpora (Haig & Schnell 2015).
DOI: 10.48564/unibafd-94kk4-ey950 (Univ. Bamberg) | TBA (Zenodo)
RefIND — Referent indexing in natural-language discourse (annotation guidelines, 2018)
Schiborr, N. & Schnell, Stefan & Thiele, Hanna
RefIND is a set of corpus annotation conventions designed for the purpose of addressing research questions in the area of reference and discourse structure. RefIND annotations target the linguistic expressions of abstract discourse referents and consist primarily of multi-digit numerical glosses that uniquely identify each occurrence of a discourse referent in a given text. RefIND annotations are intended to complement and extend other annotation schemata such as GRAID (Haig & Schnell 2014). See the Multi-CAST collection of corpora (Haig & Schnell 2015) for an example of RefIND in use.
DOI: 10.48564/unibafd-3z69c-s5y14 (Univ. Bamberg) | TBA (Zenodo)