<< List of all deliverables

D4.1.4: Analysis of multimodal properties of naturalistic speech: The YouTube Corpus of Singapore English Podcasts

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 4.1: Report on Analysis of multimodal properties of naturalistic speech
Date of reporting: 12-11-2025

Report author: Steven Coats (University of Oulu)
Contributors: Alessandro Basile (Sorbonne Nouvelle University, France), Cameron Morin (University of Paris-Cité, France), Robert Fuchs (University of Bonn, Germany)
Deliverable location: Online search interface: https://ycsep.corpora.li (on Zenodo).

Downloadable static corpus: https://doi.org/10.7910/DVN/B7JRID

Keywords: Singapore English, Corpus Linguistics, YouTube, World Englishes, Podcasts

Description

Recent advances in streaming protocols and automatic speech recognition (ASR) have enabled large-scale spoken language corpora, yet research on Singapore English remains constrained by small or text-based datasets. The YouTube Corpus of Singapore English Podcasts (YCSEP) addresses this gap with 620 hours of transcribed, diarized speech from over 1,300 podcast episodes by Singapore-based content creators. YCSEP supports the empirical analysis of phonetics, morphosyntax, and discourse, enabling the study of low-frequency features like discourse particles and reduplication. The dataset reflects informal, spontaneous speech from diverse speakers and facilitates investigation into nativization and endonormative stabilization processes in postcolonial English. Built using a pipeline of yt-dlp, WhisperX, and Pyannote, YCSEP offers robust empirical grounding for linguistic features such as verb complementation and modality. It also contributes to broader theoretical discussions on areal norms and construction grammar in World Englishes.

The corpus is available in two versions: An online search engine, through which transcripts and audio are accessible and downloadable (https://ycsep.corpora.li), and a static, text-only, downloadable version containing transcripts and metadata in tabular form (https://doi.org/10.7910/DVN/B7JRID).

Related publication:

Coats, Steven, Carmelo Alessandro Basile, Cameron Morin, and Robert Fuchs. 2025. The YouTube Corpus of Singapore English Podcasts. English World-Wide. https://doi.org/10.1075/eww.25018.coa

Related presentations:

Coats, Steven, Carmelo Alessandro Basile, Cameron Morin, and Robert Fuchs. The YouTube Corpus of Singapore English Podcasts. Presentation at the Mutations du Discours Numérique Seminar. Arras, France, April 22nd, 2025. https://calenda.org/1204680; https://adum.fr/script/formations.pl?mod=3633487&site=l

Coats, Steven, Carmelo Alessandro Basile, Cameron Morin, and Robert Fuchs. The YouTube Corpus of Singapore English Podcasts. Presentation at the 8th Conference of the International Society for the Linguistics of English. Santiago de Compostela, Spain, September 3rd, 2025. https://isle8conference.com/

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

Search the Language Bank Portal:
Krista Ojutkangas
Researcher of the Month: Krista Ojutkangas

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information