
Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 4.1: Report on Analysis of multimodal properties of naturalistic speech
Date of reporting: 12-11-2025
Report author: Steven Coats (University of Oulu)
Contributors: Alessandro Basile (Sorbonne Nouvelle University, France), Cameron Morin (University of Paris-Cité, France), Robert Fuchs (University of Bonn, Germany)
Deliverable location: Online search interface: https://ycsep.corpora.li (on Zenodo).
Downloadable static corpus: https://doi.org/10.7910/DVN/B7JRID
Keywords: Singapore English, Corpus Linguistics, YouTube, World Englishes, Podcasts
Recent advances in streaming protocols and automatic speech recognition (ASR) have enabled large-scale spoken language corpora, yet research on Singapore English remains constrained by small or text-based datasets. The YouTube Corpus of Singapore English Podcasts (YCSEP) addresses this gap with 620 hours of transcribed, diarized speech from over 1,300 podcast episodes by Singapore-based content creators. YCSEP supports the empirical analysis of phonetics, morphosyntax, and discourse, enabling the study of low-frequency features like discourse particles and reduplication. The dataset reflects informal, spontaneous speech from diverse speakers and facilitates investigation into nativization and endonormative stabilization processes in postcolonial English. Built using a pipeline of yt-dlp, WhisperX, and Pyannote, YCSEP offers robust empirical grounding for linguistic features such as verb complementation and modality. It also contributes to broader theoretical discussions on areal norms and construction grammar in World Englishes.
The corpus is available in two versions: An online search engine, through which transcripts and audio are accessible and downloadable (https://ycsep.corpora.li), and a static, text-only, downloadable version containing transcripts and metadata in tabular form (https://doi.org/10.7910/DVN/B7JRID).
Related publication:
Coats, Steven, Carmelo Alessandro Basile, Cameron Morin, and Robert Fuchs. 2025. The YouTube Corpus of Singapore English Podcasts. English World-Wide. https://doi.org/10.1075/eww.25018.coa
Related presentations:
Coats, Steven, Carmelo Alessandro Basile, Cameron Morin, and Robert Fuchs. The YouTube Corpus of Singapore English Podcasts. Presentation at the Mutations du Discours Numérique Seminar. Arras, France, April 22nd, 2025. https://calenda.org/1204680; https://adum.fr/script/formations.pl?mod=3633487&site=l
Coats, Steven, Carmelo Alessandro Basile, Cameron Morin, and Robert Fuchs. The YouTube Corpus of Singapore English Podcasts. Presentation at the 8th Conference of the International Society for the Linguistics of English. Santiago de Compostela, Spain, September 3rd, 2025. https://isle8conference.com/
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.
