
Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 3.3: Report on Forensic Linguistics Corpus and Search Interface C.R.I.M.E
Date of reporting: 01-09-2025
Report authors: Steven Coats (University of Oulu)
Contributors: Dana Roemling (University of Birmingham)
Deliverable location: Online search interface: https://forensic.corpora.li (DOI)
Keywords: Forensic linguistics; corpus linguistics, YouTube, investigative interviews
CRIME is the Corpus of Recorded Investigative, Media, and Evidence-based proceedings, a structured, searchable resource comprising audio and ASR-generated transcripts from investigative interviews, courtroom interactions, and related media. Collected from publicly available YouTube sources according to the provisions of the EU Data Mining Act, the corpus addresses a critical gap in current research: the lack of large-scale, real-world datasets that integrate reliable transcripts with corresponding audio.
Previous studies often rely on limited data, constraining generalizability and hindering methodological innovation. By enabling detailed analysis of linguistic, phonetic, pragmatic, and discourse-level features, CRIME supports interdisciplinary research in linguistics, law, psychology, and computational modeling. Potential applications include the identification of language patterns associated with interviewing strategies and outcomes, as well as leveraging large language models to explore affective and interactional dynamics.
This resource offers substantial potential to inform both academic inquiry and evidence-based practices in investigative interviewing and broader criminal justice contexts. The corpus is available in two versions: An online search engine, powered by BlackLab, through which transcripts and audio are accessible and downloadable (https://forensic.corpora.li), and a static, text-only, downloadable version containing transcripts and metadata in tabular form (https://doi.org/10.7910/DVN/MLMB6E).
Coats, Steven and Dana Roemling. 2025. CRIME: The Corpus of Recorded Investigative, Media, and Evidence-based Proceedings. In: Fábián, Annamária and Igor Trost (eds.), Impulses and Approaches to Computer-Mediated Communication Proceedings of the 12th International Conference on Computer Mediated Communication and Social Media Corpora for the Humanities, 45-49. University of Bayreuth, Germany. https://www.cmc2025.uni-bayreuth.de/pool/dokumente/CMC-2025-Proceedings-2.pdf
Coats, Steven and Dana Roemling. CRIME: The Corpus of Recorded Investigative, Media, and Evidence-based Proceedings. Presentation at the Aston University Institute for Forensic Linguistics Research Seminar. Birmingham, UK, April 24th, 2025.
Coats, Steven and Dana Roemling. CRIME: The Corpus of Recorded Investigative, Media, and Evidence-based Proceedings. Presentation at the 12th International Conference on CMC and Social Media Corpora for the Humanities (CMC-Corpora 2025). Bayreuth, Germany, September 5th, 2025. https://www.cmc2025.uni-bayreuth.de/en/
Roemling, Dana and Steven Coats. CRIME: The Corpus of Recorded Investigative, Media, and Evidence-based Proceedings. Presentation at the 7th General ILLA Conference. Kaunas, Lithuania, September 5th, 2025. https://conferences.vdu.lt/etn/general-illa-conference/
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.
