Corpus for the study of Language and Gender in Mexico and Spain (CoLaGe), source Korpus kielen ja sukupuolen tutkimiseen Meksikossa ja Espanjassa (CoLaGe), lähdemateriaali Shortname: colage-src Metadata: https://urn.fi/urn:nbn:fi:lb-2024030603 Rightholder: Pekka Posio Data controller, regarding personal data: University of Helsinki License: CLARIN RES +PRIV +DEP +OTHER* v2.1 The complete license is available at http://urn.fi/urn:nbn:fi:lb-2024030605 A copy of the license is included in LICENSE.txt. The license details may be subject to change, so before downloading the resource, please refer to the latest version of the license at the above link. There is an alternative license intended for a different version of CoLaGe containing only the transcriptions (colage-txt). You may choose this license if you do not use any of the WAV files. A copy of this license is included in LICENSE_colage-txt.txt. The license details may be subject to change, so before downloading the resource, please refer to the latest version of the license at http://urn.fi/urn:nbn:fi:lb-2025090323. NB. This resource contains personal data. You must comply with the data protection terms and conditions when processing the personal data. See the license for details. Resource group page: http://urn.fi/urn:nbn:fi:lb-2024030607 Resource description: The corpus is the downloadable version of Corpus for the study of Language and Gender in Mexico and Spain (CoLaGe), source version. The data have been collected as part of the research project Gender, society, and language use: evidence from Mexico and Spain funded by Kone Foundation in Valencia, Spain (2021-2022) and Guadalajara, Mexico (2022–2023). The objective has been to create a comparable corpus of spoken Spanish from each city to enable the study of the interconnections between speaker gender, societal gender roles and expectations and variation in spoken language combining sociolinguistic and social psychological methodologies. The data consist of sociolinguistic interviews divided into parts where gender is vs. is not activated as discourse topic, and two role plays simulating conflictive situations, with the informant playing one role and the interviewer the other role. The informants represent a middle class socioeconomic background and are divided into two age groups, 30–40 and 60–70. A thorough description of the data and the sociolinguistic variables is available with the data. Structure of the data in download: There is approximately 111 hours of audio data in WAV format. The corresponding transcriptions are in .xlsx (Excel) and .eaf (Elan) format, except for phonetic material which has .TextGrid files (Praat). The data is divided into 6 packages depending on the subset (GDL_Diversity; Guadalajara; Valencia) and file format (audio=wav; transcripts=xlsx,eaf,TextGrid). Below is a list of the packages and their approximate sizes (in unpackaged format) as well as the number of files they contain: colage-src-GDL_Diversity_audio.zip 6.6G 32 files colage-src-GDL_Diversity_transcripts.zip 11M 64 files colage-src-Guadalajara_audio.zip 35G 180 files colage-src-Guadalajara_transcripts.zip 50M 300 files colage-src-Valencia_audio.zip 22G 153 files colage-src-Valencia_transcripts.zip 31M 255 files The metadata in pseudonymized format is in file CoLaGe_metadata_CSV_pseudonymized.csv which is included in all the packages. For further information, please contact fin-clarin@helsinki.fi .