41 datasets found

E
GlobalPhone Vietnamese
live.european-language-grid.eu
catalogue.elra.info
audio format
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GlobalPhone Vietnamese [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/2100
Explore at:
audio formatAvailable download formats
License
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Description
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).

In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.

Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.

The Vietnamese part of GlobalPhone was collected in summer 2009. In total 160 speakers were recorded, 140 of them in the cities of Hanoi and Ho Chi Minh City in Vietnam, and an additional set of 20 speakers were recorded in Karlsruhe, Germany. All speakers are Vietnamese native speakers, covering the main dialectal variants from South and North Vietnam. Of these 160 speakers, 70 were female and 90 were male. The majority of speakers are well educated, being graduated students and engineers. The age distribution of the speakers ranges from 18 to 65 years. Each speaker read between 50 and 200 utterances from newspaper articles, corresponding to roughly 9.5 minutes of speech or 138 utterances per person, in total we recorded 22.112 utterances. The speech was recorded using a close-talking microphone Sennheiser HM420 in a push-to-talk scenario using an inhouse developed modern laptop-based data collection toolkit. All data were recorded at 16kHz and 16bit resolution in PCM format. The data collection took place in small-sized rooms with very low background noise. Information on recording place and environmental noise conditions are provided in a separate speaker session file for each speaker. The speech data was recorded in two phases. In a first phase data was collected from 140 speakers in the cities of Hanoi and Ho Chi Minh. In the second phase we selected utterances from the text corpus in order to cover rare Vietnamese phonemes. This second recording phase was carried out with 20 Vietnamese graduate students who live in Karlsruhe. In sum, 22.112 utterances were spoken, corresponding to 25.25 hours of speech. The text data used for recording mainly came from the news posted in online editions of 15 Vietnamese newspaper websites, where the first 12 were used for the training set, while the last three were used for the development and evaluation set. The text data collected from the first 12 websites cover almost 4 Million word tokens with a vocabulary of 30.000 words resulting in an Out-of-Vocabulary rate of 0% on the development set and 0.067% on the evaluation set. For the text selection we followed the standard GlobalPhone protocols and focused on national and international politics and economics news (see [SCHULTZ 2002]). The transcriptions are provided in Vietnamese-style Roman script, i.e. using several diacritics encoded in UTF-8. The Vietnamese data are organized in a training set of 140 speakers with 22.15 hours of speech, a development set of 10 speakers, 6 from North and 4 from South Vietnam with 1:40 hours of speech and an evaluation set of 10 speakers with same gender and dialect distribution as the development set with 1:30 hours of speech. More details on corpus statistics, collection scenario, and system building based on the Vietnamese part of GlobalPhone can be found under [Vu and Schultz, 2009, 2010].

[Schultz 2002] Tanja Schultz (2002): GlobalPhone: A Multilingual Speech and Text Database developed at Karlsruhe University, Proceedings of the International Conference of Spoken Language Processing, ICSLP 2002, Denver, CO, September 2002. [Vu and Schultz, 2010] Ngoc Thang Vu, Tanja Schultz (2010): Optimization On Vietnamese Large Vocabulary Speech Recognition, 2nd Workshop on Spoken Languages Technologies for Under-resourced Languages, SLTU 2010, Penang, Malaysia, May 2010. [Vu and Schultz, 2009] Ngoc Thang Vu, Tanja Schultz (2009): Vietnamese Large Vocabulary Continuous Speech Recognition, Automatic Speech Recognition and Understanding, ASRU 2009, Merano.

Data from: LaFresCat: a Catalan multi-accent speech dataset for...

zenodo.org

application/gzip, txt

Updated Feb 18, 2025

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Zenodo (2025). LaFresCat: a Catalan multi-accent speech dataset for text-to-speech [Dataset]. http://doi.org/10.21437/iberspeech.2024-42

Explore at:

txt, application/gzipAvailable download formats

Unique identifier

https://doi.org/10.21437/iberspeech.2024-42

Dataset updated

Feb 18, 2025

Dataset provided by

Zenodo

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

LaFresCat Multiaccent

We present LaFresCat, the first Catalan multiaccented and multispeaker dataset.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Commercial use is only possible through licensing by the voice artists. For further information, contact langtech@bsc.es and lafrescaproduccions@gmail.com.

Dataset Details

Dataset Description

The audios from this dataset have been created with professional studio recordings by professional voice actors in Lafresca Creative Studio. This is the raw version of the dataset, no resampling or trimming has been applied to the audios. Audios are stored in wav format at 48khz sampling rate

In total, there are 4 different accents, with 2 speakers per accent (female and male). After trimming, accumulates a total of 3,75h (divided by speaker IDs) as follows:

Balear
- olga -> 23.5 min
- quim -> 30.93 min
Central
- elia -> 33.14 min
- grau -> 37,86 min
Occidental (North-Western)
- emma -> 28,67 min
- pere -> 25,12 min
Valencia
- gina -> 22,25 min
- lluc -> 23,58 min

Uses

The purpose of this dataset is mainly for training text-to-speech and automatic speech recognition models in Catalan accents.

Languages

The dataset is in Catalan (ca-ES).

Dataset Structure

The dataset consists of 2858 audios and transcriptions in the following structure:lafresca_multiaccent_raw ├── balear │ ├── olga │ ├── olga.txt │ ├── quim │ └── quim.txt ├── central │ ├── elia │ ├── elia.txt │ ├── grau │ └── grau.txt ├── full_filelist.txt ├── occidental │ ├── emma │ ├── emma.txt │ ├── pere │ └── pere.txt └── valencia ├── gina ├── gina.txt ├── lluc └── lluc.txt

Metadata of the dataset can be found in the file `full_filelist.txt` , each line represents an audio and follows the format:

audio_path | speaker_id | transcription

The speaker ids have the following mapping:

"quim": 0,
"olga": 1,
"grau": 2,
"elia": 3,
"pere": 4,
"emma": 5,
"lluc": 6,
"gina": 7

Dataset Creation

This dataset has been created by members of the Language Technologies unit from the Life Sciences department of the Barcelona Supercomputing Center, except the valencian sentences which were created with the support of Cenid, the Digital Intelligence Center of the University of Alicante. The voices belong to professional voice actors and they've been recorded in Lafresca Creative Studio.

Source Data

The data presented in this dataset is the source data.

Data Collection and Processing

These are the technical details of the data collection and processing:

Microphone: Austrian Audio oc818
Preamp: Focusrite ISA Two
Audio Interface: Antelope Orion 32+
DAW: ProTools 2023.6.0

Processing:

Noise Gate: C1 Gate
Compression BF-76
De-Esser Renaissance
EQ Maag EQ2
EQ FabFilter Pro-Q3
Limiter: L1 Ultramaximizer

Here's the information about the speakers:

Dialect	Gender	County
Central	male	Barcelonès
Central	female	Barcelonès
Balear	female	Pla de Mallorca
Balear	male	Llevant
Occidental	male	Baix Ebre
Occidental	female	Baix Ebre
Valencian	female	Ribera Alta
Valencian	male	La Plana Baixa

Who are the source data producers?

The Language Technologies team from the Life Sciences department at the Barcelona Supercomputing Center developed this dataset. It features recordings by professional voice actors made at Lafresca Creative Studio.

Annotations

In order to check whether or not there were any errors in the transcriptions of the audios, we created a Label Studio space. In that space, we manually listened to subset of the dataset, and compared what we heard with the transcription. If the transcription was mistaken, we corrected it.

Personal and Sensitive Information

The dataset consists of professional voice actors who have recorded their voice. You agree to not attempt to determine the identity of speakers in this dataset.

Bias, Risks, and Limitations

Training a Text-to-Speech (TTS) model by fine-tuning with a Catalan speaker who speaks a particular dialect presents significant limitations. Mostly, the challenge is in capturing the full range of variability inherent in that accent. Each dialect has its own unique phonetic, intonational, and prosodic characteristics that can vary greatly even within a single linguistic region. Consequently, a TTS model trained on a narrow dialect sample will struggle to generalize across different accents and sub-dialects, leading to reduced accuracy and naturalness. Additionally, achieving a standard representation is exceedingly difficult because linguistic features can differ markedly not only between dialects but also among individual speakers within the same dialect group. These variations encompass subtle nuances in pronunciation, rhythm, and speech patterns that are challenging to standardize in a model trained on a limited dataset.

Funding

This work has been promoted and financed by the Generalitat de Catalunya through the Aina project, in addition the Valencian sentences have been created within the framework of the NEL-VIVES project 2022/TL22/00215334.

Dataset Card Contact

langtech@bsc.es

u
Maltese crowS-pairs dataset
drum.um.edu.mt
application/csv
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CLAUDIA BORG; Marthese Borg (2024). Maltese crowS-pairs dataset [Dataset]. http://doi.org/10.60809/drum.26056957.v1
Explore at:
application/csvAvailable download formats
Unique identifier
https://doi.org/10.60809/drum.26056957.v1
Dataset updated
Jun 25, 2024
Dataset provided by
University of Malta
Authors
CLAUDIA BORG; Marthese Borg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Warning: This dataset contains explicit statements of offensive stereotypes which may be upsetting.The study of bias, fairness and social impact in Natural Language Processing (NLP) lacks resources in languages other than English. Our objective is to support the evaluation of bias in language models in a multilingual setting. We use stereotypes across nine types of biases to build a corpus containing contrasting sentence pairs, one sentence that presents a stereotype concerning an underadvantaged group and another minimally changed sentence, concerning a matching advantaged group.In total, we produced 11,139 new sentence pairs that cover stereotypes dealing with nine types of biases in seven cultural contexts. We use the final resource for the evaluation of relevant monolingual and multilingual masked language models.This file contains the sentence pairs localised to the Maltese context in the Maltese language.Other languages are available here: https://gitlab.inria.fr/corpus4ethics/multilingualcrowspairsThe paper describing this work is available here: https://www.um.edu.mt/library/oar/handle/123456789/121722https://aclanthology.org/2024.lrec-main.1545/To use this dataset, please use the following citation:Karen Fort, Laura Alonso Alemany, Luciana Benotti, Julien Bezançon, Claudia Borg, Marthese Borg, Yongjian Chen, Fanny Ducel, Yoann Dupont, Guido Ivetta, Zhijian Li, Margot Mieskes, Marco Naguib, Yuyan Qian, Matteo Radaelli, Wolfgang S. Schmeisser-Nieto, Emma Raimundo Schulz, Thiziri Saci, Sarah Saidi, et al.. 2024. Your Stereotypical Mileage May Vary: Practical Challenges of Evaluating Biases in Multiple Languages and Cultural Contexts. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 17764–17769, Torino, Italia. ELRA and ICCL.
Z
Swahili : News Classification Dataset
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Sep 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Davis David (2021). Swahili : News Classification Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4300293
Explore at:
Dataset updated
Sep 18, 2021
Dataset authored and provided by
Davis David
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Swahili is spoken by 100-150 million people across East Africa. In Tanzania, it is one of two national languages (the other is English) and it is the official language of instruction in all schools. News in Swahili is an important part of the media sphere in Tanzania.

News contributes to education, technology, and the economic growth of a country, and news in local languages plays an important cultural role in many Africa countries. In the modern age, African languages in news and other spheres are at risk of being lost as English becomes the dominant language in online spaces.

The Swahili news dataset was created to reduce the gap of using the Swahili language to create NLP technologies and help AI practitioners in Tanzania and across the Africa continent to practice their NLP skills to solve different problems in organizations or societies related to the Swahili language. Swahili News were collected from different websites that provide news in the Swahili language. I was able to find some websites that provide news in Swahili only and others in different languages including Swahili.

The dataset was created for a specific task of text classification, this means each news content can be categorized into six different topics (Local News, International News, Finance News, Health News, Sports News, and Entertainment news). The dataset comes with a specified train/test split. The train set contains 75% of the dataset.

Acknowledgment: This project was supported by the AI4D language dataset fellowship through K4All and Zindi Africa.
e
Hamburg Corpus of Argentinean Spanish (HaCASpa) - Dataset - B2FIND
b2find.eudat.eu
Updated Apr 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Hamburg Corpus of Argentinean Spanish (HaCASpa) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/ab6502cb-48a6-530f-9d7d-172737124ff2
Explore at:
Dataset updated
Apr 30, 2023
Area covered
Hamburg
Description
Audio and video recordings of experimental/read and spontaneous speech from adult speakers of Porteño Spanish in Argentina. Speakers are 18-69 years old and from two geographic areas. For the intonational experiments, there are audio recordings only, whereas some of the free interviews and map tasks feature video recordings. The material used as stimuli in the experiments is available with references encoded in the transcriptions. The Hamburg Corpus of Argentinean Spanish (HaCASpa) was compiled in December 2008 and November/December 2009 within the context of the research project The intonation of Spanish in Argentina (H9, director: Christoph Gabriel), part of the Collaborative Research Centre "Multilingalism", funded by the German Research Foundation (Deutsche Forschungsgemeinschaft, DFG) and hosted by the University of Hamburg. It comprises data from two varieties of Argentinean Spanish, i.e. a) the dialect spoken in the capital of Buenos Aires (also called Porteño, derived from puerto 'harbor') and b) the variety of the Neuquén/Comahue area (Northern Patagonia). The seven parts of HaCASpa correspond to the seven tasks described below in more detail: Five experiments were carried out in order to elicit specific data for research in prosody, with a main focus on (Task 1–5); in addition, several speakers took part in a free interview (Task 6) and a map task experiment (Task 7). The Task is encoded as a metadata attribute for each communication. HaCASpa comprises three different types of spoken data, depending on the Task, i.e. spontaneous, semi-spontaneous, and scripted speech. This information corresponds to the metadata attribute Speech type. The regional dimension of the corpus is represented through the attribute Area (i.e. Buenos Aires or Neuquén/Comahue), its diachronic dimension through the attribute Age group (i.e. Under 25/Over 25). The subjects are 60 native speakers of the relevant variety of Argentinean Spanish, i.e. Buenos Aires (Porteño) or Nequén/Comahue Spanish. For each speaker, the following information is available: Age, Education, Occupation, Year of school enrollment, Year of school graduation and Parents' mother tongue. The current version 0.2 contains mainly orthographic transcriptions of verbal behaviour (141,000 transcribed words) and codes that relate utterances to the materials used for the experimental tasks. Experimental design: Task (1) consists of two subparts: reading a story (1a) and retelling it (1b). For (1a), the subjects were asked to read the short story "The North Wind and the Sun", which was presented on a computer screen, two times. The fable is well known for its use of phonetic descriptions of different languages (see Handbook of the International Phonetic Association, International Phonetic Association. Cambridge: Cambridge University Press, 2005); the Latin American version we used in our data stems from the Dialectoteca del español, (coordination: C.-E. Piñeros). For (1b), the speakers were instructed to retell the story in their own words without being able to consult the text. With the help of these two parts, data of scripted (part 1a) as well as of semi-spontaneous speech (part 1b) could be collected. Task (2) was designed to collect data of semi-spontaneous speech by asking the subjects to answer questions pertaining to a given picture story. In a first step, the speakers were familiarized with the story, which was presented as two pictures displayed on a computer screen. In a second step, they were asked to answer specific questions about the story. The questions were also presented on the computer screen and varied in their design in order to elicit answers with different information-structural readings (such as broad vs. narrow focus or different focus types). In general, the speakers were free to answer as they wished. However, in order to avoid single word answers, they were asked to utter complete sentences. Task (3) consisted of reading question-answer pairs, the content of which was based on the picture stories already familiar from task (2). The answers were given together with the questions on the computer screen (i.e. one question / one answer) and the speakers simply had to read both the question and the answer. Task (4) was a reading task in which the subjects were asked to utter 10 simple subject-verb-object (SVO) sentences, presented on a computer screen. The speakers were instructed to read them at both normal and fast speech rate. Along the lines proposed in D´Imperio et al. 2005 ("Intonational Phrasing in Romance: The Role of Syntactic and Prosodic Structure", in: Prosodies: With Special Reference to Iberian Languages, ed. by Frota, S. et al., Berlin: Mouton de Gruyter, 59-97), the subject and object constituents differed in their syntactic and prosodic complexity (e.g. determiner plus noun or determiner plus noun plus adjective and one or three prosodic words, respectively). The participants were instructed to read the sentences as if they contained new information. The complete experiment design is described in Gabriel, C. et al. 2011 ("Prosodic phrasing in Porteño Spanish", in: Intonational Phrasing in Romance and Germanic: Cross-Linguistic and Bilingual Studies, ed. by Gabriel, C. & Lleó, C., Amsterdam: Benjamins, 153-182). Task (5), the so-called intonation survey, consisted of 48 situations designed to elicit various intonational contours with specific pragmatic meanings. In this inductive method, the researcher confronts the speaker with a series of hypothetical situations to which he or she is supposed to react verbally. In the Argentinean version of the questionnaire, the hypothetical situations were illustrated by appropriate pictures. The experimental design is described in more detail in Prieto, P. & Roseano, P. 2010 (eds). Transcription of Intonation of the Spanish Language. Munich: Lincom; see also the Interactive atlas of Spanish intonation (coordination: P. Prieto & P. Roseano). Task (6) was conducted to collect spontaneous speech data by conducting free interviews. In this task, the subjects were asked to tell the interviewer something about a past experience, be it a vacation or memories of Argentina as it was decades ago. Even though the interviewer was still part of the conversation, it was mainly the subjects who spoke during the recordings. Task (7) consists of Map Task dialogs. Map Task is a technique employed to collect data of spontaneous speech in which two subjects cooperate to complete a specified task. It is designed to lead the subjects to produce particular interrogative patterns. Each of the two subjects receives a map of an imaginary town marked with buildings and other specific elements. A route is marked on the map of one of the two participants, who assumes the role of the instruction-giver. The version of the same map given to the other participant, who assumes the role of the instruction-follower, differs from that of the instruction-giver in that it does not show the route to be followed. The instruction-follower therefore must ask the instruction-giver questions in order to be able to reproduce the same route on his or her own map (see also the Interactive atlas of Spanish intonation). CLARIN Metadata summary for Hamburg Corpus of Argentinean Spanish (HaCASpa) (CMDI-based) Title: Hamburg Corpus of Argentinean Spanish (HaCASpa) Description: Audio and video recordings of experimental/read and spontaneous speech from adult speakers of Porteño Spanish in Argentina. Speakers are 18-69 years old and from two geographic areas. For the intonational experiments, there are audio recordings only, whereas some of the free interviews and map tasks feature video recordings. The material used as stimuli in the experiments is available with references encoded in the transcriptions. Publication date: 2011-06-30 Data owner: Christoph Gabriel, Institut für Romanistik / Von-Melle-Park 6 / D-20146 Hamburg, christoph.gabriel@uni-hamburg.de Contributors: Christoph Gabriel, Institut für Romanistik / Von-Melle-Park 6 / D-20146 Hamburg, christoph.gabriel@uni-hamburg.de (compiler) Project: H9 "The intonation of Spanish in Argentina", German Research Foundation (DFG) Keywords: contact variety, cross-sectional data, regional variety, language contact, EXMARaLDA Language: Spanish (spa) Size: 63 speakers (39 female, 24 male), 259 communications, 261 recordings, 1119 minutes, 261 transcriptions, 141321 words Annotation types: transcription (manual): mainly orthographic, project-specific conventions, code: reference to underlying prompts Temporal Coverage: 2008-11-01/2009-12-01 Spatial Coverage: Buenos Aires, AR; Neuquén/Comahue, AR Genre: discourse Modality: spoken
d
Data from: Knowledge from non-English-language studies broadens...
search.dataone.org
data.niaid.nih.gov
+1more
Updated May 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Filipe Serrano; Valentina Marconi; Stefanie Deinet; Hannah Puleston; Helga Correa; Juan C. DÃaz-Ricaurte; Carolina Farhat; Ricardo Luria-Manzano; Marcio Martins; Eletra Souza; Sergio Souza; Joao Vieira-Alencar; Paula Valdujo; Robin Freeman; Louise McRae (2025). Knowledge from non-English-language studies broadens contributions to conservation policy and helps to tackle bias in biodiversity data [Dataset]. http://doi.org/10.5061/dryad.ngf1vhj68
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.ngf1vhj68
Dataset updated
May 20, 2025
Dataset provided by
Dryad Digital Repository
Authors
Filipe Serrano; Valentina Marconi; Stefanie Deinet; Hannah Puleston; Helga Correa; Juan C. DÃaz-Ricaurte; Carolina Farhat; Ricardo Luria-Manzano; Marcio Martins; Eletra Souza; Sergio Souza; Joao Vieira-Alencar; Paula Valdujo; Robin Freeman; Louise McRae
Description
Local ecological evidence is key to informing conservation. However, many global biodiversity indicators often neglect local ecological evidence published in languages other than English, potentially biassing our understanding of biodiversity trends in areas where English is not the dominant language. Brazil is a megadiverse country with a thriving national scientific publishing landscape. Here, using Brazil and a species abundance indicator as examples, we assess how well bilingual literature searches can both improve data coverage for a country where English is not the primary language and help tackle biases in biodiversity datasets. We conducted a comprehensive screening of articles containing abundance data for vertebrates published in 59 Brazilian journals (articles in Portuguese or English) and 79 international English-only journals. These were grouped into three datasets according to journal origin and article language (Brazilian-Portuguese, Brazilian-English and International). ..., Data collection We collected time-series of vertebrate population abundance suitable for entry into the LPD (livingplanetindex.org), which provides the repository for one of the indicators in the GBF, the Living Planet Index (LPI, Ledger et al., 2023). Despite the continuous addition of new data, LPI coverage remains incomplete for some regions (Living Planet Report 2024 â€“ A System in Peril, 2024). We collected data from three sets of sources: a) Portuguese-language articles from Brazilian journals (hereafter â€œBrazilian-Portugueseâ€ dataset), b) English-language articles from Brazilian journals (â€œBrazilian-Englishâ€ dataset) and c) English-language articles from non-Brazilian journals (â€œInternationalâ€ dataset). For a) and b), we first compiled a list of Brazilian biodiversity-related journals using the list of non-English-language journals in ecology and conservation published by the translatE project (www.translatesciences.com) as a starting point. The International dataset was obtained ..., # Knowledge from non-English-language studies broadens contributions to conservation policy and helps to tackle bias in biodiversity data

Dataset DOI: 10.5061/dryad.ngf1vhj68

Description of the data and file structure

We collected time-series of vertebrate population abundance suitable for entry into the LPD (livingplanetindex.org), which provides the repository for one of the indicators in the GBF, the Living Planet Index (LPI, Ledger et al., 2023).

We collected data from three sets of sources: a) Portuguese-language articles from Brazilian journals (hereafter â€œBrazilian-Portugueseâ€ dataset), b) English-language articles from Brazilian journals (â€œBrazilian-Englishâ€ dataset) and c) English-language articles from non-Brazilian journals (â€œInternationalâ€ dataset). For a) and b), we first compiled a list of Brazilian biodiversity-related journals using the list of non-English-language journals in ecology and conservat...,
D
Data from: Dataset for 'How brands highlight country of origin in magazine...
ssh.datastations.nl
pdf, tsv, txt, xml +1
Updated Jun 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
J.M.A. Hornikx; J. van den Heuvel; W.F.J. van Meurs; A.J.M. Janssen; J.M.A. Hornikx; J. van den Heuvel; W.F.J. van Meurs; A.J.M. Janssen (2020). Dataset for 'How brands highlight country of origin in magazine advertising: A content analysis' [Dataset]. http://doi.org/10.17026/DANS-ZTF-W83F
Explore at:
tsv(40846), zip(32664), xml(11286), txt(782), pdf(126553)Available download formats
Unique identifier
https://doi.org/10.17026/DANS-ZTF-W83F
Dataset updated
Jun 8, 2020
Dataset provided by
DANS Data Station Social Sciences and Humanities
Authors
J.M.A. Hornikx; J. van den Heuvel; W.F.J. van Meurs; A.J.M. Janssen; J.M.A. Hornikx; J. van den Heuvel; W.F.J. van Meurs; A.J.M. Janssen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset for content analysis published in "Hornikx, J., Meurs, F. van, Janssen, A., & Heuvel, J. van den (2020). How brands highlight country of origin in magazine advertising: A content analysis. Journal of Global Marketing, 33 (1), 34-45."*Abstract (taken from publication)Aichner (2014) proposes a classification of ways in which brands communicate their country of origin (COO). The current, exploratory study is the first to empirically investigate the frequency with which brands employ such COO markers in magazine advertisements. An analysis of about 750 ads from the British, Dutch, and Spanish editions of Cosmopolitan showed that the prototypical ‘made in’ marker was rarely used, and that ‘COO embedded in company name’ and ‘use of COO language’ were most frequently employed. In all, 36% of the total number of ads contained at least one COO marker, underlining the importance of the COO construct.*Methodology (taken from publication)SampleThe use of COO markers in advertising was examined in print advertisements from three different countries to increase the robustness of the findings. Given the exploratory nature of this study, two practical selection criteria guided our country choice: the three countries included both smaller and larger countries in Europe, and they represented languages that the team was familiar with in order to reliably code the advertisements on the relevant variables. The three European countries selected were the Netherlands, Spain, and the United Kingdom. The dataset for the UK was discarded for testing H1 about the use of English as a foreign language, as will be explained in more detail in the coding procedure.The magazine Cosmopolitan was chosen as the source of advertisements. The choice for one specific magazine title reduces the generalizability of the findings (i.e., limited to the corresponding products and target consumers), but this magazine was chosen intentionally because an informal analysis suggested that it carried advertising for a large number of product categories that are considered ethnic products, such as cosmetics, watches, and shoes (Usunier & Cestre, 2007). This suggestion was corroborated in the main analysis: the majority of the ads in the corpus referred to a product that Usunier and Cestre (2007) classify as ethnic products. Table 2 provides a description of the product categories and brands referred to in the advertisements. Ethnic products have a prototypical COO in the minds of consumers (e.g., cosmetics – France), which makes it likely that the COOs are highlighted through the use of COO markers.Cosmopolitan is an international magazine that has different local editions in the three countries. The magazine, which is targeted at younger women (18–35 years old), reaches more than three million young women per month through its online, social and print platforms in the Netherlands (Hearst Netherlands, 2016), has about 517,000 readers per month in Spain (PrNoticias, 2016) and about 1.18 million readers per month in the UK (Hearst Magazine U.K., 2016).The sample consisted of all advertisements from all monthly issues that appeared in 2016 in the three countries. This whole-year cluster was selected so as to prevent potential seasonal influences (Neuendorf, 2002). In total, the corpus consisted of 745 advertisements, of which 111 were from the Dutch, 367 from the British and 267 from the Spanish Cosmopolitan. Two categories of ads were excluded in the selection process: (1) advertisements for subscription to Cosmopolitan itself, and (2) advertisements that were identical to ads that had appeared in another issue in one of the three countries. As a result, each advertisement was unique.Coding procedureFor all advertisements, four variables were coded: product type, presence of types of COO markers, COO referred to, and the use of English as a COO marker. In the first place, product type was assessed by the two coders. Coders classified each product to one of the 32 product types. In order to assess the reliability of the codings, ten per cent of the ads were independently coded by a second coder. The interrater reliability of the variable product category was good (κ = .97, p < .000, 97.33% agreement between both coders). Table 2 lists the most frequent product types; the label ‘other’ covers 17 types of product, including charity, education, and furniture.In the second place, it was recorded whether one or more of the COO markers occurred in a given ad. In the third place, if a marker was identified, it was assessed to which COO the markers referred. Table 1 lists the nine possible COO markers defined by Aichner (2014) and the COOs referred to, with examples taken from the current content analysis. The interrater reliability for the type of COO marker was very good (κ = .80, p < .000, 96.30% agreement between the coders), and the interrater reliability for COO referred to was excellent (κ = 1.00, p < .000).After the independent assessments of the two...
e
International Passenger Survey, 2004 - Dataset - B2FIND
b2find.eudat.eu
Updated Oct 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). International Passenger Survey, 2004 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/831295ba-61af-5c9c-888e-620673accf95
Explore at:
Dataset updated
Oct 22, 2023
Description
Abstract copyright UK Data Service and data collection copyright owner.The International Passenger Survey (IPS) aims to collect data on both credits and debits for the travel account of the Balance of Payments, provide detailed visit information on overseas visitors to the United Kingdom (UK) for tourism policy, and collect data on international migration. The depositor recommends that only expert users who are familiar with the coding and weighting structures use this dataset as limited support is available. Some considerable understanding of the data is required before meaningful analyses can be made, and care must be taken when performing time series operations as codes can vary from year to year. Not all variables from one years dataset are used in other years. Weighting the IPS ONS advise that the variable 'fweight' included in the 'Qcontact' dataset should be applied to get an overall weighted profile. This weight is set consistently overtime. Other weights are provided to analyse finer detail but no information is provided in the documentation. ONS are currently reviewing the documentation but in the meantime if users have any detailed weighting questions they should contact the ONS at: socialsurveys@ons.gov.uk The data cover four subject areas, AIRMILES, ALCOHOL, QREGTOWN and QCONTACT (one file per quarter per subject area). These can be joined together using the variables YEAR, SERIAL, FLOW and QUARTER. For the fifth edition, data for all quarters included in the study were replaced with final versions. Main Topics: The main dataset for 2004 covers: AIRMILES - quarter; flow; serial; United Kingdom port or route; direct leg overseas port; final overseas port; distance from United Kingdom port to first port; from first to second port; from United Kingdom port to second port. ALCOHOL - year; quarter; month; flow; serial; money spent on spirits; wine; beer; cigarettes; hand-rolled and other tobacco. QREGTOWN - year; quarter; month; flow; serial; towns stayed in overnight; details of type of accommodation; number of nights spent in towns; expenditure in towns; regional stay weight; regional visit weight; regional expenditure weight; various validation checks. QCONTACT - year; quarter; month; flow; serial; nationality; country of visit/residence; United Kingdom counties; date visit began; purpose of visit; organised conference; intended length of stay; number of people; package tour and cost; expenditure pre, post and during visit; flight prefix and suffix; first carrier air or shipping line; direct leg overseas port; final overseas port; long or short haul; type of vehicle; number travelling in vehicle; fare type and cost; class of travel; business trip; type of flight; flight origin or destination; gender; age group; United Kingdom port or route; quality of response; date of interview; money transfer, net and total expenditure; Belgian language spoken; distance driven in UK by overseas driver; miles or kilometres; type of transport; arrivals (number of adults); departures (type of travelling group, number of adults and children); weighting variables; various validation checks. Multi-stage stratified random sample Face-to-face interview
H
Guinea: Languages
data.humdata.org
csv
Updated Sep 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CLEAR Global (previously Translators without Borders) (2025). Guinea: Languages [Dataset]. https://data.humdata.org/dataset/0e0c217a-7764-41fc-bba9-ddbc1a40bb8f?force_layout=desktop
Explore at:
csv(5490), csv(155270), csv(42253)Available download formats
Dataset updated
Sep 1, 2025
Dataset provided by
CLEAR Global (previously Translators without Borders)
License
http://www.opendefinition.org/licenses/cc-by-sahttp://www.opendefinition.org/licenses/cc-by-sa
Area covered
Guinea
Description
Data on languages spoken in Guinea, showing the main language spoken in the household by proportion of the population. Data is drawn from IPUMS International. For more resources on the languages of Guinea and language use in humanitarian contexts please visit: https://clearglobal.org/language-maps-and-data/
Language Named Authority List
data.europa.eu
rdf xml, xml, zip
Updated Sep 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Publications Office of the European Union (2024). Language Named Authority List [Dataset]. https://data.europa.eu/data/datasets/language?locale=en
Explore at:
xml, rdf xml, zipAvailable download formats
Dataset updated
Sep 26, 2024
Dataset provided by
Publications Office of the European Unionhttp://op.europa.eu/
European Union-
Authors
Publications Office of the European Union
License
http://data.europa.eu/eli/dec/2011/833/ojhttp://data.europa.eu/eli/dec/2011/833/oj
Description
Language is a controlled vocabulary that lists world languages and language varieties, including sign languages. Its main purpose is to support activities associated with the publication process. The full set of languages contains more than 8000 language varieties, each identified by a code equivalent to the ISO 639-3 code. Concepts are aligned with the ISO 639 international standard, which is issued in several parts: ISO 639-1 contains strictly two alphabetic letters (alpha-2), ISO 639-2/B (B = bibliographic) is used for bibliographic purpose (alpha-3), ISO 639-2/T (T = terminology) is used for technical purpose (alpha-3), ISO 639-3 covers all the languages and macro-languages of the world (alpha-3); the values are compliant with ISO 639-2/T. If an authority code is needed for a language without an assigned ISO code, an alphanumeric code is created to avoid confusion with the strictly alphabetic ISO codes. Labels are provided in all 24 official EU languages for the most frequently used languages. Language is under governance of the Interinstitutional Metadata and Formats Committee (IMFC). It is maintained by the Publications Office of the European Union and disseminated on the EU Vocabularies website. It is a corporate reference data asset covered by the Corporate Reference Data Management policy of the European Commission.
Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...
zenodo.org
data.niaid.nih.gov
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata (2023). Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text [Dataset]. http://doi.org/10.5281/zenodo.7916716
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7916716
Dataset updated
May 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.

It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.

An example

An example test sentence:

Test Sentence: {"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King."}

An example of ontology:

Ontology: Music Ontology

Expected Output:

{ "id": "ont_k_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.", "triples": [ { "sub": "The Loco-Motion", "rel": "publication date", "obj": "01 January 1962" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Gerry Goffin" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Carole King" },] }

The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.

The structure of the repo is as the following.

Text2KGBench

src: the source code used for generation and evaluation, and baseline

benchmark the code used to generate the benchmark

evaluation evaluation scripts for calculating the results

baseline code for generating the baselines including prompts, sentence similarities, and LLM client.

data: the benchmark datasets and baseline data. There are two datasets: wikidata_tekgen and dbpedia_webnlg.

wikidata_tekgen Wikidata-TekGen Dataset

ontologies 10 ontologies used by this dataset

train training data

test test data

manually_verified_sentences ids of a subset of test cases manually validated

unseen_sentences new sentences that are added by the authors which are not part of Wikipedia

test unseen test unseen test sentences

ground_truth ground truth for unseen test sentences.

ground_truth ground truth for the test data

baselines data related to running the baselines.

test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

prompts prompts corresponding to each test file

unseen prompts unseen prompts for the unseen test cases

Alpaca-LoRA-13B data related to the Alpaca-LoRA model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

unseen results results for the unseen test cases

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

Vicuna-13B data related to the Vicuna-13B model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

dbpedia_webnlg DBpedia Dataset

ontologies 19 ontologies used by this dataset

train training data

test test data

ground_truth ground truth for the test data

baselines data related to running the baselines.

test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

prompts prompts corresponding to each test file

Alpaca-LoRA-13B data related to the Alpaca-LoRA model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

Vicuna-13B data related to the Vicuna-13B model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

This benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.

[1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.

[2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
H
Philippines: Languages
data.humdata.org
csv
Updated Aug 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CLEAR Global (previously Translators without Borders) (2025). Philippines: Languages [Dataset]. https://data.humdata.org/dataset/4383caa9-b4e0-4608-80f6-2b74482fe8bd?force_layout=desktop
Explore at:
csv(320616), csv(709530), csv(29859)Available download formats
Dataset updated
Aug 31, 2025
Dataset provided by
CLEAR Global (previously Translators without Borders)
License
http://www.opendefinition.org/licenses/cc-by-sahttp://www.opendefinition.org/licenses/cc-by-sa
Area covered
Philippines
Description
Data on languages spoken in Philippines, showing the main language spoken in the household by proportion of the population. Data is drawn from IPUMS International. For more resources on the languages of Philippines and language use in humanitarian contexts please visit: https://clearglobal.org/language-maps-and-data/
Multilingual publishing in the SSH in Poland and attitudes towards English
zenodo.org
data.niaid.nih.gov
Updated Jul 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Krystyna Warchał; Krystyna Warchał; Paweł Zakrajewski; Paweł Zakrajewski (2024). Multilingual publishing in the SSH in Poland and attitudes towards English [Dataset]. http://doi.org/10.5281/zenodo.7740484
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7740484
Dataset updated
Jul 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Krystyna Warchał; Krystyna Warchał; Paweł Zakrajewski; Paweł Zakrajewski
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Poland
Description
This dataset contains responses to a self-constructed questionnaire that was designed to provide the following information: What languages are used for research dissemination in various SSH disciplines in Poland? What are the main languages of the research cited by authors in these disciplines? When languages other than Polish are used for research dissemination, are the results published in national or international venues? What are the prevailing reasons for language choices? What is the position of English in SSH disciplines in Poland? What is the attitude of Polish SSH scholars towards the dominance of English as the international language of science?

The questionnaire was written in Polish and consisted of 52 items arranged in three thematic lines: multilingual publication practices, the role of English in research dissemination, and attitudes to English as the global language of science.

The data were collected in an online survey based on Google Forms. The link to the questionnaire was distributed via email among scholars affiliated with the social sciences and humanities units of 20 Polish universities which took part in the Excellence Initiative – Research University competition, a funding programme launched in 2019 by the Ministry of Science and Higher Education (https://www.gov.pl/web/science/the-excellence-initiative---research-university-programme).

Time of data collection: 12 October 2020–16 November 2020.

Volume of data and the response rate: 12,100 emails sent; 1,575 completed forms received (response rate about 13%); 50 forms removed (contradictory or random responses; this dataset is limited to speakers of Polish as the first language).

The classification of fields and disciplines follows Polish regulations in force at the time of the study (Regulation of the Polish Minister of Science and Higher Education of 20 September 2018 on Classification of fields and disciplines of science and disciplines of the arts; Journal of Laws 2018, item 1818). Compared to the OECD classification, the main points of difference involve the status of history and archaeology, linguistics and literary studies, and economics and management as separate disciplines.

The data collection was not funded by any external source.
Z
CT-FAN: A Multilingual dataset for Fake News Detection
data.niaid.nih.gov
Updated Oct 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Wiegand (2022). CT-FAN: A Multilingual dataset for Fake News Detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4714516
Explore at:
Dataset updated
Oct 23, 2022
Dataset provided by
Thomas Mandl
Julia Maria Struß
Gautam Kishore Shahi
Michael Wiegand
Melanie Siegel
Juliane Köhler
Description
By downloading the data, you agree with the terms & conditions mentioned below:

Data Access: The data in the research collection may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use them only for research purposes.

Summaries, analyses and interpretations of the linguistic properties of the information may be derived and published, provided it is impossible to reconstruct the information from these summaries. You may not try identifying the individuals whose texts are included in this dataset. You may not try to identify the original entry on the fact-checking site. You are not permitted to publish any portion of the dataset besides summary statistics or share it with anyone else.

We grant you the right to access the collection's content as described in this agreement. You may not otherwise make unauthorised commercial use of, reproduce, prepare derivative works, distribute copies, perform, or publicly display the collection or parts of it. You are responsible for keeping and storing the data in a way that others cannot access. The data is provided free of charge.

Citation

Please cite our work as

@InProceedings{clef-checkthat:2022:task3, author = {K{"o}hler, Juliane and Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Wiegand, Michael and Siegel, Melanie and Mandl, Thomas}, title = "Overview of the {CLEF}-2022 {CheckThat}! Lab Task 3 on Fake News Detection", year = {2022}, booktitle = "Working Notes of CLEF 2022---Conference and Labs of the Evaluation Forum", series = {CLEF~'2022}, address = {Bologna, Italy},}

@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }

Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.

Task 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. The training data will be released in batches and roughly about 1264 articles with the respective label in English language. Our definitions for the categories are as follows:

False - The main claim made in an article is untrue.

Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

True - This rating indicates that the primary elements of the main claim are demonstrably true.

Other- An article that cannot be categorised as true, false, or partially false due to a lack of evidence about its claims. This category includes articles in dispute and unproven articles.

Cross-Lingual Task (German)

Along with the multi-class task for the English language, we have introduced a task for low-resourced language. We will provide the data for the test in the German language. The idea of the task is to use the English data and the concept of transfer to build a classification model for the German language.

Input Data

The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

ID- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

our rating - class of the news article as false, partially false, true, other

Output data format

public_id- Unique identifier of the news article

predicted_rating- predicted class

Sample File

public_id, predicted_rating 1, false 2, true

IMPORTANT!

We have used the data from 2010 to 2022, and the content of fake news is mixed up with several topics like elections, COVID-19 etc.

Baseline: For this task, we have created a baseline system. The baseline system can be found at https://zenodo.org/record/6362498

Related Work

Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf

G. K. Shahi and D. Nandini, “FakeCovid – a multilingual cross-domain fact check news dataset for covid-19,” in workshop Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14

Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104

Shahi, G. K., Struß, J. M., & Mandl, T. (2021). Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection. Working Notes of CLEF.

Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeno, A., Míguez, R., Shaar, S., ... & Mandl, T. (2021, March). The CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news. In European Conference on Information Retrieval (pp. 639-649). Springer, Cham.

Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeño, A., Míguez, R., Shaar, S., ... & Kartal, Y. S. (2021, September). Overview of the CLEF–2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 264-291). Springer, Cham.
f
Navigating News Narratives: A Media Bias Analysis Dataset
figshare.com
txt
Updated Dec 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaina Raza (2023). Navigating News Narratives: A Media Bias Analysis Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.24422122.v4
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24422122.v4
Dataset updated
Dec 8, 2023
Dataset provided by
figshare
Authors
Shaina Raza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The prevalence of bias in the news media has become a critical issue, affecting public perception on a range of important topics such as political views, health, insurance, resource distributions, religion, race, age, gender, occupation, and climate change. The media has a moral responsibility to ensure accurate information dissemination and to increase awareness about important issues and the potential risks associated with them. This highlights the need for a solution that can help mitigate against the spread of false or misleading information and restore public trust in the media.Data description: This is a dataset for news media bias covering different dimensions of the biases: political, hate speech, political, toxicity, sexism, ageism, gender identity, gender discrimination, race/ethnicity, climate change, occupation, spirituality, which makes it a unique contribution. The dataset used for this project does not contain any personally identifiable information (PII).The data structure is tabulated as follows:Text: The main content.Dimension: Descriptive category of the text.Biased_Words: A compilation of words regarded as biased.Aspect: Specific sub-topic within the main content.Label: Indicates the presence (True) or absence (False) of bias. The label is ternary - highly biased, slightly biased and neutralToxicity: Indicates the presence (True) or absence (False) of bias.Identity_mention: Mention of any identity based on words match.Annotation SchemeThe labels and annotations in the dataset are generated through a system of Active Learning, cycling through:Manual LabelingSemi-Supervised LearningHuman VerificationThe scheme comprises:Bias Label: Specifies the degree of bias (e.g., no bias, mild, or strong).Words/Phrases Level Biases: Pinpoints specific biased terms or phrases.Subjective Bias (Aspect): Highlights biases pertinent to content dimensions.Due to the nuances of semantic match algorithms, certain labels such as 'identity' and 'aspect' may appear distinctively different.List of datasets used : We curated different news categories like Climate crisis news summaries , occupational, spiritual/faith/ general using RSS to capture different dimensions of the news media biases. The annotation is performed using active learning to label the sentence (either neural/ slightly biased/ highly biased) and to pick biased words from the news.We also utilize publicly available data from the following links. Our Attribution to others.MBIC (media bias): Spinde, Timo, Lada Rudnitckaia, Kanishka Sinha, Felix Hamborg, Bela Gipp, and Karsten Donnay. "MBIC--A Media Bias Annotation Dataset Including Annotator Characteristics." arXiv preprint arXiv:2105.11910 (2021). https://zenodo.org/records/4474336Hyperpartisan news: Kiesel, Johannes, Maria Mestre, Rishabh Shukla, Emmanuel Vincent, Payam Adineh, David Corney, Benno Stein, and Martin Potthast. "Semeval-2019 task 4: Hyperpartisan news detection." In Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 829-839. 2019. https://huggingface.co/datasets/hyperpartisan_news_detectionToxic comment classification: Adams, C.J., Jeffrey Sorensen, Julia Elliott, Lucas Dixon, Mark McDonald, Nithum, and Will Cukierski. 2017. "Toxic Comment Classification Challenge." Kaggle. https://kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge.Jigsaw Unintended Bias: Adams, C.J., Daniel Borkan, Inversion, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, and Nithum. 2019. "Jigsaw Unintended Bias in Toxicity Classification." Kaggle. https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification.Age Bias : Díaz, Mark, Isaac Johnson, Amanda Lazar, Anne Marie Piper, and Darren Gergle. "Addressing age-related bias in sentiment analysis." In Proceedings of the 2018 chi conference on human factors in computing systems, pp. 1-14. 2018. Age Bias Training and Testing Data - Age Bias and Sentiment Analysis Dataverse (harvard.edu)Multi-dimensional news Ukraine: Färber, Michael, Victoria Burkard, Adam Jatowt, and Sora Lim. "A multidimensional dataset based on crowdsourcing for analyzing and detecting news bias." In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 3007-3014. 2020. https://zenodo.org/records/3885351#.ZF0KoxHMLtVSocial biases: Sap, Maarten, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A. Smith, and Yejin Choi. "Social bias frames: Reasoning about social and power implications of language." arXiv preprint arXiv:1911.03891 (2019). https://maartensap.com/social-bias-frames/Goal of this dataset :We want to offer open and free access to dataset, ensuring a wide reach to researchers and AI practitioners across the world. The dataset should be user-friendly to use and uploading and accessing data should be straightforward, to facilitate usage.If you use this dataset, please cite us.Navigating News Narratives: A Media Bias Analysis Dataset © 2023 by Shaina Raza, Vector Institute is licensed under CC BY-NC 4.0
International Passenger Survey, 2001
beta.ukdataservice.ac.uk
Updated 2004
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Social Survey Division Office For National Statistics (2004). International Passenger Survey, 2001 [Dataset]. http://doi.org/10.5255/ukda-sn-4582-1
Explore at:
Unique identifier
https://doi.org/10.5255/ukda-sn-4582-1
Dataset updated
2004
Dataset provided by
DataCitehttps://www.datacite.org/
UK Data Servicehttps://ukdataservice.ac.uk/
Authors
Social Survey Division Office For National Statistics
Description
The International Passenger Survey (IPS) aims to collect data on both credits and debits for the travel account of the Balance of Payments, provide detailed visit information on overseas visitors to the United Kingdom (UK) for tourism policy, and collect data on international migration.
The depositor recommends that only expert users who are familiar with the coding and weighting structures use this dataset as limited support is available. Some considerable understanding of the data is required before meaningful analyses can be made, and care must be taken when performing time series operations as codes can vary from year to year. Not all variables from one years dataset are used in other years. The data covers five subject areas, three of which, ALCOHOL, QREGTOWN and QCONTACT are held quarterly in four files per subject area, the fourth AIRMILES is held as a complete year. These can be joined together using the variables YEAR, SERIAL, FLOW and QUARTER. The fifth subject area, SALANG, also held as a complete year, contains the main language spoken in South Africa and can be linked by the variable QUARTER.

The weighting of IPS data is complex and done in several stages. When working with the system weights, great care should be taken to read the documentation concerning weighting procedures as not all records are treated in exactly the same way.
Data from: Quotebank: A Corpus of Quotations from a Decade of News
zenodo.org
explore.openaire.eu
bz2
Updated Jun 18, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timoté Vaucher; Andreas Spitz; Michele Catasta; Robert West; Timoté Vaucher; Andreas Spitz; Michele Catasta; Robert West (2023). Quotebank: A Corpus of Quotations from a Decade of News [Dataset]. http://doi.org/10.5281/zenodo.4277311
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.4277311
Dataset updated
Jun 18, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Timoté Vaucher; Andreas Spitz; Michele Catasta; Robert West; Timoté Vaucher; Andreas Spitz; Michele Catasta; Robert West
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction

Quotebank is a dataset of 235 million unique, speaker-attributed quotations that were extracted from 196 million English news articles (127 million containing quotations) crawled from over 377 thousand web domains (15 thousand root domains) between September 2008 and April 2020. The quotations were extracted and attributed using Quobert, a distantly and minimally supervised end-to-end, language-agnostic framework for quotation attribution.

For further details, please refer to the description below and to the original paper:

Timoté Vaucher, Andreas Spitz, Michele Catasta, and Robert West
"Quotebank: A Corpus of Quotations from a Decade of News"
Proceedings of the 14th International ACM Conference on Web Search and Data Mining (WSDM), 2021.
https://doi.org/10.1145/3437963.3441760

When using the dataset, please cite the above paper (Note that the above numbers differ from those listed in the paper, as the updated data in this repository has been computed from an expanded set of input news articles).

Dataset summary

The dataset consists of two versions:

Quotation-centric version (quotes-YYYY.json.bz2)
An aggregated set of unique quotations with the most likely speaker. Each unique quotation occurs only once in this version of the data and the probabilities of the candidate speakers to which the quotation can be attributed are aggregated over all occurrences of the quotation. This version of the data is a minimal - but complete - list of attributed quotations that is aimed at users who only require quotation-speaker attributions, but no individual contexts for these quotations from the original articles.

Article-centric version (quotebank-YYYY.json.bz2)
A complete set of all individual quotation mentions with associated speaker as well as the article context in which they are mentioned. This larger version contains one entry per article in the news data. Each entry contains all speakers that appear in the news article as well as the (attributed) quotations, alongside a context window surrounding the quotations.

Both versions are split into 13 files (one per year) for ease of downloading and handling.

Dataset details

The following formatting applies to both versions of the dataset:

All data is made available in JSON format that has been compressed using bzip2.

The data is split per year (i.e., there is one file for each calendar year).

The offsets of quotations, contexts, and speaker annotations are given in units of Penn TreeBank Tokenizer tokens.

Offsets are zero-based and are computed from the start of the article.

When pairs of offsets are provided, the end offset is non-inclusive (e.g. in Python you can call tokens[start:end] without having to do end+1).

The Spinn3r data from which Quotebank was extracted had been collected over the course of over a decade. During this time, the client-side code used for collecting the data changed several times, and various character-encoding-related issues led to different representations of the original text at different times. We thus divide the 12 years spanned by the Spinn3r corpus into five phases (Phases A through E). A detailed description is available on GitHub; the key takeaways are that (1) text was lowercased in Phases A, B, and C, whereas the original capitalization was maintained in Phases D and E, and that (2) non-ASCII characters are properly represented only in Phase E.

Version 1: Quotation-centric data

In this version of the dataset, the quotations are aggregated across all their occurrences in the news article data, and assigned a probability for each speaker candidate. We consider two quotations to be equivalent and suitable for aggregation if they are identical after lower-casing and removing punctuation.

Quotation-centric data |-- quoteID: Primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}") |-- quotation: Text of the longest encountered original form of the quotation |-- date: Earliest occurrence date of any version of the quotation |-- phase: Corresponding phase of the data in which the quotation first occurred (A-E) |-- probas: Array representing the probabilities of each speaker having uttered the quotation. The probabilities across different occurrences of the same quotation are summed for each distinct candidate speaker and then normalized |-- proba: Probability for a given speaker |-- speaker: Most frequent surface form for a given speaker in the articles where the quotation occurred |-- speaker: Selected most likely speaker. This matches the the first speaker entry in `probas` |-- qids: Wikidata IDs of all aliases that match the selected speaker |-- numOccurrences: Number of time this quotation occurs in the articles |-- urls: List of links to the original articles containing the quotation

Note that for some speakers there can be more than one Wikidata ID in the `qids` field. To access Wikidata information about those speakers it is necessary to disambiguate them, i.e., select one of the listed Wikidata IDs that most likely corresponds to the respective speaker. Speaker disambiguation can be done using scripts available in the quotebank-toolkit repository. Additionally, the repository contains useful scripts for cleaning and enriching Quotebank.

Version 2: Article-centric data

In this data set, individual quotations are not aggregated. For each article, one JSON entry contains all speakers that appear in the news article, the (attributed) quotations, and the text within a context window surrounding each of the quotations.

Article-centric data |-- articleID: Primary key |-- articleLength: Length of the article in PTB tokens |-- date: Publication date of the article |-- phase: Corresponding phase in which the article appeared (A-E) |-- title: Title of the article |-- url: Link to the original article |-- names: List of all extracted speakers that occur in the article |-- name: Surface form of the first occurrence of each speaker in the article |-- ids: List of Wikidata IDs that have `name` as a possible alias |-- offsets: List of pairs of start/end offset, signifying positions at which the speaker occurs in the article (full and partial mention of the speaker) |-- quotations: List of all the quotations that appear in the article |-- quoteID: Foreign key of the quotation (from the quotation-centric dataset) |-- quotation: Text of the quotation as it occurs in this article |-- quotationOffset: Index where the quotation starts in the article |-- leftContext: Text in the left context window of the quotation (used for the attribution) |-- rightContext: Text in the right context window (used for the attribution) |-- globalProbas: Array representing the probabilities of each speaker having uttered the quote *at the aggregated level*. Same as `probas` for a given `quoteID` |-- globalTopSpeaker: Most probable speaker *at the aggregated level*. Same as `speaker` for a given `quoteID` |-- localProbas: Array representing the probabilities of each speaker having said the quote *given this article context*. |-- proba: Probability for a given speaker |-- speaker: Name of the speaker as it first occurs in this article |-- localTopSpeaker: Selected speaker. Same name as the first entry in `localProbas` |-- numOccurrences: Number of times this quotation occurs in any article

Code repository

The code of Quobert that was used for the extraction and attribution of this data set is available and managed in a Github repository, which you can find here.
Countries and territories Named Authority List
data.europa.eu
rdf xml, xml, zip
Updated Dec 3, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Publications Office of the European Union (2021). Countries and territories Named Authority List [Dataset]. https://data.europa.eu/data/datasets/country?locale=en
Explore at:
xml, rdf xml, zipAvailable download formats
Dataset updated
Dec 3, 2021
Dataset provided by
Publications Office of the European Unionhttp://op.europa.eu/
European Union-
Authors
Publications Office of the European Union
License
http://data.europa.eu/eli/dec/2011/833/ojhttp://data.europa.eu/eli/dec/2011/833/oj
Description
Countries and territories is a controlled vocabulary that lists concepts associated with names of countries and territories. It is a corporate reference data asset covered by the Corporate Reference Data Management policy of the European Commission. It provides codes and names of geospatial and geopolitical entities in all official EU languages and is the result of a combination of multiple relevant standards, created to serve the requirements and use cases of the EU institutions services. Its main scope is to support documentary metadata activities. The codes of the concepts included are correlated with the ISO 3166 international standard. The authority code relies where possible on the ISO 3166-1 alpha-3 code. Additional user-assigned alpha-3 codes have been used to cover entities that are not included in the ISO 3166-1 standard. The corporate list contains mappings with the ISO 3166-1 two-letter codes, the Interinstitutional Style Guide codes and with other internal and external identifiers including ISO 3166-1 numeric, ISO 3166-3, UNSD M49, UNSD Geoscheme, IBAN, TIR, IANA domain. For the names of countries and territories, the corporate list synchronises with the Interinstitutional Style Guide (ISG, Section 7.1 and Annexes A5 and A6) and with the IATE terminology database. Membership and classification properties provide possibilities to group concepts, e.g., UN, EU, EEA, EFTA, Schengen area, Euro area, NATO, OECD, UCPM, ENP-EAST, ENP-SOUTH, EU candidate countries and potential candidates. Countries and territories is maintained by the Publications Office of the European Union and disseminated on the EU Vocabularies website. Regular updates are foreseen based on its stakeholders’ needs. Downloads in human-readable formats (.csv, .html) are also available.
Z
Data from: Synthetic Product Desirability Datasets for Sentiment Analysis...
data.niaid.nih.gov
researchworks.creighton.edu
+1more
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thompson, Warren (2024). Synthetic Product Desirability Datasets for Sentiment Analysis Testing [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14188455
Explore at:
Dataset updated
Nov 21, 2024
Dataset provided by
Weitl-Harms, Sherri
Myers, Zachary
Hastings, John
Doty, Joseph
Thompson, Warren
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview:This collection contains three synthetic datasets produced by gpt-4o-mini for sentiment analysis and PDT (Product Desirability Toolkit) testing. Each dataset contains 1000 hypothetical software product reviews with the aim to produce a diversity of sentiment and text. The datasets were created as part of the research described in:

Hastings, J.D., Weitl-Harms, S., Doty, J., Myers, Z. L., and Thompson, W., “Utilizing Large Language Models to Synthesize Product Desirability Datasets,” in Proceedings of the 2024 IEEE International Conferenceon Big Data (BigData-24), Workshop on Large Language and Foundation Models (WLLFM-24), Dec. 2024.https://arxiv.org/abs/2411.13485.

Briefly, each row in the datasets was produced as follows:1) Word+Review: The LLM selected a word and synthesized a review that would align with a random target sentiment.2) Review+Word: The LLM produced a review to align with the target sentiment score, and then selected a word appropriate for the review.3) Supply-Word: A word was supplied to the LLM which was then scored, and a review was produced to align with that score.

For sentiment analysis and PDT testing, the two columns of main interest across the datasets are likely 'Selected Word' and 'Hypothetical Review'.

License:This data is licensed under the CC Attribution 4.0 international license, and may be taken and used freely with credit given. Cite as:

Hastings, J., Weitl-Harms, S., Doty, J., Myers, Z., & Thompson, W. (2024). Synthetic Product Desirability Datasets for Sentiment Analysis Testing (1.0.0). Zenodo. https://doi.org/10.5281/zenodo.14188456
f
Data from: Environmental factors and their associations with...
scielo.figshare.com
datasetcatalog.nlm.nih.gov
jpeg
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marina Garcia de Souza Borges; Adriane Mesquita de Medeiros; Stela Maris Aguiar Lemos (2023). Environmental factors and their associations with speech-language-hearing diagnostic hypotheses in children and adolescents [Dataset]. http://doi.org/10.6084/m9.figshare.20022500.v1
Explore at:
jpegAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20022500.v1
Dataset updated
May 31, 2023
Dataset provided by
SciELO journals
Authors
Marina Garcia de Souza Borges; Adriane Mesquita de Medeiros; Stela Maris Aguiar Lemos
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT Purpose: to analyze the associations between speech-language-hearing diagnostic hypotheses in children and adolescents and the Environmental Factors in the International Classification of Functioning, Disability, and Health. Methods: an observational, analytical, cross-sectional study carried out between 2016 and 2019 in an outpatient center with 5- to 16-year-old children and adolescents undergoing speech-language-hearing assessment and their parents/guardians. The Brazilian Economic Classification Criteria was used, and sociodemographic data were collected, along with speech-language-hearing diagnostic hypotheses and information on the presence of categories of the Environmental Factors, qualified as either barriers or facilitators. Descriptive and association analyses were made, using Pearson’s chi-square and Fisher’s Exact tests, with the significance level set at 0.05. Results: most participants had changes in oral language acquisition/development, written language, and oral-motor function. The most prevalent facilitators were in the categories of Services, Systems, and Policies; Support and Relationships; and Products and Technology, whereas the barriers were in the categories of Attitudes; Products and Technology; and Services, Systems, and Policies. The diagnostic hypotheses of “Change in cognitive aspects of language”, “Change in speech”, and “Change in voice” had a significant association with the codes present in chapters 3 - Support and Relationships, and 4 - Attitudes. Conclusion: this association shows that patients with communication changes need a comprehensive approach encompassing the Contextual Factors.

Facebook

Twitter

Click to copy link

Link copied

Cite

GlobalPhone Vietnamese [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/2100

GlobalPhone Vietnamese

Explore at:

audio formatAvailable download formats

License

http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

Description

The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).

In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.

Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.

The Vietnamese part of GlobalPhone was collected in summer 2009. In total 160 speakers were recorded, 140 of them in the cities of Hanoi and Ho Chi Minh City in Vietnam, and an additional set of 20 speakers were recorded in Karlsruhe, Germany. All speakers are Vietnamese native speakers, covering the main dialectal variants from South and North Vietnam. Of these 160 speakers, 70 were female and 90 were male. The majority of speakers are well educated, being graduated students and engineers. The age distribution of the speakers ranges from 18 to 65 years. Each speaker read between 50 and 200 utterances from newspaper articles, corresponding to roughly 9.5 minutes of speech or 138 utterances per person, in total we recorded 22.112 utterances. The speech was recorded using a close-talking microphone Sennheiser HM420 in a push-to-talk scenario using an inhouse developed modern laptop-based data collection toolkit. All data were recorded at 16kHz and 16bit resolution in PCM format. The data collection took place in small-sized rooms with very low background noise. Information on recording place and environmental noise conditions are provided in a separate speaker session file for each speaker. The speech data was recorded in two phases. In a first phase data was collected from 140 speakers in the cities of Hanoi and Ho Chi Minh. In the second phase we selected utterances from the text corpus in order to cover rare Vietnamese phonemes. This second recording phase was carried out with 20 Vietnamese graduate students who live in Karlsruhe. In sum, 22.112 utterances were spoken, corresponding to 25.25 hours of speech. The text data used for recording mainly came from the news posted in online editions of 15 Vietnamese newspaper websites, where the first 12 were used for the training set, while the last three were used for the development and evaluation set. The text data collected from the first 12 websites cover almost 4 Million word tokens with a vocabulary of 30.000 words resulting in an Out-of-Vocabulary rate of 0% on the development set and 0.067% on the evaluation set. For the text selection we followed the standard GlobalPhone protocols and focused on national and international politics and economics news (see [SCHULTZ 2002]). The transcriptions are provided in Vietnamese-style Roman script, i.e. using several diacritics encoded in UTF-8. The Vietnamese data are organized in a training set of 140 speakers with 22.15 hours of speech, a development set of 10 speakers, 6 from North and 4 from South Vietnam with 1:40 hours of speech and an evaluation set of 10 speakers with same gender and dialect distribution as the development set with 1:30 hours of speech. More details on corpus statistics, collection scenario, and system building based on the Vietnamese part of GlobalPhone can be found under [Vu and Schultz, 2009, 2010].

[Schultz 2002] Tanja Schultz (2002): GlobalPhone: A Multilingual Speech and Text Database developed at Karlsruhe University, Proceedings of the International Conference of Spoken Language Processing, ICSLP 2002, Denver, CO, September 2002. [Vu and Schultz, 2010] Ngoc Thang Vu, Tanja Schultz (2010): Optimization On Vietnamese Large Vocabulary Speech Recognition, 2nd Workshop on Spoken Languages Technologies for Under-resourced Languages, SLTU 2010, Penang, Malaysia, May 2010. [Vu and Schultz, 2009] Ngoc Thang Vu, Tanja Schultz (2009): Vietnamese Large Vocabulary Continuous Speech Recognition, Automatic Speech Recognition and Understanding, ASRU 2009, Merano.

Clear search

Close search

Google apps

Main menu

GlobalPhone Vietnamese

Data from: LaFresCat: a Catalan multi-accent speech dataset for...

LaFresCat Multiaccent

Dataset Details

Dataset Description

Uses

Languages

Dataset Structure

Dataset Creation

Source Data

Data Collection and Processing

Who are the source data producers?

Annotations

Personal and Sensitive Information

Bias, Risks, and Limitations

Funding

Dataset Card Contact

Maltese crowS-pairs dataset

Swahili : News Classification Dataset

Hamburg Corpus of Argentinean Spanish (HaCASpa) - Dataset - B2FIND

Data from: Knowledge from non-English-language studies broadens...

Description of the data and file structure

Data from: Dataset for 'How brands highlight country of origin in magazine...

International Passenger Survey, 2004 - Dataset - B2FIND

Guinea: Languages

Language Named Authority List

Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...

Philippines: Languages

Multilingual publishing in the SSH in Poland and attitudes towards English

CT-FAN: A Multilingual dataset for Fake News Detection

Navigating News Narratives: A Media Bias Analysis Dataset

International Passenger Survey, 2001

Data from: Quotebank: A Corpus of Quotations from a Decade of News

Countries and territories Named Authority List

Data from: Synthetic Product Desirability Datasets for Sentiment Analysis...

Data from: Environmental factors and their associations with...

GlobalPhone VietnameseSee More Versions

GlobalPhone Vietnamese