31 datasets found
  1. h

    common-accent

    • huggingface.co
    Updated Mar 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DTU DL 54 (2023). common-accent [Dataset]. https://huggingface.co/datasets/DTU54DL/common-accent
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 16, 2023
    Dataset authored and provided by
    DTU DL 54
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for [Dataset Name]

      Dataset Summary
    

    [More Information Needed]

      Supported Tasks and Leaderboards
    

    [More Information Needed]

      Languages
    

    [More Information Needed]

      Dataset Structure
    
    
    
    
    
      Data Instances
    

    [More Information Needed]

      Data Fields
    

    [More Information Needed]

      Data Splits
    

    [More Information Needed]

      Dataset Creation
    
    
    
    
    
      Curation Rationale
    

    [More Information Needed]

      Source Data… See the full description on the dataset page: https://huggingface.co/datasets/DTU54DL/common-accent.
    
  2. P

    Common Voice Dataset

    • paperswithcode.com
    • opendatalab.com
    • +1more
    Updated May 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rosana Ardila; Megan Branson; Kelly Davis; Michael Henretty; Michael Kohler; Josh Meyer; Reuben Morais; Lindsay Saunders; Francis M. Tyers; Gregor Weber (2023). Common Voice Dataset [Dataset]. https://paperswithcode.com/dataset/common-voice
    Explore at:
    Dataset updated
    May 22, 2023
    Authors
    Rosana Ardila; Megan Branson; Kelly Davis; Michael Henretty; Michael Kohler; Josh Meyer; Reuben Morais; Lindsay Saunders; Francis M. Tyers; Gregor Weber
    Description

    Common Voice is an audio dataset that consists of a unique MP3 and corresponding text file. There are 9,283 recorded hours in the dataset. The dataset also includes demographic metadata like age, sex, and accent. The dataset consists of 7,335 validated hours in 60 languages.

  3. common_voice_16_1

    • huggingface.co
    Updated Jan 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mozilla Foundation (2024). common_voice_16_1 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_16_1
    Explore at:
    Dataset updated
    Jan 16, 2024
    Dataset authored and provided by
    Mozilla Foundationhttp://mozilla.org/
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Dataset Card for Common Voice Corpus 16

      Dataset Summary
    

    The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 30328 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 19673 validated hours in 120 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_16_1.

  4. common_voice_6_0

    • huggingface.co
    Updated Aug 14, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mozilla Foundation (2022). common_voice_6_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_6_0
    Explore at:
    Dataset updated
    Aug 14, 2022
    Dataset authored and provided by
    Mozilla Foundationhttp://mozilla.org/
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Dataset Card for Common Voice Corpus 6.0

      Dataset Summary
    

    The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 9261 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 7327 validated hours in 60 languages, but more voices and languages are always added. Take a look at the Languages page to request… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_6_0.

  5. Z

    Expert annotations for the Catalan Common Voice (v13)

    • data.niaid.nih.gov
    Updated May 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technologies Unit (2024). Expert annotations for the Catalan Common Voice (v13) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11104387
    Explore at:
    Dataset updated
    May 2, 2024
    Dataset authored and provided by
    Language Technologies Unit
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description

    Dataset Summary

    These are the annotations made by a team of experts on the speakers with more than 1200 seconds recorded in the Catalan set of the Common Voice dataset (v13).

    The annotators were initially tasked with evaluating all recordings associated with the same individual. Following that, they were instructed to annotate the speaker's accent, gender, and the overall quality of the recordings.

    The accents and genders taken into account are the ones used until version 8 of the Common Voice corpus.

    See annotations for more details.

    Supported Tasks and Leaderboards

    Gender classification, Accent classification.

    Languages

    The dataset is in Catalan (ca).

    Dataset Structure

    Instances

    Two xlsx documents are published, one for each round of annotations.

    The following information is available in each of the documents:

    { 'speaker ID': '1b7fc0c4e437188bdf1b03ed21d45b780b525fd0dc3900b9759d0755e34bc25e31d64e69c5bd547ed0eda67d104fc0d658b8ec78277810830167c53ef8ced24b', 'idx': '31', 'same speaker': {'AN1': 'SI', 'AN2': 'SI', 'AN3': 'SI', 'agreed': 'SI', 'percentage': '100'}, 'gender': {'AN1': 'H', 'AN2': 'H', 'AN3': 'H', 'agreed': 'H', 'percentage': '100'}, 'accent': {'AN1': 'Central', 'AN2': 'Central', 'AN3': 'Central', 'agreed': 'Central', 'percentage': '100'}, 'audio quality': {'AN1': '4.0', 'AN2': '3.0', 'AN3': '3.0', 'agreed': '3.0', 'percentage': '66', 'mean quality': '3.33', 'stdev quality': '0.58'}, 'comments': {'AN1': '', 'AN2': 'pujades i baixades de volum', 'AN3': 'Deu ser d'alguna zona de transició amb el central, perquè no fa una reducció total vocàlica, però hi té molta tendència'}, }

    We also publish the document Guia anotació parlants.pdf, with the guidelines the annotators recieved.

    Data Fields

    speaker ID (string): An id for which client (voice) made the recording in the Common Voice corpus

    idx (int): Id in this corpus

    AN1 (string): Annotations from Annotator 1

    AN2 (string): Annotations from Annotator 2

    AN3 (string): Annotations from Annotator 3

    agreed (string): Annotation from the majority of the annotators

    percentage (int): Percentage of annotators that agree with the agreed annotation

    mean quality (float): Mean of the quality annotation

    stdev quality (float): Standard deviation of the mean quality

    Data Splits

    The corpus remains undivided into splits, as its purpose does not involve training models.

    Dataset Creation

    Curation Rationale

    During 2022, a campaign was launched to promote the Common Voice corpus within the Catalan-speaking community, achieving remarkable success. However, not all participants provided their demographic details such as age, gender, and accent. Additionally, some individuals faced difficulty in self-defining their accent using the standard classifications established by specialists.

    In order to obtain a balanced corpus with reliable information, we have seen the the necessity of enlisting a group of experts from the University of Barcelona to provide accurate annotations.

    We release the complete annotations because transparency is fundamental to our project. Furthermore, we believe they hold philological value for studying dialectal and gender variants.

    Source Data

    The original data comes from the Catalan sentences of the Common Voice corpus.

    Initial Data Collection and Normalization

    We have selected speakers who have recorded more than 1200 seconds of speech in the Catalan set of the version 13 of the Common Voice corpus.

    Who are the source language producers?

    The original data comes from the Catalan sentences of the Common Voice corpus.

    Annotations

    Annotation process

    Starting with version 13 of the Common Voice corpus we identified the speakers (273) who have recorded more than 1200 seconds of speech.

    A team of three annotators was tasked with annotating:

    if all the recordings correspond to the same person

    the gender of the speaker

    the accent of the speaker

    the quality of the recording

    They conducted an initial round of annotation, discussed their varying opinions, and subsequently conducted a second round.

    We release the complete annotations because transparency is fundamental to our project. Furthermore, we believe they hold philological value for studying dialectal and gender variants.

    Who are the annotators?

    The annotation was entrusted to the CLiC (Centre de Llenguatge i Computació) team from the University of Barcelona. They selected a group of three annotators (two men and one woman), who received a scholarship to do this work.

    The annotation team was composed of:

    Annotator 1: 1 female annotator, aged 18-25, L1 Catalan, student in the Modern Languages and Literatures degree, with a focus on Catalan.

    Annotators 2 & 3: 2 male annotators, aged 18-25, L1 Catalan, students in the Catalan Philology degree.

    1 female supervisor, aged 40-50, L1 Catalan, graduate in Physics and in Linguistics, Ph.D. in Signal Theory and Communications.

    To do the annotation they used a Google Drive spreadsheet

    Personal and Sensitive Information

    The Common Voice dataset consists of people who have donated their voice online. We don't share here their voices, but their gender and accent. You agree to not attempt to determine the identity of speakers in the Common Voice dataset.

    Considerations for Using the Data

    Social Impact of Dataset

    The ID come from the Common Voice dataset, that consists of people who have donated their voice online.

    You agree to not attempt to determine the identity of speakers in the Common Voice dataset.

    The information from this corpus will allow us to train and evaluate well balanced Catalan ASR models. Furthermore, we believe they hold philological value for studying dialectal and gender variants.

    Discussion of Biases

    Most of the voices of the common voice in Catalan correspond to men with a central accent between 40 and 60 years old. The aim of this dataset is to provide information that allows to minimize the biases that this could cause.

    For the gender annotation, we have only considered "H" (male) and "D" (female).

    Other Known Limitations

    [N/A]

    Additional Information

    Dataset Curators

    Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)

    This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.

    Licensing Information

    This dataset is licensed under a CC BY 4.0 license.

    It can be used for any purpose, whether academic or commercial, under the terms of the license. Give appropriate credit, provide a link to the license, and indicate if changes were made.

    Citation Information

    DOI

    Contributions

    The annotation was entrusted to the STeL team from the University of Barcelona.

  6. h

    common-accent-all-features

    • huggingface.co
    Updated Apr 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tong Zhou (2025). common-accent-all-features [Dataset]. https://huggingface.co/datasets/ZZZtong/common-accent-all-features
    Explore at:
    Dataset updated
    Apr 26, 2025
    Authors
    Tong Zhou
    Description

    ZZZtong/common-accent-all-features dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. Prefectures with the most popular dialect in Japan 2021

    • statista.com
    Updated Jan 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Prefectures with the most popular dialect in Japan 2021 [Dataset]. https://www.statista.com/statistics/1087757/japan-prefectures-most-charming-dialect/
    Explore at:
    Dataset updated
    Jan 9, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Oct 27, 2021 - Nov 5, 2021
    Area covered
    Japan
    Description

    Around 34 percent of Japanese residents in Fukuoka claimed that the prefecture can take pride in the charming dialect of its locals, according to a survey conducted in November 2021 among locals of the 47 prefectures in Japan. The dialect of Fukuoka prefecture is part of the Kyushu dialects, which also includes the dialects of the second-ranked prefecture Nagasaki.

  8. t

    ACCENT DECOR COMMON GROUND CORKCICLE|Full export Customs Data...

    • tradeindata.com
    Updated Dec 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    tradeindata (2024). ACCENT DECOR COMMON GROUND CORKCICLE|Full export Customs Data Records|tradeindata [Dataset]. https://www.tradeindata.com/supplier_detail/?id=ea6031474d79b1f367f4acddd34d2569
    Explore at:
    Dataset updated
    Dec 4, 2024
    Dataset authored and provided by
    tradeindata
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Customs records of are available for ACCENT DECOR COMMON GROUND CORKCICLE. Learn about its Importer, supply capabilities and the countries to which it supplies goods

  9. h

    common-accent-YAMnet

    • huggingface.co
    Updated Apr 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tong Zhou (2025). common-accent-YAMnet [Dataset]. https://huggingface.co/datasets/ZZZtong/common-accent-YAMnet
    Explore at:
    Dataset updated
    Apr 18, 2025
    Authors
    Tong Zhou
    Description

    ZZZtong/common-accent-YAMnet dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. t

    ONE HUNDRED COMMON GROUND BLOCK DESIGN ACCENT DECOR TRUE BRANDS JR WILLIAM...

    • tradeindata.com
    Updated Oct 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    tradeindata (2024). ONE HUNDRED COMMON GROUND BLOCK DESIGN ACCENT DECOR TRUE BRANDS JR WILLIAM TWO S. COMPANY LAZAR DECO|Full export Customs Data Records|tradeindata [Dataset]. https://www.tradeindata.com/supplier_detail/?id=4fd20115b1b11974d635e2dd1c4f2089
    Explore at:
    Dataset updated
    Oct 30, 2024
    Dataset authored and provided by
    tradeindata
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Customs records of are available for ONE HUNDRED COMMON GROUND BLOCK DESIGN ACCENT DECOR TRUE BRANDS JR WILLIAM TWO S. COMPANY LAZAR DECO. Learn about its Importer, supply capabilities and the countries to which it supplies goods

  11. h

    common-accent-MelSpec-MFCC

    • huggingface.co
    Updated Apr 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tong Zhou (2025). common-accent-MelSpec-MFCC [Dataset]. https://huggingface.co/datasets/ZZZtong/common-accent-MelSpec-MFCC
    Explore at:
    Dataset updated
    Apr 26, 2025
    Authors
    Tong Zhou
    Description

    ZZZtong/common-accent-MelSpec-MFCC dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. m

    BRWDS: A Multipurpose Dataset For Bangla Regional Word Detection

    • data.mendeley.com
    Updated Dec 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Umme Aiman (2024). BRWDS: A Multipurpose Dataset For Bangla Regional Word Detection [Dataset]. http://doi.org/10.17632/6pd2c48m66.3
    Explore at:
    Dataset updated
    Dec 2, 2024
    Authors
    Umme Aiman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The BRWDS (Bangla Regional Word Dataset) is a comprehensive collection of commonly used Bengali words that highlights the linguistic diversity across 8 distinct divisions in Bangladesh. This dataset aims to tackle the challenges posed by regional accents and variations in Bengali, which can create barriers to communication. The dataset covers words from the following divisions: Dhaka, Chittagong, Mymensingh, Sylhet, Rajshahi, Khulna, Barishal, and Rangpur. In total, it includes 347 Bengali words that are frequently used in daily conversations across these regions. While Bengali is spoken across all these divisions, each region has its own unique accent, leading to variations in pronunciation and word usage, which are captured in this dataset.

    To create this dataset, 12 native speakers from the 8 divisions, as well as one additional district, contributed by providing word samples. The data is stored in XLSX format, making it easily accessible for further research. This dataset has several potential applications, including the development of systems that can automatically detect regional variations in Bengali text, enabling better localization and understanding of regional dialects. It can also help minimize communication barriers caused by accent differences within Bangladesh by offering a more standardized understanding of regional variations. Additionally, the dataset can be used to translate regional words into standard Bengali (Chaste Bengali), making it easier for people to understand each other. The dataset also supports research into linguistic diversity and provides a foundation for future advancements in speech and text processing technologies. The dataset has been reviewed and evaluated by 9 authentic speakers from each division to ensure its accuracy and proper representation of the regional language variations.

    Looking forward, the dataset can be further enriched by adding voice data, which would support more advanced research in areas such as speech recognition, accent detection, and machine translation for regional language variants.

    Data was situated in Bangla RDS.xlsv . In sheet 1 named Region wise data was collected and evaluated on other sheet named categorize data where all the data was categorized and organize according to common chaste words.

  13. h

    common-accent-vgg-ready

    • huggingface.co
    Updated Apr 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tong Zhou (2025). common-accent-vgg-ready [Dataset]. https://huggingface.co/datasets/ZZZtong/common-accent-vgg-ready
    Explore at:
    Dataset updated
    Apr 26, 2025
    Authors
    Tong Zhou
    Description

    ZZZtong/common-accent-vgg-ready dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. c

    WYRED - West Yorkshire Regional English Database 2016-2019

    • datacatalogue.cessda.eu
    Updated May 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gold (2025). WYRED - West Yorkshire Regional English Database 2016-2019 [Dataset]. http://doi.org/10.5255/UKDA-SN-854354
    Explore at:
    Dataset updated
    May 27, 2025
    Dataset provided by
    E,
    Authors
    Gold
    Time period covered
    Feb 15, 2016 - Aug 31, 2019
    Area covered
    Yorkshire, West Yorkshire, United Kingdom
    Variables measured
    Individual
    Measurement technique
    WYRED consists of recordings from 180 male speakers, aged between 18 and 30 at the time of recording. All participants are British English speakers from Northern England in the county of West Yorkshire. The 180 speakers are divided between three of the five boroughs within West Yorkshire (Bradford, Kirklees, Wakefield), such that there are 60 speakers from each of the boroughs. Participants were assigned to a borough based on the postcode (zip code) where they grew up and went to primary and secondary school. All participants are native English speakers who grew up in English-only speaking households and did not speak any other languages. None of the participants reported any speech or hearing impairments. Speakers, however, were not included in the database if they were deemed to have spent a significant period (more than a few years) outside the area, had missing/broken front teeth or facial piercings that affected their speech.Recruitment largely took place through email advertisements, but also via flyers, in class presentations, Facebook Ads, and referrals. All interested participants registered their interest in participating through an online survey that allowed us to screen for eligible participants. Speakers were then invited to participate via email. All participants were compensated for their participation.In addition to each participant’s age, WYRED also contains metadata that may be of interest to other researchers. The following metadata has been collected for each participant: relationship status and where their partner was from, where the participants’ parents were from, employment status and type of work, highest level of education, smoker/Non-smoker, left or right handed, height and weight.
    Description

    The West Yorkshire Regional English Database (WYRED) consists of approximately 200 hours of high-quality audio recordings of 180 West Yorkshire (British English) speakers. All participants are male between the ages of 18-30, and are divided evenly (60 per region) across three boroughs within West Yorkshire (Northern England): Bradford, Kirklees, and Wakefield. Speakers participated in four spontaneous speaking tasks. The first two tasks relate to a mock crime where the participant speaks to a police officer (Research Assistant 1) followed by an accomplice (Research Assistant 2). Speakers returned a minimum of 6 days later at which point they were paired with someone from their borough and recorded having a conversation on any topics they wish. The final task is an experimental task in which speakers are asked to leave a voicemail message related to the fictitious crime from the first recording session. In total, each speaker participated in approximately 1 hour of spontaneous speech recordings. The primary motivation for the construction of the West Yorkshire Regional English Database (WYRED) was to provide a collection of regionally stratified speech recordings (by boroughs) from within a single, politically defined region (a county). The corpus aims to facilitate research on methodological issues surrounding the delimitation of the reference population when considering the typicality of a speech sample for a given forensic speaker comparison case, while also providing valuable insight into the West Yorkshire accent(s).

    Forensic speech science (FSS) - an applied sub-discipline of phonetics - has come to play a critical role in criminal cases involving voice evidence. Within FSS, Forensic speaker comparison (FSC) involves the comparison of a criminal recording (e.g. a threatening phone call), and a known suspect sample (e.g. a police interview). It is the role of an expert forensic phonetician to advise the trier of fact (e.g. judge or jury) on the likelihood of the two samples coming from the same speaker. There are two important elements involved in making such a comparison. First, the expert will carry out an assessment of the similarity of the speech characteristics in the criminal recording and the suspect sample. Second, the expert will assess the degree to which the same speech features for the criminal sample can be considered to be typical for a given speaker group. The speaker group will typically be defined by age, sex and geographical region (or accent). This second element is critical in providing context for the first; the suspect could have speech very similar to that in the criminal recording but this could be purely coincidental if they exhibit speech characteristics that are common to their speaker group. In contrast, if the criminal and suspect are observed as having speech features considered as being atypical for their speaker group then this would provide strong evidence for it being the same speaker.

    One complication associated with FSC is that data to estimate whether a speech feature is typical or atypical for the given speaker group, commonly known as population data, are scarcely available. Population data are typically obtained by collecting a set of recordings containing the voices of a homogeneous group of speakers similar in age, sex, and geographical region (or accent). Unfortunately, the time and expense involved in the collection of population data means that forensic phoneticians face a huge challenge in obtaining such data for casework. This problem is further complicated by the high degree of variation that exists in speech across different speaker groups. Methodological research in the field of FSS has demonstrated that identifying the correct population for a FSC is vital in accurately representing the strength of evidence. It is largely for these reasons that experts argue that the biggest problem facing the field is the limited availability of population data.

    The primary aim of this research is to explore a novel set of proposed methods that seek to remedy the aforementioned problems. The current lack of a platform on which to exchange data means that population data for a specific speaker group might have already been collected, unbeknown to experts in need of such data. This project intends to bring an end to this type of scenario by developing an international platform on which to share data, and also encouraging fellow researchers and experts to participate in data sharing. In addition, the project will explore the extent to which population data are generalizable; specifically, this will entail identifying the geographical (or regional accent) level at which speaker groups can be defined. For example, an expert might define a population group as having a Wakefield accent, when in actuality a population defined more generally as West Yorkshire would suffice. This would clearly have implications for the way in which population data would be collected.

    In...

  15. common_voice_9_0

    • huggingface.co
    Updated Jul 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mozilla Foundation (2022). common_voice_9_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_9_0
    Explore at:
    Dataset updated
    Jul 24, 2022
    Dataset authored and provided by
    Mozilla Foundationhttp://mozilla.org/
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Dataset Card for Common Voice Corpus 9.0

      Dataset Summary
    

    The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 20217 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 14973 validated hours in 93 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_9_0.

  16. h

    common-accent-processed-raw

    • huggingface.co
    Updated Apr 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tong Zhou (2025). common-accent-processed-raw [Dataset]. https://huggingface.co/datasets/ZZZtong/common-accent-processed-raw
    Explore at:
    Dataset updated
    Apr 26, 2025
    Authors
    Tong Zhou
    Description

    ZZZtong/common-accent-processed-raw dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. Dialects spoken in Italy 2018, by number of speakers

    • statista.com
    Updated Aug 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Dialects spoken in Italy 2018, by number of speakers [Dataset]. https://www.statista.com/statistics/1125068/dialects-spoken-in-italy/
    Explore at:
    Dataset updated
    Aug 30, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2018
    Area covered
    Italy
    Description

    The most spoken dialect in Italy is South Italian. This macro-group includes varieties spoken in the regions of Campania, Calabria, Basilicata, Abruzzo, Apulia, and Molise as well as in some areas of Lazio, Marche, and Umbria, three regions of the Center. As of 2018, South Italian counted 7.5 million speakers. The second most spoken dialect, Sicilian, had about five million speakers.

  18. Reasons why adults use subtitles when watching TV in known language in the...

    • statista.com
    Updated Jun 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Reasons why adults use subtitles when watching TV in known language in the U.S. 2023 [Dataset]. https://www.statista.com/statistics/1459167/reasons-use-subtitles-watching-tv-known-language-us/
    Explore at:
    Dataset updated
    Jun 4, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jun 29, 2023 - Jul 5, 2023
    Area covered
    United States
    Description

    Enhancement of comprehension and more profound understanding of accents were the most common reasons why American adults use subtitles while watching TV in a known language, according to a survey conducted between June and July 2023. Another 33 percent of the respondents stated that they did so because they were in a noisy environment.

  19. m

    Twitter Hate Speech Dataset for the Saudi Dialect

    • data.mendeley.com
    Updated Nov 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali Alhazmi (2024). Twitter Hate Speech Dataset for the Saudi Dialect [Dataset]. http://doi.org/10.17632/c2jpnv9yk6.4
    Explore at:
    Dataset updated
    Nov 1, 2024
    Authors
    Ali Alhazmi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Saudi Arabia
    Description

    The data was performed by employing standard Twitter API on Arabic tweets and code-mixing datasets. The Data was carried out for a duration of three months, specifically from April 2023 to June 2023. This was done via a combination of keyword, thread-based searches, and profile-based search approaches as. A total of 120 terms, including various versions, which were used to identify tweets containing code-mixing concerning regional hate speech. To conduct a thread-based search, we have incorporated hashtags that are related to contentious subjects that are deemed essential markers for hateful speech. Throughout the data-gathering phase, we kept an eye on Twitter trends and designated ten hashtags for information retrieval. Given that hateful tweets are usually less common than regular tweets, we expanded our dataset and improved the representation of the hate class by incorporating the most impactful terms from a lexicon of religious hate terms (Albadi et al., 2018). We gathered exclusively original Arabic tweets for all queries, excluding retweets and non-Arabic tweets. In all, we obtained 200,000 Twitter data, of which we sampled 35k tweets for annotation.

  20. h

    cm.trial

    • huggingface.co
    Updated Feb 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    taqwa mohamed (2023). cm.trial [Dataset]. https://huggingface.co/datasets/taqwa92/cm.trial
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 22, 2023
    Authors
    taqwa mohamed
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Dataset Card for Common Voice Corpus 11.0

      Dataset Summary
    

    The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 24210 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 16413 validated hours in 100 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/taqwa92/cm.trial.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
DTU DL 54 (2023). common-accent [Dataset]. https://huggingface.co/datasets/DTU54DL/common-accent

common-accent

DTU54DL/common-accent

Acronym Identification Dataset

Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 16, 2023
Dataset authored and provided by
DTU DL 54
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset Card for [Dataset Name]

  Dataset Summary

[More Information Needed]

  Supported Tasks and Leaderboards

[More Information Needed]

  Languages

[More Information Needed]

  Dataset Structure





  Data Instances

[More Information Needed]

  Data Fields

[More Information Needed]

  Data Splits

[More Information Needed]

  Dataset Creation





  Curation Rationale

[More Information Needed]

  Source Data… See the full description on the dataset page: https://huggingface.co/datasets/DTU54DL/common-accent.
Search
Clear search
Close search
Google apps
Main menu