66 datasets found

Number of Words in different Languages
kaggle.com
Updated Apr 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tayyar Hussain (2023). Number of Words in different Languages [Dataset]. https://www.kaggle.com/datasets/tayyarhussain/number-of-words-in-different-languages
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 28, 2023
Dataset provided by
Kaggle
Authors
Tayyar Hussain
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Introduction:

This dataset provides detailed information on the number of words in various languages. It includes a comprehensive list of word counts for multiple languages, making it a valuable resource for linguists, language learners, and anyone interested in language diversity. The dataset is presented in the form of a list of dictionaries, with each dictionary containing information on the language, the number of words, and other details.

The dataset covers a wide range of languages from around the world, including commonly spoken languages like English, Spanish, Mandarin, and Arabic, as well as lesser-known languages. The word counts are approximate and are based on the number of words in the respective dictionaries.

In addition to the word counts, the dataset also includes information on the approximate number of headwords and definitions available for each language. This information provides further insight into the depth and complexity of the vocabulary of each language.

The dataset is useful for a variety of purposes, such as language research, linguistic diversity studies, language teaching and learning, and natural language processing. The data is provided in a machine-readable format, making it easy to use and analyze.

This dataset is a valuable resource for anyone interested in the linguistic diversity of the world's languages and provides a starting point for exploring the vast vocabulary of different languages.

Column Descriptors:

Language: The name of the language the dictionary pertains to. Number of Words: The approximate number of words included in the dictionary. Approx Headwords: The approximate number of headwords included in the dictionary. Approx Definitions: The approximate number of definitions included in the dictionary. Dictionary: The name or type of dictionary included in the dataset. Notes: Any additional notes or information regarding the dictionary or language.
p
Distribution of Students Across Grade Levels in World Language High School
publicschoolreview.com
Updated Sep 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Public School Review (2025). Distribution of Students Across Grade Levels in World Language High School [Dataset]. https://www.publicschoolreview.com/world-language-high-school-profile
Explore at:
Dataset updated
Sep 5, 2025
Dataset authored and provided by
Public School Review
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset tracks annual distribution of students across grade levels in World Language High School
d
Global Call Center & Conversational Audio Dataset — Multilingual, Validated,...
datarade.ai
.mp3, .wav
Updated Jul 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FileMarket (2025). Global Call Center & Conversational Audio Dataset — Multilingual, Validated, with Demographics + Custom Collection Available [Dataset]. https://datarade.ai/data-products/global-call-center-conversational-audio-dataset-multiling-filemarket
Explore at:
.mp3, .wavAvailable download formats
Dataset updated
Jul 21, 2025
Dataset authored and provided by
FileMarket
Area covered
Comoros, Croatia, Taiwan, Burundi, Gibraltar, Nigeria, Namibia, New Caledonia, Gabon, Lesotho
Description
We provide a wide range of off-the-shelf multilingual audio datasets, featuring real-world call center dialogues and general conversational recordings from regions across Africa, Central America, South America, and Asia.

Our datasets include multiple languages, local dialects, and authentic conversational flows — designed for AI training, contact center automation, and conversational AI development. All samples are human-validated and come with complete metadata.

Each Dataset Includes:

Unique Participant ID

Gender (Male/Female)

Country & City of Origin

Speaker Age (18-60 years)

Language (English + Multiple Local Languages)

Audio Length: ~30 minutes per participant

Validation Status: 100% Human-Checked

Why Work With Us: ✅ Large library of ready-to-use multilingual datasets ✅ Authentic call center, customer service, and natural conversation recordings ✅ Global coverage with diverse speaker demographics ✅ Custom data collection service — we can source or record datasets tailored to your language, region, or domain needs

Best For:

Speech Recognition & Multilingual NLP

Voicebots & Contact Center AI Solutions

Dialect & Accent Recognition Training

Conversational AI & Multilingual Assistants

Customer Support & Quality Analytics

Whether you need off-the-shelf datasets or unique, project-specific collections — we’ve got you covered.

http://filemarket.ai
World Countries and Continents Details
kaggle.com
zip
Updated Oct 5, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
folaraz (2017). World Countries and Continents Details [Dataset]. https://www.kaggle.com/folaraz/world-countries-and-continents-details
Explore at:
zip(24400 bytes)Available download formats
Dataset updated
Oct 5, 2017
Authors
folaraz
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered
World
Description
Context

Can you tell geographical stories about the world using data science?

Content

World countries with their corresponding continents , official english names, official french names, Dial,ITU,Languages and so on.

Acknowledgements

This data was gotten from https://old.datahub.io/

Inspiration

Exploration of the world countries: - Can we graphically visualize countries that speak a particular language? - We can also integrate this dataset into others to enhance our exploration. - The dataset has now been updated to include longitude and latitudes of countries in the world.
Z
Global Country Information 2023
data.niaid.nih.gov
zenodo.org
Updated Jun 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elgiriyewithana, Nidula (2024). Global Country Information 2023 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8165228
Explore at:
Dataset updated
Jun 15, 2024
Dataset authored and provided by
Elgiriyewithana, Nidula
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description

This comprehensive dataset provides a wealth of information about all countries worldwide, covering a wide range of indicators and attributes. It encompasses demographic statistics, economic indicators, environmental factors, healthcare metrics, education statistics, and much more. With every country represented, this dataset offers a complete global perspective on various aspects of nations, enabling in-depth analyses and cross-country comparisons.

Key Features

Country: Name of the country.

Density (P/Km2): Population density measured in persons per square kilometer.

Abbreviation: Abbreviation or code representing the country.

Agricultural Land (%): Percentage of land area used for agricultural purposes.

Land Area (Km2): Total land area of the country in square kilometers.

Armed Forces Size: Size of the armed forces in the country.

Birth Rate: Number of births per 1,000 population per year.

Calling Code: International calling code for the country.

Capital/Major City: Name of the capital or major city.

CO2 Emissions: Carbon dioxide emissions in tons.

CPI: Consumer Price Index, a measure of inflation and purchasing power.

CPI Change (%): Percentage change in the Consumer Price Index compared to the previous year.

Currency_Code: Currency code used in the country.

Fertility Rate: Average number of children born to a woman during her lifetime.

Forested Area (%): Percentage of land area covered by forests.

Gasoline_Price: Price of gasoline per liter in local currency.

GDP: Gross Domestic Product, the total value of goods and services produced in the country.

Gross Primary Education Enrollment (%): Gross enrollment ratio for primary education.

Gross Tertiary Education Enrollment (%): Gross enrollment ratio for tertiary education.

Infant Mortality: Number of deaths per 1,000 live births before reaching one year of age.

Largest City: Name of the country's largest city.

Life Expectancy: Average number of years a newborn is expected to live.

Maternal Mortality Ratio: Number of maternal deaths per 100,000 live births.

Minimum Wage: Minimum wage level in local currency.

Official Language: Official language(s) spoken in the country.

Out of Pocket Health Expenditure (%): Percentage of total health expenditure paid out-of-pocket by individuals.

Physicians per Thousand: Number of physicians per thousand people.

Population: Total population of the country.

Population: Labor Force Participation (%): Percentage of the population that is part of the labor force.

Tax Revenue (%): Tax revenue as a percentage of GDP.

Total Tax Rate: Overall tax burden as a percentage of commercial profits.

Unemployment Rate: Percentage of the labor force that is unemployed.

Urban Population: Percentage of the population living in urban areas.

Latitude: Latitude coordinate of the country's location.

Longitude: Longitude coordinate of the country's location.

Potential Use Cases

Analyze population density and land area to study spatial distribution patterns.

Investigate the relationship between agricultural land and food security.

Examine carbon dioxide emissions and their impact on climate change.

Explore correlations between economic indicators such as GDP and various socio-economic factors.

Investigate educational enrollment rates and their implications for human capital development.

Analyze healthcare metrics such as infant mortality and life expectancy to assess overall well-being.

Study labor market dynamics through indicators such as labor force participation and unemployment rates.

Investigate the role of taxation and its impact on economic development.

Explore urbanization trends and their social and environmental consequences.
E
GlobalPhone French
catalogue.elra.info
live.european-language-grid.eu
Updated Jun 26, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone French [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0197/
Explore at:
Dataset updated
Jun 26, 2017
Dataset provided by
ELRA (European Language Resources Association)
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Area covered
French
Description
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The French corpus was produced using Le Monde newspaper. It contains recordings of 100 speakers (49 males, 51 females) recorded in Grenoble, France. The following age distribution has been obtained: 3 speakers are below 19, 52 speakers are between 20 and 29, 16 speakers are between 30 and 39, 13 speakers are between 40 and 49, and 14 speakers are over 50 (2 speakers age is unknown).
ASL 20-Words Dataset v1
kaggle.com
Updated Nov 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hossam Magdy Balaha (2024). ASL 20-Words Dataset v1 [Dataset]. http://doi.org/10.34740/kaggle/dsv/9797396
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/9797396
Dataset updated
Nov 3, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Hossam Magdy Balaha
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The Arabic Sign Language (ASL) 20-Words Dataset v1 was carefully designed to reflect natural conditions, aiming to capture realistic signing environments and circumstances. Recognizing that nearly everyone has access to a smartphone with a camera as of 2020, the dataset was specifically recorded using mobile phones, aligning with how people commonly record videos in daily life. This approach ensures the dataset is grounded in real-world conditions, enhancing its applicability for practical use cases.

Each video in this dataset was recorded directly on the authors' smartphones, without any form of stabilization—neither hardware nor software. As a result, the videos vary in resolution and were captured across diverse locations, places, and backgrounds. This variability introduces natural noise and conditions, supporting the development of robust deep learning models capable of generalizing across environments.

In total, the dataset comprises 8,467 videos of 20 sign language words, contributed by 72 volunteers aged between 20 and 24. Each volunteer performed each sign a minimum of five times, resulting in approximately 100 videos per participant. This repetition standardizes the data and ensures each sign is adequately represented across different performers. The dataset’s mean video count per sign is 423.35, with a standard deviation of 18.58, highlighting the balance and consistency achieved across the signs.

For reference, Table 2 (in the research article) provides the count of videos for each sign, while Figure 2 (in the research article) offers a visual summary of the statistics for each word in the dataset. Additionally, sample frames from each word are displayed in Figure 3 (in the research article), giving a glimpse of the visual content captured.

For in-depth insights into the methodology and the dataset's creation, see the research paper: Balaha, M.M., El-Kady, S., Balaha, H.M., et al. (2023). "A vision-based deep learning approach for independent-users Arabic sign language interpretation". Multimedia Tools and Applications, 82, 6807–6826. https://doi.org/10.1007/s11042-022-13423-9

Please consider citing the following if you use this dataset:

@misc{balaha_asl_2024_db, title={ASL 20-Words Dataset v1}, url={https://www.kaggle.com/dsv/9783691}, DOI={10.34740/KAGGLE/DSV/9783691}, publisher={Kaggle}, author={Mostafa Magdy Balaha and Sara El-Kady and Hossam Magdy Balaha and Mohamed Salama and Eslam Emad and Muhammed Hassan and Mahmoud M. Saafan}, year={2024} }

@article{balaha2023vision, title={A vision-based deep learning approach for independent-users Arabic sign language interpretation}, author={Balaha, Mostafa Magdy and El-Kady, Sara and Balaha, Hossam Magdy and Salama, Mohamed and Emad, Eslam and Hassan, Muhammed and Saafan, Mahmoud M}, journal={Multimedia Tools and Applications}, volume={82}, number={5}, pages={6807--6826}, year={2023}, publisher={Springer} }

This dataset is available under the CC BY-NC-SA 4.0 license, which allows for sharing and adaptation under conditions of non-commercial use, proper attribution, and distribution under the same license.

For further inquiries or information: https://hossambalaha.github.io/.
Ten Thousand German News Articles Dataset
kaggle.com
tblock.github.io
zip
Updated Jan 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timo Block (2022). Ten Thousand German News Articles Dataset [Dataset]. https://www.kaggle.com/tblock/10kgnad
Explore at:
zip(21144764 bytes)Available download formats
Dataset updated
Jan 20, 2022
Authors
Timo Block
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
(see https://tblock.github.io/10kGNAD/ for the original dataset page)

This page introduces the 10k German News Articles Dataset (10kGNAD) german topic classification dataset. The 10kGNAD is based on the One Million Posts Corpus and avalaible under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. You can download the dataset here.

Why a German dataset?

English text classification datasets are common. Examples are the big AG News, the class-rich 20 Newsgroups and the large-scale DBpedia ontology datasets for topic classification and for example the commonly used IMDb and Yelp datasets for sentiment analysis. Non-english datasets, especially German datasets, are less common. There is a collection of sentiment analysis datasets assembled by the Interest Group on German Sentiment Analysis. However, to my knowlege, no german topic classification dataset is avaliable to the public.

Due to grammatical differences between the English and the German language, a classifyer might be effective on a English dataset, but not as effectiv on a German dataset. The German language has a higher inflection and long compound words are quite common compared to the English language. One would need to evaluate a classifyer on multiple German datasets to get a sense of it's effectivness.

The dataset

The 10kGNAD dataset is intended to solve part of this problem as the first german topic classification dataset. It consists of 10273 german language news articles from an austrian online newspaper categorized into nine topics. These articles are a till now unused part of the One Million Posts Corpus.

In the One Million Posts Corpus each article has a topic path. For example Newsroom/Wirtschaft/Wirtschaftpolitik/Finanzmaerkte/Griechenlandkrise. The 10kGNAD uses the second part of the topic path, here Wirtschaft, as class label. In result the dataset can be used for multi-class classification.

I created and used this dataset in my thesis to train and evaluate four text classifyers on the German language. By publishing the dataset I hope to support the advancement of tools and models for the German language. Additionally this dataset can be used as a benchmark dataset for german topic classification.

Numbers and statistics

As in most real-world datasets the class distribution of the 10kGNAD is not balanced. The biggest class Web consists of 1678, while the smalles class Kultur contains only 539 articles. However articles from the Web class have on average the fewest words, while artilces from the culture class have the second most words.

Splitting into train and test

I propose a stratifyed split of 10% for testing and the remaining articles for training. To use the dataset as a benchmark dataset, please used the train.csv and test.csv files located in the project root.

Code

Python scripts to extract the articles and split them into a train- and a testset avaliable in the code directory of this project. Make sure to install the requirements. The original corpus.sqlite3 is required to extract the articles (download here (compressed) or here (uncompressed)).

License

This dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Please consider citing the authors of the One Million Post Corpus if you use the dataset.
m
BanglaDial: A Merged and Imbalanced text Dataset for Bengali Regional...
data.mendeley.com
Updated Feb 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mehraj Hossain Mahi (2025). BanglaDial: A Merged and Imbalanced text Dataset for Bengali Regional dialect analysis. [Dataset]. http://doi.org/10.17632/sx6ybcps2n.1
Explore at:
Unique identifier
https://doi.org/10.17632/sx6ybcps2n.1
Dataset updated
Feb 24, 2025
Authors
Mehraj Hossain Mahi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is gathered from online repositories, and academic papers. It includes sentences in 12 distinct regional dialect of Bangladesh. The dataset is imbalanced, reflecting real-world dialect distributions as it depends on specific group and population. The dataset supports research in dialect classification, machine translation, and regional language analysis.

Dialect-Wise Sentence Distribution:

Chittagong: 8,819 Kishoreganj: 8,751 Narail: 7,829 Tangail: 6,793 Rangpur: 5,909 Narsingdi: 5,862 Standard Bangla: 4,545 Barisal: 4,270 Sylhet: 3,922 Mymensingh: 3,212 Noakhali: 2,500 Rajshahi: 891

Total : 63,303
F
Finnish General Conversation Speech Dataset for ASR
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Finnish General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-finnish-finland
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Finnish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Finnish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Finnish communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Finnish speech models that understand and respond to authentic Finnish accents and dialects.
Speech Data
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Finnish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
•Participant Diversity:
•
Speakers: 60 verified native Finnish speakers from FutureBeeAI’s contributor community.

•
Regions: Representing various provinces of Finland to ensure dialectal diversity and demographic balance.

•
Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

•Recording Details:
•
Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

•
Duration: Each conversation ranges from 15 to 60 minutes.

•
Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

•
Environment: Quiet, echo-free settings with no background noise.

Topic Diversity
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
•Sample Topics Include:
•Family & Relationships
•Food & Recipes
•Education & Career
•Healthcare Discussions
•Social Issues
•Technology & Gadgets
•Travel & Local Culture
•Shopping & Marketplace Experiences, and many more.
Transcription
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
•Transcription Highlights:
•Speaker-segmented dialogues
•Time-coded utterances
•Non-speech elements (pauses, laughter, etc.)
•High transcription accuracy, achieved through double QA pass, average WER < 5%
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
Metadata
The dataset comes with granular metadata for both speakers and recordings:
•
Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

•
Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
Usage and Applications
This dataset is a versatile resource for multiple Finnish speech and language AI applications:
•
ASR Development: Train accurate speech-to-text systems for Finnish.

•
Voice Assistants: Build smart assistants capable of understanding natural Finnish conversations.

<span
E
GlobalPhone Portuguese (Brazilian)
catalogue.elra.info
live.european-language-grid.eu
Updated Jun 26, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Portuguese (Brazilian) [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0201/
Explore at:
Dataset updated
Jun 26, 2017
Dataset provided by
ELRA (European Language Resources Association)
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Area covered
Brazil
Description
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Portuguese (Brazilian) corpus was produced using the Folha de Sao Paulo newspaper. It contains recordings of 102 speakers (54 males, 48 females) recorded in Porto Velho and Sao Paulo, Brazil. The following age distribution has been obtained: 6 speakers are below 19, 58 speakers are between 20 and 29, 27 speakers are between 30 and 39, 5 speakers are between 40 and 49, and 5 speakers are over 50 (1 speaker age is unknown).
e
Data for: The learnability consequences of Zipfian distributions: Word...
b2find.eudat.eu
Updated Jul 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Data for: The learnability consequences of Zipfian distributions: Word Segmentation is Facilitated in More Predictable Distributions - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/1329e6b9-a50f-5077-86de-9e683c9a64e3
Explore at:
Dataset updated
Jul 31, 2025
Description
Data of Hebrew speaking children and adults on an auditory statistical learning experiment looking at the effect of distribution predictability on segmentation. While the languages of the world differ in many respects, they share certain commonalties, which can provide insight on our shared cognition. Here, we explore the learnability consequences of one of the striking commonalities between languages. Across languages, word frequencies follow a Zipfian distribution, showing a power law relation between a word's frequency and its rank. While their source in language has been studied extensively, less work has explored the learnability consequences of such distributions for language learners. We propose that the greater predictability of words in this distribution (relative to less skewed distributions) can facilitate word segmentation, a crucial aspect of early language acquisition. To explore this, we quantify word predictability using unigram entropy, assess it across languages using naturalistic corpora of child-directed speech and then ask whether similar unigram predictability facilitates word segmentation in the lab. We find similar unigram entropy in child-directed speech across 15 languages. We then use an auditory word segmentation task to show that the unigram predictability levels found in natural language are uniquely facilitative for word segmentation for both children and adults. These findings illustrate the facilitative impact of skewed input distributions on learning and raise questions about the possible role of cognitive pressures in the prevalence of Zipfian distributions in language.
f
Data Sheet 1_Large language models generating synthetic clinical datasets: a...
frontiersin.figshare.com
xlsx
Updated Feb 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 1_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2025.1533508.s001
Dataset updated
Feb 5, 2025
Dataset provided by
Frontiers
Authors
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.

Data from: WikiReddit: Tracing Information and Attention Flows Between...

zenodo.org

bin

Updated May 4, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms [Dataset]. http://doi.org/10.5281/zenodo.14653265

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14653265

Dataset updated

May 4, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jan 15, 2025

Description

Preprint

Gildersleve, P., Beers, A., Ito, V., Orozco, A., & Tripodi, F. (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms. arXiv [Cs.CY]. https://doi.org/10.48550/arXiv.2502.04942

Accepted at the International AAAI Conference on Web and Social Media (ICWSM) 2025

Abstract

The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

Datasheet

Motivation

The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.

Composition

WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.

Collection Process

Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.

Preprocessing/cleaning/labeling

Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.

Uses

We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.

Distribution

The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942

Maintenance

Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.

SQL Database Schema

Table: `posts`

Column Name	Type	Description
`subreddit_id`	TEXT	The unique identifier for the subreddit.
`crosspost_parent_id`	TEXT	The ID of the original Reddit post if this post is a crosspost.
`post_id`	TEXT	Unique identifier for the Reddit post.
`created_at`	TIMESTAMP	The timestamp when the post was created.
`updated_at`	TIMESTAMP	The timestamp when the post was last updated.
`language_code`	TEXT	The language code of the post.
`score`	INTEGER	The score (upvotes minus downvotes) of the post.
`upvote_ratio`	REAL	The ratio of upvotes to total votes.
`gildings`	INTEGER	Number of awards (gildings) received by the post.
`num_comments`	INTEGER	Number of comments on the post.

Table: `comments`

Column Name	Type	Description
`subreddit_id`	TEXT	The unique identifier for the subreddit.
`post_id`	TEXT	The ID of the Reddit post the comment belongs to.
`parent_id`	TEXT	The ID of the parent comment (if a reply).
`comment_id`	TEXT	Unique identifier for the comment.
`created_at`	TIMESTAMP	The timestamp when the comment was created.
`last_modified_at`	TIMESTAMP	The timestamp when the comment was last modified.
`score`	INTEGER	The score (upvotes minus downvotes) of the comment.
`upvote_ratio`	REAL	The ratio of upvotes to total votes for the comment.
`gilded`	INTEGER	Number of awards (gildings) received by the comment.

Table: `postlinks`

Column Name	Type	Description
`post_id`	TEXT	Unique identifier for the Reddit post.
`end_processed_valid`	INTEGER	Whether the extracted URL from the post resolves to a valid URL.
`end_processed_url`	TEXT	The extracted URL from the Reddit post.
`final_valid`	INTEGER	Whether the final URL from the post resolves to a valid URL after redirections.
`final_status`	INTEGER	HTTP status code of the final URL.
`final_url`	TEXT	The final URL after redirections.
`redirected`	INTEGER	Indicator of whether the posted URL was redirected (1) or not (0).
`in_title`	INTEGER	Indicator of whether the link appears in the post title (1) or post body (0).

Table: `commentlinks`

Column Name	Type	Description
`comment_id`	TEXT	Unique identifier for the Reddit comment.
`end_processed_valid`	INTEGER	Whether the extracted URL from the comment resolves to a valid URL.
`end_processed_url`	TEXT	The extracted URL from the comment.
`final_valid`	INTEGER	Whether the final URL from the comment resolves to a valid URL after redirections.
`final_status`	INTEGER	HTTP status code of the final

h
Data from: miracl
huggingface.co
Updated Nov 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MIRACL (2022). miracl [Dataset]. https://huggingface.co/datasets/miracl/miracl
Explore at:
Dataset updated
Nov 29, 2022
Dataset authored and provided by
MIRACL
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for MIRACL (Topics and Qrels)

Dataset Description

Homepage | Repository: | Paper | ArXiv MIRACL 🌍🙌🌏 (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual retrieval dataset that focuses on search across 18 different languages, which collectively encompass over three billion native speakers around the world. This dataset contains the collection data of the 16 "known languages". The remaining 2 "surprise languages" will not… See the full description on the dataset page: https://huggingface.co/datasets/miracl/miracl.
p
World Language High School
publicschoolreview.com
json, xml
Updated Sep 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Public School Review (2025). World Language High School [Dataset]. https://www.publicschoolreview.com/world-language-high-school-profile
Explore at:
xml, jsonAvailable download formats
Dataset updated
Sep 5, 2025
Dataset authored and provided by
Public School Review
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 2006 - Dec 31, 2025
Description
Historical Dataset of World Language High School is provided by PublicSchoolReview and contain statistics on metrics:Total Students Trends Over Years (2007-2023),Total Classroom Teachers Trends Over Years (2008-2023),Distribution of Students By Grade Trends,Student-Teacher Ratio Comparison Over Years (2008-2023),Hispanic Student Percentage Comparison Over Years (2007-2023),Black Student Percentage Comparison Over Years (2007-2023),White Student Percentage Comparison Over Years (2006-2023),Two or More Races Student Percentage Comparison Over Years (2013-2022),Diversity Score Comparison Over Years (2007-2023),Free Lunch Eligibility Comparison Over Years (2013-2023),Reading and Language Arts Proficiency Comparison Over Years (2011-2022),Math Proficiency Comparison Over Years (2011-2023),Overall School Rank Trends Over Years (2012-2023),Graduation Rate Comparison Over Years (2011-2023)
f
CMAB-The World's First National-Scale Multi-Attribute Building Dataset
figshare.com
bin
Updated Apr 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yecheng Zhang; Huimin Zhao; Ying Long (2025). CMAB-The World's First National-Scale Multi-Attribute Building Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.27992417.v7
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27992417.v7
Dataset updated
Apr 20, 2025
Dataset provided by
figshare
Authors
Yecheng Zhang; Huimin Zhao; Ying Long
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
World
Description
Rapidly acquiring three-dimensional (3D) building data, including geometric attributes like rooftop, height and orientations, as well as indicative attributes like function, quality, and age, is essential for accurate urban analysis, simulations, and policy updates. Current building datasets suffer from incomplete coverage of building multi-attributes. This paper presents the first national-scale Multi-Attribute Building dataset (CMAB) with artificial intelligence, covering 3,667 spatial cities, 31 million buildings, and 23.6 billion m² of rooftops with an F1-Score of 89.93% in OCRNet-based extraction, totaling 363 billion m³ of building stock. We trained bootstrap aggregated XGBoost models with city administrative classifications, incorporating morphology, location, and function features. Using multi-source data, including billions of remote sensing images and 60 million street view images (SVIs), we generated rooftop, height, structure, function, style, age, and quality attributes for each building with machine learning and large multimodal models. Accuracy was validated through model benchmarks, existing similar products, and manual SVI validation, mostly above 80%. Our dataset and results are crucial for global SDGs and urban planning.Data records: A building dataset with a total rooftop area of 23.6 billion square meters in 3,667 natural cities in China, including the attribute of building rooftop, height, structure, function, age, style and quality, as well as the code files used to calculate these data. The deep learning models used are OCRNet, XGBoost, fine-tuned CLIP and Yolo-v8.Supplementary note: The architectural structure, style, and quality are affected by the temporal and spatial distribution of street views in China. Regarding the recognition of building colors, we found that the existing CLIP series model can not accurately judge the composition and proportion of building colors, and then it will be accurately calculated and supplemented by semantic segmentation and image processing. Please contact zhangyec23@mails.tsinghua.edu.cn or ylong@tsinghua.edu.cn if you have any technical problems.Reference Format: Zhang, Y., Zhao, H. & Long, Y. CMAB: A Multi-Attribute Building Dataset of China. Sci Data 12, 430 (2025). https://doi.org/10.1038/s41597-025-04730-5.
p
Distribution of Students Across Grade Levels in World Language Middle School...
publicschoolreview.com
Updated Sep 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Public School Review (2025). Distribution of Students Across Grade Levels in World Language Middle School [Dataset]. https://www.publicschoolreview.com/world-language-middle-school-profile
Explore at:
Dataset updated
Sep 21, 2025
Dataset authored and provided by
Public School Review
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset tracks annual distribution of students across grade levels in World Language Middle School
t
Privacy-Sensitive Conversations between Care Workers and Care Home Residents...
researchdata.tuwien.ac.at
test.researchdata.tuwien.ac.at
+1more
bin, text/markdown
Updated Feb 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Reinhard Grabler; Reinhard Grabler; Michael Starzinger; Michael Starzinger; Matthias Hirschmanner; Matthias Hirschmanner; Helena Anna Frijns; Helena Anna Frijns (2025). Privacy-Sensitive Conversations between Care Workers and Care Home Residents in a Residential Care Home [Dataset]. http://doi.org/10.48436/q1kt0-edc53
Explore at:
bin, text/markdownAvailable download formats
Unique identifier
https://doi.org/10.48436/q1kt0-edc53
Dataset updated
Feb 25, 2025
Dataset provided by
TU Wien
Authors
Reinhard Grabler; Reinhard Grabler; Michael Starzinger; Michael Starzinger; Matthias Hirschmanner; Matthias Hirschmanner; Helena Anna Frijns; Helena Anna Frijns
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 2024 - Aug 2024
Description
Dataset Card for "privacy-care-interactions"

Table of Contents

Dataset Description

Purpose and Features

Dataset Overview

Language Distribution

Locale Distribution

Key Facts

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Annotations

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions

Dataset Description

Purpose and Features

🔒 Collection of Privacy-Sensitive Conversations between Care Workers and Care Home Residents in an Residential Care Home 🔒

The dataset is useful to train and evaluate models to identify and classify privacy-sensitive parts of conversations from text, especially in the context of AI assistants and LLMs.

Dataset Overview

Total entries: 95

Number of distinct taxonomy categories in the public dataset: 4

Number of distinct conversational categories in public dataset: 7

Papers:

Continues the work of: Privacy Agents: Utilizing Large Language Models to Safeguard Contextual Integrity in Elderly Care

Continues the work of: Prototype of a care documentation support system using audio recordings of care actions and large language models

Language Distribution 🌍

English (en): 95

Locale Distribution 🌎

United States (US) 🇺🇸: 95

Key Facts 🔑

This is synthetic data! Generated using proprietary algorithms - no privacy violations!

Conversations are classified following the taxonomy for privacy-sensitive robotics by Rueben et al. (2017).

The data was manually labeled by an expert.

Dataset Structure

Data Instances

The provided data format is .jsonl, the JSON Lines text format, also called newline-delimited JSON. An example entry looks as follows.

{ "text": "CW: Have you ever been to Italy? CR: Oh, yes... many years ago.", "taxonomy": 0, "category": 0, "affected_speaker": 1, "language": "en", "locale": "US", "data_type": 1, "uid": 16, "split": "train" }

Data Fields

The data fields are:

text: a string feature. The abbreviaton of the speakers refer to the care worker (CW) and the care recipient (CR).

taxonomy: a classification label, with possible values including informational (0), invasion (1), collection (2), processing (3), dissemination (4), physical (5), personal-space (6), territoriality (7), intrusion (8), obtrusion (9), contamination (10), modesty (11), psychological (12), interrogation (13), psychological-distance (14), social (15), association (16), crowding-isolation (17), public-gaze (18), solitude (19), intimacy (20), anonymity (21), reserve (22). The taxonomy is derived from Rueben et al. (2017). The classifications were manually labeled by an expert.

category: a classification label, with possible values including personal-information (0), family (1), health (2), thoughts (3), values (4), acquaintance (5), appointment (6). The privacy category affected in the conversation. The classifications were manually labeled by an expert.

affected_speaker: a classification label, with possible values including care-worker (0), care-recipient (1), other (2), both (3). The speaker whose privacy is impacted during the conversation. The classifications were manually labeled by an expert.

language: a string feature. Language code as defined by ISO 639.

locale: a string feature. Regional code as defined by ISO 3166-1 alpha-2.

data_type: a string a classification label, with possible values including real (0), synthetic (1).

uid: a int64 feature. A unique identifier within the dataset.

split: a string feature. Either train, validation or test.

Dataset Splits

The dataset has 2 subsets:

split: with a total of 95 examples split into train, validation and test (70%-15%-15%)

unsplit: with a total of 95 examples in a single train split

name train validation test
split 66 14 15
unsplit 95 n/a n/a

The files follow the naming convention subset-split-language.jsonl. The following files are contained in the dataset:

split-train-en.jsonl

split-validation-en.jsonl

split-test-en.jsonl

unsplit-train-en.jsonl

Dataset Creation

Curation Rationale

Recording audio of care workers and residents during care interactions, which includes partial and full body washing, giving of medication, as well as wound care, is a highly privacy-sensitive use case. Therefore, a dataset is created, which includes privacy-sensitive parts of conversations, synthesized from real-world data. This dataset serves as a basis for fine-tuning a local LLM to highlight and classify privacy-sensitive sections of transcripts created in care interactions, to further mask them to protect privacy.

Source Data

Initial Data Collection

The intial data was collected in the project Caring Robots of TU Wien in cooperation with Caritas Wien. One project track aims to facilitate Large Languge Models (LLM) to support documentation of care workers, with LLM-generated summaries of audio recordings of interactions between care workers and care home residents. The initial data are the transcriptions of those care interactions.

Data Processing

The transcriptions were thoroughly reviewed, and sections containing privacy-sensitive information were identified and marked using qualitative data analysis software by two experts. Subsequently, the sections were translated from German to U.S. English using the locally executed LLM icky/translate. In the next step, another llama3.1:70b was used locally to synthesize the conversation segments. This process involved generating similar, yet distinct and new, conversations that are not linked to the original data. The dataset was split using the train_test_split function from the <a href="https://scikit-learn.org/1.5/modules/generated/sklearn.model_selection.train_test_split.html" target="_blank" rel="noopener
e
ECOLANG Corpus - Dataset - B2FIND
b2find.eudat.eu
Updated Jan 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). ECOLANG Corpus - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/37ea2085-c62e-51f9-b84b-5f070b00b7dc
Explore at:
Dataset updated
Jan 10, 2025
Description
The ECOLANG Multimodal Corpus of adult-child and adult-adult conversation provides audiovisual recordings and annotation of multimodal communicative behaviours by English-speaking adults and children engaged in semi-naturalistic conversation.CorpusThe corpus provides audiovisual recordings and annotation of multimodal behaviours (speech transcription, gesture, object manipulation, and eye gaze) by British and American English-speaking adults engaged in semi-naturalistic conversation with their child (N = 38, children 3-4 years old) or a familiar adult (N = 31). Speakers were asked to talk about objects (familiar or unfamiliar) to their interlocutors both when the objects were physically present or absent. Thus, the corpus characterises the use of multimodal signals in social interaction and their modulations depending upon the age of the interlocutor (child or adult); whether the interlocutor is learning new concepts/words (unfamiliar or familiar objects) and whether they can see and manipulate (present or absent) the objects.ApplicationThe corpus provides ecologically-valid data about the distribution and cooccurrence of the multimodal signals for cognitive scientists and neuroscientists to address questions about real-world language learning and processing; and for computer scientists to develop more human-like artificial agents.Data access requires permission.To obtain permission to view or download the video data (either viewing in your browser or downloading to your computer), please download the user license at https://www.ucl.ac.uk/pals/sites/pals/files/eula_ecolang.pdf, fill in the form and return it to ecolang@ucl.ac.uk. User licenses are granted in batches every few weeks.To view the eaf annotation files, you will need to download and install the software ELAN, available for free for Mac, Windows and Linux.

name	train	validation	test
split	66	14	15
unsplit	95	n/a	n/a

Facebook

Twitter

Click to copy link

Link copied

Cite

Tayyar Hussain (2023). Number of Words in different Languages [Dataset]. https://www.kaggle.com/datasets/tayyarhussain/number-of-words-in-different-languages

Number of Words in different Languages

A dataset of word counts for multiple languages and the vocabulary size

Explore at:

15 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 28, 2023

Dataset provided by

Kaggle

Authors

Tayyar Hussain

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Introduction:

This dataset provides detailed information on the number of words in various languages. It includes a comprehensive list of word counts for multiple languages, making it a valuable resource for linguists, language learners, and anyone interested in language diversity. The dataset is presented in the form of a list of dictionaries, with each dictionary containing information on the language, the number of words, and other details.

The dataset covers a wide range of languages from around the world, including commonly spoken languages like English, Spanish, Mandarin, and Arabic, as well as lesser-known languages. The word counts are approximate and are based on the number of words in the respective dictionaries.

In addition to the word counts, the dataset also includes information on the approximate number of headwords and definitions available for each language. This information provides further insight into the depth and complexity of the vocabulary of each language.

The dataset is useful for a variety of purposes, such as language research, linguistic diversity studies, language teaching and learning, and natural language processing. The data is provided in a machine-readable format, making it easy to use and analyze.

This dataset is a valuable resource for anyone interested in the linguistic diversity of the world's languages and provides a starting point for exploring the vast vocabulary of different languages.

Column Descriptors:

Language: The name of the language the dictionary pertains to. Number of Words: The approximate number of words included in the dictionary. Approx Headwords: The approximate number of headwords included in the dictionary. Approx Definitions: The approximate number of definitions included in the dictionary. Dictionary: The name or type of dictionary included in the dataset. Notes: Any additional notes or information regarding the dictionary or language.

Clear search

Close search

Google apps

Main menu

Number of Words in different Languages

Introduction:

Column Descriptors:

Distribution of Students Across Grade Levels in World Language High School

Global Call Center & Conversational Audio Dataset — Multilingual, Validated,...

World Countries and Continents Details

Context

Content

Acknowledgements

Inspiration

Global Country Information 2023

GlobalPhone French

ASL 20-Words Dataset v1

Ten Thousand German News Articles Dataset

Why a German dataset?

The dataset

Numbers and statistics

Splitting into train and test

Code

License

BanglaDial: A Merged and Imbalanced text Dataset for Bengali Regional...

Finnish General Conversation Speech Dataset for ASR

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

GlobalPhone Portuguese (Brazilian)

Data for: The learnability consequences of Zipfian distributions: Word...

Data Sheet 1_Large language models generating synthetic clinical datasets: a...

Data from: WikiReddit: Tracing Information and Attention Flows Between...

Preprint

Abstract

Datasheet

Motivation

Composition

Collection Process

Preprocessing/cleaning/labeling

Uses

Distribution

Maintenance

SQL Database Schema

Table: posts

Table: comments

Table: postlinks

Table: commentlinks

Data from: miracl

World Language High School

CMAB-The World's First National-Scale Multi-Attribute Building Dataset

Distribution of Students Across Grade Levels in World Language Middle School...

Privacy-Sensitive Conversations between Care Workers and Care Home Residents...

Dataset Card for "privacy-care-interactions"

Table of Contents

Dataset Description

Purpose and Features

Dataset Overview

Language Distribution 🌍

Locale Distribution 🌎

Key Facts 🔑

Dataset Structure

Data Instances

Data Fields

Dataset Splits

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection

Data Processing

ECOLANG Corpus - Dataset - B2FIND

Number of Words in different Languages

A dataset of word counts for multiple languages and the vocabulary size

Introduction:

Column Descriptors:

Table: `posts`

Table: `comments`

Table: `postlinks`

Table: `commentlinks`