14 datasets found

🌍📚 World Languages Dataset 🌍📚
kaggle.com
zip
Updated Jul 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Waqar Ali (2024). 🌍📚 World Languages Dataset 🌍📚 [Dataset]. https://www.kaggle.com/datasets/waqi786/world-languages-dataset
Explore at:
zip(5706 bytes)Available download formats
Dataset updated
Jul 30, 2024
Authors
Waqar Ali
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Area covered
World
Description
This dataset provides a comprehensive overview of 500 languages spoken around the world. It captures essential linguistic features, including language families, geographical regions, writing systems, and the estimated number of native speakers. This dataset aims to highlight the rich diversity of languages and their cultural significance, offering valuable insights for linguists, researchers, and enthusiasts interested in global language distribution.

The dataset contains real and accurate records for 500 languages across different regions and linguistic families. It covers a diverse range of languages, from widely spoken ones like English and Mandarin to less commonly known languages. The data was meticulously compiled to reflect the authentic linguistic landscape and provide a valuable resource for language studies and cultural analysis.
MCB_languages_county
kaggle.com
zip
Updated Oct 1, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marisol Brewster (2019). MCB_languages_county [Dataset]. https://www.kaggle.com/mcbrewster/mcb-languages-county
Explore at:
zip(414833 bytes)Available download formats
Dataset updated
Oct 1, 2019
Authors
Marisol Brewster
Description
Context

This is a dataset I found online through the Google Dataset Search portal.

Content

The American Community Survey (ACS) 2009-2013 multi-year data are used to list all languages spoken in the United States that were reported during the sample period. These tables provide detailed counts of many more languages than the 39 languages and language groups that are published annually as a part of the routine ACS data release. This is the second tabulation beyond 39 languages since ACS began.

The tables include all languages that were reported in each geography during the 2009 to 2013 sampling period. For the purpose of tabulation, reported languages are classified in one of 380 possible languages or language groups. Because the data are a sample of the total population, there may be languages spoken that are not reported, either because the ACS did not sample the households where those languages are spoken, or because the person filling out the survey did not report the language or reported another language instead.

The tables also provide information about self-reported English-speaking ability. Respondents who reported speaking a language other than English were asked to indicate their ability to speak English in one of the following categories: "Very well," "Well," "Not well," or "Not at all." The data on ability to speak English represent the person’s own perception about his or her own ability or, because ACS questionnaires are usually completed by one household member, the responses may represent the perception of another household member.

These tables are also available through the Census Bureau's application programming interface (API). Please see the developers page for additional details on how to use the API to access these data.

Acknowledgements

Sources:

Google Dataset Search: https://toolbox.google.com/datasetsearch

2009-2013 American Community Survey

Original dataset: https://www.census.gov/data/tables/2013/demo/2009-2013-lang-tables.html

Downloaded From: https://data.world/kvaughn/languages-county

Banner and thumbnail photo by Farzad Mohsenvand on Unsplash
F
English Agent-Customer Chat Dataset for Healthcare Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). English Agent-Customer Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/english-healthcare-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The English Healthcare Chat Dataset is a rich collection of over 12,000 text-based conversations between customers and call center agents, focused on real-world healthcare interactions. Designed to reflect authentic language use and domain-specific dialogue patterns, this dataset supports the development of conversational AI, chatbots, and NLP models tailored for healthcare applications in English-speaking regions.
Participant & Chat Overview
•
Participants: 200+ native English speakers from the FutureBeeAI Crowd Community

•
Conversation Length: 300–700 words per chat

•
Turns per Chat: 50–150 dialogue turns across both participants

•
Chat Types: Inbound and outbound

•
Sentiment Coverage: Positive, neutral, and negative outcomes included

Topic Diversity
The dataset captures a wide spectrum of healthcare-related chat scenarios, ensuring comprehensive coverage for training robust AI systems:
•
Inbound Chats (Customer-Initiated): Appointment scheduling, new patient registration, surgery and treatment consultations, diet and lifestyle discussions, insurance claim inquiries, lab result follow-ups

•
Outbound Chats (Agent-Initiated): Appointment reminders and confirmations, health and wellness program offers, test result notifications, preventive care and vaccination reminders, subscription renewals, risk assessment and eligibility follow-ups

This variety helps simulate realistic healthcare support workflows and patient-agent dynamics.
Language Diversity & Realism
This dataset reflects the natural flow of English healthcare communication and includes:
•
Authentic Naming Patterns: English personal names, clinic names, and brands

•
Localized Contact Elements: Addresses, emails, phone numbers, and clinic locations in regional English formats

•
Time & Currency References: Use of dates, times, numeric expressions, and currency units aligned with English-speaking regions

•
Colloquial & Medical Expressions: Local slang, informal speech, and common healthcare-related terminology

These elements ensure the dataset is contextually relevant and linguistically rich for real-world use cases.
Conversational Flow & Structure
Conversations range from simple inquiries to complex advisory sessions, including:
•General inquiries
•Detailed problem-solving
•Routine status updates
•Treatment recommendations
•Support and feedback interactions
Each conversation typically includes these structural components:
•Greetings and verification
•Information gathering
•Problem definition
•Solution delivery
•Closing messages
•Follow-up and feedback (where applicable)
This structured flow mirrors actual healthcare support conversations and is ideal for training advanced dialogue systems.
Data Format & Structure
Available in JSON, CSV, and TXT formats, each conversation includes:
•Full message history with clear speaker labels
•Participant identifiers
•Metadata (e.g., topic tags, region, sentiment)
•Compatibility with common NLP and ML pipelines
Applications
<p
World Countries and Continents Details
kaggle.com
zip
Updated Oct 5, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
folaraz (2017). World Countries and Continents Details [Dataset]. https://www.kaggle.com/folaraz/world-countries-and-continents-details
Explore at:
zip(24400 bytes)Available download formats
Dataset updated
Oct 5, 2017
Authors
folaraz
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered
World
Description
Context

Can you tell geographical stories about the world using data science?

Content

World countries with their corresponding continents , official english names, official french names, Dial,ITU,Languages and so on.

Acknowledgements

This data was gotten from https://old.datahub.io/

Inspiration

Exploration of the world countries: - Can we graphically visualize countries that speak a particular language? - We can also integrate this dataset into others to enhance our exploration. - The dataset has now been updated to include longitude and latitudes of countries in the world.
The ORBIT (Object Recognition for Blind Image Training)-India Dataset
zenodo.org
data.niaid.nih.gov
+1more
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gesu India; Gesu India; Martin Grayson; Martin Grayson; Daniela Massiceti; Daniela Massiceti; Cecily Morrison; Cecily Morrison; Simon Robinson; Simon Robinson; Jennifer Pearson; Jennifer Pearson; Matt Jones; Matt Jones (2025). The ORBIT (Object Recognition for Blind Image Training)-India Dataset [Dataset]. http://doi.org/10.5281/zenodo.12608444
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.12608444
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gesu India; Gesu India; Martin Grayson; Martin Grayson; Daniela Massiceti; Daniela Massiceti; Cecily Morrison; Cecily Morrison; Simon Robinson; Simon Robinson; Jennifer Pearson; Jennifer Pearson; Matt Jones; Matt Jones
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
India
Description
The ORBIT (Object Recognition for Blind Image Training) -India Dataset is a collection of 105,243 images of 76 commonly used objects, collected by 12 individuals in India who are blind or have low vision. This dataset is an "Indian subset" of the original ORBIT dataset [1, 2], which was collected in the UK and Canada. In contrast to the ORBIT dataset, which was created in a Global North, Western, and English-speaking context, the ORBIT-India dataset features images taken in a low-resource, non-English-speaking, Global South context, a home to 90% of the world’s population of people with blindness. Since it is easier for blind or low-vision individuals to gather high-quality data by recording videos, this dataset, like the ORBIT dataset, contains images (each sized 224x224) derived from 587 videos. These videos were taken by our data collectors from various parts of India using the Find My Things [3] Android app. Each data collector was asked to record eight videos of at least 10 objects of their choice.

Collected between July and November 2023, this dataset represents a set of objects commonly used by people who are blind or have low vision in India, including earphones, talking watches, toothbrushes, and typical Indian household items like a belan (rolling pin), and a steel glass. These videos were taken in various settings of the data collectors' homes and workspaces using the Find My Things Android app.

The image dataset is stored in the ‘Dataset’ folder, organized by folders assigned to each data collector (P1, P2, ...P12) who collected them. Each collector's folder includes sub-folders named with the object labels as provided by our data collectors. Within each object folder, there are two subfolders: ‘clean’ for images taken on clean surfaces and ‘clutter’ for images taken in cluttered environments where the objects are typically found. The annotations are saved inside a ‘Annotations’ folder containing a JSON file per video (e.g., P1--coffee mug--clean--231220_084852_coffee mug_224.json) that contains keys corresponding to all frames/images in that video (e.g., "P1--coffee mug--clean--231220_084852_coffee mug_224--000001.jpeg": {"object_not_present_issue": false, "pii_present_issue": false}, "P1--coffee mug--clean--231220_084852_coffee mug_224--000002.jpeg": {"object_not_present_issue": false, "pii_present_issue": false}, ...). The ‘object_not_present_issue’ key is True if the object is not present in the image, and the ‘pii_present_issue’ key is True, if there is a personally identifiable information (PII) present in the image. Note, all PII present in the images has been blurred to protect the identity and privacy of our data collectors. This dataset version was created by cropping images originally sized at 1080 × 1920; therefore, an unscaled version of the dataset will follow soon.

This project was funded by the Engineering and Physical Sciences Research Council (EPSRC) Industrial ICASE Award with Microsoft Research UK Ltd. as the Industrial Project Partner. We would like to acknowledge and express our gratitude to our data collectors for their efforts and time invested in carefully collecting videos to build this dataset for their community. The dataset is designed for developing few-shot learning algorithms, aiming to support researchers and developers in advancing object-recognition systems. We are excited to share this dataset and would love to hear from you if and how you use this dataset. Please feel free to reach out if you have any questions, comments or suggestions.

REFERENCES:

Daniela Massiceti, Lida Theodorou, Luisa Zintgraf, Matthew Tobias Harris, Simone Stumpf, Cecily Morrison, Edward Cutrell, and Katja Hofmann. 2021. ORBIT: A real-world few-shot dataset for teachable object recognition collected from people who are blind or low vision. DOI: https://doi.org/10.25383/city.14294597

microsoft/ORBIT-Dataset. https://github.com/microsoft/ORBIT-Dataset

Linda Yilin Wen, Cecily Morrison, Martin Grayson, Rita Faia Marques, Daniela Massiceti, Camilla Longden, and Edward Cutrell. 2024. Find My Things: Personalized Accessibility through Teachable AI for People who are Blind or Low Vision. In Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems (CHI EA '24). Association for Computing Machinery, New York, NY, USA, Article 403, 1–6. https://doi.org/10.1145/3613905.3648641
S
Democracy and English Indicators
scidb.cn
Updated Apr 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdullah AlKhuraibet (2024). Democracy and English Indicators [Dataset]. http://doi.org/10.57760/sciencedb.16236
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.16236
Dataset updated
Apr 12, 2024
Dataset provided by
Science Data Bank
Authors
Abdullah AlKhuraibet
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The data collected aim to test whether English proficiency levels in a country are positively associated with higher democratic values in that country. English proficiency is sourced from statistics by Education First’s "EF English Proficiency Index" which covers countries' scores for the calendar year 2022 and 2021. The EF English Proficiency Index ranks 111 countries in five different categories based on their English proficiency scores that were calculated from the test results of 2.1 million adults. While democratic values are operationalized through the liberal democracy index from the V-Dem Institute annual report for 2022 and 2021. Additionally, the data is utilized to test whether English language media consumption acts as a mediating variable between English proficiency and democracy levels in a country, while also looking at other possible regression variables. In order to conduct the linear regression analyses for the dats, the software that was utilized for this research was Microsoft Excel.The raw data set consists of 90 nation states in two years from 2022 and 2021. The raw data is utilized for two separate data sets the first of which is democracy indicators which has the regression variables of EPI, HDI, and GDP. For this table set there is a total of 360 data entries. HDI scores are a statistical summary measure that is developed by the United Nations Development Programme (UNDP) which measures the levels of human development in 190 countries. The data for nominal gross domestic product scores (GDP) are sourced from the World Bank. Having strong regression variables that have been proven to have a positive link with democracy in the data analysis such as GDP and HDI, would allow the regression analysis to identify whether there is a true relationship between English proficiency and democracy levels in a country. While the second data set has a total of 720 data entries and aims to identify English proficiency indicators the data set has 7 various regression variables which include, LDI scores, Years of Mandatory English Education, Heads of States Publicly speaking English, GDP PPP (2021USD), Common Wealth, BBC web traffic and CNN web traffic. The data for years of mandatory English education is sourced from research at the University of Winnipeg and is coded in the data set based on the number of years a country has English as a mandatory subject. The range of this data is from 0 to 13 years of English being mandatory. It is important to note that this data only concerns public schools and does not extend to the private school systems in each country. The data for heads of state publicly speaking English was done through a video data analysis of all heads of state. The data was only used for heads of state who had been in their position for at least a year to ensure the accuracy of the data collected; with a year in power, for heads of state that had not been in their position for a year, data was taken from the previous head of state. This data only takes into account speeches and interviews that were conducted during their incumbency. The data for each country’s GDP PPP scores are sourced from the World Bank, which was last updated for a majority of the countries in 2021 and is tied to the US dollar. Data for the commonwealth will only include members of the commonwealth that have been historically colonized by the United Kingdom. Any country that falls under that category will be coded as 1 and any country that does not will be coded as 0. For BBC and CNN web traffic that data is sourced by using tools in Semrush which provide a rough estimate of how much web traffic each news site generates in each country. Which will be utilized to identify the average number of web traffic for BBC News and CNN World News for both the 2021 and 2022 calendar. The traffic for each country will also be measured per capita, per 10 thousand people to ensure that the population density of a country does not influence the results. The population of each country for both 2021 and 2022 is sourced from the United Nations revision of World Population Prospects of both 2021 and 2022 respectively.
Total population worldwide 1950-2100
thinkdemo.it
feherkonyveloiroda.hu
+2more
Updated Nov 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Total population worldwide 1950-2100 [Dataset]. https://thinkdemo.it/?p=2400399
Explore at:
Dataset updated
Nov 19, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
World
Description
The world population surpassed eight billion people in 2022, having doubled from its figure less than 50 years previously. Looking forward, it is projected that the world population will reach nine billion in 2038, and 10 billion in 2060, but it will peak around 10.3 billion in the 2080s before it then goes into decline. Regional variations The global population has seen rapid growth since the early 1800s, due to advances in areas such as food production, healthcare, water safety, education, and infrastructure, however, these changes did not occur at a uniform time or pace across the world. Broadly speaking, the first regions to undergo their demographic transitions were Europe, North America, and Oceania, followed by Latin America and Asia (although Asia's development saw the greatest variation due to its size), while Africa was the last continent to undergo this transformation. Because of these differences, many so-called "advanced" countries are now experiencing population decline, particularly in Europe and East Asia, while the fastest population growth rates are found in Sub-Saharan Africa. In fact, the roughly two billion difference in population between now and the 2080s' peak will be found in Sub-Saharan Africa, which will rise from 1.2 billion to 3.2 billion in this time (although populations in other continents will also fluctuate). Changing projections The United Nations releases their World Population Prospects report every 1-2 years, and this is widely considered the foremost demographic dataset in the world. However, recent years have seen a notable decline in projections when the global population will peak, and at what number. Previous reports in the 2010s had suggested a peak of over 11 billion people, and that population growth would continue into the 2100s, however a sooner and shorter peak is now projected. Reasons for this include a more rapid population decline in East Asia and Europe, particularly China, as well as a prolonged development arc in Sub-Saharan Africa.
Liberia Language Areas
ebola-nga.opendata.arcgis.com
Updated Dec 5, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Geospatial-Intelligence Agency (2014). Liberia Language Areas [Dataset]. https://ebola-nga.opendata.arcgis.com/content/24b6f49f6b8841c09588b397674a9dd9
Explore at:
Dataset updated
Dec 5, 2014
Dataset authored and provided by
National Geospatial-Intelligence Agencyhttp://www.nga.mil/
Area covered

Description
(UNCLASSIFIED) English is the official language in Liberia and is used in government, business, and education to some extent. The majority of Liberians do not know English and those who do speak what is commonly referred to as Liberian English. This involves often leaving off the end of words and/or adding the letter “o” to the end. Also English words will have different meaning in the country. Prior to the civil wars the government created the National Language Program. This program was designed to introduce local languages to primary students prior to the instruction of English. Due to two civil wars and poor infrastructure langue policy has not received much attention. While English is the official language it does not properly represent the diverse population leaving few to be fluent and causing them to rely on their local languages. Attribute Table Field DescriptionsISO3 - International Organization for Standardization 3-digit country code ADM0_NAME - Administration level zero identification / name LANG_FAM - Language family LANG_SUBGR - Language subgroup ALT_NAMES - Alternate names COMMENTS - Comments or notes regarding language SOURCE_DT - Source one creation date SOURCE - Source one SOURCE2_DT - Source two creation date SOURCE2 - Source two CollectionThis HGIS was created through linguistic information provided through The World Language Mapping System (WMLS). This data was then processed through DigitalGlobe’s AnthroMapper program to generate more accurate linguistic coverage boundaries. The metadata was supplemented with anthropological and linguistic information from peer-reviewed journals and published books. It should be noted that this shape file only depicts the majority first level languages spoken in a given area; there might be significant populations of other minority language speakers not shown in this dataset. The data included herein have not been derived from a registered survey and should be considered approximate unless otherwise defined. While rigorous steps have been taken to ensure the quality of each dataset, DigitalGlobe is not responsible for the accuracy and completeness of data compiled from outside sources.Sources (HGIS)Anthromapper. DigitalGlobe, September 2014.World Language Mapping System (WLMS) Version 16. World GeoDatasets, October 2013.Sources (Metadata)Albaugh, Ericka. "Language Policies in African Education." working paper., Department of Government Legal Studies at Bowdoin College, 2005. http://www.bowdoin.edu/.Central Intelligence Agency. The World FactBook, “Liberia”. Last updated June 2014. Accessed September 2014. https://www.cia.gov/index.html.The Reeds in Liberia, “Liberian English.” October 2007. Accessed September 2014. http://reedsinliberia.blogspot.com/.
FLORES-101
kaggle.com
zip
Updated Jun 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mathurin Aché (2021). FLORES-101 [Dataset]. https://www.kaggle.com/mathurinache/flores101
Explore at:
zip(13628027 bytes)Available download formats
Dataset updated
Jun 7, 2021
Authors
Mathurin Aché
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Machine translation helps bridge the language barriers between people and information — but historically, research has focused on creating and evaluating translation systems for only a handful of languages, usually the few most spoken languages in the world. This excludes the billions of people worldwide who don’t happen to be fluent in languages such as English, Spanish, Russian, and Mandarin.

We’ve recently made progress with machine translation systems like M2M-100 , our open source model that can translate a hundred different languages. Further advances necessitate tools with which to test and compare these translation systems with one another, though.

Today, we are open-sourcing FLORES-101 , a first-of-its-kind, many-to-many evaluation data set covering 101 languages from all over the world. FLORES-101 is the missing piece, the tool that enables researchers to rapidly test and improve upon multilingual translation models like M2M-100.

We’re making FLORES-101 publicly available because we believe in breaking down language barriers, and that means helping empower researchers to create more diverse (and locally relevant) translation tools — ones that may make it as easy to translate from, say, Bengali to Marathi as it is to translate from English to Spanish today. We’re making the full FLORES-101 data set , an accompanying tech report, and several models publicly available for the entire research community to use, to accelerate progress on many-to-many translation systems worldwide.

Why evaluation matters Imagine trying to bake a cake — but not being able to taste it. It’s near-impossible to know whether it’s any good, and even harder to know how to improve the recipe for future attempts.

Evaluating how well translation systems perform has been a major challenge for AI researchers — and that knowledge gap has impeded progress. If researchers cannot measure or compare their results, they can’t develop better translation systems. The AI research community needed an open and easily accessible way to perform high-quality, reliable measurement of many-to-many translation model performance and then compare results with others.

Previous work on this problem relied heavily on translating in and out of English, often using proprietary data sets. But while this benefited English speakers, it was and is insufficient for many parts of the world where people need fast and accurate translation between regional languages — for instance, in India, where the constitution recognizes over 20 official languages.

FLORES-101 focuses on what are known as low-resource languages, such as Amharic, Mongolian, and Urdu, which do not currently have extensive data sets for natural language processing research. For the first time, researchers will be able to reliably measure the quality of translations through 10,100 different translation directions — for example, directly from Hindi to Thai or Swahili. For context, evaluating in and out of English would provide merely 200 translation directions.

The flexibility exhibited by FLORES is possible because we designed around many-to-many translation from the start. The data set contains the same set of sentences across all languages, enabling researchers to evaluate the performance of any and all translation directions.

“Efforts like FLORES are of immense value, because they not only draw attention to under-served languages, but they immediately invite and actively facilitate research on all these languages,” said Antonios Anastasopoulos, assistant professor at George Mason University’s Department of Computer Science.

Building a benchmark Good benchmarks are difficult to construct. They need to be able to accurately reflect meaningful differences between models so they can be used by researchers to make decisions. Translation benchmarks can be particularly difficult because the same quality standard must be met across all languages, not just a select few for which translators are more readily available.

Lire
-0:16 Paramètres visuels supplémentairesHD Diffuser sur Chrome CastAfficher en plein écran Remettre le son To that end, we created the FLORES-101 data set in a multistep workflow. Each document was first translated by a professional translator, and then verified by a human editor. Next, it proceeded to the quality-control phase, including checks for spelling, grammar, punctuation, and formatting, and comparison with translations from commercial engines. After that, a different set of translators performed human evaluation, identifying errors across numerous categories including unnatural translation, register, and grammar. Based on the number and severity of the identified errors, the translations were either sent back for retranslation or — if they met quality standards — the translations were considered complete.

Translation quality is not enough on its own, though. Th...
E-Commerce Analysis : Global Skincare E-Store
kaggle.com
zip
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shandeep Raula (2024). E-Commerce Analysis : Global Skincare E-Store [Dataset]. https://www.kaggle.com/datasets/shandeep777/e-commerce-analysis-global-skincare-e-store
Explore at:
zip(6210832 bytes)Available download formats
Dataset updated
Dec 16, 2024
Authors
Shandeep Raula
Description
Dataset Summary This dataset provides comprehensive insights into the global skincare and beauty e-commerce market. It contains detailed transaction data, customer behavior patterns, and sales metrics, offering valuable information for analyzing the performance of online beauty stores. The dataset is tailored for English-speaking users.

Key Features Transaction Data: Includes details such as order IDs, product categories, sales revenue, and transaction dates. Customer Insights: Information about customer demographics, preferences, and purchase history. Product Details: Comprehensive data on product categories, subcategories, pricing, and stock levels. Geographic Analysis: Regional data to understand the market's reach across different countries and demographics.

Potential Use Cases - Market Analysis: Identify trends in the skincare and beauty industry. - Customer Behavior Modeling: Analyze purchasing habits to improve marketing strategies. - E-Commerce Performance Evaluation: Evaluate sales trends and revenue streams. - Price Optimization: Use data-driven insights to optimize product pricing.

File Structure The dataset is provided in an Excel file with multiple sheets (if applicable). Each sheet contains organized data for easier navigation and analysis. Specific sheets might cover:

Orders: Transaction details including order ID, product name, and sales data. Customers: Demographics and behavior. Products: Detailed product inventory and categories. Revenue Analysis: Key metrics include total revenue, average order value, and profit margins.
COVID-19 and Mental Health Search Terms
kaggle.com
zip
Updated Jun 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yunge Hao (2020). COVID-19 and Mental Health Search Terms [Dataset]. https://www.kaggle.com/luckybro/mental-health-search-term
Explore at:
zip(104868 bytes)Available download formats
Dataset updated
Jun 15, 2020
Authors
Yunge Hao
Description
This dataset is created for a task of UNCOVER COVID-19 Challenge, Mental health impact and support services.

The search interest of mental health related terms on Google before and after the outbreak of COVID-19 pandemic reveals how public's concern is affected by the pandemic, and its impact to mental health of people around the world. I picked worldwide, Canada, US, Italy, Iran, Japan, South Korea and UK as the population. The dataset also includes data of Canada for the past 4 years, from 2016 to 2019.

The mental health related search terms are "mental health", "depression", "anxiety", "ocd", "obsessive compulsive disorder", "insomnia", "panic attack", "counseling", "psychiatrist".

Search interest is indicated by a number between 0 and 100, where 100 means the most popular point of time(by week), 1 means the least, and 0 no enough data.

All data is collected from Google Trends. I assumed, when searching the terms, users from countries other than English-speaking performed the search in their own language, and they typed the word correctly.
F
British English Scripted Monologue Speech Data for Telecom
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). British English Scripted Monologue Speech Data for Telecom [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/telecom-scripted-speech-monologues-english-uk
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
United Kingdom
Dataset funded by
FutureBeeAI
Description
Introduction
Presenting the UK English Scripted Monologue Speech Dataset for the Telecom Domain, a purpose-built dataset created to accelerate the development of English speech recognition and voice AI models specifically tailored for the telecommunications industry.
Speech Data
This dataset includes over 6,000 high-quality scripted prompt recordings in UK English, representing real-world telecom customer service scenarios. It’s designed to support the training of speech-based AI systems used in call centers, virtual agents, and voice-powered support tools.
•Participant Diversity
•
Speakers: 60 native UK English speakers

•
Geographic Distribution: Carefully selected from multiple regions across United Kingdom to capture a wide spectrum of dialects and speaking styles

•
Demographics: Balanced representation of males and females (60:40 ratio), aged between 18 to 70 years

•Recording Specifications
•
Type: Scripted monologue prompts focused on telecom industry use cases

•
Duration: Each audio clip ranges from 5 to 30 seconds

•
Format: WAV files in mono, 16-bit depth, with sample rates of 8 kHz and 16 kHz

•
Environment: Clean, echo-free, and noise-controlled settings to ensure optimal audio clarity

Topic Coverage
The dataset reflects a wide variety of common telecom customer interactions, including:
•Customer onboarding and service inquiries
•Billing and payment questions
•Data plans and product information
•Technical support requests
•Network coverage discussions
•Regulatory compliance and policy information
•Upgrades, renewals, and service plan changes
•Domain-specific scripted interactions tailored to real-world telecom use cases
Contextual Depth
To maximize contextual richness, prompts include:
•
Localized Names: Common United Kingdom names in various formats

•
Addresses: Region-specific address structures for realism

•
Dates & Times: Spoken date and time references in typical telecom scenarios (e.g., billing cycles, service activation times)

•
Telecom Terminology: Keywords related to mobile data, network, SIM, devices, plans, etc.

•
Numbers & Rates: Usage statistics, pricing info, recharge values, and billing figures

•
Service Providers: References to telecom companies and third-party service entities

Transcription
Each audio file is paired with an accurate, verbatim transcription for precise model training:
•
Content: Transcriptions are direct representations of each recorded prompt

•
Format: Plain text (.TXT), with filenames matching their corresponding audio files

•
Verification: Every transcription is manually verified by native UK English linguists to ensure consistency and accuracy

Metadata
Detailed metadata is included to
Tanzania Tourism Classification Challenge
kaggle.com
zip
Updated Jun 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tevin Temu (2022). Tanzania Tourism Classification Challenge [Dataset]. https://www.kaggle.com/datasets/tevintemu/tanzania-tourism-classification-challenge
Explore at:
zip(527132 bytes)Available download formats
Dataset updated
Jun 1, 2022
Authors
Tevin Temu
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
Tanzania
Description
This challenge is open to users from English speaking African countries.

The Tanzanian tourism sector plays a significant role in the Tanzanian economy, contributing about 17% to the country’s GDP and 25% of all foreign exchange revenues. The sector, which provides direct employment for more than 600,000 people and up to 2 million people indirectly, generated approximately $2.4 billion in 2018 according to government statistics. Tanzania received a record 1.1 million international visitor arrivals in 2014, mostly from Europe, the US and Africa.

Tanzania is the only country in the world which has allocated more than 25% of its total area for wildlife, national parks, and protected areas.There are 16 national parks in Tanzania, 28 game reserves, 44 game-controlled areas, two marine parks and one conservation area.

Tanzania’s tourist attractions include the Serengeti plains, which hosts the largest terrestrial mammal migration in the world; the Ngorongoro Crater, the world’s largest intact volcanic caldera and home to the highest density of big game in Africa; Kilimanjaro, Africa’s highest mountain; and the Mafia Island marine park; among many others. The scenery, topography, rich culture and very friendly people provide for excellent cultural tourism, beach holidays, honeymooning, game hunting, historical and archaeological ventures – and certainly the best wildlife photography safaris in the world.

The objective of this hackathon is to develop a machine learning model that can classify the range of expenditures a tourist spends in Tanzania. The model can be used by different tour operators and the Tanzania Tourism Board to automatically help tourists across the world estimate their expenditure before visiting Tanzania.
F
Spanish Agent-Customer Chat Dataset for Healthcare Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Spanish Agent-Customer Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/spanish-healthcare-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The Spanish Healthcare Chat Dataset is a rich collection of over 10,000 text-based conversations between customers and call center agents, focused on real-world healthcare interactions. Designed to reflect authentic language use and domain-specific dialogue patterns, this dataset supports the development of conversational AI, chatbots, and NLP models tailored for healthcare applications in Spanish-speaking regions.
Participant & Chat Overview
•
Participants: 150+ native Spanish speakers from the FutureBeeAI Crowd Community

•
Conversation Length: 300–700 words per chat

•
Turns per Chat: 50–150 dialogue turns across both participants

•
Chat Types: Inbound and outbound

•
Sentiment Coverage: Positive, neutral, and negative outcomes included

Topic Diversity
The dataset captures a wide spectrum of healthcare-related chat scenarios, ensuring comprehensive coverage for training robust AI systems:
•
Inbound Chats (Customer-Initiated): Appointment scheduling, new patient registration, surgery and treatment consultations, diet and lifestyle discussions, insurance claim inquiries, lab result follow-ups

•
Outbound Chats (Agent-Initiated): Appointment reminders and confirmations, health and wellness program offers, test result notifications, preventive care and vaccination reminders, subscription renewals, risk assessment and eligibility follow-ups

This variety helps simulate realistic healthcare support workflows and patient-agent dynamics.
Language Diversity & Realism
This dataset reflects the natural flow of Spanish healthcare communication and includes:
•
Authentic Naming Patterns: Spanish personal names, clinic names, and brands

•
Localized Contact Elements: Addresses, emails, phone numbers, and clinic locations in regional Spanish formats

•
Time & Currency References: Use of dates, times, numeric expressions, and currency units aligned with Spanish-speaking regions

•
Colloquial & Medical Expressions: Local slang, informal speech, and common healthcare-related terminology

These elements ensure the dataset is contextually relevant and linguistically rich for real-world use cases.
Conversational Flow & Structure
Conversations range from simple inquiries to complex advisory sessions, including:
•General inquiries
•Detailed problem-solving
•Routine status updates
•Treatment recommendations
•Support and feedback interactions
Each conversation typically includes these structural components:
•Greetings and verification
•Information gathering
•Problem definition
•Solution delivery
•Closing messages
•Follow-up and feedback (where applicable)
This structured flow mirrors actual healthcare support conversations and is ideal for training advanced dialogue systems.
Data Format & Structure
Available in JSON, CSV, and TXT formats, each conversation includes:
•Full message history with clear speaker labels
•Participant identifiers
•Metadata (e.g., topic tags, region, sentiment)
•Compatibility with common NLP and ML pipelines
Applications
<p
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Waqar Ali (2024). 🌍📚 World Languages Dataset 🌍📚 [Dataset]. https://www.kaggle.com/datasets/waqi786/world-languages-dataset

🌍📚 World Languages Dataset 🌍📚

An Insight into the World's Most Spoken Languages 🌍📚

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

zip(5706 bytes)Available download formats

Dataset updated

Jul 30, 2024

Authors

Waqar Ali

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Area covered

World

Description

This dataset provides a comprehensive overview of 500 languages spoken around the world. It captures essential linguistic features, including language families, geographical regions, writing systems, and the estimated number of native speakers. This dataset aims to highlight the rich diversity of languages and their cultural significance, offering valuable insights for linguists, researchers, and enthusiasts interested in global language distribution.

The dataset contains real and accurate records for 500 languages across different regions and linguistic families. It covers a diverse range of languages, from widely spoken ones like English and Mandarin to less commonly known languages. The data was meticulously compiled to reflect the authentic linguistic landscape and provide a valuable resource for language studies and cultural analysis.

Clear search

Close search

Google apps

Main menu

🌍📚 World Languages Dataset 🌍📚

MCB_languages_county

Context

Content

Acknowledgements

English Agent-Customer Chat Dataset for Healthcare Domain

Introduction

Participant & Chat Overview

Topic Diversity

Language Diversity & Realism

Conversational Flow & Structure

Data Format & Structure

Applications

World Countries and Continents Details

Context

Content

Acknowledgements

Inspiration

The ORBIT (Object Recognition for Blind Image Training)-India Dataset

Democracy and English Indicators

Total population worldwide 1950-2100

Liberia Language Areas

FLORES-101

E-Commerce Analysis : Global Skincare E-Store

COVID-19 and Mental Health Search Terms

British English Scripted Monologue Speech Data for Telecom

Introduction

Speech Data

Topic Coverage

Contextual Depth

Transcription

Metadata

Tanzania Tourism Classification Challenge

Spanish Agent-Customer Chat Dataset for Healthcare Domain

Introduction

Participant & Chat Overview

Topic Diversity

Language Diversity & Realism

Conversational Flow & Structure

Data Format & Structure

Applications

🌍📚 World Languages Dataset 🌍📚

An Insight into the World's Most Spoken Languages 🌍📚