14 datasets found
  1. 🌍📚 World Languages Dataset 🌍📚

    • kaggle.com
    zip
    Updated Jul 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Waqar Ali (2024). 🌍📚 World Languages Dataset 🌍📚 [Dataset]. https://www.kaggle.com/datasets/waqi786/world-languages-dataset
    Explore at:
    zip(5706 bytes)Available download formats
    Dataset updated
    Jul 30, 2024
    Authors
    Waqar Ali
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    World
    Description

    This dataset provides a comprehensive overview of 500 languages spoken around the world. It captures essential linguistic features, including language families, geographical regions, writing systems, and the estimated number of native speakers. This dataset aims to highlight the rich diversity of languages and their cultural significance, offering valuable insights for linguists, researchers, and enthusiasts interested in global language distribution.

    The dataset contains real and accurate records for 500 languages across different regions and linguistic families. It covers a diverse range of languages, from widely spoken ones like English and Mandarin to less commonly known languages. The data was meticulously compiled to reflect the authentic linguistic landscape and provide a valuable resource for language studies and cultural analysis.

  2. MCB_languages_county

    • kaggle.com
    zip
    Updated Oct 1, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marisol Brewster (2019). MCB_languages_county [Dataset]. https://www.kaggle.com/mcbrewster/mcb-languages-county
    Explore at:
    zip(414833 bytes)Available download formats
    Dataset updated
    Oct 1, 2019
    Authors
    Marisol Brewster
    Description

    Context

    This is a dataset I found online through the Google Dataset Search portal.

    Content

    The American Community Survey (ACS) 2009-2013 multi-year data are used to list all languages spoken in the United States that were reported during the sample period. These tables provide detailed counts of many more languages than the 39 languages and language groups that are published annually as a part of the routine ACS data release. This is the second tabulation beyond 39 languages since ACS began.

    The tables include all languages that were reported in each geography during the 2009 to 2013 sampling period. For the purpose of tabulation, reported languages are classified in one of 380 possible languages or language groups. Because the data are a sample of the total population, there may be languages spoken that are not reported, either because the ACS did not sample the households where those languages are spoken, or because the person filling out the survey did not report the language or reported another language instead.

    The tables also provide information about self-reported English-speaking ability. Respondents who reported speaking a language other than English were asked to indicate their ability to speak English in one of the following categories: "Very well," "Well," "Not well," or "Not at all." The data on ability to speak English represent the person’s own perception about his or her own ability or, because ACS questionnaires are usually completed by one household member, the responses may represent the perception of another household member.

    These tables are also available through the Census Bureau's application programming interface (API). Please see the developers page for additional details on how to use the API to access these data.

    Acknowledgements

    Sources:

    Google Dataset Search: https://toolbox.google.com/datasetsearch

    2009-2013 American Community Survey

    Original dataset: https://www.census.gov/data/tables/2013/demo/2009-2013-lang-tables.html

    Downloaded From: https://data.world/kvaughn/languages-county

    Banner and thumbnail photo by Farzad Mohsenvand on Unsplash

  3. F

    English Agent-Customer Chat Dataset for Healthcare Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). English Agent-Customer Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/english-healthcare-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The English Healthcare Chat Dataset is a rich collection of over 12,000 text-based conversations between customers and call center agents, focused on real-world healthcare interactions. Designed to reflect authentic language use and domain-specific dialogue patterns, this dataset supports the development of conversational AI, chatbots, and NLP models tailored for healthcare applications in English-speaking regions.

    Participant & Chat Overview

    Participants: 200+ native English speakers from the FutureBeeAI Crowd Community
    Conversation Length: 300–700 words per chat
    Turns per Chat: 50–150 dialogue turns across both participants
    Chat Types: Inbound and outbound
    Sentiment Coverage: Positive, neutral, and negative outcomes included

    Topic Diversity

    The dataset captures a wide spectrum of healthcare-related chat scenarios, ensuring comprehensive coverage for training robust AI systems:

    Inbound Chats (Customer-Initiated): Appointment scheduling, new patient registration, surgery and treatment consultations, diet and lifestyle discussions, insurance claim inquiries, lab result follow-ups
    Outbound Chats (Agent-Initiated): Appointment reminders and confirmations, health and wellness program offers, test result notifications, preventive care and vaccination reminders, subscription renewals, risk assessment and eligibility follow-ups

    This variety helps simulate realistic healthcare support workflows and patient-agent dynamics.

    Language Diversity & Realism

    This dataset reflects the natural flow of English healthcare communication and includes:

    Authentic Naming Patterns: English personal names, clinic names, and brands
    Localized Contact Elements: Addresses, emails, phone numbers, and clinic locations in regional English formats
    Time & Currency References: Use of dates, times, numeric expressions, and currency units aligned with English-speaking regions
    Colloquial & Medical Expressions: Local slang, informal speech, and common healthcare-related terminology

    These elements ensure the dataset is contextually relevant and linguistically rich for real-world use cases.

    Conversational Flow & Structure

    Conversations range from simple inquiries to complex advisory sessions, including:

    General inquiries
    Detailed problem-solving
    Routine status updates
    Treatment recommendations
    Support and feedback interactions

    Each conversation typically includes these structural components:

    Greetings and verification
    Information gathering
    Problem definition
    Solution delivery
    Closing messages
    Follow-up and feedback (where applicable)

    This structured flow mirrors actual healthcare support conversations and is ideal for training advanced dialogue systems.

    Data Format & Structure

    Available in JSON, CSV, and TXT formats, each conversation includes:

    Full message history with clear speaker labels
    Participant identifiers
    Metadata (e.g., topic tags, region, sentiment)
    Compatibility with common NLP and ML pipelines

    Applications

    <p

  4. World Countries and Continents Details

    • kaggle.com
    zip
    Updated Oct 5, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    folaraz (2017). World Countries and Continents Details [Dataset]. https://www.kaggle.com/folaraz/world-countries-and-continents-details
    Explore at:
    zip(24400 bytes)Available download formats
    Dataset updated
    Oct 5, 2017
    Authors
    folaraz
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Area covered
    World
    Description

    Context

    Can you tell geographical stories about the world using data science?

    Content

    World countries with their corresponding continents , official english names, official french names, Dial,ITU,Languages and so on.

    Acknowledgements

    This data was gotten from https://old.datahub.io/

    Inspiration

    Exploration of the world countries: - Can we graphically visualize countries that speak a particular language? - We can also integrate this dataset into others to enhance our exploration. - The dataset has now been updated to include longitude and latitudes of countries in the world.

  5. The ORBIT (Object Recognition for Blind Image Training)-India Dataset

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gesu India; Gesu India; Martin Grayson; Martin Grayson; Daniela Massiceti; Daniela Massiceti; Cecily Morrison; Cecily Morrison; Simon Robinson; Simon Robinson; Jennifer Pearson; Jennifer Pearson; Matt Jones; Matt Jones (2025). The ORBIT (Object Recognition for Blind Image Training)-India Dataset [Dataset]. http://doi.org/10.5281/zenodo.12608444
    Explore at:
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gesu India; Gesu India; Martin Grayson; Martin Grayson; Daniela Massiceti; Daniela Massiceti; Cecily Morrison; Cecily Morrison; Simon Robinson; Simon Robinson; Jennifer Pearson; Jennifer Pearson; Matt Jones; Matt Jones
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    India
    Description

    The ORBIT (Object Recognition for Blind Image Training) -India Dataset is a collection of 105,243 images of 76 commonly used objects, collected by 12 individuals in India who are blind or have low vision. This dataset is an "Indian subset" of the original ORBIT dataset [1, 2], which was collected in the UK and Canada. In contrast to the ORBIT dataset, which was created in a Global North, Western, and English-speaking context, the ORBIT-India dataset features images taken in a low-resource, non-English-speaking, Global South context, a home to 90% of the world’s population of people with blindness. Since it is easier for blind or low-vision individuals to gather high-quality data by recording videos, this dataset, like the ORBIT dataset, contains images (each sized 224x224) derived from 587 videos. These videos were taken by our data collectors from various parts of India using the Find My Things [3] Android app. Each data collector was asked to record eight videos of at least 10 objects of their choice.

    Collected between July and November 2023, this dataset represents a set of objects commonly used by people who are blind or have low vision in India, including earphones, talking watches, toothbrushes, and typical Indian household items like a belan (rolling pin), and a steel glass. These videos were taken in various settings of the data collectors' homes and workspaces using the Find My Things Android app.

    The image dataset is stored in the ‘Dataset’ folder, organized by folders assigned to each data collector (P1, P2, ...P12) who collected them. Each collector's folder includes sub-folders named with the object labels as provided by our data collectors. Within each object folder, there are two subfolders: ‘clean’ for images taken on clean surfaces and ‘clutter’ for images taken in cluttered environments where the objects are typically found. The annotations are saved inside a ‘Annotations’ folder containing a JSON file per video (e.g., P1--coffee mug--clean--231220_084852_coffee mug_224.json) that contains keys corresponding to all frames/images in that video (e.g., "P1--coffee mug--clean--231220_084852_coffee mug_224--000001.jpeg": {"object_not_present_issue": false, "pii_present_issue": false}, "P1--coffee mug--clean--231220_084852_coffee mug_224--000002.jpeg": {"object_not_present_issue": false, "pii_present_issue": false}, ...). The ‘object_not_present_issue’ key is True if the object is not present in the image, and the ‘pii_present_issue’ key is True, if there is a personally identifiable information (PII) present in the image. Note, all PII present in the images has been blurred to protect the identity and privacy of our data collectors. This dataset version was created by cropping images originally sized at 1080 × 1920; therefore, an unscaled version of the dataset will follow soon.

    This project was funded by the Engineering and Physical Sciences Research Council (EPSRC) Industrial ICASE Award with Microsoft Research UK Ltd. as the Industrial Project Partner. We would like to acknowledge and express our gratitude to our data collectors for their efforts and time invested in carefully collecting videos to build this dataset for their community. The dataset is designed for developing few-shot learning algorithms, aiming to support researchers and developers in advancing object-recognition systems. We are excited to share this dataset and would love to hear from you if and how you use this dataset. Please feel free to reach out if you have any questions, comments or suggestions.

    REFERENCES:

    1. Daniela Massiceti, Lida Theodorou, Luisa Zintgraf, Matthew Tobias Harris, Simone Stumpf, Cecily Morrison, Edward Cutrell, and Katja Hofmann. 2021. ORBIT: A real-world few-shot dataset for teachable object recognition collected from people who are blind or low vision. DOI: https://doi.org/10.25383/city.14294597

    2. microsoft/ORBIT-Dataset. https://github.com/microsoft/ORBIT-Dataset

    3. Linda Yilin Wen, Cecily Morrison, Martin Grayson, Rita Faia Marques, Daniela Massiceti, Camilla Longden, and Edward Cutrell. 2024. Find My Things: Personalized Accessibility through Teachable AI for People who are Blind or Low Vision. In Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems (CHI EA '24). Association for Computing Machinery, New York, NY, USA, Article 403, 1–6. https://doi.org/10.1145/3613905.3648641

  6. S

    Democracy and English Indicators

    • scidb.cn
    Updated Apr 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdullah AlKhuraibet (2024). Democracy and English Indicators [Dataset]. http://doi.org/10.57760/sciencedb.16236
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 12, 2024
    Dataset provided by
    Science Data Bank
    Authors
    Abdullah AlKhuraibet
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The data collected aim to test whether English proficiency levels in a country are positively associated with higher democratic values in that country. English proficiency is sourced from statistics by Education First’s "EF English Proficiency Index" which covers countries' scores for the calendar year 2022 and 2021. The EF English Proficiency Index ranks 111 countries in five different categories based on their English proficiency scores that were calculated from the test results of 2.1 million adults. While democratic values are operationalized through the liberal democracy index from the V-Dem Institute annual report for 2022 and 2021. Additionally, the data is utilized to test whether English language media consumption acts as a mediating variable between English proficiency and democracy levels in a country, while also looking at other possible regression variables. In order to conduct the linear regression analyses for the dats, the software that was utilized for this research was Microsoft Excel.The raw data set consists of 90 nation states in two years from 2022 and 2021. The raw data is utilized for two separate data sets the first of which is democracy indicators which has the regression variables of EPI, HDI, and GDP. For this table set there is a total of 360 data entries. HDI scores are a statistical summary measure that is developed by the United Nations Development Programme (UNDP) which measures the levels of human development in 190 countries. The data for nominal gross domestic product scores (GDP) are sourced from the World Bank. Having strong regression variables that have been proven to have a positive link with democracy in the data analysis such as GDP and HDI, would allow the regression analysis to identify whether there is a true relationship between English proficiency and democracy levels in a country. While the second data set has a total of 720 data entries and aims to identify English proficiency indicators the data set has 7 various regression variables which include, LDI scores, Years of Mandatory English Education, Heads of States Publicly speaking English, GDP PPP (2021USD), Common Wealth, BBC web traffic and CNN web traffic. The data for years of mandatory English education is sourced from research at the University of Winnipeg and is coded in the data set based on the number of years a country has English as a mandatory subject. The range of this data is from 0 to 13 years of English being mandatory. It is important to note that this data only concerns public schools and does not extend to the private school systems in each country. The data for heads of state publicly speaking English was done through a video data analysis of all heads of state. The data was only used for heads of state who had been in their position for at least a year to ensure the accuracy of the data collected; with a year in power, for heads of state that had not been in their position for a year, data was taken from the previous head of state. This data only takes into account speeches and interviews that were conducted during their incumbency. The data for each country’s GDP PPP scores are sourced from the World Bank, which was last updated for a majority of the countries in 2021 and is tied to the US dollar. Data for the commonwealth will only include members of the commonwealth that have been historically colonized by the United Kingdom. Any country that falls under that category will be coded as 1 and any country that does not will be coded as 0. For BBC and CNN web traffic that data is sourced by using tools in Semrush which provide a rough estimate of how much web traffic each news site generates in each country. Which will be utilized to identify the average number of web traffic for BBC News and CNN World News for both the 2021 and 2022 calendar. The traffic for each country will also be measured per capita, per 10 thousand people to ensure that the population density of a country does not influence the results. The population of each country for both 2021 and 2022 is sourced from the United Nations revision of World Population Prospects of both 2021 and 2022 respectively.

  7. Total population worldwide 1950-2100

    • thinkdemo.it
    • feherkonyveloiroda.hu
    • +2more
    Updated Nov 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Total population worldwide 1950-2100 [Dataset]. https://thinkdemo.it/?p=2400399
    Explore at:
    Dataset updated
    Nov 19, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    World
    Description

    The world population surpassed eight billion people in 2022, having doubled from its figure less than 50 years previously. Looking forward, it is projected that the world population will reach nine billion in 2038, and 10 billion in 2060, but it will peak around 10.3 billion in the 2080s before it then goes into decline. Regional variations The global population has seen rapid growth since the early 1800s, due to advances in areas such as food production, healthcare, water safety, education, and infrastructure, however, these changes did not occur at a uniform time or pace across the world. Broadly speaking, the first regions to undergo their demographic transitions were Europe, North America, and Oceania, followed by Latin America and Asia (although Asia's development saw the greatest variation due to its size), while Africa was the last continent to undergo this transformation. Because of these differences, many so-called "advanced" countries are now experiencing population decline, particularly in Europe and East Asia, while the fastest population growth rates are found in Sub-Saharan Africa. In fact, the roughly two billion difference in population between now and the 2080s' peak will be found in Sub-Saharan Africa, which will rise from 1.2 billion to 3.2 billion in this time (although populations in other continents will also fluctuate). Changing projections The United Nations releases their World Population Prospects report every 1-2 years, and this is widely considered the foremost demographic dataset in the world. However, recent years have seen a notable decline in projections when the global population will peak, and at what number. Previous reports in the 2010s had suggested a peak of over 11 billion people, and that population growth would continue into the 2100s, however a sooner and shorter peak is now projected. Reasons for this include a more rapid population decline in East Asia and Europe, particularly China, as well as a prolonged development arc in Sub-Saharan Africa.

  8. Liberia Language Areas

    • ebola-nga.opendata.arcgis.com
    Updated Dec 5, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Geospatial-Intelligence Agency (2014). Liberia Language Areas [Dataset]. https://ebola-nga.opendata.arcgis.com/content/24b6f49f6b8841c09588b397674a9dd9
    Explore at:
    Dataset updated
    Dec 5, 2014
    Dataset authored and provided by
    National Geospatial-Intelligence Agencyhttp://www.nga.mil/
    Area covered
    Description

    (UNCLASSIFIED) English is the official language in Liberia and is used in government, business, and education to some extent. The majority of Liberians do not know English and those who do speak what is commonly referred to as Liberian English. This involves often leaving off the end of words and/or adding the letter “o” to the end. Also English words will have different meaning in the country. Prior to the civil wars the government created the National Language Program. This program was designed to introduce local languages to primary students prior to the instruction of English. Due to two civil wars and poor infrastructure langue policy has not received much attention. While English is the official language it does not properly represent the diverse population leaving few to be fluent and causing them to rely on their local languages. Attribute Table Field DescriptionsISO3 - International Organization for Standardization 3-digit country code ADM0_NAME - Administration level zero identification / name LANG_FAM - Language family LANG_SUBGR - Language subgroup ALT_NAMES - Alternate names COMMENTS - Comments or notes regarding language SOURCE_DT - Source one creation date SOURCE - Source one SOURCE2_DT - Source two creation date SOURCE2 - Source two CollectionThis HGIS was created through linguistic information provided through The World Language Mapping System (WMLS). This data was then processed through DigitalGlobe’s AnthroMapper program to generate more accurate linguistic coverage boundaries. The metadata was supplemented with anthropological and linguistic information from peer-reviewed journals and published books. It should be noted that this shape file only depicts the majority first level languages spoken in a given area; there might be significant populations of other minority language speakers not shown in this dataset. The data included herein have not been derived from a registered survey and should be considered approximate unless otherwise defined. While rigorous steps have been taken to ensure the quality of each dataset, DigitalGlobe is not responsible for the accuracy and completeness of data compiled from outside sources.Sources (HGIS)Anthromapper. DigitalGlobe, September 2014.World Language Mapping System (WLMS) Version 16. World GeoDatasets, October 2013.Sources (Metadata)Albaugh, Ericka. "Language Policies in African Education." working paper., Department of Government Legal Studies at Bowdoin College, 2005. http://www.bowdoin.edu/.Central Intelligence Agency. The World FactBook, “Liberia”. Last updated June 2014. Accessed September 2014. https://www.cia.gov/index.html.The Reeds in Liberia, “Liberian English.” October 2007. Accessed September 2014. http://reedsinliberia.blogspot.com/.

  9. FLORES-101

    • kaggle.com
    zip
    Updated Jun 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mathurin Aché (2021). FLORES-101 [Dataset]. https://www.kaggle.com/mathurinache/flores101
    Explore at:
    zip(13628027 bytes)Available download formats
    Dataset updated
    Jun 7, 2021
    Authors
    Mathurin Aché
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Machine translation helps bridge the language barriers between people and information — but historically, research has focused on creating and evaluating translation systems for only a handful of languages, usually the few most spoken languages in the world. This excludes the billions of people worldwide who don’t happen to be fluent in languages such as English, Spanish, Russian, and Mandarin.

    We’ve recently made progress with machine translation systems like M2M-100 , our open source model that can translate a hundred different languages. Further advances necessitate tools with which to test and compare these translation systems with one another, though.

    Today, we are open-sourcing FLORES-101 , a first-of-its-kind, many-to-many evaluation data set covering 101 languages from all over the world. FLORES-101 is the missing piece, the tool that enables researchers to rapidly test and improve upon multilingual translation models like M2M-100.

    We’re making FLORES-101 publicly available because we believe in breaking down language barriers, and that means helping empower researchers to create more diverse (and locally relevant) translation tools — ones that may make it as easy to translate from, say, Bengali to Marathi as it is to translate from English to Spanish today. We’re making the full FLORES-101 data set , an accompanying tech report, and several models publicly available for the entire research community to use, to accelerate progress on many-to-many translation systems worldwide.

    Why evaluation matters Imagine trying to bake a cake — but not being able to taste it. It’s near-impossible to know whether it’s any good, and even harder to know how to improve the recipe for future attempts.

    Evaluating how well translation systems perform has been a major challenge for AI researchers — and that knowledge gap has impeded progress. If researchers cannot measure or compare their results, they can’t develop better translation systems. The AI research community needed an open and easily accessible way to perform high-quality, reliable measurement of many-to-many translation model performance and then compare results with others.

    Previous work on this problem relied heavily on translating in and out of English, often using proprietary data sets. But while this benefited English speakers, it was and is insufficient for many parts of the world where people need fast and accurate translation between regional languages — for instance, in India, where the constitution recognizes over 20 official languages.

    FLORES-101 focuses on what are known as low-resource languages, such as Amharic, Mongolian, and Urdu, which do not currently have extensive data sets for natural language processing research. For the first time, researchers will be able to reliably measure the quality of translations through 10,100 different translation directions — for example, directly from Hindi to Thai or Swahili. For context, evaluating in and out of English would provide merely 200 translation directions.

    The flexibility exhibited by FLORES is possible because we designed around many-to-many translation from the start. The data set contains the same set of sentences across all languages, enabling researchers to evaluate the performance of any and all translation directions.

    “Efforts like FLORES are of immense value, because they not only draw attention to under-served languages, but they immediately invite and actively facilitate research on all these languages,” said Antonios Anastasopoulos, assistant professor at George Mason University’s Department of Computer Science.

    Building a benchmark Good benchmarks are difficult to construct. They need to be able to accurately reflect meaningful differences between models so they can be used by researchers to make decisions. Translation benchmarks can be particularly difficult because the same quality standard must be met across all languages, not just a select few for which translators are more readily available.

    Lire
    -0:16 Paramètres visuels supplémentairesHD Diffuser sur Chrome CastAfficher en plein écran Remettre le son To that end, we created the FLORES-101 data set in a multistep workflow. Each document was first translated by a professional translator, and then verified by a human editor. Next, it proceeded to the quality-control phase, including checks for spelling, grammar, punctuation, and formatting, and comparison with translations from commercial engines. After that, a different set of translators performed human evaluation, identifying errors across numerous categories including unnatural translation, register, and grammar. Based on the number and severity of the identified errors, the translations were either sent back for retranslation or — if they met quality standards — the translations were considered complete.

    Translation quality is not enough on its own, though. Th...

  10. E-Commerce Analysis : Global Skincare E-Store

    • kaggle.com
    zip
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shandeep Raula (2024). E-Commerce Analysis : Global Skincare E-Store [Dataset]. https://www.kaggle.com/datasets/shandeep777/e-commerce-analysis-global-skincare-e-store
    Explore at:
    zip(6210832 bytes)Available download formats
    Dataset updated
    Dec 16, 2024
    Authors
    Shandeep Raula
    Description

    Dataset Summary This dataset provides comprehensive insights into the global skincare and beauty e-commerce market. It contains detailed transaction data, customer behavior patterns, and sales metrics, offering valuable information for analyzing the performance of online beauty stores. The dataset is tailored for English-speaking users.

    Key Features Transaction Data: Includes details such as order IDs, product categories, sales revenue, and transaction dates. Customer Insights: Information about customer demographics, preferences, and purchase history. Product Details: Comprehensive data on product categories, subcategories, pricing, and stock levels. Geographic Analysis: Regional data to understand the market's reach across different countries and demographics.

    Potential Use Cases - Market Analysis: Identify trends in the skincare and beauty industry. - Customer Behavior Modeling: Analyze purchasing habits to improve marketing strategies. - E-Commerce Performance Evaluation: Evaluate sales trends and revenue streams. - Price Optimization: Use data-driven insights to optimize product pricing.

    File Structure The dataset is provided in an Excel file with multiple sheets (if applicable). Each sheet contains organized data for easier navigation and analysis. Specific sheets might cover:

    Orders: Transaction details including order ID, product name, and sales data. Customers: Demographics and behavior. Products: Detailed product inventory and categories. Revenue Analysis: Key metrics include total revenue, average order value, and profit margins.

  11. COVID-19 and Mental Health Search Terms

    • kaggle.com
    zip
    Updated Jun 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yunge Hao (2020). COVID-19 and Mental Health Search Terms [Dataset]. https://www.kaggle.com/luckybro/mental-health-search-term
    Explore at:
    zip(104868 bytes)Available download formats
    Dataset updated
    Jun 15, 2020
    Authors
    Yunge Hao
    Description

    This dataset is created for a task of UNCOVER COVID-19 Challenge, Mental health impact and support services.

    The search interest of mental health related terms on Google before and after the outbreak of COVID-19 pandemic reveals how public's concern is affected by the pandemic, and its impact to mental health of people around the world. I picked worldwide, Canada, US, Italy, Iran, Japan, South Korea and UK as the population. The dataset also includes data of Canada for the past 4 years, from 2016 to 2019.

    The mental health related search terms are "mental health", "depression", "anxiety", "ocd", "obsessive compulsive disorder", "insomnia", "panic attack", "counseling", "psychiatrist".

    Search interest is indicated by a number between 0 and 100, where 100 means the most popular point of time(by week), 1 means the least, and 0 no enough data.

    All data is collected from Google Trends. I assumed, when searching the terms, users from countries other than English-speaking performed the search in their own language, and they typed the word correctly.

  12. F

    British English Scripted Monologue Speech Data for Telecom

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). British English Scripted Monologue Speech Data for Telecom [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/telecom-scripted-speech-monologues-english-uk
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    United Kingdom
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Presenting the UK English Scripted Monologue Speech Dataset for the Telecom Domain, a purpose-built dataset created to accelerate the development of English speech recognition and voice AI models specifically tailored for the telecommunications industry.

    Speech Data

    This dataset includes over 6,000 high-quality scripted prompt recordings in UK English, representing real-world telecom customer service scenarios. It’s designed to support the training of speech-based AI systems used in call centers, virtual agents, and voice-powered support tools.

    Participant Diversity
    Speakers: 60 native UK English speakers
    Geographic Distribution: Carefully selected from multiple regions across United Kingdom to capture a wide spectrum of dialects and speaking styles
    Demographics: Balanced representation of males and females (60:40 ratio), aged between 18 to 70 years
    Recording Specifications
    Type: Scripted monologue prompts focused on telecom industry use cases
    Duration: Each audio clip ranges from 5 to 30 seconds
    Format: WAV files in mono, 16-bit depth, with sample rates of 8 kHz and 16 kHz
    Environment: Clean, echo-free, and noise-controlled settings to ensure optimal audio clarity

    Topic Coverage

    The dataset reflects a wide variety of common telecom customer interactions, including:

    Customer onboarding and service inquiries
    Billing and payment questions
    Data plans and product information
    Technical support requests
    Network coverage discussions
    Regulatory compliance and policy information
    Upgrades, renewals, and service plan changes
    Domain-specific scripted interactions tailored to real-world telecom use cases

    Contextual Depth

    To maximize contextual richness, prompts include:

    Localized Names: Common United Kingdom names in various formats
    Addresses: Region-specific address structures for realism
    Dates & Times: Spoken date and time references in typical telecom scenarios (e.g., billing cycles, service activation times)
    Telecom Terminology: Keywords related to mobile data, network, SIM, devices, plans, etc.
    Numbers & Rates: Usage statistics, pricing info, recharge values, and billing figures
    Service Providers: References to telecom companies and third-party service entities

    Transcription

    Each audio file is paired with an accurate, verbatim transcription for precise model training:

    Content: Transcriptions are direct representations of each recorded prompt
    Format: Plain text (.TXT), with filenames matching their corresponding audio files
    Verification: Every transcription is manually verified by native UK English linguists to ensure consistency and accuracy

    Metadata

    Detailed metadata is included to

  13. Tanzania Tourism Classification Challenge

    • kaggle.com
    zip
    Updated Jun 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tevin Temu (2022). Tanzania Tourism Classification Challenge [Dataset]. https://www.kaggle.com/datasets/tevintemu/tanzania-tourism-classification-challenge
    Explore at:
    zip(527132 bytes)Available download formats
    Dataset updated
    Jun 1, 2022
    Authors
    Tevin Temu
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Tanzania
    Description

    This challenge is open to users from English speaking African countries.

    The Tanzanian tourism sector plays a significant role in the Tanzanian economy, contributing about 17% to the country’s GDP and 25% of all foreign exchange revenues. The sector, which provides direct employment for more than 600,000 people and up to 2 million people indirectly, generated approximately $2.4 billion in 2018 according to government statistics. Tanzania received a record 1.1 million international visitor arrivals in 2014, mostly from Europe, the US and Africa.

    Tanzania is the only country in the world which has allocated more than 25% of its total area for wildlife, national parks, and protected areas.There are 16 national parks in Tanzania, 28 game reserves, 44 game-controlled areas, two marine parks and one conservation area.

    Tanzania’s tourist attractions include the Serengeti plains, which hosts the largest terrestrial mammal migration in the world; the Ngorongoro Crater, the world’s largest intact volcanic caldera and home to the highest density of big game in Africa; Kilimanjaro, Africa’s highest mountain; and the Mafia Island marine park; among many others. The scenery, topography, rich culture and very friendly people provide for excellent cultural tourism, beach holidays, honeymooning, game hunting, historical and archaeological ventures – and certainly the best wildlife photography safaris in the world.

    The objective of this hackathon is to develop a machine learning model that can classify the range of expenditures a tourist spends in Tanzania. The model can be used by different tour operators and the Tanzania Tourism Board to automatically help tourists across the world estimate their expenditure before visiting Tanzania.

  14. F

    Spanish Agent-Customer Chat Dataset for Healthcare Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Spanish Agent-Customer Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/spanish-healthcare-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Spanish Healthcare Chat Dataset is a rich collection of over 10,000 text-based conversations between customers and call center agents, focused on real-world healthcare interactions. Designed to reflect authentic language use and domain-specific dialogue patterns, this dataset supports the development of conversational AI, chatbots, and NLP models tailored for healthcare applications in Spanish-speaking regions.

    Participant & Chat Overview

    Participants: 150+ native Spanish speakers from the FutureBeeAI Crowd Community
    Conversation Length: 300–700 words per chat
    Turns per Chat: 50–150 dialogue turns across both participants
    Chat Types: Inbound and outbound
    Sentiment Coverage: Positive, neutral, and negative outcomes included

    Topic Diversity

    The dataset captures a wide spectrum of healthcare-related chat scenarios, ensuring comprehensive coverage for training robust AI systems:

    Inbound Chats (Customer-Initiated): Appointment scheduling, new patient registration, surgery and treatment consultations, diet and lifestyle discussions, insurance claim inquiries, lab result follow-ups
    Outbound Chats (Agent-Initiated): Appointment reminders and confirmations, health and wellness program offers, test result notifications, preventive care and vaccination reminders, subscription renewals, risk assessment and eligibility follow-ups

    This variety helps simulate realistic healthcare support workflows and patient-agent dynamics.

    Language Diversity & Realism

    This dataset reflects the natural flow of Spanish healthcare communication and includes:

    Authentic Naming Patterns: Spanish personal names, clinic names, and brands
    Localized Contact Elements: Addresses, emails, phone numbers, and clinic locations in regional Spanish formats
    Time & Currency References: Use of dates, times, numeric expressions, and currency units aligned with Spanish-speaking regions
    Colloquial & Medical Expressions: Local slang, informal speech, and common healthcare-related terminology

    These elements ensure the dataset is contextually relevant and linguistically rich for real-world use cases.

    Conversational Flow & Structure

    Conversations range from simple inquiries to complex advisory sessions, including:

    General inquiries
    Detailed problem-solving
    Routine status updates
    Treatment recommendations
    Support and feedback interactions

    Each conversation typically includes these structural components:

    Greetings and verification
    Information gathering
    Problem definition
    Solution delivery
    Closing messages
    Follow-up and feedback (where applicable)

    This structured flow mirrors actual healthcare support conversations and is ideal for training advanced dialogue systems.

    Data Format & Structure

    Available in JSON, CSV, and TXT formats, each conversation includes:

    Full message history with clear speaker labels
    Participant identifiers
    Metadata (e.g., topic tags, region, sentiment)
    Compatibility with common NLP and ML pipelines

    Applications

    <p

  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Waqar Ali (2024). 🌍📚 World Languages Dataset 🌍📚 [Dataset]. https://www.kaggle.com/datasets/waqi786/world-languages-dataset
Organization logo

🌍📚 World Languages Dataset 🌍📚

An Insight into the World's Most Spoken Languages 🌍📚

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
zip(5706 bytes)Available download formats
Dataset updated
Jul 30, 2024
Authors
Waqar Ali
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Area covered
World
Description

This dataset provides a comprehensive overview of 500 languages spoken around the world. It captures essential linguistic features, including language families, geographical regions, writing systems, and the estimated number of native speakers. This dataset aims to highlight the rich diversity of languages and their cultural significance, offering valuable insights for linguists, researchers, and enthusiasts interested in global language distribution.

The dataset contains real and accurate records for 500 languages across different regions and linguistic families. It covers a diverse range of languages, from widely spoken ones like English and Mandarin to less commonly known languages. The data was meticulously compiled to reflect the authentic linguistic landscape and provide a valuable resource for language studies and cultural analysis.

Search
Clear search
Close search
Google apps
Main menu