Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset provides a comprehensive overview of 500 languages spoken around the world. It captures essential linguistic features, including language families, geographical regions, writing systems, and the estimated number of native speakers. This dataset aims to highlight the rich diversity of languages and their cultural significance, offering valuable insights for linguists, researchers, and enthusiasts interested in global language distribution.
The dataset contains real and accurate records for 500 languages across different regions and linguistic families. It covers a diverse range of languages, from widely spoken ones like English and Mandarin to less commonly known languages. The data was meticulously compiled to reflect the authentic linguistic landscape and provide a valuable resource for language studies and cultural analysis.
Facebook
TwitterThis is a dataset I found online through the Google Dataset Search portal.
The American Community Survey (ACS) 2009-2013 multi-year data are used to list all languages spoken in the United States that were reported during the sample period. These tables provide detailed counts of many more languages than the 39 languages and language groups that are published annually as a part of the routine ACS data release. This is the second tabulation beyond 39 languages since ACS began.
The tables include all languages that were reported in each geography during the 2009 to 2013 sampling period. For the purpose of tabulation, reported languages are classified in one of 380 possible languages or language groups. Because the data are a sample of the total population, there may be languages spoken that are not reported, either because the ACS did not sample the households where those languages are spoken, or because the person filling out the survey did not report the language or reported another language instead.
The tables also provide information about self-reported English-speaking ability. Respondents who reported speaking a language other than English were asked to indicate their ability to speak English in one of the following categories: "Very well," "Well," "Not well," or "Not at all." The data on ability to speak English represent the person’s own perception about his or her own ability or, because ACS questionnaires are usually completed by one household member, the responses may represent the perception of another household member.
These tables are also available through the Census Bureau's application programming interface (API). Please see the developers page for additional details on how to use the API to access these data.
Sources:
Google Dataset Search: https://toolbox.google.com/datasetsearch
2009-2013 American Community Survey
Original dataset: https://www.census.gov/data/tables/2013/demo/2009-2013-lang-tables.html
Downloaded From: https://data.world/kvaughn/languages-county
Banner and thumbnail photo by Farzad Mohsenvand on Unsplash
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The English Healthcare Chat Dataset is a rich collection of over 12,000 text-based conversations between customers and call center agents, focused on real-world healthcare interactions. Designed to reflect authentic language use and domain-specific dialogue patterns, this dataset supports the development of conversational AI, chatbots, and NLP models tailored for healthcare applications in English-speaking regions.
The dataset captures a wide spectrum of healthcare-related chat scenarios, ensuring comprehensive coverage for training robust AI systems:
This variety helps simulate realistic healthcare support workflows and patient-agent dynamics.
This dataset reflects the natural flow of English healthcare communication and includes:
These elements ensure the dataset is contextually relevant and linguistically rich for real-world use cases.
Conversations range from simple inquiries to complex advisory sessions, including:
Each conversation typically includes these structural components:
This structured flow mirrors actual healthcare support conversations and is ideal for training advanced dialogue systems.
Available in JSON, CSV, and TXT formats, each conversation includes:
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Can you tell geographical stories about the world using data science?
World countries with their corresponding continents , official english names, official french names, Dial,ITU,Languages and so on.
This data was gotten from https://old.datahub.io/
Exploration of the world countries: - Can we graphically visualize countries that speak a particular language? - We can also integrate this dataset into others to enhance our exploration. - The dataset has now been updated to include longitude and latitudes of countries in the world.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The ORBIT (Object Recognition for Blind Image Training) -India Dataset is a collection of 105,243 images of 76 commonly used objects, collected by 12 individuals in India who are blind or have low vision. This dataset is an "Indian subset" of the original ORBIT dataset [1, 2], which was collected in the UK and Canada. In contrast to the ORBIT dataset, which was created in a Global North, Western, and English-speaking context, the ORBIT-India dataset features images taken in a low-resource, non-English-speaking, Global South context, a home to 90% of the world’s population of people with blindness. Since it is easier for blind or low-vision individuals to gather high-quality data by recording videos, this dataset, like the ORBIT dataset, contains images (each sized 224x224) derived from 587 videos. These videos were taken by our data collectors from various parts of India using the Find My Things [3] Android app. Each data collector was asked to record eight videos of at least 10 objects of their choice.
Collected between July and November 2023, this dataset represents a set of objects commonly used by people who are blind or have low vision in India, including earphones, talking watches, toothbrushes, and typical Indian household items like a belan (rolling pin), and a steel glass. These videos were taken in various settings of the data collectors' homes and workspaces using the Find My Things Android app.
The image dataset is stored in the ‘Dataset’ folder, organized by folders assigned to each data collector (P1, P2, ...P12) who collected them. Each collector's folder includes sub-folders named with the object labels as provided by our data collectors. Within each object folder, there are two subfolders: ‘clean’ for images taken on clean surfaces and ‘clutter’ for images taken in cluttered environments where the objects are typically found. The annotations are saved inside a ‘Annotations’ folder containing a JSON file per video (e.g., P1--coffee mug--clean--231220_084852_coffee mug_224.json) that contains keys corresponding to all frames/images in that video (e.g., "P1--coffee mug--clean--231220_084852_coffee mug_224--000001.jpeg": {"object_not_present_issue": false, "pii_present_issue": false}, "P1--coffee mug--clean--231220_084852_coffee mug_224--000002.jpeg": {"object_not_present_issue": false, "pii_present_issue": false}, ...). The ‘object_not_present_issue’ key is True if the object is not present in the image, and the ‘pii_present_issue’ key is True, if there is a personally identifiable information (PII) present in the image. Note, all PII present in the images has been blurred to protect the identity and privacy of our data collectors. This dataset version was created by cropping images originally sized at 1080 × 1920; therefore, an unscaled version of the dataset will follow soon.
This project was funded by the Engineering and Physical Sciences Research Council (EPSRC) Industrial ICASE Award with Microsoft Research UK Ltd. as the Industrial Project Partner. We would like to acknowledge and express our gratitude to our data collectors for their efforts and time invested in carefully collecting videos to build this dataset for their community. The dataset is designed for developing few-shot learning algorithms, aiming to support researchers and developers in advancing object-recognition systems. We are excited to share this dataset and would love to hear from you if and how you use this dataset. Please feel free to reach out if you have any questions, comments or suggestions.
REFERENCES:
Daniela Massiceti, Lida Theodorou, Luisa Zintgraf, Matthew Tobias Harris, Simone Stumpf, Cecily Morrison, Edward Cutrell, and Katja Hofmann. 2021. ORBIT: A real-world few-shot dataset for teachable object recognition collected from people who are blind or low vision. DOI: https://doi.org/10.25383/city.14294597
microsoft/ORBIT-Dataset. https://github.com/microsoft/ORBIT-Dataset
Linda Yilin Wen, Cecily Morrison, Martin Grayson, Rita Faia Marques, Daniela Massiceti, Camilla Longden, and Edward Cutrell. 2024. Find My Things: Personalized Accessibility through Teachable AI for People who are Blind or Low Vision. In Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems (CHI EA '24). Association for Computing Machinery, New York, NY, USA, Article 403, 1–6. https://doi.org/10.1145/3613905.3648641
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The data collected aim to test whether English proficiency levels in a country are positively associated with higher democratic values in that country. English proficiency is sourced from statistics by Education First’s "EF English Proficiency Index" which covers countries' scores for the calendar year 2022 and 2021. The EF English Proficiency Index ranks 111 countries in five different categories based on their English proficiency scores that were calculated from the test results of 2.1 million adults. While democratic values are operationalized through the liberal democracy index from the V-Dem Institute annual report for 2022 and 2021. Additionally, the data is utilized to test whether English language media consumption acts as a mediating variable between English proficiency and democracy levels in a country, while also looking at other possible regression variables. In order to conduct the linear regression analyses for the dats, the software that was utilized for this research was Microsoft Excel.The raw data set consists of 90 nation states in two years from 2022 and 2021. The raw data is utilized for two separate data sets the first of which is democracy indicators which has the regression variables of EPI, HDI, and GDP. For this table set there is a total of 360 data entries. HDI scores are a statistical summary measure that is developed by the United Nations Development Programme (UNDP) which measures the levels of human development in 190 countries. The data for nominal gross domestic product scores (GDP) are sourced from the World Bank. Having strong regression variables that have been proven to have a positive link with democracy in the data analysis such as GDP and HDI, would allow the regression analysis to identify whether there is a true relationship between English proficiency and democracy levels in a country. While the second data set has a total of 720 data entries and aims to identify English proficiency indicators the data set has 7 various regression variables which include, LDI scores, Years of Mandatory English Education, Heads of States Publicly speaking English, GDP PPP (2021USD), Common Wealth, BBC web traffic and CNN web traffic. The data for years of mandatory English education is sourced from research at the University of Winnipeg and is coded in the data set based on the number of years a country has English as a mandatory subject. The range of this data is from 0 to 13 years of English being mandatory. It is important to note that this data only concerns public schools and does not extend to the private school systems in each country. The data for heads of state publicly speaking English was done through a video data analysis of all heads of state. The data was only used for heads of state who had been in their position for at least a year to ensure the accuracy of the data collected; with a year in power, for heads of state that had not been in their position for a year, data was taken from the previous head of state. This data only takes into account speeches and interviews that were conducted during their incumbency. The data for each country’s GDP PPP scores are sourced from the World Bank, which was last updated for a majority of the countries in 2021 and is tied to the US dollar. Data for the commonwealth will only include members of the commonwealth that have been historically colonized by the United Kingdom. Any country that falls under that category will be coded as 1 and any country that does not will be coded as 0. For BBC and CNN web traffic that data is sourced by using tools in Semrush which provide a rough estimate of how much web traffic each news site generates in each country. Which will be utilized to identify the average number of web traffic for BBC News and CNN World News for both the 2021 and 2022 calendar. The traffic for each country will also be measured per capita, per 10 thousand people to ensure that the population density of a country does not influence the results. The population of each country for both 2021 and 2022 is sourced from the United Nations revision of World Population Prospects of both 2021 and 2022 respectively.
Facebook
TwitterThe world population surpassed eight billion people in 2022, having doubled from its figure less than 50 years previously. Looking forward, it is projected that the world population will reach nine billion in 2038, and 10 billion in 2060, but it will peak around 10.3 billion in the 2080s before it then goes into decline. Regional variations The global population has seen rapid growth since the early 1800s, due to advances in areas such as food production, healthcare, water safety, education, and infrastructure, however, these changes did not occur at a uniform time or pace across the world. Broadly speaking, the first regions to undergo their demographic transitions were Europe, North America, and Oceania, followed by Latin America and Asia (although Asia's development saw the greatest variation due to its size), while Africa was the last continent to undergo this transformation. Because of these differences, many so-called "advanced" countries are now experiencing population decline, particularly in Europe and East Asia, while the fastest population growth rates are found in Sub-Saharan Africa. In fact, the roughly two billion difference in population between now and the 2080s' peak will be found in Sub-Saharan Africa, which will rise from 1.2 billion to 3.2 billion in this time (although populations in other continents will also fluctuate). Changing projections The United Nations releases their World Population Prospects report every 1-2 years, and this is widely considered the foremost demographic dataset in the world. However, recent years have seen a notable decline in projections when the global population will peak, and at what number. Previous reports in the 2010s had suggested a peak of over 11 billion people, and that population growth would continue into the 2100s, however a sooner and shorter peak is now projected. Reasons for this include a more rapid population decline in East Asia and Europe, particularly China, as well as a prolonged development arc in Sub-Saharan Africa.
Facebook
Twitter(UNCLASSIFIED) English is the official language in Liberia and is used in government, business, and education to some extent. The majority of Liberians do not know English and those who do speak what is commonly referred to as Liberian English. This involves often leaving off the end of words and/or adding the letter “o” to the end. Also English words will have different meaning in the country. Prior to the civil wars the government created the National Language Program. This program was designed to introduce local languages to primary students prior to the instruction of English. Due to two civil wars and poor infrastructure langue policy has not received much attention. While English is the official language it does not properly represent the diverse population leaving few to be fluent and causing them to rely on their local languages. Attribute Table Field DescriptionsISO3 - International Organization for Standardization 3-digit country code ADM0_NAME - Administration level zero identification / name LANG_FAM - Language family LANG_SUBGR - Language subgroup ALT_NAMES - Alternate names COMMENTS - Comments or notes regarding language SOURCE_DT - Source one creation date SOURCE - Source one SOURCE2_DT - Source two creation date SOURCE2 - Source two CollectionThis HGIS was created through linguistic information provided through The World Language Mapping System (WMLS). This data was then processed through DigitalGlobe’s AnthroMapper program to generate more accurate linguistic coverage boundaries. The metadata was supplemented with anthropological and linguistic information from peer-reviewed journals and published books. It should be noted that this shape file only depicts the majority first level languages spoken in a given area; there might be significant populations of other minority language speakers not shown in this dataset. The data included herein have not been derived from a registered survey and should be considered approximate unless otherwise defined. While rigorous steps have been taken to ensure the quality of each dataset, DigitalGlobe is not responsible for the accuracy and completeness of data compiled from outside sources.Sources (HGIS)Anthromapper. DigitalGlobe, September 2014.World Language Mapping System (WLMS) Version 16. World GeoDatasets, October 2013.Sources (Metadata)Albaugh, Ericka. "Language Policies in African Education." working paper., Department of Government Legal Studies at Bowdoin College, 2005. http://www.bowdoin.edu/.Central Intelligence Agency. The World FactBook, “Liberia”. Last updated June 2014. Accessed September 2014. https://www.cia.gov/index.html.The Reeds in Liberia, “Liberian English.” October 2007. Accessed September 2014. http://reedsinliberia.blogspot.com/.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Machine translation helps bridge the language barriers between people and information — but historically, research has focused on creating and evaluating translation systems for only a handful of languages, usually the few most spoken languages in the world. This excludes the billions of people worldwide who don’t happen to be fluent in languages such as English, Spanish, Russian, and Mandarin.
We’ve recently made progress with machine translation systems like M2M-100 , our open source model that can translate a hundred different languages. Further advances necessitate tools with which to test and compare these translation systems with one another, though.
Today, we are open-sourcing FLORES-101 , a first-of-its-kind, many-to-many evaluation data set covering 101 languages from all over the world. FLORES-101 is the missing piece, the tool that enables researchers to rapidly test and improve upon multilingual translation models like M2M-100.
We’re making FLORES-101 publicly available because we believe in breaking down language barriers, and that means helping empower researchers to create more diverse (and locally relevant) translation tools — ones that may make it as easy to translate from, say, Bengali to Marathi as it is to translate from English to Spanish today. We’re making the full FLORES-101 data set , an accompanying tech report, and several models publicly available for the entire research community to use, to accelerate progress on many-to-many translation systems worldwide.
Why evaluation matters Imagine trying to bake a cake — but not being able to taste it. It’s near-impossible to know whether it’s any good, and even harder to know how to improve the recipe for future attempts.
Evaluating how well translation systems perform has been a major challenge for AI researchers — and that knowledge gap has impeded progress. If researchers cannot measure or compare their results, they can’t develop better translation systems. The AI research community needed an open and easily accessible way to perform high-quality, reliable measurement of many-to-many translation model performance and then compare results with others.
Previous work on this problem relied heavily on translating in and out of English, often using proprietary data sets. But while this benefited English speakers, it was and is insufficient for many parts of the world where people need fast and accurate translation between regional languages — for instance, in India, where the constitution recognizes over 20 official languages.
FLORES-101 focuses on what are known as low-resource languages, such as Amharic, Mongolian, and Urdu, which do not currently have extensive data sets for natural language processing research. For the first time, researchers will be able to reliably measure the quality of translations through 10,100 different translation directions — for example, directly from Hindi to Thai or Swahili. For context, evaluating in and out of English would provide merely 200 translation directions.
The flexibility exhibited by FLORES is possible because we designed around many-to-many translation from the start. The data set contains the same set of sentences across all languages, enabling researchers to evaluate the performance of any and all translation directions.
“Efforts like FLORES are of immense value, because they not only draw attention to under-served languages, but they immediately invite and actively facilitate research on all these languages,” said Antonios Anastasopoulos, assistant professor at George Mason University’s Department of Computer Science.
Building a benchmark Good benchmarks are difficult to construct. They need to be able to accurately reflect meaningful differences between models so they can be used by researchers to make decisions. Translation benchmarks can be particularly difficult because the same quality standard must be met across all languages, not just a select few for which translators are more readily available.
Lire
-0:16
Paramètres visuels supplémentairesHD
Diffuser sur Chrome CastAfficher en plein écran Remettre le son
To that end, we created the FLORES-101 data set in a multistep workflow. Each document was first translated by a professional translator, and then verified by a human editor. Next, it proceeded to the quality-control phase, including checks for spelling, grammar, punctuation, and formatting, and comparison with translations from commercial engines. After that, a different set of translators performed human evaluation, identifying errors across numerous categories including unnatural translation, register, and grammar. Based on the number and severity of the identified errors, the translations were either sent back for retranslation or — if they met quality standards — the translations were considered complete.
Translation quality is not enough on its own, though. Th...
Facebook
TwitterDataset Summary This dataset provides comprehensive insights into the global skincare and beauty e-commerce market. It contains detailed transaction data, customer behavior patterns, and sales metrics, offering valuable information for analyzing the performance of online beauty stores. The dataset is tailored for English-speaking users.
Key Features Transaction Data: Includes details such as order IDs, product categories, sales revenue, and transaction dates. Customer Insights: Information about customer demographics, preferences, and purchase history. Product Details: Comprehensive data on product categories, subcategories, pricing, and stock levels. Geographic Analysis: Regional data to understand the market's reach across different countries and demographics.
Potential Use Cases - Market Analysis: Identify trends in the skincare and beauty industry. - Customer Behavior Modeling: Analyze purchasing habits to improve marketing strategies. - E-Commerce Performance Evaluation: Evaluate sales trends and revenue streams. - Price Optimization: Use data-driven insights to optimize product pricing.
File Structure The dataset is provided in an Excel file with multiple sheets (if applicable). Each sheet contains organized data for easier navigation and analysis. Specific sheets might cover:
Orders: Transaction details including order ID, product name, and sales data. Customers: Demographics and behavior. Products: Detailed product inventory and categories. Revenue Analysis: Key metrics include total revenue, average order value, and profit margins.
Facebook
TwitterThis dataset is created for a task of UNCOVER COVID-19 Challenge, Mental health impact and support services.
The search interest of mental health related terms on Google before and after the outbreak of COVID-19 pandemic reveals how public's concern is affected by the pandemic, and its impact to mental health of people around the world. I picked worldwide, Canada, US, Italy, Iran, Japan, South Korea and UK as the population. The dataset also includes data of Canada for the past 4 years, from 2016 to 2019.
The mental health related search terms are "mental health", "depression", "anxiety", "ocd", "obsessive compulsive disorder", "insomnia", "panic attack", "counseling", "psychiatrist".
Search interest is indicated by a number between 0 and 100, where 100 means the most popular point of time(by week), 1 means the least, and 0 no enough data.
All data is collected from Google Trends. I assumed, when searching the terms, users from countries other than English-speaking performed the search in their own language, and they typed the word correctly.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Presenting the UK English Scripted Monologue Speech Dataset for the Telecom Domain, a purpose-built dataset created to accelerate the development of English speech recognition and voice AI models specifically tailored for the telecommunications industry.
This dataset includes over 6,000 high-quality scripted prompt recordings in UK English, representing real-world telecom customer service scenarios. It’s designed to support the training of speech-based AI systems used in call centers, virtual agents, and voice-powered support tools.
The dataset reflects a wide variety of common telecom customer interactions, including:
To maximize contextual richness, prompts include:
Each audio file is paired with an accurate, verbatim transcription for precise model training:
Detailed metadata is included to
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This challenge is open to users from English speaking African countries.
The Tanzanian tourism sector plays a significant role in the Tanzanian economy, contributing about 17% to the country’s GDP and 25% of all foreign exchange revenues. The sector, which provides direct employment for more than 600,000 people and up to 2 million people indirectly, generated approximately $2.4 billion in 2018 according to government statistics. Tanzania received a record 1.1 million international visitor arrivals in 2014, mostly from Europe, the US and Africa.
Tanzania is the only country in the world which has allocated more than 25% of its total area for wildlife, national parks, and protected areas.There are 16 national parks in Tanzania, 28 game reserves, 44 game-controlled areas, two marine parks and one conservation area.
Tanzania’s tourist attractions include the Serengeti plains, which hosts the largest terrestrial mammal migration in the world; the Ngorongoro Crater, the world’s largest intact volcanic caldera and home to the highest density of big game in Africa; Kilimanjaro, Africa’s highest mountain; and the Mafia Island marine park; among many others. The scenery, topography, rich culture and very friendly people provide for excellent cultural tourism, beach holidays, honeymooning, game hunting, historical and archaeological ventures – and certainly the best wildlife photography safaris in the world.
The objective of this hackathon is to develop a machine learning model that can classify the range of expenditures a tourist spends in Tanzania. The model can be used by different tour operators and the Tanzania Tourism Board to automatically help tourists across the world estimate their expenditure before visiting Tanzania.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Spanish Healthcare Chat Dataset is a rich collection of over 10,000 text-based conversations between customers and call center agents, focused on real-world healthcare interactions. Designed to reflect authentic language use and domain-specific dialogue patterns, this dataset supports the development of conversational AI, chatbots, and NLP models tailored for healthcare applications in Spanish-speaking regions.
The dataset captures a wide spectrum of healthcare-related chat scenarios, ensuring comprehensive coverage for training robust AI systems:
This variety helps simulate realistic healthcare support workflows and patient-agent dynamics.
This dataset reflects the natural flow of Spanish healthcare communication and includes:
These elements ensure the dataset is contextually relevant and linguistically rich for real-world use cases.
Conversations range from simple inquiries to complex advisory sessions, including:
Each conversation typically includes these structural components:
This structured flow mirrors actual healthcare support conversations and is ideal for training advanced dialogue systems.
Available in JSON, CSV, and TXT formats, each conversation includes:
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset provides a comprehensive overview of 500 languages spoken around the world. It captures essential linguistic features, including language families, geographical regions, writing systems, and the estimated number of native speakers. This dataset aims to highlight the rich diversity of languages and their cultural significance, offering valuable insights for linguists, researchers, and enthusiasts interested in global language distribution.
The dataset contains real and accurate records for 500 languages across different regions and linguistic families. It covers a diverse range of languages, from widely spoken ones like English and Mandarin to less commonly known languages. The data was meticulously compiled to reflect the authentic linguistic landscape and provide a valuable resource for language studies and cultural analysis.