100+ datasets found

h
100-richest-people-in-world
huggingface.co
Updated Aug 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nate Raw (2023). 100-richest-people-in-world [Dataset]. https://huggingface.co/datasets/nateraw/100-richest-people-in-world
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 2, 2023
Authors
Nate Raw
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Area covered
World
Description
Dataset Card for 100 Richest People In World

Dataset Summary

This dataset contains the list of Top 100 Richest People in the World Column Information:-

Name - Person Name NetWorth - His/Her Networth Age - Person Age Country - The country person belongs to Source - Information Source Industry - Expertise Domain

Join our Community Supported Tasks and Leaderboards

[More Information Needed]

Languages

[More Information Needed]… See the full description on the dataset page: https://huggingface.co/datasets/nateraw/100-richest-people-in-world.
1000 Richest People in the World
kaggle.com
zip
Updated Jul 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Waqar Ali (2024). 1000 Richest People in the World [Dataset]. https://www.kaggle.com/datasets/waqi786/1000-richest-people-in-the-world
Explore at:
zip(8652 bytes)Available download formats
Dataset updated
Jul 28, 2024
Authors
Waqar Ali
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset provides a synthetic overview of the 1,000 wealthiest individuals in the world, offering insights into the distribution of wealth across industries and regions. It is designed to help analysts, researchers, and data enthusiasts explore global wealth trends, industry dominance, and regional wealth concentration.

Whether you're conducting market research, financial analysis, or data modeling, this dataset serves as a valuable resource for understanding the characteristics of the world's top billionaires.

📊 Key Features: Name 👤: The name of the billionaire. Country 🌍: Country of residence or primary business operation. Industry 🏭: Industry in which the individual has built their wealth. Net Worth (in billions) 💵: Estimated net worth in billions of USD. Company 🏢: The primary company or business associated with the billionaire. ⚠️ Important Note: This dataset is 100% synthetic and does not contain real financial or personal data. It is artificially generated for educational, analytical, and research purposes.
Forbes World's Billionaires List 2024
kaggle.com
Updated Aug 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vincent Campanaro (2025). Forbes World's Billionaires List 2024 [Dataset]. http://doi.org/10.34740/kaggle/dsv/12717950
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/12717950
Dataset updated
Aug 9, 2025
Dataset provided by
Kaggle
Authors
Vincent Campanaro
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This comprehensive dataset encapsulates a detailed snapshot of the wealthiest individuals globally, as listed by Forbes in 2024. Compiled through meticulous web scraping and data aggregation, the dataset includes a wide range of attributes for each billionaire. Fields encompass basic personal information such as name, age, and gender, alongside financial details including net worth and sources of wealth. The dataset further delves into aspects like industry involvement, organizational affiliations, philanthropic endeavors, and educational backgrounds.

Key attributes in this dataset include:

Name: Full legal name of the billionaire. Age: Age of the individual. 2024 Net Worth: Estimated net worth in USD for the year 2024. Industry: Primary industry or sector of operation. Source of Wealth: Origin of the billionaire’s wealth. Title: Professional title or position. Organization: Name of the associated organization. Self-Made: Indicator if the wealth is self-made. Self-Made Score: A quantitative score assessing how self-made their wealth is. Philanthropy Score: A score reflecting the extent of their philanthropic activities. Residence: Main residence of the individual. Citizenship: Legal citizenship. Gender: Gender identity. Marital Status: Current marital status. Children: Number of children. Education: Highest level of education attained.

This dataset is ideal for analysis, offering insights into the distribution of wealth, the influence of education on wealth accumulation, and trends across different industries. It also provides a foundation for exploring the impact of socioeconomic factors on personal wealth. The data were collected and formatted with careful consideration to ensure accuracy, making it a valuable resource for researchers, economists, and anyone interested in the dynamics of wealth and success.

Please note that some data is missing in this dataset, primarily due to the unavailability of information from Forbes. This issue becomes more prevalent beyond the top 400 entries. Many individuals lack a self-made score, a philanthropy score, or specific details regarding their title or organization as per Forbes' listings. I am currently working to update the dataset with this missing information. However, this update process is quite tedious and time-consuming since it is mostly manual. I appreciate your patience and understanding as I work through these details.
Billionaires dataset cleaned
kaggle.com
zip
Updated Feb 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javier_SAB (2024). Billionaires dataset cleaned [Dataset]. https://www.kaggle.com/datasets/javiersab/billionaires-dataset-cleaned
Explore at:
zip(128906 bytes)Available download formats
Dataset updated
Feb 24, 2024
Authors
Javier_SAB
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Cleaned dataset from the Billionaires Statistic Dataset (2023) that can be found here.

The code I used to clean and re-structure the data is also here.

First things first: a big shout-out to Nidula Elgiriyewithana for providing the original data.

As with it, this dataset contains various information about the world's wealthiest persons in different columns that can be grouped into three different types:

Business-related information. These columns contain data about the industry in which the billionaires' operate, their source of wealth, total wealth and position they occupy in the ranking.

Personal information. Such as name, age, nationality, country and city of residence.

Economic activity information. These columns are related to the country in which the billionaire resides and provide different economic indicators like GDP, education enrollment or Consumer Price Index (CPI).

Column names

position. Ranking of the billionaire measured by their wealth.

wealth. The wealth of the billionaire measured in $.

industry. Industry in which the billionaire's operates their businesses.

full_name. Complete name of the billionaire.

age. The age of the billionaire.

country_of_residence. Country in which the billionaire resides.

city_of_residence. City in which the billionaire resides.

source. The source of the billionaire's wealth.

citizenship. The country of citizenship of the billionaire.

gender. The gender of the billionaire.

birth_date. The birth date of the billionaire.

last_name. The last name of the billionaire.

first_name. The first name of the billionaire.

residence_state. State in which the billionaire resides (only for billionaires who reside in the U.S.).

residence_region. Region in which the billionaire resides (only for billionaires who reside in the U.S.).

birth_year. The birth year of the billionaire.

birth_month. The birth month of the billionaire.

birth_day. The birth data of the billionaire.

cpi_country. Consumer Price Index (CPI) for the billionaire's country.

cpi_change_country. CPI change for the billionaire's country.

gdp_country. Gross Domestic Product (GDP) in $ for the billionaire's country.

g_tertiary_ed_enroll. Enrollment in tertiary education in the billionaire's country.

g_primary_ed_enroll. Enrollment in primary education in the billionaire's country.

life_expectancy. Life expectancy in the billionaire's country.

tax_revenue. Tax revenue in the billionaire's country.

tax_rate. Total tax rate in the billionaire's country.

country_pop. Population of the billionaire's country.

country_lat. Latitude coordinate of the billionaire's country.

country_long. Longitude coordinate of the billionaire's country.

continent. Continent in which the country of the billionaire's residence is located.

Potential analyses

Analyze which industries contain the biggest groups of billionaires overall and in different countries.

Explore number of billionaires and total wealth across countries and continents and display the result in a map.

Focus on personal information columns such as age or gender to explore the distribution of billionaires from this perspective.

Discover if countries' economic indicators have any impact in the presence of billionaires.

The U.S. is the country with most billionaires presented in the dataset and also the only one with attributes in the residence_state and residence_region columns. This makes the American billionaires a good focus for a specific analysis.

Bonus

If you want a challenge, you can create a dashboard using tools such as Plotly to dynamically visualize the data using one or different attributes (such as industry, age or country). I did it, leave the link below in case you want to investigate:

Dashboard notebook here

If you find this dataset informative or inspirational, a vote is appreciated for others to easily discover value in it 💎💰
Leading billionaires worldwide 2025
statista.com
Updated Mar 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Leading billionaires worldwide 2025 [Dataset]. https://www.statista.com/statistics/272047/top-25-global-billionaires/
Explore at:
Dataset updated
Mar 18, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Mar 2025
Area covered
World
Description
As of March 2025, Elon Musk had a net worth valued at 328.5 billion U.S. dollars, making him the richest man in the world. Amazon founder Jeff Bezos followed in second, with Marc Zuckerberg, the founder of Facebook, in third. The list is dominated by Americans, and Alice Walton and Francoise Bettencourt Meyers are the only women among the 20 richest people worldwide.
Top 100 Richest People in the World
kaggle.com
zip
Updated Sep 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayessa (2022). Top 100 Richest People in the World [Dataset]. https://www.kaggle.com/datasets/ayessa/top-100-richest-people-in-the-world
Explore at:
zip(3573 bytes)Available download formats
Dataset updated
Sep 18, 2022
Authors
Ayessa
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Introduction

This dataset contains the top 100 richest people in the world based on their net worth. The dataset includes their rank, name, net worth, birthday, age, and nationality.

Methodology

This dataset was collected using web scraping (Beautiful Soup) on this website and this "https://en.wikipedia.org/wiki/List_of_countries_by_number_of_billionaires">wikipedia

Thumbnail Photo
World_billion_2024
kaggle.com
zip
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
willian oliveira (2024). World_billion_2024 [Dataset]. https://www.kaggle.com/willianoliveiragibin/world-billion-2024
Explore at:
zip(55504 bytes)Available download formats
Dataset updated
Jun 25, 2024
Authors
willian oliveira
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
World
Description
This graph was retired this internet :

The "Richest People in the World - 2024" dataset provides a detailed overview of the wealthiest individuals globally for the year 2024. This dataset includes crucial information about the top executives, their net worth, and the countries they are based in, offering valuable insights for economic analysis, market research, and financial studies.
Billionaries dataset
kaggle.com
zip
Updated Apr 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TEJA KUMAR (2021). Billionaries dataset [Dataset]. https://www.kaggle.com/ravillatejakumar/billionaries-dataset
Explore at:
zip(101897 bytes)Available download formats
Dataset updated
Apr 29, 2021
Authors
TEJA KUMAR
Description
Content

This dataset consists of top most billionaires in the world and respective their names, whether it is a finance company or any software company, how much money they have ,these all the details which are in the dataset

Researchers have compiled a multi-decade database of the super-rich. Building off the Forbes World’s Billionaires lists from 1996-2014, scholars at Peterson Institute for International Economics have added a couple dozen more variables about each billionaire - including whether they were self-made or inherited their wealth. (Roughly half of European billionaires and one-third of U.S. billionaires got a significant financial boost from family, the authors estimate.)

Acknowledgements

Reference : https://corgis-edu.github.io/corgis/csv/billionaires/

Inspiration

Some of the datasets which I have seen in the kaggle or somewhere but it is limited to less number of columns . Kagglers are not able to get an insights from very low amount of data. so that I decided that to be more helpful to them or we can able to get an more insights from this dataset
w
Globalization and Income Distribution Dataset 1975-2002 - Aruba,...
microdata.worldbank.org
catalog.ihsn.org
Updated Oct 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Branko L. Milanovic (2023). Globalization and Income Distribution Dataset 1975-2002 - Aruba, Afghanistan, Angola...and 188 more [Dataset]. https://microdata.worldbank.org/index.php/catalog/1786
Explore at:
Dataset updated
Oct 26, 2023
Dataset authored and provided by
Branko L. Milanovic
Time period covered
1975 - 2002
Area covered
Angola
Description
Abstract

Dataset used in World Bank Policy Research Working Paper #2876, published in World Bank Economic Review, No. 1, 2005, pp. 21-44.

The effects of globalization on income distribution in rich and poor countries are a matter of controversy. While international trade theory in its most abstract formulation implies that increased trade and foreign investment should make income distribution more equal in poor countries and less equal in rich countries, finding these effects has proved elusive. The author presents another attempt to discern the effects of globalization by using data from household budget surveys and looking at the impact of openness and foreign direct investment on relative income shares of low and high deciles. The author finds some evidence that at very low average income levels, it is the rich who benefit from openness. As income levels rise to those of countries such as Chile, Colombia, or Czech Republic, for example, the situation changes, and it is the relative income of the poor and the middle class that rises compared with the rich. It seems that openness makes income distribution worse before making it better-or differently in that the effect of openness on a country's income distribution depends on the country's initial income level.

Kind of data

Aggregate data [agg]
w
Education Attainment and Enrollment around the World
datacatalog.worldbank.org
excel, html, pdf, zip
Updated Nov 4, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryan Douglas Hahn (2018). Education Attainment and Enrollment around the World [Dataset]. https://datacatalog.worldbank.org/search/dataset/0038973/education-attainment-and-enrollment-around-the-world
Explore at:
pdf, excel, html, zipAvailable download formats
Dataset updated
Nov 4, 2018
Dataset provided by
Ryan Douglas Hahn
License
https://datacatalog.worldbank.org/public-licenses?fragment=cchttps://datacatalog.worldbank.org/public-licenses?fragment=cc
Area covered
World
Description
Patterns of educational attainment vary greatly across countries, and across population groups within countries. In some countries, virtually all children complete basic education whereas in others large groups fall short. The primary purpose of this database, and the associated research program, is to document and analyze these differences using a compilation of a variety of household-based data sets: Demographic and Health Surveys (DHS); Multiple Indicator Cluster Surveys (MICS); Living Standards Measurement Study Surveys (LSMS); as well as country-specific Integrated Household Surveys (IHS) such as Socio-Economic Surveys.

As shown at the website associated with this database, there are dramatic differences in attainment by wealth. When households are ranked according to their wealth status (or more precisely, a proxy based on the assets owned by members of the household) there are striking differences in the attainment patterns of children from the richest 20 percent compared to the poorest 20 percent.

In Mali in 2012 only 34 percent of 15 to 19 year olds in the poorest quintile have completed grade 1 whereas 80 percent of the richest quintile have done so. In many countries, for example Pakistan, Peru and Indonesia, almost all the children from the wealthiest households have completed at least one year of schooling. In some countries, like Mali and Pakistan, wealth gaps are evident from grade 1 on, in other countries, like Peru and Indonesia, wealth gaps emerge later in the school system.

The EdAttain website allows a visual exploration of gaps in attainment and enrollment within and across countries, based on the international database which spans multiple years from over 120 countries and includes indicators disaggregated by wealth, gender and urban/rural location. The database underlying that site can be downloaded from here.
d
Are Students Ready for a Technology-Rich World? What PISA Studies Tell Us
catalog.data.gov
Updated Mar 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Department of State (2021). Are Students Ready for a Technology-Rich World? What PISA Studies Tell Us [Dataset]. https://catalog.data.gov/dataset/are-students-ready-for-a-technology-rich-world-what-pisa-studies-tell-us
Explore at:
Dataset updated
Mar 30, 2021
Dataset provided by
U.S. Department of State
Description
ICT has profound implications for education, both because ICT can facilitate new forms of learning and because it has become important for young people to master ICT in preparation for adult life. But how extensive is access to ICT in schools and informal settings and how is it used by students? Drawing on data from the OECD’s Programme for International Student Assessment (PISA), Are Students Ready for a Technology-Rich World? What PISA Studies Tell Us, examines whether access to computers for students is equitable across countries and student groups; how students use ICT and what their attitudes are towards ICT; the relationship between students’ access to and use of ICT and their performance in PISA 2003; and the implications for educational policy.
t
Data from: REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic...
researchdata.tuwien.ac.at
txt, zip
Updated Jul 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee (2025). REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly [Dataset]. http://doi.org/10.48436/0ewrv-8cb44
Explore at:
zip, txtAvailable download formats
Unique identifier
https://doi.org/10.48436/0ewrv-8cb44
Dataset updated
Jul 15, 2025
Dataset provided by
TU Wien
Authors
Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 9, 2025 - Jan 14, 2025
Description
REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly

📋 Introduction

Robotic manipulation remains a core challenge in robotics, particularly for contact-rich tasks such as industrial assembly and disassembly. Existing datasets have significantly advanced learning in manipulation but are primarily focused on simpler tasks like object rearrangement, falling short of capturing the complexity and physical dynamics involved in assembly and disassembly. To bridge this gap, we present REASSEMBLE (Robotic assEmbly disASSEMBLy datasEt), a new dataset designed specifically for contact-rich manipulation tasks. Built around the NIST Assembly Task Board 1 benchmark, REASSEMBLE includes four actions (pick, insert, remove, and place) involving 17 objects. The dataset contains 4,551 demonstrations, of which 4,035 were successful, spanning a total of 781 minutes. Our dataset features multi-modal sensor data including event cameras, force-torque sensors, microphones, and multi-view RGB cameras. This diverse dataset supports research in areas such as learning contact-rich manipulation, task condition identification, action segmentation, and more. We believe REASSEMBLE will be a valuable resource for advancing robotic manipulation in complex, real-world scenarios.

✨ Key Features

Multimodality: REASSEMBLE contains data from robot proprioception, RGB cameras, Force&Torque sensors, microphones, and event cameras

Multitask labels: REASSEMBLE contains labeling which enables research in Temporal Action Segmentation, Motion Policy Learning, Anomaly detection, and Task Inversion.

Long horizon: Demonstrations in the REASSEMBLE dataset cover long horizon tasks and actions which usually span multiple steps.

Hierarchical labels: REASSEMBLE contains actions segmentation labels at two hierarchical levels.

🔴 Dataset Collection

Each demonstration starts by randomizing the board and object poses, after which an operator teleoperates the robot to assemble and disassemble the board while narrating their actions and marking task segment boundaries with key presses. The narrated descriptions are transcribed using Whisper [1], and the board and camera poses are measured at the beginning using a motion capture system, though continuous tracking is avoided due to interference with the event camera. Sensory data is recorded with rosbag and later post-processed into HDF5 files without downsampling or synchronization, preserving raw data and timestamps for future flexibility. To reduce memory usage, video and audio are stored as encoded MP4 and MP3 files, respectively. Transcription errors are corrected automatically or manually, and a custom visualization tool is used to validate the synchronization and correctness of all data and annotations. Missing or incorrect entries are identified and corrected, ensuring the dataset’s completeness. Low-level Skill annotations were added manually after data collection, and all labels were carefully reviewed to ensure accuracy.

📑 Dataset Structure

The dataset consists of several HDF5 (.h5) and JSON (.json) files, organized into two directories. The poses directory contains the JSON files, which store the poses of the cameras and the board in the world coordinate frame. The data directory contains the HDF5 files, which store the sensory readings and annotations collected as part of the REASSEMBLE dataset. Each JSON file can be matched with its corresponding HDF5 file based on their filenames, which include the timestamp when the data was recorded. For example, 2025-01-09-13-59-54_poses.json corresponds to 2025-01-09-13-59-54.h5.

The structure of the JSON files is as follows:

{"Hama1": [ [x ,y, z], [qx, qy, qz, qw] ], "Hama2": [ [x ,y, z], [qx, qy, qz, qw] ], "DAVIS346": [ [x ,y, z], [qx, qy, qz, qw] ], "NIST_Board1": [ [x ,y, z], [qx, qy, qz, qw] ] }

[x, y, z] represent the position of the object, and [qx, qy, qz, qw] represent its orientation as a quaternion.

The HDF5 (.h5) format organizes data into two main types of structures: datasets, which hold the actual data, and groups, which act like folders that can contain datasets or other groups. In the diagram below, groups are shown as folder icons, and datasets as file icons. The main group of the file directly contains the video, audio, and event data. To save memory, video and audio are stored as encoded byte strings, while event data is stored as arrays. The robot’s proprioceptive information is kept in the robot_state group as arrays. Because different sensors record data at different rates, the arrays vary in length (signified by the N_xxx variable in the data shapes). To align the sensory data, each sensor’s timestamps are stored separately in the timestamps group. Information about action segments is stored in the segments_info group. Each segment is saved as a subgroup, named according to its order in the demonstration, and includes a start timestamp, end timestamp, a success indicator, and a natural language description of the action. Within each segment, low-level skills are organized under a low_level subgroup, following the same structure as the high-level annotations.

📁

The splits folder contains two text files which list the h5 files used for the traning and validation splits.

📌 Important Resources

The project website contains more details about the REASSEMBLE dataset. The Code for loading and visualizing the data is avaibile on our github repository.

📄 Project website: https://tuwien-asl.github.io/REASSEMBLE_page/
💻 Code: https://github.com/TUWIEN-ASL/REASSEMBLE

⚠️ File comments

Below is a table which contains a list records which have any issues. Issues typically correspond to missing data from one of the sensors.

Recording Issue
2025-01-10-15-28-50.h5 hand cam missing at beginning
2025-01-10-16-17-40.h5 missing hand cam
2025-01-10-17-10-38.h5 hand cam missing at beginning
2025-01-10-17-54-09.h5 no empty action at
F
English Agent-Customer Chat Dataset for Healthcare Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). English Agent-Customer Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/english-healthcare-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The English Healthcare Chat Dataset is a rich collection of over 12,000 text-based conversations between customers and call center agents, focused on real-world healthcare interactions. Designed to reflect authentic language use and domain-specific dialogue patterns, this dataset supports the development of conversational AI, chatbots, and NLP models tailored for healthcare applications in English-speaking regions.
Participant & Chat Overview
•
Participants: 200+ native English speakers from the FutureBeeAI Crowd Community

•
Conversation Length: 300–700 words per chat

•
Turns per Chat: 50–150 dialogue turns across both participants

•
Chat Types: Inbound and outbound

•
Sentiment Coverage: Positive, neutral, and negative outcomes included

Topic Diversity
The dataset captures a wide spectrum of healthcare-related chat scenarios, ensuring comprehensive coverage for training robust AI systems:
•
Inbound Chats (Customer-Initiated): Appointment scheduling, new patient registration, surgery and treatment consultations, diet and lifestyle discussions, insurance claim inquiries, lab result follow-ups

•
Outbound Chats (Agent-Initiated): Appointment reminders and confirmations, health and wellness program offers, test result notifications, preventive care and vaccination reminders, subscription renewals, risk assessment and eligibility follow-ups

This variety helps simulate realistic healthcare support workflows and patient-agent dynamics.
Language Diversity & Realism
This dataset reflects the natural flow of English healthcare communication and includes:
•
Authentic Naming Patterns: English personal names, clinic names, and brands

•
Localized Contact Elements: Addresses, emails, phone numbers, and clinic locations in regional English formats

•
Time & Currency References: Use of dates, times, numeric expressions, and currency units aligned with English-speaking regions

•
Colloquial & Medical Expressions: Local slang, informal speech, and common healthcare-related terminology

These elements ensure the dataset is contextually relevant and linguistically rich for real-world use cases.
Conversational Flow & Structure
Conversations range from simple inquiries to complex advisory sessions, including:
•General inquiries
•Detailed problem-solving
•Routine status updates
•Treatment recommendations
•Support and feedback interactions
Each conversation typically includes these structural components:
•Greetings and verification
•Information gathering
•Problem definition
•Solution delivery
•Closing messages
•Follow-up and feedback (where applicable)
This structured flow mirrors actual healthcare support conversations and is ideal for training advanced dialogue systems.
Data Format & Structure
Available in JSON, CSV, and TXT formats, each conversation includes:
•Full message history with clear speaker labels
•Participant identifiers
•Metadata (e.g., topic tags, region, sentiment)
•Compatibility with common NLP and ML pipelines
Applications
<p
World's Billionaires
kaggle.com
zip
Updated May 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sadialiou Diallo (2021). World's Billionaires [Dataset]. https://www.kaggle.com/seriadiallo1/world-billionaires
Explore at:
zip(2962 bytes)Available download formats
Dataset updated
May 19, 2021
Authors
Sadialiou Diallo
Description
The richest people in the world, yearly rank from 2002 to 2021

This dataset contains 200 rows and 7 columns.

The World's Billionaires is an annual ranking by documented net worth of the world's wealthiest billionaires compiled and published in March annually by the American business magazine Forbes. The list was first published in March 1987. The total net worth of each individual on the list is estimated and is cited in United States dollars, based on their documented assets and accounting for debt. Royalty and dictators whose wealth comes from their positions are excluded from these lists. This ranking is an index of the wealthiest documented individuals, excluding and ranking against those with wealth that is not able to be completely ascertained. (wikipedia)
Data from: InterHub: A Naturalistic Trajectory Dataset with Dense...
figshare.com
csv
Updated May 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiyan Jiang; Xiaocong Zhao; Yiru Liu; Zirui Li; Peng Hang; Lu Xiong; Jian Sun (2025). InterHub: A Naturalistic Trajectory Dataset with Dense Interaction for Autonomous Driving [Dataset]. http://doi.org/10.6084/m9.figshare.27899754.v6
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27899754.v6
Dataset updated
May 24, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Xiyan Jiang; Xiaocong Zhao; Yiru Liu; Zirui Li; Peng Hang; Lu Xiong; Jian Sun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We provide a dense interaction dataset, InterHub, derived from extensive naturalistic driving records to address the scarcity of real-world datasets capturing rich interaction events.The dataset provided on this page include:A CSV file (Interactive_Segments_Index.csv) containing the indexed list of the extracted interaction events. In addition to indexing and tracing information about interaction scenarios, we also provide some interesting labels to facilitate more targeted retrieval and utilization of interaction scenarios.(For detailed information, please refer to https://github.com/zxc-tju/InterHub.)Relevant unified data cache files (InterHub_cache_files.zip that includes cache files of lyft_train_full, nuplan_train).The Python codes used to process and analyze the dataset can be found at https://github.com/zxc-tju/InterHub. The tools for navigating InterHub involve the following three parts:0_data_unify.py converts various data resources into a unified format for seamless interaction event extraction.1_interaction_extract.py extracts interactive segments from unified driving records.2_case_visualize.py showcases interaction scenarios in InterHub.You can refer to the data structure of cache files presented in dataset.md, and after extracting the InterHub_cache_files.zip file, put it in the corresponding folder.Statement: All third-party data redistributions included in the interhub_cache_files.zip repository are carried out in full compliance with the original licensing terms of the respective source datasets, as required by their mandatory licensing conditions. This portion of the data remains subject to its original licenses, and users of the data are required to comply with these original licensing terms in any subsequent use or redistribution.
G
Golden Dataset Curation for LLMs Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Oct 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Golden Dataset Curation for LLMs Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/golden-dataset-curation-for-llms-market
Explore at:
csv, pdf, pptxAvailable download formats
Dataset updated
Oct 4, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Golden Dataset Curation for LLMs Market Outlook

According to our latest research, the global Golden Dataset Curation for LLMs market size stood at USD 1.42 billion in 2024, reflecting the surging demand for high-quality, bias-mitigated datasets in large language model (LLM) development. The market is projected to grow at a robust CAGR of 27.8% from 2025 to 2033, reaching an estimated USD 13.9 billion by 2033. This remarkable growth is fueled by the increasing sophistication of AI models, the critical need for reliable training data, and the expanding adoption of LLMs across diverse sectors.

Several key factors are driving the rapid expansion of the Golden Dataset Curation for LLMs market. First and foremost is the exponential growth in the deployment of large language models across industries such as healthcare, finance, legal, and customer service. As organizations seek to leverage LLMs for complex natural language processing tasks, the demand for meticulously curated, high-quality datasets has become paramount. This is because the performance, reliability, and ethical alignment of LLMs are intrinsically linked to the quality of their training data. Companies are increasingly investing in the curation of "golden datasets"—datasets that are not only comprehensive and representative but also rigorously annotated and validated to minimize bias and ensure regulatory compliance. This trend is expected to intensify as AI regulations tighten and as organizations strive for greater transparency and accountability in AI deployments.

Another significant growth driver for the Golden Dataset Curation for LLMs market is the advancement in data curation technologies and methodologies. The integration of automation, machine learning, and human-in-the-loop systems has revolutionized the way datasets are curated and validated. These advancements enable the efficient handling of vast and complex data sources, including text, image, audio, and multimodal datasets. The rise of specialized data curation platforms and services has further accelerated the adoption of golden dataset practices, allowing organizations to scale their AI initiatives while maintaining data integrity. Moreover, as LLMs become more multilingual and domain-specific, the need for curated datasets that reflect diverse languages, cultures, and industry-specific knowledge is growing rapidly, further boosting market demand.

The expanding ecosystem of AI applications is also propelling the Golden Dataset Curation for LLMs market forward. As LLMs are increasingly utilized for tasks such as model training, evaluation, benchmarking, and fine-tuning, the scope and complexity of required datasets have grown exponentially. Organizations are now seeking datasets that not only support model development but also facilitate continuous evaluation and improvement of AI models in real-world scenarios. This has led to a surge in demand for datasets that are regularly updated, contextually rich, and tailored to specific use cases. Additionally, the proliferation of open-source and third-party data sources, coupled with the need for proprietary datasets, has created a dynamic and competitive market landscape where data quality and curation expertise are key differentiators.

From a regional perspective, North America currently dominates the Golden Dataset Curation for LLMs market, accounting for the largest share in 2024. This leadership is attributed to the presence of major technology companies, a robust research ecosystem, and significant investments in AI and machine learning infrastructure. Europe and Asia Pacific are also emerging as key markets, driven by increasing regulatory focus on AI ethics and the rapid digital transformation of enterprises. The Asia Pacific region, in particular, is expected to witness the highest CAGR during the forecast period, fueled by rising AI adoption in countries such as China, Japan, and India. Meanwhile, Latin America and the Middle East & Africa are gradually catching up, supported by growing awareness of AI's potential and investments in digital infrastructure.

Dataset Type
F
Spanish Agent-Customer Chat Dataset for Healthcare Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Spanish Agent-Customer Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/spanish-healthcare-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The Spanish Healthcare Chat Dataset is a rich collection of over 10,000 text-based conversations between customers and call center agents, focused on real-world healthcare interactions. Designed to reflect authentic language use and domain-specific dialogue patterns, this dataset supports the development of conversational AI, chatbots, and NLP models tailored for healthcare applications in Spanish-speaking regions.
Participant & Chat Overview
•
Participants: 150+ native Spanish speakers from the FutureBeeAI Crowd Community

•
Conversation Length: 300–700 words per chat

•
Turns per Chat: 50–150 dialogue turns across both participants

•
Chat Types: Inbound and outbound

•
Sentiment Coverage: Positive, neutral, and negative outcomes included

Topic Diversity
The dataset captures a wide spectrum of healthcare-related chat scenarios, ensuring comprehensive coverage for training robust AI systems:
•
Inbound Chats (Customer-Initiated): Appointment scheduling, new patient registration, surgery and treatment consultations, diet and lifestyle discussions, insurance claim inquiries, lab result follow-ups

•
Outbound Chats (Agent-Initiated): Appointment reminders and confirmations, health and wellness program offers, test result notifications, preventive care and vaccination reminders, subscription renewals, risk assessment and eligibility follow-ups

This variety helps simulate realistic healthcare support workflows and patient-agent dynamics.
Language Diversity & Realism
This dataset reflects the natural flow of Spanish healthcare communication and includes:
•
Authentic Naming Patterns: Spanish personal names, clinic names, and brands

•
Localized Contact Elements: Addresses, emails, phone numbers, and clinic locations in regional Spanish formats

•
Time & Currency References: Use of dates, times, numeric expressions, and currency units aligned with Spanish-speaking regions

•
Colloquial & Medical Expressions: Local slang, informal speech, and common healthcare-related terminology

These elements ensure the dataset is contextually relevant and linguistically rich for real-world use cases.
Conversational Flow & Structure
Conversations range from simple inquiries to complex advisory sessions, including:
•General inquiries
•Detailed problem-solving
•Routine status updates
•Treatment recommendations
•Support and feedback interactions
Each conversation typically includes these structural components:
•Greetings and verification
•Information gathering
•Problem definition
•Solution delivery
•Closing messages
•Follow-up and feedback (where applicable)
This structured flow mirrors actual healthcare support conversations and is ideal for training advanced dialogue systems.
Data Format & Structure
Available in JSON, CSV, and TXT formats, each conversation includes:
•Full message history with clear speaker labels
•Participant identifiers
•Metadata (e.g., topic tags, region, sentiment)
•Compatibility with common NLP and ML pipelines
Applications
<p
Mapping Ocean Wealth Explorer
rmi-data.sprep.org
pacificdata.org
+14more
pdf
Updated Feb 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Secretariat of the Pacific Regional Environment Programme (2025). Mapping Ocean Wealth Explorer [Dataset]. https://rmi-data.sprep.org/dataset/mapping-ocean-wealth-explorer
Explore at:
pdf(12573434)Available download formats
Dataset updated
Feb 20, 2025
Dataset provided by
Pacific Regional Environment Programmehttps://www.sprep.org/
License
Public Domain Mark 1.0https://creativecommons.org/publicdomain/mark/1.0/
License information was derived automatically
Area covered
Pacific Region
Description
The Mapping Ocean Wealth data viewer is a live online resource for sharing understanding of the value of marine and coastal ecosystems to people. It includes global maps, regionally-specific studies, reference data, and a number of “apps” providing key data analytics. Maps and apps can be opened according to key themes or geographies. The navigator the left of the maps enables you to add or remove any additional map layers as you explore. Information keys explain how the maps were made and provide additional links. Further information and resources can be found on Oceanwealth.org

Recreation and Tourism App - Explore the value of healthy ecosystems to the tourism industry

Natural Coastal Protection App - Discover the coastal protection benefits of coral reefs around the world

Blue Carbon App - View Mangrove Carbon Storage

Coral Reef Fisheries App - Learn about the status of coral reef fisheries

Regional Planning

Mangrove Restoration
Friedl presentation at CIDU - Dataset - NASA Open Data Portal
data.nasa.gov
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Friedl presentation at CIDU - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/friedl-presentation-at-cidu
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
The land remote sensing community has a long history of using supervised and unsupervised methods to help interpret and analyze remote sensing data sets. Until relatively recently, most remote sensing studies have used fairly conventional image processing and pattern recognition methodologies. In the past decade, NASA has launched a series of remote sensing missions known as the Earth Observing System (EOS). The data sets acquired by EOS instruments provide an extremely rich source of information related to the properties and dynamics of the Earth’s terrestrial ecosystems. However, these data are also characterized by large volumes and complex spectral, spatial and temporal attributes. Because of the volume and complexity of EOS data sets, efficient and effective analysis of them presents significant challenges that are difficult to address using conventional remote sensing approaches. In this paper we discuss results from applying a variety of different data mining approaches to global remote sensing data sets. Specifically, we describe three main problem domains and sets of analyses: (1) supervised classification of global land cover from using data from NASA’s Moderate Resolution Imaging Spectroradiometer; (2) the use of linear and non-linear cluster and dimensionality reduction methods to examine coupled climate-vegetation dynamics using a twenty year time series of data from the Advanced Very High Resolution Radiometer; and (3) the use of functional models, non-parametric clustering, and mixture models to help interpret and understand the feature space and class structure of high dimensional remote sensing data sets. The paper will not focus on specific details of algorithms. Instead we describe key results, successes, and lessons learned from ten years of research focusing on the use of data mining and machine learning methods for remote sensing and Earth science problems.
F
Vietnamese Agent-Customer Chat Dataset for Healthcare Domain
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Vietnamese Agent-Customer Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/vietnamese-healthcare-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The Vietnamese Healthcare Chat Dataset is a rich collection of over 10,000 text-based conversations between customers and call center agents, focused on real-world healthcare interactions. Designed to reflect authentic language use and domain-specific dialogue patterns, this dataset supports the development of conversational AI, chatbots, and NLP models tailored for healthcare applications in Vietnamese-speaking regions.
Participant & Chat Overview
•
Participants: 150+ native Vietnamese speakers from the FutureBeeAI Crowd Community

•
Conversation Length: 300–700 words per chat

•
Turns per Chat: 50–150 dialogue turns across both participants

•
Chat Types: Inbound and outbound

•
Sentiment Coverage: Positive, neutral, and negative outcomes included

Topic Diversity
The dataset captures a wide spectrum of healthcare-related chat scenarios, ensuring comprehensive coverage for training robust AI systems:
•
Inbound Chats (Customer-Initiated): Appointment scheduling, new patient registration, surgery and treatment consultations, diet and lifestyle discussions, insurance claim inquiries, lab result follow-ups

•
Outbound Chats (Agent-Initiated): Appointment reminders and confirmations, health and wellness program offers, test result notifications, preventive care and vaccination reminders, subscription renewals, risk assessment and eligibility follow-ups

This variety helps simulate realistic healthcare support workflows and patient-agent dynamics.
Language Diversity & Realism
This dataset reflects the natural flow of Vietnamese healthcare communication and includes:
•
Authentic Naming Patterns: Vietnamese personal names, clinic names, and brands

•
Localized Contact Elements: Addresses, emails, phone numbers, and clinic locations in regional Vietnamese formats

•
Time & Currency References: Use of dates, times, numeric expressions, and currency units aligned with Vietnamese-speaking regions

•
Colloquial & Medical Expressions: Local slang, informal speech, and common healthcare-related terminology

These elements ensure the dataset is contextually relevant and linguistically rich for real-world use cases.
Conversational Flow & Structure
Conversations range from simple inquiries to complex advisory sessions, including:
•General inquiries
•Detailed problem-solving
•Routine status updates
•Treatment recommendations
•Support and feedback interactions
Each conversation typically includes these structural components:
•Greetings and verification
•Information gathering
•Problem definition
•Solution delivery
•Closing messages
•Follow-up and feedback (where applicable)
This structured flow mirrors actual healthcare support conversations and is ideal for training advanced dialogue systems.
Data Format & Structure
Available in JSON, CSV, and TXT formats, each conversation includes:
•Full message history with clear speaker labels
•Participant identifiers
•Metadata (e.g., topic tags, region, sentiment)
•Compatibility with common NLP and ML pipelines
<h3 style="font-weight:

Facebook

Twitter

Click to copy link

Link copied

Cite

Nate Raw (2023). 100-richest-people-in-world [Dataset]. https://huggingface.co/datasets/nateraw/100-richest-people-in-world

100-richest-people-in-world

nateraw/100-richest-people-in-world

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 2, 2023

Authors

Nate Raw

License

https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

Area covered

World

Description

Dataset Card for 100 Richest People In World

  Dataset Summary

This dataset contains the list of Top 100 Richest People in the World Column Information:-

Name - Person Name NetWorth - His/Her Networth Age - Person Age Country - The country person belongs to Source - Information Source Industry - Expertise Domain

  Join our Community









  Supported Tasks and Leaderboards

[More Information Needed]

  Languages

[More Information Needed]… See the full description on the dataset page: https://huggingface.co/datasets/nateraw/100-richest-people-in-world.

Clear search

Close search

Google apps

Main menu

Recording	Issue
2025-01-10-15-28-50.h5	hand cam missing at beginning
2025-01-10-16-17-40.h5	missing hand cam
2025-01-10-17-10-38.h5	hand cam missing at beginning
2025-01-10-17-54-09.h5	no empty action at

100-richest-people-in-world

1000 Richest People in the World

Forbes World's Billionaires List 2024

Billionaires dataset cleaned

Column names

Potential analyses

Bonus

Leading billionaires worldwide 2025

Top 100 Richest People in the World

Introduction

Methodology

World_billion_2024

Billionaries dataset

Content

Acknowledgements

Inspiration

Globalization and Income Distribution Dataset 1975-2002 - Aruba,...

Abstract

Kind of data

Education Attainment and Enrollment around the World

Are Students Ready for a Technology-Rich World? What PISA Studies Tell Us

Data from: REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic...

REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly

📋 Introduction

✨ Key Features

🔴 Dataset Collection

📑 Dataset Structure

📌 Important Resources

⚠️ File comments

English Agent-Customer Chat Dataset for Healthcare Domain

Introduction

Participant & Chat Overview

Topic Diversity

Language Diversity & Realism

Conversational Flow & Structure

Data Format & Structure

Applications

World's Billionaires

The richest people in the world, yearly rank from 2002 to 2021

Data from: InterHub: A Naturalistic Trajectory Dataset with Dense...

Golden Dataset Curation for LLMs Market Research Report 2033

Golden Dataset Curation for LLMs Market Outlook

Dataset Type

Spanish Agent-Customer Chat Dataset for Healthcare Domain

Introduction

Participant & Chat Overview

Topic Diversity

Language Diversity & Realism

Conversational Flow & Structure

Data Format & Structure

Applications

Mapping Ocean Wealth Explorer

Friedl presentation at CIDU - Dataset - NASA Open Data Portal

Vietnamese Agent-Customer Chat Dataset for Healthcare Domain

Introduction

Participant & Chat Overview

Topic Diversity

Language Diversity & Realism

Conversational Flow & Structure

Data Format & Structure

100-richest-people-in-worldSee More Versions

nateraw/100-richest-people-in-world

100-richest-people-in-world