100+ datasets found

Research Papers Dataset
kaggle.com
zip
Updated May 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NECHBA MOHAMMED (2023). Research Papers Dataset [Dataset]. https://www.kaggle.com/datasets/nechbamohammed/research-papers-dataset
Explore at:
zip(619131172 bytes)Available download formats
Dataset updated
May 8, 2023
Authors
NECHBA MOHAMMED
Description
Description: This dataset (Version 10) contains a collection of research papers along with various attributes and metadata. It is a comprehensive and diverse dataset that can be used for a wide range of research and analysis tasks. The dataset encompasses papers from different fields of study, including computer science, mathematics, physics, and more.

Fields in the Dataset: - id: A unique identifier for each paper. - title: The title of the research paper. - authors: The list of authors involved in the paper. - venue: The journal or venue where the paper was published. - year: The year when the paper was published. - n_citation: The number of citations received by the paper. - references: A list of paper IDs that are cited by the current paper. - abstract: The abstract of the paper.

Example: - "id": "013ea675-bb58-42f8-a423-f5534546b2b1", - "title": "Prediction of consensus binding mode geometries for related chemical series of positive allosteric modulators of adenosine and muscarinic acetylcholine receptors", - "authors": ["Leon A. Sakkal", "Kyle Z. Rajkowski", "Roger S. Armen"], - "venue": "Journal of Computational Chemistry", - "year": 2017, - "n_citation": 0, - "references": ["4f4f200c-0764-4fef-9718-b8bccf303dba", "aa699fbf-fabe-40e4-bd68-46eaf333f7b1"], - "abstract": "This paper studies ..."

Cite: https://www.aminer.cn/citation
Dirty Dataset to practice Data Cleaning
kaggle.com
zip
Updated Nov 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amrutha yenikonda (2023). Dirty Dataset to practice Data Cleaning [Dataset]. https://www.kaggle.com/datasets/amruthayenikonda/dirty-dataset-to-practice-data-cleaning
Explore at:
zip(1241 bytes)Available download formats
Dataset updated
Nov 3, 2023
Authors
Amrutha yenikonda
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The dataset has been obtained by web scraping a Wikipedia page for which code is linked below: https://www.kaggle.com/amruthayenikonda/simple-web-scraping-using-pandas

This dataset can be used to practice data cleaning and manipulation for example dropping of unwanted columns, null vales, removing symbols etc
c
Walmart Products Dataset – Free Product Data CSV
crawlfeeds.com
csv, zip
Updated Dec 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Walmart Products Dataset – Free Product Data CSV [Dataset]. https://crawlfeeds.com/datasets/walmart-products-free-dataset
Explore at:
zip, csvAvailable download formats
Dataset updated
Dec 2, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Looking for a free Walmart product dataset? The Walmart Products Free Dataset delivers a ready-to-use ecommerce product data CSV containing ~2,100 verified product records from Walmart.com. It includes vital details like product titles, prices, categories, brand info, availability, and descriptions — perfect for data analysis, price comparison, market research, or building machine-learning models.

Key Features

Complete Product Metadata: Each entry includes URL, title, brand, SKU, price, currency, description, availability, delivery method, average rating, total ratings, image links, unique ID, and timestamp.

CSV Format, Ready to Use: Download instantly - no need for scraping, cleaning or formatting.

Good for E-commerce Research & ML: Ideal for product cataloging, price tracking, demand forecasting, recommendation systems, or data-driven projects.

Free & Easy Access: Priced at USD $0.0, making it a great starting point for developers, data analysts or students.

Who Benefits?

Data analysts & researchers exploring e-commerce trends or product catalog data.

Developers & data scientists building price-comparison tools, recommendation engines or ML models.

E-commerce strategists/marketers need product metadata for competitive analysis or market research.

Students/hobbyists needing a free dataset for learning or demo projects.

Why Use This Dataset Instead of Manual Scraping?

Time-saving: No need to write scrapers or deal with rate limits.

Clean, structured data: All records are verified and already formatted in CSV, saving hours of cleaning.

Risk-free: Avoid Terms-of-Service issues or IP blocks that come with manual scraping.
Instant access: Free and immediately downloadable.
u
Behance Community Art Data
cseweb.ucsd.edu
json
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, Behance Community Art Data [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
Explore at:
jsonAvailable download formats
Dataset authored and provided by
UCSD CSE Research Project
Description
Likes and image data from the community art website Behance. This is a small, anonymized, version of a larger proprietary dataset.

Metadata includes

appreciates (likes)

timestamps

extracted image features

Basic Statistics:

Users: 63,497

Items: 178,788

Appreciates (likes): 1,000,000
Z
Dataset: A Systematic Literature Review on the topic of High-value datasets
data.niaid.nih.gov
Updated Jun 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anastasija Nikiforova; Nina Rizun; Magdalena Ciesielska; Charalampos Alexopoulos; Andrea Miletič (2023). Dataset: A Systematic Literature Review on the topic of High-value datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7944424
Explore at:
Dataset updated
Jun 23, 2023
Dataset provided by
Gdańsk University of Technology
University of Zagreb
University of the Aegean
University of Tartu
Authors
Anastasija Nikiforova; Nina Rizun; Magdalena Ciesielska; Charalampos Alexopoulos; Andrea Miletič
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains data collected during a study ("Towards High-Value Datasets determination for data-driven development: a systematic literature review") conducted by Anastasija Nikiforova (University of Tartu), Nina Rizun, Magdalena Ciesielska (Gdańsk University of Technology), Charalampos Alexopoulos (University of the Aegean) and Andrea Miletič (University of Zagreb) It being made public both to act as supplementary data for "Towards High-Value Datasets determination for data-driven development: a systematic literature review" paper (pre-print is available in Open Access here -> https://arxiv.org/abs/2305.10234) and in order for other researchers to use these data in their own work.

The protocol is intended for the Systematic Literature review on the topic of High-value Datasets with the aim to gather information on how the topic of High-value datasets (HVD) and their determination has been reflected in the literature over the years and what has been found by these studies to date, incl. the indicators used in them, involved stakeholders, data-related aspects, and frameworks. The data in this dataset were collected in the result of the SLR over Scopus, Web of Science, and Digital Government Research library (DGRL) in 2023.

Methodology

To understand how HVD determination has been reflected in the literature over the years and what has been found by these studies to date, all relevant literature covering this topic has been studied. To this end, the SLR was carried out to by searching digital libraries covered by Scopus, Web of Science (WoS), Digital Government Research library (DGRL).

These databases were queried for keywords ("open data" OR "open government data") AND ("high-value data*" OR "high value data*"), which were applied to the article title, keywords, and abstract to limit the number of papers to those, where these objects were primary research objects rather than mentioned in the body, e.g., as a future work. After deduplication, 11 articles were found unique and were further checked for relevance. As a result, a total of 9 articles were further examined. Each study was independently examined by at least two authors.

To attain the objective of our study, we developed the protocol, where the information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information.

Test procedure Each study was independently examined by at least two authors, where after the in-depth examination of the full-text of the article, the structured protocol has been filled for each study. The structure of the survey is available in the supplementary file available (see Protocol_HVD_SLR.odt, Protocol_HVD_SLR.docx) The data collected for each study by two researchers were then synthesized in one final version by the third researcher.

Description of the data in this data set

Protocol_HVD_SLR provides the structure of the protocol Spreadsheets #1 provides the filled protocol for relevant studies. Spreadsheet#2 provides the list of results after the search over three indexing databases, i.e. before filtering out irrelevant studies

The information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information

Descriptive information
1) Article number - a study number, corresponding to the study number assigned in an Excel worksheet 2) Complete reference - the complete source information to refer to the study 3) Year of publication - the year in which the study was published 4) Journal article / conference paper / book chapter - the type of the paper -{journal article, conference paper, book chapter} 5) DOI / Website- a link to the website where the study can be found 6) Number of citations - the number of citations of the article in Google Scholar, Scopus, Web of Science 7) Availability in OA - availability of an article in the Open Access 8) Keywords - keywords of the paper as indicated by the authors 9) Relevance for this study - what is the relevance level of the article for this study? {high / medium / low}

Approach- and research design-related information 10) Objective / RQ - the research objective / aim, established research questions 11) Research method (including unit of analysis) - the methods used to collect data, including the unit of analy-sis (country, organisation, specific unit that has been ana-lysed, e.g., the number of use-cases, scope of the SLR etc.) 12) Contributions - the contributions of the study 13) Method - whether the study uses a qualitative, quantitative, or mixed methods approach? 14) Availability of the underlying research data- whether there is a reference to the publicly available underly-ing research data e.g., transcriptions of interviews, collected data, or explanation why these data are not shared? 15) Period under investigation - period (or moment) in which the study was conducted 16) Use of theory / theoretical concepts / approaches - does the study mention any theory / theoretical concepts / approaches? If any theory is mentioned, how is theory used in the study?

Quality- and relevance- related information
17) Quality concerns - whether there are any quality concerns (e.g., limited infor-mation about the research methods used)? 18) Primary research object - is the HVD a primary research object in the study? (primary - the paper is focused around the HVD determination, sec-ondary - mentioned but not studied (e.g., as part of discus-sion, future work etc.))

HVD determination-related information
19) HVD definition and type of value - how is the HVD defined in the article and / or any other equivalent term? 20) HVD indicators - what are the indicators to identify HVD? How were they identified? (components & relationships, “input -> output") 21) A framework for HVD determination - is there a framework presented for HVD identification? What components does it consist of and what are the rela-tionships between these components? (detailed description) 22) Stakeholders and their roles - what stakeholders or actors does HVD determination in-volve? What are their roles? 23) Data - what data do HVD cover? 24) Level (if relevant) - what is the level of the HVD determination covered in the article? (e.g., city, regional, national, international)

Format of the file .xls, .csv (for the first spreadsheet only), .odt, .docx

Licenses or restrictions CC-BY

For more info, see README.txt
u
Steam Video Game and Bundle Data
cseweb.ucsd.edu
json
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, Steam Video Game and Bundle Data [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
Explore at:
jsonAvailable download formats
Dataset authored and provided by
UCSD CSE Research Project
Description
These datasets contain reviews from the Steam video game platform, and information about which games were bundled together.

Metadata includes

reviews

purchases, plays, recommends (likes)

product bundles

pricing information

Basic Statistics:

Reviews: 7,793,069

Users: 2,567,538

Items: 15,474

Bundles: 615
Top 1000 Kaggle Datasets
kaggle.com
zip
Updated Jan 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Trrishan (2022). Top 1000 Kaggle Datasets [Dataset]. https://www.kaggle.com/datasets/notkrishna/top-1000-kaggle-datasets
Explore at:
zip(34269 bytes)Available download formats
Dataset updated
Jan 3, 2022
Authors
Trrishan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
From wiki

Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

Kaggle got its start in 2010 by offering machine learning competitions and now also offers a public data platform, a cloud-based workbench for data science, and Artificial Intelligence education. Its key personnel were Anthony Goldbloom and Jeremy Howard. Nicholas Gruen was founding chair succeeded by Max Levchin. Equity was raised in 2011 valuing the company at $25 million. On 8 March 2017, Google announced that they were acquiring Kaggle.[1][2]

Source: Kaggle
u
Google Restaurants dataset
cseweb.ucsd.edu
csv
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, Google Restaurants dataset [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
Explore at:
csvAvailable download formats
Dataset authored and provided by
UCSD CSE Research Project
Description
This is a mutli-modal dataset for restaurants from Google Local (Google Maps). Data includes images and reviews posted by users, as well as metadata for each restaurant.
Multivariate Time Series Search - Dataset - NASA Open Data Portal
data.nasa.gov
Updated Mar 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Multivariate Time Series Search - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/multivariate-time-series-search
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
Multivariate Time-Series (MTS) are ubiquitous, and are generated in areas as disparate as sensor recordings in aerospace systems, music and video streams, medical monitoring, and financial systems. Domain experts are often interested in searching for interesting multivariate patterns from these MTS databases which can contain up to several gigabytes of data. Surprisingly, research on MTS search is very limited. Most existing work only supports queries with the same length of data, or queries on a fixed set of variables. In this paper, we propose an efficient and flexible subsequence search framework for massive MTS databases, that, for the first time, enables querying on any subset of variables with arbitrary time delays between them. We propose two provably correct algorithms to solve this problem — (1) an R-tree Based Search (RBS) which uses Minimum Bounding Rectangles (MBR) to organize the subsequences, and (2) a List Based Search (LBS) algorithm which uses sorted lists for indexing. We demonstrate the performance of these algorithms using two large MTS databases from the aviation domain, each containing several millions of observations. Both these tests show that our algorithms have very high prune rates (>95%) thus needing actual disk access for only less than 5% of the observations. To the best of our knowledge, this is the first flexible MTS search algorithm capable of subsequence search on any subset of variables. Moreover, MTS subsequence search has never been attempted on datasets of the size we have used in this paper.
G
Open Data Portal Catalogue
open.canada.ca
datasets.ai
+1more
csv, json, jsonl, png +2
Updated Jan 1, 2026
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Treasury Board of Canada Secretariat (2026). Open Data Portal Catalogue [Dataset]. https://open.canada.ca/data/en/dataset/c4c5c7f1-bfa6-4ff6-b4a0-c164cb2060f7
Explore at:
csv, sqlite, json, png, jsonl, xlsxAvailable download formats
Dataset updated
Jan 1, 2026
Dataset provided by
Treasury Board of Canada Secretariat
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Description
The open data portal catalogue is a downloadable dataset containing some key metadata for the general datasets available on the Government of Canada's Open Data portal. Resource 1 is generated using the ckanapi tool (external link) Resources 2 - 8 are generated using the Flatterer (external link) utility. ###Description of resources: 1. Dataset is a JSON Lines (external link) file where the metadata of each Dataset/Open Information Record is one line of JSON. The file is compressed with GZip. The file is heavily nested and recommended for users familiar with working with nested JSON. 2. Catalogue is a XLSX workbook where the nested metadata of each Dataset/Open Information Record is flattened into worksheets for each type of metadata. 3. datasets metadata contains metadata at the dataset level. This is also referred to as the package in some CKAN documentation. This is the main table/worksheet in the SQLite database and XLSX output. 4. Resources Metadata contains the metadata for the resources contained within each dataset. 5. resource views metadata contains the metadata for the views applied to each resource, if a resource has a view configured. 6. datastore fields metadata contains the DataStore information for CSV datasets that have been loaded into the DataStore. This information is displayed in the Data Dictionary for DataStore enabled CSVs. 7. Data Package Fields contains a description of the fields available in each of the tables within the Catalogue, as well as the count of the number of records each table contains. 8. data package entity relation diagram Displays the title and format for column, in each table in the Data Package in the form of a ERD Diagram. The Data Package resource offers a text based version. 9. SQLite Database is a .db database, similar in structure to Catalogue. This can be queried with database or analytical software tools for doing analysis.
Z
Dataset of knee joint contact force peaks and corresponding subject...
data.niaid.nih.gov
explore.openaire.eu
+2more
Updated Oct 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lavikainen, Jere Joonatan; Stenroth, Lauri (2023). Dataset of knee joint contact force peaks and corresponding subject characteristics from 4 open datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7253457
Explore at:
Dataset updated
Oct 9, 2023
Dataset provided by
University of Eastern Finland
Authors
Lavikainen, Jere Joonatan; Stenroth, Lauri
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains data from overground walking trials of 166 subjects with several trials per subject (approximately 2900 trials total).

DATA ORIGINS & LICENSE INFORMATION

The data comes from four existing open datasets collected by others:

Schreiber & Moissenet, A multimodal dataset of human gait at different walking speeds established on injury-free adult participants

article: https://www.nature.com/articles/s41597-019-0124-4

dataset: https://figshare.com/articles/dataset/A_multimodal_dataset_of_human_gait_at_different_walking_speeds/7734767

Fukuchi et al., A public dataset of overground and treadmill walking kinematics and kinetics in healthy individuals

article: https://peerj.com/articles/4640/

dataset: https://figshare.com/articles/dataset/A_public_data_set_of_overground_and_treadmill_walking_kinematics_and_kinetics_of_healthy_individuals/5722711

Horst et al., A public dataset of overground walking kinetics and full-body kinematics in healthy adult individuals

article: https://www.nature.com/articles/s41598-019-38748-8

dataset: https://data.mendeley.com/datasets/svx74xcrjr/3

Camargo et al., A comprehensive, open-source dataset of lower limb biomechanics in multiple conditions of stairs, ramps, and level-ground ambulation and transitions

article: https://www.sciencedirect.com/science/article/pii/S0021929021001007

dataset (3 links): https://data.mendeley.com/datasets/fcgm3chfff/1 https://data.mendeley.com/datasets/k9kvm5tn3f/1 https://data.mendeley.com/datasets/jj3r5f9pnf/1

In this dataset, those datasets are referred to as the Schreiber, Fukuchi, Horst, and Camargo datasets, respectively. The Schreiber, Fukuchi, Horst, and Camargo datasets are licensed under the CC BY 4.0 license (https://creativecommons.org/licenses/by/4.0/).

We have modified the datasets by analyzing the data with musculoskeletal simulations & analysis software (OpenSim). In this dataset, we publish modified data as well as some of the original data.

STRUCTURE OF THE DATASET The dataset contains two kinds of text files: those starting with "predictors_" and those starting with "response_".

Predictors comprise 12 text files, each describing the input (predictor) variables we used to train artifical neural networks to predict knee joint loading peaks. Responses similarly comprise 12 text files, each describing the response (outcome) variables that we trained and evaluated the network on. The file names are of the form "predictors_X" for predictors and "response_X" for responses, where X describes which response (outcome) variable is predicted with them. X can be: - loading_response_both: the maximum of the first peak of stance for the sum of the loading of the medial and lateral compartments - loading_response_lateral: the maximum of the first peak of stance for the loading of the lateral compartment - loading_response_medial: the maximum of the first peak of stance for the loading of the medial compartment - terminal_extension_both: the maximum of the second peak of stance for the sum of the loading of the medial and lateral compartments - terminal_extension_lateral: the maximum of the second peak of stance for the loading of the lateral compartment - terminal_extension_medial: the maximum of the second peak of stance for the loading of the medial compartment - max_peak_both: the maximum of the entire stance phase for the sum of the loading of the medial and lateral compartments - max_peak_lateral: the maximum of the entire stance phase for the loading of the lateral compartment - max_peak_medial: the maximum of the entire stance phase for the loading of the medial compartment - MFR_common: the medial force ratio for the entire stance phase - MFR_LR: the medial force ratio for the first peak of stance - MFR_TE: the medial force ratio for the second peak of stance

The predictor text files are organized as comma-separated values. Each row corresponds to one walking trial. A single subject typically has several trials. The column labels are DATASET_INDEX,SUBJECT_INDEX,KNEE_ADDUCTION,MASS,HEIGHT,BMI,WALKING_SPEED,HEEL_STRIKE_VELOCITY,AGE,GENDER.

DATASET_INDEX describes which original dataset the trial is from, where {1=Schreiber, 2=Fukuchi, 3=Horst, 4=Camargo}

SUBJECT_INDEX is the index of the subject in the original dataset. If you use this column, you will have to rewrite these to avoid duplicates (e.g., several datasets probably have subject "3").

KNEE_ADDUCTION is the knee adduction-abduction angle (positive for adduction, negative for abduction) of the subject in static pose, estimated from motion capture markers.

MASS is the mass of the subject in kilograms

HEIGHT is the height of the subject in millimeters

BMI is the body mass index of the subject

WALKING_SPEED is the mean walking speed of the subject during the trial

HEEL_STRIKE_VELOCITY is the mean of the velocities of the subject's pelvis markers at the instant of heel strike

AGE is the age of the subject in years

GENDER is an integer/boolean where {1=male, 0=female}

The response text files contain one floating-point value per row, describing the knee joint contact force peak for the trial in newtons (or the medial force ratio). Each row corresponds to one walking trial. The rows in predictor and response text files match each other (e.g., row 7 describes the same trial in both predictors_max_peak_medial.txt and response_max_peak_medial.txt).

See our journal article "Prediction of Knee Joint Compartmental Loading Maxima Utilizing Simple Subject Characteristics and Neural Networks" (https://doi.org/10.1007/s10439-023-03278-y) for more information.

Questions & other contacts: jere.lavikainen@uef.fi
Healthcare Dataset
kaggle.com
zip
Updated May 8, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prasad Patil (2024). Healthcare Dataset [Dataset]. https://www.kaggle.com/datasets/prasad22/healthcare-dataset
Explore at:
zip(3054550 bytes)Available download formats
Dataset updated
May 8, 2024
Authors
Prasad Patil
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context:

This synthetic healthcare dataset has been created to serve as a valuable resource for data science, machine learning, and data analysis enthusiasts. It is designed to mimic real-world healthcare data, enabling users to practice, develop, and showcase their data manipulation and analysis skills in the context of the healthcare industry.

Inspiration:

The inspiration behind this dataset is rooted in the need for practical and diverse healthcare data for educational and research purposes. Healthcare data is often sensitive and subject to privacy regulations, making it challenging to access for learning and experimentation. To address this gap, I have leveraged Python's Faker library to generate a dataset that mirrors the structure and attributes commonly found in healthcare records. By providing this synthetic data, I hope to foster innovation, learning, and knowledge sharing in the healthcare analytics domain.

Dataset Information:

Each column provides specific information about the patient, their admission, and the healthcare services provided, making this dataset suitable for various data analysis and modeling tasks in the healthcare domain. Here's a brief explanation of each column in the dataset - - Name: This column represents the name of the patient associated with the healthcare record. - Age: The age of the patient at the time of admission, expressed in years. - Gender: Indicates the gender of the patient, either "Male" or "Female." - Blood Type: The patient's blood type, which can be one of the common blood types (e.g., "A+", "O-", etc.). - Medical Condition: This column specifies the primary medical condition or diagnosis associated with the patient, such as "Diabetes," "Hypertension," "Asthma," and more. - Date of Admission: The date on which the patient was admitted to the healthcare facility. - Doctor: The name of the doctor responsible for the patient's care during their admission. - Hospital: Identifies the healthcare facility or hospital where the patient was admitted. - Insurance Provider: This column indicates the patient's insurance provider, which can be one of several options, including "Aetna," "Blue Cross," "Cigna," "UnitedHealthcare," and "Medicare." - Billing Amount: The amount of money billed for the patient's healthcare services during their admission. This is expressed as a floating-point number. - Room Number: The room number where the patient was accommodated during their admission. - Admission Type: Specifies the type of admission, which can be "Emergency," "Elective," or "Urgent," reflecting the circumstances of the admission. - Discharge Date: The date on which the patient was discharged from the healthcare facility, based on the admission date and a random number of days within a realistic range. - Medication: Identifies a medication prescribed or administered to the patient during their admission. Examples include "Aspirin," "Ibuprofen," "Penicillin," "Paracetamol," and "Lipitor." - Test Results: Describes the results of a medical test conducted during the patient's admission. Possible values include "Normal," "Abnormal," or "Inconclusive," indicating the outcome of the test.

Usage Scenarios:

This dataset can be utilized for a wide range of purposes, including: - Developing and testing healthcare predictive models. - Practicing data cleaning, transformation, and analysis techniques. - Creating data visualizations to gain insights into healthcare trends. - Learning and teaching data science and machine learning concepts in a healthcare context. - You can treat it as a Multi-Class Classification Problem and solve it for Test Results which contains 3 categories(Normal, Abnormal, and Inconclusive).

Acknowledgments:

I acknowledge the importance of healthcare data privacy and security and emphasize that this dataset is entirely synthetic. It does not contain any real patient information or violate any privacy regulations.

I hope that this dataset contributes to the advancement of data science and healthcare analytics and inspires new ideas. Feel free to explore, analyze, and share your findings with the Kaggle community.

Image Credit:

Image by BC Y from Pixabay
c
Booking hotel reviews large dataset
crawlfeeds.com
csv, zip
Updated Oct 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Booking hotel reviews large dataset [Dataset]. https://crawlfeeds.com/datasets/booking-hotel-reviews-large-dataset
Explore at:
zip, csvAvailable download formats
Dataset updated
Oct 6, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Explore our extensive Booking Hotel Reviews Large Dataset, featuring over 20.8 million records of detailed customer feedback from hotels worldwide. Whether you're conducting sentiment analysis, market research, or competitive benchmarking, this dataset provides invaluable insights into customer experiences and preferences.

The dataset includes crucial information such as reviews, ratings, comments, and more, all sourced from travellers who booked through Booking.com. It's an ideal resource for businesses aiming to understand guest sentiments, improve service quality, or refine marketing strategies within the hospitality sector.

With this hotel reviews dataset, you can dive deep into trends and patterns that reveal what customers truly value during their stays. Whether you're analyzing reviews for sentiment analysis or studying traveller feedback from specific regions, this dataset delivers the insights you need.

Ready to get started? Download the complete hotel review dataset or connect with the Crawl Feeds team to request records tailored to specific countries or regions. Unlock the power of data and take your hospitality analysis to the next level!

Access 3 million+ US hotel reviews — submit your request today.
LinkedIn Datasets
brightdata.com
.json, .csv, .xlsx
Updated Dec 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2021). LinkedIn Datasets [Dataset]. https://brightdata.com/products/datasets/linkedin
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Dec 17, 2021
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Unlock the full potential of LinkedIn data with our extensive dataset that combines profiles, company information, and job listings into one powerful resource for business decision-making, strategic hiring, competitive analysis, and market trend insights. This all-encompassing dataset is ideal for professionals, recruiters, analysts, and marketers aiming to enhance their strategies and operations across various business functions. Dataset Features

Profiles: Dive into detailed public profiles featuring names, titles, positions, experience, education, skills, and more. Utilize this data for talent sourcing, lead generation, and investment signaling, with a refresh rate ensuring up to 30 million records per month. Companies: Access comprehensive company data including ID, country, industry, size, number of followers, website details, subsidiaries, and posts. Tailored subsets by industry or region provide invaluable insights for CRM enrichment, competitive intelligence, and understanding the startup ecosystem, updated monthly with up to 40 million records. Job Listings: Explore current job opportunities detailed with job titles, company names, locations, and employment specifics such as seniority levels and employment functions. This dataset includes direct application links and real-time application numbers, serving as a crucial tool for job seekers and analysts looking to understand industry trends and the job market dynamics.

Customizable Subsets for Specific Needs Our LinkedIn dataset offers the flexibility to tailor the dataset according to your specific business requirements. Whether you need comprehensive insights across all data points or are focused on specific segments like job listings, company profiles, or individual professional details, we can customize the dataset to match your needs. This modular approach ensures that you get only the data that is most relevant to your objectives, maximizing efficiency and relevance in your strategic applications. Popular Use Cases

Strategic Hiring and Recruiting: Track talent movement, identify growth opportunities, and enhance your recruiting efforts with targeted data. Market Analysis and Competitive Intelligence: Gain a competitive edge by analyzing company growth, industry trends, and strategic opportunities. Lead Generation and CRM Enrichment: Enrich your database with up-to-date company and professional data for targeted marketing and sales strategies. Job Market Insights and Trends: Leverage detailed job listings for a nuanced understanding of employment trends and opportunities, facilitating effective job matching and market analysis. AI-Driven Predictive Analytics: Utilize AI algorithms to analyze large datasets for predicting industry shifts, optimizing business operations, and enhancing decision-making processes based on actionable data insights.

Whether you are mapping out competitive landscapes, sourcing new talent, or analyzing job market trends, our LinkedIn dataset provides the tools you need to succeed. Customize your access to fit specific needs, ensuring that you have the most relevant and timely data at your fingertips.
YouTube Datasets
brightdata.com
.json, .csv, .xlsx
Updated Jan 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2023). YouTube Datasets [Dataset]. https://brightdata.com/products/datasets/youtube
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Jan 9, 2023
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
YouTube, Worldwide
Description
Use our YouTube profiles dataset to extract both business and non-business information from public channels and filter by channel name, views, creation date, or subscribers. Datapoints include URL, handle, banner image, profile image, name, subscribers, description, video count, create date, views, details, and more. You may purchase the entire dataset or a customized subset, depending on your needs. Popular use cases for this dataset include sentiment analysis, brand monitoring, influencer marketing, and more.
d
Data from: Business Owners
catalog.data.gov
data.cityofchicago.org
+3more
Updated Feb 8, 2026
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.cityofchicago.org (2026). Business Owners [Dataset]. https://catalog.data.gov/dataset/business-owners
Explore at:
Dataset updated
Feb 8, 2026
Dataset provided by
data.cityofchicago.org
Description
This dataset contains the owner information for all the accounts listed in the Business License Dataset, and is sorted by Account Number. To identify the owner of a business, you will need the account number or legal name, which may be obtained from theBusiness Licenses dataset: https://data.cityofchicago.org/dataset/Business-Licenses/r5kz-chrr. Data Owner: Business Affairs & Consumer Protection. Time Period: 2002 to present. Frequency: Data is updated daily.
b
Google Maps Dataset
brightdata.com
.json, .csv, .xlsx
Updated Jan 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2023). Google Maps Dataset [Dataset]. https://brightdata.com/products/datasets/google-maps
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Jan 8, 2023
Dataset authored and provided by
Bright Data
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
The Google Maps dataset is ideal for getting extensive information on businesses anywhere in the world. Easily filter by location, business type, and other factors to get the exact data you need. The Google Maps dataset includes all major data points: timestamp, name, category, address, description, open website, phone number, open_hours, open_hours_updated, reviews_count, rating, main_image, reviews, url, lat, lon, place_id, country, and more.
Fake Dataset for Practice
kaggle.com
zip
Updated Aug 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shuvo Kumar Basak-4004 (2023). Fake Dataset for Practice [Dataset]. https://www.kaggle.com/datasets/shuvokumarbasak4004/fake-dataset-for-practice
Explore at:
zip(1515599 bytes)Available download formats
Dataset updated
Aug 21, 2023
Authors
Shuvo Kumar Basak-4004
Description
Description: This dataset is created solely for the purpose of practice and learning. It contains entirely fake and fabricated information, including names, phone numbers, emails, cities, ages, and other attributes. None of the information in this dataset corresponds to real individuals or entities. It serves as a resource for those who are learning data manipulation, analysis, and machine learning techniques. Please note that the data is completely fictional and should not be treated as representing any real-world scenarios or individuals.

Attributes: - phone_number: Fake phone numbers in various formats. - name: Fictitious names generated for practice purposes. - email: Imaginary email addresses created for the dataset. - city: Made-up city names to simulate geographical diversity. - age: Randomly generated ages for practice analysis. - sex: Simulated gender values (Male, Female). - married_status: Synthetic marital status information. - job: Fictional job titles for practicing data analysis. - income: Fake income values for learning data manipulation. - religion: Pretend religious affiliations for practice. - nationality: Simulated nationalities for practice purposes.

Please be aware that this dataset is not based on real data and should be used exclusively for educational purposes.
Immigration system statistics data tables
gov.uk
Updated Nov 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Home Office (2025). Immigration system statistics data tables [Dataset]. https://www.gov.uk/government/statistical-data-sets/immigration-system-statistics-data-tables
Explore at:
Dataset updated
Nov 27, 2025
Dataset provided by
GOV.UKhttp://gov.uk/
Authors
Home Office
Description
List of the data tables as part of the Immigration system statistics Home Office release. Summary and detailed data tables covering the immigration system, including out-of-country and in-country visas, asylum, detention, and returns.

If you have any feedback, please email MigrationStatsEnquiries@homeoffice.gov.uk.

Accessible file formats

The Microsoft Excel .xlsx files may not be suitable for users of assistive technology.
If you use assistive technology (such as a screen reader) and need a version of these documents in a more accessible format, please email MigrationStatsEnquiries@homeoffice.gov.uk
Please tell us what format you need. It will help us if you say what assistive technology you use.

Related content

Immigration system statistics, year ending September 2025
Immigration system statistics quarterly release
Immigration system statistics user guide
Publishing detailed data tables in migration statistics
Policy and legislative changes affecting migration to the UK: timeline
Immigration statistics data archives

Passenger arrivals

https://assets.publishing.service.gov.uk/media/691afc82e39a085bda43edd8/passenger-arrivals-summary-sep-2025-tables.ods">Passenger arrivals summary tables, year ending September 2025 (ODS, 31.5 KB)

‘Passengers refused entry at the border summary tables’ and ‘Passengers refused entry at the border detailed datasets’ have been discontinued. The latest published versions of these tables are from February 2025 and are available in the ‘Passenger refusals – release discontinued’ section. A similar data series, ‘Refused entry at port and subsequently departed’, is available within the Returns detailed and summary tables.

Electronic travel authorisation

https://assets.publishing.service.gov.uk/media/691b03595a253e2c40d705b9/electronic-travel-authorisation-datasets-sep-2025.xlsx">Electronic travel authorisation detailed datasets, year ending September 2025 (MS Excel Spreadsheet, 58.6 KB)
ETA_D01: Applications for electronic travel authorisations, by nationality ETA_D02: Outcomes of applications for electronic travel authorisations, by nationality

Entry clearance visas granted outside the UK

https://assets.publishing.service.gov.uk/media/6924812a367485ea116a56bd/visas-summary-sep-2025-tables.ods">Entry clearance visas summary tables, year ending September 2025 (ODS, 53.3 KB)

https://assets.publishing.service.gov.uk/media/691aebbf5a253e2c40d70598/entry-clearance-visa-outcomes-datasets-sep-2025.xlsx">Entry clearance visa applications and outcomes detailed datasets, year ending September 2025 (MS Excel Spreadsheet, 30.2 MB)
Vis_D01: Entry clearance visa applications, by nationality and visa type
Vis_D02: Outcomes of entry clearance visa applications, by nationality, visa type, and outcome

Additional data relating to in country and overse
h
ML-ArXiv-Papers
huggingface.co
opendatalab.com
Updated Jun 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Connor Shorten (2022). ML-ArXiv-Papers [Dataset]. https://huggingface.co/datasets/CShorten/ML-ArXiv-Papers
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2022
Authors
Connor Shorten
License
https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/
Description
This dataset contains the subset of ArXiv papers with the "cs.LG" tag to indicate the paper is about Machine Learning. The core dataset is filtered from the full ArXiv dataset hosted on Kaggle: https://www.kaggle.com/datasets/Cornell-University/arxiv. The original dataset contains roughly 2 million papers. This dataset contains roughly 100,000 papers following the category filtering. The dataset is maintained by with requests to the ArXiv API. The current iteration of the dataset only contains… See the full description on the dataset page: https://huggingface.co/datasets/CShorten/ML-ArXiv-Papers.

Facebook

Twitter

Click to copy link

Link copied

Cite

NECHBA MOHAMMED (2023). Research Papers Dataset [Dataset]. https://www.kaggle.com/datasets/nechbamohammed/research-papers-dataset

Research Papers Dataset

Comprehensive Research Papers Dataset- A Goldmine of Scholarly Knowledge

Explore at:

zip(619131172 bytes)Available download formats

Dataset updated

May 8, 2023

Authors

NECHBA MOHAMMED

Description

Description: This dataset (Version 10) contains a collection of research papers along with various attributes and metadata. It is a comprehensive and diverse dataset that can be used for a wide range of research and analysis tasks. The dataset encompasses papers from different fields of study, including computer science, mathematics, physics, and more.

Fields in the Dataset: - id: A unique identifier for each paper. - title: The title of the research paper. - authors: The list of authors involved in the paper. - venue: The journal or venue where the paper was published. - year: The year when the paper was published. - n_citation: The number of citations received by the paper. - references: A list of paper IDs that are cited by the current paper. - abstract: The abstract of the paper.

Example: - "id": "013ea675-bb58-42f8-a423-f5534546b2b1", - "title": "Prediction of consensus binding mode geometries for related chemical series of positive allosteric modulators of adenosine and muscarinic acetylcholine receptors", - "authors": ["Leon A. Sakkal", "Kyle Z. Rajkowski", "Roger S. Armen"], - "venue": "Journal of Computational Chemistry", - "year": 2017, - "n_citation": 0, - "references": ["4f4f200c-0764-4fef-9718-b8bccf303dba", "aa699fbf-fabe-40e4-bd68-46eaf333f7b1"], - "abstract": "This paper studies ..."

Cite: https://www.aminer.cn/citation

Clear search

Close search

Google apps

Main menu

Research Papers Dataset

Dirty Dataset to practice Data Cleaning

Walmart Products Dataset – Free Product Data CSV

Key Features

Who Benefits?

Why Use This Dataset Instead of Manual Scraping?

Behance Community Art Data

Dataset: A Systematic Literature Review on the topic of High-value datasets

Steam Video Game and Bundle Data

Top 1000 Kaggle Datasets

From wiki

Google Restaurants dataset

Multivariate Time Series Search - Dataset - NASA Open Data Portal

Open Data Portal Catalogue

Dataset of knee joint contact force peaks and corresponding subject...

Healthcare Dataset

`Context:`

`Inspiration:`

`Dataset Information:`

`Usage Scenarios:`

`Acknowledgments:`

`Image Credit:`

Booking hotel reviews large dataset

LinkedIn Datasets

YouTube Datasets

Data from: Business Owners

Google Maps Dataset

Fake Dataset for Practice

Immigration system statistics data tables

Accessible file formats

Related content

Passenger arrivals

Electronic travel authorisation

Entry clearance visas granted outside the UK

ML-ArXiv-Papers

Research Papers Dataset

Comprehensive Research Papers Dataset- A Goldmine of Scholarly Knowledge