100+ datasets found

c
Booking hotel reviews large dataset
crawlfeeds.com
csv, zip
Updated Oct 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Booking hotel reviews large dataset [Dataset]. https://crawlfeeds.com/datasets/booking-hotel-reviews-large-dataset
Explore at:
zip, csvAvailable download formats
Dataset updated
Oct 6, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Explore our extensive Booking Hotel Reviews Large Dataset, featuring over 20.8 million records of detailed customer feedback from hotels worldwide. Whether you're conducting sentiment analysis, market research, or competitive benchmarking, this dataset provides invaluable insights into customer experiences and preferences.

The dataset includes crucial information such as reviews, ratings, comments, and more, all sourced from travellers who booked through Booking.com. It's an ideal resource for businesses aiming to understand guest sentiments, improve service quality, or refine marketing strategies within the hospitality sector.

With this hotel reviews dataset, you can dive deep into trends and patterns that reveal what customers truly value during their stays. Whether you're analyzing reviews for sentiment analysis or studying traveller feedback from specific regions, this dataset delivers the insights you need.

Ready to get started? Download the complete hotel review dataset or connect with the Crawl Feeds team to request records tailored to specific countries or regions. Unlock the power of data and take your hospitality analysis to the next level!

Access 3 million+ US hotel reviews — submit your request today.
The Expanded Groove MIDI Dataset (E-GMD)
kaggle.com
zip
Updated Dec 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alex Ignatov (2023). The Expanded Groove MIDI Dataset (E-GMD) [Dataset]. https://www.kaggle.com/datasets/alexignatov/the-expanded-groove-midi-dataset
Explore at:
zip(107045765 bytes)Available download formats
Dataset updated
Dec 13, 2023
Authors
Alex Ignatov
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
⚠️ Note! This is the MIDI-only archive. If you need the WAV alternatives for your work, please download the full dataset from their website: https://magenta.tensorflow.org/datasets/e-gmd

Cited from the orignal website:

Overview

The Expanded Groove MIDI Dataset (E-GMD) is a large dataset of human drum performances, with audio recordings annotated in MIDI. E-GMD contains 444 hours of audio from 43 drum kits and is an order of magnitude larger than similar datasets. It is also the first human-performed drum transcription dataset with annotations of velocity. It is based on our previously released Groove MIDI Dataset.

Dataset

This dataset is an expansion of the Groove MIDI Dataset (GMD). GMD is a dataset of human drum performances recorded in MIDI format on a Roland TD-11 electronic drum kit. To make the dataset applicable to ADT, we expanded it by re-recording the GMD sequences on 43 drumkits using a Roland TD-17. The kits range from electronic (e.g., 808, 909) to acoustic sounds. Recording was done at 44.1kHz and 24 bits and aligned within 2ms of the original MIDI files.

We maintained the same train, test and validation splits across sequences that GMD had. Because each kit was recorded for every sequence, we see all 43 kits in the train, test and validation splits

Split Unique Sequences Total Sequences Duration (hours)
Train 819 35,217 341.4
Test 123 5,289 50.9
Validation 117 5,031 52.2
Total 1,059 45,537 444.5

Given the semi-manual nature of the pipeline, there were some errors in the recording process that resulted in unusable tracks. If your application requires only symbolic drum data, we recommend using the original data from the Groove MIDI Dataset.

For more information about how the dataset was created and several applications of it, please see the paper where it was introduced: Improving Perceptual Quality of Drum Transcription with the Expanded Groove MIDI Dataset.

Lee Callender, Curtis Hawthorne, and Jesse Engel. "Improving Perceptual Quality of Drum Transcription with the Expanded Groove MIDI Dataset." 2020. arXiv:2004.00188.

For citations, please use: @misc{callender2020improving, title={Improving Perceptual Quality of Drum Transcription with the Expanded Groove MIDI Dataset}, author={Lee Callender and Curtis Hawthorne and Jesse Engel}, year={2020}, eprint={2004.00188}, archivePrefix={arXiv}, primaryClass={cs.SD} }

I have no contribution and affililation with this work - just uploaded it and made available on Kaggle.
H
Smartphone High-explosive Audio Recordings Dataset (SHAReD)
dataverse.harvard.edu
Updated Jun 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samuel Kei Takazawa (2025). Smartphone High-explosive Audio Recordings Dataset (SHAReD) [Dataset]. http://doi.org/10.7910/DVN/ROWODP
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/ROWODP
Dataset updated
Jun 1, 2025
Dataset provided by
Harvard Dataverse
Authors
Samuel Kei Takazawa
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Collection of signals from of high explosives recorded on a smartphone sensor network available as a pandas DataFrame. The dataset is accompanied by two machine learning models (LFM and D-YAMNet) that were trained for explosion detection using SHAReD and the ESC-50 Dataset. There are 326 sets of signals from 70 high-explosive events. The sensors included are the following: Microphone Accelerometer Barometer Global Navigation Satellite Systems *The extended dataset only includes microphone data and information about the explosion. *D-YAMNet takes 0.96 seconds of audio at 16 kHz sample rate with input shape of (15360,) *LFM takes 0.96 seconds of audio at 800 Hz sample rate with input shape of (1, 768) For ease of use of machine learning models (LFM and D-YAMNet) the dataset (SHAReD + ESC-50) used for training and testing are included along with a simple python code to produce the ensemble model's confusion matrix seen in the publication.
p
Data from: Open Access Dataset and Toolbox of High-Density Surface...
physionet.org
Updated Dec 28, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xinyu Jiang; Chenyun Dai; Xiangyu Liu; Jiahao Fan (2023). Open Access Dataset and Toolbox of High-Density Surface Electromyogram Recordings [Dataset]. http://doi.org/10.13026/hxan-pe94
Explore at:
Unique identifier
https://doi.org/10.13026/hxan-pe94
Dataset updated
Dec 28, 2023
Authors
Xinyu Jiang; Chenyun Dai; Xiangyu Liu; Jiahao Fan
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
We provide an open access dataset of High densitY Surface Electromyogram (HD-sEMG) Recordings (named "Hyser"). We acquired data from 20 subjects with each subject participating in our experiment twice on separate days following the same experiment paradigm. Our Hyser dataset contains five sub-datasets: (1) pattern recognition (PR) dataset acquired during 34 hand gestures, (2) maximal voluntary muscle contraction (MVC) dataset while subjects contracted each individual finger, (3) one-degree of freedom (DoF) dataset acquired during force-varying contraction of each individual finger, (4) N-DoF dataset acquired during prescribed contractions of combinations of multiple fingers, and (5) random task dataset acquired during random contraction of combinations of fingers without any prescribed force trajectory. Sub-dataset 1 can be used for gesture recognition studies. Sub-datasets 2-5 also recorded individual finger forces, thus can be used for studies on proportional control of neuroprostheses.
EEG: silent and perceive speech on 30 spanish sentences
openneuro.org
Updated Sep 27, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlos Valle Araya; Carolina Mendez-Orellana; Maria Rodriguez-Fernandez (2022). EEG: silent and perceive speech on 30 spanish sentences [Dataset]. http://doi.org/10.18112/openneuro.ds004279.v1.0.0
Explore at:
Unique identifier
https://doi.org/10.18112/openneuro.ds004279.v1.0.0
Dataset updated
Sep 27, 2022
Dataset provided by
OpenNeurohttps://openneuro.org/
Authors
Carlos Valle Araya; Carolina Mendez-Orellana; Maria Rodriguez-Fernandez
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
EEG: silent and perceive speech on 30 spanish sentences Large Spanish Speech EEG dataset

Authors

Carlos Valle

Carolina Mendez-Orellana

María Rodríguez-Fernández

Resources:

Code availaible at: https://github.com/CarlosValleA/Large_Spanish_EEG

Publication:

Abstract: Decoding speech from brain activity can enable communication for individuals with speech disorders. Deep neural networks have shown great potential for speech decoding applications, but the large data sets required for these models are usually not available for neural recordings of speech impaired subjects. Harnessing data from other participants would thus be ideal to create speech neuroprostheses without the need of patient-specific training data. In this study, we recorded 60 sessions from 56 healthy participants using 64 EEG channels and developed a neural network capable of subject-independent classification of perceived sentences. We found that sentence identity can be decoded from subjects without prior training achieving higher accuracy than mixed-subject models. The development of subject-independent models eliminates the need to collect data from a target subject, reducing time and data collection costs during deployment. These results open new avenues for creating speech neuroprostheses when subjects cannot provide training data.

Please contact us at this e-mail address if you have any question: cgvalle@uc.cl
Full HD Videos - Liveness Detection Dataset
kaggle.com
zip
Updated Aug 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unique Data (2023). Full HD Videos - Liveness Detection Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/full-hd-webcam-live-attacks
Explore at:
zip(623365842 bytes)Available download formats
Dataset updated
Aug 1, 2023
Authors
Unique Data
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
High Resolution Live Attacks - Biometric Attack dataset

The anti spoofing dataset includes live-recorded Anti-Spoofing videos from around the world, captured via high-quality webcams with Full HD resolution and above. The videos were gathered by capturing faces of genuine individuals presenting spoofs, using facial presentations. Our dataset proposes a novel approach that learns and detects spoofing techniques, extracting features from the genuine facial images to prevent the capturing of such information by fake users.

The dataset is created on the basis of Phone and Webcam Video Dataset

The dataset contains images and videos of real humans with various views, and colors, making it a comprehensive resource for researchers working on anti-spoofing technologies.

👉 Legally sourced datasets and carefully structured for AI training and model development. Explore samples from our dataset of 95,000+ human images & videos - Full dataset

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F1ffb68e96724140488b944b22c68580c%2F(1).png?generation=1684702390091084&alt=media" alt="">

The dataset provides data to combine and apply different techniques, approaches, and models to address the challenging task of distinguishing between genuine and spoofed inputs, providing effective anti-spoofing solutions in active authentication systems. These solutions are crucial as newer devices, such as phones, have become vulnerable to spoofing attacks due to the availability of technologies that can create replays, reflections, and depths, making them susceptible to spoofing and generalization.

Our dataset also explores the use of neural architectures, such as deep neural networks, to facilitate the identification of distinguishing patterns and textures in different regions of the face, increasing the accuracy and generalizability of the anti-spoofing models.

Webcam Resolution

The collection of different video resolutions from Full HD (1080p) up to 4K (2160p) is provided, including several intermediate resolutions like QHD (1440p)

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2Fc07c45d6c6558291a2923d24eeb43d1b%2FResoluo-de-tela-sem-imagem.webp?generation=1684703424049108&alt=media" alt="">

Metadata

Each attack instance is accompanied by the following details:

Unique attack identifier

Identifier of the user recording the attack

User's age

User's gender

User's country of origin

Attack resolution

Additionally, the model of the webcam is also specified.

Metadata is represented in the file_info.csv.

🧩 This is just an example of the data. Leave a request here to learn more

🚀 You can learn more about our high-quality unique datasets here

keywords: liveness detection systems, liveness detection dataset, biometric dataset, biometric data dataset, biometric system attacks, anti-spoofing dataset, face liveness detection, deep learning dataset, face spoofing database, face anti-spoofing, ibeta dataset, human video dataset, video dataset, high quality video dataset, hd video dataset, phone attack dataset, face anti spoofing, large-scale face anti spoofing, rich annotations anti spoofing dataset
Z
Metadata of a Large Sonar and Stereo Camera Dataset Suitable for...
data.niaid.nih.gov
Updated Jul 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Backe, Christian; Wehbe, Bilal; Bande, Miguel; Shah, Nimish; Cesar, Diego; Pribbernow, Max (2024). Metadata of a Large Sonar and Stereo Camera Dataset Suitable for Sonar-to-RGB Image Translation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10373153
Explore at:
Dataset updated
Jul 8, 2024
Dataset provided by
German Research Center for Artificial Intelligence (DFKI)
Kraken Robotik GmbH
Authors
Backe, Christian; Wehbe, Bilal; Bande, Miguel; Shah, Nimish; Cesar, Diego; Pribbernow, Max
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Metadata of a Large Sonar and Stereo Camera Dataset Suitable for Sonar-to-RGB Image Translation

Introduction

This is a set of metadata describing a large dataset of synchronized sonar and stereo camera recordings, that were captured between August 2021 and September 2023 during the project DeeperSense (https://robotik.dfki-bremen.de/en/research/projects/deepersense/), as training data for Sonar-to-RGB image translation. Parts of the sensor data have been published (https://zenodo.org/records/7728089, https://zenodo.org/records/10220989). Due to the size of the sensor data corpus, it is currently impractical to make the entire corpus accessible online. Instead, this metadatabase serves as a relatively compact representation, allowing interested researchers to inspect the data, and select relevant portions for their particular use case, which will be made available on demand. This is an effort to comply with the FAIR principle A2 (https://www.go-fair.org/fair-principles/) that metadata shall be accessible, even when the base data is not immediately.

Locations and sensors

The sensor data was captured at four different locations, including one laboratory (Maritime Exploration Hall at DFKI RIC Bremen) and three field locations (Chalk Lake Hemmoor, Tank Wash Basin Neu-Ulm, Lake Starnberg). At all locations, a ZED camera and a Blueprint Oculus M1200d sonar were used. Additionally, a SeaVision camera was used at the Maritime Exploration Hall at DFKI RIC Bremen and at the Chalk Lake Hemmoor. The examples/ directory holds a typical output image for each sensor at each available location.

Data volume per session

Six data collection sessions were conducted. The table below presents an overview of the amount of data captured in each session:

Session dates Location Number of datasets Total duration of datasets [h] Total logfile size [GB] Number of images Total image size [GB]

2021-08-09 - 2021-08-12 Maritime Exploration Hall at DFKI RIC Bremen 52 10.8 28.8 389’047 88.1

2022-02-07 - 2022-02-08 Maritime Exploration Hall at DFKI RIC Bremen 35 4.4 54.1 629’626 62.3

2022-04-26 - 2022-04-28 Chalk Lake Hemmoor 52 8.1 133.6 1’114’281 97.8

2022-06-28 - 2022-06-29 Tank Wash Basin Neu-Ulm 42 6.7 144.2 824’969 26.9

2023-04-26 - 2023-04-27 Maritime Exploration Hall at DFKI RIC Bremen 55 7.4 141.9 739’613 9.6

2023-09-01 - 2023-09-02 Lake Starnberg 19 2.9 40.1 217’385 2.3

255 40.3 542.7 3’914’921 287.0

Data and metadata structure

Sensor data corpus

The sensor data corpus comprises two processing stages:

raw data streams stored in ROS bagfiles (aka logfiles),

camera and sonar images (aka datafiles) extracted from the logfiles.

The files are stored in a file tree hierarchy which groups them by session, dataset, and modality:

${session_key}/ ${dataset_key}/ ${logfile_name} ${modality_key}/ ${datafile_name}

A typical logfile path has this form:

2023-09_starnberg_lake/ 2023-09-02-15-06_hydraulic_drill/ stereo_camera-zed-2023-09-02-15-06-07.bag

A typical datafile path has this form:

2023-09_starnberg_lake/ 2023-09-02-15-06_hydraulic_drill/ zed_right/ 1693660038_368077993.jpg

All directory and file names, and their particles, are designed to serve as identifiers in the metadatabase. Their formatting, as well as the definitions of all terms, are documented in the file entities.json.

Metadatabase

The metadatabase is provided in two equivalent forms:

as a standalone SQLite (https://www.sqlite.org/index.html) database file metadata.sqlite for users familiar with SQLite,

as a collection of CSV files in the csv/ directory for users who prefer other tools.

The database file has been generated from the CSV files, so each database table holds the same information as the corresponding CSV file. In addition, the metadatabase contains a series of convenience views that facilitate access to certain aggregate information.

An entity relationship diagram of the metadatabase tables is stored in the file entity_relationship_diagram.png. Each entity, its attributes, and relations are documented in detail in the file entities.json

Some general design remarks:

For convenience, timestamps are always given in both a human-readable form (ISO 8601 formatted datetime strings with explicit local time zone), and as seconds since the UNIX epoch.

In practice, each logfile always contains a single stream, and each stream is stored always in a single logfile. Per database schema however, the entities stream and logfile are modeled separately, with a “many-streams-to-one-logfile” relationship. This design was chosen to be compatible with, and open for, data collections where a single logfile contains multiple streams.

A modality is not an attribute of a sensor alone, but of a datafile: Because a sensor is an attribute of a stream, and a single stream may be the source of multiple modalities (e.g. RGB vs. grayscale images from the same camera, or cartesian vs. polar projection of the same sonar output). Conversely, the same modality may originate from different sensors.

As a usage example, the data volume per session which is tabulated at the top of this document, can be extracted from the metadatabase with the following SQL query:

SELECT PRINTF( '%s - %s', SUBSTR(session_start, 1, 10), SUBSTR(session_end, 1, 10)) AS 'Session dates', location_name_english AS Location, number_of_datasets AS 'Number of datasets', total_duration_of_datasets_h AS 'Total duration of datasets [h]', total_logfile_size_gb AS 'Total logfile size [GB]', number_of_images AS 'Number of images', total_image_size_gb AS 'Total image size [GB]' FROM location JOIN session USING (location_id) JOIN ( SELECT session_id, COUNT(dataset_id) AS number_of_datasets, ROUND( SUM(dataset_duration) / 3600, 1) AS total_duration_of_datasets_h, ROUND( SUM(total_logfile_size) / 10e9, 1) AS total_logfile_size_gb FROM location JOIN session USING (location_id) JOIN dataset USING (session_id) JOIN view_dataset_total_logfile_size USING (dataset_id) GROUP BY session_id ) USING (session_id) JOIN ( SELECT session_id, COUNT(datafile_id) AS number_of_images, ROUND(SUM(datafile_size) / 10e9, 1) AS total_image_size_gb FROM session JOIN dataset USING (session_id) JOIN stream USING (dataset_id) JOIN datafile USING (stream_id) GROUP BY session_id ) USING (session_id) ORDER BY session_id;
CityTrek-14K
kaggle.com
zip
Updated Jan 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sobhan Moosavi (2024). CityTrek-14K [Dataset]. https://www.kaggle.com/datasets/sobhanmoosavi/citytrek-14k
Explore at:
zip(182314065 bytes)Available download formats
Dataset updated
Jan 13, 2024
Authors
Sobhan Moosavi
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Description

CityTrek-14K is a distinctive, extensive dataset that includes 14,000 trajectories from 280 drivers, each contributing 50 trajectories, in three major U.S. cities: Philadelphia (PA), Atlanta (GA), and Memphis (TN). It features a time series data set capturing details like timestamps, vehicle speeds, and GPS coordinates, with a collection frequency of 1Hz. Although the dataset includes location data, strict anonymization practices were adhered to, ensuring personal information like home or work addresses remain confidential. The CityTrek-14K dataset offers a comprehensive view of driving patterns, encompassing over 4,800 hours of driving data and spanning more than 189,000 miles, collected between July 2017 and March 2019. The dataset comprises two distinct files: the first is a summary of the trips, and the second is a trajectory data file that includes detailed records captured every second.

Acknowledgements

If you use this dataset, please kindly cite the following paper: - Moosavi, Sobhan, and Rajiv Ramnath. "Context-aware driver risk prediction with telematics data." Accident Analysis & Prevention 192 (2023): 107269.

Data Collection Methodology

The CityTrek-14K dataset was collected using specially designed devices installed in vehicles. These devices were configured to record and transmit data frequently. Further details about this data collection process are elaborated in the paper mentioned above.

Potential Applications

The CityTrek-14K dataset is versatile, suitable for numerous applications such as: - Traffic Modeling and ETA Prediction: The dataset contains detailed route information and travel times, making it an excellent resource for large-scale traffic modeling and ETA modeling techniques. - Route Optimization: With its detailed trajectory data, the dataset is ideal for developing and testing route optimization techniques, providing insights into efficient pathfinding methods. - Modeling and Analyzing Driver Behavior: As each driver in the dataset has exactly 50 trajectories recorded, this allows for a comprehensive analysis of driver behavior, offering a unique opportunity to study and model driving patterns and habits.

Usage Policy and Legal Disclaimer

This dataset is being distributed solely for research purposes under the Creative Commons Attribution-Noncommercial-ShareAlike license (CC BY-NC-SA 4.0). By downloading the dataset, you agree to use it only for non-commercial, research, or academic applications. If you use this dataset, it is necessary to cite the paper mentioned above.

Inquiries or need help?

For any inquiries or assistance, please contact Sobhan Moosavi at sobhan.mehr84@gmail.com
F
Thai Wake Words & Voice Commands Speech Data
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Thai Wake Words & Voice Commands Speech Data [Dataset]. https://www.futurebeeai.com/dataset/wake-words-and-commands-dataset/wake-words-and-commands-thai-thailand
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The Thai Wake Word & Voice Command Dataset is expertly curated to support the training and development of voice-activated systems. This dataset includes a large collection of wake words and command phrases, essential for enabling seamless user interaction with voice assistants and other speech-enabled technologies. It’s designed to ensure accurate wake word detection and voice command recognition, enhancing overall system performance and user experience.
Speech Data
This dataset includes 20,000+ audio recordings of wake words and command phrases. Each participant contributed 400 recordings, captured under varied environmental conditions and speaking speeds. The data covers:
•Wake words alone
•Wake words followed by command phrases
Participant Diversity
•
Speakers: 50 native Thai speakers from the FutureBeeAI community

•
Regions: Participants from various Thailand provinces, ensuring broad coverage of accents and dialects

•
Demographics: Ages 18–70; 60% male and 40% female participants

Recording Details
•
Type: Scripted wake words and command phrases

•
Duration: 1 to 15 seconds per clip

•
Format: WAV, stereo, 16-bit, with sample rates ranging from 16 kHz to 48 kHz

Dataset Diversity
•Wake Word Types
•
Automobile Wake Words: Hey Mercedes, Hey BMW, Hey Porsche, Hey Volvo, Hey Audi, Hi Genesis, Ok Ford, etc.

•
Voice Assistant Wake Words: Hey Siri, Ok Google, Alexa, Hey Cortana, Hi Bixby, Hey Celia, etc.

•
Home Appliance Wake Words: Hi LG, Ok LG, Hello Lloyd, and more

•Command Types by Use Case
•
Automobile: Play music, check directions, voice search, provide feedback, and more

•
Voice Assistant: Ask general questions, make calls, control devices, shopping, manage calendars, and more

•
Home Appliances: Control appliances, check status, set reminders/alarms, manage shopping lists, etc.

•Recording Environments
•No background noise
•Background traffic noise
•People talking in the background
•Speaking Pace
•Normal speed
•Fast speed
This diversity ensures robust training for real-world voice assistant applications.
Metadata
Each audio file is accompanied by detailed metadata to support advanced filtering and training needs.
•
Participant Metadata: Unique ID, age, gender, region, accent, dialect

•
Recording Metadata: Transcript, environment, pace, device used, sample rate, bit depth, file format

Use Cases & Applications
•
Voice Assistant Activation: Train models to accurately detect and trigger based on wake words

•
Smart Home Devices: Enable responsive voice control in smart appliances

•
<b style="font-weight:
E
Data from: The COUGHVID crowdsourcing dataset: A corpus for the study of...
live.european-language-grid.eu
webm
Updated May 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). The COUGHVID crowdsourcing dataset: A corpus for the study of large-scale cough analysis algorithms [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7851
Explore at:
webmAvailable download formats
Dataset updated
May 1, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
OverviewCough audio signal classification has been successfully used to diagnose a variety of respiratory conditions, and there has been significant interest in leveraging Machine Learning (ML) to provide widespread COVID-19 screening. The COUGHVID dataset provides over 20,000 crowdsourced cough recordings representing a wide range of subject ages, genders, geographic locations, and COVID-19 statuses. Furthermore, experienced pulmonologists labeled more than 2,000 recordings to diagnose medical abnormalities present in the coughs, thereby contributing one of the largest expert-labeled cough datasets in existence that can be used for a plethora of cough audio classification tasks. As a result, the COUGHVID dataset contributes a wealth of cough recordings for training ML models to address the world’s most urgent health crises.Private Set and Testing ProtocolResearchers interested in testing their models on the private test dataset should contact us at coughvid@epfl.ch, briefly explaining the type of validation they want to make, and their obtained results obtained through cross-validation with the public data. Then, access to the unlabeled recordings will be provided, and the researchers should send the predictions of their models on these recordings. Finally, the performance metrics of the predictions will be sent to the researchers.
F
US English Wake Words & Voice Commands Speech Data
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). US English Wake Words & Voice Commands Speech Data [Dataset]. https://www.futurebeeai.com/dataset/wake-words-and-commands-dataset/wake-words-and-commands-english-us
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The US English Wake Word & Voice Command Dataset is expertly curated to support the training and development of voice-activated systems. This dataset includes a large collection of wake words and command phrases, essential for enabling seamless user interaction with voice assistants and other speech-enabled technologies. It’s designed to ensure accurate wake word detection and voice command recognition, enhancing overall system performance and user experience.
Speech Data
This dataset includes 20,000+ audio recordings of wake words and command phrases. Each participant contributed 400 recordings, captured under varied environmental conditions and speaking speeds. The data covers:
•Wake words alone
•Wake words followed by command phrases
Participant Diversity
•
Speakers: 50 native US English speakers from the FutureBeeAI community

•
Regions: Participants from various United States of America provinces, ensuring broad coverage of accents and dialects

•
Demographics: Ages 18–70; 60% male and 40% female participants

Recording Details
•
Type: Scripted wake words and command phrases

•
Duration: 1 to 15 seconds per clip

•
Format: WAV, stereo, 16-bit, with sample rates ranging from 16 kHz to 48 kHz

Dataset Diversity
•Wake Word Types
•
Automobile Wake Words: Hey Mercedes, Hey BMW, Hey Porsche, Hey Volvo, Hey Audi, Hi Genesis, Ok Ford, etc.

•
Voice Assistant Wake Words: Hey Siri, Ok Google, Alexa, Hey Cortana, Hi Bixby, Hey Celia, etc.

•
Home Appliance Wake Words: Hi LG, Ok LG, Hello Lloyd, and more

•Command Types by Use Case
•
Automobile: Play music, check directions, voice search, provide feedback, and more

•
Voice Assistant: Ask general questions, make calls, control devices, shopping, manage calendars, and more

•
Home Appliances: Control appliances, check status, set reminders/alarms, manage shopping lists, etc.

•Recording Environments
•No background noise
•Background traffic noise
•People talking in the background
•Speaking Pace
•Normal speed
•Fast speed
This diversity ensures robust training for real-world voice assistant applications.
Metadata
Each audio file is accompanied by detailed metadata to support advanced filtering and training needs.
•
Participant Metadata: Unique ID, age, gender, region, accent, dialect

•
Recording Metadata: Transcript, environment, pace, device used, sample rate, bit depth, file format

Use Cases & Applications
•
Voice Assistant Activation: Train models to accurately detect and trigger based on wake words

•
Smart Home Devices: Enable responsive voice control in smart appliances

•
<b
c
Bulk Bookstore dataset
crawlfeeds.com
csv, zip
Updated Apr 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Bulk Bookstore dataset [Dataset]. https://crawlfeeds.com/datasets/bulk-bookstore-dataset
Explore at:
zip, csvAvailable download formats
Dataset updated
Apr 27, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Bulk Bookstore is online book store. Crawl feeds teams extracted few sample records for analysis purposes. Last crawled on 27 Nov 2021.
The big dataset of ultra-marathon running
kaggle.com
zip
Updated Jul 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David (2023). The big dataset of ultra-marathon running [Dataset]. https://www.kaggle.com/datasets/aiaiaidavid/the-big-dataset-of-ultra-marathon-running
Explore at:
zip(258022817 bytes)Available download formats
Dataset updated
Jul 12, 2023
Authors
David
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
According to the Wikipedia, an ultramarathon, also called ultra distance or ultra running, is any footrace longer than the traditional marathon length of 42.195 kilometres (26 mi 385 yd). Various distances are raced competitively, from the shortest common ultramarathon of 31 miles (50 km) to over 200 miles (320 km). 50k and 100k are both World Athletics record distances, but some 100 miles (160 km) races are among the oldest and most prestigious events, especially in North America.}

The data in this file is a large collection of ultra-marathon race records registered between 1798 and 2022 (a period of well over two centuries) being therefore a formidable long term sample. All data was obtained from public websites.

Despite the original data being of public domain, the race records, which originally contained the athlete´s names, have been anonymized to comply with data protection laws and to preserve the athlete´s privacy. However, a column Athlete ID has been created with a numerical ID representing each unique runner (so if Antonio Fernández participated in 5 races over different years, then the corresponding race records now hold his unique Athlete ID instead of his name). This way I have preserved valuable information.

The dataset contains 7,461,226 ultra-marathon race records from 1,641,168 unique athletes.

The following columns (with data types) are included:

Year of event (int64)

Event dates (object)

Event name (object)

Event distance/length (object)

Event number of finishers (int64)

Athlete performance (object)

Athlete club (object)

Athlete country (object)

Athlete year of birth (float64)

Athlete gender (object)

Athlete age category (object)

Athlete average speed (object)

Athlete ID (int64)

The Event name column include country location information that can be derived to a new column, and similarly seasonal information can be found in the Event dates column beyond the Year of event (these can be extracted with a bit of processing).

The Event distance/length column describes the type of race, covering the most popular UM race distances and lengths, and some other specific modalities (multi-day, etc.):

Distances: 50km, 100km, 50mi, 100mi

Lengths: 6h, 12h, 24h, 48h, 72h, 6d, 10d

Additionally, there is information of age, gender and speed (in km/h) in other columns.
p
PTB-XL, a large publicly available electrocardiography dataset
physionet.org
maplerate.net
Updated Nov 9, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick Wagner; Nils Strodthoff; Ralf-Dieter Bousseljot; Wojciech Samek; Tobias Schaeffter (2022). PTB-XL, a large publicly available electrocardiography dataset [Dataset]. http://doi.org/10.13026/kfzx-aw45
Explore at:
Unique identifier
https://doi.org/10.13026/kfzx-aw45
Dataset updated
Nov 9, 2022
Authors
Patrick Wagner; Nils Strodthoff; Ralf-Dieter Bousseljot; Wojciech Samek; Tobias Schaeffter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Electrocardiography (ECG) is a key diagnostic tool to assess the cardiac condition of a patient. Automatic ECG interpretation algorithms as diagnosis support systems promise large reliefs for the medical personnel - only on the basis of the number of ECGs that are routinely taken. However, the development of such algorithms requires large training datasets and clear benchmark procedures. In our opinion, both aspects are not covered satisfactorily by existing freely accessible ECG datasets.

The PTB-XL ECG dataset is a large dataset of 21799 clinical 12-lead ECGs from 18869 patients of 10 second length. The raw waveform data was annotated by up to two cardiologists, who assigned potentially multiple ECG statements to each record. The in total 71 different ECG statements conform to the SCP-ECG standard and cover diagnostic, form, and rhythm statements. To ensure comparability of machine learning algorithms trained on the dataset, we provide recommended splits into training and test sets. In combination with the extensive annotation, this turns the dataset into a rich resource for the training and the evaluation of automatic ECG interpretation algorithms. The dataset is complemented by extensive metadata on demographics, infarction characteristics, likelihoods for diagnostic ECG statements as well as annotated signal properties.
c
Data from: Datasets used to train the Generative Adversarial Networks used...
opendata.cern.ch
Updated 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ATLAS collaboration (2021). Datasets used to train the Generative Adversarial Networks used in ATLFast3 [Dataset]. http://doi.org/10.7483/OPENDATA.ATLAS.UXKX.TXBN
Explore at:
Unique identifier
https://doi.org/10.7483/OPENDATA.ATLAS.UXKX.TXBN
Dataset updated
2021
Dataset provided by
CERN Open Data Portal
Authors
ATLAS collaboration
Description
Three datasets are available, each consisting of 15 csv files. Each file containing the voxelised shower information obtained from single particles produced at the front of the calorimeter in the |η| range (0.2-0.25) simulated in the ATLAS detector. Two datasets contain photons events with different statistics; the larger sample has about 10 times the number of events as the other. The other dataset contains pions. The pion dataset and the photon dataset with the lower statistics were used to train the corresponding two GANs presented in the AtlFast3 paper SIMU-2018-04.

The information in each file is a table; the rows correspond to the events and the columns to the voxels. The voxelisation procedure is described in the AtlFast3 paper linked above and in the dedicated PUB note ATL-SOFT-PUB-2020-006. In summary, the detailed energy deposits produced by ATLAS were converted from x,y,z coordinates to local cylindrical coordinates defined around the particle 3-momentum at the entrance of the calorimeter. The energy deposits in each layer were then grouped in voxels and for each voxel the energy was stored in the csv file. For each particle, there are 15 files corresponding to the 15 energy points used to train the GAN. The name of the csv file defines both the particle and the energy of the sample used to create the file.

The size of the voxels is described in the binning.xml file. Software tools to read the XML file and manipulate the spatial information of voxels are provided in the FastCaloGAN repository.
Updated on February 10th 2022. A new dataset photons_samples_highStat.tgz was added to this record and the binning.xml file was updated accordingly.
Updated on April 18th 2023. A new dataset pions_samples_highStat.tgz was added to this record.
New 1000 Sales Records Data 2
kaggle.com
zip
Updated Jan 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Calvin Oko Mensah (2023). New 1000 Sales Records Data 2 [Dataset]. https://www.kaggle.com/datasets/calvinokomensah/new-1000-sales-records-data-2
Explore at:
zip(49305 bytes)Available download formats
Dataset updated
Jan 12, 2023
Authors
Calvin Oko Mensah
Description
This is a dataset downloaded off excelbianalytics.com created off of random VBA logic. I recently performed an extensive exploratory data analysis on it and I included new columns to it, namely: Unit margin, Order year, Order month, Order weekday and Order_Ship_Days which I think can help with analysis on the data. I shared it because I thought it was a great dataset to practice analytical processes on for newbies like myself.
n
Acoustic features as a tool to visualize and explore marine soundscapes:...
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Feb 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simone Cominelli; Nicolo' Bellin; Carissa D. Brown; Jack Lawson (2024). Acoustic features as a tool to visualize and explore marine soundscapes: Applications illustrated using marine mammal Passive Acoustic Monitoring datasets [Dataset]. http://doi.org/10.5061/dryad.3bk3j9kn8
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.3bk3j9kn8
Dataset updated
Feb 15, 2024
Dataset provided by
University of Parma
Fisheries and Oceans Canada
Memorial University of Newfoundland
Authors
Simone Cominelli; Nicolo' Bellin; Carissa D. Brown; Jack Lawson
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Passive Acoustic Monitoring (PAM) is emerging as a solution for monitoring species and environmental change over large spatial and temporal scales. However, drawing rigorous conclusions based on acoustic recordings is challenging, as there is no consensus over which approaches, and indices are best suited for characterizing marine and terrestrial acoustic environments. Here, we describe the application of multiple machine-learning techniques to the analysis of a large PAM dataset. We combine pre-trained acoustic classification models (VGGish, NOAA & Google Humpback Whale Detector), dimensionality reduction (UMAP), and balanced random forest algorithms to demonstrate how machine-learned acoustic features capture different aspects of the marine environment. The UMAP dimensions derived from VGGish acoustic features exhibited good performance in separating marine mammal vocalizations according to species and locations. RF models trained on the acoustic features performed well for labelled sounds in the 8 kHz range, however, low and high-frequency sounds could not be classified using this approach. The workflow presented here shows how acoustic feature extraction, visualization, and analysis allow for establishing a link between ecologically relevant information and PAM recordings at multiple scales. The datasets and scripts provided in this repository allow replicating the results presented in the publication. Methods Data acquisition and preparation We collected all records available in the Watkins Marine Mammal Database website listed under the “all cuts'' page. For each audio file in the WMD the associated metadata included a label for the sound sources present in the recording (biological, anthropogenic, and environmental), as well as information related to the location and date of recording. To minimize the presence of unwanted sounds in the samples, we only retained audio files with a single source listed in the metadata. We then labelled the selected audio clips according to taxonomic group (Odontocetae, Mysticetae), and species. We limited the analysis to 12 marine mammal species by discarding data when a species: had less than 60 s of audio available, had a vocal repertoire extending beyond the resolution of the acoustic classification model (VGGish), or was recorded in a single country. To determine if a species was suited for analysis using VGGish, we inspected the Mel-spectrograms of 3-s audio samples and only retained species with vocalizations that could be captured in the Mel-spectrogram (Appendix S1). The vocalizations of species that produce very low frequency, or very high frequency were not captured by the Mel-spectrogram, thus we removed them from the analysis. To ensure that records included the vocalizations of multiple individuals for each species, we only considered species with records from two or more different countries. Lastly, to avoid overrepresentation of sperm whale vocalizations, we excluded 30,000 sperm whale recordings collected in the Dominican Republic. The resulting dataset consisted in 19,682 audio clips with a duration of 960 milliseconds each (0.96 s) (Table 1). The Placentia Bay Database (PBD) includes recordings collected by Fisheries and Oceans Canada in Placentia Bay (Newfoundland, Canada), in 2019. The dataset consisted of two months of continuous recordings (1230 hours), starting on July 1st, 2019, and ending on August 31st 2029. The data was collected using an AMAR G4 hydrophone (sensitivity: -165.02 dB re 1V/µPa at 250 Hz) deployed at 64 m of depth. The hydrophone was set to operate following 15 min cycles, with the first 60 s sampled at 512 kHz, and the remaining 14 min sampled at 64 kHz. For the purpose of this study, we limited the analysis to the 64 kHz recordings. Acoustic feature extraction The audio files from the WMD and PBD databases were used as input for VGGish (Abu-El-Haija et al., 2016; Chung et al., 2018), a CNN developed and trained to perform general acoustic classification. VGGish was trained on the Youtube8M dataset, containing more than two million user-labelled audio-video files. Rather than focusing on the final output of the model (i.e., the assigned labels), here the model was used as a feature extractor (Sethi et al., 2020). VGGish converts audio input into a semantically meaningful vector consisting of 128 features. The model returns features at multiple resolution: ~1 s (960 ms); ~5 s (4800 ms); ~1 min (59’520 ms); ~5 min (299’520 ms). All of the visualizations and results pertaining to the WMD were prepared using the finest feature resolution of ~1 s. The visualizations and results pertaining to the PBD were prepared using the ~5 s features for the humpback whale detection example, and were then averaged to an interval of 30 min in order to match the temporal resolution of the environmental measures available for the area. UMAP ordination and visualization UMAP is a non-linear dimensionality reduction algorithm based on the concept of topological data analysis which, unlike other dimensionality reduction techniques (e.g., tSNE), preserves both the local and global structure of multivariate datasets (McInnes et al., 2018). To allow for data visualization and to reduce the 128 features to two dimensions for further analysis, we applied Uniform Manifold Approximation and Projection (UMAP) to both datasets and inspected the resulting plots. The UMAP algorithm generates a low-dimensional representation of a multivariate dataset while maintaining the relationships between points in the global dataset structure (i.e., the 128 features extracted from VGGish). Each point in a UMAP plot in this paper represents an audio sample with duration of ~ 1 second (WMD dataset), ~ 5 seconds (PBD dataset, humpback whale detections), or 30 minutes (PBD dataset, environmental variables). Each point in the two-dimensional UMAP space also represents a vector of 128 VGGish features. The nearer two points are in the plot space, the nearer the two points are in the 128-dimensional space, and thus the distance between two points in UMAP reflects the degree of similarity between two audio samples in our datasets. Areas with a high density of samples in UMAP space should, therefore, contain sounds with similar characteristics, and such similarity should decrease with increasing point distance. Previous studies illustrated how VGGish and UMAP can be applied to the analysis of terrestrial acoustic datasets (Heath et al., 2021; Sethi et al., 2020). The visualizations and classification trials presented here illustrate how the two techniques (VGGish and UMAP) can be used together for marine ecoacoustics analysis. UMAP visualizations were prepared the umap-learn package for Python programming language (version 3.10). All UMAP visualizations presented in this study were generated using the algorithm’s default parameters.
Labelling sound sources The labels for the WMD records (i.e., taxonomic group, species, location) were obtained from the database metadata. For the PBD recordings, we obtained measures of wind speed, surface temperature, and current speed from (Fig 1) an oceanographic buy located in proximity of the recorder. We choose these three variables for their different contributions to background noise in marine environments. Wind speed contributes to underwater background noise at multiple frequencies, ranging 500 Hz to 20 kHz (Hildebrand et al., 2021). Sea surface temperature contributes to background noise at frequencies between 63 Hz and 125 Hz (Ainslie et al., 2021), while ocean currents contribute to ambient noise at frequencies below 50 Hz (Han et al., 2021) Prior to analysis, we categorized the environmental variables and assigned the categories as labels to the acoustic features (Table 2). Humpback whale vocalizations in the PBD recordings were processed using the humpback whale acoustic detector created by NOAA and Google (Allen et al., 2021), providing a model score for every ~5 s sample. This model was trained on a large dataset (14 years and 13 locations) using humpback whale recordings annotated by experts (Allen et al., 2021). The model returns scores ranging from 0 to 1 indicating the confidence in the predicted humpback whale presence. We used the results of this detection model to label the PBD samples according to presence of humpback whale vocalizations. To verify the model results, we inspected all audio files that contained a 5 s sample with a model score higher than 0.9 for the month of July. If the presence of a humpback whale was confirmed, we labelled the segment as a model detection. We labelled any additional humpback whale vocalization present in the inspected audio files as a visual detection, while we labelled other sources and background noise samples as absences. In total, we labelled 4.6 hours of recordings. We reserved the recordings collected in August to test the precision of the final predictive model. Label prediction performance We used Balanced Random Forest models (BRF) provided in the imbalanced-learn python package (Lemaître et al., 2017) to predict humpback whale presence and environmental conditions from the acoustic features generated by VGGish. We choose BRF as the algorithm as it is suited for datasets characterized by class imbalance. The BRF algorithm performs under sampling of the majority class prior to prediction, allowing to overcome class imbalance (Lemaître et al., 2017). For each model run, the PBD dataset was split into training (80%) and testing (20%) sets. The training datasets were used to fine-tune the models though a nested k-fold cross validation approach with ten-folds in the outer loop, and five-folds in the inner loop. We selected nested cross validation as it allows optimizing model hyperparameters and performing model evaluation in a single step. We used the default parameters of the BRF algorithm, except for the ‘n_estimators’ hyperparameter, for which we tested
Company Records - Dataset - CRO
opendata.cro.ie
Updated Dec 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
cro.ie (2024). Company Records - Dataset - CRO [Dataset]. https://opendata.cro.ie/dataset/companies
Explore at:
Dataset updated
Dec 1, 2024
Dataset provided by
Companies Registration Office
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset provides a structured and machine-readable register of all companies recorded by the Companies Registration Office (CRO) in Ireland. It includes a daily snapshot of company records, covering both currently registered companies and historical records of dissolved or closed entities. The dataset aligns with the European Union’s Open Data Directive (Directive (EU) 2019/1024) and the Implementing Regulation (EU) 2023/138, which designates company and company ownership data as a high-value dataset. Updated daily, it ensures timely access to corporate information and is available for bulk download and API access under the Creative Commons Attribution 4.0 (CC BY 4.0) licence, allowing unrestricted reuse with appropriate attribution. By increasing transparency, accountability, and economic innovation, this dataset supports public sector initiatives, research, and digital services development.
c
Download Flipkart E-commerce dataset
crawlfeeds.com
json, zip
Updated Dec 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2024). Download Flipkart E-commerce dataset [Dataset]. https://crawlfeeds.com/datasets/download-flipkart-e-commerce-dataset
Explore at:
json, zipAvailable download formats
Dataset updated
Dec 9, 2024
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Unlock the power of Flipkart's extensive product catalog with our meticulously curated e-commerce dataset. This dataset provides detailed information on a wide range of products available on Flipkart, including product names, descriptions, prices, customer reviews, ratings, and images. Whether you're working on data analysis, machine learning models, or conducting in-depth market research, this dataset is an invaluable resource.

With our Flipkart e-commerce dataset, you can easily analyze trends, compare products, and gain insights into consumer behavior. The dataset is structured and high-quality, ensuring that you have the best foundation for your projects.

Flipkart is largest E-commerce website based out india. Pre crawled dataset having more than 5.7 million records.

Where to use dataset

Train machine learning algorithms

Check discounts of various categories fields

Find out which brand and categories having best discounts
Synthetic UML Diagram Dataset (PlantUML)
zenodo.org
zip
Updated May 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Averi Bates; Averi Bates; Chongle Pan; Chongle Pan (2025). Synthetic UML Diagram Dataset (PlantUML) [Dataset]. http://doi.org/10.5281/zenodo.15103682
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15103682
Dataset updated
May 28, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Averi Bates; Averi Bates; Chongle Pan; Chongle Pan
License
http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
Description
This dataset comprises synthetic UML diagrams, explicitly focusing on activity and sequence diagrams generated using PlantUML—a text-based tool for creating visual diagrams. By leveraging randomized text strings based on PlantUML syntax, we produced a diverse and scalable collection that emulates standard UML diagrams. Each diagram is accompanied by its corresponding PlantUML code, facilitating a clear understanding of the visual representation's textual foundation. Data from smaller datasets is reused in the larger datasets, as each model was trained on the data separately, as described in the original paper. It's recommended just to use the Extra Large dataset when interested in using the data in its entirety.

Each category is divided into four subsets based on size (approximately):

Small: 6,000 training diagrams and 1,500 testing diagrams.

Medium: 12,000 training diagrams and 3,000 testing diagrams.

Large: 24,000 training diagrams and 6,000 testing diagrams.

Extra Large: 120,000 training diagrams and 30,000 testing diagrams.

Split	Unique Sequences	Total Sequences	Duration (hours)
Train	819	35,217	341.4
Test	123	5,289	50.9
Validation	117	5,031	52.2
Total	1,059	45,537	444.5

Facebook

Twitter

Click to copy link

Link copied

Cite

Crawl Feeds (2025). Booking hotel reviews large dataset [Dataset]. https://crawlfeeds.com/datasets/booking-hotel-reviews-large-dataset

Booking hotel reviews large dataset

Booking hotel reviews large dataset from booking.com

Explore at:

zip, csvAvailable download formats

Dataset updated

Oct 6, 2025

Dataset authored and provided by

Crawl Feeds

License

https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

Description

Explore our extensive Booking Hotel Reviews Large Dataset, featuring over 20.8 million records of detailed customer feedback from hotels worldwide. Whether you're conducting sentiment analysis, market research, or competitive benchmarking, this dataset provides invaluable insights into customer experiences and preferences.

The dataset includes crucial information such as reviews, ratings, comments, and more, all sourced from travellers who booked through Booking.com. It's an ideal resource for businesses aiming to understand guest sentiments, improve service quality, or refine marketing strategies within the hospitality sector.

With this hotel reviews dataset, you can dive deep into trends and patterns that reveal what customers truly value during their stays. Whether you're analyzing reviews for sentiment analysis or studying traveller feedback from specific regions, this dataset delivers the insights you need.

Ready to get started? Download the complete hotel review dataset or connect with the Crawl Feeds team to request records tailored to specific countries or regions. Unlock the power of data and take your hospitality analysis to the next level!

Access 3 million+ US hotel reviews — submit your request today.

Clear search

Close search

Google apps

Main menu

Booking hotel reviews large dataset

The Expanded Groove MIDI Dataset (E-GMD)

Overview

Dataset

Smartphone High-explosive Audio Recordings Dataset (SHAReD)

Data from: Open Access Dataset and Toolbox of High-Density Surface...

EEG: silent and perceive speech on 30 spanish sentences

Full HD Videos - Liveness Detection Dataset

High Resolution Live Attacks - Biometric Attack dataset

The dataset is created on the basis of Phone and Webcam Video Dataset

👉 Legally sourced datasets and carefully structured for AI training and model development. Explore samples from our dataset of 95,000+ human images & videos - Full dataset

Webcam Resolution

Metadata

🧩 This is just an example of the data. Leave a request here to learn more

Metadata of a Large Sonar and Stereo Camera Dataset Suitable for...

CityTrek-14K

Description

Acknowledgements

Data Collection Methodology

Potential Applications

Usage Policy and Legal Disclaimer

Inquiries or need help?

Thai Wake Words & Voice Commands Speech Data

Introduction

Speech Data

Participant Diversity

Recording Details

Dataset Diversity

Metadata

Use Cases & Applications

Data from: The COUGHVID crowdsourcing dataset: A corpus for the study of...

US English Wake Words & Voice Commands Speech Data

Introduction

Speech Data

Participant Diversity

Recording Details

Dataset Diversity

Metadata

Use Cases & Applications

Bulk Bookstore dataset

The big dataset of ultra-marathon running

PTB-XL, a large publicly available electrocardiography dataset

Data from: Datasets used to train the Generative Adversarial Networks used...

New 1000 Sales Records Data 2

Acoustic features as a tool to visualize and explore marine soundscapes:...

Company Records - Dataset - CRO

Download Flipkart E-commerce dataset

Synthetic UML Diagram Dataset (PlantUML)

Booking hotel reviews large dataset

Booking hotel reviews large dataset from booking.com