100+ datasets found
  1. c

    Booking hotel reviews large dataset

    • crawlfeeds.com
    csv, zip
    Updated Oct 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Booking hotel reviews large dataset [Dataset]. https://crawlfeeds.com/datasets/booking-hotel-reviews-large-dataset
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Oct 6, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    Explore our extensive Booking Hotel Reviews Large Dataset, featuring over 20.8 million records of detailed customer feedback from hotels worldwide. Whether you're conducting sentiment analysis, market research, or competitive benchmarking, this dataset provides invaluable insights into customer experiences and preferences.

    The dataset includes crucial information such as reviews, ratings, comments, and more, all sourced from travellers who booked through Booking.com. It's an ideal resource for businesses aiming to understand guest sentiments, improve service quality, or refine marketing strategies within the hospitality sector.

    With this hotel reviews dataset, you can dive deep into trends and patterns that reveal what customers truly value during their stays. Whether you're analyzing reviews for sentiment analysis or studying traveller feedback from specific regions, this dataset delivers the insights you need.

    Ready to get started? Download the complete hotel review dataset or connect with the Crawl Feeds team to request records tailored to specific countries or regions. Unlock the power of data and take your hospitality analysis to the next level!

    Access 3 million+ US hotel reviews — submit your request today.

  2. The Expanded Groove MIDI Dataset (E-GMD)

    • kaggle.com
    zip
    Updated Dec 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alex Ignatov (2023). The Expanded Groove MIDI Dataset (E-GMD) [Dataset]. https://www.kaggle.com/datasets/alexignatov/the-expanded-groove-midi-dataset
    Explore at:
    zip(107045765 bytes)Available download formats
    Dataset updated
    Dec 13, 2023
    Authors
    Alex Ignatov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ⚠️ Note! This is the MIDI-only archive. If you need the WAV alternatives for your work, please download the full dataset from their website: https://magenta.tensorflow.org/datasets/e-gmd

    Cited from the orignal website:

    Overview

    The Expanded Groove MIDI Dataset (E-GMD) is a large dataset of human drum performances, with audio recordings annotated in MIDI. E-GMD contains 444 hours of audio from 43 drum kits and is an order of magnitude larger than similar datasets. It is also the first human-performed drum transcription dataset with annotations of velocity. It is based on our previously released Groove MIDI Dataset.

    Dataset

    This dataset is an expansion of the Groove MIDI Dataset (GMD). GMD is a dataset of human drum performances recorded in MIDI format on a Roland TD-11 electronic drum kit. To make the dataset applicable to ADT, we expanded it by re-recording the GMD sequences on 43 drumkits using a Roland TD-17. The kits range from electronic (e.g., 808, 909) to acoustic sounds. Recording was done at 44.1kHz and 24 bits and aligned within 2ms of the original MIDI files.

    We maintained the same train, test and validation splits across sequences that GMD had. Because each kit was recorded for every sequence, we see all 43 kits in the train, test and validation splits

    SplitUnique SequencesTotal SequencesDuration (hours)
    Train81935,217341.4
    Test1235,28950.9
    Validation1175,03152.2
    Total1,05945,537444.5

    Given the semi-manual nature of the pipeline, there were some errors in the recording process that resulted in unusable tracks. If your application requires only symbolic drum data, we recommend using the original data from the Groove MIDI Dataset.

    For more information about how the dataset was created and several applications of it, please see the paper where it was introduced: Improving Perceptual Quality of Drum Transcription with the Expanded Groove MIDI Dataset.

    Lee Callender, Curtis Hawthorne, and Jesse Engel. "Improving Perceptual Quality of Drum Transcription with the Expanded Groove MIDI Dataset." 2020. arXiv:2004.00188.

    For citations, please use: @misc{callender2020improving, title={Improving Perceptual Quality of Drum Transcription with the Expanded Groove MIDI Dataset}, author={Lee Callender and Curtis Hawthorne and Jesse Engel}, year={2020}, eprint={2004.00188}, archivePrefix={arXiv}, primaryClass={cs.SD} }

    I have no contribution and affililation with this work - just uploaded it and made available on Kaggle.

  3. H

    Smartphone High-explosive Audio Recordings Dataset (SHAReD)

    • dataverse.harvard.edu
    Updated Jun 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samuel Kei Takazawa (2025). Smartphone High-explosive Audio Recordings Dataset (SHAReD) [Dataset]. http://doi.org/10.7910/DVN/ROWODP
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 1, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Samuel Kei Takazawa
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Collection of signals from of high explosives recorded on a smartphone sensor network available as a pandas DataFrame. The dataset is accompanied by two machine learning models (LFM and D-YAMNet) that were trained for explosion detection using SHAReD and the ESC-50 Dataset. There are 326 sets of signals from 70 high-explosive events. The sensors included are the following: Microphone Accelerometer Barometer Global Navigation Satellite Systems *The extended dataset only includes microphone data and information about the explosion. *D-YAMNet takes 0.96 seconds of audio at 16 kHz sample rate with input shape of (15360,) *LFM takes 0.96 seconds of audio at 800 Hz sample rate with input shape of (1, 768) For ease of use of machine learning models (LFM and D-YAMNet) the dataset (SHAReD + ESC-50) used for training and testing are included along with a simple python code to produce the ensemble model's confusion matrix seen in the publication.

  4. p

    Data from: Open Access Dataset and Toolbox of High-Density Surface...

    • physionet.org
    Updated Dec 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xinyu Jiang; Chenyun Dai; Xiangyu Liu; Jiahao Fan (2023). Open Access Dataset and Toolbox of High-Density Surface Electromyogram Recordings [Dataset]. http://doi.org/10.13026/hxan-pe94
    Explore at:
    Dataset updated
    Dec 28, 2023
    Authors
    Xinyu Jiang; Chenyun Dai; Xiangyu Liu; Jiahao Fan
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    We provide an open access dataset of High densitY Surface Electromyogram (HD-sEMG) Recordings (named "Hyser"). We acquired data from 20 subjects with each subject participating in our experiment twice on separate days following the same experiment paradigm. Our Hyser dataset contains five sub-datasets: (1) pattern recognition (PR) dataset acquired during 34 hand gestures, (2) maximal voluntary muscle contraction (MVC) dataset while subjects contracted each individual finger, (3) one-degree of freedom (DoF) dataset acquired during force-varying contraction of each individual finger, (4) N-DoF dataset acquired during prescribed contractions of combinations of multiple fingers, and (5) random task dataset acquired during random contraction of combinations of fingers without any prescribed force trajectory. Sub-dataset 1 can be used for gesture recognition studies. Sub-datasets 2-5 also recorded individual finger forces, thus can be used for studies on proportional control of neuroprostheses.

  5. EEG: silent and perceive speech on 30 spanish sentences

    • openneuro.org
    Updated Sep 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlos Valle Araya; Carolina Mendez-Orellana; Maria Rodriguez-Fernandez (2022). EEG: silent and perceive speech on 30 spanish sentences [Dataset]. http://doi.org/10.18112/openneuro.ds004279.v1.0.0
    Explore at:
    Dataset updated
    Sep 27, 2022
    Dataset provided by
    OpenNeurohttps://openneuro.org/
    Authors
    Carlos Valle Araya; Carolina Mendez-Orellana; Maria Rodriguez-Fernandez
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    EEG: silent and perceive speech on 30 spanish sentences Large Spanish Speech EEG dataset

    Authors

    • Carlos Valle
    • Carolina Mendez-Orellana
    • María Rodríguez-Fernández

    Resources:

    Abstract: Decoding speech from brain activity can enable communication for individuals with speech disorders. Deep neural networks have shown great potential for speech decoding applications, but the large data sets required for these models are usually not available for neural recordings of speech impaired subjects. Harnessing data from other participants would thus be ideal to create speech neuroprostheses without the need of patient-specific training data. In this study, we recorded 60 sessions from 56 healthy participants using 64 EEG channels and developed a neural network capable of subject-independent classification of perceived sentences. We found that sentence identity can be decoded from subjects without prior training achieving higher accuracy than mixed-subject models. The development of subject-independent models eliminates the need to collect data from a target subject, reducing time and data collection costs during deployment. These results open new avenues for creating speech neuroprostheses when subjects cannot provide training data.

    Please contact us at this e-mail address if you have any question: cgvalle@uc.cl

  6. Full HD Videos - Liveness Detection Dataset

    • kaggle.com
    zip
    Updated Aug 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unique Data (2023). Full HD Videos - Liveness Detection Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/full-hd-webcam-live-attacks
    Explore at:
    zip(623365842 bytes)Available download formats
    Dataset updated
    Aug 1, 2023
    Authors
    Unique Data
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    High Resolution Live Attacks - Biometric Attack dataset

    The anti spoofing dataset includes live-recorded Anti-Spoofing videos from around the world, captured via high-quality webcams with Full HD resolution and above. The videos were gathered by capturing faces of genuine individuals presenting spoofs, using facial presentations. Our dataset proposes a novel approach that learns and detects spoofing techniques, extracting features from the genuine facial images to prevent the capturing of such information by fake users.

    The dataset is created on the basis of Phone and Webcam Video Dataset

    The dataset contains images and videos of real humans with various views, and colors, making it a comprehensive resource for researchers working on anti-spoofing technologies.

    👉 Legally sourced datasets and carefully structured for AI training and model development. Explore samples from our dataset of 95,000+ human images & videos - Full dataset

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F1ffb68e96724140488b944b22c68580c%2F(1).png?generation=1684702390091084&alt=media" alt="">

    The dataset provides data to combine and apply different techniques, approaches, and models to address the challenging task of distinguishing between genuine and spoofed inputs, providing effective anti-spoofing solutions in active authentication systems. These solutions are crucial as newer devices, such as phones, have become vulnerable to spoofing attacks due to the availability of technologies that can create replays, reflections, and depths, making them susceptible to spoofing and generalization.

    Our dataset also explores the use of neural architectures, such as deep neural networks, to facilitate the identification of distinguishing patterns and textures in different regions of the face, increasing the accuracy and generalizability of the anti-spoofing models.

    Webcam Resolution

    The collection of different video resolutions from Full HD (1080p) up to 4K (2160p) is provided, including several intermediate resolutions like QHD (1440p)

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2Fc07c45d6c6558291a2923d24eeb43d1b%2FResoluo-de-tela-sem-imagem.webp?generation=1684703424049108&alt=media" alt="">

    Metadata

    Each attack instance is accompanied by the following details:

    • Unique attack identifier
    • Identifier of the user recording the attack
    • User's age
    • User's gender
    • User's country of origin
    • Attack resolution

    Additionally, the model of the webcam is also specified.

    Metadata is represented in the file_info.csv.

    🧩 This is just an example of the data. Leave a request here to learn more

    🚀 You can learn more about our high-quality unique datasets here

    keywords: liveness detection systems, liveness detection dataset, biometric dataset, biometric data dataset, biometric system attacks, anti-spoofing dataset, face liveness detection, deep learning dataset, face spoofing database, face anti-spoofing, ibeta dataset, human video dataset, video dataset, high quality video dataset, hd video dataset, phone attack dataset, face anti spoofing, large-scale face anti spoofing, rich annotations anti spoofing dataset

  7. Z

    Metadata of a Large Sonar and Stereo Camera Dataset Suitable for...

    • data.niaid.nih.gov
    Updated Jul 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Backe, Christian; Wehbe, Bilal; Bande, Miguel; Shah, Nimish; Cesar, Diego; Pribbernow, Max (2024). Metadata of a Large Sonar and Stereo Camera Dataset Suitable for Sonar-to-RGB Image Translation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10373153
    Explore at:
    Dataset updated
    Jul 8, 2024
    Dataset provided by
    German Research Center for Artificial Intelligence (DFKI)
    Kraken Robotik GmbH
    Authors
    Backe, Christian; Wehbe, Bilal; Bande, Miguel; Shah, Nimish; Cesar, Diego; Pribbernow, Max
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Metadata of a Large Sonar and Stereo Camera Dataset Suitable for Sonar-to-RGB Image Translation

    Introduction

    This is a set of metadata describing a large dataset of synchronized sonar and stereo camera recordings, that were captured between August 2021 and September 2023 during the project DeeperSense (https://robotik.dfki-bremen.de/en/research/projects/deepersense/), as training data for Sonar-to-RGB image translation. Parts of the sensor data have been published (https://zenodo.org/records/7728089, https://zenodo.org/records/10220989). Due to the size of the sensor data corpus, it is currently impractical to make the entire corpus accessible online. Instead, this metadatabase serves as a relatively compact representation, allowing interested researchers to inspect the data, and select relevant portions for their particular use case, which will be made available on demand. This is an effort to comply with the FAIR principle A2 (https://www.go-fair.org/fair-principles/) that metadata shall be accessible, even when the base data is not immediately.

    Locations and sensors

    The sensor data was captured at four different locations, including one laboratory (Maritime Exploration Hall at DFKI RIC Bremen) and three field locations (Chalk Lake Hemmoor, Tank Wash Basin Neu-Ulm, Lake Starnberg). At all locations, a ZED camera and a Blueprint Oculus M1200d sonar were used. Additionally, a SeaVision camera was used at the Maritime Exploration Hall at DFKI RIC Bremen and at the Chalk Lake Hemmoor. The examples/ directory holds a typical output image for each sensor at each available location.

    Data volume per session

    Six data collection sessions were conducted. The table below presents an overview of the amount of data captured in each session:

    Session dates Location Number of datasets Total duration of datasets [h] Total logfile size [GB] Number of images Total image size [GB]

    2021-08-09 - 2021-08-12 Maritime Exploration Hall at DFKI RIC Bremen 52 10.8 28.8 389’047 88.1

    2022-02-07 - 2022-02-08 Maritime Exploration Hall at DFKI RIC Bremen 35 4.4 54.1 629’626 62.3

    2022-04-26 - 2022-04-28 Chalk Lake Hemmoor 52 8.1 133.6 1’114’281 97.8

    2022-06-28 - 2022-06-29 Tank Wash Basin Neu-Ulm 42 6.7 144.2 824’969 26.9

    2023-04-26 - 2023-04-27 Maritime Exploration Hall at DFKI RIC Bremen 55 7.4 141.9 739’613 9.6

    2023-09-01 - 2023-09-02 Lake Starnberg 19 2.9 40.1 217’385 2.3

    255 40.3 542.7 3’914’921 287.0

    Data and metadata structure

    Sensor data corpus

    The sensor data corpus comprises two processing stages:

    raw data streams stored in ROS bagfiles (aka logfiles),

    camera and sonar images (aka datafiles) extracted from the logfiles.

    The files are stored in a file tree hierarchy which groups them by session, dataset, and modality:

    ${session_key}/ ${dataset_key}/ ${logfile_name} ${modality_key}/ ${datafile_name}

    A typical logfile path has this form:

    2023-09_starnberg_lake/ 2023-09-02-15-06_hydraulic_drill/ stereo_camera-zed-2023-09-02-15-06-07.bag

    A typical datafile path has this form:

    2023-09_starnberg_lake/ 2023-09-02-15-06_hydraulic_drill/ zed_right/ 1693660038_368077993.jpg

    All directory and file names, and their particles, are designed to serve as identifiers in the metadatabase. Their formatting, as well as the definitions of all terms, are documented in the file entities.json.

    Metadatabase

    The metadatabase is provided in two equivalent forms:

    as a standalone SQLite (https://www.sqlite.org/index.html) database file metadata.sqlite for users familiar with SQLite,

    as a collection of CSV files in the csv/ directory for users who prefer other tools.

    The database file has been generated from the CSV files, so each database table holds the same information as the corresponding CSV file. In addition, the metadatabase contains a series of convenience views that facilitate access to certain aggregate information.

    An entity relationship diagram of the metadatabase tables is stored in the file entity_relationship_diagram.png. Each entity, its attributes, and relations are documented in detail in the file entities.json

    Some general design remarks:

    For convenience, timestamps are always given in both a human-readable form (ISO 8601 formatted datetime strings with explicit local time zone), and as seconds since the UNIX epoch.

    In practice, each logfile always contains a single stream, and each stream is stored always in a single logfile. Per database schema however, the entities stream and logfile are modeled separately, with a “many-streams-to-one-logfile” relationship. This design was chosen to be compatible with, and open for, data collections where a single logfile contains multiple streams.

    A modality is not an attribute of a sensor alone, but of a datafile: Because a sensor is an attribute of a stream, and a single stream may be the source of multiple modalities (e.g. RGB vs. grayscale images from the same camera, or cartesian vs. polar projection of the same sonar output). Conversely, the same modality may originate from different sensors.

    As a usage example, the data volume per session which is tabulated at the top of this document, can be extracted from the metadatabase with the following SQL query:

    SELECT PRINTF( '%s - %s', SUBSTR(session_start, 1, 10), SUBSTR(session_end, 1, 10)) AS 'Session dates', location_name_english AS Location, number_of_datasets AS 'Number of datasets', total_duration_of_datasets_h AS 'Total duration of datasets [h]', total_logfile_size_gb AS 'Total logfile size [GB]', number_of_images AS 'Number of images', total_image_size_gb AS 'Total image size [GB]' FROM location JOIN session USING (location_id) JOIN ( SELECT session_id, COUNT(dataset_id) AS number_of_datasets, ROUND( SUM(dataset_duration) / 3600, 1) AS total_duration_of_datasets_h, ROUND( SUM(total_logfile_size) / 10e9, 1) AS total_logfile_size_gb FROM location JOIN session USING (location_id) JOIN dataset USING (session_id) JOIN view_dataset_total_logfile_size USING (dataset_id) GROUP BY session_id ) USING (session_id) JOIN ( SELECT session_id, COUNT(datafile_id) AS number_of_images, ROUND(SUM(datafile_size) / 10e9, 1) AS total_image_size_gb FROM session JOIN dataset USING (session_id) JOIN stream USING (dataset_id) JOIN datafile USING (stream_id) GROUP BY session_id ) USING (session_id) ORDER BY session_id;

  8. CityTrek-14K

    • kaggle.com
    zip
    Updated Jan 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sobhan Moosavi (2024). CityTrek-14K [Dataset]. https://www.kaggle.com/datasets/sobhanmoosavi/citytrek-14k
    Explore at:
    zip(182314065 bytes)Available download formats
    Dataset updated
    Jan 13, 2024
    Authors
    Sobhan Moosavi
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Description

    CityTrek-14K is a distinctive, extensive dataset that includes 14,000 trajectories from 280 drivers, each contributing 50 trajectories, in three major U.S. cities: Philadelphia (PA), Atlanta (GA), and Memphis (TN). It features a time series data set capturing details like timestamps, vehicle speeds, and GPS coordinates, with a collection frequency of 1Hz. Although the dataset includes location data, strict anonymization practices were adhered to, ensuring personal information like home or work addresses remain confidential. The CityTrek-14K dataset offers a comprehensive view of driving patterns, encompassing over 4,800 hours of driving data and spanning more than 189,000 miles, collected between July 2017 and March 2019. The dataset comprises two distinct files: the first is a summary of the trips, and the second is a trajectory data file that includes detailed records captured every second.

    Acknowledgements

    If you use this dataset, please kindly cite the following paper: - Moosavi, Sobhan, and Rajiv Ramnath. "Context-aware driver risk prediction with telematics data." Accident Analysis & Prevention 192 (2023): 107269.

    Data Collection Methodology

    The CityTrek-14K dataset was collected using specially designed devices installed in vehicles. These devices were configured to record and transmit data frequently. Further details about this data collection process are elaborated in the paper mentioned above.

    Potential Applications

    The CityTrek-14K dataset is versatile, suitable for numerous applications such as: - Traffic Modeling and ETA Prediction: The dataset contains detailed route information and travel times, making it an excellent resource for large-scale traffic modeling and ETA modeling techniques. - Route Optimization: With its detailed trajectory data, the dataset is ideal for developing and testing route optimization techniques, providing insights into efficient pathfinding methods. - Modeling and Analyzing Driver Behavior: As each driver in the dataset has exactly 50 trajectories recorded, this allows for a comprehensive analysis of driver behavior, offering a unique opportunity to study and model driving patterns and habits.

    Usage Policy and Legal Disclaimer

    This dataset is being distributed solely for research purposes under the Creative Commons Attribution-Noncommercial-ShareAlike license (CC BY-NC-SA 4.0). By downloading the dataset, you agree to use it only for non-commercial, research, or academic applications. If you use this dataset, it is necessary to cite the paper mentioned above.

    Inquiries or need help?

    For any inquiries or assistance, please contact Sobhan Moosavi at sobhan.mehr84@gmail.com

  9. F

    Thai Wake Words & Voice Commands Speech Data

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Thai Wake Words & Voice Commands Speech Data [Dataset]. https://www.futurebeeai.com/dataset/wake-words-and-commands-dataset/wake-words-and-commands-thai-thailand
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Thai Wake Word & Voice Command Dataset is expertly curated to support the training and development of voice-activated systems. This dataset includes a large collection of wake words and command phrases, essential for enabling seamless user interaction with voice assistants and other speech-enabled technologies. It’s designed to ensure accurate wake word detection and voice command recognition, enhancing overall system performance and user experience.

    Speech Data

    This dataset includes 20,000+ audio recordings of wake words and command phrases. Each participant contributed 400 recordings, captured under varied environmental conditions and speaking speeds. The data covers:

    Wake words alone
    Wake words followed by command phrases

    Participant Diversity

    Speakers: 50 native Thai speakers from the FutureBeeAI community
    Regions: Participants from various Thailand provinces, ensuring broad coverage of accents and dialects
    Demographics: Ages 18–70; 60% male and 40% female participants

    Recording Details

    Type: Scripted wake words and command phrases
    Duration: 1 to 15 seconds per clip
    Format: WAV, stereo, 16-bit, with sample rates ranging from 16 kHz to 48 kHz

    Dataset Diversity

    Wake Word Types
    Automobile Wake Words: Hey Mercedes, Hey BMW, Hey Porsche, Hey Volvo, Hey Audi, Hi Genesis, Ok Ford, etc.
    Voice Assistant Wake Words: Hey Siri, Ok Google, Alexa, Hey Cortana, Hi Bixby, Hey Celia, etc.
    Home Appliance Wake Words: Hi LG, Ok LG, Hello Lloyd, and more
    Command Types by Use Case
    Automobile: Play music, check directions, voice search, provide feedback, and more
    Voice Assistant: Ask general questions, make calls, control devices, shopping, manage calendars, and more
    Home Appliances: Control appliances, check status, set reminders/alarms, manage shopping lists, etc.
    Recording Environments
    No background noise
    Background traffic noise
    People talking in the background
    Speaking Pace
    Normal speed
    Fast speed

    This diversity ensures robust training for real-world voice assistant applications.

    Metadata

    Each audio file is accompanied by detailed metadata to support advanced filtering and training needs.

    Participant Metadata: Unique ID, age, gender, region, accent, dialect
    Recording Metadata: Transcript, environment, pace, device used, sample rate, bit depth, file format

    Use Cases & Applications

    Voice Assistant Activation: Train models to accurately detect and trigger based on wake words
    Smart Home Devices: Enable responsive voice control in smart appliances
    <b style="font-weight:

  10. E

    Data from: The COUGHVID crowdsourcing dataset: A corpus for the study of...

    • live.european-language-grid.eu
    webm
    Updated May 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). The COUGHVID crowdsourcing dataset: A corpus for the study of large-scale cough analysis algorithms [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7851
    Explore at:
    webmAvailable download formats
    Dataset updated
    May 1, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    OverviewCough audio signal classification has been successfully used to diagnose a variety of respiratory conditions, and there has been significant interest in leveraging Machine Learning (ML) to provide widespread COVID-19 screening. The COUGHVID dataset provides over 20,000 crowdsourced cough recordings representing a wide range of subject ages, genders, geographic locations, and COVID-19 statuses. Furthermore, experienced pulmonologists labeled more than 2,000 recordings to diagnose medical abnormalities present in the coughs, thereby contributing one of the largest expert-labeled cough datasets in existence that can be used for a plethora of cough audio classification tasks. As a result, the COUGHVID dataset contributes a wealth of cough recordings for training ML models to address the world’s most urgent health crises.Private Set and Testing ProtocolResearchers interested in testing their models on the private test dataset should contact us at coughvid@epfl.ch, briefly explaining the type of validation they want to make, and their obtained results obtained through cross-validation with the public data. Then, access to the unlabeled recordings will be provided, and the researchers should send the predictions of their models on these recordings. Finally, the performance metrics of the predictions will be sent to the researchers.

  11. F

    US English Wake Words & Voice Commands Speech Data

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). US English Wake Words & Voice Commands Speech Data [Dataset]. https://www.futurebeeai.com/dataset/wake-words-and-commands-dataset/wake-words-and-commands-english-us
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The US English Wake Word & Voice Command Dataset is expertly curated to support the training and development of voice-activated systems. This dataset includes a large collection of wake words and command phrases, essential for enabling seamless user interaction with voice assistants and other speech-enabled technologies. It’s designed to ensure accurate wake word detection and voice command recognition, enhancing overall system performance and user experience.

    Speech Data

    This dataset includes 20,000+ audio recordings of wake words and command phrases. Each participant contributed 400 recordings, captured under varied environmental conditions and speaking speeds. The data covers:

    Wake words alone
    Wake words followed by command phrases

    Participant Diversity

    Speakers: 50 native US English speakers from the FutureBeeAI community
    Regions: Participants from various United States of America provinces, ensuring broad coverage of accents and dialects
    Demographics: Ages 18–70; 60% male and 40% female participants

    Recording Details

    Type: Scripted wake words and command phrases
    Duration: 1 to 15 seconds per clip
    Format: WAV, stereo, 16-bit, with sample rates ranging from 16 kHz to 48 kHz

    Dataset Diversity

    Wake Word Types
    Automobile Wake Words: Hey Mercedes, Hey BMW, Hey Porsche, Hey Volvo, Hey Audi, Hi Genesis, Ok Ford, etc.
    Voice Assistant Wake Words: Hey Siri, Ok Google, Alexa, Hey Cortana, Hi Bixby, Hey Celia, etc.
    Home Appliance Wake Words: Hi LG, Ok LG, Hello Lloyd, and more
    Command Types by Use Case
    Automobile: Play music, check directions, voice search, provide feedback, and more
    Voice Assistant: Ask general questions, make calls, control devices, shopping, manage calendars, and more
    Home Appliances: Control appliances, check status, set reminders/alarms, manage shopping lists, etc.
    Recording Environments
    No background noise
    Background traffic noise
    People talking in the background
    Speaking Pace
    Normal speed
    Fast speed

    This diversity ensures robust training for real-world voice assistant applications.

    Metadata

    Each audio file is accompanied by detailed metadata to support advanced filtering and training needs.

    Participant Metadata: Unique ID, age, gender, region, accent, dialect
    Recording Metadata: Transcript, environment, pace, device used, sample rate, bit depth, file format

    Use Cases & Applications

    Voice Assistant Activation: Train models to accurately detect and trigger based on wake words
    Smart Home Devices: Enable responsive voice control in smart appliances
    <b

  12. c

    Bulk Bookstore dataset

    • crawlfeeds.com
    csv, zip
    Updated Apr 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Bulk Bookstore dataset [Dataset]. https://crawlfeeds.com/datasets/bulk-bookstore-dataset
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Apr 27, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    Bulk Bookstore is online book store. Crawl feeds teams extracted few sample records for analysis purposes. Last crawled on 27 Nov 2021.

  13. The big dataset of ultra-marathon running

    • kaggle.com
    zip
    Updated Jul 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David (2023). The big dataset of ultra-marathon running [Dataset]. https://www.kaggle.com/datasets/aiaiaidavid/the-big-dataset-of-ultra-marathon-running
    Explore at:
    zip(258022817 bytes)Available download formats
    Dataset updated
    Jul 12, 2023
    Authors
    David
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    According to the Wikipedia, an ultramarathon, also called ultra distance or ultra running, is any footrace longer than the traditional marathon length of 42.195 kilometres (26 mi 385 yd). Various distances are raced competitively, from the shortest common ultramarathon of 31 miles (50 km) to over 200 miles (320 km). 50k and 100k are both World Athletics record distances, but some 100 miles (160 km) races are among the oldest and most prestigious events, especially in North America.}

    The data in this file is a large collection of ultra-marathon race records registered between 1798 and 2022 (a period of well over two centuries) being therefore a formidable long term sample. All data was obtained from public websites.

    Despite the original data being of public domain, the race records, which originally contained the athlete´s names, have been anonymized to comply with data protection laws and to preserve the athlete´s privacy. However, a column Athlete ID has been created with a numerical ID representing each unique runner (so if Antonio Fernández participated in 5 races over different years, then the corresponding race records now hold his unique Athlete ID instead of his name). This way I have preserved valuable information.

    The dataset contains 7,461,226 ultra-marathon race records from 1,641,168 unique athletes.

    The following columns (with data types) are included:

    • Year of event (int64)
    • Event dates (object)
    • Event name (object)
    • Event distance/length (object)
    • Event number of finishers (int64)
    • Athlete performance (object)
    • Athlete club (object)
    • Athlete country (object)
    • Athlete year of birth (float64)
    • Athlete gender (object)
    • Athlete age category (object)
    • Athlete average speed (object)
    • Athlete ID (int64)

    The Event name column include country location information that can be derived to a new column, and similarly seasonal information can be found in the Event dates column beyond the Year of event (these can be extracted with a bit of processing).

    The Event distance/length column describes the type of race, covering the most popular UM race distances and lengths, and some other specific modalities (multi-day, etc.):

    • Distances: 50km, 100km, 50mi, 100mi
    • Lengths: 6h, 12h, 24h, 48h, 72h, 6d, 10d

    Additionally, there is information of age, gender and speed (in km/h) in other columns.

  14. p

    PTB-XL, a large publicly available electrocardiography dataset

    • physionet.org
    • maplerate.net
    Updated Nov 9, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Wagner; Nils Strodthoff; Ralf-Dieter Bousseljot; Wojciech Samek; Tobias Schaeffter (2022). PTB-XL, a large publicly available electrocardiography dataset [Dataset]. http://doi.org/10.13026/kfzx-aw45
    Explore at:
    Dataset updated
    Nov 9, 2022
    Authors
    Patrick Wagner; Nils Strodthoff; Ralf-Dieter Bousseljot; Wojciech Samek; Tobias Schaeffter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Electrocardiography (ECG) is a key diagnostic tool to assess the cardiac condition of a patient. Automatic ECG interpretation algorithms as diagnosis support systems promise large reliefs for the medical personnel - only on the basis of the number of ECGs that are routinely taken. However, the development of such algorithms requires large training datasets and clear benchmark procedures. In our opinion, both aspects are not covered satisfactorily by existing freely accessible ECG datasets.

    The PTB-XL ECG dataset is a large dataset of 21799 clinical 12-lead ECGs from 18869 patients of 10 second length. The raw waveform data was annotated by up to two cardiologists, who assigned potentially multiple ECG statements to each record. The in total 71 different ECG statements conform to the SCP-ECG standard and cover diagnostic, form, and rhythm statements. To ensure comparability of machine learning algorithms trained on the dataset, we provide recommended splits into training and test sets. In combination with the extensive annotation, this turns the dataset into a rich resource for the training and the evaluation of automatic ECG interpretation algorithms. The dataset is complemented by extensive metadata on demographics, infarction characteristics, likelihoods for diagnostic ECG statements as well as annotated signal properties.

  15. c

    Data from: Datasets used to train the Generative Adversarial Networks used...

    • opendata.cern.ch
    Updated 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ATLAS collaboration (2021). Datasets used to train the Generative Adversarial Networks used in ATLFast3 [Dataset]. http://doi.org/10.7483/OPENDATA.ATLAS.UXKX.TXBN
    Explore at:
    Dataset updated
    2021
    Dataset provided by
    CERN Open Data Portal
    Authors
    ATLAS collaboration
    Description

    Three datasets are available, each consisting of 15 csv files. Each file containing the voxelised shower information obtained from single particles produced at the front of the calorimeter in the |η| range (0.2-0.25) simulated in the ATLAS detector. Two datasets contain photons events with different statistics; the larger sample has about 10 times the number of events as the other. The other dataset contains pions. The pion dataset and the photon dataset with the lower statistics were used to train the corresponding two GANs presented in the AtlFast3 paper SIMU-2018-04.

    The information in each file is a table; the rows correspond to the events and the columns to the voxels. The voxelisation procedure is described in the AtlFast3 paper linked above and in the dedicated PUB note ATL-SOFT-PUB-2020-006. In summary, the detailed energy deposits produced by ATLAS were converted from x,y,z coordinates to local cylindrical coordinates defined around the particle 3-momentum at the entrance of the calorimeter. The energy deposits in each layer were then grouped in voxels and for each voxel the energy was stored in the csv file. For each particle, there are 15 files corresponding to the 15 energy points used to train the GAN. The name of the csv file defines both the particle and the energy of the sample used to create the file.

    The size of the voxels is described in the binning.xml file. Software tools to read the XML file and manipulate the spatial information of voxels are provided in the FastCaloGAN repository.

    Updated on February 10th 2022. A new dataset photons_samples_highStat.tgz was added to this record and the binning.xml file was updated accordingly.

    Updated on April 18th 2023. A new dataset pions_samples_highStat.tgz was added to this record.

  16. New 1000 Sales Records Data 2

    • kaggle.com
    zip
    Updated Jan 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Calvin Oko Mensah (2023). New 1000 Sales Records Data 2 [Dataset]. https://www.kaggle.com/datasets/calvinokomensah/new-1000-sales-records-data-2
    Explore at:
    zip(49305 bytes)Available download formats
    Dataset updated
    Jan 12, 2023
    Authors
    Calvin Oko Mensah
    Description

    This is a dataset downloaded off excelbianalytics.com created off of random VBA logic. I recently performed an extensive exploratory data analysis on it and I included new columns to it, namely: Unit margin, Order year, Order month, Order weekday and Order_Ship_Days which I think can help with analysis on the data. I shared it because I thought it was a great dataset to practice analytical processes on for newbies like myself.

  17. n

    Acoustic features as a tool to visualize and explore marine soundscapes:...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Feb 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simone Cominelli; Nicolo' Bellin; Carissa D. Brown; Jack Lawson (2024). Acoustic features as a tool to visualize and explore marine soundscapes: Applications illustrated using marine mammal Passive Acoustic Monitoring datasets [Dataset]. http://doi.org/10.5061/dryad.3bk3j9kn8
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 15, 2024
    Dataset provided by
    University of Parma
    Fisheries and Oceans Canada
    Memorial University of Newfoundland
    Authors
    Simone Cominelli; Nicolo' Bellin; Carissa D. Brown; Jack Lawson
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Passive Acoustic Monitoring (PAM) is emerging as a solution for monitoring species and environmental change over large spatial and temporal scales. However, drawing rigorous conclusions based on acoustic recordings is challenging, as there is no consensus over which approaches, and indices are best suited for characterizing marine and terrestrial acoustic environments. Here, we describe the application of multiple machine-learning techniques to the analysis of a large PAM dataset. We combine pre-trained acoustic classification models (VGGish, NOAA & Google Humpback Whale Detector), dimensionality reduction (UMAP), and balanced random forest algorithms to demonstrate how machine-learned acoustic features capture different aspects of the marine environment. The UMAP dimensions derived from VGGish acoustic features exhibited good performance in separating marine mammal vocalizations according to species and locations. RF models trained on the acoustic features performed well for labelled sounds in the 8 kHz range, however, low and high-frequency sounds could not be classified using this approach. The workflow presented here shows how acoustic feature extraction, visualization, and analysis allow for establishing a link between ecologically relevant information and PAM recordings at multiple scales. The datasets and scripts provided in this repository allow replicating the results presented in the publication. Methods Data acquisition and preparation We collected all records available in the Watkins Marine Mammal Database website listed under the “all cuts'' page. For each audio file in the WMD the associated metadata included a label for the sound sources present in the recording (biological, anthropogenic, and environmental), as well as information related to the location and date of recording. To minimize the presence of unwanted sounds in the samples, we only retained audio files with a single source listed in the metadata. We then labelled the selected audio clips according to taxonomic group (Odontocetae, Mysticetae), and species. We limited the analysis to 12 marine mammal species by discarding data when a species: had less than 60 s of audio available, had a vocal repertoire extending beyond the resolution of the acoustic classification model (VGGish), or was recorded in a single country. To determine if a species was suited for analysis using VGGish, we inspected the Mel-spectrograms of 3-s audio samples and only retained species with vocalizations that could be captured in the Mel-spectrogram (Appendix S1). The vocalizations of species that produce very low frequency, or very high frequency were not captured by the Mel-spectrogram, thus we removed them from the analysis. To ensure that records included the vocalizations of multiple individuals for each species, we only considered species with records from two or more different countries. Lastly, to avoid overrepresentation of sperm whale vocalizations, we excluded 30,000 sperm whale recordings collected in the Dominican Republic. The resulting dataset consisted in 19,682 audio clips with a duration of 960 milliseconds each (0.96 s) (Table 1). The Placentia Bay Database (PBD) includes recordings collected by Fisheries and Oceans Canada in Placentia Bay (Newfoundland, Canada), in 2019. The dataset consisted of two months of continuous recordings (1230 hours), starting on July 1st, 2019, and ending on August 31st 2029. The data was collected using an AMAR G4 hydrophone (sensitivity: -165.02 dB re 1V/µPa at 250 Hz) deployed at 64 m of depth. The hydrophone was set to operate following 15 min cycles, with the first 60 s sampled at 512 kHz, and the remaining 14 min sampled at 64 kHz. For the purpose of this study, we limited the analysis to the 64 kHz recordings. Acoustic feature extraction The audio files from the WMD and PBD databases were used as input for VGGish (Abu-El-Haija et al., 2016; Chung et al., 2018), a CNN developed and trained to perform general acoustic classification. VGGish was trained on the Youtube8M dataset, containing more than two million user-labelled audio-video files. Rather than focusing on the final output of the model (i.e., the assigned labels), here the model was used as a feature extractor (Sethi et al., 2020). VGGish converts audio input into a semantically meaningful vector consisting of 128 features. The model returns features at multiple resolution: ~1 s (960 ms); ~5 s (4800 ms); ~1 min (59’520 ms); ~5 min (299’520 ms). All of the visualizations and results pertaining to the WMD were prepared using the finest feature resolution of ~1 s. The visualizations and results pertaining to the PBD were prepared using the ~5 s features for the humpback whale detection example, and were then averaged to an interval of 30 min in order to match the temporal resolution of the environmental measures available for the area. UMAP ordination and visualization UMAP is a non-linear dimensionality reduction algorithm based on the concept of topological data analysis which, unlike other dimensionality reduction techniques (e.g., tSNE), preserves both the local and global structure of multivariate datasets (McInnes et al., 2018). To allow for data visualization and to reduce the 128 features to two dimensions for further analysis, we applied Uniform Manifold Approximation and Projection (UMAP) to both datasets and inspected the resulting plots. The UMAP algorithm generates a low-dimensional representation of a multivariate dataset while maintaining the relationships between points in the global dataset structure (i.e., the 128 features extracted from VGGish). Each point in a UMAP plot in this paper represents an audio sample with duration of ~ 1 second (WMD dataset), ~ 5 seconds (PBD dataset, humpback whale detections), or 30 minutes (PBD dataset, environmental variables). Each point in the two-dimensional UMAP space also represents a vector of 128 VGGish features. The nearer two points are in the plot space, the nearer the two points are in the 128-dimensional space, and thus the distance between two points in UMAP reflects the degree of similarity between two audio samples in our datasets. Areas with a high density of samples in UMAP space should, therefore, contain sounds with similar characteristics, and such similarity should decrease with increasing point distance. Previous studies illustrated how VGGish and UMAP can be applied to the analysis of terrestrial acoustic datasets (Heath et al., 2021; Sethi et al., 2020). The visualizations and classification trials presented here illustrate how the two techniques (VGGish and UMAP) can be used together for marine ecoacoustics analysis. UMAP visualizations were prepared the umap-learn package for Python programming language (version 3.10). All UMAP visualizations presented in this study were generated using the algorithm’s default parameters.
    Labelling sound sources The labels for the WMD records (i.e., taxonomic group, species, location) were obtained from the database metadata. For the PBD recordings, we obtained measures of wind speed, surface temperature, and current speed from (Fig 1) an oceanographic buy located in proximity of the recorder. We choose these three variables for their different contributions to background noise in marine environments. Wind speed contributes to underwater background noise at multiple frequencies, ranging 500 Hz to 20 kHz (Hildebrand et al., 2021). Sea surface temperature contributes to background noise at frequencies between 63 Hz and 125 Hz (Ainslie et al., 2021), while ocean currents contribute to ambient noise at frequencies below 50 Hz (Han et al., 2021) Prior to analysis, we categorized the environmental variables and assigned the categories as labels to the acoustic features (Table 2). Humpback whale vocalizations in the PBD recordings were processed using the humpback whale acoustic detector created by NOAA and Google (Allen et al., 2021), providing a model score for every ~5 s sample. This model was trained on a large dataset (14 years and 13 locations) using humpback whale recordings annotated by experts (Allen et al., 2021). The model returns scores ranging from 0 to 1 indicating the confidence in the predicted humpback whale presence. We used the results of this detection model to label the PBD samples according to presence of humpback whale vocalizations. To verify the model results, we inspected all audio files that contained a 5 s sample with a model score higher than 0.9 for the month of July. If the presence of a humpback whale was confirmed, we labelled the segment as a model detection. We labelled any additional humpback whale vocalization present in the inspected audio files as a visual detection, while we labelled other sources and background noise samples as absences. In total, we labelled 4.6 hours of recordings. We reserved the recordings collected in August to test the precision of the final predictive model. Label prediction performance We used Balanced Random Forest models (BRF) provided in the imbalanced-learn python package (Lemaître et al., 2017) to predict humpback whale presence and environmental conditions from the acoustic features generated by VGGish. We choose BRF as the algorithm as it is suited for datasets characterized by class imbalance. The BRF algorithm performs under sampling of the majority class prior to prediction, allowing to overcome class imbalance (Lemaître et al., 2017). For each model run, the PBD dataset was split into training (80%) and testing (20%) sets. The training datasets were used to fine-tune the models though a nested k-fold cross validation approach with ten-folds in the outer loop, and five-folds in the inner loop. We selected nested cross validation as it allows optimizing model hyperparameters and performing model evaluation in a single step. We used the default parameters of the BRF algorithm, except for the ‘n_estimators’ hyperparameter, for which we tested

  18. Company Records - Dataset - CRO

    • opendata.cro.ie
    Updated Dec 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    cro.ie (2024). Company Records - Dataset - CRO [Dataset]. https://opendata.cro.ie/dataset/companies
    Explore at:
    Dataset updated
    Dec 1, 2024
    Dataset provided by
    Companies Registration Office
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset provides a structured and machine-readable register of all companies recorded by the Companies Registration Office (CRO) in Ireland. It includes a daily snapshot of company records, covering both currently registered companies and historical records of dissolved or closed entities. The dataset aligns with the European Union’s Open Data Directive (Directive (EU) 2019/1024) and the Implementing Regulation (EU) 2023/138, which designates company and company ownership data as a high-value dataset. Updated daily, it ensures timely access to corporate information and is available for bulk download and API access under the Creative Commons Attribution 4.0 (CC BY 4.0) licence, allowing unrestricted reuse with appropriate attribution. By increasing transparency, accountability, and economic innovation, this dataset supports public sector initiatives, research, and digital services development.

  19. c

    Download Flipkart E-commerce dataset

    • crawlfeeds.com
    json, zip
    Updated Dec 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2024). Download Flipkart E-commerce dataset [Dataset]. https://crawlfeeds.com/datasets/download-flipkart-e-commerce-dataset
    Explore at:
    json, zipAvailable download formats
    Dataset updated
    Dec 9, 2024
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    Unlock the power of Flipkart's extensive product catalog with our meticulously curated e-commerce dataset. This dataset provides detailed information on a wide range of products available on Flipkart, including product names, descriptions, prices, customer reviews, ratings, and images. Whether you're working on data analysis, machine learning models, or conducting in-depth market research, this dataset is an invaluable resource.

    With our Flipkart e-commerce dataset, you can easily analyze trends, compare products, and gain insights into consumer behavior. The dataset is structured and high-quality, ensuring that you have the best foundation for your projects.

    Flipkart is largest E-commerce website based out india. Pre crawled dataset having more than 5.7 million records.

    Where to use dataset

    • Train machine learning algorithms
    • Check discounts of various categories fields
    • Find out which brand and categories having best discounts
  20. Synthetic UML Diagram Dataset (PlantUML)

    • zenodo.org
    zip
    Updated May 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Averi Bates; Averi Bates; Chongle Pan; Chongle Pan (2025). Synthetic UML Diagram Dataset (PlantUML) [Dataset]. http://doi.org/10.5281/zenodo.15103682
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 28, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Averi Bates; Averi Bates; Chongle Pan; Chongle Pan
    License

    http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0

    Description

    This dataset comprises synthetic UML diagrams, explicitly focusing on activity and sequence diagrams generated using PlantUML—a text-based tool for creating visual diagrams. By leveraging randomized text strings based on PlantUML syntax, we produced a diverse and scalable collection that emulates standard UML diagrams. Each diagram is accompanied by its corresponding PlantUML code, facilitating a clear understanding of the visual representation's textual foundation. Data from smaller datasets is reused in the larger datasets, as each model was trained on the data separately, as described in the original paper. It's recommended just to use the Extra Large dataset when interested in using the data in its entirety.

    Each category is divided into four subsets based on size (approximately):

    • Small: 6,000 training diagrams and 1,500 testing diagrams.

    • Medium: 12,000 training diagrams and 3,000 testing diagrams.

    • Large: 24,000 training diagrams and 6,000 testing diagrams.

    • Extra Large: 120,000 training diagrams and 30,000 testing diagrams.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Crawl Feeds (2025). Booking hotel reviews large dataset [Dataset]. https://crawlfeeds.com/datasets/booking-hotel-reviews-large-dataset

Booking hotel reviews large dataset

Booking hotel reviews large dataset from booking.com

Explore at:
zip, csvAvailable download formats
Dataset updated
Oct 6, 2025
Dataset authored and provided by
Crawl Feeds
License

https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

Description

Explore our extensive Booking Hotel Reviews Large Dataset, featuring over 20.8 million records of detailed customer feedback from hotels worldwide. Whether you're conducting sentiment analysis, market research, or competitive benchmarking, this dataset provides invaluable insights into customer experiences and preferences.

The dataset includes crucial information such as reviews, ratings, comments, and more, all sourced from travellers who booked through Booking.com. It's an ideal resource for businesses aiming to understand guest sentiments, improve service quality, or refine marketing strategies within the hospitality sector.

With this hotel reviews dataset, you can dive deep into trends and patterns that reveal what customers truly value during their stays. Whether you're analyzing reviews for sentiment analysis or studying traveller feedback from specific regions, this dataset delivers the insights you need.

Ready to get started? Download the complete hotel review dataset or connect with the Crawl Feeds team to request records tailored to specific countries or regions. Unlock the power of data and take your hospitality analysis to the next level!

Access 3 million+ US hotel reviews — submit your request today.

Search
Clear search
Close search
Google apps
Main menu