Facebook
Twitterhttps://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Explore our extensive Booking Hotel Reviews Large Dataset, featuring over 20.8 million records of detailed customer feedback from hotels worldwide. Whether you're conducting sentiment analysis, market research, or competitive benchmarking, this dataset provides invaluable insights into customer experiences and preferences.
The dataset includes crucial information such as reviews, ratings, comments, and more, all sourced from travellers who booked through Booking.com. It's an ideal resource for businesses aiming to understand guest sentiments, improve service quality, or refine marketing strategies within the hospitality sector.
With this hotel reviews dataset, you can dive deep into trends and patterns that reveal what customers truly value during their stays. Whether you're analyzing reviews for sentiment analysis or studying traveller feedback from specific regions, this dataset delivers the insights you need.
Ready to get started? Download the complete hotel review dataset or connect with the Crawl Feeds team to request records tailored to specific countries or regions. Unlock the power of data and take your hospitality analysis to the next level!
Access 3 million+ US hotel reviews — submit your request today.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
⚠️ Note! This is the MIDI-only archive. If you need the WAV alternatives for your work, please download the full dataset from their website: https://magenta.tensorflow.org/datasets/e-gmd
Cited from the orignal website:
The Expanded Groove MIDI Dataset (E-GMD) is a large dataset of human drum performances, with audio recordings annotated in MIDI. E-GMD contains 444 hours of audio from 43 drum kits and is an order of magnitude larger than similar datasets. It is also the first human-performed drum transcription dataset with annotations of velocity. It is based on our previously released Groove MIDI Dataset.
This dataset is an expansion of the Groove MIDI Dataset (GMD). GMD is a dataset of human drum performances recorded in MIDI format on a Roland TD-11 electronic drum kit. To make the dataset applicable to ADT, we expanded it by re-recording the GMD sequences on 43 drumkits using a Roland TD-17. The kits range from electronic (e.g., 808, 909) to acoustic sounds. Recording was done at 44.1kHz and 24 bits and aligned within 2ms of the original MIDI files.
We maintained the same train, test and validation splits across sequences that GMD had. Because each kit was recorded for every sequence, we see all 43 kits in the train, test and validation splits
| Split | Unique Sequences | Total Sequences | Duration (hours) |
|---|---|---|---|
| Train | 819 | 35,217 | 341.4 |
| Test | 123 | 5,289 | 50.9 |
| Validation | 117 | 5,031 | 52.2 |
| Total | 1,059 | 45,537 | 444.5 |
Given the semi-manual nature of the pipeline, there were some errors in the recording process that resulted in unusable tracks. If your application requires only symbolic drum data, we recommend using the original data from the Groove MIDI Dataset.
For more information about how the dataset was created and several applications of it, please see the paper where it was introduced: Improving Perceptual Quality of Drum Transcription with the Expanded Groove MIDI Dataset.
Lee Callender, Curtis Hawthorne, and Jesse Engel. "Improving Perceptual Quality of Drum Transcription with the Expanded Groove MIDI Dataset." 2020. arXiv:2004.00188.
For citations, please use:
@misc{callender2020improving,
title={Improving Perceptual Quality of Drum Transcription with the Expanded Groove MIDI Dataset},
author={Lee Callender and Curtis Hawthorne and Jesse Engel},
year={2020},
eprint={2004.00188},
archivePrefix={arXiv},
primaryClass={cs.SD}
}
I have no contribution and affililation with this work - just uploaded it and made available on Kaggle.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Collection of signals from of high explosives recorded on a smartphone sensor network available as a pandas DataFrame. The dataset is accompanied by two machine learning models (LFM and D-YAMNet) that were trained for explosion detection using SHAReD and the ESC-50 Dataset. There are 326 sets of signals from 70 high-explosive events. The sensors included are the following: Microphone Accelerometer Barometer Global Navigation Satellite Systems *The extended dataset only includes microphone data and information about the explosion. *D-YAMNet takes 0.96 seconds of audio at 16 kHz sample rate with input shape of (15360,) *LFM takes 0.96 seconds of audio at 800 Hz sample rate with input shape of (1, 768) For ease of use of machine learning models (LFM and D-YAMNet) the dataset (SHAReD + ESC-50) used for training and testing are included along with a simple python code to produce the ensemble model's confusion matrix seen in the publication.
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
We provide an open access dataset of High densitY Surface Electromyogram (HD-sEMG) Recordings (named "Hyser"). We acquired data from 20 subjects with each subject participating in our experiment twice on separate days following the same experiment paradigm. Our Hyser dataset contains five sub-datasets: (1) pattern recognition (PR) dataset acquired during 34 hand gestures, (2) maximal voluntary muscle contraction (MVC) dataset while subjects contracted each individual finger, (3) one-degree of freedom (DoF) dataset acquired during force-varying contraction of each individual finger, (4) N-DoF dataset acquired during prescribed contractions of combinations of multiple fingers, and (5) random task dataset acquired during random contraction of combinations of fingers without any prescribed force trajectory. Sub-dataset 1 can be used for gesture recognition studies. Sub-datasets 2-5 also recorded individual finger forces, thus can be used for studies on proportional control of neuroprostheses.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
EEG: silent and perceive speech on 30 spanish sentences Large Spanish Speech EEG dataset
Authors
Resources:
Abstract: Decoding speech from brain activity can enable communication for individuals with speech disorders. Deep neural networks have shown great potential for speech decoding applications, but the large data sets required for these models are usually not available for neural recordings of speech impaired subjects. Harnessing data from other participants would thus be ideal to create speech neuroprostheses without the need of patient-specific training data. In this study, we recorded 60 sessions from 56 healthy participants using 64 EEG channels and developed a neural network capable of subject-independent classification of perceived sentences. We found that sentence identity can be decoded from subjects without prior training achieving higher accuracy than mixed-subject models. The development of subject-independent models eliminates the need to collect data from a target subject, reducing time and data collection costs during deployment. These results open new avenues for creating speech neuroprostheses when subjects cannot provide training data.
Please contact us at this e-mail address if you have any question: cgvalle@uc.cl
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The anti spoofing dataset includes live-recorded Anti-Spoofing videos from around the world, captured via high-quality webcams with Full HD resolution and above. The videos were gathered by capturing faces of genuine individuals presenting spoofs, using facial presentations. Our dataset proposes a novel approach that learns and detects spoofing techniques, extracting features from the genuine facial images to prevent the capturing of such information by fake users.
The dataset contains images and videos of real humans with various views, and colors, making it a comprehensive resource for researchers working on anti-spoofing technologies.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F1ffb68e96724140488b944b22c68580c%2F(1).png?generation=1684702390091084&alt=media" alt="">
The dataset provides data to combine and apply different techniques, approaches, and models to address the challenging task of distinguishing between genuine and spoofed inputs, providing effective anti-spoofing solutions in active authentication systems. These solutions are crucial as newer devices, such as phones, have become vulnerable to spoofing attacks due to the availability of technologies that can create replays, reflections, and depths, making them susceptible to spoofing and generalization.
Our dataset also explores the use of neural architectures, such as deep neural networks, to facilitate the identification of distinguishing patterns and textures in different regions of the face, increasing the accuracy and generalizability of the anti-spoofing models.
The collection of different video resolutions from Full HD (1080p) up to 4K (2160p) is provided, including several intermediate resolutions like QHD (1440p)
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2Fc07c45d6c6558291a2923d24eeb43d1b%2FResoluo-de-tela-sem-imagem.webp?generation=1684703424049108&alt=media" alt="">
Each attack instance is accompanied by the following details:
Additionally, the model of the webcam is also specified.
Metadata is represented in the file_info.csv.
🚀 You can learn more about our high-quality unique datasets here
keywords: liveness detection systems, liveness detection dataset, biometric dataset, biometric data dataset, biometric system attacks, anti-spoofing dataset, face liveness detection, deep learning dataset, face spoofing database, face anti-spoofing, ibeta dataset, human video dataset, video dataset, high quality video dataset, hd video dataset, phone attack dataset, face anti spoofing, large-scale face anti spoofing, rich annotations anti spoofing dataset
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Metadata of a Large Sonar and Stereo Camera Dataset Suitable for Sonar-to-RGB Image Translation
Introduction
This is a set of metadata describing a large dataset of synchronized sonar and stereo camera recordings, that were captured between August 2021 and September 2023 during the project DeeperSense (https://robotik.dfki-bremen.de/en/research/projects/deepersense/), as training data for Sonar-to-RGB image translation. Parts of the sensor data have been published (https://zenodo.org/records/7728089, https://zenodo.org/records/10220989). Due to the size of the sensor data corpus, it is currently impractical to make the entire corpus accessible online. Instead, this metadatabase serves as a relatively compact representation, allowing interested researchers to inspect the data, and select relevant portions for their particular use case, which will be made available on demand. This is an effort to comply with the FAIR principle A2 (https://www.go-fair.org/fair-principles/) that metadata shall be accessible, even when the base data is not immediately.
Locations and sensors
The sensor data was captured at four different locations, including one laboratory (Maritime Exploration Hall at DFKI RIC Bremen) and three field locations (Chalk Lake Hemmoor, Tank Wash Basin Neu-Ulm, Lake Starnberg). At all locations, a ZED camera and a Blueprint Oculus M1200d sonar were used. Additionally, a SeaVision camera was used at the Maritime Exploration Hall at DFKI RIC Bremen and at the Chalk Lake Hemmoor. The examples/ directory holds a typical output image for each sensor at each available location.
Data volume per session
Six data collection sessions were conducted. The table below presents an overview of the amount of data captured in each session:
Session dates Location Number of datasets Total duration of datasets [h] Total logfile size [GB] Number of images Total image size [GB]
2021-08-09 - 2021-08-12 Maritime Exploration Hall at DFKI RIC Bremen 52 10.8 28.8 389’047 88.1
2022-02-07 - 2022-02-08 Maritime Exploration Hall at DFKI RIC Bremen 35 4.4 54.1 629’626 62.3
2022-04-26 - 2022-04-28 Chalk Lake Hemmoor 52 8.1 133.6 1’114’281 97.8
2022-06-28 - 2022-06-29 Tank Wash Basin Neu-Ulm 42 6.7 144.2 824’969 26.9
2023-04-26 - 2023-04-27 Maritime Exploration Hall at DFKI RIC Bremen 55 7.4 141.9 739’613 9.6
2023-09-01 - 2023-09-02 Lake Starnberg 19 2.9 40.1 217’385 2.3
255 40.3 542.7 3’914’921 287.0
Data and metadata structure
Sensor data corpus
The sensor data corpus comprises two processing stages:
raw data streams stored in ROS bagfiles (aka logfiles),
camera and sonar images (aka datafiles) extracted from the logfiles.
The files are stored in a file tree hierarchy which groups them by session, dataset, and modality:
${session_key}/ ${dataset_key}/ ${logfile_name} ${modality_key}/ ${datafile_name}
A typical logfile path has this form:
2023-09_starnberg_lake/ 2023-09-02-15-06_hydraulic_drill/ stereo_camera-zed-2023-09-02-15-06-07.bag
A typical datafile path has this form:
2023-09_starnberg_lake/ 2023-09-02-15-06_hydraulic_drill/ zed_right/ 1693660038_368077993.jpg
All directory and file names, and their particles, are designed to serve as identifiers in the metadatabase. Their formatting, as well as the definitions of all terms, are documented in the file entities.json.
Metadatabase
The metadatabase is provided in two equivalent forms:
as a standalone SQLite (https://www.sqlite.org/index.html) database file metadata.sqlite for users familiar with SQLite,
as a collection of CSV files in the csv/ directory for users who prefer other tools.
The database file has been generated from the CSV files, so each database table holds the same information as the corresponding CSV file. In addition, the metadatabase contains a series of convenience views that facilitate access to certain aggregate information.
An entity relationship diagram of the metadatabase tables is stored in the file entity_relationship_diagram.png. Each entity, its attributes, and relations are documented in detail in the file entities.json
Some general design remarks:
For convenience, timestamps are always given in both a human-readable form (ISO 8601 formatted datetime strings with explicit local time zone), and as seconds since the UNIX epoch.
In practice, each logfile always contains a single stream, and each stream is stored always in a single logfile. Per database schema however, the entities stream and logfile are modeled separately, with a “many-streams-to-one-logfile” relationship. This design was chosen to be compatible with, and open for, data collections where a single logfile contains multiple streams.
A modality is not an attribute of a sensor alone, but of a datafile: Because a sensor is an attribute of a stream, and a single stream may be the source of multiple modalities (e.g. RGB vs. grayscale images from the same camera, or cartesian vs. polar projection of the same sonar output). Conversely, the same modality may originate from different sensors.
As a usage example, the data volume per session which is tabulated at the top of this document, can be extracted from the metadatabase with the following SQL query:
SELECT PRINTF( '%s - %s', SUBSTR(session_start, 1, 10), SUBSTR(session_end, 1, 10)) AS 'Session dates', location_name_english AS Location, number_of_datasets AS 'Number of datasets', total_duration_of_datasets_h AS 'Total duration of datasets [h]', total_logfile_size_gb AS 'Total logfile size [GB]', number_of_images AS 'Number of images', total_image_size_gb AS 'Total image size [GB]' FROM location JOIN session USING (location_id) JOIN ( SELECT session_id, COUNT(dataset_id) AS number_of_datasets, ROUND( SUM(dataset_duration) / 3600, 1) AS total_duration_of_datasets_h, ROUND( SUM(total_logfile_size) / 10e9, 1) AS total_logfile_size_gb FROM location JOIN session USING (location_id) JOIN dataset USING (session_id) JOIN view_dataset_total_logfile_size USING (dataset_id) GROUP BY session_id ) USING (session_id) JOIN ( SELECT session_id, COUNT(datafile_id) AS number_of_images, ROUND(SUM(datafile_size) / 10e9, 1) AS total_image_size_gb FROM session JOIN dataset USING (session_id) JOIN stream USING (dataset_id) JOIN datafile USING (stream_id) GROUP BY session_id ) USING (session_id) ORDER BY session_id;
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
CityTrek-14K is a distinctive, extensive dataset that includes 14,000 trajectories from 280 drivers, each contributing 50 trajectories, in three major U.S. cities: Philadelphia (PA), Atlanta (GA), and Memphis (TN). It features a time series data set capturing details like timestamps, vehicle speeds, and GPS coordinates, with a collection frequency of 1Hz. Although the dataset includes location data, strict anonymization practices were adhered to, ensuring personal information like home or work addresses remain confidential. The CityTrek-14K dataset offers a comprehensive view of driving patterns, encompassing over 4,800 hours of driving data and spanning more than 189,000 miles, collected between July 2017 and March 2019. The dataset comprises two distinct files: the first is a summary of the trips, and the second is a trajectory data file that includes detailed records captured every second.
If you use this dataset, please kindly cite the following paper: - Moosavi, Sobhan, and Rajiv Ramnath. "Context-aware driver risk prediction with telematics data." Accident Analysis & Prevention 192 (2023): 107269.
The CityTrek-14K dataset was collected using specially designed devices installed in vehicles. These devices were configured to record and transmit data frequently. Further details about this data collection process are elaborated in the paper mentioned above.
The CityTrek-14K dataset is versatile, suitable for numerous applications such as: - Traffic Modeling and ETA Prediction: The dataset contains detailed route information and travel times, making it an excellent resource for large-scale traffic modeling and ETA modeling techniques. - Route Optimization: With its detailed trajectory data, the dataset is ideal for developing and testing route optimization techniques, providing insights into efficient pathfinding methods. - Modeling and Analyzing Driver Behavior: As each driver in the dataset has exactly 50 trajectories recorded, this allows for a comprehensive analysis of driver behavior, offering a unique opportunity to study and model driving patterns and habits.
This dataset is being distributed solely for research purposes under the Creative Commons Attribution-Noncommercial-ShareAlike license (CC BY-NC-SA 4.0). By downloading the dataset, you agree to use it only for non-commercial, research, or academic applications. If you use this dataset, it is necessary to cite the paper mentioned above.
For any inquiries or assistance, please contact Sobhan Moosavi at sobhan.mehr84@gmail.com
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Thai Wake Word & Voice Command Dataset is expertly curated to support the training and development of voice-activated systems. This dataset includes a large collection of wake words and command phrases, essential for enabling seamless user interaction with voice assistants and other speech-enabled technologies. It’s designed to ensure accurate wake word detection and voice command recognition, enhancing overall system performance and user experience.
This dataset includes 20,000+ audio recordings of wake words and command phrases. Each participant contributed 400 recordings, captured under varied environmental conditions and speaking speeds. The data covers:
This diversity ensures robust training for real-world voice assistant applications.
Each audio file is accompanied by detailed metadata to support advanced filtering and training needs.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OverviewCough audio signal classification has been successfully used to diagnose a variety of respiratory conditions, and there has been significant interest in leveraging Machine Learning (ML) to provide widespread COVID-19 screening. The COUGHVID dataset provides over 20,000 crowdsourced cough recordings representing a wide range of subject ages, genders, geographic locations, and COVID-19 statuses. Furthermore, experienced pulmonologists labeled more than 2,000 recordings to diagnose medical abnormalities present in the coughs, thereby contributing one of the largest expert-labeled cough datasets in existence that can be used for a plethora of cough audio classification tasks. As a result, the COUGHVID dataset contributes a wealth of cough recordings for training ML models to address the world’s most urgent health crises.Private Set and Testing ProtocolResearchers interested in testing their models on the private test dataset should contact us at coughvid@epfl.ch, briefly explaining the type of validation they want to make, and their obtained results obtained through cross-validation with the public data. Then, access to the unlabeled recordings will be provided, and the researchers should send the predictions of their models on these recordings. Finally, the performance metrics of the predictions will be sent to the researchers.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The US English Wake Word & Voice Command Dataset is expertly curated to support the training and development of voice-activated systems. This dataset includes a large collection of wake words and command phrases, essential for enabling seamless user interaction with voice assistants and other speech-enabled technologies. It’s designed to ensure accurate wake word detection and voice command recognition, enhancing overall system performance and user experience.
This dataset includes 20,000+ audio recordings of wake words and command phrases. Each participant contributed 400 recordings, captured under varied environmental conditions and speaking speeds. The data covers:
This diversity ensures robust training for real-world voice assistant applications.
Each audio file is accompanied by detailed metadata to support advanced filtering and training needs.
Facebook
Twitterhttps://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Bulk Bookstore is online book store. Crawl feeds teams extracted few sample records for analysis purposes. Last crawled on 27 Nov 2021.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
According to the Wikipedia, an ultramarathon, also called ultra distance or ultra running, is any footrace longer than the traditional marathon length of 42.195 kilometres (26 mi 385 yd). Various distances are raced competitively, from the shortest common ultramarathon of 31 miles (50 km) to over 200 miles (320 km). 50k and 100k are both World Athletics record distances, but some 100 miles (160 km) races are among the oldest and most prestigious events, especially in North America.}
The data in this file is a large collection of ultra-marathon race records registered between 1798 and 2022 (a period of well over two centuries) being therefore a formidable long term sample. All data was obtained from public websites.
Despite the original data being of public domain, the race records, which originally contained the athlete´s names, have been anonymized to comply with data protection laws and to preserve the athlete´s privacy. However, a column Athlete ID has been created with a numerical ID representing each unique runner (so if Antonio Fernández participated in 5 races over different years, then the corresponding race records now hold his unique Athlete ID instead of his name). This way I have preserved valuable information.
The dataset contains 7,461,226 ultra-marathon race records from 1,641,168 unique athletes.
The following columns (with data types) are included:
The Event name column include country location information that can be derived to a new column, and similarly seasonal information can be found in the Event dates column beyond the Year of event (these can be extracted with a bit of processing).
The Event distance/length column describes the type of race, covering the most popular UM race distances and lengths, and some other specific modalities (multi-day, etc.):
Additionally, there is information of age, gender and speed (in km/h) in other columns.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Electrocardiography (ECG) is a key diagnostic tool to assess the cardiac condition of a patient. Automatic ECG interpretation algorithms as diagnosis support systems promise large reliefs for the medical personnel - only on the basis of the number of ECGs that are routinely taken. However, the development of such algorithms requires large training datasets and clear benchmark procedures. In our opinion, both aspects are not covered satisfactorily by existing freely accessible ECG datasets.
The PTB-XL ECG dataset is a large dataset of 21799 clinical 12-lead ECGs from 18869 patients of 10 second length. The raw waveform data was annotated by up to two cardiologists, who assigned potentially multiple ECG statements to each record. The in total 71 different ECG statements conform to the SCP-ECG standard and cover diagnostic, form, and rhythm statements. To ensure comparability of machine learning algorithms trained on the dataset, we provide recommended splits into training and test sets. In combination with the extensive annotation, this turns the dataset into a rich resource for the training and the evaluation of automatic ECG interpretation algorithms. The dataset is complemented by extensive metadata on demographics, infarction characteristics, likelihoods for diagnostic ECG statements as well as annotated signal properties.
Facebook
TwitterThree datasets are available, each consisting of 15 csv files. Each file containing the voxelised shower information obtained from single particles produced at the front of the calorimeter in the |η| range (0.2-0.25) simulated in the ATLAS detector. Two datasets contain photons events with different statistics; the larger sample has about 10 times the number of events as the other. The other dataset contains pions. The pion dataset and the photon dataset with the lower statistics were used to train the corresponding two GANs presented in the AtlFast3 paper SIMU-2018-04.
The information in each file is a table; the rows correspond to the events and the columns to the voxels. The voxelisation procedure is described in the AtlFast3 paper linked above and in the dedicated PUB note ATL-SOFT-PUB-2020-006. In summary, the detailed energy deposits produced by ATLAS were converted from x,y,z coordinates to local cylindrical coordinates defined around the particle 3-momentum at the entrance of the calorimeter. The energy deposits in each layer were then grouped in voxels and for each voxel the energy was stored in the csv file. For each particle, there are 15 files corresponding to the 15 energy points used to train the GAN. The name of the csv file defines both the particle and the energy of the sample used to create the file.
The size of the voxels is described in the binning.xml file. Software tools to read the XML file and manipulate the spatial information of voxels are provided in the FastCaloGAN repository.
Updated on February 10th 2022. A new dataset photons_samples_highStat.tgz was added to this record and the binning.xml file was updated accordingly.
Updated on April 18th 2023. A new dataset pions_samples_highStat.tgz was added to this record.
Facebook
TwitterThis is a dataset downloaded off excelbianalytics.com created off of random VBA logic. I recently performed an extensive exploratory data analysis on it and I included new columns to it, namely: Unit margin, Order year, Order month, Order weekday and Order_Ship_Days which I think can help with analysis on the data. I shared it because I thought it was a great dataset to practice analytical processes on for newbies like myself.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Passive Acoustic Monitoring (PAM) is emerging as a solution for monitoring species and environmental change over large spatial and temporal scales. However, drawing rigorous conclusions based on acoustic recordings is challenging, as there is no consensus over which approaches, and indices are best suited for characterizing marine and terrestrial acoustic environments.
Here, we describe the application of multiple machine-learning techniques to the analysis of a large PAM dataset. We combine pre-trained acoustic classification models (VGGish, NOAA & Google Humpback Whale Detector), dimensionality reduction (UMAP), and balanced random forest algorithms to demonstrate how machine-learned acoustic features capture different aspects of the marine environment.
The UMAP dimensions derived from VGGish acoustic features exhibited good performance in separating marine mammal vocalizations according to species and locations. RF models trained on the acoustic features performed well for labelled sounds in the 8 kHz range, however, low and high-frequency sounds could not be classified using this approach.
The workflow presented here shows how acoustic feature extraction, visualization, and analysis allow for establishing a link between ecologically relevant information and PAM recordings at multiple scales.
The datasets and scripts provided in this repository allow replicating the results presented in the publication.
Methods
Data acquisition and preparation
We collected all records available in the Watkins Marine Mammal Database website listed under the “all cuts'' page. For each audio file in the WMD the associated metadata included a label for the sound sources present in the recording (biological, anthropogenic, and environmental), as well as information related to the location and date of recording. To minimize the presence of unwanted sounds in the samples, we only retained audio files with a single source listed in the metadata. We then labelled the selected audio clips according to taxonomic group (Odontocetae, Mysticetae), and species.
We limited the analysis to 12 marine mammal species by discarding data when a species: had less than 60 s of audio available, had a vocal repertoire extending beyond the resolution of the acoustic classification model (VGGish), or was recorded in a single country. To determine if a species was suited for analysis using VGGish, we inspected the Mel-spectrograms of 3-s audio samples and only retained species with vocalizations that could be captured in the Mel-spectrogram (Appendix S1). The vocalizations of species that produce very low frequency, or very high frequency were not captured by the Mel-spectrogram, thus we removed them from the analysis. To ensure that records included the vocalizations of multiple individuals for each species, we only considered species with records from two or more different countries. Lastly, to avoid overrepresentation of sperm whale vocalizations, we excluded 30,000 sperm whale recordings collected in the Dominican Republic. The resulting dataset consisted in 19,682 audio clips with a duration of 960 milliseconds each (0.96 s) (Table 1).
The Placentia Bay Database (PBD) includes recordings collected by Fisheries and Oceans Canada in Placentia Bay (Newfoundland, Canada), in 2019. The dataset consisted of two months of continuous recordings (1230 hours), starting on July 1st, 2019, and ending on August 31st 2029. The data was collected using an AMAR G4 hydrophone (sensitivity: -165.02 dB re 1V/µPa at 250 Hz) deployed at 64 m of depth. The hydrophone was set to operate following 15 min cycles, with the first 60 s sampled at 512 kHz, and the remaining 14 min sampled at 64 kHz. For the purpose of this study, we limited the analysis to the 64 kHz recordings.
Acoustic feature extraction
The audio files from the WMD and PBD databases were used as input for VGGish (Abu-El-Haija et al., 2016; Chung et al., 2018), a CNN developed and trained to perform general acoustic classification. VGGish was trained on the Youtube8M dataset, containing more than two million user-labelled audio-video files. Rather than focusing on the final output of the model (i.e., the assigned labels), here the model was used as a feature extractor (Sethi et al., 2020). VGGish converts audio input into a semantically meaningful vector consisting of 128 features. The model returns features at multiple resolution: ~1 s (960 ms); ~5 s (4800 ms); ~1 min (59’520 ms); ~5 min (299’520 ms). All of the visualizations and results pertaining to the WMD were prepared using the finest feature resolution of ~1 s. The visualizations and results pertaining to the PBD were prepared using the ~5 s features for the humpback whale detection example, and were then averaged to an interval of 30 min in order to match the temporal resolution of the environmental measures available for the area.
UMAP ordination and visualization
UMAP is a non-linear dimensionality reduction algorithm based on the concept of topological data analysis which, unlike other dimensionality reduction techniques (e.g., tSNE), preserves both the local and global structure of multivariate datasets (McInnes et al., 2018). To allow for data visualization and to reduce the 128 features to two dimensions for further analysis, we applied Uniform Manifold Approximation and Projection (UMAP) to both datasets and inspected the resulting plots.
The UMAP algorithm generates a low-dimensional representation of a multivariate dataset while maintaining the relationships between points in the global dataset structure (i.e., the 128 features extracted from VGGish). Each point in a UMAP plot in this paper represents an audio sample with duration of ~ 1 second (WMD dataset), ~ 5 seconds (PBD dataset, humpback whale detections), or 30 minutes (PBD dataset, environmental variables). Each point in the two-dimensional UMAP space also represents a vector of 128 VGGish features. The nearer two points are in the plot space, the nearer the two points are in the 128-dimensional space, and thus the distance between two points in UMAP reflects the degree of similarity between two audio samples in our datasets. Areas with a high density of samples in UMAP space should, therefore, contain sounds with similar characteristics, and such similarity should decrease with increasing point distance. Previous studies illustrated how VGGish and UMAP can be applied to the analysis of terrestrial acoustic datasets (Heath et al., 2021; Sethi et al., 2020). The visualizations and classification trials presented here illustrate how the two techniques (VGGish and UMAP) can be used together for marine ecoacoustics analysis. UMAP visualizations were prepared the umap-learn package for Python programming language (version 3.10). All UMAP visualizations presented in this study were generated using the algorithm’s default parameters.
Labelling sound sources
The labels for the WMD records (i.e., taxonomic group, species, location) were obtained from the database metadata.
For the PBD recordings, we obtained measures of wind speed, surface temperature, and current speed from (Fig 1) an oceanographic buy located in proximity of the recorder. We choose these three variables for their different contributions to background noise in marine environments. Wind speed contributes to underwater background noise at multiple frequencies, ranging 500 Hz to 20 kHz (Hildebrand et al., 2021). Sea surface temperature contributes to background noise at frequencies between 63 Hz and 125 Hz (Ainslie et al., 2021), while ocean currents contribute to ambient noise at frequencies below 50 Hz (Han et al., 2021) Prior to analysis, we categorized the environmental variables and assigned the categories as labels to the acoustic features (Table 2). Humpback whale vocalizations in the PBD recordings were processed using the humpback whale acoustic detector created by NOAA and Google (Allen et al., 2021), providing a model score for every ~5 s sample. This model was trained on a large dataset (14 years and 13 locations) using humpback whale recordings annotated by experts (Allen et al., 2021). The model returns scores ranging from 0 to 1 indicating the confidence in the predicted humpback whale presence. We used the results of this detection model to label the PBD samples according to presence of humpback whale vocalizations. To verify the model results, we inspected all audio files that contained a 5 s sample with a model score higher than 0.9 for the month of July. If the presence of a humpback whale was confirmed, we labelled the segment as a model detection. We labelled any additional humpback whale vocalization present in the inspected audio files as a visual detection, while we labelled other sources and background noise samples as absences. In total, we labelled 4.6 hours of recordings. We reserved the recordings collected in August to test the precision of the final predictive model.
Label prediction performance
We used Balanced Random Forest models (BRF) provided in the imbalanced-learn python package (Lemaître et al., 2017) to predict humpback whale presence and environmental conditions from the acoustic features generated by VGGish. We choose BRF as the algorithm as it is suited for datasets characterized by class imbalance. The BRF algorithm performs under sampling of the majority class prior to prediction, allowing to overcome class imbalance (Lemaître et al., 2017). For each model run, the PBD dataset was split into training (80%) and testing (20%) sets.
The training datasets were used to fine-tune the models though a nested k-fold cross validation approach with ten-folds in the outer loop, and five-folds in the inner loop. We selected nested cross validation as it allows optimizing model hyperparameters and performing model evaluation in a single step. We used the default parameters of the BRF algorithm, except for the ‘n_estimators’ hyperparameter, for which we tested
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides a structured and machine-readable register of all companies recorded by the Companies Registration Office (CRO) in Ireland. It includes a daily snapshot of company records, covering both currently registered companies and historical records of dissolved or closed entities. The dataset aligns with the European Union’s Open Data Directive (Directive (EU) 2019/1024) and the Implementing Regulation (EU) 2023/138, which designates company and company ownership data as a high-value dataset. Updated daily, it ensures timely access to corporate information and is available for bulk download and API access under the Creative Commons Attribution 4.0 (CC BY 4.0) licence, allowing unrestricted reuse with appropriate attribution. By increasing transparency, accountability, and economic innovation, this dataset supports public sector initiatives, research, and digital services development.
Facebook
Twitterhttps://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Unlock the power of Flipkart's extensive product catalog with our meticulously curated e-commerce dataset. This dataset provides detailed information on a wide range of products available on Flipkart, including product names, descriptions, prices, customer reviews, ratings, and images. Whether you're working on data analysis, machine learning models, or conducting in-depth market research, this dataset is an invaluable resource.
With our Flipkart e-commerce dataset, you can easily analyze trends, compare products, and gain insights into consumer behavior. The dataset is structured and high-quality, ensuring that you have the best foundation for your projects.
Flipkart is largest E-commerce website based out india. Pre crawled dataset having more than 5.7 million records.
Where to use dataset
Facebook
Twitterhttp://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
This dataset comprises synthetic UML diagrams, explicitly focusing on activity and sequence diagrams generated using PlantUML—a text-based tool for creating visual diagrams. By leveraging randomized text strings based on PlantUML syntax, we produced a diverse and scalable collection that emulates standard UML diagrams. Each diagram is accompanied by its corresponding PlantUML code, facilitating a clear understanding of the visual representation's textual foundation. Data from smaller datasets is reused in the larger datasets, as each model was trained on the data separately, as described in the original paper. It's recommended just to use the Extra Large dataset when interested in using the data in its entirety.
Each category is divided into four subsets based on size (approximately):
Small: 6,000 training diagrams and 1,500 testing diagrams.
Medium: 12,000 training diagrams and 3,000 testing diagrams.
Large: 24,000 training diagrams and 6,000 testing diagrams.
Extra Large: 120,000 training diagrams and 30,000 testing diagrams.
Facebook
Twitterhttps://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Explore our extensive Booking Hotel Reviews Large Dataset, featuring over 20.8 million records of detailed customer feedback from hotels worldwide. Whether you're conducting sentiment analysis, market research, or competitive benchmarking, this dataset provides invaluable insights into customer experiences and preferences.
The dataset includes crucial information such as reviews, ratings, comments, and more, all sourced from travellers who booked through Booking.com. It's an ideal resource for businesses aiming to understand guest sentiments, improve service quality, or refine marketing strategies within the hospitality sector.
With this hotel reviews dataset, you can dive deep into trends and patterns that reveal what customers truly value during their stays. Whether you're analyzing reviews for sentiment analysis or studying traveller feedback from specific regions, this dataset delivers the insights you need.
Ready to get started? Download the complete hotel review dataset or connect with the Crawl Feeds team to request records tailored to specific countries or regions. Unlock the power of data and take your hospitality analysis to the next level!
Access 3 million+ US hotel reviews — submit your request today.