MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Code and additional data for solution #4 in Predicting Molecular Properties competition, described in #4 Solution [Hyperspatial Engineers].
This dataset was created by Vikusya1808
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Andrew Gross
Released under MIT
VaggP/Eedi-competition-kaggle-prompt-formats-mpnet dataset hosted on Hugging Face and contributed by the HF Datasets community
This dataset was created by Alessandro P.
FSDKaggle2019 is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology. FSDKaggle2019 has been used for the DCASE Challenge 2019 Task 2, which was run as a Kaggle competition titled Freesound Audio Tagging 2019.
Citation
If you use the FSDKaggle2019 dataset or part of it, please cite our DCASE 2019 paper:
Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Serra. "Audio tagging with noisy labels and minimal supervision". Proceedings of the DCASE 2019 Workshop, NYC, US (2019)
You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2019.
Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017
Data curators
Eduardo Fonseca, Manoj Plakal, Xavier Favory, Jordi Pons
Contact
You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.
ABOUT FSDKaggle2019
Freesound Dataset Kaggle 2019 (or FSDKaggle2019 for short) is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology [1]. FSDKaggle2019 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2019. Please visit the DCASE2019 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound Audio Tagging 2019. It was organized by researchers from the Music Technology Group (MTG) of Universitat Pompeu Fabra (UPF), and from Sound Understanding team at Google AI Perception. The competition intended to provide insight towards the development of broadly-applicable sound event classifiers able to cope with label noise and minimal supervision conditions.
FSDKaggle2019 employs audio clips from the following sources:
Freesound Dataset (FSD): a dataset being collected at the MTG-UPF based on Freesound content organized with the AudioSet Ontology
The soundtracks of a pool of Flickr videos taken from the Yahoo Flickr Creative Commons 100M dataset (YFCC)
The audio data is labeled using a vocabulary of 80 labels from Googleâs AudioSet Ontology [1], covering diverse topics: Guitar and other Musical Instruments, Percussion, Water, Digestive, Respiratory sounds, Human voice, Human locomotion, Hands, Human group actions, Insect, Domestic animals, Glass, Liquid, Motor vehicle (road), Mechanisms, Doors, and a variety of Domestic sounds. The full list of categories can be inspected in vocabulary.csv (see Files & Download below). The goal of the task was to build a multi-label audio tagging system that can predict appropriate label(s) for each audio clip in a test set.
What follows is a summary of some of the most relevant characteristics of FSDKaggle2019. Nevertheless, it is highly recommended to read our DCASE 2019 paper for a more in-depth description of the dataset and how it was built.
Ground Truth Labels
The ground truth labels are provided at the clip-level, and express the presence of a sound category in the audio clip, hence can be considered weak labels or tags. Audio clips have variable lengths (roughly from 0.3 to 30s).
The audio content from FSD has been manually labeled by humans following a data labeling process using the Freesound Annotator platform. Most labels have inter-annotator agreement but not all of them. More details about the data labeling process and the Freesound Annotator can be found in [2].
The YFCC soundtracks were labeled using automated heuristics applied to the audio content and metadata of the original Flickr clips. Hence, a substantial amount of label noise can be expected. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises. More information about some of the types of label noise that can be encountered is available in [3].
Specifically, FSDKaggle2019 features three types of label quality, one for each set in the dataset:
curated train set: correct (but potentially incomplete) labels
noisy train set: noisy labels
test set: correct and complete labels
Further details can be found below in the sections for each set.
Format
All audio clips are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files.
DATA SPLIT
FSDKaggle2019 consists of two train sets and one test set. The idea is to limit the supervision provided for training (i.e., the manually-labeled, hence reliable, data), thus promoting approaches to deal with label noise.
Curated train set
The curated train set consists of manually-labeled data from FSD.
Number of clips/class: 75 except in a few cases (where there are less)
Total number of clips: 4970
Avg number of labels/clip: 1.2
Total duration: 10.5 hours
The duration of the audio clips ranges from 0.3 to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording/uploading sounds. Labels are correct but potentially incomplete. It can happen that a few of these audio clips present additional acoustic material beyond the provided ground truth label(s).
Noisy train set
The noisy train set is a larger set of noisy web audio data from Flickr videos taken from the YFCC dataset [5].
Number of clips/class: 300
Total number of clips: 19,815
Avg number of labels/clip: 1.2
Total duration: ~80 hours
The duration of the audio clips ranges from 1s to 15s, with the vast majority lasting 15s. Labels are automatically generated and purposefully noisy. No human validation is involved. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises.
Considering the numbers above, the per-class data distribution available for training is, for most of the classes, 300 clips from the noisy train set and 75 clips from the curated train set. This means 80% noisy / 20% curated at the clip level, while at the duration level the proportion is more extreme considering the variable-length clips.
Test set
The test set is used for system evaluation and consists of manually-labeled data from FSD.
Number of clips/class: between 50 and 150
Total number of clips: 4481
Avg number of labels/clip: 1.4
Total duration: 12.9 hours
The acoustic material present in the test set clips is labeled exhaustively using the aforementioned vocabulary of 80 classes. Most labels have inter-annotator agreement but not all of them. Except human error, the label(s) are correct and complete considering the target vocabulary; nonetheless, a few clips could still present additional (unlabeled) acoustic content out of the vocabulary.
During the DCASE2019 Challenge Task 2, the test set was split into two subsets, for the public and private leaderboards, and only the data corresponding to the public leaderboard was provided. In this current package you will find the full test set with all the test labels. To allow comparison with previous work, the file test_post_competition.csv includes a flag to determine the corresponding leaderboard (public or private) for each test clip (see more info in Files & Download below).
Acoustic mismatch
As mentioned before, FSDKaggle2019 uses audio clips from two sources:
FSD: curated train set and test set, and
YFCC: noisy train set.
While the sources of audio (Freesound and Flickr) are collaboratively contributed and pretty diverse themselves, a certain acoustic mismatch can be expected between FSD and YFCC. We conjecture this mismatch comes from a variety of reasons. For example, through acoustic inspection of a small sample of both data sources, we find a higher percentage of high quality recordings in FSD. In addition, audio clips in Freesound are typically recorded with the purpose of capturing audio, which is not necessarily the case in YFCC.
This mismatch can have an impact in the evaluation, considering that most of the train data come from YFCC, while all test data are drawn from FSD. This constraint (i.e., noisy training data coming from a different web audio source than the test set) is sometimes a real-world condition.
LICENSE
All clips in FSDKaggle2019 are released under Creative Commons (CC) licenses. For attribution purposes and to facilitate attribution of these files to third parties, we include a mapping from the audio clips to their corresponding licenses.
Curated train set and test set. All clips in Freesound are released under different modalities of Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. The licenses are specified in the files train_curated_post_competition.csv and test_post_competition.csv. These licenses can be CC0, CC-BY, CC-BY-NC and CC Sampling+.
Noisy train set. Similarly, the licenses of the soundtracks from Flickr used in FSDKaggle2019 are specified in the file train_noisy_post_competition.csv. These licenses can be CC-BY and CC BY-SA.
In addition, FSDKaggle2019 as a whole is the result of a curation process and it has an additional license. FSDKaggle2019 is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSDKaggle2019.doc zip file.
FILES & DOWNLOAD
FSDKaggle2019 can be downloaded as a series of zip files with the following directory structure:
root
â
ââââFSDKaggle2019.audio_train_curated/ Audio clips in the curated train set
â
ââââFSDKaggle2019.audio_train_noisy/ Audio clips in the noisy
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was used in the Kaggle Wikipedia Web Traffic forecasting competition. It contains 145063 daily time series representing the number of hits or web traffic for a set of Wikipedia pages from 2015-07-01 to 2017-09-10.
The original dataset contains missing values. They have been simply replaced by zeros.
Data for a Kaggle competition
Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit.
Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years.
The goal of this competition is to build a model that borrowers can use to help make the best financial decisions.
Historical data are provided on 250,000 borrowers and the prize pool is $5,000 ($3,000 for first, $1,500 for second and $500 for third).
This dataset was created by Edifon Jimmy
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the dataset of our Mobicom 2023 paper titled "The Wisdom of 1,170 Teams: Lessons and Experiences from a Large Indoor Localization Competition". We organized an indoor location competition in 2021. 1446 contestants from more than 60 countries making up 1170 teams participated in this unique global event. In this competition, a first-of-its-kind large-scale indoor location benchmark dataset (60 GB) was released. The dataset for this competition consists of dense indoor signatures of WiFi, geomagnetic field, iBeacons etc. as well as ground truth locations collected from hundreds of buildings in Chinese cities. Here we upload a sample data to Zenodo, and the whole dataset can be found at https://www.kaggle.com/c/indoor-location-navigation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 427 quarterly time series used in the Kaggle Tourism forecasting competition.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the trained model weights for the baseline model and the winning solutions in the Kaggle competition "HuBMAP+HPA - Hacking the Human Body", and is part of the paper "Segmenting functional tissue units across human organs using community-driven development of generalizable machine learning algorithms".
The directory contains:
trained_model_1_weights.zip: Trained model weights for first place solution (Team 1).
trained_model_2_weights.zip: Trained model weights for second place solution (Team 2).
trained_model_3_weights.zip: Trained model weights for third place solution (Team 3).
trained_model_weights_baseline.zip: Trained model weights for the baseline model.
This dataset was created by nelson mandela
A large set of images of cats and dogs. There are 1738 corrupted images that are dropped. This dataset is part of a now-closed Kaggle competition and represents a subset of the so-called Asirra dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In recent years, with the continuous improvement of the financial system and the rapid development of the banking industry, the competition of the banking industry itself has intensified. At the same time, with the rapid development of information technology and Internet technology, customersâ choice of financial products is becoming more and more diversified, and customersâ dependence and loyalty to banking institutions is becoming less and less, and the problem of customer churn in commercial banks is becoming more and more prominent. How to predict customer behavior and retain existing customers has become a major challenge for banks to solve. Therefore, this study takes a bankâs business data on Kaggle platform as the research object, uses multiple sampling methods to compare the data for balancing, constructs a bank customer churn prediction model for churn identification by GA-XGBoost, and conducts interpretability analysis on the GA-XGBoost model to provide decision support and suggestions for the banking industry to prevent customer churn. The results show that: (1) The applied SMOTEENN is more effective than SMOTE and ADASYN in dealing with the imbalance of banking data. (2) The F1 and AUC values of the model improved and optimized by XGBoost using genetic algorithm can reach 90% and 99%, respectively, which are optimal compared to other six machine learning models. The GA-XGBoost classifier was identified as the best solution for the customer churn problem. (3) Using Shapley values, we explain how each feature affects the model results, and analyze the features that have a high impact on the model prediction, such as the total number of transactions in the past year, the amount of transactions in the past year, the number of products owned by customers, and the total sales balance. The contribution of this paper is mainly in two aspects: (1) this study can provide useful information from the black box model based on the accurate identification of churned customers, which can provide reference for commercial banks to improve their service quality and retain customers; (2) it can provide reference for customer churn early warning models of other related industries, which can help the banking industry to maintain customer stability, maintain market position and reduce corporate losses.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The BDG2 open data set consists of 3,053 energy meters from 1,636 non-residential buildings with a range of two full years (2016 and 2017) at an hourly frequency (17,544 measurements per meter resulting in approximately 53.6 million measurements). These meters are collected from 19 sites across North America and Europe, and they measure electrical, heating and cooling water, steam, and solar energy as well as water and irrigation meters. Part of these data was used in the Great Energy Predictor III (GEPIII) competition hosted by the ASHRAE organization in October-December 2019. This subset includes data from 2,380 meters from 1,448 buildings that were used in the GEPIII, a machine learning competition for long-term prediction with an application to measurement and verification. This paper describes the process of data collection, cleaning, and convergence of time-series meter data, the meta-data about the buildings, and complementary weather data. This data set can be used for further prediction benchmarking and prototyping as well as anomaly detection, energy analysis, and building type classification.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle. The corpus consists of â 2.5 million snippets of ML code collected from â 100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose.
The data is organized as a set of tables in CSV format. It includes several central entities: raw code blocks collected from Kaggle (code_blocks.csv), kernels (kernels_meta.csv) and competitions meta information (competitions_meta.csv). Manually annotated code blocks are presented as a separate table (murkup_data.csv). As this table contains the numeric id of the code block semantic type, we also provide a mapping from the id to semantic class and subclass (vertices.csv).
Snippets information (code_blocks.csv) can be mapped with kernels meta-data via kernel_id. Kernels metadata is linked to Kaggle competitions information through comp_name. To ensure the quality of the data kernels_meta.csv includes only notebooks with an available Kaggle score.
Automatic classification of code_blocks are stored in data_with_preds.csv. The mapping of this table with code_blocks.csv can be doe through code_blocks_index column, which corresponds to code_blocks indices.
The corpus can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.
This dataset was created by Samawel JABALLI
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This collection contains 100 soundscape recordings of 10 minutes duration, which have been annotated with 10,296 bounding box labels for 21 different bird species from the Western United States. The data were recorded in 2015 in the southern end of the Sierra Nevada mountain range in California, USA. This collection has been featured as test data in the 2020 BirdCLEF and Kaggle Birdcall Identification competition and can primarily be used for training and evaluation of machine learning algorithms.
Data collection
The recordings were made in Sequoia and Kings Canyon National Parks, two contiguous national parks in the southern Sierra Nevada mountain range in California, USA. The focus of the acoustic study was the high-elevation region of the Parks; specifically, the headwater lake basins above 3,000 km in elevation. The original intent of the study was to monitor seasonal activity of birds and bats at lakes containing trout and lakes without trout, because the cascading impacts of trout on the adjacent terrestrial zone remain poorly understood. Soundscapes were recorded for 24 h continuously at 10 lakes (5 fishless, 5 fish-containing) throughout Sequoia and Kings Canyon National Parks during June-September 2015. Song Meter SM2+ units (Wildlife Acoustics, USA) powered by custom-made solar panels were used to obviate the need to swap batteries, due to the recording locations being extremely difficult to access. Song Meters continuously recorded mono-channel, 16-bits uncompressed WAVE files at 48 kHz sampling rate. For this collection, recordings were resampled at 32 kHz and converted to FLAC.
Sampling and annotation protocol
A total of 100 10-minute segments of audio between July 9 and 12, 2015 from morning hours (06:10-09:10 PDT) from all 10 sites were selected at random. Annotators were asked to box every bird call they could recognize, ignoring those that are too faint or unidentifiable. Every sound that could not be confidently assigned an identity was reviewed with 1-2 other experts in bird identification. To minimize observer bias, all identifying information about the location, date and time of the recordings was hidden from the annotator. Raven Pro software was used to annotate the data. Provided labels contain full bird calls that are boxed in time and frequency. In this collection, we use eBird species codes as labels, following the 2021 eBird taxonomy (Clements list). Unidentifiable calls have been marked with â????â and were added as bounding box labels to the ground truth annotations. Parts of this dataset have previously been used in the 2020 BirdCLEF and Kaggle Birdcall Identification competition.
Files in this collection
Audio recordings can be accessed by downloading and extracting the âsoundscape_data.zipâ file. Soundscape recording filenames contain a sequential file ID, recording date and timestamp in PDT (UTC-7). As an example, the file âHSN_001_20150708_061805.flacâ has sequential ID 001 and was recorded on July 8th 2015 at 06:18:05 PDT. Ground truth annotations are listed in âannotations.csvâ where each line specifies the corresponding filename, start and end time in seconds, low and high frequency in Hertz and an eBird species code. These species codes can be assigned to scientific and common name of a species with the âspecies.csvâ file. The approximate recording location with longitude and latitude can be found in the ârecording_location.txtâ file.
Acknowledgements
Compiling this extensive dataset was a major undertaking, and we are very thankful to the domain experts who helped to collect and manually annotate the data for this collection (individual contributors in alphabetic order): Anna CalderĂłn, Thomas Hahn, Ruoshi Huang, Angelly Tovar
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Processed and subsampled of the data provided in the Multimodal Single-Cell Integration NeurIPS 2022 challenge (https://www.kaggle.com/competitions/open-problems-multimodal/data).
Data was filtered on donor "31800" and non-hidden celltypes. Subsequently, 2000 data points were randomly subsampled. The 2000 most highly variable genes were selected for the RNA data and peaks which appeared in less than 5% of the cells were filtered out, resulting in 11607 peaks.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Code and additional data for solution #4 in Predicting Molecular Properties competition, described in #4 Solution [Hyperspatial Engineers].