100+ datasets found
  1. Z

    Solution #4 for Predicting Molecular Properties Kaggle Competition

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tijanic, Nebojsa (2020). Solution #4 for Predicting Molecular Properties Kaggle Competition [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3406153
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Popovic, Milos
    Rakocevic, Goran
    Stojanovic, Luka
    Tijanic, Nebojsa
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Code and additional data for solution #4 in Predicting Molecular Properties competition, described in #4 Solution [Hyperspatial Engineers].

  2. For competition

    • kaggle.com
    zip
    Updated Aug 6, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vikusya1808 (2022). For competition [Dataset]. https://www.kaggle.com/datasets/vikusya1808/for-competition/discussion
    Explore at:
    zip(131070359 bytes)Available download formats
    Dataset updated
    Aug 6, 2022
    Authors
    Vikusya1808
    Description

    Dataset

    This dataset was created by Vikusya1808

    Contents

  3. Kaggle PII Competition Mixtral Dataset 2

    • kaggle.com
    zip
    Updated Apr 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew Gross (2024). Kaggle PII Competition Mixtral Dataset 2 [Dataset]. https://www.kaggle.com/datasets/awgross/kaggle-pii-competition-mixtral-dataset-2/data
    Explore at:
    zip(22503350 bytes)Available download formats
    Dataset updated
    Apr 2, 2024
    Authors
    Andrew Gross
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Andrew Gross

    Released under MIT

    Contents

  4. h

    Eedi-competition-kaggle-prompt-formats-mpnet

    • huggingface.co
    Updated Oct 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eedi-competition-kaggle-prompt-formats-mpnet [Dataset]. https://huggingface.co/datasets/VaggP/Eedi-competition-kaggle-prompt-formats-mpnet
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 15, 2024
    Authors
    EVANGELOS PAPAMITSOS
    Description

    VaggP/Eedi-competition-kaggle-prompt-formats-mpnet dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. Housing Prices Competition for Kaggle Learn Users

    • kaggle.com
    zip
    Updated Aug 30, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alessandro P. (2019). Housing Prices Competition for Kaggle Learn Users [Dataset]. https://www.kaggle.com/paretogp/housing-prices-competition-for-kaggle-learn-users
    Explore at:
    zip(183401 bytes)Available download formats
    Dataset updated
    Aug 30, 2019
    Authors
    Alessandro P.
    Description

    Dataset

    This dataset was created by Alessandro P.

    Contents

  6. Z

    FSDKaggle2019

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eduardo Fonseca (2020). FSDKaggle2019 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3612636
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Daniel P. W. Ellis
    Eduardo Fonseca
    Manoj Plakal
    Xavier Serra
    Frederic Font
    Description

    FSDKaggle2019 is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology. FSDKaggle2019 has been used for the DCASE Challenge 2019 Task 2, which was run as a Kaggle competition titled Freesound Audio Tagging 2019.

    Citation

    If you use the FSDKaggle2019 dataset or part of it, please cite our DCASE 2019 paper:

    Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Serra. "Audio tagging with noisy labels and minimal supervision". Proceedings of the DCASE 2019 Workshop, NYC, US (2019)

    You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2019.

    Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

    Data curators

    Eduardo Fonseca, Manoj Plakal, Xavier Favory, Jordi Pons

    Contact

    You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

    ABOUT FSDKaggle2019

    Freesound Dataset Kaggle 2019 (or FSDKaggle2019 for short) is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology [1]. FSDKaggle2019 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2019. Please visit the DCASE2019 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound Audio Tagging 2019. It was organized by researchers from the Music Technology Group (MTG) of Universitat Pompeu Fabra (UPF), and from Sound Understanding team at Google AI Perception. The competition intended to provide insight towards the development of broadly-applicable sound event classifiers able to cope with label noise and minimal supervision conditions.

    FSDKaggle2019 employs audio clips from the following sources:

    Freesound Dataset (FSD): a dataset being collected at the MTG-UPF based on Freesound content organized with the AudioSet Ontology

    The soundtracks of a pool of Flickr videos taken from the Yahoo Flickr Creative Commons 100M dataset (YFCC)

    The audio data is labeled using a vocabulary of 80 labels from Google’s AudioSet Ontology [1], covering diverse topics: Guitar and other Musical Instruments, Percussion, Water, Digestive, Respiratory sounds, Human voice, Human locomotion, Hands, Human group actions, Insect, Domestic animals, Glass, Liquid, Motor vehicle (road), Mechanisms, Doors, and a variety of Domestic sounds. The full list of categories can be inspected in vocabulary.csv (see Files & Download below). The goal of the task was to build a multi-label audio tagging system that can predict appropriate label(s) for each audio clip in a test set.

    What follows is a summary of some of the most relevant characteristics of FSDKaggle2019. Nevertheless, it is highly recommended to read our DCASE 2019 paper for a more in-depth description of the dataset and how it was built.

    Ground Truth Labels

    The ground truth labels are provided at the clip-level, and express the presence of a sound category in the audio clip, hence can be considered weak labels or tags. Audio clips have variable lengths (roughly from 0.3 to 30s).

    The audio content from FSD has been manually labeled by humans following a data labeling process using the Freesound Annotator platform. Most labels have inter-annotator agreement but not all of them. More details about the data labeling process and the Freesound Annotator can be found in [2].

    The YFCC soundtracks were labeled using automated heuristics applied to the audio content and metadata of the original Flickr clips. Hence, a substantial amount of label noise can be expected. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises. More information about some of the types of label noise that can be encountered is available in [3].

    Specifically, FSDKaggle2019 features three types of label quality, one for each set in the dataset:

    curated train set: correct (but potentially incomplete) labels

    noisy train set: noisy labels

    test set: correct and complete labels

    Further details can be found below in the sections for each set.

    Format

    All audio clips are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files.

    DATA SPLIT

    FSDKaggle2019 consists of two train sets and one test set. The idea is to limit the supervision provided for training (i.e., the manually-labeled, hence reliable, data), thus promoting approaches to deal with label noise.

    Curated train set

    The curated train set consists of manually-labeled data from FSD.

    Number of clips/class: 75 except in a few cases (where there are less)

    Total number of clips: 4970

    Avg number of labels/clip: 1.2

    Total duration: 10.5 hours

    The duration of the audio clips ranges from 0.3 to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording/uploading sounds. Labels are correct but potentially incomplete. It can happen that a few of these audio clips present additional acoustic material beyond the provided ground truth label(s).

    Noisy train set

    The noisy train set is a larger set of noisy web audio data from Flickr videos taken from the YFCC dataset [5].

    Number of clips/class: 300

    Total number of clips: 19,815

    Avg number of labels/clip: 1.2

    Total duration: ~80 hours

    The duration of the audio clips ranges from 1s to 15s, with the vast majority lasting 15s. Labels are automatically generated and purposefully noisy. No human validation is involved. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises.

    Considering the numbers above, the per-class data distribution available for training is, for most of the classes, 300 clips from the noisy train set and 75 clips from the curated train set. This means 80% noisy / 20% curated at the clip level, while at the duration level the proportion is more extreme considering the variable-length clips.

    Test set

    The test set is used for system evaluation and consists of manually-labeled data from FSD.

    Number of clips/class: between 50 and 150

    Total number of clips: 4481

    Avg number of labels/clip: 1.4

    Total duration: 12.9 hours

    The acoustic material present in the test set clips is labeled exhaustively using the aforementioned vocabulary of 80 classes. Most labels have inter-annotator agreement but not all of them. Except human error, the label(s) are correct and complete considering the target vocabulary; nonetheless, a few clips could still present additional (unlabeled) acoustic content out of the vocabulary.

    During the DCASE2019 Challenge Task 2, the test set was split into two subsets, for the public and private leaderboards, and only the data corresponding to the public leaderboard was provided. In this current package you will find the full test set with all the test labels. To allow comparison with previous work, the file test_post_competition.csv includes a flag to determine the corresponding leaderboard (public or private) for each test clip (see more info in Files & Download below).

    Acoustic mismatch

    As mentioned before, FSDKaggle2019 uses audio clips from two sources:

    FSD: curated train set and test set, and

    YFCC: noisy train set.

    While the sources of audio (Freesound and Flickr) are collaboratively contributed and pretty diverse themselves, a certain acoustic mismatch can be expected between FSD and YFCC. We conjecture this mismatch comes from a variety of reasons. For example, through acoustic inspection of a small sample of both data sources, we find a higher percentage of high quality recordings in FSD. In addition, audio clips in Freesound are typically recorded with the purpose of capturing audio, which is not necessarily the case in YFCC.

    This mismatch can have an impact in the evaluation, considering that most of the train data come from YFCC, while all test data are drawn from FSD. This constraint (i.e., noisy training data coming from a different web audio source than the test set) is sometimes a real-world condition.

    LICENSE

    All clips in FSDKaggle2019 are released under Creative Commons (CC) licenses. For attribution purposes and to facilitate attribution of these files to third parties, we include a mapping from the audio clips to their corresponding licenses.

    Curated train set and test set. All clips in Freesound are released under different modalities of Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. The licenses are specified in the files train_curated_post_competition.csv and test_post_competition.csv. These licenses can be CC0, CC-BY, CC-BY-NC and CC Sampling+.

    Noisy train set. Similarly, the licenses of the soundtracks from Flickr used in FSDKaggle2019 are specified in the file train_noisy_post_competition.csv. These licenses can be CC-BY and CC BY-SA.

    In addition, FSDKaggle2019 as a whole is the result of a curation process and it has an additional license. FSDKaggle2019 is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSDKaggle2019.doc zip file.

    FILES & DOWNLOAD

    FSDKaggle2019 can be downloaded as a series of zip files with the following directory structure:

    root │
    └───FSDKaggle2019.audio_train_curated/ Audio clips in the curated train set │ └───FSDKaggle2019.audio_train_noisy/ Audio clips in the noisy

  7. Z

    Kaggle Wikipedia Web Traffic Daily Dataset (without Missing Values)

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 1, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Webb, Geoff (2021). Kaggle Wikipedia Web Traffic Daily Dataset (without Missing Values) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3892918
    Explore at:
    Dataset updated
    Apr 1, 2021
    Dataset provided by
    Bergmeir, Christoph
    Montero-Manso, Pablo
    Hyndman, Rob
    Godahewa, Rakshitha
    Webb, Geoff
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was used in the Kaggle Wikipedia Web Traffic forecasting competition. It contains 145063 daily time series representing the number of hits or web traffic for a set of Wikipedia pages from 2015-07-01 to 2017-09-10.

    The original dataset contains missing values. They have been simply replaced by zeros.

  8. P

    GMSC Dataset

    • paperswithcode.com
    Updated Sep 19, 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2011). GMSC Dataset [Dataset]. https://paperswithcode.com/dataset/gmsc
    Explore at:
    Dataset updated
    Sep 19, 2011
    Description

    Data for a Kaggle competition

    Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit.

    Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years.

    The goal of this competition is to build a model that borrowers can use to help make the best financial decisions.

    Historical data are provided on 250,000 borrowers and the prize pool is $5,000 ($3,000 for first, $1,500 for second and $500 for third).

  9. competition dataset

    • kaggle.com
    zip
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Edifon Jimmy (2024). competition dataset [Dataset]. https://www.kaggle.com/datasets/edifonjimmy/competition-dataset/discussion
    Explore at:
    zip(1956867 bytes)Available download formats
    Dataset updated
    Jul 22, 2024
    Authors
    Edifon Jimmy
    Description

    Dataset

    This dataset was created by Edifon Jimmy

    Contents

  10. Z

    Indoor Location Competition 2.0 Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Han, Yeqiang (2023). Indoor Location Competition 2.0 Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8265878
    Explore at:
    Dataset updated
    Aug 20, 2023
    Dataset provided by
    Bahl, Paramvir
    Yin, Zhimeng
    Fan, Xiubin
    Hu, Yuming
    Shu, Yuanchao
    Qian, Feng
    Liu, Jie
    Ji, Zhe
    Han, Yeqiang
    Xu, Qiang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset of our Mobicom 2023 paper titled "The Wisdom of 1,170 Teams: Lessons and Experiences from a Large Indoor Localization Competition". We organized an indoor location competition in 2021. 1446 contestants from more than 60 countries making up 1170 teams participated in this unique global event. In this competition, a first-of-its-kind large-scale indoor location benchmark dataset (60 GB) was released. The dataset for this competition consists of dense indoor signatures of WiFi, geomagnetic field, iBeacons etc. as well as ground truth locations collected from hundreds of buildings in Chinese cities. Here we upload a sample data to Zenodo, and the whole dataset can be found at https://www.kaggle.com/c/indoor-location-navigation.

  11. Z

    Tourism Quarterly Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tourism Quarterly Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3889386
    Explore at:
    Dataset updated
    Apr 1, 2021
    Dataset provided by
    Bergmeir, Christoph
    Godahewa, Rakshitha
    Webb, Geoff
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains 427 quarterly time series used in the Kaggle Tourism forecasting competition.

  12. Trained Models for "Segmenting functional tissue units across human organs...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yashvardhan Jain; Yashvardhan Jain; Leah Godwin L.; Leah Godwin L.; Sripad Joshi; Sripad Joshi; Shriya Mandarapu; Shriya Mandarapu; Trang Le; Trang Le; Cecilia Lindskog; Cecilia Lindskog; Emma Lundberg; Emma Lundberg; Katy Börner; Katy Börner (2023). Trained Models for "Segmenting functional tissue units across human organs using community-driven development of generalizable machine learning algorithms" [Dataset]. http://doi.org/10.5281/zenodo.7545793
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 18, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yashvardhan Jain; Yashvardhan Jain; Leah Godwin L.; Leah Godwin L.; Sripad Joshi; Sripad Joshi; Shriya Mandarapu; Shriya Mandarapu; Trang Le; Trang Le; Cecilia Lindskog; Cecilia Lindskog; Emma Lundberg; Emma Lundberg; Katy Börner; Katy Börner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the trained model weights for the baseline model and the winning solutions in the Kaggle competition "HuBMAP+HPA - Hacking the Human Body", and is part of the paper "Segmenting functional tissue units across human organs using community-driven development of generalizable machine learning algorithms".

    The directory contains:

    trained_model_1_weights.zip: Trained model weights for first place solution (Team 1).

    trained_model_2_weights.zip: Trained model weights for second place solution (Team 2).

    trained_model_3_weights.zip: Trained model weights for third place solution (Team 3).

    trained_model_weights_baseline.zip: Trained model weights for the baseline model.

  13. wsdm-competition

    • kaggle.com
    zip
    Updated Nov 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nelson mandela (2024). wsdm-competition [Dataset]. https://www.kaggle.com/datasets/nelsonmandela18/wsdm-competition/code
    Explore at:
    zip(113139292 bytes)Available download formats
    Dataset updated
    Nov 21, 2024
    Authors
    nelson mandela
    Description

    Dataset

    This dataset was created by nelson mandela

    Contents

  14. O

    cats-vs-dogs

    • opendatalab.com
    zip
    Updated Jan 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Microsoft Research (2024). cats-vs-dogs [Dataset]. https://opendatalab.com/OpenDataLab/cats-vs-dogs
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 2, 2024
    Dataset provided by
    Microsoft Research
    Description

    A large set of images of cats and dogs. There are 1738 corrupted images that are dropped. This dataset is part of a now-closed Kaggle competition and represents a subset of the so-called Asirra dataset.

  15. f

    Comparison results of different model.

    • plos.figshare.com
    xls
    Updated Dec 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ke Peng; Yan Peng; Wenguang Li (2023). Comparison results of different model. [Dataset]. http://doi.org/10.1371/journal.pone.0289724.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 8, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Ke Peng; Yan Peng; Wenguang Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In recent years, with the continuous improvement of the financial system and the rapid development of the banking industry, the competition of the banking industry itself has intensified. At the same time, with the rapid development of information technology and Internet technology, customers’ choice of financial products is becoming more and more diversified, and customers’ dependence and loyalty to banking institutions is becoming less and less, and the problem of customer churn in commercial banks is becoming more and more prominent. How to predict customer behavior and retain existing customers has become a major challenge for banks to solve. Therefore, this study takes a bank’s business data on Kaggle platform as the research object, uses multiple sampling methods to compare the data for balancing, constructs a bank customer churn prediction model for churn identification by GA-XGBoost, and conducts interpretability analysis on the GA-XGBoost model to provide decision support and suggestions for the banking industry to prevent customer churn. The results show that: (1) The applied SMOTEENN is more effective than SMOTE and ADASYN in dealing with the imbalance of banking data. (2) The F1 and AUC values of the model improved and optimized by XGBoost using genetic algorithm can reach 90% and 99%, respectively, which are optimal compared to other six machine learning models. The GA-XGBoost classifier was identified as the best solution for the customer churn problem. (3) Using Shapley values, we explain how each feature affects the model results, and analyze the features that have a high impact on the model prediction, such as the total number of transactions in the past year, the amount of transactions in the past year, the number of products owned by customers, and the total sales balance. The contribution of this paper is mainly in two aspects: (1) this study can provide useful information from the black box model based on the accurate identification of churned customers, which can provide reference for commercial banks to improve their service quality and retain customers; (2) it can provide reference for customer churn early warning models of other related industries, which can help the banking industry to maintain customer stability, maintain market position and reduce corporate losses.

  16. buds-lab/building-data-genome-project-2: v1.0

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Sep 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Clayton Miller; Anjukan Kathirgamanathan; Bianca Picchetti; Pandarasamy Arjunan; June Young Park; Zoltan Nagy; Paul Raftery; Brodie W. Hobson; Zixiao Shi; Forrest Meggers; Clayton Miller; Anjukan Kathirgamanathan; Bianca Picchetti; Pandarasamy Arjunan; June Young Park; Zoltan Nagy; Paul Raftery; Brodie W. Hobson; Zixiao Shi; Forrest Meggers (2020). buds-lab/building-data-genome-project-2: v1.0 [Dataset]. http://doi.org/10.5281/zenodo.3887306
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 2, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Clayton Miller; Anjukan Kathirgamanathan; Bianca Picchetti; Pandarasamy Arjunan; June Young Park; Zoltan Nagy; Paul Raftery; Brodie W. Hobson; Zixiao Shi; Forrest Meggers; Clayton Miller; Anjukan Kathirgamanathan; Bianca Picchetti; Pandarasamy Arjunan; June Young Park; Zoltan Nagy; Paul Raftery; Brodie W. Hobson; Zixiao Shi; Forrest Meggers
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The BDG2 open data set consists of 3,053 energy meters from 1,636 non-residential buildings with a range of two full years (2016 and 2017) at an hourly frequency (17,544 measurements per meter resulting in approximately 53.6 million measurements). These meters are collected from 19 sites across North America and Europe, and they measure electrical, heating and cooling water, steam, and solar energy as well as water and irrigation meters. Part of these data was used in the Great Energy Predictor III (GEPIII) competition hosted by the ASHRAE organization in October-December 2019. This subset includes data from 2,380 meters from 1,448 buildings that were used in the GEPIII, a machine learning competition for long-term prediction with an application to measurement and verification. This paper describes the process of data collection, cleaning, and convergence of time-series meter data, the meta-data about the buildings, and complementary weather data. This data set can be used for further prediction benchmarking and prototyping as well as anomaly detection, energy analysis, and building type classification.

  17. Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

    • zenodo.org
    csv, txt
    Updated Sep 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anastasia Drozdova; Polina Guseva; Ekaterina Trofimova; Ekaterina Trofimova; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy (2023). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.7733823
    Explore at:
    csv, txtAvailable download formats
    Dataset updated
    Sep 15, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anastasia Drozdova; Polina Guseva; Ekaterina Trofimova; Ekaterina Trofimova; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle. The corpus consists of ≈ 2.5 million snippets of ML code collected from ≈ 100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose.

    The data is organized as a set of tables in CSV format. It includes several central entities: raw code blocks collected from Kaggle (code_blocks.csv), kernels (kernels_meta.csv) and competitions meta information (competitions_meta.csv). Manually annotated code blocks are presented as a separate table (murkup_data.csv). As this table contains the numeric id of the code block semantic type, we also provide a mapping from the id to semantic class and subclass (vertices.csv).

    Snippets information (code_blocks.csv) can be mapped with kernels meta-data via kernel_id. Kernels metadata is linked to Kaggle competitions information through comp_name. To ensure the quality of the data kernels_meta.csv includes only notebooks with an available Kaggle score.

    Automatic classification of code_blocks are stored in data_with_preds.csv. The mapping of this table with code_blocks.csv can be doe through code_blocks_index column, which corresponds to code_blocks indices.

    The corpus can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.

  18. competition

    • kaggle.com
    zip
    Updated Dec 26, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samawel JABALLI (2020). competition [Dataset]. https://www.kaggle.com/samawel97/competition
    Explore at:
    zip(2577830 bytes)Available download formats
    Dataset updated
    Dec 26, 2020
    Authors
    Samawel JABALLI
    Description

    Dataset

    This dataset was created by Samawel JABALLI

    Contents

  19. A collection of fully-annotated soundscape recordings from the southern...

    • zenodo.org
    • data.niaid.nih.gov
    csv, pdf, txt, zip
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mary Clapp; Mary Clapp; Stefan Kahl; Stefan Kahl; Erik Meyer; Megan McKenna; Holger Klinck; Holger Klinck; Gail Patricelli; Gail Patricelli; Erik Meyer; Megan McKenna (2024). A collection of fully-annotated soundscape recordings from the southern Sierra Nevada mountain range [Dataset]. http://doi.org/10.5281/zenodo.7525805
    Explore at:
    csv, txt, zip, pdfAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mary Clapp; Mary Clapp; Stefan Kahl; Stefan Kahl; Erik Meyer; Megan McKenna; Holger Klinck; Holger Klinck; Gail Patricelli; Gail Patricelli; Erik Meyer; Megan McKenna
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Sierra Nevada, Nevada
    Description

    This collection contains 100 soundscape recordings of 10 minutes duration, which have been annotated with 10,296 bounding box labels for 21 different bird species from the Western United States. The data were recorded in 2015 in the southern end of the Sierra Nevada mountain range in California, USA. This collection has been featured as test data in the 2020 BirdCLEF and Kaggle Birdcall Identification competition and can primarily be used for training and evaluation of machine learning algorithms.

    Data collection

    The recordings were made in Sequoia and Kings Canyon National Parks, two contiguous national parks in the southern Sierra Nevada mountain range in California, USA. The focus of the acoustic study was the high-elevation region of the Parks; specifically, the headwater lake basins above 3,000 km in elevation. The original intent of the study was to monitor seasonal activity of birds and bats at lakes containing trout and lakes without trout, because the cascading impacts of trout on the adjacent terrestrial zone remain poorly understood. Soundscapes were recorded for 24 h continuously at 10 lakes (5 fishless, 5 fish-containing) throughout Sequoia and Kings Canyon National Parks during June-September 2015. Song Meter SM2+ units (Wildlife Acoustics, USA) powered by custom-made solar panels were used to obviate the need to swap batteries, due to the recording locations being extremely difficult to access. Song Meters continuously recorded mono-channel, 16-bits uncompressed WAVE files at 48 kHz sampling rate. For this collection, recordings were resampled at 32 kHz and converted to FLAC.

    Sampling and annotation protocol

    A total of 100 10-minute segments of audio between July 9 and 12, 2015 from morning hours (06:10-09:10 PDT) from all 10 sites were selected at random. Annotators were asked to box every bird call they could recognize, ignoring those that are too faint or unidentifiable. Every sound that could not be confidently assigned an identity was reviewed with 1-2 other experts in bird identification. To minimize observer bias, all identifying information about the location, date and time of the recordings was hidden from the annotator. Raven Pro software was used to annotate the data. Provided labels contain full bird calls that are boxed in time and frequency. In this collection, we use eBird species codes as labels, following the 2021 eBird taxonomy (Clements list). Unidentifiable calls have been marked with “????” and were added as bounding box labels to the ground truth annotations. Parts of this dataset have previously been used in the 2020 BirdCLEF and Kaggle Birdcall Identification competition.

    Files in this collection

    Audio recordings can be accessed by downloading and extracting the “soundscape_data.zip” file. Soundscape recording filenames contain a sequential file ID, recording date and timestamp in PDT (UTC-7). As an example, the file “HSN_001_20150708_061805.flac” has sequential ID 001 and was recorded on July 8th 2015 at 06:18:05 PDT. Ground truth annotations are listed in “annotations.csv” where each line specifies the corresponding filename, start and end time in seconds, low and high frequency in Hertz and an eBird species code. These species codes can be assigned to scientific and common name of a species with the “species.csv” file. The approximate recording location with longitude and latitude can be found in the “recording_location.txt” file.

    Acknowledgements

    Compiling this extensive dataset was a major undertaking, and we are very thankful to the domain experts who helped to collect and manually annotate the data for this collection (individual contributors in alphabetic order): Anna CalderĂłn, Thomas Hahn, Ruoshi Huang, Angelly Tovar

  20. neurips2022_multiome

    • figshare.com
    hdf
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dominik Klein (2023). neurips2022_multiome [Dataset]. http://doi.org/10.6084/m9.figshare.20503227.v1
    Explore at:
    hdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Dominik Klein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Processed and subsampled of the data provided in the Multimodal Single-Cell Integration NeurIPS 2022 challenge (https://www.kaggle.com/competitions/open-problems-multimodal/data).

    Data was filtered on donor "31800" and non-hidden celltypes. Subsequently, 2000 data points were randomly subsampled. The 2000 most highly variable genes were selected for the RNA data and peaks which appeared in less than 5% of the cells were filtered out, resulting in 11607 peaks.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Tijanic, Nebojsa (2020). Solution #4 for Predicting Molecular Properties Kaggle Competition [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3406153

Solution #4 for Predicting Molecular Properties Kaggle Competition

Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Popovic, Milos
Rakocevic, Goran
Stojanovic, Luka
Tijanic, Nebojsa
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Code and additional data for solution #4 in Predicting Molecular Properties competition, described in #4 Solution [Hyperspatial Engineers].

Search
Clear search
Close search
Google apps
Main menu