100+ datasets found
  1. Meta Kaggle Code

    • kaggle.com
    zip
    Updated Nov 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
    Explore at:
    zip(167219625372 bytes)Available download formats
    Dataset updated
    Nov 27, 2025
    Dataset authored and provided by
    Kagglehttp://kaggle.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Explore our public notebook content!

    Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

    Why we’re releasing this dataset

    By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

    Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

    The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

    Sensitive data

    While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

    Joining with Meta Kaggle

    The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

    File organization

    The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

    The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

    Questions / Comments

    We love feedback! Let us know in the Discussion tab.

    Happy Kaggling!

  2. h

    kbp37

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Speech and Language Technology, DFKI, kbp37 [Dataset]. https://huggingface.co/datasets/DFKI-SLT/kbp37
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Speech and Language Technology, DFKI
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    KBP37 is a revision of MIML-RE annotation dataset, provided by Gabor Angeli et al. (2014). They use both the 2010 and 2013 KBP official document collections, as well as a July 2013 dump of Wikipedia as the text corpus for annotation. There are 33811 sentences been annotated. Zhang and Wang made several refinements: 1. They add direction to the relation names, e.g. 'per:employee_of' is split into 'per:employee of(e1,e2)' and 'per:employee of(e2,e1)'. They also replace 'org:parents' with 'org:subsidiaries' and replace 'org:member of’ with 'org:member`' (by their reverse directions). 2. They discard low frequency relations such that both directions of each relation occur more than 100 times in the dataset.

    KBP37 contains 18 directional relations and an additional 'no_relation' relation, resulting in 37 relation classes.

  3. F

    Data from: A Neural Approach for Text Extraction from Scholarly Figures

    • data.uni-hannover.de
    zip
    Updated Jan 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TIB (2022). A Neural Approach for Text Extraction from Scholarly Figures [Dataset]. https://data.uni-hannover.de/dataset/a-neural-approach-for-text-extraction-from-scholarly-figures
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 20, 2022
    Dataset authored and provided by
    TIB
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    A Neural Approach for Text Extraction from Scholarly Figures

    This is the readme for the supplemental data for our ICDAR 2019 paper.

    You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202

    If you found this dataset useful, please consider citing our paper:

    @inproceedings{DBLP:conf/icdar/MorrisTE19,
     author  = {David Morris and
            Peichen Tang and
            Ralph Ewerth},
     title   = {A Neural Approach for Text Extraction from Scholarly Figures},
     booktitle = {2019 International Conference on Document Analysis and Recognition,
            {ICDAR} 2019, Sydney, Australia, September 20-25, 2019},
     pages   = {1438--1443},
     publisher = {{IEEE}},
     year   = {2019},
     url    = {https://doi.org/10.1109/ICDAR.2019.00231},
     doi    = {10.1109/ICDAR.2019.00231},
     timestamp = {Tue, 04 Feb 2020 13:28:39 +0100},
     biburl  = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib},
     bibsource = {dblp computer science bibliography, https://dblp.org}
    }
    

    This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).

    Datasets

    We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.

    Testing

    These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2

    Validation

    The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.

    Training

    We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.

    Code

    We have made our code available in code.zip. We will upload code, announce further news, and field questions via the github repo.

    Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours subdirectory contains the trained weights we used in the paper.

    We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar as text_recognition_multipro.py.

    We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar.

    Parameter sweeps are automated by param_sweep.rb. This file also shows how to invoke all of these components.

  4. Can I Play It? (CIPI) Dataset

    • zenodo.org
    Updated Jun 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pedro Ramoneda; Dasaem Jeong; Vsevolod Eremenko; Nazif Can Tamer; Marius Miron; Xavier Serra; Pedro Ramoneda; Dasaem Jeong; Vsevolod Eremenko; Nazif Can Tamer; Marius Miron; Xavier Serra (2024). Can I Play It? (CIPI) Dataset [Dataset]. http://doi.org/10.5281/zenodo.8037327
    Explore at:
    Dataset updated
    Jun 27, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Pedro Ramoneda; Dasaem Jeong; Vsevolod Eremenko; Nazif Can Tamer; Marius Miron; Xavier Serra; Pedro Ramoneda; Dasaem Jeong; Vsevolod Eremenko; Nazif Can Tamer; Marius Miron; Xavier Serra
    Description

    Can I Play It? (CIPI) dataset from Combining piano performance dimensions for score difficulty classification

    Description

    Overview

    Predicting the difficulty of playing a musical score plays a pivotal role in structuring and exploring score collections, with significant implications for music education. The automatic difficulty classification of piano scores, however, remains an unsolved challenge. This is largely due to the scarcity of annotated data and the inherent subjectiveness in the annotation process. The "Can I Play It?" (CIPI) dataset represents a substantial step forward in this domain, providing a machine-readable collection of piano scores paired with difficulty annotations from the esteemed Henle Verlag.

    Dataset Creation

    The CIPI dataset is meticulously assembled by aligning public domain scores with their corresponding difficulty labels sourced from Henle Verlag. This initial pairing was subsequently reviewed and refined by an expert pianist to ensure accuracy and reliability. The dataset is structured to facilitate easy access and interpretation, making it a valuable resource for researchers and educators alike.

    Contributions and Findings

    Our work makes two primary contributions to the field of score difficulty classification. Firstly, we address the critical issue of data scarcity, introducing the CIPI dataset to the academic community. Secondly, we delve into various input representations derived from score information, utilizing pre-trained machine learning models tailored for piano fingering and expressiveness. These models draw inspiration from musicological definitions of performance, offering nuanced insights into score difficulty.

    Through extensive experimentation, we demonstrate that an ensemble approach—combining outputs from multiple classifiers—yields superior results compared to individual classifiers. This highlights the diverse facets of difficulty captured by different representations. Our comprehensive experiments lay a robust foundation for future endeavors in score difficulty classification, and our best-performing model reports a balanced accuracy of 39.5% and a median square error of 1.1 across the nine difficulty levels introduced in this study.

    Access and Usage

    The CIPI dataset, along with the associated code and models, is made publicly available to ensure reproducibility and to encourage further research in this domain. Users are encouraged to reference this resource in their work and to contribute to its ongoing development.

    Citation

    Ramoneda, P., Jeong, D., Eremenko, V., Tamer, N. C., Miron, M., & Serra, X. (2024). Combining Piano Performance Dimensions for Score Difficulty Classification. Expert Systems with Applications, 238, 121776. DOI: 10.1016/j.eswa.2023.121776

    @article{Ramoneda2024,
    author = {Pedro Ramoneda and Dasaem Jeong and Vsevolod Eremenko and Nazif Can Tamer and Marius Miron and Xavier Serra},
    title = {Combining Piano Performance Dimensions for Score Difficulty Classification},
    journal = {Expert Systems with Applications},
    volume = {238},
    pages = {121776},
    year = {2024},
    doi = {10.1016/j.eswa.2023.121776},
    url = {https://doi.org/10.1016/j.eswa.2023.121776}
    }

    Contact

    pedro.ramoneda@upf.edu

    xavier.serra@upf.edu

  5. Make Data Count Dataset - MinerU Extraction

    • kaggle.com
    zip
    Updated Aug 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omid Erfanmanesh (2025). Make Data Count Dataset - MinerU Extraction [Dataset]. https://www.kaggle.com/datasets/omiderfanmanesh/make-data-count-dataset-mineru-extraction
    Explore at:
    zip(4272989320 bytes)Available download formats
    Dataset updated
    Aug 26, 2025
    Authors
    Omid Erfanmanesh
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Description

    This dataset contains PDF-to-text conversions of scientific research articles, prepared for the task of data citation mining. The goal is to identify references to research datasets within full-text scientific papers and classify them as Primary (data generated in the study) or Secondary (data reused from external sources).

    The PDF articles were processed using MinerU, which converts scientific PDFs into structured machine-readable formats (JSON, Markdown, images). This ensures participants can access both the raw text and layout information needed for fine-grained information extraction.

    Files and Structure

    Each paper directory contains the following files:

    • *_origin.pdf The original PDF file of the scientific article.

    • *_content_list.json Structured extraction of the PDF content, where each object represents a text or figure element with metadata. Example entry:

      {
       "type": "text",
       "text": "10.1002/2017JC013030",
       "text_level": 1,
       "page_idx": 0
      }
      
    • full.md The complete article content in Markdown format (linearized for easier reading).

    • images/ Folder containing figures and extracted images from the article.

    • layout.json Page layout metadata, including positions of text blocks and images.

    Data Mining Task

    The aim is to detect dataset references in the article text and classify them:

    Each dataset mention must be labeled as:

    • Primary: Data generated by the paper (new experiments, field observations, sequencing runs, etc.).
    • Secondary: Data reused from external repositories or prior studies.

    Training and Test Splits

    • train/ → Articles with gold-standard labels (train_labels.csv).
    • test/ → Articles without labels, used for evaluation.
    • train_labels.csv → Ground truth with:

      • article_id: Research paper DOI.
      • dataset_id: Extracted dataset identifier.
      • type: Citation type (Primary / Secondary).
    • sample_submission.csv → Example submission format.

    Example

    Paper: https://doi.org/10.1098/rspb.2016.1151 Data: https://doi.org/10.5061/dryad.6m3n9 In-text span:

    "The data we used in this publication can be accessed from Dryad at doi:10.5061/dryad.6m3n9." Citation type: Primary

    This dataset enables participants to develop and test NLP systems for:

    • Information extraction (locating dataset mentions).
    • Identifier normalization (mapping mentions to persistent IDs).
    • Citation classification (distinguishing Primary vs Secondary data usage).
  6. Zenodo Code Images

    • kaggle.com
    zip
    Updated Jun 18, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford Research Computing Center (2018). Zenodo Code Images [Dataset]. https://www.kaggle.com/datasets/stanfordcompute/code-images
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Jun 18, 2018
    Dataset authored and provided by
    Stanford Research Computing Center
    Description

    Code Images

    DOI

    Context

    This is a subset of the Zenodo-ML Dinosaur Dataset [Github] that has been converted to small png files and organized in folders by the language so you can jump right in to using machine learning methods that assume image input.

    Content

    Included are .tar.gz files, each named based on a file extension, and when extracted, will produce a folder of the same name.

     tree -L 1
    .
    ├── c
    ├── cc
    ├── cpp
    ├── cs
    ├── css
    ├── csv
    ├── cxx
    ├── data
    ├── f90
    ├── go
    ├── html
    ├── java
    ├── js
    ├── json
    ├── m
    ├── map
    ├── md
    ├── txt
    └── xml
    

    And we can peep inside a (somewhat smaller) of the set to see that the subfolders are zenodo identifiers. A zenodo identifier corresponds to a single Github repository, so it means that the png files produced are chunks of code of the extension type from a particular repository.

    $ tree map -L 1
    map
    ├── 1001104
    ├── 1001659
    ├── 1001793
    ├── 1008839
    ├── 1009700
    ├── 1033697
    ├── 1034342
    ...
    ├── 836482
    ├── 838329
    ├── 838961
    ├── 840877
    ├── 840881
    ├── 844050
    ├── 845960
    ├── 848163
    ├── 888395
    ├── 891478
    └── 893858
    
    154 directories, 0 files
    

    Within each folder (zenodo id) the files are prefixed by the zenodo id, followed by the index into the original image set array that is provided with the full dinosaur dataset archive.

    $ tree m/891531/ -L 1
    m/891531/
    ├── 891531_0.png
    ├── 891531_10.png
    ├── 891531_11.png
    ├── 891531_12.png
    ├── 891531_13.png
    ├── 891531_14.png
    ├── 891531_15.png
    ├── 891531_16.png
    ├── 891531_17.png
    ├── 891531_18.png
    ├── 891531_19.png
    ├── 891531_1.png
    ├── 891531_20.png
    ├── 891531_21.png
    ├── 891531_22.png
    ├── 891531_23.png
    ├── 891531_24.png
    ├── 891531_25.png
    ├── 891531_26.png
    ├── 891531_27.png
    ├── 891531_28.png
    ├── 891531_29.png
    ├── 891531_2.png
    ├── 891531_30.png
    ├── 891531_3.png
    ├── 891531_4.png
    ├── 891531_5.png
    ├── 891531_6.png
    ├── 891531_7.png
    ├── 891531_8.png
    └── 891531_9.png
    
    0 directories, 31 files
    

    So what's the difference?

    The difference is that these files are organized by extension type, and provided as actual png images. The original data is provided as numpy data frames, and is organized by zenodo ID. Both are useful for different things - this particular version is cool because we can actually see what a code image looks like.

    How many images total?

    We can count the number of total images:

    find "." -type f -name *.png | wc -l
    3,026,993
    

    Dataset Curation

    The script to create the dataset is provided here. Essentially, we start with the top extensions as identified by this work (excluding actual images files) and then write each 80x80 image to an actual png image, organizing by extension then zenodo id (as shown above).

    Saving the Image

    I tested a few methods to write the single channel 80x80 data frames as png images, and wound up liking cv2's imwrite function because it would save and then load the exact same content.

    import cv2
    cv2.imwrite(image_path, image)
    

    Loading the Image

    Given the above, it's pretty easy to load an image! Here is an example using scipy, and then for newer Python (if you get a deprecation message) using imageio.

    image_path = '/tmp/data1/data/csv/1009185/1009185_0.png'
    from imageio import imread
    
    image = imread(image_path)
    array([[116, 105, 109, ..., 32, 32, 32],
        [ 48, 44, 48, ..., 32, 32, 32],
        [ 48, 46, 49, ..., 32, 32, 32],
        ..., 
        [ 32, 32, 32, ..., 32, 32, 32],
        [ 32, 32, 32, ..., 32, 32, 32],
        [ 32, 32, 32, ..., 32, 32, 32]], dtype=uint8)
    
    
    image.shape
    (80,80)
    
    
    # Deprecated
    from scipy import misc
    misc.imread(image_path)
    
    Image([[116, 105, 109, ..., 32, 32, 32],
        [ 48, 44, 48, ..., 32, 32, 32],
        [ 48, 46, 49, ..., 32, 32, 32],
        ..., 
        [ 32, 32, 32, ..., 32, 32, 32],
        [ 32, 32, 32, ..., 32, 32, 32],
        [ 32, 32, 32, ..., 32, 32, 32]], dtype=uint8)
    

    Remember that the values in the data are characters that have been converted to ordinal. Can you guess what 32 is?

    ord(' ')
    32
    
    # And thus if you wanted to convert it back...
    chr(32)
    

    So how t...

  7. s

    CODE dataset

    • figshare.scilifelab.se
    • researchdata.se
    • +1more
    Updated Feb 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonio H. Ribeiro; Manoel Horta Ribeiro; Gabriela M. Paixão; Derick M. Oliveira; Paulo R. Gomes; Jéssica A. Canazart; Milton P. Ferreira; Carl R. Andersson; Peter W. Macfarlane; Wagner Meira Jr.; Thomas B. Schön; Antonio Luiz P. Ribeiro (2025). CODE dataset [Dataset]. http://doi.org/10.17044/scilifelab.15169716.v1
    Explore at:
    Dataset updated
    Feb 27, 2025
    Dataset provided by
    Uppsala University & UFMG
    Authors
    Antonio H. Ribeiro; Manoel Horta Ribeiro; Gabriela M. Paixão; Derick M. Oliveira; Paulo R. Gomes; Jéssica A. Canazart; Milton P. Ferreira; Carl R. Andersson; Peter W. Macfarlane; Wagner Meira Jr.; Thomas B. Schön; Antonio Luiz P. Ribeiro
    License

    https://www.scilifelab.se/data/restricted-access/https://www.scilifelab.se/data/restricted-access/

    Description

    Dataset with annotated 12-lead ECG records. The exams were taken in 811 counties in the state of Minas Gerais/Brazil by the Telehealth Network of Minas Gerais (TNMG) between 2010 and 2016. And organized by the CODE (Clinical outcomes in digital electrocardiography) group.Requesting accessResearchers affiliated to educational or research institutions might make requests to access this data dataset. Requests will be analyzed on an individual basis and should contain: Name of PI and host organisation; Contact details (including your name and email); and, the scientific purpose of data access request.If approved, a data user agreement will be forwarded to the researcher that made the request (through the email that was provided). After the agreement has been signed (by the researcher or by the research institution) access to the dataset will be granted.Openly available subset:A subset of this dataset (with 15% of the patients) is openly available. See: "CODE-15%: a large scale annotated dataset of 12-lead ECGs" https://doi.org/10.5281/zenodo.4916206.ContentThe folder contains: A column separated file containing basic patient attributes. The ECG waveforms in the wfdb format.Additional referencesThe dataset is described in the paper "Automatic diagnosis of the 12-lead ECG using a deep neural network". https://www.nature.com/articles/s41467-020-15432-4. Related publications also using this dataset are:- [1] G. Paixao et al., “Validation of a Deep Neural Network Electrocardiographic-Age as a Mortality Predictor: The CODE Study,” Circulation, vol. 142, no. Suppl_3, pp. A16883–A16883, Nov. 2020, doi: 10.1161/circ.142.suppl_3.16883.- [2] A. L. P. Ribeiro et al., “Tele-electrocardiography and bigdata: The CODE (Clinical Outcomes in Digital Electrocardiography) study,” Journal of Electrocardiology, Sep. 2019, doi: 10/gf7pwg.- [3] D. M. Oliveira, A. H. Ribeiro, J. A. O. Pedrosa, G. M. M. Paixao, A. L. P. Ribeiro, and W. Meira Jr, “Explaining end-to-end ECG automated diagnosis using contextual features,” in Machine Learning and Knowledge Discovery in Databases. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD), Ghent, Belgium, Sep. 2020, vol. 12461, pp. 204--219. doi: 10.1007/978-3-030-67670-4_13.- [4] D. M. Oliveira, A. H. Ribeiro, J. A. O. Pedrosa, G. M. M. Paixao, A. L. Ribeiro, and W. M. Jr, “Explaining black-box automated electrocardiogram classification to cardiologists,” in 2020 Computing in Cardiology (CinC), 2020, vol. 47. doi: 10.22489/CinC.2020.452.- [5] G. M. M. Paixão et al., “Evaluation of mortality in bundle branch block patients from an electronic cohort: Clinical Outcomes in Digital Electrocardiography (CODE) study,” Journal of Electrocardiology, Sep. 2019, doi: 10/dcgk.- [6] G. M. M. Paixão et al., “Evaluation of Mortality in Atrial Fibrillation: Clinical Outcomes in Digital Electrocardiography (CODE) Study,” Global Heart, vol. 15, no. 1, p. 48, Jul. 2020, doi: 10.5334/gh.772.- [7] G. M. M. Paixão et al., “Electrocardiographic Predictors of Mortality: Data from a Primary Care Tele-Electrocardiography Cohort of Brazilian Patients,” Hearts, vol. 2, no. 4, Art. no. 4, Dec. 2021, doi: 10.3390/hearts2040035.- [8] G. M. Paixão et al., “ECG-AGE FROM ARTIFICIAL INTELLIGENCE: A NEW PREDICTOR FOR MORTALITY? THE CODE (CLINICAL OUTCOMES IN DIGITAL ELECTROCARDIOGRAPHY) STUDY,” Journal of the American College of Cardiology, vol. 75, no. 11 Supplement 1, p. 3672, 2020, doi: 10.1016/S0735-1097(20)34299-6.- [9] E. M. Lima et al., “Deep neural network estimated electrocardiographic-age as a mortality predictor,” Nature Communications, vol. 12, 2021, doi: 10.1038/s41467-021-25351-7.- [10] W. Meira Jr, A. L. P. Ribeiro, D. M. Oliveira, and A. H. Ribeiro, “Contextualized Interpretable Machine Learning for Medical Diagnosis,” Communications of the ACM, 2020, doi: 10.1145/3416965.- [11] A. H. Ribeiro et al., “Automatic diagnosis of the 12-lead ECG using a deep neural network,” Nature Communications, vol. 11, no. 1, p. 1760, 2020, doi: 10/drkd.- [12] A. H. Ribeiro et al., “Automatic Diagnosis of Short-Duration 12-Lead ECG using a Deep Convolutional Network,” Machine Learning for Health (ML4H) Workshop at NeurIPS, 2018.- [13] A. H. Ribeiro et al., “Automatic 12-lead ECG classification using a convolutional network ensemble,” 2020. doi: 10.22489/CinC.2020.130.- [14] V. Sangha et al., “Automated Multilabel Diagnosis on Electrocardiographic Images and Signals,” medRxiv, Sep. 2021, doi: 10.1101/2021.09.22.21263926.- [15] S. Biton et al., “Atrial fibrillation risk prediction from the 12-lead ECG using digital biomarkers and deep representation learning,” European Heart Journal - Digital Health, 2021, doi: 10.1093/ehjdh/ztab071.Code:The following github repositories perform analysis that use this dataset:- https://github.com/antonior92/automatic-ecg-diagnosis- https://github.com/antonior92/ecg-age-predictionRelated Datasets:- CODE-test: An annotated 12-lead ECG dataset (https://doi.org/10.5281/zenodo.3765780)- CODE-15%: a large scale annotated dataset of 12-lead ECGs (https://doi.org/10.5281/zenodo.4916206)- Sami-Trop: 12-lead ECG traces with age and mortality annotations (https://doi.org/10.5281/zenodo.4905618)Ethics declarationsThe CODE Study was approved by the Research Ethics Committee of the Universidade Federal de Minas Gerais, protocol 49368496317.7.0000.5149.

  8. CODEBRIM: COncrete DEfect BRidge IMage Dataset

    • zenodo.org
    • data-staging.niaid.nih.gov
    • +2more
    bin, zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Mundt; Sagnik Majumder; Sreenivas Murali; Panagiotis Panetsos; Visvanathan Ramesh; Martin Mundt; Sagnik Majumder; Sreenivas Murali; Panagiotis Panetsos; Visvanathan Ramesh (2020). CODEBRIM: COncrete DEfect BRidge IMage Dataset [Dataset]. http://doi.org/10.5281/zenodo.2620293
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Martin Mundt; Sagnik Majumder; Sreenivas Murali; Panagiotis Panetsos; Visvanathan Ramesh; Martin Mundt; Sagnik Majumder; Sreenivas Murali; Panagiotis Panetsos; Visvanathan Ramesh
    Description

    CODEBRIM: COncrete DEfect BRidge IMage Dataset for multi-target multi-class concrete defect classification in computer vision and machine learning.

    Dataset as presented and detailed in our CVPR 2019 publication: http://openaccess.thecvf.com/content_CVPR_2019/html/Mundt_Meta-Learning_Convolutional_Neural_Architectures_for_Multi-Target_Concrete_Defect_Classification_With_CVPR_2019_paper.html or https://arxiv.org/abs/1904.08486 . If you make use of the dataset please cite it as follows:

    "Martin Mundt, Sagnik Majumder, Sreenivas Murali, Panagiotis Panetsos, Visvanathan Ramesh. Meta-learning Convolutional Neural Architectures for Multi-target Concrete Defect Classification with the COncrete DEfect BRidge IMage Dataset. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019"

    We offer a supplementary GitHub repository with code to reproduce the paper and data loaders: https://github.com/ccc-frankfurt/meta-learning-CODEBRIM

    For ease of use we provide the dataset in multiple different versions.

    Files contained:
    * CODEBRIM_original_images: contains the original full-resolution images and bounding box annotations
    * CODEBRIM_cropped_dataset: contains the extracted crops/patches with corresponding class labels from the bounding boxes
    * CODEBRIM_classification_dataset: contains the cropped patches with corresponding class labels split into training, validation and test sets for machine learning
    * CODEBRIM_classification_balanced_dataset: similar to "CODEBRIM_classification_dataset" but with the exact replication of training images to balance the dataset in order to reproduce results obtained in the paper.

  9. Data from: OPTIMAP: A Dataset for Open Public Transport Infrastructure and...

    • zenodo.org
    • data.niaid.nih.gov
    pdf
    Updated Feb 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maximilian T. Fischer; Maximilian T. Fischer; Daniel Fürst; Daniel Fürst; Yannick Metz; Yannick Metz; Manuel Schmidt; Manuel Schmidt; Julius Rauscher; Julius Rauscher; Daniel A. Keim; Daniel A. Keim (2025). OPTIMAP: A Dataset for Open Public Transport Infrastructure and Mobility Accessibility Profiles [Dataset]. http://doi.org/10.5281/zenodo.14772647
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Feb 1, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Maximilian T. Fischer; Maximilian T. Fischer; Daniel Fürst; Daniel Fürst; Yannick Metz; Yannick Metz; Manuel Schmidt; Manuel Schmidt; Julius Rauscher; Julius Rauscher; Daniel A. Keim; Daniel A. Keim
    License

    Data licence Germany – Attribution – Version 2.0https://www.govdata.de/dl-de/by-2-0
    License information was derived automatically

    Description

    Introduction

    This dataset provides a comprehensive assessment of public transport connectivity across Germany by analyzing both walking distances to the nearest public transport stops as well as the quality of public transport connections for daily usage scenarios with housing-level-granularity on a country-wide scale. The data was generated through a novel approach that integrates multiple open data sources, simulation models, and visual analytics techniques, enabling researchers, policymakers, and urban planners to identify gaps and opportunities for transit network improvements. ewline

    Why does it matter?

    Efficient and accessible public transportation is a critical component of sustainable urban development. However, many transit networks struggle to adequately serve diverse populations due to infrastructural, financial, and urban planning limitations. Traditional transit planning often relies on aggregated statistics, expert opinions, or limited surveys, making it difficult to assess transport accessibility at an individual household level. This dataset provides a data-driven and reproducible methodology for unbiased country-wide comparisons.

    Find more information at https://mobility.dbvis.de.

    Key Facts, Download, Citation

    TitleOPTIMAP: A Dataset for Open Public Transport Infrastructure and Mobility Accessibility Profiles
    AcronymOPTIMAP
    Downloadhttps://mobility.dbvis.de/data-results/OPTIMAP_v2025-02-01.parquet (478MB, parquet)
    LicenseDatenlizenz Deutschland - Namensnennung - Version 2.0 (dl-de-by/2.0)

    Please cite the dataset as:

    Maximilian T. Fischer, Daniel Fürst, Yannick Metz, Manuel Schmidt, Julius Rauscher, and Daniel A. Keim. OPTIMAP: A Dataset for Open Public Transport Infrastructure and Mobility Accessibility Profiles. Zenodo, 2025. doi: 10.5281/zenodo.14772646.

    or, when using Bibtex

    @dataset{MobilityProfiles.DatasetGermany.2025,
    author = {Fischer, Maximilian T. and
    Fürst, Daniel and
    Metz, Yannick and
    Schmidt, Manuel and
    Rauscher, Julius and
    Keim, Daniel A.},
    title = {OPTIMAP: A Dataset for Open Public Transport Infrastructure and Mobility Accessibility Profiles},
    year = 2025,
    publisher = {Zenodo},
    doi = {10.5281/zenodo.14772646}
    }

    Dataset Description

    The dataset in the PARQUET format includes detailed accessibility measures for public transport at a fine-grained, housing-level resolution. It consists of four columns:

    • lat, lng (float32): GPS coordinates (EPSG:4326) of each house in Germany, expensively compiled from the house coordinates (HK-DE) data provided by the 16 federal states under the EU INSPIRE regulations.
    • MinDistanceWalking (int32): An approximate walking distance (in meters) to the nearest public transport stop from each registered building in Germany.
    • scores_OVERALL (float32): A simulated, demographic- and scenario-weighted measure of public transport quality for daily usage, considering travel times, frequency, and coverage across various daily scenarios (e.g., commuting, shopping, medical visits). The results are represented in an artificial time unit to allow comparative analysis across locations.

    Methodology

    The dataset was generated using a combination of open geospatial data and advanced transport simulation techniques.

    • Data Sources: Public transit information from the German national access point (DELFI NeTEx), housing geolocation data from various state authorities, and routing information from OpenStreetMap.
    • Walking Distance Calculation: The shortest path to the nearest transit stop was computed using the Dijkstra algorithm on a graph network of publicly available pathways sourced from OSM, considering the ten aerial-nearest public transport stops.
    • Public Transport Quality Estimation: The dataset incorporates a scenario-based simulation model, analyzing weight-averaged travel times and connection frequency to typical daily POIs such as the individually nearest train stations, kindergartens, schools, institutions of higher education, fitness centers, cinemas, places of worship, supermarkets, shopping malls, restaurants, doctors, parks, and cultural institutions. It includes walking distances to the start and from the destination public transport stops as well as the averaged travel and waiting times on the shortest route calculated via a modified Dijkstra algorithm. The results are aggregated using a demographically- and scenario-weighted metric to ensure comparability. The value is in the unit of time, although it should not be interpreted directly as real minutes.
    • Visualization and Validation: A WebGL-based interactive tool and static precomputed maps were developed to allow users to interactively explore transport accessibility metrics dynamically, available at https://mobility.dbvis.de.

    Potential Applications

    The dataset enables multiple use cases across research, policy, and urban planning:

    • Public Accessibility Studies: Provides insights into transport equity by evaluating mobility gaps affecting different demographic groups, different regional areas, and comparing county and state efforts in improving public transport quality.
    • Urban Planning and Transport Policy: Supports data-driven decision-making for optimizing transit networks, adjusting service schedules, or identifying underserved areas.
    • Smart City Development: Assists in integrating mobility analytics into broader smart city initiatives for efficient resource allocation and sustainability planning.
    • Academic Research: Facilitates studies in transportation engineering, urban geography, and mobility behavior analysis.

    Conclusion

    By offering high-resolution public transport accessibility data at housing-level granularity, this dataset contributes to a more transparent and objective understanding of urban mobility challenges. The integration of simulation models, demographic considerations, and scalable analytics provides a novel approach to evaluating and improving public transit systems. Researchers, city officials, and policymakers are encouraged to leverage this dataset to enhance transport infrastructure planning and accessibility.

    This dataset contains both the approximate walking distances in meters and a weighted overall quality score in an artificial time unit for each individual house in Germany. More advanced versions are currently not publicly available. This base dataset is publicly available and adheres to open data licensing principles, enabling its reuse for scientific and policy-oriented studies.

    Source Data Licenses

    While not part of this dataset, the scientific simulation used to create the results leverages public transit information via the National Access Point (NAP) DELFI as NeTEx, provided via GTFS feeds of Germany (CC BY 4.0).

    Also, routing information used during the processing was based on Open Street Map contributors (CC BY 4.0).

    Primarily, this dataset contains original and slightly processed housing locations (lat, lng) that were made available as part of the EU INSPIRE regulations, based on Directive (EU) 2019/1024 (of the European Parliament and of the Council of 20 June 2019 on open data and the re-use of public sector information (recast)).

    In Germany, the respective data is provided individually by the 16 federal states, with the following required attributions and license indications:

    • BB: EU INSPIRE / © GeoBasis-DE/LGB, dl-de-by/2.0 (data modified)
    • BE: EU INSPIRE / © Geoportal Berlin / Hauskoordinaten, dl-de-by/2.0 (data modified)
    • BW: EU INSPIRE / © LGL, www.lgl-bw.de, <a

  10. t

    Programming Language Ecosystem Project TU Wien

    • test.researchdata.tuwien.at
    csv, text/markdown
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valentin Futterer; Valentin Futterer; Valentin Futterer; Valentin Futterer (2024). Programming Language Ecosystem Project TU Wien [Dataset]. http://doi.org/10.70124/gnbse-ts649
    Explore at:
    text/markdown, csvAvailable download formats
    Dataset updated
    Jun 25, 2024
    Dataset provided by
    TU Wien
    Authors
    Valentin Futterer; Valentin Futterer; Valentin Futterer; Valentin Futterer
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Time period covered
    Dec 12, 2023
    Area covered
    Vienna
    Description

    About Dataset

    This dataset was created during the Programming Language Ecosystem project from TU Wien using the code inside the repository https://github.com/ValentinFutterer/UsageOfProgramminglanguages2011-2023?tab=readme-ov-file.

    The centerpiece of this repository is the usage_of_programming_languages_2011-2023.csv. This csv file shows the popularity of programming languages over the last 12 years in yearly increments. The repository also contains graphs created with the dataset. To get an accurate estimate on the popularity of programming languages, this dataset was created using 3 vastly different sources.

    About Data collection methodology

    The dataset was created using the github repository above. As input data, three public datasets where used.

    github_metadata

    Taken from https://www.kaggle.com/datasets/pelmers/github-repository-metadata-with-5-stars/ by Peter Elmers. It is licensed under CC BY 4.0 https://creativecommons.org/licenses/by/4.0/. It shows metadata information (no code) of all github repositories with more than 5 stars.

    PYPL_survey_2004-2023

    Taken from https://github.com/pypl/pypl.github.io/tree/master, put online by the user pcarbonn. It is licensed under CC BY 3.0 https://creativecommons.org/licenses/by/3.0/. It shows from 2004 to 2023 for each month the share of programming related google searches per language.

    stack_overflow_developer_survey

    Taken from https://insights.stackoverflow.com/survey. It is licensed under Open Data Commons Open Database License (ODbL) v1.0 https://opendatacommons.org/licenses/odbl/1-0/. It shows from 2011 to 2023 the results of the yearly stackoverflow developer survey.

    All these datasets were downloaded on the 12.12.2023. The datasets are all in the github repository above

    Description of the data

    The dataset contains a column for the year and then many columns for the different languages, denoting their usage in percent. Additionally, vertical barcharts and piecharts for each year plus a line graph for each language over the whole timespan as png's are provided.

    The languages that are going to be considered for the project can be seen here:

    - Python

    - C

    - C++

    - Java

    - C#

    - JavaScript

    - PHP

    - SQL

    - Assembly

    - Scratch

    - Fortran

    - Go

    - Kotlin

    - Delphi

    - Swift

    - Rust

    - Ruby

    - R

    - COBOL

    - F#

    - Perl

    - TypeScript

    - Haskell

    - Scala

    License

    This project is licensed under the Open Data Commons Open Database License (ODbL) v1.0 https://opendatacommons.org/licenses/odbl/1-0/ license.

    TLDR: You are free to share, adapt, and create derivative works from this dataser as long as you attribute me, keep the database open (if you redistribute it), and continue to share-alike any adapted database under the ODbl.

    Acknowledgments

    Thanks go out to

    - stackoverflow https://insights.stackoverflow.com/survey for providing the data from the yearly stackoverflow developer survey.

    - the PYPL survey, https://github.com/pypl/pypl.github.io/tree/master for providing google search data.

    - Peter Elmers, for crawling metadata on github repositories and providing the data https://www.kaggle.com/datasets/pelmers/github-repository-metadata-with-5-stars/.

  11. T

    civil_comments

    • tensorflow.org
    • huggingface.co
    Updated Feb 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). civil_comments [Dataset]. https://www.tensorflow.org/datasets/catalog/civil_comments
    Explore at:
    Dataset updated
    Feb 28, 2023
    Description

    This version of the CivilComments Dataset provides access to the primary seven labels that were annotated by crowd workers, the toxicity and other tags are a value between 0 and 1 indicating the fraction of annotators that assigned these attributes to the comment text.

    The other tags are only available for a fraction of the input examples. They are currently ignored for the main dataset; the CivilCommentsIdentities set includes those labels, but only consists of the subset of the data with them. The other attributes that were part of the original CivilComments release are included only in the raw data. See the Kaggle documentation for more details about the available features.

    The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, publication IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity, identity mentions, as well as covert offensiveness. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text.

    For comments that have a parent_id also in the civil comments data, the text of the previous comment is provided as the "parent_text" feature. Note that the splits were made without regard to this information, so using previous comments may leak some information. The annotators did not have access to the parent text when making the labels.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('civil_comments', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  12. Best Books Ever Dataset

    • zenodo.org
    csv
    Updated Nov 10, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells (2020). Best Books Ever Dataset [Dataset]. http://doi.org/10.5281/zenodo.4265096
    Explore at:
    csvAvailable download formats
    Dataset updated
    Nov 10, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).

    The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).

    Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset

    The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.

    Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.

    The 25 fields of the dataset are:

    | Attributes | Definition | Completeness |
    | ------------- | ------------- | ------------- | 
    | bookId | Book Identifier as in goodreads.com | 100 |
    | title | Book title | 100 |
    | series | Series Name | 45 |
    | author | Book's Author | 100 |
    | rating | Global goodreads rating | 100 |
    | description | Book's description | 97 |
    | language | Book's language | 93 |
    | isbn | Book's ISBN | 92 |
    | genres | Book's genres | 91 |
    | characters | Main characters | 26 |
    | bookFormat | Type of binding | 97 |
    | edition | Type of edition (ex. Anniversary Edition) | 9 |
    | pages | Number of pages | 96 |
    | publisher | Editorial | 93 |
    | publishDate | publication date | 98 |
    | firstPublishDate | Publication date of first edition | 59 |
    | awards | List of awards | 20 |
    | numRatings | Number of total ratings | 100 |
    | ratingsByStars | Number of ratings by stars | 97 |
    | likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 |
    | setting | Story setting | 22 |
    | coverImg | URL to cover image | 99 |
    | bbeScore | Score in Best Books Ever list | 100 |
    | bbeVotes | Number of votes in Best Books Ever list | 100 |
    | price | Book's price (extracted from Iberlibro) | 73 |

  13. MSP430FR5969 Basic Block Worst Case Energy Consumption (WCEC) and Worst Case...

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, xz
    Updated Dec 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugo Reymond; Hugo Reymond; Hector Chabot; Hector Chabot; Abderaouf Nassim Amalou; Abderaouf Nassim Amalou; Isabelle Puaut; Isabelle Puaut (2024). MSP430FR5969 Basic Block Worst Case Energy Consumption (WCEC) and Worst Case Execution Time (WCET) dataset [Dataset]. http://doi.org/10.5281/zenodo.11066623
    Explore at:
    csv, bin, xzAvailable download formats
    Dataset updated
    Dec 20, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hugo Reymond; Hugo Reymond; Hector Chabot; Hector Chabot; Abderaouf Nassim Amalou; Abderaouf Nassim Amalou; Isabelle Puaut; Isabelle Puaut
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains around 30 000 basic blocks whose energy consumption and execution time have been measured in isolation on the MSP430FR5969 microcontroller, at 1MHz. Basic blocks were executed in a worst case scenario regarding the MSP430 FRAM cache and CPU pipeline. The dataset creation process is described thoroughly in [1].

    Folder structure

    This dataset is composed of the following files:

    • basic_blocks.tar.xz contains all basic blocks (BB) used in the dataset, in a custom JSON format,
    • data.csv/data.xlsx contains the measured energy consumption and execution time for each basic block

    We first details how the basic_blocks.tar.gz archive is organized, and then present the CSV/XSLX spreadsheet format.

    Basic Blocks

    We extracted the basic blocks from a subset of programs of the AnghaBench benchmark suite [2]. The basic_blocks.tar.gz archive consist of the extracted basic blocks organized as json files. Each json file correspond to a C source file from AnghaBench, and is given a unique identifier. An example json (137.json) is available here:

    {
      "extr_pfctl_altq.c_pfctl_altq_init": [
         # Basic block 1
        [
          # Instruction 1 of BB1
          [
            "MOV.W",
            "#queue_map",
            "R13"
          ],
          # Instruction 2 of BB1
          [
            "MOV.B",
            "#0",
            "R14"
          ],
          # Instruction 3 of BB1
          [
            "CALL",
            "#hcreate_r",
            null
          ]
        ],
        # Basic block 2
        [
          ....
        ]
      ]
    }

    The json contains a dict with only one key pointing to an array of basic blocks. This key is the name of the original C source file in AnghaBench from which the basic blocks were extracted (here extr_pfctl_altq.c_pfctl_altq_init.c). The array contains severals basic blocks, which are represented as an array of instructions, which are themselves represented as an array [OPCODE, OPERAND1, OPERAND2].

    Then, each basic block can be identified uniquely using two ids : its file id and its offset in the file (id=). In our example, the basic block 1 can be identified by the json file id (137) and its offset in the file (0). Its ID is 137_0. This ID is used to make the mapping between a basic block and its energy consumption/execution time, with the data.csv/data.xlsx spreadsheet.

    Energy Consumption and Execution Time

    Energy consumption and execution time data are stored in the data.csv file. Here is the extract of the csv file corresponding to the basic block 137_0. The spreadsheet format is described below.

    bb_id;nb_inst;max_energy;max_time;avg_time;avg_energy;energy_per_inst;nb_samples;unroll_factor
    137_0;3;8.77;7.08;7.04;8.21;2.92;40;50

    Spreadsheet format :

    • bb_id: the unique identifier of a basic block (cf. Basic Blocks)
    • nb_inst: the number of instructions in the basic block
    • max_energy: the maximum energy comsumption (in nJ) measured during the experiment
    • max_time: the maximum execution time (in us) measured during the experiment
    • avg_time: the average execution time (in us) measured during the experiment
    • avg_energy: the average energy comsumption (in nJ) measured during the experiment
    • energy_per_inst: the average energy consumption per instruction (correspond to avg_energy/nb_inst)
    • nb_samples: how much time the basic block energy consumption/execution time has been measured
    • unroll_factor: how much time the basic block was unrolled (cf Basic Block Unrolling)

    Basic Block Unrolling

    To measure the energy consumption and execution time of the msp430, we need to be able to handle the scale difference between the measurement tool and the basic block execution time. This is achieved by duplicating the basic block multiple times while making sure to keep the worst-case memory layout as explained in the paper. The number of time the basic block has been duplicated is called the unroll_factor.

    Values of energy and time are always given per basic block, so they have already been divided by the unroll factor.

    Dataset description

    Features

    The selected features after PCA analysis for both energy and time model are listed here: MOV.W_Rn_Rn, MOV.W_X(Rn)_X(Rn), CALL, MOV.B_#N_Rn, ADD.W_Rn_Rn, MOV.W_@Rn_Rn, MOV.W_X(Rn)_Rn, ADD.W_#N_Rn, PUSHM.W_#N_Rn, MOV.W_X(Rn)_ADDR, CMP.W_#N_Rn, MOV.W_&ADDR_X(Rn), MOV.W_Rn_X(Rn), BIS.W_Rn_Rn, RLAM.W_#N_Rn, SUB.W_#N_Rn, MOV.W_&ADDR_Rn, MOV.W_#N_X(Rn), CMP.W_Rn_Rn, BIT.W_ADDR_Rn, MOV.W_@Rn_X(Rn), ADD.W_#N_X(Rn), MOV.W_#N_Rn, AND.W_Rn_Rn, MOV.W_Rn_ADDR, SUB.W_Rn_Rn, MOV.W_ADDR_Rn, MOV.W_X(Rn)_&ADDR, MOV.W_ADDR_ADDR, JMP, ADD_#N_Rn, BIS.W_Rn_X(Rn), SUB_Rn_Rn, MOV.W_ADDR_X(Rn), ADDC_#N_X(Rn), MOV.B_Rn_Rn, CMP.W_X(Rn)_X(Rn), ADD_Rn_Rn, nb_inst, INV.W_Rn_, NOP_, ADD.W_X(Rn)_X(Rn), ADD.W_Rn_X(Rn), MOV.B_@Rn_Rn, BIS.W_X(Rn)_X(Rn), MOV.B_#N_X(Rn), MOV.W_#N_ADDR, AND.W_#N_ADDR, SUBC_X(Rn)_X(Rn), BIS.W_#N_X(Rn), SUB.W_X(Rn)_X(Rn), AND.B_#N_Rn, ADD_X(Rn)_X(Rn), MOV.W_@Rn_ADDR, MOV.W_&ADDR_ADDR, ADDC_Rn_Rn, AND.W_#N_X(Rn), SUB_#N_Rn, RRUM.W_#N_Rn, AND_ADDR_Rn, CMP.W_X(Rn)_ADDR, MOV.B_#N_ADDR, ADD.W_#N_ADDR, CMP.B_#N_Rn, SXT_Rn_, XOR.W_Rn_Rn, CMP.W_@Rn_Rn, ADD.W_@Rn_Rn, ADD.W_X(Rn)_Rn, AND.W_Rn_X(Rn), CMP.B_Rn_Rn, AND.W_X(Rn)_X(Rn), BIC.W_#N_Rn, BIS.W_#N_Rn, AND.B_#N_X(Rn), MOV.B_X(Rn)_X(Rn), AND.W_@Rn_Rn, MOV.W_#N_&ADDR, BIS.W_Rn_ADDR, SUB.W_X(Rn)_Rn, SUB.W_Rn_X(Rn), SUB_X(Rn)_X(Rn), MOV.B_@Rn_X(Rn), CMP.W_@Rn_X(Rn), ADD.W_X(Rn)_ADDR, CMP.W_Rn_X(Rn), BIS.W_@Rn_X(Rn), CMP.B_X(Rn)_X(Rn), RRC.W_Rn_, MOV.W_@Rn_&ADDR, CMP.W_#N_X(Rn), ADDC_X(Rn)_Rn, CMP.W_X(Rn)_Rn, BIS.W_X(Rn)_Rn, SUB_X(Rn)_Rn, MOV.B_X(Rn)_Rn, MOV.W_ADDR_&ADDR, AND.W_#N_Rn, RLA.W_Rn_, INV.W_X(Rn)_, XOR.W_#N_Rn, SUB.W_Rn_ADDR, BIC.W_#N_X(Rn), MOV.B_X(Rn)_ADDR, ADD_#N_X(Rn), SUB_Rn_X(Rn), MOV.B_&ADDR_Rn, MOV.W_Rn_&ADDR, ADD_X(Rn)_Rn, AND.W_X(Rn)_Rn, PUSHM.A_#N_Rn, RRAM.W_#N_Rn, AND.W_@Rn_X(Rn), BIS.B_Rn_X(Rn), SUB.W_@Rn_Rn, CLRC_, CMP.W_#N_ADDR, XOR.W_Rn_X(Rn), MOV.B_Rn_ADDR, CMP.B_X(Rn)_Rn, BIS.B_Rn_Rn, BIS.W_X(Rn)_ADDR, CMP.B_#N_X(Rn), CMP.W_Rn_ADDR, XOR.W_X(Rn)_Rn, MOV.B_Rn_X(Rn), ADD.B_#N_Rn

    Code

    The trained machine learning model, tests, and local explanation code can be generated and found here: WORTEX Machine learning code

    Acknowledgment

    This work has received a French government support granted to the Labex CominLabs excellence laboratory and managed by the National Research Agency in the “Investing for the Future” program under reference ANR-10-LABX-07-01

    Licensing

    Copyright 2024 Hector Chabot Copyright 2024 Abderaouf Nassim Amalou Copyright 2024 Hugo Reymond Copyright 2024 Isabelle Puaut

    Licensed under the Creative Commons Attribution 4.0 International License

    References

    [1] Reymond, H., Amalou, A. N., Puaut, I. “WORTEX: Worst-Case Execution Time and Energy Estimation in Low-Power Microprocessors using Explainable ML” in 22nd International Workshop on Worst-Case Execution Time Analysis (WCET 2024)

    [2] Da Silva, Anderson Faustino, et al. “Anghabench: A suite with one million compilable C benchmarks for code-size reduction.” 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 2021.

  14. R

    Racconnall Dataset

    • universe.roboflow.com
    zip
    Updated Mar 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    coding code (2024). Racconnall Dataset [Dataset]. https://universe.roboflow.com/coding-code-bzji8/racconnall
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 11, 2024
    Dataset authored and provided by
    coding code
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    F Bounding Boxes
    Description

    Racconnall

    ## Overview
    
    Racconnall is a dataset for object detection tasks - it contains F annotations for 1,726 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  15. Results files for Land-free Bioenergy From Circular Agroecology -- A Diverse...

    • data.europa.eu
    unknown
    Updated Mar 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2024). Results files for Land-free Bioenergy From Circular Agroecology -- A Diverse Option Space and Trade-offs [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-8246394?locale=hr
    Explore at:
    unknown(8362)Available download formats
    Dataset updated
    Mar 16, 2024
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the open data repository to support and reproduce results in the paper "Land-free Bioenergy From Circular Agroecology -- A Diverse Option Space and Trade-offs." There are three types of files here: 1. Ready-to-use final results files of all strategies and scenarios referred to in the paper. They can be downloaded and used directly without running any codes. They all have the same naming format for strategies/scenarios: Org = organic share, ConcRed = concentrate feeding reduction share, WasteRed = waste reduction share, and numbers refer to the share. E.g., Org0_ConcRed50_WasteRed75 is a strategy with 0% organic share, 50% concentrate feeding reduction, and 75% waste reduction. NationalAncillaryBioenergyPotential_EJ.csv: The national potential of ancillary bioenergy in 2050 from all scenarios. (Units: EJ). Same in both pathways. GlobalPotentialEnvironmentalImpacts_NutrientFirst.csv: Environmental impacts of all scenarios from the pathway NutrientFirst. The first three rows refer to the combination of agroecological practices in places, which allow you to explore environmental impacts grouped by, e.g., different organic shares. GlobalPotentialEnvironmentalImpacts_NegFirst.csv: Same structure as the file above, but from another pathway, NegativeFirst. 2. SOLmOutputs contains all original output files from our model SOLmV6. 3. DataCleaningKit has the Python codes and additional dataset of heat values to process 2. SOLmOutputs and spit 1. (Tip: One should adjust the input_path and output_path before running DataCleaning.py.) Fei Wu (fei.wu@usys.ethz.ch) Delft, August, 2023

  16. Social Reward and Nonsocial Reward Processing Across the Adult Lifespan: An...

    • openneuro.org
    Updated Aug 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David V. Smith; Cooper J. Sharp; Abraham Dachs; James Wyngaarden; Daniel Sazhin; Jen Yang; Melanie Kos; Tia Tropea; Ishika Kohli; John A. Clithero; Ingrid Olson; Tania Giovannetti; Dominic Fareri; Johanna M. Jarcho (2024). Social Reward and Nonsocial Reward Processing Across the Adult Lifespan: An Interim Multi-echo fMRI and Diffusion Dataset [Dataset]. http://doi.org/10.18112/openneuro.ds005123.v1.1.1
    Explore at:
    Dataset updated
    Aug 1, 2024
    Dataset provided by
    OpenNeurohttps://openneuro.org/
    Authors
    David V. Smith; Cooper J. Sharp; Abraham Dachs; James Wyngaarden; Daniel Sazhin; Jen Yang; Melanie Kos; Tia Tropea; Ishika Kohli; John A. Clithero; Ingrid Olson; Tania Giovannetti; Dominic Fareri; Johanna M. Jarcho
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Summary

    This is a preliminary release of a dataset supported by the National Institutes of Aging and National Insitutes of Health. The full dataset is described in a submission to Data in Brief.

    Abstract

    Social relationships change across the lifespan as social networks narrow and motivational priorities shift. These changes may affect, or reflect, differences in how older adults make decisions related to processing social and non-social rewards. While we have shown initial evidence that older adults have a blunted response to some features of social reward, further work in larger samples is needed to probe the extent to which age-related differences translate to real world consequences, such as financial exploitation. To address this gap, we are conducting a 5-year study funded by the National Institute on Aging (NIH R01-AG067011). Over the course of the funding period (2021-2026), this study seeks to: 1) characterize neural responses to social rewards across adulthood; 2) relate those responses to risk for financial exploitation and sociodemographic factors tied to risk; and 3) examine changes in risk for financial exploitation over time in healthy and vulnerable groups of older adults. This paper describes the preliminary release of data for the larger study. Adults (N=114; 40 male / 70 female / 4 other or non-binary; 21-80 years of age M = 42.78, SD = 17.13) were recruited from the community to undergo multi-echo fMRI while completing tasks that measure brain function during social reward and decision-making. Tasks probe neural response to social reward (e.g., peer vs. monetary feedback) and social context and closeness (e.g., sharing a monetary reward with a friend compared to a stranger). Neural response to social decision-making is probed via economic trust and ultimatum games. Functional data, are complimented by a T1 weighted anatomical scan, and diffusion-weighted imaging (DWI) to enable tractography. This dataset has extensive potential for re-use, including leveraging multimodal neuroimaging data, within subject measures of fMRI data from different tasks – data features that are rarely see in an adult lifespan dataset.

    Expanded Task Names

    1. doors and socialdoors: a task in which participants received well-matched social and monetary rewards and punishment;
    2. ugdg: a strategic reward-based decision-making task with Ultimatum and Dictator Game conditions
    3. trust: a task where participants choose an amount to invest in their partner (friend, stranger, or computer) and see wether or not that partner shared the tripled amount back
    4. sharedreward: a task where participants shared rewards or losses with peers, strangers, or non-human partners

    Additional Usage Notes

    We note that participants 10584, 10951, and 11005 are missing dwi. This is due to chiller malfunctions during the sequence that halted data collection. We also note that not all participants have two runs of each task. This was due to time constraints during the scan visits.

    Code related to this dataset can be found on GitHub (https://github.com/DVS-Lab/SRPAL-DataInBrief/code/).

    Original sourcedata for behavioral data is included in the sourcedata folder. Due to privacy restrictions, we cannot release original sourcedata for the imaging data (i.e., DICOM files).

  17. h

    python-code-dataset-500k

    • huggingface.co
    Updated Jan 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James (2024). python-code-dataset-500k [Dataset]. https://huggingface.co/datasets/jtatman/python-code-dataset-500k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 22, 2024
    Authors
    James
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Attention: This dataset is a summary and reformat pulled from github code.

    You should make your own assumptions based on this. In fact, there is another dataset I formed through parsing that addresses several points:

    out of 500k python related items, most of them are python-ish, not pythonic the majority of the items here contain excessive licensing inclusion of original code the items here are sometimes not even python but have references There's a whole lot of gpl summaries… See the full description on the dataset page: https://huggingface.co/datasets/jtatman/python-code-dataset-500k.

  18. o

    Data from: Dataset and code for Parent–Child Adaptive Responses for Digital...

    • openicpsr.org
    Updated May 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Ziker; Jerry Fails; Kendall House; Hollie Abele (2025). Dataset and code for Parent–Child Adaptive Responses for Digital Resilience [Dataset]. http://doi.org/10.3886/E230863V1
    Explore at:
    Dataset updated
    May 22, 2025
    Dataset provided by
    Boise State University
    Authors
    John Ziker; Jerry Fails; Kendall House; Hollie Abele
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 2023 - Jul 2024
    Description

    This is the dataset and code needed to run the analyses for Study 3 highlighted in the article: Ziker, John P., Jerry Alan Fails, Kendall House, Jessi Boyer, Michael Wendell, Hollie Abele, Letizia Maukar, and Kayla Ramirez. 2025. “Parent–Child Adaptive Responses for Digital Resilience.” Social Sciences 14 (4): 1–24. https://doi.org/10.3390/socsci14040197.The dataset and code were originally made available here: https://github.com/johnziker/digitalResilienceofYouth

  19. OpenGlue

    • kaggle.com
    zip
    Updated Apr 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    k_s (2022). OpenGlue [Dataset]. https://www.kaggle.com/datasets/ksork6s4/openglue
    Explore at:
    zip(340106 bytes)Available download formats
    Dataset updated
    Apr 26, 2022
    Authors
    k_s
    Description

    this dataset is clone of this repo

    The following is the README of original repository.

    =======================================

    OpenGlue - Open Source Pipeline for Image Matching

    StandWithUkraine License: MIT

    This is an implementation of the training, inference and evaluation scripts for OpenGlue under an open source license, our paper - OpenGlue: Open Source Graph Neural Net Based Pipeline for Image Matching

    Overview

    SuperGlue - a method for learning feature matching using graph neural network, proposed by a team (Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, Andrew Rabinovich) from Magic Leap. Official full paper - SuperGlue: Learning Feature Matching with Graph Neural Networks.

    We present OpenGlue: a free open-source framework for image matching, that uses a Graph Neural Network-based matcher inspired by SuperGlue. We show that including additional geometrical information, such as local feature scale, orientation, and affine geometry, when available (e.g. for SIFT features), significantly improves the performance of the OpenGlue matcher. We study the influence of the various attention mechanisms on accuracy and speed. We also present a simple architectural improvement by combining local descriptors with context-aware descriptors.

    This repo is based on PyTorch Lightning framework and enables user to train, predict and evaluate the model.

    For local feature extraction, our interface supports Kornia detectors and descriptors along with our version of SuperPoint.

    We provide an instruction on how to launch training on MegaDepth dataset and test the trained models on Image Matching Challenge.

    License

    This code is licensed under the MIT License. Modifications, distribution, commercial and academic uses are permitted. More information in LICENSE file.

    Data

    Steps to prepare MegaDepth dataset for training

    1) Create folder MegaDepth, where your dataset will be stored. mkdir MegaDepth && cd MegaDepth 2) Download and unzip MegaDepth_v1.tar.gz from official link. You should now be able to see MegaDepth/phoenix directory. 3) We provide the lists of pairs for training and validation, link to download. Each line corresponds to one pair and has the following structure: path_image_A path_image_B exif_rotationA exif_rotationB [KA_0 ... KA_8] [KB_0 ... KB_8] [T_AB_0 ... T_AB_15] overlap_AB overlap_AB - is a value of overlap between two images of the same scene, it shows how close (in position transformation) two images are.

    The resulting directory structure should be as follows: MegaDepth/ - pairs/ | - 0000/ | | - sparse-txt/ | | | pairs.txt ... - phoenix/S6/zl548/MegaDepth_v1/ | -0000/ | | - dense0/ | | | - depths/ | | | | id.h5 ... | | | - images/ | | | | id.jpg ... | | - dense1/ ... ...

    Steps to prepare Oxford-Paris dataset for pre-training

    We also release the open-source weights for a pretrained OpenGlue on this dataset.

    Usage

    This repository is divided into several modules: * config - configuration files with training hyperparameters * data - preprocessing and dataset for MegaDepth * examples - code and notebooks with examples of applications * models - module with OpenGlue architecture and detector/descriptors methods * utils - losses, metrics and additional training utils

    Dependencies

    For all necessary modules refer to requirements.txt pip3 install -r requirements.txt

    This code is compatible with Python >= 3.6.9 * PyTorch >= 1.10.0 * PyTorch Lightning >= 1.4.9 * Kornia >= 0.6.1 * OpenCV >= 4.5.4

    Training

    Extracting features

    There are two options for feature extraction: 1) Extract features during training. No additional steps required before Launching training.

    2) Extract and save features before training. We suggest using this approach, since training time is decreased immensely with pre-extracted features...

  20. Z

    Dataset for An Empirical Study of Hackathon Code Creation and Reuse

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Samir Imam Mahmoud; Tapajit Dey; Alexander Nolte; Audris Mockus; James D. Herbsleb (2022). Dataset for An Empirical Study of Hackathon Code Creation and Reuse [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6578707
    Explore at:
    Dataset updated
    May 30, 2022
    Dataset provided by
    Carnegie Mellon University, Pittsburgh, PA, USA
    University of Tartu, Estonia
    Lero---the Irish Software Research Centre, University of Limerick, Ireland
    University of Tennessee, Knoxville, TN, USA
    University of Tartu, Estonia - Carnegie Mellon University, Pittsburgh, PA, USA
    Authors
    Ahmed Samir Imam Mahmoud; Tapajit Dey; Alexander Nolte; Audris Mockus; James D. Herbsleb
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is corresponds to our extended analysis done for “The Secret Life of Hackathon Code Where does it come from and where does it go?” (https://doi.org/10.1109/MSR52588.2021.00020, pre-print at: https://arxiv.org/abs/2103.01145) and “Tracking Hackathon Code Creation and Reuse” (https://doi.org/10.1109/MSR52588.2021.00085, pre-print at: https://arxiv.org/pdf/2103.10167). The replication package including the scripts used for generating this dataset from the “World of Code” (https://worldofcode.org/) dataset is available in Github link https://github.com/woc-hack/track_hack.

    The dataset contains the blob hashes used in the scope of the analysis and the analysis outcome.

    The columns are as following

    DevpostID: Devpost identification for the hackathon project and it can be used to get the URL for the devpost.com website. Example DevpostID -q9nd5 can be translated to https://devpost.com/software/-q9nd5

    ProjectID: The Github project name

    HackathonEndDate: Hackathon event end date

    BlobHash: The blob hash used in the analysis

    BeforeHackathon-DuringHackathon-AfterHackathon: This column represents if the blob was first introduced before/during/after the hackathon (1: before, 2: during, 3: after)

    SameAuthor-Contributor-OtherAuthor: This column represents if the blob was first created by someone in the hackathon team, or someone who was a contributor to a project in which one of the members of the hackathon project contributed to as well (contributor), or someone else outside of the hackathon team (1: Author is a hackathon team member, 2: Author Contributed before with a hackathon team member, 3: Author is not related to the hackathon team).

    UsedBySmallProject-UsedByMediumProject-UsedByLargeProject: This column represents if the hackathon blob is reused again after the hackathon event and what is the project size that reused the code (1: not reused, 3: reused in small project, 4: reused in medium project, 5: reused in large project)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
Organization logo

Meta Kaggle Code

Kaggle's public data on notebook code

Explore at:
zip(167219625372 bytes)Available download formats
Dataset updated
Nov 27, 2025
Dataset authored and provided by
Kagglehttp://kaggle.com/
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Explore our public notebook content!

Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

Why we’re releasing this dataset

By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

Sensitive data

While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

Joining with Meta Kaggle

The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

File organization

The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

Questions / Comments

We love feedback! Let us know in the Discussion tab.

Happy Kaggling!

Search
Clear search
Close search
Google apps
Main menu