Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.
By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.
Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.
The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!
While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.
The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.
The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.
The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays
We love feedback! Let us know in the Discussion tab.
Happy Kaggling!
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
KBP37 is a revision of MIML-RE annotation dataset, provided by Gabor Angeli et al. (2014). They use both the 2010 and
2013 KBP official document collections, as well as a July 2013 dump of Wikipedia as the text corpus for annotation.
There are 33811 sentences been annotated. Zhang and Wang made several refinements:
1. They add direction to the relation names, e.g. 'per:employee_of' is split into 'per:employee of(e1,e2)'
and 'per:employee of(e2,e1)'. They also replace 'org:parents' with 'org:subsidiaries' and replace
'org:member of’ with 'org:member`' (by their reverse directions).
2. They discard low frequency relations such that both directions of each relation occur more than 100 times in the
dataset.
KBP37 contains 18 directional relations and an additional 'no_relation' relation, resulting in 37 relation classes.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This is the readme for the supplemental data for our ICDAR 2019 paper.
You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202
If you found this dataset useful, please consider citing our paper:
@inproceedings{DBLP:conf/icdar/MorrisTE19,
author = {David Morris and
Peichen Tang and
Ralph Ewerth},
title = {A Neural Approach for Text Extraction from Scholarly Figures},
booktitle = {2019 International Conference on Document Analysis and Recognition,
{ICDAR} 2019, Sydney, Australia, September 20-25, 2019},
pages = {1438--1443},
publisher = {{IEEE}},
year = {2019},
url = {https://doi.org/10.1109/ICDAR.2019.00231},
doi = {10.1109/ICDAR.2019.00231},
timestamp = {Tue, 04 Feb 2020 13:28:39 +0100},
biburl = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).
We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.
These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2
The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.
We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.
We have made our code available in code.zip. We will upload code, announce further news, and field questions via the github repo.
Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours subdirectory contains the trained weights we used in the paper.
We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar as text_recognition_multipro.py.
We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar.
Parameter sweeps are automated by param_sweep.rb. This file also shows how to invoke all of these components.
Facebook
TwitterPredicting the difficulty of playing a musical score plays a pivotal role in structuring and exploring score collections, with significant implications for music education. The automatic difficulty classification of piano scores, however, remains an unsolved challenge. This is largely due to the scarcity of annotated data and the inherent subjectiveness in the annotation process. The "Can I Play It?" (CIPI) dataset represents a substantial step forward in this domain, providing a machine-readable collection of piano scores paired with difficulty annotations from the esteemed Henle Verlag.
The CIPI dataset is meticulously assembled by aligning public domain scores with their corresponding difficulty labels sourced from Henle Verlag. This initial pairing was subsequently reviewed and refined by an expert pianist to ensure accuracy and reliability. The dataset is structured to facilitate easy access and interpretation, making it a valuable resource for researchers and educators alike.
Our work makes two primary contributions to the field of score difficulty classification. Firstly, we address the critical issue of data scarcity, introducing the CIPI dataset to the academic community. Secondly, we delve into various input representations derived from score information, utilizing pre-trained machine learning models tailored for piano fingering and expressiveness. These models draw inspiration from musicological definitions of performance, offering nuanced insights into score difficulty.
Through extensive experimentation, we demonstrate that an ensemble approach—combining outputs from multiple classifiers—yields superior results compared to individual classifiers. This highlights the diverse facets of difficulty captured by different representations. Our comprehensive experiments lay a robust foundation for future endeavors in score difficulty classification, and our best-performing model reports a balanced accuracy of 39.5% and a median square error of 1.1 across the nine difficulty levels introduced in this study.
The CIPI dataset, along with the associated code and models, is made publicly available to ensure reproducibility and to encourage further research in this domain. Users are encouraged to reference this resource in their work and to contribute to its ongoing development.
Ramoneda, P., Jeong, D., Eremenko, V., Tamer, N. C., Miron, M., & Serra, X. (2024). Combining Piano Performance Dimensions for Score Difficulty Classification. Expert Systems with Applications, 238, 121776. DOI: 10.1016/j.eswa.2023.121776
@article{Ramoneda2024, author = {Pedro Ramoneda and Dasaem Jeong and Vsevolod Eremenko and Nazif Can Tamer and Marius Miron and Xavier Serra}, title = {Combining Piano Performance Dimensions for Score Difficulty Classification}, journal = {Expert Systems with Applications}, volume = {238}, pages = {121776}, year = {2024}, doi = {10.1016/j.eswa.2023.121776}, url = {https://doi.org/10.1016/j.eswa.2023.121776}}
pedro.ramoneda@upf.edu
xavier.serra@upf.edu
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains PDF-to-text conversions of scientific research articles, prepared for the task of data citation mining. The goal is to identify references to research datasets within full-text scientific papers and classify them as Primary (data generated in the study) or Secondary (data reused from external sources).
The PDF articles were processed using MinerU, which converts scientific PDFs into structured machine-readable formats (JSON, Markdown, images). This ensures participants can access both the raw text and layout information needed for fine-grained information extraction.
Each paper directory contains the following files:
*_origin.pdf
The original PDF file of the scientific article.
*_content_list.json
Structured extraction of the PDF content, where each object represents a text or figure element with metadata.
Example entry:
{
"type": "text",
"text": "10.1002/2017JC013030",
"text_level": 1,
"page_idx": 0
}
full.md
The complete article content in Markdown format (linearized for easier reading).
images/
Folder containing figures and extracted images from the article.
layout.json
Page layout metadata, including positions of text blocks and images.
The aim is to detect dataset references in the article text and classify them:
DOIs (Digital Object Identifiers):
https://doi.org/[prefix]/[suffix]
Example: https://doi.org/10.5061/dryad.r6nq870
Accession IDs: Used by data repositories. Format varies by repository. Examples:
GSE12345 (NCBI GEO)PDB 1Y2T (Protein Data Bank)E-MEXP-568 (ArrayExpress)Each dataset mention must be labeled as:
train_labels.csv).train_labels.csv → Ground truth with:
article_id: Research paper DOI.dataset_id: Extracted dataset identifier.type: Citation type (Primary / Secondary).sample_submission.csv → Example submission format.
Paper: https://doi.org/10.1098/rspb.2016.1151 Data: https://doi.org/10.5061/dryad.6m3n9 In-text span:
"The data we used in this publication can be accessed from Dryad at doi:10.5061/dryad.6m3n9." Citation type: Primary
This dataset enables participants to develop and test NLP systems for:
Facebook
TwitterThis is a subset of the Zenodo-ML Dinosaur Dataset [Github] that has been converted to small png files and organized in folders by the language so you can jump right in to using machine learning methods that assume image input.
Included are .tar.gz files, each named based on a file extension, and when extracted, will produce a folder of the same name.
tree -L 1
.
├── c
├── cc
├── cpp
├── cs
├── css
├── csv
├── cxx
├── data
├── f90
├── go
├── html
├── java
├── js
├── json
├── m
├── map
├── md
├── txt
└── xml
And we can peep inside a (somewhat smaller) of the set to see that the subfolders are zenodo identifiers. A zenodo identifier corresponds to a single Github repository, so it means that the png files produced are chunks of code of the extension type from a particular repository.
$ tree map -L 1
map
├── 1001104
├── 1001659
├── 1001793
├── 1008839
├── 1009700
├── 1033697
├── 1034342
...
├── 836482
├── 838329
├── 838961
├── 840877
├── 840881
├── 844050
├── 845960
├── 848163
├── 888395
├── 891478
└── 893858
154 directories, 0 files
Within each folder (zenodo id) the files are prefixed by the zenodo id, followed by the index into the original image set array that is provided with the full dinosaur dataset archive.
$ tree m/891531/ -L 1
m/891531/
├── 891531_0.png
├── 891531_10.png
├── 891531_11.png
├── 891531_12.png
├── 891531_13.png
├── 891531_14.png
├── 891531_15.png
├── 891531_16.png
├── 891531_17.png
├── 891531_18.png
├── 891531_19.png
├── 891531_1.png
├── 891531_20.png
├── 891531_21.png
├── 891531_22.png
├── 891531_23.png
├── 891531_24.png
├── 891531_25.png
├── 891531_26.png
├── 891531_27.png
├── 891531_28.png
├── 891531_29.png
├── 891531_2.png
├── 891531_30.png
├── 891531_3.png
├── 891531_4.png
├── 891531_5.png
├── 891531_6.png
├── 891531_7.png
├── 891531_8.png
└── 891531_9.png
0 directories, 31 files
So what's the difference?
The difference is that these files are organized by extension type, and provided as actual png images. The original data is provided as numpy data frames, and is organized by zenodo ID. Both are useful for different things - this particular version is cool because we can actually see what a code image looks like.
How many images total?
We can count the number of total images:
find "." -type f -name *.png | wc -l
3,026,993
The script to create the dataset is provided here. Essentially, we start with the top extensions as identified by this work (excluding actual images files) and then write each 80x80 image to an actual png image, organizing by extension then zenodo id (as shown above).
I tested a few methods to write the single channel 80x80 data frames as png images, and wound up liking cv2's imwrite function because it would save and then load the exact same content.
import cv2
cv2.imwrite(image_path, image)
Given the above, it's pretty easy to load an image! Here is an example using scipy, and then for newer Python (if you get a deprecation message) using imageio.
image_path = '/tmp/data1/data/csv/1009185/1009185_0.png'
from imageio import imread
image = imread(image_path)
array([[116, 105, 109, ..., 32, 32, 32],
[ 48, 44, 48, ..., 32, 32, 32],
[ 48, 46, 49, ..., 32, 32, 32],
...,
[ 32, 32, 32, ..., 32, 32, 32],
[ 32, 32, 32, ..., 32, 32, 32],
[ 32, 32, 32, ..., 32, 32, 32]], dtype=uint8)
image.shape
(80,80)
# Deprecated
from scipy import misc
misc.imread(image_path)
Image([[116, 105, 109, ..., 32, 32, 32],
[ 48, 44, 48, ..., 32, 32, 32],
[ 48, 46, 49, ..., 32, 32, 32],
...,
[ 32, 32, 32, ..., 32, 32, 32],
[ 32, 32, 32, ..., 32, 32, 32],
[ 32, 32, 32, ..., 32, 32, 32]], dtype=uint8)
Remember that the values in the data are characters that have been converted to ordinal. Can you guess what 32 is?
ord(' ')
32
# And thus if you wanted to convert it back...
chr(32)
So how t...
Facebook
Twitterhttps://www.scilifelab.se/data/restricted-access/https://www.scilifelab.se/data/restricted-access/
Dataset with annotated 12-lead ECG records. The exams were taken in 811 counties in the state of Minas Gerais/Brazil by the Telehealth Network of Minas Gerais (TNMG) between 2010 and 2016. And organized by the CODE (Clinical outcomes in digital electrocardiography) group.Requesting accessResearchers affiliated to educational or research institutions might make requests to access this data dataset. Requests will be analyzed on an individual basis and should contain: Name of PI and host organisation; Contact details (including your name and email); and, the scientific purpose of data access request.If approved, a data user agreement will be forwarded to the researcher that made the request (through the email that was provided). After the agreement has been signed (by the researcher or by the research institution) access to the dataset will be granted.Openly available subset:A subset of this dataset (with 15% of the patients) is openly available. See: "CODE-15%: a large scale annotated dataset of 12-lead ECGs" https://doi.org/10.5281/zenodo.4916206.ContentThe folder contains: A column separated file containing basic patient attributes. The ECG waveforms in the wfdb format.Additional referencesThe dataset is described in the paper "Automatic diagnosis of the 12-lead ECG using a deep neural network". https://www.nature.com/articles/s41467-020-15432-4. Related publications also using this dataset are:- [1] G. Paixao et al., “Validation of a Deep Neural Network Electrocardiographic-Age as a Mortality Predictor: The CODE Study,” Circulation, vol. 142, no. Suppl_3, pp. A16883–A16883, Nov. 2020, doi: 10.1161/circ.142.suppl_3.16883.- [2] A. L. P. Ribeiro et al., “Tele-electrocardiography and bigdata: The CODE (Clinical Outcomes in Digital Electrocardiography) study,” Journal of Electrocardiology, Sep. 2019, doi: 10/gf7pwg.- [3] D. M. Oliveira, A. H. Ribeiro, J. A. O. Pedrosa, G. M. M. Paixao, A. L. P. Ribeiro, and W. Meira Jr, “Explaining end-to-end ECG automated diagnosis using contextual features,” in Machine Learning and Knowledge Discovery in Databases. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD), Ghent, Belgium, Sep. 2020, vol. 12461, pp. 204--219. doi: 10.1007/978-3-030-67670-4_13.- [4] D. M. Oliveira, A. H. Ribeiro, J. A. O. Pedrosa, G. M. M. Paixao, A. L. Ribeiro, and W. M. Jr, “Explaining black-box automated electrocardiogram classification to cardiologists,” in 2020 Computing in Cardiology (CinC), 2020, vol. 47. doi: 10.22489/CinC.2020.452.- [5] G. M. M. Paixão et al., “Evaluation of mortality in bundle branch block patients from an electronic cohort: Clinical Outcomes in Digital Electrocardiography (CODE) study,” Journal of Electrocardiology, Sep. 2019, doi: 10/dcgk.- [6] G. M. M. Paixão et al., “Evaluation of Mortality in Atrial Fibrillation: Clinical Outcomes in Digital Electrocardiography (CODE) Study,” Global Heart, vol. 15, no. 1, p. 48, Jul. 2020, doi: 10.5334/gh.772.- [7] G. M. M. Paixão et al., “Electrocardiographic Predictors of Mortality: Data from a Primary Care Tele-Electrocardiography Cohort of Brazilian Patients,” Hearts, vol. 2, no. 4, Art. no. 4, Dec. 2021, doi: 10.3390/hearts2040035.- [8] G. M. Paixão et al., “ECG-AGE FROM ARTIFICIAL INTELLIGENCE: A NEW PREDICTOR FOR MORTALITY? THE CODE (CLINICAL OUTCOMES IN DIGITAL ELECTROCARDIOGRAPHY) STUDY,” Journal of the American College of Cardiology, vol. 75, no. 11 Supplement 1, p. 3672, 2020, doi: 10.1016/S0735-1097(20)34299-6.- [9] E. M. Lima et al., “Deep neural network estimated electrocardiographic-age as a mortality predictor,” Nature Communications, vol. 12, 2021, doi: 10.1038/s41467-021-25351-7.- [10] W. Meira Jr, A. L. P. Ribeiro, D. M. Oliveira, and A. H. Ribeiro, “Contextualized Interpretable Machine Learning for Medical Diagnosis,” Communications of the ACM, 2020, doi: 10.1145/3416965.- [11] A. H. Ribeiro et al., “Automatic diagnosis of the 12-lead ECG using a deep neural network,” Nature Communications, vol. 11, no. 1, p. 1760, 2020, doi: 10/drkd.- [12] A. H. Ribeiro et al., “Automatic Diagnosis of Short-Duration 12-Lead ECG using a Deep Convolutional Network,” Machine Learning for Health (ML4H) Workshop at NeurIPS, 2018.- [13] A. H. Ribeiro et al., “Automatic 12-lead ECG classification using a convolutional network ensemble,” 2020. doi: 10.22489/CinC.2020.130.- [14] V. Sangha et al., “Automated Multilabel Diagnosis on Electrocardiographic Images and Signals,” medRxiv, Sep. 2021, doi: 10.1101/2021.09.22.21263926.- [15] S. Biton et al., “Atrial fibrillation risk prediction from the 12-lead ECG using digital biomarkers and deep representation learning,” European Heart Journal - Digital Health, 2021, doi: 10.1093/ehjdh/ztab071.Code:The following github repositories perform analysis that use this dataset:- https://github.com/antonior92/automatic-ecg-diagnosis- https://github.com/antonior92/ecg-age-predictionRelated Datasets:- CODE-test: An annotated 12-lead ECG dataset (https://doi.org/10.5281/zenodo.3765780)- CODE-15%: a large scale annotated dataset of 12-lead ECGs (https://doi.org/10.5281/zenodo.4916206)- Sami-Trop: 12-lead ECG traces with age and mortality annotations (https://doi.org/10.5281/zenodo.4905618)Ethics declarationsThe CODE Study was approved by the Research Ethics Committee of the Universidade Federal de Minas Gerais, protocol 49368496317.7.0000.5149.
Facebook
TwitterCODEBRIM: COncrete DEfect BRidge IMage Dataset for multi-target multi-class concrete defect classification in computer vision and machine learning.
Dataset as presented and detailed in our CVPR 2019 publication: http://openaccess.thecvf.com/content_CVPR_2019/html/Mundt_Meta-Learning_Convolutional_Neural_Architectures_for_Multi-Target_Concrete_Defect_Classification_With_CVPR_2019_paper.html or https://arxiv.org/abs/1904.08486 . If you make use of the dataset please cite it as follows:
"Martin Mundt, Sagnik Majumder, Sreenivas Murali, Panagiotis Panetsos, Visvanathan Ramesh. Meta-learning Convolutional Neural Architectures for Multi-target Concrete Defect Classification with the COncrete DEfect BRidge IMage Dataset. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019"
We offer a supplementary GitHub repository with code to reproduce the paper and data loaders: https://github.com/ccc-frankfurt/meta-learning-CODEBRIM
For ease of use we provide the dataset in multiple different versions.
Files contained:
* CODEBRIM_original_images: contains the original full-resolution images and bounding box annotations
* CODEBRIM_cropped_dataset: contains the extracted crops/patches with corresponding class labels from the bounding boxes
* CODEBRIM_classification_dataset: contains the cropped patches with corresponding class labels split into training, validation and test sets for machine learning
* CODEBRIM_classification_balanced_dataset: similar to "CODEBRIM_classification_dataset" but with the exact replication of training images to balance the dataset in order to reproduce results obtained in the paper.
Facebook
TwitterData licence Germany – Attribution – Version 2.0https://www.govdata.de/dl-de/by-2-0
License information was derived automatically
This dataset provides a comprehensive assessment of public transport connectivity across Germany by analyzing both walking distances to the nearest public transport stops as well as the quality of public transport connections for daily usage scenarios with housing-level-granularity on a country-wide scale. The data was generated through a novel approach that integrates multiple open data sources, simulation models, and visual analytics techniques, enabling researchers, policymakers, and urban planners to identify gaps and opportunities for transit network improvements. ewline
Efficient and accessible public transportation is a critical component of sustainable urban development. However, many transit networks struggle to adequately serve diverse populations due to infrastructural, financial, and urban planning limitations. Traditional transit planning often relies on aggregated statistics, expert opinions, or limited surveys, making it difficult to assess transport accessibility at an individual household level. This dataset provides a data-driven and reproducible methodology for unbiased country-wide comparisons.
Find more information at https://mobility.dbvis.de.
| Title | OPTIMAP: A Dataset for Open Public Transport Infrastructure and Mobility Accessibility Profiles |
| Acronym | OPTIMAP |
| Download | https://mobility.dbvis.de/data-results/OPTIMAP_v2025-02-01.parquet (478MB, parquet) |
| License | Datenlizenz Deutschland - Namensnennung - Version 2.0 (dl-de-by/2.0) |
Please cite the dataset as:Maximilian T. Fischer, Daniel Fürst, Yannick Metz, Manuel Schmidt, Julius Rauscher, and Daniel A. Keim. OPTIMAP: A Dataset for Open Public Transport Infrastructure and Mobility Accessibility Profiles. Zenodo, 2025. doi: 10.5281/zenodo.14772646.
or, when using Bibtex
@dataset{MobilityProfiles.DatasetGermany.2025, author = {Fischer, Maximilian T. and Fürst, Daniel and Metz, Yannick and Schmidt, Manuel and Rauscher, Julius and Keim, Daniel A.}, title = {OPTIMAP: A Dataset for Open Public Transport Infrastructure and Mobility Accessibility Profiles}, year = 2025, publisher = {Zenodo}, doi = {10.5281/zenodo.14772646}}
The dataset in the PARQUET format includes detailed accessibility measures for public transport at a fine-grained, housing-level resolution. It consists of four columns:
lat, lng (float32): GPS coordinates (EPSG:4326) of each house in Germany, expensively compiled from the house coordinates (HK-DE) data provided by the 16 federal states under the EU INSPIRE regulations.MinDistanceWalking (int32): An approximate walking distance (in meters) to the nearest public transport stop from each registered building in Germany.scores_OVERALL (float32): A simulated, demographic- and scenario-weighted measure of public transport quality for daily usage, considering travel times, frequency, and coverage across various daily scenarios (e.g., commuting, shopping, medical visits). The results are represented in an artificial time unit to allow comparative analysis across locations.The dataset was generated using a combination of open geospatial data and advanced transport simulation techniques.
The dataset enables multiple use cases across research, policy, and urban planning:
By offering high-resolution public transport accessibility data at housing-level granularity, this dataset contributes to a more transparent and objective understanding of urban mobility challenges. The integration of simulation models, demographic considerations, and scalable analytics provides a novel approach to evaluating and improving public transit systems. Researchers, city officials, and policymakers are encouraged to leverage this dataset to enhance transport infrastructure planning and accessibility.
This dataset contains both the approximate walking distances in meters and a weighted overall quality score in an artificial time unit for each individual house in Germany. More advanced versions are currently not publicly available. This base dataset is publicly available and adheres to open data licensing principles, enabling its reuse for scientific and policy-oriented studies.
While not part of this dataset, the scientific simulation used to create the results leverages public transit information via the National Access Point (NAP) DELFI as NeTEx, provided via GTFS feeds of Germany (CC BY 4.0).
Also, routing information used during the processing was based on Open Street Map contributors (CC BY 4.0).
Primarily, this dataset contains original and slightly processed housing locations (lat, lng) that were made available as part of the EU INSPIRE regulations, based on Directive (EU) 2019/1024 (of the European Parliament and of the Council of 20 June 2019 on open data and the re-use of public sector information (recast)).
In Germany, the respective data is provided individually by the 16 federal states, with the following required attributions and license indications:
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
This dataset was created during the Programming Language Ecosystem project from TU Wien using the code inside the repository https://github.com/ValentinFutterer/UsageOfProgramminglanguages2011-2023?tab=readme-ov-file.
The centerpiece of this repository is the usage_of_programming_languages_2011-2023.csv. This csv file shows the popularity of programming languages over the last 12 years in yearly increments. The repository also contains graphs created with the dataset. To get an accurate estimate on the popularity of programming languages, this dataset was created using 3 vastly different sources.
The dataset was created using the github repository above. As input data, three public datasets where used.
Taken from https://www.kaggle.com/datasets/pelmers/github-repository-metadata-with-5-stars/ by Peter Elmers. It is licensed under CC BY 4.0 https://creativecommons.org/licenses/by/4.0/. It shows metadata information (no code) of all github repositories with more than 5 stars.
Taken from https://github.com/pypl/pypl.github.io/tree/master, put online by the user pcarbonn. It is licensed under CC BY 3.0 https://creativecommons.org/licenses/by/3.0/. It shows from 2004 to 2023 for each month the share of programming related google searches per language.
Taken from https://insights.stackoverflow.com/survey. It is licensed under Open Data Commons Open Database License (ODbL) v1.0 https://opendatacommons.org/licenses/odbl/1-0/. It shows from 2011 to 2023 the results of the yearly stackoverflow developer survey.
All these datasets were downloaded on the 12.12.2023. The datasets are all in the github repository above
The dataset contains a column for the year and then many columns for the different languages, denoting their usage in percent. Additionally, vertical barcharts and piecharts for each year plus a line graph for each language over the whole timespan as png's are provided.
The languages that are going to be considered for the project can be seen here:
- Python
- C
- C++
- Java
- C#
- JavaScript
- PHP
- SQL
- Assembly
- Scratch
- Fortran
- Go
- Kotlin
- Delphi
- Swift
- Rust
- Ruby
- R
- COBOL
- F#
- Perl
- TypeScript
- Haskell
- Scala
This project is licensed under the Open Data Commons Open Database License (ODbL) v1.0 https://opendatacommons.org/licenses/odbl/1-0/ license.
TLDR: You are free to share, adapt, and create derivative works from this dataser as long as you attribute me, keep the database open (if you redistribute it), and continue to share-alike any adapted database under the ODbl.
Thanks go out to
- stackoverflow https://insights.stackoverflow.com/survey for providing the data from the yearly stackoverflow developer survey.
- the PYPL survey, https://github.com/pypl/pypl.github.io/tree/master for providing google search data.
- Peter Elmers, for crawling metadata on github repositories and providing the data https://www.kaggle.com/datasets/pelmers/github-repository-metadata-with-5-stars/.
Facebook
TwitterThis version of the CivilComments Dataset provides access to the primary seven labels that were annotated by crowd workers, the toxicity and other tags are a value between 0 and 1 indicating the fraction of annotators that assigned these attributes to the comment text.
The other tags are only available for a fraction of the input examples. They are currently ignored for the main dataset; the CivilCommentsIdentities set includes those labels, but only consists of the subset of the data with them. The other attributes that were part of the original CivilComments release are included only in the raw data. See the Kaggle documentation for more details about the available features.
The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, publication IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity, identity mentions, as well as covert offensiveness. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text.
For comments that have a parent_id also in the civil comments data, the text of the previous comment is provided as the "parent_text" feature. Note that the splits were made without regard to this information, so using previous comments may leak some information. The annotators did not have access to the parent text when making the labels.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('civil_comments', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).
The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).
Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset
The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.
Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.
The 25 fields of the dataset are:
| Attributes | Definition | Completeness |
| ------------- | ------------- | ------------- |
| bookId | Book Identifier as in goodreads.com | 100 |
| title | Book title | 100 |
| series | Series Name | 45 |
| author | Book's Author | 100 |
| rating | Global goodreads rating | 100 |
| description | Book's description | 97 |
| language | Book's language | 93 |
| isbn | Book's ISBN | 92 |
| genres | Book's genres | 91 |
| characters | Main characters | 26 |
| bookFormat | Type of binding | 97 |
| edition | Type of edition (ex. Anniversary Edition) | 9 |
| pages | Number of pages | 96 |
| publisher | Editorial | 93 |
| publishDate | publication date | 98 |
| firstPublishDate | Publication date of first edition | 59 |
| awards | List of awards | 20 |
| numRatings | Number of total ratings | 100 |
| ratingsByStars | Number of ratings by stars | 97 |
| likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 |
| setting | Story setting | 22 |
| coverImg | URL to cover image | 99 |
| bbeScore | Score in Best Books Ever list | 100 |
| bbeVotes | Number of votes in Best Books Ever list | 100 |
| price | Book's price (extracted from Iberlibro) | 73 |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains around 30 000 basic blocks whose energy consumption and execution time have been measured in isolation on the MSP430FR5969 microcontroller, at 1MHz. Basic blocks were executed in a worst case scenario regarding the MSP430 FRAM cache and CPU pipeline. The dataset creation process is described thoroughly in [1].
This dataset is composed of the following files:
basic_blocks.tar.xz contains all basic blocks (BB) used in the dataset, in a custom JSON format,data.csv/data.xlsx contains the measured energy consumption and execution time for each basic blockWe first details how the basic_blocks.tar.gz archive is organized, and then present the CSV/XSLX spreadsheet format.
We extracted the basic blocks from a subset of programs of the AnghaBench benchmark suite [2]. The basic_blocks.tar.gz archive consist of the extracted basic blocks organized as json files. Each json file correspond to a C source file from AnghaBench, and is given a unique identifier. An example json (137.json) is available here:
{
"extr_pfctl_altq.c_pfctl_altq_init": [
# Basic block 1
[
# Instruction 1 of BB1
[
"MOV.W",
"#queue_map",
"R13"
],
# Instruction 2 of BB1
[
"MOV.B",
"#0",
"R14"
],
# Instruction 3 of BB1
[
"CALL",
"#hcreate_r",
null
]
],
# Basic block 2
[
....
]
]
}
The json contains a dict with only one key pointing to an array of basic blocks. This key is the name of the original C source file in AnghaBench from which the basic blocks were extracted (here extr_pfctl_altq.c_pfctl_altq_init.c). The array contains severals basic blocks, which are represented as an array of instructions, which are themselves represented as an array [OPCODE, OPERAND1, OPERAND2].
Then, each basic block can be identified uniquely using two ids : its file id and its offset in the file (id=). In our example, the basic block 1 can be identified by the json file id (137) and its offset in the file (0). Its ID is 137_0. This ID is used to make the mapping between a basic block and its energy consumption/execution time, with the data.csv/data.xlsx spreadsheet.
Energy consumption and execution time data are stored in the data.csv file. Here is the extract of the csv file corresponding to the basic block 137_0. The spreadsheet format is described below.
bb_id;nb_inst;max_energy;max_time;avg_time;avg_energy;energy_per_inst;nb_samples;unroll_factor
137_0;3;8.77;7.08;7.04;8.21;2.92;40;50
Spreadsheet format :
bb_id: the unique identifier of a basic block (cf. Basic Blocks)nb_inst: the number of instructions in the basic blockmax_energy: the maximum energy comsumption (in nJ) measured during the experimentmax_time: the maximum execution time (in us) measured during the experimentavg_time: the average execution time (in us) measured during the experimentavg_energy: the average energy comsumption (in nJ) measured during the experimentenergy_per_inst: the average energy consumption per instruction (correspond to avg_energy/nb_inst)nb_samples: how much time the basic block energy consumption/execution time has been measuredunroll_factor: how much time the basic block was unrolled (cf Basic Block Unrolling)To measure the energy consumption and execution time of the msp430, we need to be able to handle the scale difference between the measurement tool and the basic block execution time. This is achieved by duplicating the basic block multiple times while making sure to keep the worst-case memory layout as explained in the paper. The number of time the basic block has been duplicated is called the unroll_factor.
Values of energy and time are always given per basic block, so they have already been divided by the unroll factor.
The selected features after PCA analysis for both energy and time model are listed here: MOV.W_Rn_Rn, MOV.W_X(Rn)_X(Rn), CALL, MOV.B_#N_Rn, ADD.W_Rn_Rn, MOV.W_@Rn_Rn, MOV.W_X(Rn)_Rn, ADD.W_#N_Rn, PUSHM.W_#N_Rn, MOV.W_X(Rn)_ADDR, CMP.W_#N_Rn, MOV.W_&ADDR_X(Rn), MOV.W_Rn_X(Rn), BIS.W_Rn_Rn, RLAM.W_#N_Rn, SUB.W_#N_Rn, MOV.W_&ADDR_Rn, MOV.W_#N_X(Rn), CMP.W_Rn_Rn, BIT.W_ADDR_Rn, MOV.W_@Rn_X(Rn), ADD.W_#N_X(Rn), MOV.W_#N_Rn, AND.W_Rn_Rn, MOV.W_Rn_ADDR, SUB.W_Rn_Rn, MOV.W_ADDR_Rn, MOV.W_X(Rn)_&ADDR, MOV.W_ADDR_ADDR, JMP, ADD_#N_Rn, BIS.W_Rn_X(Rn), SUB_Rn_Rn, MOV.W_ADDR_X(Rn), ADDC_#N_X(Rn), MOV.B_Rn_Rn, CMP.W_X(Rn)_X(Rn), ADD_Rn_Rn, nb_inst, INV.W_Rn_, NOP_, ADD.W_X(Rn)_X(Rn), ADD.W_Rn_X(Rn), MOV.B_@Rn_Rn, BIS.W_X(Rn)_X(Rn), MOV.B_#N_X(Rn), MOV.W_#N_ADDR, AND.W_#N_ADDR, SUBC_X(Rn)_X(Rn), BIS.W_#N_X(Rn), SUB.W_X(Rn)_X(Rn), AND.B_#N_Rn, ADD_X(Rn)_X(Rn), MOV.W_@Rn_ADDR, MOV.W_&ADDR_ADDR, ADDC_Rn_Rn, AND.W_#N_X(Rn), SUB_#N_Rn, RRUM.W_#N_Rn, AND_ADDR_Rn, CMP.W_X(Rn)_ADDR, MOV.B_#N_ADDR, ADD.W_#N_ADDR, CMP.B_#N_Rn, SXT_Rn_, XOR.W_Rn_Rn, CMP.W_@Rn_Rn, ADD.W_@Rn_Rn, ADD.W_X(Rn)_Rn, AND.W_Rn_X(Rn), CMP.B_Rn_Rn, AND.W_X(Rn)_X(Rn), BIC.W_#N_Rn, BIS.W_#N_Rn, AND.B_#N_X(Rn), MOV.B_X(Rn)_X(Rn), AND.W_@Rn_Rn, MOV.W_#N_&ADDR, BIS.W_Rn_ADDR, SUB.W_X(Rn)_Rn, SUB.W_Rn_X(Rn), SUB_X(Rn)_X(Rn), MOV.B_@Rn_X(Rn), CMP.W_@Rn_X(Rn), ADD.W_X(Rn)_ADDR, CMP.W_Rn_X(Rn), BIS.W_@Rn_X(Rn), CMP.B_X(Rn)_X(Rn), RRC.W_Rn_, MOV.W_@Rn_&ADDR, CMP.W_#N_X(Rn), ADDC_X(Rn)_Rn, CMP.W_X(Rn)_Rn, BIS.W_X(Rn)_Rn, SUB_X(Rn)_Rn, MOV.B_X(Rn)_Rn, MOV.W_ADDR_&ADDR, AND.W_#N_Rn, RLA.W_Rn_, INV.W_X(Rn)_, XOR.W_#N_Rn, SUB.W_Rn_ADDR, BIC.W_#N_X(Rn), MOV.B_X(Rn)_ADDR, ADD_#N_X(Rn), SUB_Rn_X(Rn), MOV.B_&ADDR_Rn, MOV.W_Rn_&ADDR, ADD_X(Rn)_Rn, AND.W_X(Rn)_Rn, PUSHM.A_#N_Rn, RRAM.W_#N_Rn, AND.W_@Rn_X(Rn), BIS.B_Rn_X(Rn), SUB.W_@Rn_Rn, CLRC_, CMP.W_#N_ADDR, XOR.W_Rn_X(Rn), MOV.B_Rn_ADDR, CMP.B_X(Rn)_Rn, BIS.B_Rn_Rn, BIS.W_X(Rn)_ADDR, CMP.B_#N_X(Rn), CMP.W_Rn_ADDR, XOR.W_X(Rn)_Rn, MOV.B_Rn_X(Rn), ADD.B_#N_Rn
The trained machine learning model, tests, and local explanation code can be generated and found here: WORTEX Machine learning code
This work has received a French government support granted to the Labex CominLabs excellence laboratory and managed by the National Research Agency in the “Investing for the Future” program under reference ANR-10-LABX-07-01
Copyright 2024 Hector Chabot Copyright 2024 Abderaouf Nassim Amalou Copyright 2024 Hugo Reymond Copyright 2024 Isabelle Puaut
Licensed under the Creative Commons Attribution 4.0 International License
[1] Reymond, H., Amalou, A. N., Puaut, I. “WORTEX: Worst-Case Execution Time and Energy Estimation in Low-Power Microprocessors using Explainable ML” in 22nd International Workshop on Worst-Case Execution Time Analysis (WCET 2024)
[2] Da Silva, Anderson Faustino, et al. “Anghabench: A suite with one million compilable C benchmarks for code-size reduction.” 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 2021.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Racconnall is a dataset for object detection tasks - it contains F annotations for 1,726 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the open data repository to support and reproduce results in the paper "Land-free Bioenergy From Circular Agroecology -- A Diverse Option Space and Trade-offs." There are three types of files here: 1. Ready-to-use final results files of all strategies and scenarios referred to in the paper. They can be downloaded and used directly without running any codes. They all have the same naming format for strategies/scenarios: Org = organic share, ConcRed = concentrate feeding reduction share, WasteRed = waste reduction share, and numbers refer to the share. E.g., Org0_ConcRed50_WasteRed75 is a strategy with 0% organic share, 50% concentrate feeding reduction, and 75% waste reduction. NationalAncillaryBioenergyPotential_EJ.csv: The national potential of ancillary bioenergy in 2050 from all scenarios. (Units: EJ). Same in both pathways. GlobalPotentialEnvironmentalImpacts_NutrientFirst.csv: Environmental impacts of all scenarios from the pathway NutrientFirst. The first three rows refer to the combination of agroecological practices in places, which allow you to explore environmental impacts grouped by, e.g., different organic shares. GlobalPotentialEnvironmentalImpacts_NegFirst.csv: Same structure as the file above, but from another pathway, NegativeFirst. 2. SOLmOutputs contains all original output files from our model SOLmV6. 3. DataCleaningKit has the Python codes and additional dataset of heat values to process 2. SOLmOutputs and spit 1. (Tip: One should adjust the input_path and output_path before running DataCleaning.py.) Fei Wu (fei.wu@usys.ethz.ch) Delft, August, 2023
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This is a preliminary release of a dataset supported by the National Institutes of Aging and National Insitutes of Health. The full dataset is described in a submission to Data in Brief.
Social relationships change across the lifespan as social networks narrow and motivational priorities shift. These changes may affect, or reflect, differences in how older adults make decisions related to processing social and non-social rewards. While we have shown initial evidence that older adults have a blunted response to some features of social reward, further work in larger samples is needed to probe the extent to which age-related differences translate to real world consequences, such as financial exploitation. To address this gap, we are conducting a 5-year study funded by the National Institute on Aging (NIH R01-AG067011). Over the course of the funding period (2021-2026), this study seeks to: 1) characterize neural responses to social rewards across adulthood; 2) relate those responses to risk for financial exploitation and sociodemographic factors tied to risk; and 3) examine changes in risk for financial exploitation over time in healthy and vulnerable groups of older adults. This paper describes the preliminary release of data for the larger study. Adults (N=114; 40 male / 70 female / 4 other or non-binary; 21-80 years of age M = 42.78, SD = 17.13) were recruited from the community to undergo multi-echo fMRI while completing tasks that measure brain function during social reward and decision-making. Tasks probe neural response to social reward (e.g., peer vs. monetary feedback) and social context and closeness (e.g., sharing a monetary reward with a friend compared to a stranger). Neural response to social decision-making is probed via economic trust and ultimatum games. Functional data, are complimented by a T1 weighted anatomical scan, and diffusion-weighted imaging (DWI) to enable tractography. This dataset has extensive potential for re-use, including leveraging multimodal neuroimaging data, within subject measures of fMRI data from different tasks – data features that are rarely see in an adult lifespan dataset.
We note that participants 10584, 10951, and 11005 are missing dwi. This is due to chiller malfunctions during the sequence that halted data collection. We also note that not all participants have two runs of each task. This was due to time constraints during the scan visits.
Code related to this dataset can be found on GitHub (https://github.com/DVS-Lab/SRPAL-DataInBrief/code/).
Original sourcedata for behavioral data is included in the sourcedata folder. Due to privacy restrictions, we cannot release original sourcedata for the imaging data (i.e., DICOM files).
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Attention: This dataset is a summary and reformat pulled from github code.
You should make your own assumptions based on this. In fact, there is another dataset I formed through parsing that addresses several points:
out of 500k python related items, most of them are python-ish, not pythonic the majority of the items here contain excessive licensing inclusion of original code the items here are sometimes not even python but have references There's a whole lot of gpl summaries… See the full description on the dataset page: https://huggingface.co/datasets/jtatman/python-code-dataset-500k.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the dataset and code needed to run the analyses for Study 3 highlighted in the article: Ziker, John P., Jerry Alan Fails, Kendall House, Jessi Boyer, Michael Wendell, Hollie Abele, Letizia Maukar, and Kayla Ramirez. 2025. “Parent–Child Adaptive Responses for Digital Resilience.” Social Sciences 14 (4): 1–24. https://doi.org/10.3390/socsci14040197.The dataset and code were originally made available here: https://github.com/johnziker/digitalResilienceofYouth
Facebook
TwitterThe following is the README of original repository.
=======================================
This is an implementation of the training, inference and evaluation scripts for OpenGlue under an open source license, our paper - OpenGlue: Open Source Graph Neural Net Based Pipeline for Image Matching
SuperGlue - a method for learning feature matching using graph neural network, proposed by a team (Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, Andrew Rabinovich) from Magic Leap. Official full paper - SuperGlue: Learning Feature Matching with Graph Neural Networks.
We present OpenGlue: a free open-source framework for image matching, that uses a Graph Neural Network-based matcher inspired by SuperGlue. We show that including additional geometrical information, such as local feature scale, orientation, and affine geometry, when available (e.g. for SIFT features), significantly improves the performance of the OpenGlue matcher. We study the influence of the various attention mechanisms on accuracy and speed. We also present a simple architectural improvement by combining local descriptors with context-aware descriptors.
This repo is based on PyTorch Lightning framework and enables user to train, predict and evaluate the model.
For local feature extraction, our interface supports Kornia detectors and descriptors along with our version of SuperPoint.
We provide an instruction on how to launch training on MegaDepth dataset and test the trained models on Image Matching Challenge.
This code is licensed under the MIT License. Modifications, distribution, commercial and academic uses are permitted. More information in LICENSE file.
1) Create folder MegaDepth, where your dataset will be stored.
mkdir MegaDepth && cd MegaDepth
2) Download and unzip MegaDepth_v1.tar.gz from official link.
You should now be able to see MegaDepth/phoenix directory.
3) We provide the lists of pairs for training and validation, link to download. Each line corresponds to one pair and has the following structure:
path_image_A path_image_B exif_rotationA exif_rotationB [KA_0 ... KA_8] [KB_0 ... KB_8] [T_AB_0 ... T_AB_15] overlap_AB
overlap_AB - is a value of overlap between two images of the same scene, it shows how close (in position transformation) two images are.
The resulting directory structure should be as follows:
MegaDepth/
- pairs/
| - 0000/
| | - sparse-txt/
| | | pairs.txt
...
- phoenix/S6/zl548/MegaDepth_v1/
| -0000/
| | - dense0/
| | | - depths/
| | | | id.h5
...
| | | - images/
| | | | id.jpg
...
| | - dense1/
...
...
We also release the open-source weights for a pretrained OpenGlue on this dataset.
This repository is divided into several modules:
* config - configuration files with training hyperparameters
* data - preprocessing and dataset for MegaDepth
* examples - code and notebooks with examples of applications
* models - module with OpenGlue architecture and detector/descriptors methods
* utils - losses, metrics and additional training utils
For all necessary modules refer to requirements.txt
pip3 install -r requirements.txt
This code is compatible with Python >= 3.6.9 * PyTorch >= 1.10.0 * PyTorch Lightning >= 1.4.9 * Kornia >= 0.6.1 * OpenCV >= 4.5.4
There are two options for feature extraction: 1) Extract features during training. No additional steps required before Launching training.
2) Extract and save features before training. We suggest using this approach, since training time is decreased immensely with pre-extracted features...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is corresponds to our extended analysis done for “The Secret Life of Hackathon Code Where does it come from and where does it go?” (https://doi.org/10.1109/MSR52588.2021.00020, pre-print at: https://arxiv.org/abs/2103.01145) and “Tracking Hackathon Code Creation and Reuse” (https://doi.org/10.1109/MSR52588.2021.00085, pre-print at: https://arxiv.org/pdf/2103.10167). The replication package including the scripts used for generating this dataset from the “World of Code” (https://worldofcode.org/) dataset is available in Github link https://github.com/woc-hack/track_hack.
The dataset contains the blob hashes used in the scope of the analysis and the analysis outcome.
The columns are as following
DevpostID: Devpost identification for the hackathon project and it can be used to get the URL for the devpost.com website. Example DevpostID -q9nd5 can be translated to https://devpost.com/software/-q9nd5
ProjectID: The Github project name
HackathonEndDate: Hackathon event end date
BlobHash: The blob hash used in the analysis
BeforeHackathon-DuringHackathon-AfterHackathon: This column represents if the blob was first introduced before/during/after the hackathon (1: before, 2: during, 3: after)
SameAuthor-Contributor-OtherAuthor: This column represents if the blob was first created by someone in the hackathon team, or someone who was a contributor to a project in which one of the members of the hackathon project contributed to as well (contributor), or someone else outside of the hackathon team (1: Author is a hackathon team member, 2: Author Contributed before with a hackathon team member, 3: Author is not related to the hackathon team).
UsedBySmallProject-UsedByMediumProject-UsedByLargeProject: This column represents if the hackathon blob is reused again after the hackathon event and what is the project size that reused the code (1: not reused, 3: reused in small project, 4: reused in medium project, 5: reused in large project)
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.
By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.
Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.
The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!
While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.
The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.
The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.
The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays
We love feedback! Let us know in the Discussion tab.
Happy Kaggling!