100+ datasets found

f
Large scale API Usage dataset
figshare.com
bin
Updated Jun 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anand Sawant (2023). Large scale API Usage dataset [Dataset]. http://doi.org/10.4121/uuid:cb751e3e-3034-44a1-b0c1-b23128927dd8
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.4121/uuid:cb751e3e-3034-44a1-b0c1-b23128927dd8
Dataset updated
Jun 6, 2023
Dataset provided by
4TU.ResearchData
Authors
Anand Sawant
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This is data collected from 50 APIs on their usage among over 200,000 GitHub consumers.
Z
Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction
data.niaid.nih.gov
explore.openaire.eu
Updated Jan 27, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Keshavarz, Hossein (2022). ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5907001
Explore at:
Dataset updated
Jan 27, 2022
Dataset provided by
Keshavarz, Hossein
Nagappan, Meiyappan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track.

The datasets are available under directory dataset. There are 4 datasets in this directory.

apachejit_total.csv: This file contains the entire dataset. Commits are specified by their identifier and a set of commit metrics that are explained in the paper are provided as features. Column buggy specifies whether or not the commit introduced any bug into the system.

apachejit_train.csv: This file is a subset of the entire dataset. It provides a balanced set that we recommend for models that are sensitive to class imbalance. This set is obtained from the first 14 years of data (2003 to 2016).

apachejit_test_large.csv: This file is a subset of the entire dataset. The commits in this file are the commits from the last 3 years of data. This set is not balanced to represent a real-life scenario in a JIT model evaluation where the model is trained on historical data to be applied on future data without any modification.

apachejit_test_small.csv: This file is a subset of the test file explained above. Since the test file has more than 30,000 commits, we also provide a smaller test set which is still unbalanced and from the last 3 years of data.

In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here.

The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset.

More specifically, git_token.py handles GitHub API token that is necessary for requests to GitHub API. Script collector.py performs GitHub search. Tracing changed lines and git annotate is done in gitminer.py using PyDriller. Finally, gumtree.py applies 4 filtering steps (number of lines, number of files, language, and change significance).

References:

GumTree

https://github.com/GumTreeDiff/gumtree

Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14,Vasteras, Sweden - September 15 - 19, 2014. 313–324

PyDriller

https://pydriller.readthedocs.io/en/latest/

Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908–911
CDC WONDER API for Data Query Web Service
catalog.data.gov
data.virginia.gov
+4more
Updated Jul 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Centers for Disease Control and Prevention, Department of Health & Human Services (2023). CDC WONDER API for Data Query Web Service [Dataset]. https://catalog.data.gov/dataset/wide-ranging-online-data-for-epidemiologic-research-wonder
Explore at:
Dataset updated
Jul 26, 2023
Dataset provided by
Centers for Disease Control and Preventionhttp://www.cdc.gov/
United States Department of Health and Human Serviceshttp://www.hhs.gov/
Description
WONDER online databases include county-level Compressed Mortality (death certificates) since 1979; county-level Multiple Cause of Death (death certificates) since 1999; county-level Natality (birth certificates) since 1995; county-level Linked Birth / Death records (linked birth-death certificates) since 1995; state & large metro-level United States Cancer Statistics mortality (death certificates) since 1999; state & large metro-level United States Cancer Statistics incidence (cancer registry cases) since 1999; state and metro-level Online Tuberculosis Information System (TB case reports) since 1993; state-level Sexually Transmitted Disease Morbidity (case reports) since 1984; state-level Vaccine Adverse Event Reporting system (adverse reaction case reports) since 1990; county-level population estimates since 1970. The WONDER web server also hosts the Data2010 system with state-level data for compliance with Healthy People 2010 goals since 1998; the National Notifiable Disease Surveillance System weekly provisional case reports since 1996; the 122 Cities Mortality Reporting System weekly death reports since 1996; the Prevention Guidelines database (book in electronic format) published 1998; the Scientific Data Archives (public use data sets and documentation); and links to other online data sources on the "Topics" page.
High value dataset via API
data.europa.eu
json
Updated Feb 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bolagsverket (2025). High value dataset via API [Dataset]. https://data.europa.eu/data/datasets/https-bolagsverket-se-vardefulla-datamangder-api-dataset?locale=en
Explore at:
jsonAvailable download formats
Dataset updated
Feb 3, 2025
Dataset authored and provided by
Bolagsverkethttp://www.bolagsverket.se/
Description
High value datasets are data that according to the Counciil of the European Union provide important benefits for society, the environment and the economy. A part of the datasets registered at Bolagsverket are also part of the High value datasets. The API for High-value datasets is free of charge and do not require any agreement. The APi is for companies who wants to retrieve information about companies and entrepreneurship directly into their systems. The infromation is retreived from Bolagsverket and SCB and you can use the information in your own business or for developing services for your customers or other end-users. If you need more detailed datasets you can use our API for business infomrmation.
Dataset of Empirical Evidence of Large-Scale Diversity in API Usage of...
zenodo.org
explore.openaire.eu
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Diego Mendez; Benoit Baudry; Martin Monperrus; Diego Mendez; Benoit Baudry; Martin Monperrus (2020). Dataset of Empirical Evidence of Large-Scale Diversity in API Usage of Object-Oriented Software [Dataset]. http://doi.org/10.5281/zenodo.1239616
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1239616
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Diego Mendez; Benoit Baudry; Martin Monperrus; Diego Mendez; Benoit Baudry; Martin Monperrus
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset of Empirical Evidence of Large-Scale Diversity in API Usage of Object-Oriented Software, SCAM'13.

https://www.monperrus.net/martin/companion-diversity-api-usages
d
Coresignal | Company Data | Company API | Global / Largest Professional...
datarade.ai
.json
Updated Feb 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Coresignal, Coresignal | Company Data | Company API | Global / Largest Professional Network / Filter & Retrieve / 86M+ Records [Dataset]. https://datarade.ai/data-products/database-api-coresignal
Explore at:
.jsonAvailable download formats
Dataset updated
Feb 21, 2024
Dataset authored and provided by
Coresignal
Area covered
Mali, Belize, Papua New Guinea, Bhutan, Bouvet Island, Dominican Republic, Ecuador, Mozambique, Isle of Man, Maldives
Description
Use Coresignal's Company API to explore and filter our extensive, regularly updated Companies dataset directly. Easily integrate this API into your workflow or use it to look up specific company records on demand. This tool is perfect for enhancing investing and lead generation efforts.

Two ways to use Company API

Search. Use specific parametric filters, such as location, industry, size, or specific keywords to narrow down your search and pull URL lists.

Enrichment. Enrich your data using specific URLs or IDs to pull full records thanks to the 1:1 type matching.
Api.Data.Gov Metrics API
catalog.data.gov
datasets.ai
+2more
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
General Services Administration (2025). Api.Data.Gov Metrics API [Dataset]. https://catalog.data.gov/dataset/api-data-gov-metrics-api
Explore at:
Dataset updated
May 6, 2025
Dataset provided by
General Services Administrationhttp://www.gsa.gov/
Description
api.data.gov is a free API management service for federal agencies. This API offers access to high level metrics for the APIs that are using the shared service. This API is used to power the api.data.gov metrics page.
h
datasetfinetune
huggingface.co
Updated Mar 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harshpreet Singh (2022). datasetfinetune [Dataset]. https://huggingface.co/datasets/Harshpreet-singh1/datasetfinetune
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 17, 2022
Authors
Harshpreet Singh
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
GitHub Code Dataset

Dataset Description

The GitHub Code dataset consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in 1TB of data. The dataset was created from the public GitHub dataset on Google BiqQuery.

How to use it

The GitHub Code dataset is a very large dataset so for most use cases it is recommended to make use of the streaming API of datasets. You can load and iterate through the dataset with the following… See the full description on the dataset page: https://huggingface.co/datasets/Harshpreet-singh1/datasetfinetune.
A large-scale COVID-19 Twitter chatter dataset
kaggle.com
Updated Nov 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TavoGLC (2023). A large-scale COVID-19 Twitter chatter dataset [Dataset]. https://www.kaggle.com/datasets/tavoglc/a-large-scale-covid-19-twitter-chatter-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 16, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
TavoGLC
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
A large-scale COVID-19 Twitter chatter dataset for open scientific research - an international collaboration

Version 162 of the dataset. NOTES: Data for 3/15 - 3/18 was not extracted due to unexpected and unannounced downtime of our university infrastructure. We will try to backfill those days by next release. FUTURE CHANGES: Due to the imminent paywalling of Twitter's API access this might be the last full update of this dataset. If the API access is not blocked, we will be stopping updates for this dataset with release 165 - a bit more than 3 years after our initial release. It's been a joy seeing all the work that uses this resource and we are glad that so many found it useful.

The dataset files: full_dataset.tsv.gz and full_dataset_clean.tsv.gz have been split in 1 GB parts using the Linux utility called Split. So make sure to join the parts before unzipping. We had to make this change as we had huge issues uploading files larger than 2GB's (hence the delay in the dataset releases). The peer-reviewed publication for this dataset has now been published in Epidemiologia an MDPI journal, and can be accessed here: https://doi.org/10.3390/epidemiologia2030024. Please cite this when using the dataset.

Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to March 27th, to provide extra longitudinal coverage. Version 10 added ~1.5 million tweets in the Russian language collected between January 1st and May 8th, gracefully provided to us by: Katya Artemova (NRU HSE) and Elena Tutubalina (KFU). From version 12 we have included daily hashtags, mentions and emoijis and their frequencies the respective zip files. From version 14 we have included the tweet identifiers and their respective language for the clean version of the dataset. Since version 20 we have included language and place location for all tweets.

The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (1,395,222,801 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (361,748,721 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the full_dataset-statistics.tsv and full_dataset-clean-statistics.tsv files. For more statistics and some visualizations visit: http://www.panacealab.org/covid19/

More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter) and our pre-print about the dataset (https://arxiv.org/abs/2004.03688)

As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data ONLY for research purposes. They need to be hydrated to be used.
d
311 Service and Information Requests
catalog.data.gov
Updated Jun 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Philadelphia (2025). 311 Service and Information Requests [Dataset]. https://catalog.data.gov/dataset/311-service-and-information-requests
Explore at:
Dataset updated
Jun 23, 2025
Dataset provided by
City of Philadelphia
Description
This represents all service and information requests since December 8th, 2014 submitted to Philly311 via the 311 mobile application, calls, walk-ins, emails, the 311 website or social media. Please note that this is a very large dataset. Unless you are comfortable working with APIs, we recommend using the visualization to explore the data. If you are comfortable with APIs, you can also use the API links to access this data. You can learn more about how to use the API at Carto’s SQL API site and in the CARTO guide in the section on making calls to the API.**
d
Employee Data | The Largest Dataset Of Active Profiles | Global / 1B Records...
datarade.ai
.json
Updated Apr 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Avanteer (2025). Employee Data | The Largest Dataset Of Active Profiles | Global / 1B Records / Updated Daily [Dataset]. https://datarade.ai/data-products/employee-data-the-largest-dataset-of-active-profiles-glob-avanteer
Explore at:
.jsonAvailable download formats
Dataset updated
Apr 19, 2025
Dataset authored and provided by
Avanteer
Area covered
Fiji, Maldives, State of, Anguilla, Pitcairn, United Arab Emirates, Gambia, Nicaragua, Tunisia, Bulgaria
Description
//// 🌍 Avanteer Employee Data ////

The Largest Dataset of Active Global Profiles 1B+ Records | Updated Daily | Built for Scale & Accuracy

Avanteer’s Employee Data offers unparalleled access to the world’s most comprehensive dataset of active professional profiles. Designed for companies building data-driven products or workflows, this resource supports recruitment, lead generation, enrichment, and investment intelligence — with unmatched scale and update frequency.

//// 🔧 What You Get ////

1B+ active profiles across industries, roles, and geographies

Work history, education history, languages, skills and multiple additional datapoints.

AI-enriched datapoints include: Gender Age Normalized seniority Normalized department Normalized skillset MBTI assessment

Daily updates, with change-tracking fields to capture job changes, promotions, and new entries.

Flexible delivery via API, S3, or flat file.

Choice of formats: raw, cleaned, or AI-enriched.

Built-in compliance aligned with GDPR and CCPA.

//// 💡 Key Use Cases ////

✅ Smarter Talent Acquisition Identify, enrich, and engage high-potential candidates using up-to-date global profiles.

✅ B2B Lead Generation at Scale Build prospecting lists with confidence using job-related and firmographic filters to target decision-makers across verticals.

✅ Data Enrichment for SaaS & Platforms Supercharge ATS, CRMs, or HR tech products by syncing enriched, structured employee data through real-time or batch delivery.

✅ Investor & Market Intelligence Analyze team structures, hiring trends, and senior leadership signals to discover early-stage investment opportunities or evaluate portfolio companies.

//// 🧰 Built for Top-Tier Teams Who Move Fast ////

Zero duplicate, by design

<300ms API response time

99.99% guaranteed API uptime

Onboarding support including data samples, test credits, and consultations

Advanced data quality checks

//// ✅ Why Companies Choose Avanteer ////

➔ The largest daily-updated dataset of global professional profiles

➔ Trusted by sales, HR, and data teams building at enterprise scale

➔ Transparent, compliant data collection with opt-out infrastructure baked in

➔ Dedicated support with fast onboarding and hands-on implementation help

////////////////////////////////

Empower your team with reliable, current, and scalable employee data — all from a single source.
Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...
zenodo.org
csv
Updated Sep 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous authors; Anonymous authors (2023). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.6607065
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6607065
Dataset updated
Sep 15, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous authors; Anonymous authors
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.

The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.

Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.

The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.

Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).

As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).

The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.
h
codeparrot-java-all
huggingface.co
Updated Mar 17, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aditya Goswami (2022). codeparrot-java-all [Dataset]. https://huggingface.co/datasets/Aditya78b/codeparrot-java-all
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 17, 2022
Authors
Aditya Goswami
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
GitHub Code Dataset

Dataset Description

The GitHub Code dataset consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in 1TB of data. The dataset was created from the public GitHub dataset on Google BiqQuery.

How to use it

The GitHub Code dataset is a very large dataset so for most use cases it is recommended to make use of the streaming API of datasets. You can load and iterate through the dataset with the… See the full description on the dataset page: https://huggingface.co/datasets/Aditya78b/codeparrot-java-all.
A
‘Former large stores indices: overall and by groups. ICM (API identifier:...
analyst-2.ai
Updated Aug 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘Former large stores indices: overall and by groups. ICM (API identifier: 2240)’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/data-europa-eu-former-large-stores-indices-overall-and-by-groups-icm-api-identifier-2240-15f8/28482afe/?iid=001-327&v=presentation
Explore at:
Dataset updated
Aug 5, 2020
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Former large stores indices: overall and by groups. ICM (API identifier: 2240)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from http://data.europa.eu/88u/dataset/urn-ine-es-tabla-t3-5-2240 on 07 January 2022.

--- Dataset description provided by original source is as follows ---

Table of INEBase Former large stores indices: overall and by groups. Monthly. National. Retail Trade Indices

--- Original source retains full ownership of the source dataset ---
LAS&T: Large Shape And Texture Dataset
zenodo.org
jpeg, zip
Updated May 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sagi Eppel; Sagi Eppel (2025). LAS&T: Large Shape And Texture Dataset [Dataset]. http://doi.org/10.5281/zenodo.15453634
Explore at:
jpeg, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15453634
Dataset updated
May 26, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sagi Eppel; Sagi Eppel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Large Shape And Texture Dataset (LAS&T)

LAS&T is the largest and most diverse dataset for shape, texture and material recognition and retrieval in 2D and 3D with 650,000 images, based on real world shapes and textures.

Overview

The LAS&T Dataset aims to test the most basic aspect of vision in the most general way. Mainly the ability to identify any shape, texture, and material in any setting and environment, without being limited to specific types or classes of objects, materials, and environments. For shapes, this means identifying and retrieving any shape in 2D or 3D with every element of the shape changed between images, including the shape material and texture, orientation, size, and environment. For textures and materials, the goal is to recognize the same texture or material when appearing on different objects, environments, and light conditions. The dataset relies on shapes, textures, and materials extracted from real-world images, leading to an almost unlimited quantity and diversity of real-world natural patterns. Each section of the dataset (shapes, and textures), contains 3D parts that rely on physics-based scenes with realistic light materials and object simulation and abstract 2D parts. In addition, the real-world benchmark for 3D shapes.

Main Dataset webpage

The dataset contain four parts parts:

3D shape recognition and retrieval.

2D shape recognition and retrieval.

3D Materials recognition and retrieval.

2D Texture recognition and retrieval.

Each can be used independently for training and testing.

Additional assets are a set of 350,000 natural 2D shapes extracted from real-world images (SHAPES_COLLECTION_350k.zip)

3D shape recognition real-world images benchmark

The scripts used to generate and test the dataset are supplied as in SCRIPT** files.

Shapes Recognition and Retrieval:

For shape recognition the goal is to identify the same shape in different images, where the material/texture/color of the shape is changed, the shape is rotated, and the background is replaced. Hence, only the shape remains the same in both images. All files with 3D shapes contain samples of the 3D shape dataset. This is tested for 3D shapes/objects with realistic light simulation. All files with 2D shapes contain samples of the 2D shape dataset. Examples files contain images with examples for each set.

Main files:

Real_Images_3D_shape_matching_Benchmarks.zip contains real-world image benchmarks for 3D shapes.

3D_Shape_Recognition_Synthethic_GENERAL_LARGE_SET_76k.zip A Large number of synthetic examples 3D shapes with max variability can be used for training/testing 3D shape/objects recognition/retrieval.

2D_Shapes_Recognition_Textured_Synthetic_Resize2_GENERAL_LARGE_SET_61k.zip A Large number of synthetic examples for 2D shapes with max variability can be used for training/testing 2D shape recognition/retrieval.

SHAPES_2D_365k.zip 365,000 2D shapes extracted from real-world images saved as black and white .png image files.

File structure:

All jpg images that are in the exact same subfolder contain the exact same shape (but with different texture/color/background/orientation).

Textures and Materials Recognition and Retrieval

For texture and materials, the goal is to identify and match images containing the same material or textures, however the shape/object on which the material texture is applied is different, and so is the background and light.

This is done for physics-based material in 3D and abstract 2D textures.

3D_Materials_PBR_Synthetic_GENERAL_LARGE_SET_80K.zip A Large number of examples of 3D materials in physics grounded can be used for training or testing of material recognition/retrieval.

2D_Textures_Recogition_GENERAL_LARGE_SET_Synthetic_53K.zip

Large number of images of 2D texture in maximum variability of setting can be used for training/testing 2D textured recognition/retrieval.

File structure:

All jpg images that are in the exact same subfolder contain the exact same texture/material (but overlay on different objects with different background/and illumination/orientation).

Data Generation:

The images in the synthetic part of the dataset were created by automatically extracting shapes and textures from natural images and combining them in synthetic images. This created synthetic images that completely rely on real-world patterns, making extremely diverse and complex shapes and textures. As far as we know this is the largest and most diverse shape and texture recognition/retrieval dataset. 3D data was generated using physics-based material and rendering (blender) making the images physically grounded and enabling using the data to train for real-world examples. The scripts for generating the data are supplied in files with the world SCRIPTS* in them.

Real-world image data:

For 3D shape recognition and retrieval, we also supply a real-world natural image benchmark. With a variety of natural images containing the exact same 3D shape but made/coated with different materials and in different environments and orientations. The goal is again to identify the same shape in different images. The benchmark is available at: Real_Images_3D_shape_matching_Benchmarks.zip

File structure:

Files containing the word 'GENERAL_LARGE_SET' contains synthetic images that can be used for training or testing, the type of data (2D shapes, 3D shapes, 2D textures, 3D materials) that appears in the file name, as well as the number of images. Files containing MultiTests contain a number of different tests in which only a single aspect of the aspect of the instance is changed (for example only the background.) File containing "SCRIPTS" contain data generation testing scripts. Images containing "examples" are example of each test.

Shapes Collections

The file SHAPES_COLLECTION_350k.zip contains 350,000 2D shapes extracted from natural images and used for the dataset generation.

Evaluating and Testing

For evaluating and testing see: SCRIPTS_Testing_LVLM_ON_LAST_VQA.zip
This can be use to test leading LVLMs using api, create human tests, and in general turn the dataset into multichoice question images similar to the one in the paper.
Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...
zenodo.org
application/gzip
Updated Mar 16, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks / Understanding and Improving the Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.3519618
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3519618
Dataset updated
Mar 16, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourages poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub. Based on the results, we proposed and evaluated Julynter, a linting tool for Jupyter Notebooks.

Papers:

PIMENTEL, J. F.; MURTA, L.; BRAGANHOLO, V.; FREIRE, J.; A large-scale study about quality and reproducibility of jupyter notebooks. In: International Conference on Mining Software Repositories (MSR), 2019, Montreal, Canada.

PIMENTEL, J. F.; MURTA, L.; BRAGANHOLO, V.; FREIRE, J.; Understanding and Improving the Quality and Reproducibility of Jupyter Notebooks. Empirical Software Engineering, 2021 (in press)

This repository contains three files:

db2020-09-22.dump.gz

sample.tar.gz

julynter_reproducility.tar.gz

Reproducing the Notebook Study

The db2020-09-22.dump.gz file contains a PostgreSQL dump of the database, with all the data we extracted from notebooks. For loading it, run:

gunzip -c db2020-09-22.dump.gz | psql jupyter

Note that this file contains only the database with the extracted data. The actual repositories are available in a google drive folder, which also contains the docker images we used in the reproducibility study. The repositories are stored as content/{hash_dir1}/{hash_dir2}.tar.bz2, where hash_dir1 and hash_dir2 are columns of repositories in the database.

For scripts, notebooks, and detailed instructions on how to analyze or reproduce the data collection, please check the instructions on the Jupyter Archaeology repository (tag 1.0.0)

The sample.tar.gz file contains the repositories obtained during the manual sampling.

Reproducing the Julynter Experiment

The julynter_reproducility.tar.gz file contains all the data collected in the Julynter experiment and the analysis notebooks. Reproducing the analysis is straightforward:

Uncompress the file: $ tar zxvf julynter_reproducibility.tar.gz

Install the dependencies: $ pip install julynter/requirements.txt

Run the notebooks in order: J1.Data.Collection.ipynb; J2.Recommendations.ipynb; J3.Usability.ipynb.

The collected data is stored in the julynter/data folder.

Changelog

2019/01/14 - Version 1 - Initial version
2019/01/22 - Version 2 - Update N8.Execution.ipynb to calculate the rate of failure for each reason
2019/03/13 - Version 3 - Update package for camera ready. Add columns to db to detect duplicates, change notebooks to consider them, and add N1.Skip.Notebook.ipynb and N11.Repository.With.Notebook.Restriction.ipynb.
2021/03/15 - Version 4 - Add Julynter experiment; Update database dump to include new data collected for the second paper; remove scripts and analysis notebooks from this package (moved to GitHub), add a link to Google Drive with collected repository files
d
Licenses and Inspections: Case Investigations
catalog.data.gov
Updated Jun 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Philadelphia (2025). Licenses and Inspections: Case Investigations [Dataset]. https://catalog.data.gov/dataset/licenses-and-inspections-case-investigations
Explore at:
Dataset updated
Jun 23, 2025
Dataset provided by
City of Philadelphia
Description
All investigations completed on a property with property maintenance violations by an inspector of the Department of Licenses & Inspections. Please note that this is a very large dataset. To see all investigations, download all datasets for all years. If you are comfortable with APIs, you can also use the API links to access this data. You can learn more about how to use the API at Carto’s SQL API site and in the Carto guide in the section on making calls to the API.
f
Data journals and data papers in the humanities
kcl.figshare.com
txt
Updated Jul 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Barbara McGillivray; Marongiu, Paola; Nilo Pedrazzini; Marton Ribary; Eleonora Zordan (2022). Data journals and data papers in the humanities [Dataset]. http://doi.org/10.18742/19935014.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.18742/19935014.v1
Dataset updated
Jul 21, 2022
Dataset provided by
King's College London
Authors
Barbara McGillivray; Marongiu, Paola; Nilo Pedrazzini; Marton Ribary; Eleonora Zordan
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This collection contains five sets of datasets: 1) Publication counts from two multidisciplinary humanities data journals: the Journal of Open Humanities Data and Research Data in the Humanities and Social Sciences (RDJ_JOHD_Publications.csv); 2) A large dataset about the performance of research articles in HSS exported from dimensions.ai (allhumss_dims_res_papers_PUB_ID.csv); 3) A large dataset about the performance of datasets in HSS harvested from the Zenodo REST API (Zenodo.zip); 4) Impact and usage metrics from the papers published in the two journals above (final_outputs.zip); 5) Data from Twitter analytics on tweets from the @up_johd account, with paper DOI and engagement rate (twitter-data.zip).

Please note that, as requested by the Dimensions team, for 2 and 4, we only included the Publication IDs from Dimensions rather than the full data. Interested parties only need the Dimensions publications IDs to retrieve the data; even if they have no Dimensions subscription, they can easily get a no-cost agreement with Dimensions, for research purposes, in order to retrieve the data.
88.6 Million Developer Comments from GitHub
zenodo.org
data.niaid.nih.gov
bin, zip
Updated Jan 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benjamin S. Meyers; Benjamin S. Meyers; Andrew Meneely; Andrew Meneely (2024). 88.6 Million Developer Comments from GitHub [Dataset]. http://doi.org/10.5281/zenodo.5603093
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5603093
Dataset updated
Jan 4, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Benjamin S. Meyers; Benjamin S. Meyers; Andrew Meneely; Andrew Meneely
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description

This is a collection of developer comments from GitHub issues, commits, and pull requests. We collected 88,640,237 developer comments from 17,378 repositories. In total, this dataset includes:

54,252,380 issue comments (from 13,458,208 issues)

979,642 commit comments (from 49,710,108 commits)

33,408,215 pull request comments (from 12,680,373 pull requests)

Warning: The uploaded dataset is compressed from 185GB down to 25.1GB.

Purpose

The purpose of this dataset (corpus) is to provide a large dataset of software developer comments (natural language) for research. We intend to use this data in our own research, but we hope it will be helpful for other researchers.

Collection Process

Full implementation details can be found in the following publication:

Benjamin S. Meyers. Human Error Assessment in Software Engineering. Rochester Institute of Technology. 2023.

Data was downloaded using GitHub's GraphQL API via requests made with Python's requests library. We targeted 17,491 repositories with the following criteria:

At least 850 stars.

Primary language in the Top 50 from the TIOBE Index and/or listed as "popular" in GitHub's advanced search. Note that we collected the list of languages on August 31, 2021.

Due to design decisions made by GitHub, we could only get a list of at most 1,000 repositories for each target language. Comments from 113 repositories could not be downloaded for various reasons (failing API queries, JSONDecoderErrors, etc.). Eight target languages had no repositories matching the above criteria.

After collection using the GraphQL API, data was written to CSV using Python's csv.writer class. We highly recommend using Python's csv.reader to parse these CSV files as no newlines have been removed from developer comments.

88_million_developer_comments.zip

This zip file contains 135 CSV files; 3 per language. CSV names are formatted _, with being the name of the primary language and being one of co (commits), is (issues), or pr (pull requests).
Languages included are: ABAP, Assembly, C, C# (C-Sharp), C++ (C-PlusPlus), Clojure, COBOL, CoffeeScript, CSS, Dart, D, DM, Elixir, Fortran, F# (F-Sharp), Go, Groovy, HTML, Java, JavaScript, Julia, Kotlin, Lisp, Lua, MATLAB, Nim, Objective-C, Pascal, Perl, PHP, PowerShell, Prolog, Python, R, Ruby, Rust, Scala, Scheme, Scratch, Shell, Swift, TSQL, TypeScript, VBScript, and VHDL. Details on the columns in each CSV file are described in the provided README.md. Detailed_Breakdown.ods This spreadsheet contains specific details on how many repositories, commits, issues, pull requests, and comments are included in 88_million_developer_comments.zip. Note On Completeness We make no guarantee that every commit, issue, and/or pull request for each repository is included in this dataset. Due to the nature of the GraphQL API and data decoding difficulties, sometimes a query failed and that data is not included here. Versioning v1.1: The original corpus had duplicate header rows in the CSV files. This has been fixed. v1.0: Original corpus. Contact Please contact Benjamin S. Meyers (email) with questions about this data and its collection. Acknowledgments Collection of this data has been sponsored in part by the National Science Foundation grant 1922169, and by a Department of Defense DARPA SBIR program (grant 140D63-19-C-0018). This data was collected using the compute resources from the Research Computing department at the Rochester Institute of Technology. doi:10.34788/0S3G-QD15
d
Licenses and Inspections Code Violations
catalog.data.gov
Updated Jun 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Philadelphia (2025). Licenses and Inspections Code Violations [Dataset]. https://catalog.data.gov/dataset/licenses-and-inspections-code-violations
Explore at:
Dataset updated
Jun 23, 2025
Dataset provided by
City of Philadelphia
Description
Violations issued by the Department of Licenses and Inspections in reference to the Philadelphia Building Construction and Occupancy Code. Please note that L&I Violations is a very large dataset. To see all violations, download all datasets for all years. If you are comfortable with APIs, you can also use the API links to access this data. You can learn more about how to use the API at Carto’s SQL API site and in the Carto guide in the section on making calls to the API.