10 datasets found

GitHub Repos
kaggle.com
zip
Updated Mar 20, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
GitHubhttps://github.com/
Authors
Github
Description
GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgements

This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

Inspiration

This is the perfect dataset for fighting language wars.

Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
Medical Expenditure Panel Survey (MEPS) Query Tool - k85c-yfyp - Archive...
healthdata.gov
application/rdfxml +5
Updated Jun 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Medical Expenditure Panel Survey (MEPS) Query Tool - k85c-yfyp - Archive Repository [Dataset]. https://healthdata.gov/dataset/Medical-Expenditure-Panel-Survey-MEPS-Query-Tool-k/vzpv-jw8g
Explore at:
tsv, csv, application/rdfxml, xml, application/rssxml, jsonAvailable download formats
Dataset updated
Jun 2, 2022
Description
This dataset tracks the updates made on the dataset "Medical Expenditure Panel Survey (MEPS) Query Tool" as a repository for previous versions of the data and metadata.
Data from: TerraDS: A Dataset for Terraform HCL Programs
zenodo.org
application/gzip, bin
Updated Nov 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christoph Bühler; Christoph Bühler; David Spielmann; David Spielmann; Roland Meier; Roland Meier; Guido Salvaneschi; Guido Salvaneschi (2024). TerraDS: A Dataset for Terraform HCL Programs [Dataset]. http://doi.org/10.5281/zenodo.14217386
Explore at:
application/gzip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14217386
Dataset updated
Nov 27, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Christoph Bühler; Christoph Bühler; David Spielmann; David Spielmann; Roland Meier; Roland Meier; Guido Salvaneschi; Guido Salvaneschi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
TerraDS

The TerraDS dataset provides a comprehensive collection of Terraform programs written in the HashiCorp Configuration Language (HCL). As Infrastructure as Code (IaC) gains popularity for managing cloud infrastructure, Terraform has become one of the leading tools due to its declarative nature and widespread adoption. However, a lack of publicly available, large-scale datasets has hindered systematic research on Terraform practices. TerraDS addresses this gap by compiling metadata and source code from 62,406 open-source repositories with valid licenses. This dataset aims to foster research on best practices, vulnerabilities, and improvements in IaC methodologies.

Structure of the Database

The TerraDS dataset is organized into two main components: a SQLite database containing metadata and an archive of source code (~335 MB). The metadata, captured in a structured format, includes information about repositories, modules, and resources:

1. Repository Data:

Contains 62,406 repositories with fields such as repository name, creation date, star count, and permissive license details.

Provides cloneable URLs for access and analysis.

Tracks additional metrics like repository size and the latest commit details.

2. Module Data:

Consists of 279,344 modules identified within the repositories.

Each module includes its relative path, referenced providers, and external module calls stored as JSON objects.

3. Resource Data:

Encompasses 1,773,991 resources, split into managed (1,484,185) and data (289,806) resources.

Each resource entry details its type, provider, and whether it is managed or read-only.

Structure of the Archive

The provided archive contains the source code of the 62,406 repositories to allow further analysis based on the actual source instead of the metadata only. As such, researcher can access the permissive repositories and conduct studies on the executable HCL code.

Tools

The "HCL Dataset Tools" file contains a snapshot of the https://github.com/prg-grp/hcl-dataset-tools repository - for long term archival reasons. The tools in this repository can be used to reproduce this dataset.

One of the tools - "RepositorySearcher" - can be used to fetch metadata for various other GitHub API queries, not only Terraform code. While the RepositorySearcher allows usage for other types of repository search, the other tools provided are focused on Terraform repositories.
Human Cell and Tissue Establishment Registration Public Query - pj5w-zcqt -...
healthdata.gov
application/rdfxml +5
Updated Jul 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Human Cell and Tissue Establishment Registration Public Query - pj5w-zcqt - Archive Repository [Dataset]. https://healthdata.gov/dataset/Human-Cell-and-Tissue-Establishment-Registration-P/4qd7-mv3m
Explore at:
csv, xml, application/rssxml, tsv, application/rdfxml, jsonAvailable download formats
Dataset updated
Jul 16, 2025
Description
This dataset tracks the updates made on the dataset "Human Cell and Tissue Establishment Registration Public Query" as a repository for previous versions of the data and metadata.
Energy System Time Series Suite (ESTSS) - Data Archive
zenodo.org
data.niaid.nih.gov
bin, csv
Updated Jan 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Günther; Sebastian Günther (2024). Energy System Time Series Suite (ESTSS) - Data Archive [Dataset]. http://doi.org/10.5281/zenodo.10213145
Explore at:
csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10213145
Dataset updated
Jan 22, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sebastian Günther; Sebastian Günther
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Energy System Time Series Suite - Data Archive

This archive contains variously sized sets of declustered time series within the context of energy systems. These series demonstrate low discrepancy and high heterogeneity in feature space, resulting in a roughly uniform distribution within this space.

For detailed information, please refer to the corresponding GitHub project:
https://github.com/s-guenther/estss/

For associated research, see
https://doi.org/10.1186/s42162-024-00304-8

Data is provided in .csv format. The GitHub project includes a Python function to load this data as a dictionary of pandas data frames.

Should you utilize this data, kindly also cite the associated research paper. For any queries, please feel free to reach out to us through GitHub or the contact details provided at the end of this readme file.

Folder Content

`ts_*.csv`: Contains declustered load profile time series in tabular format.

Size: `(n+1) x (m+1)`, with `n` representing time steps (1000 per series) and `m` the number of series.

Includes a header row and index column. Headers indicate series id, and the index column numbers each time step, starting from `0`.

The first half of the series `(m/2)` consistently display a constant sign (negative). They are sequentially numbered from 0.

The second half `(m/2)` display varying signs. Numbering starts from `1,000,000`.

`features_*.csv`: Tabulates features corresponding to the time series.

Size: `(m+1) x (f+1)`, where `m` is the number of time series and `f` is the number of features

Includes a header row and index column. Indexes represent time series id (matching `ts_*.csv` headers), and headers name the features.

`norm_space_*.csv`: Shows feature vectors in normalized feature space where time series are declustered. Provided for completeness; typically not needed by users.

Size: `(m+1) x (g+1)`, where `m` is the number of timer series and `g` is the number of selected features space features. (a subset of `f` from `features_*.csv`).

Format matches `features_*.csv`.

`info_*.csv`: Maps declustered datasets to the manifolded dataset. Provided for completeness; typically not needed by users.

Size: `(m+1) x 2`, with `m` as series count. Columns contain manifolded set time series ids.

Includes an index column and a header. The index holds the remapped id of declustered series. Header `0` is non-significant.

Each `ts_*.csv`, `features_*.csv`, `norm_space_*.csv`, and `info_*.csv` file comes in four versions to accommodate various set sizes:

`*_4096.csv`

`*_1024.csv`

`*_256.csv`

`*_64.csv`

These represent sets with 4096, 1024, 256, and 64 time series, respectively,offering different densities in feature space population. The objective is to balance computational load and resolution for individual research needs.

Contact

ESTSS - Energy System Time Series Suite
Copyright (C) 2023
Sebastian Günther
sebastian.guenther@ifes.uni-hannover.de

Leibniz Universität Hannover
Institut für Elektrische Energiesysteme
Fachgebiet für Elektrische Energiespeichersysteme

Leibniz University Hannover
Institute of Electric Power Systems
Electric Energy Storage Systems Section

https://www.ifes.uni-hannover.de/ees.html
Web-based Injury Statistics Query and Reporting System (WISQARS) - 82ty-fydp...
healthdata.gov
application/rdfxml +5
Updated Jul 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Web-based Injury Statistics Query and Reporting System (WISQARS) - 82ty-fydp - Archive Repository [Dataset]. https://healthdata.gov/dataset/Web-based-Injury-Statistics-Query-and-Reporting-Sy/d2ht-timm
Explore at:
csv, xml, application/rdfxml, json, tsv, application/rssxmlAvailable download formats
Dataset updated
Jul 26, 2023
Description
This dataset tracks the updates made on the dataset "Web-based Injury Statistics Query and Reporting System (WISQARS)" as a repository for previous versions of the data and metadata.
SQLite Sakila Sample Database
kaggle.com
Updated Mar 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Atanas Kanev (2021). SQLite Sakila Sample Database [Dataset]. https://www.kaggle.com/atanaskanev/sqlite-sakila-sample-database/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 14, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Atanas Kanev
Description
SQLite Sakila Sample Database

Database Description

The Sakila sample database is a fictitious database designed to represent a DVD rental store. The tables of the database include film, film_category, actor, customer, rental, payment and inventory among others. The Sakila sample database is intended to provide a standard schema that can be used for examples in books, tutorials, articles, samples, and so forth. Detailed information about the database can be found on the MySQL website: https://dev.mysql.com/doc/sakila/en/

Sakila for SQLite is a part of the sakila-sample-database-ports project intended to provide ported versions of the original MySQL database for other database systems, including:

Oracle

SQL Server

SQLIte

Interbase/Firebird

Microsoft Access

Sakila for SQLite is a port of the Sakila example database available for MySQL, which was originally developed by Mike Hillyer of the MySQL AB documentation team. This project is designed to help database administrators to decide which database to use for development of new products The user can run the same SQL against different kind of databases and compare the performance

License: BSD Copyright DB Software Laboratory http://www.etl-tools.com

Note: Part of the insert scripts were generated by Advanced ETL Processor http://www.etl-tools.com/etl-tools/advanced-etl-processor-enterprise/overview.html

Information about the project and the downloadable files can be found at: https://code.google.com/archive/p/sakila-sample-database-ports/

Other versions and developments of the project can be found at: https://github.com/ivanceras/sakila/tree/master/sqlite-sakila-db

https://github.com/jOOQ/jOOQ/tree/main/jOOQ-examples/Sakila

Direct access to the MySQL Sakila database, which does not require installation of MySQL (queries can be typed directly in the browser), is provided on the phpMyAdmin demo version website: https://demo.phpmyadmin.net/master-config/

Files Description

The files in the sqlite-sakila-db folder are the script files which can be used to generate the SQLite version of the database. For convenience, the script files have already been run in cmd to generate the sqlite-sakila.db file, as follows:

sqlite> .open sqlite-sakila.db # creates the .db file sqlite> .read sqlite-sakila-schema.sql # creates the database schema sqlite> .read sqlite-sakila-insert-data.sql # inserts the data

Therefore, the sqlite-sakila.db file can be directly loaded into SQLite3 and queries can be directly executed. You can refer to my notebook for an overview of the database and a demonstration of SQL queries. Note: Data about the film_text table is not provided in the script files, thus the film_text table is empty. Instead the film_id, title and description fields are included in the film table. Moreover, the Sakila Sample Database has many versions, so an Entity Relationship Diagram (ERD) is provided to describe this specific version. You are advised to refer to the ERD to familiarise yourself with the structure of the database.
CDC WONDER API for Data Query Web Service - t3mh-uddy - Archive Repository
healthdata.gov
application/rdfxml +5
Updated Jul 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). CDC WONDER API for Data Query Web Service - t3mh-uddy - Archive Repository [Dataset]. https://healthdata.gov/dataset/CDC-WONDER-API-for-Data-Query-Web-Service-t3mh-udd/gjrs-j84b
Explore at:
xml, application/rdfxml, csv, application/rssxml, tsv, jsonAvailable download formats
Dataset updated
Jul 26, 2023
Description
This dataset tracks the updates made on the dataset "CDC WONDER API for Data Query Web Service" as a repository for previous versions of the data and metadata.
Resources of IncRML: Incremental Knowledge Graph Construction from...
zenodo.org
explore.openaire.eu
bin, text/x-python +1
Updated Dec 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dylan Van Assche; Dylan Van Assche; Julian Andres Rojas Melendez; Julian Andres Rojas Melendez; Ben De Meester; Ben De Meester; Pieter Colpaert; Pieter Colpaert (2024). Resources of IncRML: Incremental Knowledge Graph Construction from Heterogeneous Data Sources [Dataset]. http://doi.org/10.5281/zenodo.14038823
Explore at:
xz, text/x-python, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14038823
Dataset updated
Dec 13, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dylan Van Assche; Dylan Van Assche; Julian Andres Rojas Melendez; Julian Andres Rojas Melendez; Ben De Meester; Ben De Meester; Pieter Colpaert; Pieter Colpaert
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jul 8, 2023
Description
IncRML resources

This Zenodo dataset contains all the resources of the paper 'IncRML: Incremental Knowledge Graph Construction from Heterogeneous Data Sources' submitted to the Semantic Web Journal's Special Issue on Knowledge Graph Construction. This resource aims to make the paper experiments fully reproducible through our experiment tool written in Python which was already used before in the Knowledge Graph Construction Challenge by the ESWC 2023 Workshop on Knowledge Graph Construction. The exact Java JAR file of the RMLMapper (rmlmapper.jar) is also provided in this dataset which was used to execute the experiments. This JAR file was executed with Java OpenJDK 11.0.20.1 on Ubuntu 22.04.1 LTS (Linux 5.15.0-53-generic). Each experiment was executed 5 times and the median values are reported together with the standard deviation of the measurements.

Datasets

We provide both dataset dumps of the GTFS-Madrid-Benchmark and of real-life use cases from Open Data in Belgium.
GTFS-Madrid-Benchmark dumps are used to analyze the impact on execution time and resources, while the real-life use cases aim to verify the approach on different types of datasets since the GTFS-Madrid-Benchmark is a single type of dataset which does not advertise changes at all.

Benchmarks

GTFS-Madrid-Benchmark: change types with fixed data size and amount of changes: additions-only, modifications-only, deletions-only (11 versions)

GTFS-Madrid-Benchmark: amount of changes with fixed data size: 0%, 25%, 50%, 75%, and 100% changes (11 versions)

GTFS-Madrid-Benchmark: data size with fixed amount of changes: scales 1, 10, 100 (11 versions)

Real-world datasets

Traffic control center Vlaams Verkeerscentrum (Belgium): traffic board messages data (1 day, 28760 versions)

Meteorological institute KMI (Belgium): weather sensor data (1 day, 144 versions)

Public transport agency NMBS (Belgium): train schedule data (1 week, 7 versions)

Public transport agency De Lijn (Belgium): busses schedule data (1 week, 7 versions)

Bike-sharing company BlueBike (Belgium): bike-sharing availability data (1 day, 1440 versions)

Bike-sharing company JCDecaux (EU): bike-sharing availability data (1 day, 1440 versions)

OpenStreetMap (World): geographical map data (1 day, 1440 versions)

Ingestion

Real-world datasets LDES output was converted into SPARQL UPDATE queries and executed against Virtuoso to have an estimate for non-LDES clients how incremental generation impacted ingestion into triplestores.

Remarks

The first version of each dataset is always used as a baseline. All next versions are applied as an update on the existing version. The reported results are only focusing on the updates since these are the actual incremental generation.

GTFS-Change-50_percent-{ALL, CHANGE}.tar.xz datasets are not uploaded as GTFS-Madrid-Benchmark scale 100 because both share the same parameters (50% changes, scale 100). Please use GTFS-Scale-100-{ALL, CHANGE}.tar.xz for GTFS-Change-50_percent-{ALL, CHANGE}.tar.xz

All datasets are compressed with XZ and provided as a TAR archive, be aware that you need sufficient space to decompress these archives! 2 TB of free space is advised to decompress all benchmarks and use cases. The expected output is provided as a ZIP file in each TAR archive, decompressing these requires even more space (4 TB).

Reproducing

By using our experiment tool, you can easily reproduce the experiments as followed:

Download one of the TAR.XZ archives and unpack them.

Clone the GitHub repository of our experiment tool and install the Python dependencies with 'pip install -r requirements.txt'.

Download the rmlmapper.jar JAR file from this Zenodo dataset and place it inside the experiment tool root folder.

Execute the tool by running: './exectool --root=/path/to/the/root/of/the/tarxz/archive --runs=5 run'. The argument '--runs=5' is used to perform the experiment 5 times.

Once executed, you can generate the statistics by running: './exectool --root=/path/to/the/root/of/the/tarxz/archive stats'.

Testcases

Testcases to verify the integration of RML and LDES with IncRML, see https://doi.org/10.5281/zenodo.10171394
Health Service Research (HSR) PubMed Queries - dm44-vu3a - Archive...
healthdata.gov
application/rdfxml +5
Updated Jul 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Health Service Research (HSR) PubMed Queries - dm44-vu3a - Archive Repository [Dataset]. https://healthdata.gov/dataset/Health-Service-Research-HSR-PubMed-Queries-dm44-vu/hha6-5rpm
Explore at:
tsv, application/rdfxml, xml, csv, application/rssxml, jsonAvailable download formats
Dataset updated
Jul 16, 2025
Description
This dataset tracks the updates made on the dataset "Health Service Research (HSR) PubMed Queries" as a repository for previous versions of the data and metadata.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos

GitHub Repos

Code and comments from 2.8 million repos

Explore at:

zip(0 bytes)Available download formats

Dataset updated

Mar 20, 2019

Dataset provided by

GitHubhttps://github.com/

Authors

Github

Description

GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgements

This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

Inspiration

This is the perfect dataset for fighting language wars.
Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?

Clear search

Close search

Google apps

Main menu

GitHub Repos

Querying BigQuery tables

Acknowledgements

Inspiration

Medical Expenditure Panel Survey (MEPS) Query Tool - k85c-yfyp - Archive...

Data from: TerraDS: A Dataset for Terraform HCL Programs

TerraDS

Structure of the Database

Structure of the Archive

Tools

Human Cell and Tissue Establishment Registration Public Query - pj5w-zcqt -...

Energy System Time Series Suite (ESTSS) - Data Archive

Energy System Time Series Suite - Data Archive

Folder Content

Contact

Web-based Injury Statistics Query and Reporting System (WISQARS) - 82ty-fydp...

SQLite Sakila Sample Database

SQLite Sakila Sample Database

Database Description

Files Description

CDC WONDER API for Data Query Web Service - t3mh-uddy - Archive Repository

Resources of IncRML: Incremental Knowledge Graph Construction from...

IncRML resources

Datasets

Benchmarks

Real-world datasets

Ingestion

Remarks

Reproducing

Testcases

Health Service Research (HSR) PubMed Queries - dm44-vu3a - Archive...

GitHub ReposSee More Versions

Code and comments from 2.8 million repos

Querying BigQuery tables

Acknowledgements

Inspiration

GitHub Repos