10 datasets found
  1. GitHub Repos

    • kaggle.com
    zip
    Updated Mar 20, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 20, 2019
    Dataset provided by
    GitHubhttps://github.com/
    Authors
    Github
    Description

    GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

    This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

    Querying BigQuery tables

    You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

    Acknowledgements

    This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

    Inspiration

    • This is the perfect dataset for fighting language wars.
    • Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
  2. Medical Expenditure Panel Survey (MEPS) Query Tool - k85c-yfyp - Archive...

    • healthdata.gov
    application/rdfxml +5
    Updated Jun 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Medical Expenditure Panel Survey (MEPS) Query Tool - k85c-yfyp - Archive Repository [Dataset]. https://healthdata.gov/dataset/Medical-Expenditure-Panel-Survey-MEPS-Query-Tool-k/vzpv-jw8g
    Explore at:
    tsv, csv, application/rdfxml, xml, application/rssxml, jsonAvailable download formats
    Dataset updated
    Jun 2, 2022
    Description

    This dataset tracks the updates made on the dataset "Medical Expenditure Panel Survey (MEPS) Query Tool" as a repository for previous versions of the data and metadata.

  3. Data from: TerraDS: A Dataset for Terraform HCL Programs

    • zenodo.org
    application/gzip, bin
    Updated Nov 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christoph Bühler; Christoph Bühler; David Spielmann; David Spielmann; Roland Meier; Roland Meier; Guido Salvaneschi; Guido Salvaneschi (2024). TerraDS: A Dataset for Terraform HCL Programs [Dataset]. http://doi.org/10.5281/zenodo.14217386
    Explore at:
    application/gzip, binAvailable download formats
    Dataset updated
    Nov 27, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Christoph Bühler; Christoph Bühler; David Spielmann; David Spielmann; Roland Meier; Roland Meier; Guido Salvaneschi; Guido Salvaneschi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    TerraDS

    The TerraDS dataset provides a comprehensive collection of Terraform programs written in the HashiCorp Configuration Language (HCL). As Infrastructure as Code (IaC) gains popularity for managing cloud infrastructure, Terraform has become one of the leading tools due to its declarative nature and widespread adoption. However, a lack of publicly available, large-scale datasets has hindered systematic research on Terraform practices. TerraDS addresses this gap by compiling metadata and source code from 62,406 open-source repositories with valid licenses. This dataset aims to foster research on best practices, vulnerabilities, and improvements in IaC methodologies.

    Structure of the Database

    The TerraDS dataset is organized into two main components: a SQLite database containing metadata and an archive of source code (~335 MB). The metadata, captured in a structured format, includes information about repositories, modules, and resources:

    1. Repository Data:

    • Contains 62,406 repositories with fields such as repository name, creation date, star count, and permissive license details.
    • Provides cloneable URLs for access and analysis.
    • Tracks additional metrics like repository size and the latest commit details.

    2. Module Data:

    • Consists of 279,344 modules identified within the repositories.
    • Each module includes its relative path, referenced providers, and external module calls stored as JSON objects.

    3. Resource Data:

    • Encompasses 1,773,991 resources, split into managed (1,484,185) and data (289,806) resources.
    • Each resource entry details its type, provider, and whether it is managed or read-only.

    Structure of the Archive

    The provided archive contains the source code of the 62,406 repositories to allow further analysis based on the actual source instead of the metadata only. As such, researcher can access the permissive repositories and conduct studies on the executable HCL code.

    Tools

    The "HCL Dataset Tools" file contains a snapshot of the https://github.com/prg-grp/hcl-dataset-tools repository - for long term archival reasons. The tools in this repository can be used to reproduce this dataset.

    One of the tools - "RepositorySearcher" - can be used to fetch metadata for various other GitHub API queries, not only Terraform code. While the RepositorySearcher allows usage for other types of repository search, the other tools provided are focused on Terraform repositories.

  4. Human Cell and Tissue Establishment Registration Public Query - pj5w-zcqt -...

    • healthdata.gov
    application/rdfxml +5
    Updated Jul 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Human Cell and Tissue Establishment Registration Public Query - pj5w-zcqt - Archive Repository [Dataset]. https://healthdata.gov/dataset/Human-Cell-and-Tissue-Establishment-Registration-P/4qd7-mv3m
    Explore at:
    csv, xml, application/rssxml, tsv, application/rdfxml, jsonAvailable download formats
    Dataset updated
    Jul 16, 2025
    Description

    This dataset tracks the updates made on the dataset "Human Cell and Tissue Establishment Registration Public Query" as a repository for previous versions of the data and metadata.

  5. Energy System Time Series Suite (ESTSS) - Data Archive

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv
    Updated Jan 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sebastian Günther; Sebastian Günther (2024). Energy System Time Series Suite (ESTSS) - Data Archive [Dataset]. http://doi.org/10.5281/zenodo.10213145
    Explore at:
    csv, binAvailable download formats
    Dataset updated
    Jan 22, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sebastian Günther; Sebastian Günther
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Energy System Time Series Suite - Data Archive

    This archive contains variously sized sets of declustered time series within the context of energy systems. These series demonstrate low discrepancy and high heterogeneity in feature space, resulting in a roughly uniform distribution within this space.

    For detailed information, please refer to the corresponding GitHub project:
    https://github.com/s-guenther/estss/

    For associated research, see
    https://doi.org/10.1186/s42162-024-00304-8

    Data is provided in .csv format. The GitHub project includes a Python function to load this data as a dictionary of pandas data frames.

    Should you utilize this data, kindly also cite the associated research paper. For any queries, please feel free to reach out to us through GitHub or the contact details provided at the end of this readme file.

    Folder Content

    • `ts_*.csv`: Contains declustered load profile time series in tabular format.
      • Size: `(n+1) x (m+1)`, with `n` representing time steps (1000 per series) and `m` the number of series.
      • Includes a header row and index column. Headers indicate series id, and the index column numbers each time step, starting from `0`.
      • The first half of the series `(m/2)` consistently display a constant sign (negative). They are sequentially numbered from 0.
      • The second half `(m/2)` display varying signs. Numbering starts from `1,000,000`.
    • `features_*.csv`: Tabulates features corresponding to the time series.
      • Size: `(m+1) x (f+1)`, where `m` is the number of time series and `f` is the number of features
      • Includes a header row and index column. Indexes represent time series id (matching `ts_*.csv` headers), and headers name the features.
    • `norm_space_*.csv`: Shows feature vectors in normalized feature space where time series are declustered. Provided for completeness; typically not needed by users.
      • Size: `(m+1) x (g+1)`, where `m` is the number of timer series and `g` is the number of selected features space features. (a subset of `f` from `features_*.csv`).
      • Format matches `features_*.csv`.
    • `info_*.csv`: Maps declustered datasets to the manifolded dataset. Provided for completeness; typically not needed by users.
      • Size: `(m+1) x 2`, with `m` as series count. Columns contain manifolded set time series ids.
      • Includes an index column and a header. The index holds the remapped id of declustered series. Header `0` is non-significant.

    Each `ts_*.csv`, `features_*.csv`, `norm_space_*.csv`, and `info_*.csv` file comes in four versions to accommodate various set sizes:

    • `*_4096.csv`
    • `*_1024.csv`
    • `*_256.csv`
    • `*_64.csv`

    These represent sets with 4096, 1024, 256, and 64 time series, respectively,offering different densities in feature space population. The objective is to balance computational load and resolution for individual research needs.

    Contact

    ESTSS - Energy System Time Series Suite
    Copyright (C) 2023
    Sebastian Günther
    sebastian.guenther@ifes.uni-hannover.de

    Leibniz Universität Hannover
    Institut für Elektrische Energiesysteme
    Fachgebiet für Elektrische Energiespeichersysteme

    Leibniz University Hannover
    Institute of Electric Power Systems
    Electric Energy Storage Systems Section

    https://www.ifes.uni-hannover.de/ees.html

  6. Web-based Injury Statistics Query and Reporting System (WISQARS) - 82ty-fydp...

    • healthdata.gov
    application/rdfxml +5
    Updated Jul 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Web-based Injury Statistics Query and Reporting System (WISQARS) - 82ty-fydp - Archive Repository [Dataset]. https://healthdata.gov/dataset/Web-based-Injury-Statistics-Query-and-Reporting-Sy/d2ht-timm
    Explore at:
    csv, xml, application/rdfxml, json, tsv, application/rssxmlAvailable download formats
    Dataset updated
    Jul 26, 2023
    Description

    This dataset tracks the updates made on the dataset "Web-based Injury Statistics Query and Reporting System (WISQARS)" as a repository for previous versions of the data and metadata.

  7. SQLite Sakila Sample Database

    • kaggle.com
    Updated Mar 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Atanas Kanev (2021). SQLite Sakila Sample Database [Dataset]. https://www.kaggle.com/atanaskanev/sqlite-sakila-sample-database/tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 14, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Atanas Kanev
    Description

    SQLite Sakila Sample Database

    Database Description

    The Sakila sample database is a fictitious database designed to represent a DVD rental store. The tables of the database include film, film_category, actor, customer, rental, payment and inventory among others. The Sakila sample database is intended to provide a standard schema that can be used for examples in books, tutorials, articles, samples, and so forth. Detailed information about the database can be found on the MySQL website: https://dev.mysql.com/doc/sakila/en/

    Sakila for SQLite is a part of the sakila-sample-database-ports project intended to provide ported versions of the original MySQL database for other database systems, including:

    • Oracle
    • SQL Server
    • SQLIte
    • Interbase/Firebird
    • Microsoft Access

    Sakila for SQLite is a port of the Sakila example database available for MySQL, which was originally developed by Mike Hillyer of the MySQL AB documentation team. This project is designed to help database administrators to decide which database to use for development of new products The user can run the same SQL against different kind of databases and compare the performance

    License: BSD Copyright DB Software Laboratory http://www.etl-tools.com

    Note: Part of the insert scripts were generated by Advanced ETL Processor http://www.etl-tools.com/etl-tools/advanced-etl-processor-enterprise/overview.html

    Information about the project and the downloadable files can be found at: https://code.google.com/archive/p/sakila-sample-database-ports/

    Other versions and developments of the project can be found at: https://github.com/ivanceras/sakila/tree/master/sqlite-sakila-db

    https://github.com/jOOQ/jOOQ/tree/main/jOOQ-examples/Sakila

    Direct access to the MySQL Sakila database, which does not require installation of MySQL (queries can be typed directly in the browser), is provided on the phpMyAdmin demo version website: https://demo.phpmyadmin.net/master-config/

    Files Description

    The files in the sqlite-sakila-db folder are the script files which can be used to generate the SQLite version of the database. For convenience, the script files have already been run in cmd to generate the sqlite-sakila.db file, as follows:

    sqlite> .open sqlite-sakila.db # creates the .db file sqlite> .read sqlite-sakila-schema.sql # creates the database schema sqlite> .read sqlite-sakila-insert-data.sql # inserts the data

    Therefore, the sqlite-sakila.db file can be directly loaded into SQLite3 and queries can be directly executed. You can refer to my notebook for an overview of the database and a demonstration of SQL queries. Note: Data about the film_text table is not provided in the script files, thus the film_text table is empty. Instead the film_id, title and description fields are included in the film table. Moreover, the Sakila Sample Database has many versions, so an Entity Relationship Diagram (ERD) is provided to describe this specific version. You are advised to refer to the ERD to familiarise yourself with the structure of the database.

  8. CDC WONDER API for Data Query Web Service - t3mh-uddy - Archive Repository

    • healthdata.gov
    application/rdfxml +5
    Updated Jul 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). CDC WONDER API for Data Query Web Service - t3mh-uddy - Archive Repository [Dataset]. https://healthdata.gov/dataset/CDC-WONDER-API-for-Data-Query-Web-Service-t3mh-udd/gjrs-j84b
    Explore at:
    xml, application/rdfxml, csv, application/rssxml, tsv, jsonAvailable download formats
    Dataset updated
    Jul 26, 2023
    Description

    This dataset tracks the updates made on the dataset "CDC WONDER API for Data Query Web Service" as a repository for previous versions of the data and metadata.

  9. Resources of IncRML: Incremental Knowledge Graph Construction from...

    • zenodo.org
    • explore.openaire.eu
    bin, text/x-python +1
    Updated Dec 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Van Assche; Dylan Van Assche; Julian Andres Rojas Melendez; Julian Andres Rojas Melendez; Ben De Meester; Ben De Meester; Pieter Colpaert; Pieter Colpaert (2024). Resources of IncRML: Incremental Knowledge Graph Construction from Heterogeneous Data Sources [Dataset]. http://doi.org/10.5281/zenodo.14038823
    Explore at:
    xz, text/x-python, binAvailable download formats
    Dataset updated
    Dec 13, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dylan Van Assche; Dylan Van Assche; Julian Andres Rojas Melendez; Julian Andres Rojas Melendez; Ben De Meester; Ben De Meester; Pieter Colpaert; Pieter Colpaert
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jul 8, 2023
    Description

    IncRML resources

    This Zenodo dataset contains all the resources of the paper 'IncRML: Incremental Knowledge Graph Construction from Heterogeneous Data Sources' submitted to the Semantic Web Journal's Special Issue on Knowledge Graph Construction. This resource aims to make the paper experiments fully reproducible through our experiment tool written in Python which was already used before in the Knowledge Graph Construction Challenge by the ESWC 2023 Workshop on Knowledge Graph Construction. The exact Java JAR file of the RMLMapper (rmlmapper.jar) is also provided in this dataset which was used to execute the experiments. This JAR file was executed with Java OpenJDK 11.0.20.1 on Ubuntu 22.04.1 LTS (Linux 5.15.0-53-generic). Each experiment was executed 5 times and the median values are reported together with the standard deviation of the measurements.

    Datasets

    We provide both dataset dumps of the GTFS-Madrid-Benchmark and of real-life use cases from Open Data in Belgium.
    GTFS-Madrid-Benchmark dumps are used to analyze the impact on execution time and resources, while the real-life use cases aim to verify the approach on different types of datasets since the GTFS-Madrid-Benchmark is a single type of dataset which does not advertise changes at all.

    Benchmarks

    • GTFS-Madrid-Benchmark: change types with fixed data size and amount of changes: additions-only, modifications-only, deletions-only (11 versions)
    • GTFS-Madrid-Benchmark: amount of changes with fixed data size: 0%, 25%, 50%, 75%, and 100% changes (11 versions)
    • GTFS-Madrid-Benchmark: data size with fixed amount of changes: scales 1, 10, 100 (11 versions)

    Real-world datasets

    • Traffic control center Vlaams Verkeerscentrum (Belgium): traffic board messages data (1 day, 28760 versions)
    • Meteorological institute KMI (Belgium): weather sensor data (1 day, 144 versions)
    • Public transport agency NMBS (Belgium): train schedule data (1 week, 7 versions)
    • Public transport agency De Lijn (Belgium): busses schedule data (1 week, 7 versions)
    • Bike-sharing company BlueBike (Belgium): bike-sharing availability data (1 day, 1440 versions)
    • Bike-sharing company JCDecaux (EU): bike-sharing availability data (1 day, 1440 versions)
    • OpenStreetMap (World): geographical map data (1 day, 1440 versions)

    Ingestion

    Real-world datasets LDES output was converted into SPARQL UPDATE queries and executed against Virtuoso to have an estimate for non-LDES clients how incremental generation impacted ingestion into triplestores.

    Remarks

    1. The first version of each dataset is always used as a baseline. All next versions are applied as an update on the existing version. The reported results are only focusing on the updates since these are the actual incremental generation.
    2. GTFS-Change-50_percent-{ALL, CHANGE}.tar.xz datasets are not uploaded as GTFS-Madrid-Benchmark scale 100 because both share the same parameters (50% changes, scale 100). Please use GTFS-Scale-100-{ALL, CHANGE}.tar.xz for GTFS-Change-50_percent-{ALL, CHANGE}.tar.xz
    3. All datasets are compressed with XZ and provided as a TAR archive, be aware that you need sufficient space to decompress these archives! 2 TB of free space is advised to decompress all benchmarks and use cases. The expected output is provided as a ZIP file in each TAR archive, decompressing these requires even more space (4 TB).

    Reproducing

    By using our experiment tool, you can easily reproduce the experiments as followed:

    1. Download one of the TAR.XZ archives and unpack them.
    2. Clone the GitHub repository of our experiment tool and install the Python dependencies with 'pip install -r requirements.txt'.
    3. Download the rmlmapper.jar JAR file from this Zenodo dataset and place it inside the experiment tool root folder.
    4. Execute the tool by running: './exectool --root=/path/to/the/root/of/the/tarxz/archive --runs=5 run'. The argument '--runs=5' is used to perform the experiment 5 times.
    5. Once executed, you can generate the statistics by running: './exectool --root=/path/to/the/root/of/the/tarxz/archive stats'.

    Testcases

    Testcases to verify the integration of RML and LDES with IncRML, see https://doi.org/10.5281/zenodo.10171394

  10. Health Service Research (HSR) PubMed Queries - dm44-vu3a - Archive...

    • healthdata.gov
    application/rdfxml +5
    Updated Jul 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Health Service Research (HSR) PubMed Queries - dm44-vu3a - Archive Repository [Dataset]. https://healthdata.gov/dataset/Health-Service-Research-HSR-PubMed-Queries-dm44-vu/hha6-5rpm
    Explore at:
    tsv, application/rdfxml, xml, csv, application/rssxml, jsonAvailable download formats
    Dataset updated
    Jul 16, 2025
    Description

    This dataset tracks the updates made on the dataset "Health Service Research (HSR) PubMed Queries" as a repository for previous versions of the data and metadata.

  11. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
Organization logo

GitHub Repos

Code and comments from 2.8 million repos

Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
GitHubhttps://github.com/
Authors
Github
Description

GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgements

This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

Inspiration

  • This is the perfect dataset for fighting language wars.
  • Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
Search
Clear search
Close search
Google apps
Main menu