GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.
This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.
You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]
. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.
This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.
This dataset tracks the updates made on the dataset "Medical Expenditure Panel Survey (MEPS) Query Tool" as a repository for previous versions of the data and metadata.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The TerraDS dataset provides a comprehensive collection of Terraform programs written in the HashiCorp Configuration Language (HCL). As Infrastructure as Code (IaC) gains popularity for managing cloud infrastructure, Terraform has become one of the leading tools due to its declarative nature and widespread adoption. However, a lack of publicly available, large-scale datasets has hindered systematic research on Terraform practices. TerraDS addresses this gap by compiling metadata and source code from 62,406 open-source repositories with valid licenses. This dataset aims to foster research on best practices, vulnerabilities, and improvements in IaC methodologies.
The TerraDS dataset is organized into two main components: a SQLite database containing metadata and an archive of source code (~335 MB). The metadata, captured in a structured format, includes information about repositories, modules, and resources:
1. Repository Data:
2. Module Data:
3. Resource Data:
The provided archive contains the source code of the 62,406 repositories to allow further analysis based on the actual source instead of the metadata only. As such, researcher can access the permissive repositories and conduct studies on the executable HCL code.
The "HCL Dataset Tools" file contains a snapshot of the https://github.com/prg-grp/hcl-dataset-tools repository - for long term archival reasons. The tools in this repository can be used to reproduce this dataset.
One of the tools - "RepositorySearcher" - can be used to fetch metadata for various other GitHub API queries, not only Terraform code. While the RepositorySearcher allows usage for other types of repository search, the other tools provided are focused on Terraform repositories.
This dataset tracks the updates made on the dataset "Human Cell and Tissue Establishment Registration Public Query" as a repository for previous versions of the data and metadata.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This archive contains variously sized sets of declustered time series within the context of energy systems. These series demonstrate low discrepancy and high heterogeneity in feature space, resulting in a roughly uniform distribution within this space.
For detailed information, please refer to the corresponding GitHub project:
https://github.com/s-guenther/estss/
For associated research, see
https://doi.org/10.1186/s42162-024-00304-8
Data is provided in .csv format. The GitHub project includes a Python function to load this data as a dictionary of pandas data frames.
Should you utilize this data, kindly also cite the associated research paper. For any queries, please feel free to reach out to us through GitHub or the contact details provided at the end of this readme file.
Each `ts_*.csv`, `features_*.csv`, `norm_space_*.csv`, and `info_*.csv` file comes in four versions to accommodate various set sizes:
These represent sets with 4096, 1024, 256, and 64 time series, respectively,offering different densities in feature space population. The objective is to balance computational load and resolution for individual research needs.
ESTSS - Energy System Time Series Suite
Copyright (C) 2023
Sebastian Günther
sebastian.guenther@ifes.uni-hannover.de
Leibniz Universität Hannover
Institut für Elektrische Energiesysteme
Fachgebiet für Elektrische Energiespeichersysteme
Leibniz University Hannover
Institute of Electric Power Systems
Electric Energy Storage Systems Section
This dataset tracks the updates made on the dataset "Web-based Injury Statistics Query and Reporting System (WISQARS)" as a repository for previous versions of the data and metadata.
The Sakila sample database is a fictitious database designed to represent a DVD rental store. The tables of the database include film, film_category, actor, customer, rental, payment and inventory among others. The Sakila sample database is intended to provide a standard schema that can be used for examples in books, tutorials, articles, samples, and so forth. Detailed information about the database can be found on the MySQL website: https://dev.mysql.com/doc/sakila/en/
Sakila for SQLite is a part of the sakila-sample-database-ports project intended to provide ported versions of the original MySQL database for other database systems, including:
Sakila for SQLite is a port of the Sakila example database available for MySQL, which was originally developed by Mike Hillyer of the MySQL AB documentation team. This project is designed to help database administrators to decide which database to use for development of new products The user can run the same SQL against different kind of databases and compare the performance
License: BSD Copyright DB Software Laboratory http://www.etl-tools.com
Note: Part of the insert scripts were generated by Advanced ETL Processor http://www.etl-tools.com/etl-tools/advanced-etl-processor-enterprise/overview.html
Information about the project and the downloadable files can be found at: https://code.google.com/archive/p/sakila-sample-database-ports/
Other versions and developments of the project can be found at: https://github.com/ivanceras/sakila/tree/master/sqlite-sakila-db
https://github.com/jOOQ/jOOQ/tree/main/jOOQ-examples/Sakila
Direct access to the MySQL Sakila database, which does not require installation of MySQL (queries can be typed directly in the browser), is provided on the phpMyAdmin demo version website: https://demo.phpmyadmin.net/master-config/
The files in the sqlite-sakila-db folder are the script files which can be used to generate the SQLite version of the database. For convenience, the script files have already been run in cmd to generate the sqlite-sakila.db file, as follows:
sqlite> .open sqlite-sakila.db
# creates the .db file
sqlite> .read sqlite-sakila-schema.sql
# creates the database schema
sqlite> .read sqlite-sakila-insert-data.sql
# inserts the data
Therefore, the sqlite-sakila.db file can be directly loaded into SQLite3 and queries can be directly executed. You can refer to my notebook for an overview of the database and a demonstration of SQL queries. Note: Data about the film_text table is not provided in the script files, thus the film_text table is empty. Instead the film_id, title and description fields are included in the film table. Moreover, the Sakila Sample Database has many versions, so an Entity Relationship Diagram (ERD) is provided to describe this specific version. You are advised to refer to the ERD to familiarise yourself with the structure of the database.
This dataset tracks the updates made on the dataset "CDC WONDER API for Data Query Web Service" as a repository for previous versions of the data and metadata.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This Zenodo dataset contains all the resources of the paper 'IncRML: Incremental Knowledge Graph Construction from Heterogeneous Data Sources' submitted to the Semantic Web Journal's Special Issue on Knowledge Graph Construction. This resource aims to make the paper experiments fully reproducible through our experiment tool written in Python which was already used before in the Knowledge Graph Construction Challenge by the ESWC 2023 Workshop on Knowledge Graph Construction. The exact Java JAR file of the RMLMapper (rmlmapper.jar) is also provided in this dataset which was used to execute the experiments. This JAR file was executed with Java OpenJDK 11.0.20.1 on Ubuntu 22.04.1 LTS (Linux 5.15.0-53-generic). Each experiment was executed 5 times and the median values are reported together with the standard deviation of the measurements.
We provide both dataset dumps of the GTFS-Madrid-Benchmark and of real-life use cases from Open Data in Belgium.
GTFS-Madrid-Benchmark dumps are used to analyze the impact on execution time and resources, while the real-life use cases aim to verify the approach on different types of datasets since the GTFS-Madrid-Benchmark is a single type of dataset which does not advertise changes at all.
Real-world datasets LDES output was converted into SPARQL UPDATE queries and executed against Virtuoso to have an estimate for non-LDES clients how incremental generation impacted ingestion into triplestores.
By using our experiment tool, you can easily reproduce the experiments as followed:
Testcases to verify the integration of RML and LDES with IncRML, see https://doi.org/10.5281/zenodo.10171394
This dataset tracks the updates made on the dataset "Health Service Research (HSR) PubMed Queries" as a repository for previous versions of the data and metadata.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.
This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.
You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]
. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.
This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.