Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Geo Open is an IP address geolocation per country in MMDB format. The database can be used as a replacement for software using the MMDB format. Information about MMDB format: https://maxmind.github.io/MaxMind-DB/ Open source server using Geo Open: https://github.com/adulau/mmdb-server Open source library to read MMDB file: https://github.com/maxmind/MaxMind-DB-Reader-python Historical dataset: https://cra.circl.lu/opendata/geo-open/ The database is automatically generated from public BGP AS announces matching the country code. The precision is at country level.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Open Context (https://opencontext.org) publishes free and open access research data for archaeology and related disciplines. An open source (but bespoke) Django (Python) application supports these data publishing services. The software repository is here: https://github.com/ekansa/open-context-py
The Open Context team runs ETL (extract, transform, load) workflows to import data contributed by researchers from various source relational databases and spreadsheets. Open Context uses PostgreSQL (https://www.postgresql.org) relational database to manage these imported data in a graph style schema. The Open Context Python application interacts with the PostgreSQL database via the Django Object-Relational-Model (ORM).
In 2023, the Open Context team finished migration of from a legacy database schema to a revised and refactored database schema with stricter referential integrity and better consistency across tables. During this process, the Open Context team de-duplicated records, cleaned some metadata, and redacted attribute data left over from records that had been incompletely deleted in the legacy schema.
This database dump includes all Open Context data organized with the legacy schema (table names that start with the 'oc_' or 'link_' prefixes) along with all Open Context data after cleanup and migration to the new database schema (table names that start with 'oc_all_'). The binary media files referenced by these structured data records are stored elsewhere. Binary media files for some projects, still in preparation, are not yet archived with long term digital repositories.
These data comprehensively reflect the structured data currently published and publicly available on Open Context. Other data (such as user and group information) used to run the Website are not included.
IMPORTANT
This database dump contains data from roughly 180 different projects. Each project dataset has its own metadata and citation expectations. If you use these data, you must cite each data contributor appropriately, not just this Zenodo archived database dump.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
All advisories acknowledged by GitHub are stored as individual files in this repository. They are formatted in the Open Source Vulnerability (OSV) format.
You can submit a pull request to this database (see, Contributions) to change or update the information in each advisory.
Pull requests will be reviewed and either merged or closed by our internal security advisory curation team. If the advisory originated from a GitHub repository, we will also @mention the original publisher for optional commentary.
We add advisories to the GitHub Advisory Database from the following sources: - Security advisories reported on GitHub - The National Vulnerability Database - The npm Security Advisories Database - The FriendsOfPHP Database - The Go Vulnerability Database - The Python Packaging Advisory Database - The Ruby Advisory Database - The RustSec Advisory Database
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Open Context (https://opencontext.org) publishes free and open access research data for archaeology and related disciplines. An open source (but bespoke) Django (Python) application supports these data publishing services. The software repository is here: https://github.com/ekansa/open-context-py
The Open Context team runs ETL (extract, transform, load) workflows to import data contributed by researchers from various source relational databases and spreadsheets. Open Context uses PostgreSQL (https://www.postgresql.org) relational database to manage these imported data in a graph style schema. The Open Context Python application interacts with the PostgreSQL database via the Django Object-Relational-Model (ORM).
This database dump includes all published structured data organized used by Open Context (table names that start with 'oc_all_'). The binary media files referenced by these structured data records are stored elsewhere. Binary media files for some projects, still in preparation, are not yet archived with long term digital repositories.
These data comprehensively reflect the structured data currently published and publicly available on Open Context. Other data (such as user and group information) used to run the Website are not included.
IMPORTANT
This database dump contains data from roughly 190+ different projects. Each project dataset has its own metadata and citation expectations. If you use these data, you must cite each data contributor appropriately, not just this Zenodo archived database dump.
Facebook
Twitterhttps://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
This record is a global open-source passenger air traffic dataset primarily dedicated to the research community.
It gives a seating capacity available on each origin-destination route for a given year, 2019, and the associated aircraft and airline when this information is available.
Context on the original work is given in the related article (https://journals.open.tudelft.nl/joas/article/download/7201/5683) and on the associated GitHub page (https://github.com/AeroMAPS/AeroSCOPE/).
A simple data exploration interface will be available at www.aeromaps.eu/aeroscope.
The dataset was created by aggregating various available open-source databases with limited geographical coverage. It was then completed using a route database created by parsing Wikipedia and Wikidata, on which the traffic volume was estimated using a machine learning algorithm (XGBoost) trained using traffic and socio-economical data.
The dataset was gathered to allow highly aggregated analyses of the air traffic, at the continental or country levels. At the route level, the accuracy is limited as mentioned in the associated article and improper usage could lead to erroneous analyses.
Each data entry represents an (Origin-Destination-Operator-Aircraft type) tuple.
Please refer to the support article for more details (see above).
The dataset contains the following columns:
Please cite the support paper instead of the dataset itself.
Salgas, A., Sun, J., Delbecq, S., Planès, T., & Lafforgue, G. (2023). Compilation of an open-source traffic and CO2 emissions dataset for commercial aviation. Journal of Open Aviation Science. https://doi.org/10.59490/joas.2023.7201
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
WikiDBs is an open-source corpus of 100,000 relational databases. We aim to support research on tabular representation learning on multi-table data. The corpus is based on Wikidata and aims to follow certain characteristics of real-world databases.
WikiDBs was published as a spotlight paper at the Dataset & Benchmarks track at NeurIPS 2024.
WikiDBs contains the database schemas, as well as table contents. The database tables are provided as CSV files, and each database schema as JSON. The 100,000 databases are available in five splits, containing 20k databases each. In total, around 165 GB of disk space are needed for the full corpus. We also provide a script to convert the databases into SQLite.
Facebook
TwitterThis dataverse hosts the data repository of the article entitled "Open Source Software as Digital Platforms to Innovate" . It contains databases and R software codes that replicate the main results of the article. The article contains a detailed description of how these databases were constructed and how they are organized.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository introduces a dataset of obverse and reverse images of 319 unique Schlage SC1 keys, labeled with each key's bitting code. We make our data accessible in an HDF5 format, through arrays aligned where the Nth index of each array represents the Nth key, with keys sorted ascending by bitting code: /bittings: Each keys 1-9 bitting code, recorded from shoulder through the tip of the key, uint8 of shape (319, 5). /obverse: Obverse image of each key, uint8 of shape (319, 512, 512, 3). /reverse: Reverse image of each key, uint8 of shape (319, 512, 512, 3).
Full dataset details available on GitHub https://github.com/alexxke/keynet
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Data sources for Badger an open source budget execution & data analysis tool for federal budget analysts with the environmental protection agency based on WPF, Net 6, and is written in C#.
Databases play a critical role in environmental data analysis by providing a structured system to store, organize, and efficiently retrieve large amounts of data, allowing analysts to easily access and manipulate information needed to extract meaningful insights through queries and analysis tools; essentially acting as the central repository for data used in data analysis processes. Badger provides the following providers to store and analyze data locally.
bin - Binaries are included in the bin folder due to the complex Baby setup required. Don't empty this folder.bin/storage - HTML and JS required for downloads manager and custom error pages
_Environmental...
Facebook
Twitterhttps://www.nist.gov/open/licensehttps://www.nist.gov/open/license
Data here contain and describe an open-source structured query language (SQLite) portable database containing high resolution mass spectrometry data (MS1 and MS2) for per- and polyfluorinated alykl substances (PFAS) and associated metadata regarding their measurement techniques, quality assurance metrics, and the samples from which they were produced. These data are stored in a format adhering to the Database Infrastructure for Mass Spectrometry (DIMSpec) project. That project produces and uses databases like this one, providing a complete toolkit for non-targeted analysis. See more information about the full DIMSpec code base - as well as these data for demonstration purposes - at GitHub (https://github.com/usnistgov/dimspec) or view the full User Guide for DIMSpec (https://pages.nist.gov/dimspec/docs). Files of most interest contained here include the database file itself (dimspec_nist_pfas.sqlite) as well as an entity relationship diagram (ERD.png) and data dictionary (DIMSpec for PFAS_1.0.1.20230615_data_dictionary.json) to elucidate the database structure and assist in interpretation and use.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The TerraDS dataset provides a comprehensive collection of Terraform programs written in the HashiCorp Configuration Language (HCL). As Infrastructure as Code (IaC) gains popularity for managing cloud infrastructure, Terraform has become one of the leading tools due to its declarative nature and widespread adoption. However, a lack of publicly available, large-scale datasets has hindered systematic research on Terraform practices. TerraDS addresses this gap by compiling metadata and source code from 62,406 open-source repositories with valid licenses. This dataset aims to foster research on best practices, vulnerabilities, and improvements in IaC methodologies.
The TerraDS dataset is organized into two main components: a SQLite database containing metadata and an archive of source code (~335 MB). The metadata, captured in a structured format, includes information about repositories, modules, and resources:
1. Repository Data:
2. Module Data:
3. Resource Data:
The provided archive contains the source code of the 62,406 repositories to allow further analysis based on the actual source instead of the metadata only. As such, researcher can access the permissive repositories and conduct studies on the executable HCL code.
The "HCL Dataset Tools" file contains a snapshot of the https://github.com/prg-grp/hcl-dataset-tools repository - for long term archival reasons. The tools in this repository can be used to reproduce this dataset.
One of the tools - "RepositorySearcher" - can be used to fetch metadata for various other GitHub API queries, not only Terraform code. While the RepositorySearcher allows usage for other types of repository search, the other tools provided are focused on Terraform repositories.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Please explore the provided notebook to learn about the dataset:
🔗 IPinfo IP to Country ASN Demo Notebook for Kaggle
Detailed documentation for the IP to Country ASN database can be found on IPinfo's documentation page. Database samples are also available on IPinfo's GitHub repo.
🔗 Documentation: https://ipinfo.io/developers/ip-to-country-asn-database
| Field Name | Example | Description |
|---|---|---|
start_ip | 194.87.139.0 | The starting IP address of an IP address range |
end_ip | 194.87.139.255 | The ending IP address of an IP address range |
country | NL | The ISO 3166 country code of the location |
country_name | Netherlands | The name of the country |
continent | EU | The continent code of the country |
continent_name | Europe | The name of the continent |
asn | AS1239 | The Autonomous System Number |
as_name | Sprint | The name of the AS (Autonomous System) organization |
as_domain | sprint.net | The official domain or website of the AS organization |
The IPinfo IP to Country ASN database is a subset of IPinfo's IP to Geolocation database and the ASN database.
The database provides daily updates, complete IPv4 and IPv6 coverage, and full accuracy, just like its parent databases. The database is crucial for:
Whether you are running a web service or a server connected to the internet, this enterprise-ready database should be part of your tech stack.
In this dataset, we include 3 files:
country_asn.csv → For reverse IP look-ups and running IP-based analyticscountry_asn.mmdb → For IP address information look-upsips.txt → Sample IP addressesUsing the CSV dataset
As the CSV dataset has a relatively small size (~120 MB), any dataframe and database should be adequate. However, we recommend users not use the CSV file for IP address lookups. For everything else, feel free to explore the CSV file format.
Using the MMDB dataset
The MMDB dataset requires a special third-party library called the MMDB reader library. The MMDB reader library enables you to look up IP addresses at the most efficient speed possible. However, as this is a third-party library, you should install it via pip install in your notebook, which requires an internet connection to be enabled in your notebook settings.
Please see our attached demo notebook for usage examples.
IP to Country ASN provides many diverse solutions, so we encourage and share those ideas with the Kaggle community!
The geolocation data is produced by IPinfo's ProbeNet, a globe-spanning probe network infrastructure with 400+ servers. The ASN data is collected from public datasets like WHOIS, Geofeed etc. The ASN data is later parsed and structured to make it more data-friendly.
See the Data Provenance section below to learn more.
Please note that this Kaggle Dataset is not updated daily. We recommend users download our free IP to Country ASN database from IPinfo's website directly for daily updates.
AS Organization - An AS (Autonomous System) organization is an organization that owns a block or range of IP addresses. These IP addresses are sold to them by the Regional Internet Organizations (RIRs). Even though this AS organization may own an IP address, they sometimes do not operate IP addresses directly and may rent them out to other organizations. You can check out our IP to Company data or ASN database to learn more about them.
ASN - ASN or Autonomous System Number is the unique identifying number assigned to an AS organization.
IP to ASN - Get ASN and AS organizat...
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Glycoinformatics is a critical resource for the study of glycobiology, and glycobiology is a necessary component for understanding the complex interface between intra- and extracellular spaces. Despite this, there is limited software available to scientists studying these topics, requiring each to create fundamental data structures and representations anew for each of their applications. This leads to poor uptake of standardization and loss of focus on the real problems. We present glypy, a library written in Python for reading, writing, manipulating, and transforming glycans at several levels of precision. In addition to understanding several common formats for textual representation of glycans, the library also provides application programming interfaces (APIs) for major community databases, including GlyTouCan and UnicarbKB. The library is freely available under the Apache 2 common license with source code available at https://github.com/mobiusklein/ and documentation at https://glypy.readthedocs.io/.
Facebook
Twitter2019 Novel Coronavirus COVID-19 (2019-nCoV) Visual Dashboard and Map:
https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6
Downloadable data:
https://github.com/CSSEGISandData/COVID-19
Additional Information about the Visual Dashboard:
https://systems.jhu.edu/research/public-health/ncov
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MetaSBT database with viral metagenome-assembled genomes (MAGs) from the MGV database.
It comprises 26,285 reference genomes and 190,756 MAGs organised into 40,729 species, 13,916 genera, 8,862 families, 6,551 orders, 3,014 classes, and 17 phyla.
MetaSBT public databases are indexed in the MetaSBT-DBs repository on GitHub at https://github.com/cumbof/MetaSBT-DBs and they are produced with the open-source MetaSBT framework available at https://github.com/cumbof/MetaSBT.
Databases can be installed locally with the unpack command of MetaSBT as documented in the official wiki at https://github.com/cumbof/MetaSBT/wiki.
Other commands to interact with the database are available through the MetaSBT framework and are documented in the same wiki.
Facebook
Twitterhttp://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
https://repository-images.githubusercontent.com/314205765/0bb18d80-2b22-11eb-8f6f-ccf20c0c2679">
Dataset Contents:
This dataset compiles translations and commentaries of the Bhagavad Gita, an ancient Indian scripture, provided by various authors. The Bhagavad Gita is a 700-verse Hindu scripture that is part of the Indian epic Mahabharata. It is revered for its philosophical and spiritual teachings.
The dataset includes translations and commentaries in different languages, such as Sanskrit, Hindi, English, and more. It features the insights and interpretations of renowned authors and scholars who have contributed to the understanding of the Bhagavad Gita's teachings. The dataset encompasses multiple dimensions of the scripture, including translations, transliterations, commentaries, and explanations.
Featured Authors:
Bhagavad Gita API:
In addition to the dataset, an API named the Bhagavad Gita API has been developed to provide easy access to the Bhagavad Gita's verses, translations, and commentaries. This API allows developers and enthusiasts to access the teachings of the Bhagavad Gita programmatically. The API can be accessed at https://bhagavadgitaapi.in/.
API Source Code:
The source code for the Bhagavad Gita API is available on GitHub at https://github.com/vedicscriptures/bhagavad-gita-api. It provides an open-source resource for those interested in contributing or understanding how the API works.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The /derivatives folder contains the pre-split training/validation/testing datasets, each containing unique subjects with the following:
descoteaux07 basis: https://dipy.org/documentation/1.3.0./theory/sh_basis/)All tractograms contain compressed streamlines to reduce disk space, which means that the step size is variable. If your algorithm requires a fixed step size, you have to manually resample the streamlines, which can be done using SCILPY (https://github.com/scilus/scilpy) and the scil_resample_streamlines.py script: https://github.com/scilus/scilpy/blob/master/scripts/scil_resample_streamlines.py
To evaluate a candidate tractogram, refer to: https://github.com/scil-vital/TractoInferno/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The files provided here are the supporting data and code files for the analyses presented in "An open source cyberinfrastructure for collecting, processing, storing and accessing high temporal resolution residential water use data," an article in Environmental Modelling and Software (https://doi.org/10.1016/j.envsoft.2021.105137). The data included in this resource were processed using the Cyberinfrastructure for Intelligent Water Supply (CIWS) (https://github.com/UCHIC/CIWS-Server), and collected using the CIWS-Node (https://github.com/UCHIC/CIWS-WM-Node) data logging device. CIWS is an open-source, modular, generalized architecture designed to automate the process from data collection to analysis and presentation of high temporal residential water use data. The CIWS-Node is a low cost device capable of collecting this type of data on magnetically driven water meters. The code included allows replication of the analyses presented in the journal paper, and the raw data included allow for extension of the analyses conducted. The journal paper presents the architecture design and a prototype implementation for CIWS that was built using existing open-source technologies, including smart meters, databases, and services. Two case studies were selected to test functionalities of CIWS, including push and pull data models within single family and multi-unit residential contexts, respectively. CIWS was tested for scalability and performance within our design constraints and proved to be effective within both case studies. All CIWS elements and the case study data described are freely available for re-use.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CVEfixes is a comprehensive vulnerability dataset that is automatically collected and curated from Common Vulnerabilities and Exposures (CVE) records in the public U.S. National Vulnerability Database (NVD). The goal is to support data-driven security research based on source code and source code metrics related to fixes for CVEs in the NVD by providing detailed information at different interlinked levels of abstraction, such as the commit-, file-, and method level, as well as the repository- and CVE level.
At the initial release, the dataset covers all published CVEs up to 9 June 2021. All open-source projects that were reported in CVE records in the NVD in this time frame and had publicly available git repositories were fetched and considered for the construction of this vulnerability dataset. The dataset is organized as a relational database and covers 5495 vulnerability fixing commits in 1754 open source projects for a total of 5365 CVEs in 180 different Common Weakness Enumeration (CWE) types. The dataset includes the source code before and after fixing of 18249 files, and 50322 functions.
This repository includes the SQL dump of the dataset, as well as the JSON for the CVEs and XML of the CWEs at the time of collection. The complete process has been documented in the paper "CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software", which is published in the Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE '21). You will find a copy of the paper in the Doc folder.
Citation and Zenodo links
Please cite this work by referring to the published paper:
@inproceedings{bhandari2021:cvefixes,
title = {{CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software}},
booktitle = {{Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE '21)}},
author = {Bhandari, Guru and Naseer, Amara and Moonen, Leon},
year = {2021},
pages = {10},
publisher = {{ACM}},
doi = {10.1145/3475960.3475985},
copyright = {Open Access},
isbn = {978-1-4503-8680-7},
language = {en}
}
The dataset has been released on Zenodo with DOI:10.5281/zenodo.4476563. The GitHub repository containing the code to automatically collect the dataset can be found at https://github.com/secureIT-project/CVEfixes, released with DOI:10.5281/zenodo.5111494.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Global Power Plant Database is a comprehensive, open source database of power plants around the world. It centralizes power plant data to make it easier to navigate, compare and draw insights for one’s own analysis. The database covers approximately 35,000 power plants from 167 countries and includes thermal plants (e.g. coal, gas, oil, nuclear, biomass, waste, geothermal) and renewables (e.g. hydro, wind, solar). Each power plant is geolocated and entries contain information on plant capacity, generation, ownership, and fuel type. It will be continuously updated as data becomes available. The methodology for the dataset creation is given in the World Resources Institute publication "A Global Database of Power Plants". Data updates may occur without associated updates to this manuscript. The database can be visualized on Resource Watch together with hundreds of other datasets. The database is available for immediate download and use through the WRI Open Data Portal. Associated code for the creation of the dataset can be found on GitHub. The bleeding-edge version of the database (which may contain substantial differences from the release you are viewing) is available on GitHub as well. To be informed of important database releases in the future, please sign up for our newsletter.
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Geo Open is an IP address geolocation per country in MMDB format. The database can be used as a replacement for software using the MMDB format. Information about MMDB format: https://maxmind.github.io/MaxMind-DB/ Open source server using Geo Open: https://github.com/adulau/mmdb-server Open source library to read MMDB file: https://github.com/maxmind/MaxMind-DB-Reader-python Historical dataset: https://cra.circl.lu/opendata/geo-open/ The database is automatically generated from public BGP AS announces matching the country code. The precision is at country level.