Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Training dataset used in my TFM.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive—the largest publicly available archive of FOSS source code with accompanying development history—all versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license blobs, plus several portable CSV files for metadata, referencing blobs via cryptographic checksums.
For more details see the included README file and companion paper:
Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.
If you use this dataset for research purposes, please acknowledge its use by citing the above paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains responses to a survey on open data and open access amongst members of the Weizenbaum Institute for the Networked Society which ran from 30 August to 21 September 2021. The survey elicited 39 valid responses out of 181 potential respondents working at the institute.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains Zenodo's published open access records and communities metadata, including entries marked by the Zenodo staff as spam and deleted.
The datasets are gzipped compressed JSON-lines files, where each line is a JSON object representation of a Zenodo record or community.
Records dataset
Filename: zenodo_open_metadata_{ date of export }.jsonl.gz
Each object contains the terms: part_of, thesis, description, doi, meeting, imprint, references, recid, alternate_identifiers, resource_type, journal, related_identifiers, title, subjects, notes, creators, communities, access_right, keywords, contributors, publication_date
which correspond to the fields with the same name available in Zenodo's record JSON Schema at https://zenodo.org/schemas/records/record-v1.0.0.json.
In addition, some terms have been altered:
The term files contains a list of dictionaries containing filetype, size, and filename only.
The term license contains a short Zenodo ID of the license (e.g. "cc-by").
Communities dataset
Filename: zenodo_community_metadata_{ date of export }.jsonl.gz
Each object contains the terms: id, title, description, curation_policy, page
which correspond to the fields with the same name available in Zenodo's community creation form.
Notes for all datasets
For each object the term spam contains a boolean value, determining whether a given record/community was marked as spam content by Zenodo staff.
Some values for the top-level terms, which were missing in the metadata may contain a null value.
A smaller uncompressed random sample of 200 JSON lines is also included for each dataset to test and get familiar with the format without having to download the entire dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset supports the SPARC Europe study that investigates the copyright retention policy amongst publishers, self-archiving policies and records publisher policies on open licensing, also as relating to the Plan S requirements on rights and licensing. The report can be found here: 10.5281/zenodo.4046624)
Data reuse information
These data were compiled by Chris Morrison and Jane Secker as part of the above-mentioned research project. For more information on how the data were produced and verified see 3.3 of the report 10.5281/zenodo.4046624. There are two sets of data with supporting information that are made available under different reuse terms.
1. DOAJ dataset
The file "DOAJ data May2020" was extracted from the Directory of Open Access Journals on 10 May 2020 and represents an analysis of the data (which is (c) 2020 DOAJ) by Chris Morrison and Jane Secker. The dataset is licensed under CC BY-SA following the DOAJ data licence.
2. 10 large publishers dataset
The file "Academic Publishers Copyright Policies and Practices Table" was created by Chris Morrison and Jane Secker and is provided under CC0 dedication.
The remaining files provide information from publisher websites and directly from publisher representatives and remain (c) of each respective organisation.
https://cdla.io/sharing-1-0https://cdla.io/sharing-1-0
Overview The files in this repository compose the Charge-dependent, Reproducible, Accessible, Forcefield-dependent, and Temperature-dependent Exploratory Database (CRAFTED) of adsorption isotherms. This dataset contains the simulation of CO2 and N2 adsorption isotherms on 690 metal-organic frameworks taken from the CoRE-MOF-2014 database and 667 covalent organic frameworks taken from the CURATED-COFs database. The simulations were performed with two force fields (UFF and DREIDING), six partial charge schemes (no charges, Qeq, EQeq, DDEC, MPNN, and PACMOF), and three temperatures (273, 298, 323 K). Contents
CIF_FILES/ contains 6 folders (NEUTRAL, DDEC, EQeq, Qeq, MPNN, and PACMOF), each one with 1357 CIF files; FORCEFIELDS/ contains 2 folders (UFF and DREIDING) with the definition of the forcefields; INPUT_FILES/ contains 97,704 input files for the GCMC simulations; ISOTHERM_FILES/ contains 97,704 adsorption isotherms resulting from the GCMC simulation; ENTHALPY_FILES/ contains 97,704 enthalpies of adsorption from the isotherms; RAC_DBSCAN/ contains the RAC and geometrical descriptors to perform the t-NSE + DBSCAN analysis; Licenses The 690 MOF-related CIF files in the DDEC folder were downloaded from CoRE-MOF-2014 and are licensed under the terms of the Creative Commons Attribution 4.0 International license (CC-BY-4.0). The 667 COF-related CIF files in the NEUTRAL folder were downloaded from CURATED-COFs and are licensed under the terms of the MIT license (MIT).
Dalar Nazarian, Jeffrey S. Camp, & David S. Sholl. (2016). Computation-Ready Experimental Metal-Organic Framework (CoRE MOF) 2014 DDEC Database [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3986573 Ongari, Daniele, et al. "Building a consistent and reproducible database for adsorption evaluation in covalent–organic frameworks." ACS Central Science 5.10 (2019): 1663-1675. https://doi.org/10.1021/acscentsci.9b00619 Ongari, Daniele, Leopold Talirz, and Berend Smit. "Too many materials and too many applications: An experimental problem waiting for a computational solution." ACS Central Science 6.11 (2020): 1890-1900. https://doi.org/10.1021/acscentsci.0c00988 The CO2.def and N2.def forcefield files were downloaded from RASPA and are licensed under the terms of the MIT license.
Dubbeldam, David, et al. "RASPA: molecular simulation software for adsorption and diffusion in flexible nanoporous materials." Molecular Simulation 42.2 (2016): 81-101. https://doi.org/10.1080/08927022.2015.1010082 The remaining MOF-related CIF files in the PACMOF, MPNN, Qeq, EQeq and NEUTRAL folders were derived from those in the DDEC folder and are licensed under the terms of the Creative Commons Attribution 4.0 International license (CC-BY-4.0) from the CoRE-MOF-2014 subset. The remaining COF-related CIF files in the PACMOF, MPNN, Qeq, EQeq and DDEC folders were derived from those in the NEUTRAL folder and are licensed under the terms of the MIT license (MIT) from the CURATED-COFs subset. All remaining files were created by us, and are licensed under the terms of the CDLA-Sharing-1.0 license. Software requirements In order to create a Python environment capable of running the Jupyter notebooks, please install conda and execute conda env create --file environment.yml Usage instructions Execute the command below to run JupyterLab in the appropriate Python environment. conda run --name crafted jupyter-lab
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains all “license files” extracted from a snapshot of the Software Heritage archive taken on 2022-04-25. (Other, possibly more recent, versions of the datasets can be found at https://annex.softwareheritage.org/public/dataset/license-blobs/).
In this context, a license file is a unique file content (or “blob”) that appeared in a software origin archived by Software Heritage as a file whose name is often used to ship licenses in software projects. Some name examples are: COPYING, LICENSE, NOTICE, COPYRIGHT, etc. The exact file name pattern used to select the blobs contained in the dataset can be found in the SQL query file 01-select-blobs.sql. Note that the file name was not expected to be at the project root, because project subdirectories can contain different licenses than the top-level one, and we wanted to include those too.
Format
The dataset is organized as follows:
blobs.tar.zst: a Zst-compressed tarball containing deduplicated license blobs, one per file. The tarball contains 6’859’189 blobs, for a total uncompressed size on disk of 66 GiB.
The blobs are organized in a sharded directory structure that contains files named like blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02, where:
blobs/ is the root directory containing all license blobs
8624bcdae55baeef00cd11d5dfcfa60f68710a02 is the SHA1 checksum of a specific license blobs, a copy of the GPL3 license in this case. Each license blob is ultimately named with its SHA1:
$ head -n 3 blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02 GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007
$ sha1sum blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02 8624bcdae55baeef00cd11d5dfcfa60f68710a02 blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02
86 and 24 are, respectively, the first and second group of two hex digits in the blob SHA1
One blob is missing, because its size (313MB) prevented its inclusion; (it was originally a tarball containing source code):
swh:1:cnt:61bf63793c2ee178733b39f8456a796b72dc8bde,1340d4e2da173c92d432026ecdc54b4859fe9911,"AUTHORS"
blobs-sample20k.tar.zst: analogous to blobs.tar.zst, but containing “only” 20’000 randomly selected license blobs
license-blobs.csv.zst a Zst-compressed CSV index of all the blobs in the dataset. Each line in the index (except the first one, which contains column headers) describes a license blob and is in the format SWHID,SHA1,NAME, for example:
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING" swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING.GPL3" swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING.GLP-3"
where:
SWHID: the Software Heritage persistent identifier of the blob. It can be used to retrieve and cross-reference the license blob via the Software Heritage archive, e.g., at: https://archive.softwareheritage.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2
SHA1: the blob SHA1, that can be used to cross-reference blobs in the blobs/ directory
NAME: a file name given to the license blob in a given software origin. As the same license blob can have different names in different contexts, the index contain multiple entries for the same blob with different names, as it is the case in the example above (yes, one of those has a typo in it, but it’s an original typo from some repository!).
blobs-fileinfo.csv.zst a Zst-compressed CSV mapping from blobs to basic file information in the format: SHA1,MIME_TYPE,ENCODING,LINE_COUNT,WORD_COUNT,SIZE, where:
SHA1: blob SHA1
MIME_TYPE: blob MIME type, as detected by libmagic
ENCODING: blob character encoding, as detected by libmagic
LINE_COUNT: number of lines in the blob (only for textual blobs with UTF8 encoding)
WORD_COUNT: number of words in the blob (only for textual blobs with UTF8 encoding)
SIZE: blob size in bytes
blobs-scancode.csv.zst a Zst-compressed CSV mapping from blobs to software license detected in them by ScanCode, in the format: SHA1,LICENSE,SCORE, where:
SHA1: blob SHA1
LICENSE: license detected in the blob, as an SPDX identifier (or ScanCode identifier for non-SPDX-indexed licenses)
SCORE: confidence score in the result, as a decimal number between 0 and 100
There may be zero or arbitrarily many lines for each blob.
blobs-scancode.ndjson.zst a Zst-compressed line-delimited JSON, containing a superset of the information in blobs-scancode.csv.zst. Each line is a JSON dictionary with three keys:
sha1: blob SHA1
licenses: output of scancode.api.get_licenses(..., min_score=0)
copyrights: output of scancode.api.get_copyrights(...)
There is exactly one line for each blob. licenses and copyrights keys are omitted for files not detected as plain text.
blobs-origins.csv.zst a Zst-compressed CSV mapping of where license blobs come from. Each line in the index associate a license blob to one of its origins in the format SWHIDURL, for example:
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 https://github.com/pombreda/Artemis
Note that a license blob can come from many different places, only an arbitrary (and somewhat random) one is listed in this mapping.
If no origin URL is found in the Software Heritage archive, then a blank is used instead. This happens when they were either being loaded when the dataset was generated, or the loader process crashed before completing the blob’s origin’s ingestion.
blobs-nb-origins.csv.zst a Zst-compressed CSV mapping of how many origins of this blob are known to Software Heritage. Each line in the index associate a license blob to this count in the format SWHIDNUMBER, for example:
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 2822260
Two blobs are missing because the computation crashes:
swh:1:cnt:e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 swh:1:cnt:8b137891791fe96927ad78e64b0aad7bded08bdc
This issue will be fixed in a future version of the dataset
blobs-earliest.csv.zst a Zst-compressed CSV mapping from blobs to information about their (earliest) known occurence(s) in the archive. Format: SWHIDEARLIEST_SWHIDEARLIEST_TSOCCURRENCES, where:
SWHID: blob SWHID
EARLIEST_SWHID: SWHID of the earliest known commit containing the blob
EARLIEST_TS: timestamp of the earliest known commit containing the blob, as a Unix time integer
OCCURRENCES: number of known commits containing the blob
replication-package.tar.gz: code and scripts used to produce the dataset
licenses-annotated-sample.tar.gz: ground truth, i.e., manually annotated random sample of license blobs, with details about the kind of information they contain.
Changes since the 2021-03-23 dataset
More input data, due to the SWH archive growing: more origins in supported forges and package managers; and support for more forges and package managers. See the SWH Archive Changelog for details.
Values in the NAME column of license-blobs.csv.zst are quoted, as some file names now contain commas.
Replication package now contains all the steps needed to reproduce all artefacts including the licenseblobs/fetch.py script.
blobs-nb-origins.csv.zst is added.
blobs-origins.csv.zst is now generated using the first origin returned by swh-graph’s leaves endpoint, instead of its randomwalk endpoint. This should have no impact on the result, other than a different distribution of “random” origins being picked.
blobs-origins.csv.zst was missing ~10% of its results in previous versions of the dataset, due to errors and/or timeouts in its generation, this is now down to 0.02% (1254 of the 6859445 unique blobs). Blobs with no known origins are now present, with a blank instead of URL.
blobs-earliest.csv.zst was missing ~10% of its results in previous versions of the dataset. It is complete now.
blobs-scancode.csv.zst is generated with a newer scancode-toolkit version (31.2.1)
blobs-scancode.ndjson.zst is added.
Errata
A file name .tmp_1340d4e2da173c92d432026ecdc54b4859fe9911 was present in the initial version of the dataset (published on 2022-11-07). It was removed on 2022-11-09 using these two commands:
pv blobs-fileinfo.csv.zst | zstdcat | grep -v ".tmp" | zstd -19 pv blobs.tar.zst| zstdcat | tar --delete blobs/13/40/.tmp_1340d4e2da173c92d432026ecdc54b4859fe9911 | zstd -19 -T12
The total uncompressed size was announced as 84 GiB based on the physical size on ext4, but it is actually 66 GiB.
Citation
If you use this dataset for research purposes, please acknowledge its use by citing one or both of the following papers:
[pdf, bib] Jesús M. González-Barahona, Sergio Raúl Montes León, Gregorio Robles, Stefano Zacchiroli. The software heritage license dataset (2022 edition). Empirical Software Engineering, Volume 28, Number 6, Article number 147 (2023).
[pdf, bib] Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.
References
The dataset has been built using primarily the data sources described in the following papers:
[pdf, bib] Roberto Di Cosmo, Stefano Zacchiroli. Software Heritage: Why and How to Preserve Software Source Code. In Proceedings of iPRES 2017: 14th International Conference on Digital Preservation, Kyoto, Japan, 25-29 September 2017.
[pdf, bib] Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli. The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019.
Errata (v2, 2024-01-09)
licenses-annotated-sample.tar.gz: some comments not intended for publication were removed, and 4
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview:
This collection contains three synthetic datasets produced by gpt-4o-mini for sentiment analysis and PDT (Product Desirability Toolkit) testing. Each dataset contains 1000 hypothetical software product reviews with the aim to produce a diversity of sentiment and text. The datasets were created as part of the research described in:
Hastings, J.D., Weitl-Harms, S., Doty, J., Myers, Z. L., and Thompson, W., “Utilizing Large Language Models to Synthesize Product Desirability Datasets,” in Proceedings of the 2024 IEEE International Conference
on Big Data (BigData-24), Workshop on Large Language and Foundation Models (WLLFM-24), Dec. 2024.
https://arxiv.org/abs/2411.13485.
Briefly, each row in the datasets was produced as follows:
1) Word+Review: The LLM selected a word and synthesized a review that would align with a random target sentiment.
2) Review+Word: The LLM produced a review to align with the target sentiment score, and then selected a word appropriate for the review.
3) Supply-Word: A word was supplied to the LLM which was then scored, and a review was produced to align with that score.
For sentiment analysis and PDT testing, the two columns of main interest across the datasets are likely 'Selected Word' and 'Hypothetical Review'.
License:
This data is licensed under the CC Attribution 4.0 international license, and may be taken and used freely with credit given. Cite as:
Hastings, J., Weitl-Harms, S., Doty, J., Myers, Z., & Thompson, W. (2024). Synthetic Product Desirability Datasets for Sentiment Analysis Testing (1.0.0). Zenodo. https://doi.org/10.5281/zenodo.14188456
This preliminary dataset contains the application/vnd.zenodo.v1+json JSON records of Zenodo deposits as retrieved on 2019-09-16. Files zenodo-records-json-2019-09-16.tar.xz Zenodo JSON records XZ-compressed tar archive of individual JSON records as retrieved from Zenodo. Filenames reflects record, e.g. 1310621.json was retrieved from https://zenodo.org/api/records/1310621 using content-negotiation for application/vnd.zenodo.v1+json zenodo-records-json-2019-09-16-filtered.jsonseq.xz Concatinated Zenodo JSON records XZ-compressed RFC7464 JSON Sequence stream, readable by jq. Concatination of Zenodo JSON records. Order not significant. zenodo-records.sh Retrieve Zenodo JSON records A retrospectively created Bash shell script that shows the commands used to retrieve JSON files and concationate to jsonseq. ro-crate-metadata.jsonld RO-Crate 0.2 structured metadata ro-crate-preview.html Browser rendering of RO-Crate structured metadata README.md This dataset description License This dataset is provided under the license Apache License, version 2.0: Copyright 2019 The University of Manchester Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. CC0 for Zenodo metadata The Zenodo metadata in zenodo-records-json-2019-09-16.tar.xz is reused under the terms of https://creativecommons.org/publicdomain/zero/1.0/ Reproducibility To retrieve the Zenodo JSON it was deemed necessary to use the undocumented parts of Zenodo API. From the Zenodo source code it was identified that the REST template https://zenodo.org/api/records/{pid_value} could be used with pid_value as the numeric part from the OAI-PMH identifier, e.g. for oai:zenodo.org:1310621 the Zenodo JSON can be retrieved at https://zenodo.org/api/records/1310621. The JSON API supports content negotiation, the content-types supported as of 2019-09-20 include: application/vnd.zenodo.v1+json giving the Zenodo record in Zenodo's internal JSON schema (v1) application/ld+json giving JSON-LD Linked Data using the http://schema.org/ vocabulary application/x-datacite-v41+xml giving DataCite v4 XML application/marcxml+xml giving MARC 21 XML Using these (currently) undocumented parts of the Zenodo API thus avoids the need for HTML scraping while also giving individual complete records that are suitable to redistribute as records in a filtered dataset. This preliminary exploration will be adapted into the reproducible CWL workflow, for now included as a Bash script zenodo-records.sh Execution time was about 3 days from a server at the University of Manchester network on a single 1 GBps network link. The script does: Retrieve each of the first 3.5 million Zenodo records as Zenodo JSON by iterating over possible numeric IDs (the maximum ID 3450000 was estimated from "Recent uploads") Filter list to exclude records that are not found, moved or deleted. The presence of the key conceptrecid is used as marker. Use jq to ensure the JSON is on a single line Join the JSON files using the ASCII Record Separator (RS, 0x1e) to make a application/json-seq JSON text sequence stream Save the JSON stream as a single compressed file using xz
This is the dataset used in the respective research work. The abstract is available below.
If you want to cite this work, please use:
Georgia M. Kapitsaki, Maria Papoutsoglou, Daniel German and Lefteris Angelis, What do developers talk about open source software licensing?, to appear in the Proceedings of the Euromicro Conference on Software Engineering and Advanced Applications, SEAA 2020.
Free and open source software has gained a lot of momentum in the industry and the research community. Open source licenses determine the rules, under which the open source software can be further used and distributed. Previous works have examined the usage of open source licenses in the framework of specific projects or online social coding platforms, examining developers specific licensing views for specific software. However, the questions practitioners ask about licenses and licensing as captured in Question and Answer websites also constitute an important aspect toward understanding practitioners general licenses and licensing concerns. In this paper, we investigate open source license discussions using data from the Software Engineering, Open Source and Law Stack Exchange sites that contain relevant data. We describe the process used for the data collection and analysis, and discuss the main results. Our results indicate that clarifications about specific licenses and specific license terms are required. The results can be useful for developers, educators and license authors.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Within the ESA funded WorldCereal project we have built an open harmonized reference data repository at global extent for model training or product validation in support of land cover and crop type mapping. Data from 2017 onwards were collected from many different sources and then harmonized, annotated and evaluated. These steps are explained in the harmonization protocol (10.5281/zenodo.7584463). This protocol also clarifies the naming convention of the shape files and the WorldCereal attributes (LC, CT, IRR, valtime and sampleID) that were added to the original data sets.
This publication includes those harmonized data sets of which the original data set was published under the CC-BY-SA license or a license similar to CC-BY-SA. See document "_In-situ-data-World-Cereal - license - CC-BY-SA.pdf" for an overview of the original data sets.
TChard is a datset for TCR-peptide/-pMHC binding prediction.
It includes more than 500,000 samples, derived from heterogenous sources, such as IEDB, VDJdb, McPAS-TCR and the NetTCR-2.0 repository. Since the samples included in the TChard dataset derive from different datasets, we include a license
column in the CSV file. The license
column specifies from which original dataset a sample comes, and which license applies.
Experiments on this dataset can be found in the GitHub repository: https://github.com/nec-research/tc-hard
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please see the original paper at https://doi.org/10.1038/s41597-021-00833-x for more information about this dataset.
This package contains a datasets described by Donchev et al. [1]: DES370K, It is presented as a CSV (DES370K.csv) and .mol files (geometries//DES370K_.mol). Also included is a metadata file DES370K_meta.csv, which contains a set of long-form column descriptions replicating those in [1], as well as data types and units (when applicable) for each column.
DES370K.csv : Full dataset, containing interaction energies calculated using CCSD(T), MP2, HF, and SAPT0, as well as dimer geometries.
DES370K_meta.csv : Long-form descriptions of the columns in DES370K, as well as datatypes and units (when applicable) for each column
LICENSE.txt : License for using and redistributing the datasets provided.
README.md : This file.
The datasets are presented as CSVs as a compromise between human-readability, format uniformity, and parsing speed. While an almost uncountable number of packages exist to read CSV files, we recommend using the python data analysis
[1] A. G. Donchev, A. G. Taube, E. Decolvenaere, C. Hargus, R. T. McGibbon, K.-H. Law, B. A. Gregersen, J.-L. Li, K. Palmo, K. Siva, M. Bergdorf, J. L. Klepeis, and D. E. Shaw. "Quantum chemical benchmark database of dimer interaction energies at a “gold standard” level of accuracy"
[2] R. T. McGibbon, A. G. Taube, A. G. Donchev, K. Siva, F. Fernandez, C. Hargus, K.-H. Law, J.L. Klepeis, and D. E. Shaw. "Improving the accuracy of Moller-Plesset perturbation theory with neural networks"
[3] M. K. Kesharwani, A. Karton, N. Sylvetsky, J. M. L. Nitai. "The S66 non-covalent interactions benchmark reconsidered using explicitly correlated methods near the basis set limit."
DESRES DATA SETS LICENSE AGREEMENT
Copyright 2020, D. E. Shaw Research. All rights reserved.
Redistribution and use of electronic structure data released in the DESRES
Data Sets (DES370K, DES15K, DES5M, DESS66, and DESS66x8) with or without
modification, is permitted provided that the following conditions are met:
* Redistributions of the data must retain the above copyright notice,
this list of conditions, and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions, and the following disclaimer in the
documentation and/or other materials provided with the distribution.
Neither the name of D. E. Shaw Research nor the names of its contributors may
be used to endorse or promote products derived from this software without
specific prior written permission.
THIS SOFTWARE AND DATA ARE PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDINGNEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE AND/OR DATA, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains Zenodo's published open access records and communities metadata, including entries marked by the Zenodo staff as spam and deleted.
The datasets are gzipped compressed JSON-lines files, where each line is a JSON object representation of a Zenodo record or community.
Records dataset
Filename: zenodo_open_metadata_{ date of export }.jsonl.gz
Each object contains the terms: part_of, thesis, description, doi, meeting, imprint, references, recid, alternate_identifiers, resource_type, journal, related_identifiers, title, subjects, notes, creators, communities, access_right, keywords, contributors, publication_date
which correspond to the fields with the same name available in Zenodo's record JSON Schema at https://zenodo.org/schemas/records/record-v1.0.0.json.
In addition, some terms have been altered:
Communities dataset
Filename: zenodo_community_metadata_{ date of export }.jsonl.gz
Each object contains the terms: id, title, description, curation_policy, page
which correspond to the fields with the same name available in Zenodo's community creation form.
Notes for all datasets
For each object the term spam contains a boolean value, determining whether a given record/community was marked as spam content by Zenodo staff.
Some values for the top-level terms, which were missing in the metadata may contain a null value.
A smaller uncompressed random sample of 200 JSON lines is also included for each dataset to test and get familiar with the format without having to download the entire dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset comprises data from the GEMStat database that are available under an open data license (CC BY 4.0 or equivalent). It is made available on the Zenodo repository. GEMStat provides access to freshwater quality data. The data are voluntarily provided by countries and organizations worldwide within the framework of the GEMS/Water Programme of the United Nations Environment Programme (UNEP). The dataset includes more than 20 million measurement from over 13,000 stations and covering more than 600 different parameters and spans the time period from 1906 to 2023. This represents over 70% of all GEMStat data, further data is only available under more restricted data licenses. GEMStat is operated by the GEMS/Water programme of the United Nations Environment Programme (UNEP) and hosted at the International Centre for Water Resources and Global Change (ICWRGC) and the German Federal Institute of Hydrology (BfG). The data in GEMStat is provided by National Hydrological Services of UN member states.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
What is in this release?
In this release you will find data about software distributed and/or crafted publicly on the Internet. You will find information about its development, its distribution and its relationship with other software included as a dependency. You will not find any information about the individuals who create and maintain these projects.
Further information and documentation on this data set can be found at https://libraries.io/data
For enquiries please contact data@libraries.io
This dataset contains seven csv files:
Projects
A project is a piece of software available on any one of the 34 package managers supported by Libraries.io.
Versions
A Libraries.io version is an immutable published version of a Project from a package manager. Not all package managers have a concept of publishing versions, often relying directly on tags/branches from a revision control tool.
Tags
A tag is equivalent to a tag in a revision control system. Tags are sometimes used instead of Versions where a package manager does not use the concept of versions. Tags are often semantic version numbers.
Dependencies
Dependencies describe the relationship between a project and the software it builds upon. Dependencies belong to Version. Each Version can have different sets of dependencies. Dependencies point at a specific Version or range of versions of other projects.
Repositories
A Libraries.io repository represents a publically accessible source code repository from either github.com, gitlab.com or bitbucket.org. Repositories are distinct from Projects, they are not distributed via a package manager and typically an application for end users rather than component to build upon.
Repository dependencies
A repository dependency is a dependency upon a Version from a package manager has been specified in a manifest file, either as a manually added dependency committed by a user or listed as a generated dependency listed in a lockfile that has been automatically generated by a package manager and committed.
Projects with related Repository fields
This is an alternative projects export that denormalizes a projects related source code repository inline to reduce the need to join between two data sets.
Licence
This dataset is released under the Creative Commons Attribution-ShareAlike 4.0 International Licence.
This licence provides the user with the freedom to use, adapt and redistribute this data. In return the user must publish any derivative work under a similarly open licence, attributing Libraries.io as a data source. The full text of the licence is included in the data.
Access, Attribution and Citation
The dataset is available to download from Zenodo at https://zenodo.org/record/2536573.
Please attribute Libraries.io as a data source by including the words ‘Includes data from Libraries.io, a project from Tidelift’ and reference the Digital Object identifier: 10.5281/zenodo.3626071
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ROOTS Subset: roots_ar_openiti_proc
OpenITI
Dataset uid: openiti_proc
Description
A corpus of Arabic texts that collected from Islamic books from different websites.
Homepage
https://zenodo.org/record/4075046
Licensing
non-commercial use cc-by-nc-sa-4.0: Creative Commons Attribution Non Commercial Share Alike 4.0 International
By exercising the Licensed Rights (defined below), You accept and agree to be bound by the terms and conditions of this… See the full description on the dataset page: https://huggingface.co/datasets/bigscience-data/roots_ar_openiti_proc.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains part of the imaging data for the Universal Lesion Segmentation Challenge (ULS23). It contains lesion volumes-of-interest (VOI's) for previously released data. It consists of 333 kidney lesions from the KiTS21 dataset, 2.246 lung lesion from LIDC-IDRI and 888 liver lesions from the LiTS challenge. The annotations are made available through the Challenge repository on GitHub.The Universal Lesion Segmentation 2023 (ULS23) data is licensed under CC BY-NC-SA 4.0
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains:
A kmz file for Penmanshiel wind farm in the UK (for opening in e.g. Google Earth)
Static data including turbine coordinates and turbine details (rated power, rotor diameter, hub height, etc.)
10-minute SCADA and events data from the 14 Senvion MM82's at Penmanshiel wind farm, grouped by year from 2016 to mid-2021, which was extracted from our secondary SCADA system (Greenbyte). Note not all signals are available for the entire period, and there is no turbine WT03
Data mappings from primary SCADA to csv signal names
Site substation/PMU meter data where available for the same period
Site fiscal/grid meter data where available for the same period
The dataset has been released by Cubico Sustainable Investments Ltd under a CC-BY-4.0 open data license and is provided as is. However, please provide any feedback you might have on the dataset and format of the data. I'll try and add or link to additional file formats that might be easier to work with (e.g. for use with specific analysis software), and update this dataset periodically (e.g. twice a year), but please prompt me as required.
Feel free to use the data according to the license, however, it would be helpful to me if you could let me know where, how and why you are using the data, so that I can highlight this to the business (and renewables industry) and hopefully promote similar data sharing initiatives. I am particularly interested in performance analysis/improvement opportunities, how the dataset can be augmented with other (open) datasets, and sharing more generally within the renewables industry.
If you would like to get access to other datasets we may hold (e.g. more recent data, data from our other sites, ~30s resolution data, etc.), please let me know, and, if you have any questions or want to discuss open data and this or other initiatives, please contact me and I will endeavour to help.
I would like to thank Cubico's Senior Legal Advisor & Compliance Officer, IT Director, UK Asset Management Team, Executive Committee and my manager for supporting this initiative, as well as our partners GLIL for agreeing to release this data under an open license. I would also like to thank those I have talked to during the process of releasing this data under an open license and the encouragement and advice I have had on the way.
For contact my email address is charlie.plumley@cubicoinvest.com.
You can also access data from Kelmarsh wind farm here.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the dataset for the study of Potential Code Borrowing and License Violations in Java Projects on GitHub. The dataset is based on the Public Git Archive and consists of projects on GitHub that have at least 50 stars and have at least one line in Java. A total of 23,378 projects are listed here that we downloaded for analysis on June 1st, 2019.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Training dataset used in my TFM.