100+ datasets found

Training Dataset
zenodo.org
data.niaid.nih.gov
txt, xml
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sanchez Romero; Sanchez Romero (2020). Training Dataset [Dataset]. http://doi.org/10.5281/zenodo.3309364
Explore at:
xml, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3309364
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sanchez Romero; Sanchez Romero
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Training dataset used in my TFM.
Z
Data from: A Large-scale Dataset of (Open Source) License Text Variants
data.niaid.nih.gov
Updated Mar 31, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefano Zacchiroli (2022). A Large-scale Dataset of (Open Source) License Text Variants [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6379163
Explore at:
Dataset updated
Mar 31, 2022
Dataset authored and provided by
Stefano Zacchiroli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive—the largest publicly available archive of FOSS source code with accompanying development history—all versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license blobs, plus several portable CSV files for metadata, referencing blobs via cryptographic checksums.

For more details see the included README file and companion paper:

Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.

If you use this dataset for research purposes, please acknowledge its use by citing the above paper.
Umfrage zu Forschungsdatenmanagement am Weizenbaum-Institut (2021)
zenodo.org
data.niaid.nih.gov
csv, html, pdf, xls
Updated Jul 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Weizenbaum-Institut e.V.; Weizenbaum-Institut e.V. (2024). Umfrage zu Forschungsdatenmanagement am Weizenbaum-Institut (2021) [Dataset]. http://doi.org/10.34669/wi.wd/1
Explore at:
csv, xls, pdf, htmlAvailable download formats
Unique identifier
https://doi.org/10.34669/wi.wd/1
Dataset updated
Jul 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Weizenbaum-Institut e.V.; Weizenbaum-Institut e.V.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains responses to a survey on open data and open access amongst members of the Weizenbaum Institute for the Networked Society which ran from 30 August to 21 September 2021. The survey elicited 39 valid responses out of 181 potential respondents working at the institute.
Zenodo Open Metadata snapshot - Training dataset for records and communities...
data.niaid.nih.gov
explore.openaire.eu
Updated Dec 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo team (2022). Zenodo Open Metadata snapshot - Training dataset for records and communities classifier building [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_787062
Explore at:
Dataset updated
Dec 15, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Zenodo team
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains Zenodo's published open access records and communities metadata, including entries marked by the Zenodo staff as spam and deleted.

The datasets are gzipped compressed JSON-lines files, where each line is a JSON object representation of a Zenodo record or community.

Records dataset

Filename: zenodo_open_metadata_{ date of export }.jsonl.gz

Each object contains the terms: part_of, thesis, description, doi, meeting, imprint, references, recid, alternate_identifiers, resource_type, journal, related_identifiers, title, subjects, notes, creators, communities, access_right, keywords, contributors, publication_date

which correspond to the fields with the same name available in Zenodo's record JSON Schema at https://zenodo.org/schemas/records/record-v1.0.0.json.

In addition, some terms have been altered:

The term files contains a list of dictionaries containing filetype, size, and filename only.

The term license contains a short Zenodo ID of the license (e.g. "cc-by").

Communities dataset

Filename: zenodo_community_metadata_{ date of export }.jsonl.gz

Each object contains the terms: id, title, description, curation_policy, page

which correspond to the fields with the same name available in Zenodo's community creation form.

Notes for all datasets

For each object the term spam contains a boolean value, determining whether a given record/community was marked as spam content by Zenodo staff.

Some values for the top-level terms, which were missing in the metadata may contain a null value.

A smaller uncompressed random sample of 200 JSON lines is also included for each dataset to test and get familiar with the format without having to download the entire dataset.
Dataset: Open Access: An Analysis of Publisher Copyright and Licensing...
zenodo.org
explore.openaire.eu
Updated Feb 25, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jane Secker; Jane Secker (2021). Dataset: Open Access: An Analysis of Publisher Copyright and Licensing Policies in Europe, 2020 [Dataset]. http://doi.org/10.5281/zenodo.4047001
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4047001
Dataset updated
Feb 25, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jane Secker; Jane Secker
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset supports the SPARC Europe study that investigates the copyright retention policy amongst publishers, self-archiving policies and records publisher policies on open licensing, also as relating to the Plan S requirements on rights and licensing. The report can be found here: 10.5281/zenodo.4046624)

Data reuse information

These data were compiled by Chris Morrison and Jane Secker as part of the above-mentioned research project. For more information on how the data were produced and verified see 3.3 of the report 10.5281/zenodo.4046624. There are two sets of data with supporting information that are made available under different reuse terms.

1. DOAJ dataset

The file "DOAJ data May2020" was extracted from the Directory of Open Access Journals on 10 May 2020 and represents an analysis of the data (which is (c) 2020 DOAJ) by Chris Morrison and Jane Secker. The dataset is licensed under CC BY-SA following the DOAJ data licence.

2. 10 large publishers dataset

The file "Academic Publishers Copyright Policies and Practices Table" was created by Chris Morrison and Jane Secker and is provided under CC0 dedication.

The remaining files provide information from publisher websites and directly from publisher representatives and remain (c) of each respective organisation.
Z
CRAFTED: An exploratory database of simulated adsorption isotherms of...
data.niaid.nih.gov
zenodo.org
Updated Nov 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Steiner, Mathias (2023). CRAFTED: An exploratory database of simulated adsorption isotherms of nanoporous materials [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7106173
Explore at:
Dataset updated
Nov 13, 2023
Dataset provided by
Farmahini, Amir H.
Sarkisov, Lev
Cleeton, Conor
Lopes Oliveira, Felipe
Steiner, Mathias
Neumann Barros Ferreira, Rodrigo
Luan, Binquan
License
https://cdla.io/sharing-1-0https://cdla.io/sharing-1-0
Description
Overview The files in this repository compose the Charge-dependent, Reproducible, Accessible, Forcefield-dependent, and Temperature-dependent Exploratory Database (CRAFTED) of adsorption isotherms. This dataset contains the simulation of CO2 and N2 adsorption isotherms on 690 metal-organic frameworks taken from the CoRE-MOF-2014 database and 667 covalent organic frameworks taken from the CURATED-COFs database. The simulations were performed with two force fields (UFF and DREIDING), six partial charge schemes (no charges, Qeq, EQeq, DDEC, MPNN, and PACMOF), and three temperatures (273, 298, 323 K). Contents

CIF_FILES/ contains 6 folders (NEUTRAL, DDEC, EQeq, Qeq, MPNN, and PACMOF), each one with 1357 CIF files; FORCEFIELDS/ contains 2 folders (UFF and DREIDING) with the definition of the forcefields; INPUT_FILES/ contains 97,704 input files for the GCMC simulations; ISOTHERM_FILES/ contains 97,704 adsorption isotherms resulting from the GCMC simulation; ENTHALPY_FILES/ contains 97,704 enthalpies of adsorption from the isotherms; RAC_DBSCAN/ contains the RAC and geometrical descriptors to perform the t-NSE + DBSCAN analysis; Licenses The 690 MOF-related CIF files in the DDEC folder were downloaded from CoRE-MOF-2014 and are licensed under the terms of the Creative Commons Attribution 4.0 International license (CC-BY-4.0). The 667 COF-related CIF files in the NEUTRAL folder were downloaded from CURATED-COFs and are licensed under the terms of the MIT license (MIT).

Dalar Nazarian, Jeffrey S. Camp, & David S. Sholl. (2016). Computation-Ready Experimental Metal-Organic Framework (CoRE MOF) 2014 DDEC Database [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3986573 Ongari, Daniele, et al. "Building a consistent and reproducible database for adsorption evaluation in covalent–organic frameworks." ACS Central Science 5.10 (2019): 1663-1675. https://doi.org/10.1021/acscentsci.9b00619 Ongari, Daniele, Leopold Talirz, and Berend Smit. "Too many materials and too many applications: An experimental problem waiting for a computational solution." ACS Central Science 6.11 (2020): 1890-1900. https://doi.org/10.1021/acscentsci.0c00988 The CO2.def and N2.def forcefield files were downloaded from RASPA and are licensed under the terms of the MIT license.

Dubbeldam, David, et al. "RASPA: molecular simulation software for adsorption and diffusion in flexible nanoporous materials." Molecular Simulation 42.2 (2016): 81-101. https://doi.org/10.1080/08927022.2015.1010082 The remaining MOF-related CIF files in the PACMOF, MPNN, Qeq, EQeq and NEUTRAL folders were derived from those in the DDEC folder and are licensed under the terms of the Creative Commons Attribution 4.0 International license (CC-BY-4.0) from the CoRE-MOF-2014 subset. The remaining COF-related CIF files in the PACMOF, MPNN, Qeq, EQeq and DDEC folders were derived from those in the NEUTRAL folder and are licensed under the terms of the MIT license (MIT) from the CURATED-COFs subset. All remaining files were created by us, and are licensed under the terms of the CDLA-Sharing-1.0 license. Software requirements In order to create a Python environment capable of running the Jupyter notebooks, please install conda and execute conda env create --file environment.yml Usage instructions Execute the command below to run JupyterLab in the appropriate Python environment. conda run --name crafted jupyter-lab
Z
Data from: The Software Heritage License Dataset (2022 Edition)
data.niaid.nih.gov
Updated Jan 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergio Montes-Leon (2024). The Software Heritage License Dataset (2022 Edition) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8200351
Explore at:
Dataset updated
Jan 10, 2024
Dataset provided by
Stefano Zacchiroli
Jesus M. Gonzalez-Barahona
Gregorio Robles
Sergio Montes-Leon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains all “license files” extracted from a snapshot of the Software Heritage archive taken on 2022-04-25. (Other, possibly more recent, versions of the datasets can be found at https://annex.softwareheritage.org/public/dataset/license-blobs/).

In this context, a license file is a unique file content (or “blob”) that appeared in a software origin archived by Software Heritage as a file whose name is often used to ship licenses in software projects. Some name examples are: COPYING, LICENSE, NOTICE, COPYRIGHT, etc. The exact file name pattern used to select the blobs contained in the dataset can be found in the SQL query file 01-select-blobs.sql. Note that the file name was not expected to be at the project root, because project subdirectories can contain different licenses than the top-level one, and we wanted to include those too.

Format

The dataset is organized as follows:

blobs.tar.zst: a Zst-compressed tarball containing deduplicated license blobs, one per file. The tarball contains 6’859’189 blobs, for a total uncompressed size on disk of 66 GiB.

The blobs are organized in a sharded directory structure that contains files named like blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02, where:

blobs/ is the root directory containing all license blobs

8624bcdae55baeef00cd11d5dfcfa60f68710a02 is the SHA1 checksum of a specific license blobs, a copy of the GPL3 license in this case. Each license blob is ultimately named with its SHA1:

$ head -n 3 blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02 GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007

$ sha1sum blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02 8624bcdae55baeef00cd11d5dfcfa60f68710a02 blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02

86 and 24 are, respectively, the first and second group of two hex digits in the blob SHA1

One blob is missing, because its size (313MB) prevented its inclusion; (it was originally a tarball containing source code):

swh:1:cnt:61bf63793c2ee178733b39f8456a796b72dc8bde,1340d4e2da173c92d432026ecdc54b4859fe9911,"AUTHORS"

blobs-sample20k.tar.zst: analogous to blobs.tar.zst, but containing “only” 20’000 randomly selected license blobs

license-blobs.csv.zst a Zst-compressed CSV index of all the blobs in the dataset. Each line in the index (except the first one, which contains column headers) describes a license blob and is in the format SWHID,SHA1,NAME, for example:

swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING" swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING.GPL3" swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING.GLP-3"

where:

SWHID: the Software Heritage persistent identifier of the blob. It can be used to retrieve and cross-reference the license blob via the Software Heritage archive, e.g., at: https://archive.softwareheritage.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2

SHA1: the blob SHA1, that can be used to cross-reference blobs in the blobs/ directory

NAME: a file name given to the license blob in a given software origin. As the same license blob can have different names in different contexts, the index contain multiple entries for the same blob with different names, as it is the case in the example above (yes, one of those has a typo in it, but it’s an original typo from some repository!).

blobs-fileinfo.csv.zst a Zst-compressed CSV mapping from blobs to basic file information in the format: SHA1,MIME_TYPE,ENCODING,LINE_COUNT,WORD_COUNT,SIZE, where:

SHA1: blob SHA1

MIME_TYPE: blob MIME type, as detected by libmagic

ENCODING: blob character encoding, as detected by libmagic

LINE_COUNT: number of lines in the blob (only for textual blobs with UTF8 encoding)

WORD_COUNT: number of words in the blob (only for textual blobs with UTF8 encoding)

SIZE: blob size in bytes

blobs-scancode.csv.zst a Zst-compressed CSV mapping from blobs to software license detected in them by ScanCode, in the format: SHA1,LICENSE,SCORE, where:

SHA1: blob SHA1

LICENSE: license detected in the blob, as an SPDX identifier (or ScanCode identifier for non-SPDX-indexed licenses)

SCORE: confidence score in the result, as a decimal number between 0 and 100

There may be zero or arbitrarily many lines for each blob.

blobs-scancode.ndjson.zst a Zst-compressed line-delimited JSON, containing a superset of the information in blobs-scancode.csv.zst. Each line is a JSON dictionary with three keys:

sha1: blob SHA1

licenses: output of scancode.api.get_licenses(..., min_score=0)

copyrights: output of scancode.api.get_copyrights(...)

There is exactly one line for each blob. licenses and copyrights keys are omitted for files not detected as plain text.

blobs-origins.csv.zst a Zst-compressed CSV mapping of where license blobs come from. Each line in the index associate a license blob to one of its origins in the format SWHIDURL, for example:

swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 https://github.com/pombreda/Artemis

Note that a license blob can come from many different places, only an arbitrary (and somewhat random) one is listed in this mapping.

If no origin URL is found in the Software Heritage archive, then a blank is used instead. This happens when they were either being loaded when the dataset was generated, or the loader process crashed before completing the blob’s origin’s ingestion.

blobs-nb-origins.csv.zst a Zst-compressed CSV mapping of how many origins of this blob are known to Software Heritage. Each line in the index associate a license blob to this count in the format SWHIDNUMBER, for example:

swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 2822260

Two blobs are missing because the computation crashes:

swh:1:cnt:e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 swh:1:cnt:8b137891791fe96927ad78e64b0aad7bded08bdc

This issue will be fixed in a future version of the dataset

blobs-earliest.csv.zst a Zst-compressed CSV mapping from blobs to information about their (earliest) known occurence(s) in the archive. Format: SWHIDEARLIEST_SWHIDEARLIEST_TSOCCURRENCES, where:

SWHID: blob SWHID

EARLIEST_SWHID: SWHID of the earliest known commit containing the blob

EARLIEST_TS: timestamp of the earliest known commit containing the blob, as a Unix time integer

OCCURRENCES: number of known commits containing the blob

replication-package.tar.gz: code and scripts used to produce the dataset

licenses-annotated-sample.tar.gz: ground truth, i.e., manually annotated random sample of license blobs, with details about the kind of information they contain.

Changes since the 2021-03-23 dataset

More input data, due to the SWH archive growing: more origins in supported forges and package managers; and support for more forges and package managers. See the SWH Archive Changelog for details.

Values in the NAME column of license-blobs.csv.zst are quoted, as some file names now contain commas.

Replication package now contains all the steps needed to reproduce all artefacts including the licenseblobs/fetch.py script.

blobs-nb-origins.csv.zst is added.

blobs-origins.csv.zst is now generated using the first origin returned by swh-graph’s leaves endpoint, instead of its randomwalk endpoint. This should have no impact on the result, other than a different distribution of “random” origins being picked.

blobs-origins.csv.zst was missing ~10% of its results in previous versions of the dataset, due to errors and/or timeouts in its generation, this is now down to 0.02% (1254 of the 6859445 unique blobs). Blobs with no known origins are now present, with a blank instead of URL.

blobs-earliest.csv.zst was missing ~10% of its results in previous versions of the dataset. It is complete now.

blobs-scancode.csv.zst is generated with a newer scancode-toolkit version (31.2.1)

blobs-scancode.ndjson.zst is added.

Errata

A file name .tmp_1340d4e2da173c92d432026ecdc54b4859fe9911 was present in the initial version of the dataset (published on 2022-11-07). It was removed on 2022-11-09 using these two commands:

pv blobs-fileinfo.csv.zst | zstdcat | grep -v ".tmp" | zstd -19 pv blobs.tar.zst| zstdcat | tar --delete blobs/13/40/.tmp_1340d4e2da173c92d432026ecdc54b4859fe9911 | zstd -19 -T12

The total uncompressed size was announced as 84 GiB based on the physical size on ext4, but it is actually 66 GiB.

Citation

If you use this dataset for research purposes, please acknowledge its use by citing one or both of the following papers:

[pdf, bib] Jesús M. González-Barahona, Sergio Raúl Montes León, Gregorio Robles, Stefano Zacchiroli. The software heritage license dataset (2022 edition). Empirical Software Engineering, Volume 28, Number 6, Article number 147 (2023).

[pdf, bib] Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.

References

The dataset has been built using primarily the data sources described in the following papers:

[pdf, bib] Roberto Di Cosmo, Stefano Zacchiroli. Software Heritage: Why and How to Preserve Software Source Code. In Proceedings of iPRES 2017: 14th International Conference on Digital Preservation, Kyoto, Japan, 25-29 September 2017.

[pdf, bib] Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli. The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019.

Errata (v2, 2024-01-09)

licenses-annotated-sample.tar.gz: some comments not intended for publication were removed, and 4
Data from: Synthetic Product Desirability Datasets for Sentiment Analysis...
zenodo.org
paperswithcode.com
+2more
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Hastings; John Hastings; Sherri Weitl-Harms; Sherri Weitl-Harms; Joseph Doty; Zachary Myers; Zachary Myers; Warren Thompson; Joseph Doty; Warren Thompson (2024). Synthetic Product Desirability Datasets for Sentiment Analysis Testing [Dataset]. http://doi.org/10.5281/zenodo.14188456
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14188456
Dataset updated
Nov 21, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
John Hastings; John Hastings; Sherri Weitl-Harms; Sherri Weitl-Harms; Joseph Doty; Zachary Myers; Zachary Myers; Warren Thompson; Joseph Doty; Warren Thompson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview:
This collection contains three synthetic datasets produced by gpt-4o-mini for sentiment analysis and PDT (Product Desirability Toolkit) testing. Each dataset contains 1000 hypothetical software product reviews with the aim to produce a diversity of sentiment and text. The datasets were created as part of the research described in:

Hastings, J.D., Weitl-Harms, S., Doty, J., Myers, Z. L., and Thompson, W., “Utilizing Large Language Models to Synthesize Product Desirability Datasets,” in Proceedings of the 2024 IEEE International Conference
on Big Data (BigData-24), Workshop on Large Language and Foundation Models (WLLFM-24), Dec. 2024.
https://arxiv.org/abs/2411.13485.

Briefly, each row in the datasets was produced as follows:
1) Word+Review: The LLM selected a word and synthesized a review that would align with a random target sentiment.
2) Review+Word: The LLM produced a review to align with the target sentiment score, and then selected a word appropriate for the review.
3) Supply-Word: A word was supplied to the LLM which was then scored, and a review was produced to align with that score.

For sentiment analysis and PDT testing, the two columns of main interest across the datasets are likely 'Selected Word' and 'Hypothetical Review'.

License:
This data is licensed under the CC Attribution 4.0 international license, and may be taken and used freely with credit given. Cite as:

Hastings, J., Weitl-Harms, S., Doty, J., Myers, Z., & Thompson, W. (2024). Synthetic Product Desirability Datasets for Sentiment Analysis Testing (1.0.0). Zenodo. https://doi.org/10.5281/zenodo.14188456
o
Zenodo metadata JSON records as of 2019-09-16
explore.openaire.eu
zenodo.org
Updated Nov 7, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stian Soiland-Reyes; Paul Groth; Information Management; University of Amsterdam (2019). Zenodo metadata JSON records as of 2019-09-16 [Dataset]. http://doi.org/10.5281/zenodo.3531504
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3531504
Dataset updated
Nov 7, 2019
Authors
Stian Soiland-Reyes; Paul Groth; Information Management; University of Amsterdam
Description
This preliminary dataset contains the application/vnd.zenodo.v1+json JSON records of Zenodo deposits as retrieved on 2019-09-16. Files zenodo-records-json-2019-09-16.tar.xz Zenodo JSON records XZ-compressed tar archive of individual JSON records as retrieved from Zenodo. Filenames reflects record, e.g. 1310621.json was retrieved from https://zenodo.org/api/records/1310621 using content-negotiation for application/vnd.zenodo.v1+json zenodo-records-json-2019-09-16-filtered.jsonseq.xz Concatinated Zenodo JSON records XZ-compressed RFC7464 JSON Sequence stream, readable by jq. Concatination of Zenodo JSON records. Order not significant. zenodo-records.sh Retrieve Zenodo JSON records A retrospectively created Bash shell script that shows the commands used to retrieve JSON files and concationate to jsonseq. ro-crate-metadata.jsonld RO-Crate 0.2 structured metadata ro-crate-preview.html Browser rendering of RO-Crate structured metadata README.md This dataset description License This dataset is provided under the license Apache License, version 2.0: Copyright 2019 The University of Manchester Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. CC0 for Zenodo metadata The Zenodo metadata in zenodo-records-json-2019-09-16.tar.xz is reused under the terms of https://creativecommons.org/publicdomain/zero/1.0/ Reproducibility To retrieve the Zenodo JSON it was deemed necessary to use the undocumented parts of Zenodo API. From the Zenodo source code it was identified that the REST template https://zenodo.org/api/records/{pid_value} could be used with pid_value as the numeric part from the OAI-PMH identifier, e.g. for oai:zenodo.org:1310621 the Zenodo JSON can be retrieved at https://zenodo.org/api/records/1310621. The JSON API supports content negotiation, the content-types supported as of 2019-09-20 include: application/vnd.zenodo.v1+json giving the Zenodo record in Zenodo's internal JSON schema (v1) application/ld+json giving JSON-LD Linked Data using the http://schema.org/ vocabulary application/x-datacite-v41+xml giving DataCite v4 XML application/marcxml+xml giving MARC 21 XML Using these (currently) undocumented parts of the Zenodo API thus avoids the need for HTML scraping while also giving individual complete records that are suitable to redistribute as records in a filtered dataset. This preliminary exploration will be adapted into the reproducible CWL workflow, for now included as a Bash script zenodo-records.sh Execution time was about 3 days from a server at the University of Manchester network on a single 1 GBps network link. The script does: Retrieve each of the first 3.5 million Zenodo records as Zenodo JSON by iterating over possible numeric IDs (the maximum ID 3450000 was estimated from "Recent uploads") Filter list to exclude records that are not found, moved or deleted. The presence of the key conceptrecid is used as marker. Use jq to ensure the JSON is on a single line Join the JSON files using the ASCII Record Separator (RS, 0x1e) to make a application/json-seq JSON text sequence stream Save the JSON stream as a single compressed file using xz
Z
Dataset from "What do developers talk about open source software licensing?...
data.niaid.nih.gov
Updated Jun 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel German (2020). Dataset from "What do developers talk about open source software licensing? " - SEAA2020 [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_3871564
Explore at:
Dataset updated
Jun 1, 2020
Dataset provided by
Daniel German
Lefteris Angelis
Georgia M. Kapitsaki
Maria Papoutsoglou
Description
This is the dataset used in the respective research work. The abstract is available below.

If you want to cite this work, please use:

Georgia M. Kapitsaki, Maria Papoutsoglou, Daniel German and Lefteris Angelis, What do developers talk about open source software licensing?, to appear in the Proceedings of the Euromicro Conference on Software Engineering and Advanced Applications, SEAA 2020.

Free and open source software has gained a lot of momentum in the industry and the research community. Open source licenses determine the rules, under which the open source software can be further used and distributed. Previous works have examined the usage of open source licenses in the framework of specific projects or online social coding platforms, examining developers specific licensing views for specific software. However, the questions practitioners ask about licenses and licensing as captured in Question and Answer websites also constitute an important aspect toward understanding practitioners general licenses and licensing concerns. In this paper, we investigate open source license discussions using data from the Software Engineering, Open Source and Law Stack Exchange sites that contain relevant data. We describe the process used for the data collection and analysis, and discuss the main results. Our results indicate that clarifications about specific licenses and specific license terms are required. The results can be useful for developers, educators and license authors.
WorldCereal open global harmonized reference data repository (CC-BY-SA...
zenodo.org
data.niaid.nih.gov
Updated Jul 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hendrik Boogaard; Hendrik Boogaard; Arun Pratihast; Juan Carlos Laso Bayas; Santosh Karanam; Steffen Fritz; Kristof Van Tricht; Jeroen Degerickx; Sven Gilliams; Arun Pratihast; Juan Carlos Laso Bayas; Santosh Karanam; Steffen Fritz; Kristof Van Tricht; Jeroen Degerickx; Sven Gilliams (2024). WorldCereal open global harmonized reference data repository (CC-BY-SA licensed data sets) [Dataset]. http://doi.org/10.5281/zenodo.7609546
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7609546
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Hendrik Boogaard; Hendrik Boogaard; Arun Pratihast; Juan Carlos Laso Bayas; Santosh Karanam; Steffen Fritz; Kristof Van Tricht; Jeroen Degerickx; Sven Gilliams; Arun Pratihast; Juan Carlos Laso Bayas; Santosh Karanam; Steffen Fritz; Kristof Van Tricht; Jeroen Degerickx; Sven Gilliams
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Within the ESA funded WorldCereal project we have built an open harmonized reference data repository at global extent for model training or product validation in support of land cover and crop type mapping. Data from 2017 onwards were collected from many different sources and then harmonized, annotated and evaluated. These steps are explained in the harmonization protocol (10.5281/zenodo.7584463). This protocol also clarifies the naming convention of the shape files and the WorldCereal attributes (LC, CT, IRR, valtime and sampleID) that were added to the original data sets.

This publication includes those harmonized data sets of which the original data set was published under the CC-BY-SA license or a license similar to CC-BY-SA. See document "_In-situ-data-World-Cereal - license - CC-BY-SA.pdf" for an overview of the original data sets.
Z
Data from: TChard
data.niaid.nih.gov
Updated Aug 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timothy O'Donnell (2022). TChard [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_6962042
Explore at:
Dataset updated
Aug 17, 2022
Dataset provided by
Martin Renqiang Min
Anja Moesch
Kai Li
Timothy O'Donnell
Israa Alqassem
Filippo Grazioli
Pierre Machart
Description
TChard is a datset for TCR-peptide/-pMHC binding prediction. It includes more than 500,000 samples, derived from heterogenous sources, such as IEDB, VDJdb, McPAS-TCR and the NetTCR-2.0 repository. Since the samples included in the TChard dataset derive from different datasets, we include a license column in the CSV file. The license column specifies from which original dataset a sample comes, and which license applies. Experiments on this dataset can be found in the GitHub repository: https://github.com/nec-research/tc-hard
Z
DES370K
data.niaid.nih.gov
zenodo.org
Updated Nov 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gregersen, Brent A (2021). DES370K [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5676265
Explore at:
Dataset updated
Nov 12, 2021
Dataset provided by
Gregersen, Brent A
Li Je-Leun
McGibbon, Robert T
Hargus, Cory
Shaw, David E
Klepeis. John L
Siva. Karthik
Palmo, Kim
Taube, Andrew G
Decolvenaere, Elizabeth
Donchev, Alexander G
Law, Ka-Hei
Bergdorf, Michael
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
DESRES Data Sets (DES370K)

Please see the original paper at https://doi.org/10.1038/s41597-021-00833-x for more information about this dataset.

This package contains a datasets described by Donchev et al. [1]: DES370K, It is presented as a CSV (DES370K.csv) and .mol files (geometries//DES370K_.mol). Also included is a metadata file DES370K_meta.csv, which contains a set of long-form column descriptions replicating those in [1], as well as data types and units (when applicable) for each column.

Manifest

DES370K.csv : Full dataset, containing interaction energies calculated using CCSD(T), MP2, HF, and SAPT0, as well as dimer geometries.

DES370K_meta.csv : Long-form descriptions of the columns in DES370K, as well as datatypes and units (when applicable) for each column

LICENSE.txt : License for using and redistributing the datasets provided.

README.md : This file.

Loading the Datset

The datasets are presented as CSVs as a compromise between human-readability, format uniformity, and parsing speed. While an almost uncountable number of packages exist to read CSV files, we recommend using the python data analysis

References

[1] A. G. Donchev, A. G. Taube, E. Decolvenaere, C. Hargus, R. T. McGibbon, K.-H. Law, B. A. Gregersen, J.-L. Li, K. Palmo, K. Siva, M. Bergdorf, J. L. Klepeis, and D. E. Shaw. "Quantum chemical benchmark database of dimer interaction energies at a “gold standard” level of accuracy"

[2] R. T. McGibbon, A. G. Taube, A. G. Donchev, K. Siva, F. Fernandez, C. Hargus, K.-H. Law, J.L. Klepeis, and D. E. Shaw. "Improving the accuracy of Moller-Plesset perturbation theory with neural networks"

[3] M. K. Kesharwani, A. Karton, N. Sylvetsky, J. M. L. Nitai. "The S66 non-covalent interactions benchmark reconsidered using explicitly correlated methods near the basis set limit."

License

DESRES DATA SETS LICENSE AGREEMENT Copyright 2020, D. E. Shaw Research. All rights reserved. Redistribution and use of electronic structure data released in the DESRES Data Sets (DES370K, DES15K, DES5M, DESS66, and DESS66x8) with or without modification, is permitted provided that the following conditions are met: * Redistributions of the data must retain the above copyright notice, this list of conditions, and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions, and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of D. E. Shaw Research nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE AND DATA ARE PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDINGNEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE AND/OR DATA, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Zenodo Open Metadata snapshot - Training dataset for records and communities...
zenodo.org
application/gzip, bin
Updated Dec 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo team; Zenodo team (2022). Zenodo Open Metadata snapshot - Training dataset for records and communities classifier building [Dataset]. http://doi.org/10.5281/zenodo.7438358
Explore at:
bin, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7438358
Dataset updated
Dec 15, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Zenodo team; Zenodo team
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains Zenodo's published open access records and communities metadata, including entries marked by the Zenodo staff as spam and deleted.

The datasets are gzipped compressed JSON-lines files, where each line is a JSON object representation of a Zenodo record or community.

Records dataset

Filename: zenodo_open_metadata_{ date of export }.jsonl.gz

Each object contains the terms: part_of, thesis, description, doi, meeting, imprint, references, recid, alternate_identifiers, resource_type, journal, related_identifiers, title, subjects, notes, creators, communities, access_right, keywords, contributors, publication_date

which correspond to the fields with the same name available in Zenodo's record JSON Schema at https://zenodo.org/schemas/records/record-v1.0.0.json.

In addition, some terms have been altered:

The term files contains a list of dictionaries containing filetype, size, and filename only.

The term license contains a short Zenodo ID of the license (e.g. "cc-by").

Communities dataset

Filename: zenodo_community_metadata_{ date of export }.jsonl.gz

Each object contains the terms: id, title, description, curation_policy, page

which correspond to the fields with the same name available in Zenodo's community creation form.

Notes for all datasets

For each object the term spam contains a boolean value, determining whether a given record/community was marked as spam content by Zenodo staff.

Some values for the top-level terms, which were missing in the metadata may contain a null value.

A smaller uncompressed random sample of 200 JSON lines is also included for each dataset to test and get familiar with the format without having to download the entire dataset.
g
GEMS/Water Open Global Water Quality Dataset
gimi9.com
data.europa.eu
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GEMS/Water Open Global Water Quality Dataset [Dataset]. https://gimi9.com/dataset/eu_fb38ac32-1459-48d9-8390-f8dd2e43e0cc
Explore at:
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset comprises data from the GEMStat database that are available under an open data license (CC BY 4.0 or equivalent). It is made available on the Zenodo repository. GEMStat provides access to freshwater quality data. The data are voluntarily provided by countries and organizations worldwide within the framework of the GEMS/Water Programme of the United Nations Environment Programme (UNEP). The dataset includes more than 20 million measurement from over 13,000 stations and covering more than 600 different parameters and spans the time period from 1906 to 2023. This represents over 70% of all GEMStat data, further data is only available under more restricted data licenses. GEMStat is operated by the GEMS/Water programme of the United Nations Environment Programme (UNEP) and hosted at the International Centre for Water Resources and Global Change (ICWRGC) and the German Federal Institute of Hydrology (BfG). The data in GEMStat is provided by National Hydrological Services of UN member states.
Z
Data from: Libraries.io Open Source Repository and Dependency Metadata
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Feb 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeremy Katz (2020). Libraries.io Open Source Repository and Dependency Metadata [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_808272
Explore at:
Dataset updated
Feb 13, 2020
Dataset authored and provided by
Jeremy Katz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
What is in this release?

In this release you will find data about software distributed and/or crafted publicly on the Internet. You will find information about its development, its distribution and its relationship with other software included as a dependency. You will not find any information about the individuals who create and maintain these projects.

Further information and documentation on this data set can be found at https://libraries.io/data

For enquiries please contact data@libraries.io

This dataset contains seven csv files:

Projects

A project is a piece of software available on any one of the 34 package managers supported by Libraries.io.

Versions

A Libraries.io version is an immutable published version of a Project from a package manager. Not all package managers have a concept of publishing versions, often relying directly on tags/branches from a revision control tool.

Tags

A tag is equivalent to a tag in a revision control system. Tags are sometimes used instead of Versions where a package manager does not use the concept of versions. Tags are often semantic version numbers.

Dependencies

Dependencies describe the relationship between a project and the software it builds upon. Dependencies belong to Version. Each Version can have different sets of dependencies. Dependencies point at a specific Version or range of versions of other projects.

Repositories

A Libraries.io repository represents a publically accessible source code repository from either github.com, gitlab.com or bitbucket.org. Repositories are distinct from Projects, they are not distributed via a package manager and typically an application for end users rather than component to build upon.

Repository dependencies

A repository dependency is a dependency upon a Version from a package manager has been specified in a manifest file, either as a manually added dependency committed by a user or listed as a generated dependency listed in a lockfile that has been automatically generated by a package manager and committed.

Projects with related Repository fields

This is an alternative projects export that denormalizes a projects related source code repository inline to reduce the need to join between two data sets.

Licence

This dataset is released under the Creative Commons Attribution-ShareAlike 4.0 International Licence.

This licence provides the user with the freedom to use, adapt and redistribute this data. In return the user must publish any derivative work under a similarly open licence, attributing Libraries.io as a data source. The full text of the licence is included in the data.

Access, Attribution and Citation

The dataset is available to download from Zenodo at https://zenodo.org/record/2536573.

Please attribute Libraries.io as a data source by including the words ‘Includes data from Libraries.io, a project from Tidelift’ and reference the Digital Object identifier: 10.5281/zenodo.3626071
h
roots_ar_openiti_proc
huggingface.co
Updated Apr 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigScience Data (2024). roots_ar_openiti_proc [Dataset]. https://huggingface.co/datasets/bigscience-data/roots_ar_openiti_proc
Explore at:
Dataset updated
Apr 22, 2024
Dataset authored and provided by
BigScience Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ROOTS Subset: roots_ar_openiti_proc

OpenITI

Dataset uid: openiti_proc

Description

A corpus of Arabic texts that collected from Islamic books from different websites.

Homepage

https://zenodo.org/record/4075046

Licensing

non-commercial use cc-by-nc-sa-4.0: Creative Commons Attribution Non Commercial Share Alike 4.0 International

By exercising the Licensed Rights (defined below), You accept and agree to be bound by the terms and conditions of this… See the full description on the dataset page: https://huggingface.co/datasets/bigscience-data/roots_ar_openiti_proc.
Z
The ULS23 Challenge Public Training Dataset Part 2
data.niaid.nih.gov
explore.openaire.eu
Updated Oct 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
de Grauw, Max (2023). The ULS23 Challenge Public Training Dataset Part 2 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10050959
Explore at:
Dataset updated
Oct 30, 2023
Dataset provided by
van Ginneken, Bram
Hering, Alessa
de Grauw, Max
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset contains part of the imaging data for the Universal Lesion Segmentation Challenge (ULS23). It contains lesion volumes-of-interest (VOI's) for previously released data. It consists of 333 kidney lesions from the KiTS21 dataset, 2.246 lung lesion from LIDC-IDRI and 888 liver lesions from the LiTS challenge. The annotations are made available through the Challenge repository on GitHub.The Universal Lesion Segmentation 2023 (ULS23) data is licensed under CC BY-NC-SA 4.0
Z
Penmanshiel Wind Farm Data
data.niaid.nih.gov
zenodo.org
Updated Aug 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Plumley, Charlie (2023). Penmanshiel Wind Farm Data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5946807
Explore at:
Dataset updated
Aug 17, 2023
Dataset authored and provided by
Plumley, Charlie
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains:

A kmz file for Penmanshiel wind farm in the UK (for opening in e.g. Google Earth)

Static data including turbine coordinates and turbine details (rated power, rotor diameter, hub height, etc.)

10-minute SCADA and events data from the 14 Senvion MM82's at Penmanshiel wind farm, grouped by year from 2016 to mid-2021, which was extracted from our secondary SCADA system (Greenbyte). Note not all signals are available for the entire period, and there is no turbine WT03

Data mappings from primary SCADA to csv signal names

Site substation/PMU meter data where available for the same period

Site fiscal/grid meter data where available for the same period

The dataset has been released by Cubico Sustainable Investments Ltd under a CC-BY-4.0 open data license and is provided as is. However, please provide any feedback you might have on the dataset and format of the data. I'll try and add or link to additional file formats that might be easier to work with (e.g. for use with specific analysis software), and update this dataset periodically (e.g. twice a year), but please prompt me as required.

Feel free to use the data according to the license, however, it would be helpful to me if you could let me know where, how and why you are using the data, so that I can highlight this to the business (and renewables industry) and hopefully promote similar data sharing initiatives. I am particularly interested in performance analysis/improvement opportunities, how the dataset can be augmented with other (open) datasets, and sharing more generally within the renewables industry.

If you would like to get access to other datasets we may hold (e.g. more recent data, data from our other sites, ~30s resolution data, etc.), please let me know, and, if you have any questions or want to discuss open data and this or other initiatives, please contact me and I will endeavour to help.

I would like to thank Cubico's Senior Legal Advisor & Compliance Officer, IT Director, UK Asset Management Team, Executive Committee and my manager for supporting this initiative, as well as our partners GLIL for agreeing to release this data under an open license. I would also like to thank those I have talked to during the process of releasing this data under an open license and the encouragement and advice I have had on the way.

For contact my email address is charlie.plumley@cubicoinvest.com.

You can also access data from Kelmarsh wind farm here.
Z
Dataset for the study of Potential Code Borrowing and License Violations in...
data.niaid.nih.gov
zenodo.org
Updated Mar 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Golubev, Yaroslav (2020). Dataset for the study of Potential Code Borrowing and License Violations in Java Projects on GitHub [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3608211
Explore at:
Dataset updated
Mar 14, 2020
Dataset provided by
Eliseeva, Maria
Povarov, Nikita
Bryksin, Timofey
Golubev, Yaroslav
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the dataset for the study of Potential Code Borrowing and License Violations in Java Projects on GitHub. The dataset is based on the Public Git Archive and consists of projects on GitHub that have at least 50 stars and have at least one line in Java. A total of 23,378 projects are listed here that we downloaded for analysis on June 1st, 2019.

Facebook

Twitter

Click to copy link

Link copied

Cite

Sanchez Romero; Sanchez Romero (2020). Training Dataset [Dataset]. http://doi.org/10.5281/zenodo.3309364

Training Dataset

Explore at:

xml, txtAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.3309364

Dataset updated

Jan 24, 2020

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Sanchez Romero; Sanchez Romero

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Training dataset used in my TFM.

Clear search

Close search

Google apps

Main menu

Training Dataset

Data from: A Large-scale Dataset of (Open Source) License Text Variants

Umfrage zu Forschungsdatenmanagement am Weizenbaum-Institut (2021)

Zenodo Open Metadata snapshot - Training dataset for records and communities...

Dataset: Open Access: An Analysis of Publisher Copyright and Licensing...

CRAFTED: An exploratory database of simulated adsorption isotherms of...

Data from: The Software Heritage License Dataset (2022 Edition)

Data from: Synthetic Product Desirability Datasets for Sentiment Analysis...

Zenodo metadata JSON records as of 2019-09-16

Dataset from "What do developers talk about open source software licensing?...

WorldCereal open global harmonized reference data repository (CC-BY-SA...

Data from: TChard

DES370K

DESRES Data Sets (DES370K)

Manifest

Loading the Datset

References

License

Zenodo Open Metadata snapshot - Training dataset for records and communities...

GEMS/Water Open Global Water Quality Dataset

Data from: Libraries.io Open Source Repository and Dependency Metadata

roots_ar_openiti_proc

The ULS23 Challenge Public Training Dataset Part 2

Penmanshiel Wind Farm Data

Dataset for the study of Potential Code Borrowing and License Violations in...

Training Dataset