100+ datasets found
  1. Training Dataset

    • zenodo.org
    • data.niaid.nih.gov
    txt, xml
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sanchez Romero; Sanchez Romero (2020). Training Dataset [Dataset]. http://doi.org/10.5281/zenodo.3309364
    Explore at:
    xml, txtAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sanchez Romero; Sanchez Romero
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Training dataset used in my TFM.

  2. Z

    Data from: A Large-scale Dataset of (Open Source) License Text Variants

    • data.niaid.nih.gov
    Updated Mar 31, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefano Zacchiroli (2022). A Large-scale Dataset of (Open Source) License Text Variants [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6379163
    Explore at:
    Dataset updated
    Mar 31, 2022
    Dataset authored and provided by
    Stefano Zacchiroli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive—the largest publicly available archive of FOSS source code with accompanying development history—all versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license blobs, plus several portable CSV files for metadata, referencing blobs via cryptographic checksums.

    For more details see the included README file and companion paper:

    Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.

    If you use this dataset for research purposes, please acknowledge its use by citing the above paper.

  3. Umfrage zu Forschungsdatenmanagement am Weizenbaum-Institut (2021)

    • zenodo.org
    • data.niaid.nih.gov
    csv, html, pdf, xls
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weizenbaum-Institut e.V.; Weizenbaum-Institut e.V. (2024). Umfrage zu Forschungsdatenmanagement am Weizenbaum-Institut (2021) [Dataset]. http://doi.org/10.34669/wi.wd/1
    Explore at:
    csv, xls, pdf, htmlAvailable download formats
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Weizenbaum-Institut e.V.; Weizenbaum-Institut e.V.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains responses to a survey on open data and open access amongst members of the Weizenbaum Institute for the Networked Society which ran from 30 August to 21 September 2021. The survey elicited 39 valid responses out of 181 potential respondents working at the institute.

  4. Zenodo Open Metadata snapshot - Training dataset for records and communities...

    • data.niaid.nih.gov
    • explore.openaire.eu
    Updated Dec 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo team (2022). Zenodo Open Metadata snapshot - Training dataset for records and communities classifier building [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_787062
    Explore at:
    Dataset updated
    Dec 15, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Zenodo team
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains Zenodo's published open access records and communities metadata, including entries marked by the Zenodo staff as spam and deleted.

    The datasets are gzipped compressed JSON-lines files, where each line is a JSON object representation of a Zenodo record or community.

    Records dataset

    Filename: zenodo_open_metadata_{ date of export }.jsonl.gz

    Each object contains the terms: part_of, thesis, description, doi, meeting, imprint, references, recid, alternate_identifiers, resource_type, journal, related_identifiers, title, subjects, notes, creators, communities, access_right, keywords, contributors, publication_date

    which correspond to the fields with the same name available in Zenodo's record JSON Schema at https://zenodo.org/schemas/records/record-v1.0.0.json.

    In addition, some terms have been altered:

    The term files contains a list of dictionaries containing filetype, size, and filename only.

    The term license contains a short Zenodo ID of the license (e.g. "cc-by").

    Communities dataset

    Filename: zenodo_community_metadata_{ date of export }.jsonl.gz

    Each object contains the terms: id, title, description, curation_policy, page

    which correspond to the fields with the same name available in Zenodo's community creation form.

    Notes for all datasets

    For each object the term spam contains a boolean value, determining whether a given record/community was marked as spam content by Zenodo staff.

    Some values for the top-level terms, which were missing in the metadata may contain a null value.

    A smaller uncompressed random sample of 200 JSON lines is also included for each dataset to test and get familiar with the format without having to download the entire dataset.

  5. Dataset: Open Access: An Analysis of Publisher Copyright and Licensing...

    • zenodo.org
    • explore.openaire.eu
    Updated Feb 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jane Secker; Jane Secker (2021). Dataset: Open Access: An Analysis of Publisher Copyright and Licensing Policies in Europe, 2020 [Dataset]. http://doi.org/10.5281/zenodo.4047001
    Explore at:
    Dataset updated
    Feb 25, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jane Secker; Jane Secker
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset supports the SPARC Europe study that investigates the copyright retention policy amongst publishers, self-archiving policies and records publisher policies on open licensing, also as relating to the Plan S requirements on rights and licensing. The report can be found here: 10.5281/zenodo.4046624)

    Data reuse information

    These data were compiled by Chris Morrison and Jane Secker as part of the above-mentioned research project. For more information on how the data were produced and verified see 3.3 of the report 10.5281/zenodo.4046624. There are two sets of data with supporting information that are made available under different reuse terms.

    1. DOAJ dataset

    The file "DOAJ data May2020" was extracted from the Directory of Open Access Journals on 10 May 2020 and represents an analysis of the data (which is (c) 2020 DOAJ) by Chris Morrison and Jane Secker. The dataset is licensed under CC BY-SA following the DOAJ data licence.

    2. 10 large publishers dataset

    The file "Academic Publishers Copyright Policies and Practices Table" was created by Chris Morrison and Jane Secker and is provided under CC0 dedication.

    The remaining files provide information from publisher websites and directly from publisher representatives and remain (c) of each respective organisation.

  6. Z

    CRAFTED: An exploratory database of simulated adsorption isotherms of...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Nov 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steiner, Mathias (2023). CRAFTED: An exploratory database of simulated adsorption isotherms of nanoporous materials [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7106173
    Explore at:
    Dataset updated
    Nov 13, 2023
    Dataset provided by
    Farmahini, Amir H.
    Sarkisov, Lev
    Cleeton, Conor
    Lopes Oliveira, Felipe
    Steiner, Mathias
    Neumann Barros Ferreira, Rodrigo
    Luan, Binquan
    License

    https://cdla.io/sharing-1-0https://cdla.io/sharing-1-0

    Description

    Overview The files in this repository compose the Charge-dependent, Reproducible, Accessible, Forcefield-dependent, and Temperature-dependent Exploratory Database (CRAFTED) of adsorption isotherms. This dataset contains the simulation of CO2 and N2 adsorption isotherms on 690 metal-organic frameworks taken from the CoRE-MOF-2014 database and 667 covalent organic frameworks taken from the CURATED-COFs database. The simulations were performed with two force fields (UFF and DREIDING), six partial charge schemes (no charges, Qeq, EQeq, DDEC, MPNN, and PACMOF), and three temperatures (273, 298, 323 K). Contents

    CIF_FILES/ contains 6 folders (NEUTRAL, DDEC, EQeq, Qeq, MPNN, and PACMOF), each one with 1357 CIF files; FORCEFIELDS/ contains 2 folders (UFF and DREIDING) with the definition of the forcefields; INPUT_FILES/ contains 97,704 input files for the GCMC simulations; ISOTHERM_FILES/ contains 97,704 adsorption isotherms resulting from the GCMC simulation; ENTHALPY_FILES/ contains 97,704 enthalpies of adsorption from the isotherms; RAC_DBSCAN/ contains the RAC and geometrical descriptors to perform the t-NSE + DBSCAN analysis; Licenses The 690 MOF-related CIF files in the DDEC folder were downloaded from CoRE-MOF-2014 and are licensed under the terms of the Creative Commons Attribution 4.0 International license (CC-BY-4.0). The 667 COF-related CIF files in the NEUTRAL folder were downloaded from CURATED-COFs and are licensed under the terms of the MIT license (MIT).

    Dalar Nazarian, Jeffrey S. Camp, & David S. Sholl. (2016). Computation-Ready Experimental Metal-Organic Framework (CoRE MOF) 2014 DDEC Database [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3986573 Ongari, Daniele, et al. "Building a consistent and reproducible database for adsorption evaluation in covalent–organic frameworks." ACS Central Science 5.10 (2019): 1663-1675. https://doi.org/10.1021/acscentsci.9b00619 Ongari, Daniele, Leopold Talirz, and Berend Smit. "Too many materials and too many applications: An experimental problem waiting for a computational solution." ACS Central Science 6.11 (2020): 1890-1900. https://doi.org/10.1021/acscentsci.0c00988 The CO2.def and N2.def forcefield files were downloaded from RASPA and are licensed under the terms of the MIT license.

    Dubbeldam, David, et al. "RASPA: molecular simulation software for adsorption and diffusion in flexible nanoporous materials." Molecular Simulation 42.2 (2016): 81-101. https://doi.org/10.1080/08927022.2015.1010082 The remaining MOF-related CIF files in the PACMOF, MPNN, Qeq, EQeq and NEUTRAL folders were derived from those in the DDEC folder and are licensed under the terms of the Creative Commons Attribution 4.0 International license (CC-BY-4.0) from the CoRE-MOF-2014 subset. The remaining COF-related CIF files in the PACMOF, MPNN, Qeq, EQeq and DDEC folders were derived from those in the NEUTRAL folder and are licensed under the terms of the MIT license (MIT) from the CURATED-COFs subset. All remaining files were created by us, and are licensed under the terms of the CDLA-Sharing-1.0 license. Software requirements In order to create a Python environment capable of running the Jupyter notebooks, please install conda and execute conda env create --file environment.yml Usage instructions Execute the command below to run JupyterLab in the appropriate Python environment. conda run --name crafted jupyter-lab

  7. Z

    Data from: The Software Heritage License Dataset (2022 Edition)

    • data.niaid.nih.gov
    Updated Jan 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sergio Montes-Leon (2024). The Software Heritage License Dataset (2022 Edition) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8200351
    Explore at:
    Dataset updated
    Jan 10, 2024
    Dataset provided by
    Stefano Zacchiroli
    Jesus M. Gonzalez-Barahona
    Gregorio Robles
    Sergio Montes-Leon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains all “license files” extracted from a snapshot of the Software Heritage archive taken on 2022-04-25. (Other, possibly more recent, versions of the datasets can be found at https://annex.softwareheritage.org/public/dataset/license-blobs/).

    In this context, a license file is a unique file content (or “blob”) that appeared in a software origin archived by Software Heritage as a file whose name is often used to ship licenses in software projects. Some name examples are: COPYING, LICENSE, NOTICE, COPYRIGHT, etc. The exact file name pattern used to select the blobs contained in the dataset can be found in the SQL query file 01-select-blobs.sql. Note that the file name was not expected to be at the project root, because project subdirectories can contain different licenses than the top-level one, and we wanted to include those too.

    Format

    The dataset is organized as follows:

    blobs.tar.zst: a Zst-compressed tarball containing deduplicated license blobs, one per file. The tarball contains 6’859’189 blobs, for a total uncompressed size on disk of 66 GiB.

    The blobs are organized in a sharded directory structure that contains files named like blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02, where:

    blobs/ is the root directory containing all license blobs

    8624bcdae55baeef00cd11d5dfcfa60f68710a02 is the SHA1 checksum of a specific license blobs, a copy of the GPL3 license in this case. Each license blob is ultimately named with its SHA1:

    $ head -n 3 blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02 GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007

    $ sha1sum blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02 8624bcdae55baeef00cd11d5dfcfa60f68710a02 blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02

    86 and 24 are, respectively, the first and second group of two hex digits in the blob SHA1

    One blob is missing, because its size (313MB) prevented its inclusion; (it was originally a tarball containing source code):

    swh:1:cnt:61bf63793c2ee178733b39f8456a796b72dc8bde,1340d4e2da173c92d432026ecdc54b4859fe9911,"AUTHORS"

    blobs-sample20k.tar.zst: analogous to blobs.tar.zst, but containing “only” 20’000 randomly selected license blobs

    license-blobs.csv.zst a Zst-compressed CSV index of all the blobs in the dataset. Each line in the index (except the first one, which contains column headers) describes a license blob and is in the format SWHID,SHA1,NAME, for example:

    swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING" swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING.GPL3" swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING.GLP-3"

    where:

    SWHID: the Software Heritage persistent identifier of the blob. It can be used to retrieve and cross-reference the license blob via the Software Heritage archive, e.g., at: https://archive.softwareheritage.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2

    SHA1: the blob SHA1, that can be used to cross-reference blobs in the blobs/ directory

    NAME: a file name given to the license blob in a given software origin. As the same license blob can have different names in different contexts, the index contain multiple entries for the same blob with different names, as it is the case in the example above (yes, one of those has a typo in it, but it’s an original typo from some repository!).

    blobs-fileinfo.csv.zst a Zst-compressed CSV mapping from blobs to basic file information in the format: SHA1,MIME_TYPE,ENCODING,LINE_COUNT,WORD_COUNT,SIZE, where:

    SHA1: blob SHA1

    MIME_TYPE: blob MIME type, as detected by libmagic

    ENCODING: blob character encoding, as detected by libmagic

    LINE_COUNT: number of lines in the blob (only for textual blobs with UTF8 encoding)

    WORD_COUNT: number of words in the blob (only for textual blobs with UTF8 encoding)

    SIZE: blob size in bytes

    blobs-scancode.csv.zst a Zst-compressed CSV mapping from blobs to software license detected in them by ScanCode, in the format: SHA1,LICENSE,SCORE, where:

    SHA1: blob SHA1

    LICENSE: license detected in the blob, as an SPDX identifier (or ScanCode identifier for non-SPDX-indexed licenses)

    SCORE: confidence score in the result, as a decimal number between 0 and 100

    There may be zero or arbitrarily many lines for each blob.

    blobs-scancode.ndjson.zst a Zst-compressed line-delimited JSON, containing a superset of the information in blobs-scancode.csv.zst. Each line is a JSON dictionary with three keys:

    sha1: blob SHA1

    licenses: output of scancode.api.get_licenses(..., min_score=0)

    copyrights: output of scancode.api.get_copyrights(...)

    There is exactly one line for each blob. licenses and copyrights keys are omitted for files not detected as plain text.

    blobs-origins.csv.zst a Zst-compressed CSV mapping of where license blobs come from. Each line in the index associate a license blob to one of its origins in the format SWHIDURL, for example:

    swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 https://github.com/pombreda/Artemis

    Note that a license blob can come from many different places, only an arbitrary (and somewhat random) one is listed in this mapping.

    If no origin URL is found in the Software Heritage archive, then a blank is used instead. This happens when they were either being loaded when the dataset was generated, or the loader process crashed before completing the blob’s origin’s ingestion.

    blobs-nb-origins.csv.zst a Zst-compressed CSV mapping of how many origins of this blob are known to Software Heritage. Each line in the index associate a license blob to this count in the format SWHIDNUMBER, for example:

    swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 2822260

    Two blobs are missing because the computation crashes:

    swh:1:cnt:e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 swh:1:cnt:8b137891791fe96927ad78e64b0aad7bded08bdc

    This issue will be fixed in a future version of the dataset

    blobs-earliest.csv.zst a Zst-compressed CSV mapping from blobs to information about their (earliest) known occurence(s) in the archive. Format: SWHIDEARLIEST_SWHIDEARLIEST_TSOCCURRENCES, where:

    SWHID: blob SWHID

    EARLIEST_SWHID: SWHID of the earliest known commit containing the blob

    EARLIEST_TS: timestamp of the earliest known commit containing the blob, as a Unix time integer

    OCCURRENCES: number of known commits containing the blob

    replication-package.tar.gz: code and scripts used to produce the dataset

    licenses-annotated-sample.tar.gz: ground truth, i.e., manually annotated random sample of license blobs, with details about the kind of information they contain.

    Changes since the 2021-03-23 dataset

    More input data, due to the SWH archive growing: more origins in supported forges and package managers; and support for more forges and package managers. See the SWH Archive Changelog for details.

    Values in the NAME column of license-blobs.csv.zst are quoted, as some file names now contain commas.

    Replication package now contains all the steps needed to reproduce all artefacts including the licenseblobs/fetch.py script.

    blobs-nb-origins.csv.zst is added.

    blobs-origins.csv.zst is now generated using the first origin returned by swh-graph’s leaves endpoint, instead of its randomwalk endpoint. This should have no impact on the result, other than a different distribution of “random” origins being picked.

    blobs-origins.csv.zst was missing ~10% of its results in previous versions of the dataset, due to errors and/or timeouts in its generation, this is now down to 0.02% (1254 of the 6859445 unique blobs). Blobs with no known origins are now present, with a blank instead of URL.

    blobs-earliest.csv.zst was missing ~10% of its results in previous versions of the dataset. It is complete now.

    blobs-scancode.csv.zst is generated with a newer scancode-toolkit version (31.2.1)

    blobs-scancode.ndjson.zst is added.

    Errata

    A file name .tmp_1340d4e2da173c92d432026ecdc54b4859fe9911 was present in the initial version of the dataset (published on 2022-11-07). It was removed on 2022-11-09 using these two commands:

    pv blobs-fileinfo.csv.zst | zstdcat | grep -v ".tmp" | zstd -19 pv blobs.tar.zst| zstdcat | tar --delete blobs/13/40/.tmp_1340d4e2da173c92d432026ecdc54b4859fe9911 | zstd -19 -T12

    The total uncompressed size was announced as 84 GiB based on the physical size on ext4, but it is actually 66 GiB.

    Citation

    If you use this dataset for research purposes, please acknowledge its use by citing one or both of the following papers:

    [pdf, bib] Jesús M. González-Barahona, Sergio Raúl Montes León, Gregorio Robles, Stefano Zacchiroli. The software heritage license dataset (2022 edition). Empirical Software Engineering, Volume 28, Number 6, Article number 147 (2023).

    [pdf, bib] Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.

    References

    The dataset has been built using primarily the data sources described in the following papers:

    [pdf, bib] Roberto Di Cosmo, Stefano Zacchiroli. Software Heritage: Why and How to Preserve Software Source Code. In Proceedings of iPRES 2017: 14th International Conference on Digital Preservation, Kyoto, Japan, 25-29 September 2017.

    [pdf, bib] Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli. The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019.

    Errata (v2, 2024-01-09)

    licenses-annotated-sample.tar.gz: some comments not intended for publication were removed, and 4

  8. Data from: Synthetic Product Desirability Datasets for Sentiment Analysis...

    • zenodo.org
    • paperswithcode.com
    • +2more
    Updated Nov 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Hastings; John Hastings; Sherri Weitl-Harms; Sherri Weitl-Harms; Joseph Doty; Zachary Myers; Zachary Myers; Warren Thompson; Joseph Doty; Warren Thompson (2024). Synthetic Product Desirability Datasets for Sentiment Analysis Testing [Dataset]. http://doi.org/10.5281/zenodo.14188456
    Explore at:
    Dataset updated
    Nov 21, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    John Hastings; John Hastings; Sherri Weitl-Harms; Sherri Weitl-Harms; Joseph Doty; Zachary Myers; Zachary Myers; Warren Thompson; Joseph Doty; Warren Thompson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview:
    This collection contains three synthetic datasets produced by gpt-4o-mini for sentiment analysis and PDT (Product Desirability Toolkit) testing. Each dataset contains 1000 hypothetical software product reviews with the aim to produce a diversity of sentiment and text. The datasets were created as part of the research described in:

    Hastings, J.D., Weitl-Harms, S., Doty, J., Myers, Z. L., and Thompson, W., “Utilizing Large Language Models to Synthesize Product Desirability Datasets,” in Proceedings of the 2024 IEEE International Conference
    on Big Data (BigData-24), Workshop on Large Language and Foundation Models (WLLFM-24), Dec. 2024.
    https://arxiv.org/abs/2411.13485.

    Briefly, each row in the datasets was produced as follows:
    1) Word+Review: The LLM selected a word and synthesized a review that would align with a random target sentiment.
    2) Review+Word: The LLM produced a review to align with the target sentiment score, and then selected a word appropriate for the review.
    3) Supply-Word: A word was supplied to the LLM which was then scored, and a review was produced to align with that score.

    For sentiment analysis and PDT testing, the two columns of main interest across the datasets are likely 'Selected Word' and 'Hypothetical Review'.

    License:
    This data is licensed under the CC Attribution 4.0 international license, and may be taken and used freely with credit given. Cite as:

    Hastings, J., Weitl-Harms, S., Doty, J., Myers, Z., & Thompson, W. (2024). Synthetic Product Desirability Datasets for Sentiment Analysis Testing (1.0.0). Zenodo. https://doi.org/10.5281/zenodo.14188456

  9. o

    Zenodo metadata JSON records as of 2019-09-16

    • explore.openaire.eu
    • zenodo.org
    Updated Nov 7, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stian Soiland-Reyes; Paul Groth; Information Management; University of Amsterdam (2019). Zenodo metadata JSON records as of 2019-09-16 [Dataset]. http://doi.org/10.5281/zenodo.3531504
    Explore at:
    Dataset updated
    Nov 7, 2019
    Authors
    Stian Soiland-Reyes; Paul Groth; Information Management; University of Amsterdam
    Description

    This preliminary dataset contains the application/vnd.zenodo.v1+json JSON records of Zenodo deposits as retrieved on 2019-09-16. Files zenodo-records-json-2019-09-16.tar.xz Zenodo JSON records XZ-compressed tar archive of individual JSON records as retrieved from Zenodo. Filenames reflects record, e.g. 1310621.json was retrieved from https://zenodo.org/api/records/1310621 using content-negotiation for application/vnd.zenodo.v1+json zenodo-records-json-2019-09-16-filtered.jsonseq.xz Concatinated Zenodo JSON records XZ-compressed RFC7464 JSON Sequence stream, readable by jq. Concatination of Zenodo JSON records. Order not significant. zenodo-records.sh Retrieve Zenodo JSON records A retrospectively created Bash shell script that shows the commands used to retrieve JSON files and concationate to jsonseq. ro-crate-metadata.jsonld RO-Crate 0.2 structured metadata ro-crate-preview.html Browser rendering of RO-Crate structured metadata README.md This dataset description License This dataset is provided under the license Apache License, version 2.0: Copyright 2019 The University of Manchester Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. CC0 for Zenodo metadata The Zenodo metadata in zenodo-records-json-2019-09-16.tar.xz is reused under the terms of https://creativecommons.org/publicdomain/zero/1.0/ Reproducibility To retrieve the Zenodo JSON it was deemed necessary to use the undocumented parts of Zenodo API. From the Zenodo source code it was identified that the REST template https://zenodo.org/api/records/{pid_value} could be used with pid_value as the numeric part from the OAI-PMH identifier, e.g. for oai:zenodo.org:1310621 the Zenodo JSON can be retrieved at https://zenodo.org/api/records/1310621. The JSON API supports content negotiation, the content-types supported as of 2019-09-20 include: application/vnd.zenodo.v1+json giving the Zenodo record in Zenodo's internal JSON schema (v1) application/ld+json giving JSON-LD Linked Data using the http://schema.org/ vocabulary application/x-datacite-v41+xml giving DataCite v4 XML application/marcxml+xml giving MARC 21 XML Using these (currently) undocumented parts of the Zenodo API thus avoids the need for HTML scraping while also giving individual complete records that are suitable to redistribute as records in a filtered dataset. This preliminary exploration will be adapted into the reproducible CWL workflow, for now included as a Bash script zenodo-records.sh Execution time was about 3 days from a server at the University of Manchester network on a single 1 GBps network link. The script does: Retrieve each of the first 3.5 million Zenodo records as Zenodo JSON by iterating over possible numeric IDs (the maximum ID 3450000 was estimated from "Recent uploads") Filter list to exclude records that are not found, moved or deleted. The presence of the key conceptrecid is used as marker. Use jq to ensure the JSON is on a single line Join the JSON files using the ASCII Record Separator (RS, 0x1e) to make a application/json-seq JSON text sequence stream Save the JSON stream as a single compressed file using xz

  10. Z

    Dataset from "What do developers talk about open source software licensing?...

    • data.niaid.nih.gov
    Updated Jun 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel German (2020). Dataset from "What do developers talk about open source software licensing? " - SEAA2020 [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_3871564
    Explore at:
    Dataset updated
    Jun 1, 2020
    Dataset provided by
    Daniel German
    Lefteris Angelis
    Georgia M. Kapitsaki
    Maria Papoutsoglou
    Description

    This is the dataset used in the respective research work. The abstract is available below.

    If you want to cite this work, please use:

    Georgia M. Kapitsaki, Maria Papoutsoglou, Daniel German and Lefteris Angelis, What do developers talk about open source software licensing?, to appear in the Proceedings of the Euromicro Conference on Software Engineering and Advanced Applications, SEAA 2020.

    Free and open source software has gained a lot of momentum in the industry and the research community. Open source licenses determine the rules, under which the open source software can be further used and distributed. Previous works have examined the usage of open source licenses in the framework of specific projects or online social coding platforms, examining developers specific licensing views for specific software. However, the questions practitioners ask about licenses and licensing as captured in Question and Answer websites also constitute an important aspect toward understanding practitioners general licenses and licensing concerns. In this paper, we investigate open source license discussions using data from the Software Engineering, Open Source and Law Stack Exchange sites that contain relevant data. We describe the process used for the data collection and analysis, and discuss the main results. Our results indicate that clarifications about specific licenses and specific license terms are required. The results can be useful for developers, educators and license authors.

  11. WorldCereal open global harmonized reference data repository (CC-BY-SA...

    • zenodo.org
    • data.niaid.nih.gov
    Updated Jul 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hendrik Boogaard; Hendrik Boogaard; Arun Pratihast; Juan Carlos Laso Bayas; Santosh Karanam; Steffen Fritz; Kristof Van Tricht; Jeroen Degerickx; Sven Gilliams; Arun Pratihast; Juan Carlos Laso Bayas; Santosh Karanam; Steffen Fritz; Kristof Van Tricht; Jeroen Degerickx; Sven Gilliams (2024). WorldCereal open global harmonized reference data repository (CC-BY-SA licensed data sets) [Dataset]. http://doi.org/10.5281/zenodo.7609546
    Explore at:
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hendrik Boogaard; Hendrik Boogaard; Arun Pratihast; Juan Carlos Laso Bayas; Santosh Karanam; Steffen Fritz; Kristof Van Tricht; Jeroen Degerickx; Sven Gilliams; Arun Pratihast; Juan Carlos Laso Bayas; Santosh Karanam; Steffen Fritz; Kristof Van Tricht; Jeroen Degerickx; Sven Gilliams
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Within the ESA funded WorldCereal project we have built an open harmonized reference data repository at global extent for model training or product validation in support of land cover and crop type mapping. Data from 2017 onwards were collected from many different sources and then harmonized, annotated and evaluated. These steps are explained in the harmonization protocol (10.5281/zenodo.7584463). This protocol also clarifies the naming convention of the shape files and the WorldCereal attributes (LC, CT, IRR, valtime and sampleID) that were added to the original data sets.

    This publication includes those harmonized data sets of which the original data set was published under the CC-BY-SA license or a license similar to CC-BY-SA. See document "_In-situ-data-World-Cereal - license - CC-BY-SA.pdf" for an overview of the original data sets.

  12. Z

    Data from: TChard

    • data.niaid.nih.gov
    Updated Aug 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timothy O'Donnell (2022). TChard [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_6962042
    Explore at:
    Dataset updated
    Aug 17, 2022
    Dataset provided by
    Martin Renqiang Min
    Anja Moesch
    Kai Li
    Timothy O'Donnell
    Israa Alqassem
    Filippo Grazioli
    Pierre Machart
    Description

    TChard is a datset for TCR-peptide/-pMHC binding prediction. It includes more than 500,000 samples, derived from heterogenous sources, such as IEDB, VDJdb, McPAS-TCR and the NetTCR-2.0 repository. Since the samples included in the TChard dataset derive from different datasets, we include a license column in the CSV file. The license column specifies from which original dataset a sample comes, and which license applies. Experiments on this dataset can be found in the GitHub repository: https://github.com/nec-research/tc-hard

  13. Z

    DES370K

    • data.niaid.nih.gov
    • zenodo.org
    Updated Nov 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gregersen, Brent A (2021). DES370K [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5676265
    Explore at:
    Dataset updated
    Nov 12, 2021
    Dataset provided by
    Gregersen, Brent A
    Li Je-Leun
    McGibbon, Robert T
    Hargus, Cory
    Shaw, David E
    Klepeis. John L
    Siva. Karthik
    Palmo, Kim
    Taube, Andrew G
    Decolvenaere, Elizabeth
    Donchev, Alexander G
    Law, Ka-Hei
    Bergdorf, Michael
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    DESRES Data Sets (DES370K)

    Please see the original paper at https://doi.org/10.1038/s41597-021-00833-x for more information about this dataset.

    This package contains a datasets described by Donchev et al. [1]: DES370K, It is presented as a CSV (DES370K.csv) and .mol files (geometries//DES370K_.mol). Also included is a metadata file DES370K_meta.csv, which contains a set of long-form column descriptions replicating those in [1], as well as data types and units (when applicable) for each column.

    Manifest

    • DES370K.csv : Full dataset, containing interaction energies calculated using CCSD(T), MP2, HF, and SAPT0, as well as dimer geometries.

    • DES370K_meta.csv : Long-form descriptions of the columns in DES370K, as well as datatypes and units (when applicable) for each column

    • LICENSE.txt : License for using and redistributing the datasets provided.

    • README.md : This file.

    Loading the Datset

    The datasets are presented as CSVs as a compromise between human-readability, format uniformity, and parsing speed. While an almost uncountable number of packages exist to read CSV files, we recommend using the python data analysis

    References

    [1] A. G. Donchev, A. G. Taube, E. Decolvenaere, C. Hargus, R. T. McGibbon, K.-H. Law, B. A. Gregersen, J.-L. Li, K. Palmo, K. Siva, M. Bergdorf, J. L. Klepeis, and D. E. Shaw. "Quantum chemical benchmark database of dimer interaction energies at a “gold standard” level of accuracy"

    [2] R. T. McGibbon, A. G. Taube, A. G. Donchev, K. Siva, F. Fernandez, C. Hargus, K.-H. Law, J.L. Klepeis, and D. E. Shaw. "Improving the accuracy of Moller-Plesset perturbation theory with neural networks"

    [3] M. K. Kesharwani, A. Karton, N. Sylvetsky, J. M. L. Nitai. "The S66 non-covalent interactions benchmark reconsidered using explicitly correlated methods near the basis set limit."

    License

            DESRES DATA SETS LICENSE AGREEMENT
    
    
    Copyright 2020, D. E. Shaw Research. All rights reserved.
    
    
    Redistribution and use of electronic structure data released in the DESRES
    Data Sets (DES370K, DES15K, DES5M, DESS66, and DESS66x8) with or without
    modification, is permitted provided that the following conditions are met:
    
    
      * Redistributions of the data must retain the above copyright notice,
      this list of conditions, and the following disclaimer.
    
    
      * Redistributions in binary form must reproduce the above copyright
      notice, this list of conditions, and the following disclaimer in the
      documentation and/or other materials provided with the distribution.
    
    
    Neither the name of D. E. Shaw Research nor the names of its contributors may
    be used to endorse or promote products derived from this software without
    specific prior written permission.
    
    
    THIS SOFTWARE AND DATA ARE PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
    "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
    THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
    ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE
    FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
    DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
    SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
    CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
    OR TORT (INCLUDINGNEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
    OF THIS SOFTWARE AND/OR DATA, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
    
  14. Zenodo Open Metadata snapshot - Training dataset for records and communities...

    • zenodo.org
    application/gzip, bin
    Updated Dec 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo team; Zenodo team (2022). Zenodo Open Metadata snapshot - Training dataset for records and communities classifier building [Dataset]. http://doi.org/10.5281/zenodo.7438358
    Explore at:
    bin, application/gzipAvailable download formats
    Dataset updated
    Dec 15, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Zenodo team; Zenodo team
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains Zenodo's published open access records and communities metadata, including entries marked by the Zenodo staff as spam and deleted.

    The datasets are gzipped compressed JSON-lines files, where each line is a JSON object representation of a Zenodo record or community.

    Records dataset

    Filename: zenodo_open_metadata_{ date of export }.jsonl.gz

    Each object contains the terms: part_of, thesis, description, doi, meeting, imprint, references, recid, alternate_identifiers, resource_type, journal, related_identifiers, title, subjects, notes, creators, communities, access_right, keywords, contributors, publication_date

    which correspond to the fields with the same name available in Zenodo's record JSON Schema at https://zenodo.org/schemas/records/record-v1.0.0.json.

    In addition, some terms have been altered:

    • The term files contains a list of dictionaries containing filetype, size, and filename only.
    • The term license contains a short Zenodo ID of the license (e.g. "cc-by").

    Communities dataset

    Filename: zenodo_community_metadata_{ date of export }.jsonl.gz

    Each object contains the terms: id, title, description, curation_policy, page

    which correspond to the fields with the same name available in Zenodo's community creation form.

    Notes for all datasets

    For each object the term spam contains a boolean value, determining whether a given record/community was marked as spam content by Zenodo staff.

    Some values for the top-level terms, which were missing in the metadata may contain a null value.

    A smaller uncompressed random sample of 200 JSON lines is also included for each dataset to test and get familiar with the format without having to download the entire dataset.

  15. g

    GEMS/Water Open Global Water Quality Dataset

    • gimi9.com
    • data.europa.eu
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GEMS/Water Open Global Water Quality Dataset [Dataset]. https://gimi9.com/dataset/eu_fb38ac32-1459-48d9-8390-f8dd2e43e0cc
    Explore at:
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset comprises data from the GEMStat database that are available under an open data license (CC BY 4.0 or equivalent). It is made available on the Zenodo repository. GEMStat provides access to freshwater quality data. The data are voluntarily provided by countries and organizations worldwide within the framework of the GEMS/Water Programme of the United Nations Environment Programme (UNEP). The dataset includes more than 20 million measurement from over 13,000 stations and covering more than 600 different parameters and spans the time period from 1906 to 2023. This represents over 70% of all GEMStat data, further data is only available under more restricted data licenses. GEMStat is operated by the GEMS/Water programme of the United Nations Environment Programme (UNEP) and hosted at the International Centre for Water Resources and Global Change (ICWRGC) and the German Federal Institute of Hydrology (BfG). The data in GEMStat is provided by National Hydrological Services of UN member states.

  16. Z

    Data from: Libraries.io Open Source Repository and Dependency Metadata

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Feb 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeremy Katz (2020). Libraries.io Open Source Repository and Dependency Metadata [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_808272
    Explore at:
    Dataset updated
    Feb 13, 2020
    Dataset authored and provided by
    Jeremy Katz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    What is in this release?

    In this release you will find data about software distributed and/or crafted publicly on the Internet. You will find information about its development, its distribution and its relationship with other software included as a dependency. You will not find any information about the individuals who create and maintain these projects.

    Further information and documentation on this data set can be found at https://libraries.io/data

    For enquiries please contact data@libraries.io

    This dataset contains seven csv files:

    Projects

    A project is a piece of software available on any one of the 34 package managers supported by Libraries.io.

    Versions

    A Libraries.io version is an immutable published version of a Project from a package manager. Not all package managers have a concept of publishing versions, often relying directly on tags/branches from a revision control tool.

    Tags

    A tag is equivalent to a tag in a revision control system. Tags are sometimes used instead of Versions where a package manager does not use the concept of versions. Tags are often semantic version numbers.

    Dependencies

    Dependencies describe the relationship between a project and the software it builds upon. Dependencies belong to Version. Each Version can have different sets of dependencies. Dependencies point at a specific Version or range of versions of other projects.

    Repositories

    A Libraries.io repository represents a publically accessible source code repository from either github.com, gitlab.com or bitbucket.org. Repositories are distinct from Projects, they are not distributed via a package manager and typically an application for end users rather than component to build upon.

    Repository dependencies

    A repository dependency is a dependency upon a Version from a package manager has been specified in a manifest file, either as a manually added dependency committed by a user or listed as a generated dependency listed in a lockfile that has been automatically generated by a package manager and committed.

    Projects with related Repository fields

    This is an alternative projects export that denormalizes a projects related source code repository inline to reduce the need to join between two data sets.

    Licence

    This dataset is released under the Creative Commons Attribution-ShareAlike 4.0 International Licence.

    This licence provides the user with the freedom to use, adapt and redistribute this data. In return the user must publish any derivative work under a similarly open licence, attributing Libraries.io as a data source. The full text of the licence is included in the data.

    Access, Attribution and Citation

    The dataset is available to download from Zenodo at https://zenodo.org/record/2536573.

    Please attribute Libraries.io as a data source by including the words ‘Includes data from Libraries.io, a project from Tidelift’ and reference the Digital Object identifier: 10.5281/zenodo.3626071

  17. h

    roots_ar_openiti_proc

    • huggingface.co
    Updated Apr 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigScience Data (2024). roots_ar_openiti_proc [Dataset]. https://huggingface.co/datasets/bigscience-data/roots_ar_openiti_proc
    Explore at:
    Dataset updated
    Apr 22, 2024
    Dataset authored and provided by
    BigScience Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ROOTS Subset: roots_ar_openiti_proc

      OpenITI
    

    Dataset uid: openiti_proc

      Description
    

    A corpus of Arabic texts that collected from Islamic books from different websites.

      Homepage
    

    https://zenodo.org/record/4075046

      Licensing
    

    non-commercial use cc-by-nc-sa-4.0: Creative Commons Attribution Non Commercial Share Alike 4.0 International

    By exercising the Licensed Rights (defined below), You accept and agree to be bound by the terms and conditions of this… See the full description on the dataset page: https://huggingface.co/datasets/bigscience-data/roots_ar_openiti_proc.

  18. Z

    The ULS23 Challenge Public Training Dataset Part 2

    • data.niaid.nih.gov
    • explore.openaire.eu
    Updated Oct 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    de Grauw, Max (2023). The ULS23 Challenge Public Training Dataset Part 2 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10050959
    Explore at:
    Dataset updated
    Oct 30, 2023
    Dataset provided by
    van Ginneken, Bram
    Hering, Alessa
    de Grauw, Max
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains part of the imaging data for the Universal Lesion Segmentation Challenge (ULS23). It contains lesion volumes-of-interest (VOI's) for previously released data. It consists of 333 kidney lesions from the KiTS21 dataset, 2.246 lung lesion from LIDC-IDRI and 888 liver lesions from the LiTS challenge. The annotations are made available through the Challenge repository on GitHub.The Universal Lesion Segmentation 2023 (ULS23) data is licensed under CC BY-NC-SA 4.0

  19. Z

    Penmanshiel Wind Farm Data

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Plumley, Charlie (2023). Penmanshiel Wind Farm Data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5946807
    Explore at:
    Dataset updated
    Aug 17, 2023
    Dataset authored and provided by
    Plumley, Charlie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains:

    A kmz file for Penmanshiel wind farm in the UK (for opening in e.g. Google Earth)

    Static data including turbine coordinates and turbine details (rated power, rotor diameter, hub height, etc.)

    10-minute SCADA and events data from the 14 Senvion MM82's at Penmanshiel wind farm, grouped by year from 2016 to mid-2021, which was extracted from our secondary SCADA system (Greenbyte). Note not all signals are available for the entire period, and there is no turbine WT03

    Data mappings from primary SCADA to csv signal names

    Site substation/PMU meter data where available for the same period

    Site fiscal/grid meter data where available for the same period

    The dataset has been released by Cubico Sustainable Investments Ltd under a CC-BY-4.0 open data license and is provided as is. However, please provide any feedback you might have on the dataset and format of the data. I'll try and add or link to additional file formats that might be easier to work with (e.g. for use with specific analysis software), and update this dataset periodically (e.g. twice a year), but please prompt me as required.

    Feel free to use the data according to the license, however, it would be helpful to me if you could let me know where, how and why you are using the data, so that I can highlight this to the business (and renewables industry) and hopefully promote similar data sharing initiatives. I am particularly interested in performance analysis/improvement opportunities, how the dataset can be augmented with other (open) datasets, and sharing more generally within the renewables industry.

    If you would like to get access to other datasets we may hold (e.g. more recent data, data from our other sites, ~30s resolution data, etc.), please let me know, and, if you have any questions or want to discuss open data and this or other initiatives, please contact me and I will endeavour to help.

    I would like to thank Cubico's Senior Legal Advisor & Compliance Officer, IT Director, UK Asset Management Team, Executive Committee and my manager for supporting this initiative, as well as our partners GLIL for agreeing to release this data under an open license. I would also like to thank those I have talked to during the process of releasing this data under an open license and the encouragement and advice I have had on the way.

    For contact my email address is charlie.plumley@cubicoinvest.com.

    You can also access data from Kelmarsh wind farm here.

  20. Z

    Dataset for the study of Potential Code Borrowing and License Violations in...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Golubev, Yaroslav (2020). Dataset for the study of Potential Code Borrowing and License Violations in Java Projects on GitHub [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3608211
    Explore at:
    Dataset updated
    Mar 14, 2020
    Dataset provided by
    Eliseeva, Maria
    Povarov, Nikita
    Bryksin, Timofey
    Golubev, Yaroslav
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset for the study of Potential Code Borrowing and License Violations in Java Projects on GitHub. The dataset is based on the Public Git Archive and consists of projects on GitHub that have at least 50 stars and have at least one line in Java. A total of 23,378 projects are listed here that we downloaded for analysis on June 1st, 2019.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sanchez Romero; Sanchez Romero (2020). Training Dataset [Dataset]. http://doi.org/10.5281/zenodo.3309364
Organization logo

Training Dataset

Explore at:
xml, txtAvailable download formats
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sanchez Romero; Sanchez Romero
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Training dataset used in my TFM.

Search
Clear search
Close search
Google apps
Main menu