14 datasets found
  1. Zenodo Open Metadata snapshot - Training dataset for records and communities...

    • zenodo.org
    application/gzip, bin
    Updated Dec 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo team; Zenodo team (2022). Zenodo Open Metadata snapshot - Training dataset for records and communities classifier building [Dataset]. http://doi.org/10.5281/zenodo.7438358
    Explore at:
    bin, application/gzipAvailable download formats
    Dataset updated
    Dec 15, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Zenodo team; Zenodo team
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains Zenodo's published open access records and communities metadata, including entries marked by the Zenodo staff as spam and deleted.

    The datasets are gzipped compressed JSON-lines files, where each line is a JSON object representation of a Zenodo record or community.

    Records dataset

    Filename: zenodo_open_metadata_{ date of export }.jsonl.gz

    Each object contains the terms: part_of, thesis, description, doi, meeting, imprint, references, recid, alternate_identifiers, resource_type, journal, related_identifiers, title, subjects, notes, creators, communities, access_right, keywords, contributors, publication_date

    which correspond to the fields with the same name available in Zenodo's record JSON Schema at https://zenodo.org/schemas/records/record-v1.0.0.json.

    In addition, some terms have been altered:

    • The term files contains a list of dictionaries containing filetype, size, and filename only.
    • The term license contains a short Zenodo ID of the license (e.g. "cc-by").

    Communities dataset

    Filename: zenodo_community_metadata_{ date of export }.jsonl.gz

    Each object contains the terms: id, title, description, curation_policy, page

    which correspond to the fields with the same name available in Zenodo's community creation form.

    Notes for all datasets

    For each object the term spam contains a boolean value, determining whether a given record/community was marked as spam content by Zenodo staff.

    Some values for the top-level terms, which were missing in the metadata may contain a null value.

    A smaller uncompressed random sample of 200 JSON lines is also included for each dataset to test and get familiar with the format without having to download the entire dataset.

  2. Data Citation Corpus Data File

    • zenodo.org
    zip
    Updated Oct 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataCite (2024). Data Citation Corpus Data File [Dataset]. http://doi.org/10.5281/zenodo.13376773
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 14, 2024
    Dataset provided by
    DataCite
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Data file for the second release of the Data Citation Corpus, produced by DataCite and Make Data Count as part of an ongoing grant project funded by the Wellcome Trust. Read more about the project.

    The data file includes 5,256,114 data citation records in JSON and CSV formats. The JSON file is the version of record.

    For convenience, the data is provided in batches of approximately 1 million records each. The publication date and batch number are included in the file name, ex: 2024-08-23-data-citation-corpus-01-v2.0.json.

    The data citations in the file originate from DataCite Event Data and a project by Chan Zuckerberg Initiative (CZI) to identify mentions to datasets in the full text of articles.

    Each data citation record is comprised of:

    • A pair of identifiers: An identifier for the dataset (a DOI or an accession number) and the DOI of the publication (journal article or preprint) in which the dataset is cited

    • Metadata for the cited dataset and for the citing publication

    The data file includes the following fields:

    Field

    Description

    Required?

    id

    Internal identifier for the citation

    Yes

    created

    Date of item's incorporation into the corpus

    Yes

    updated

    Date of item's most recent update in corpus

    Yes

    repository

    Repository where cited data is stored

    No

    publisher

    Publisher for the article citing the data

    No

    journal

    Journal for the article citing the data

    No

    title

    Title of cited data

    No

    publication

    DOI of article where data is cited

    Yes

    dataset

    DOI or accession number of cited data

    Yes

    publishedDate

    Date when citing article was published

    No

    source

    Source where citation was harvested

    Yes

    subjects

    Subject information for cited data

    No

    affiliations

    Affiliation information for creator of cited data

    No

    funders

    Funding information for cited data

    No

    Additional documentation about the citations and metadata in the file is available on the Make Data Count website.

    The second release of the Data Citation Corpus data file reflects several changes made to add new citations, remove some records deemed out of scope for the corpus, update and enhance citation metadata, and improve the overall usability of the file. These changes are as follows:

    Add and update Event Data citations:

    • Add 179,885 new data citations created in DataCite Event Data between 01 June 2023 through 30 June 2024

    Remove citation records deemed out of scope for the corpus:

    • 273,567 records from DataCite Event Data with non-citation relationship types

    • 28,334 citations to items in non-data repositories (clinical trials registries, stem cells, samples, and other non-data materials)

    • 44,117 invalid citations where subj_id value was the same as the obj_id value or subj_id and obj_id are inverted, indicating a citation from a dataset to a publication

    • 473,792 citations to invalid accession numbers from CZI data present in v1.1 as a result of false positives in the algorithm used to identify mentions

    • 4,110,019 duplicate records from CZI data present in v1.1 where metadata is the same for obj_id, subj_id, repository_id, publisher_id, journal_id, accession_number, and source_id (the record with the most recent updated date was retained in all of these cases)

    Metadata enhancements:

    • Apply Field of Science subject terms to citation records originating from CZI, based on disciplinary area of data repository

    • Initial cleanup of affiliation and funder organization names to remove personal email addresses and social media handles (additional cleanup and standardization in progress and will be included in future releases)

    Data structure updates to improve usability and eliminate redundancies:

    • Rename subj_id and obj_id fields to “dataset” and “publication” for clarity

    • Remove accessionNumber and doi elements to eliminate redundancy with subj_id

    • Remove relationTypeId fields as these are specific to Event Data only

    Full details of the above changes, including scripts used to perform the above tasks, are available in GitHub.

    While v2 addresses a number of cleanup and enhancement tasks, additional data issues may remain, and additional enhancements are being explored. These will be addressed in the course of subsequent data file releases.


    Feedback on the data file can be submitted via GitHub. For general questions, email info@makedatacount.org.

  3. Data from: Worm Perturb-Seq: massively parallel whole-animal RNAi and...

    • zenodo.org
    bin, zip
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xuhang Li; Xuhang Li (2025). Worm Perturb-Seq: massively parallel whole-animal RNAi and RNA-seq [Dataset]. http://doi.org/10.5281/zenodo.15223779
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Apr 17, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Xuhang Li; Xuhang Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description:

    This repository contains all source codes, necessary input and output data, and raw figures and tables for reproducing most figures and results published in the following study:

    Hefei Zhang#, Xuhang Li#, Dongyuan Song, Onur Yukselen, Shivani Nanda, Alper Kucukural, Jingyi Jessica Li, Manuel Garber, Albertha J.M. Walhout. Worm Perturb-Seq: massively parallel whole-animal RNAi and RNA-seq. (2025) Nature Communications, in press (# equal contribution, *: correspondng author)

    These include results related to method benchmarking and NHR data processing. Source data for figures that are not reproducible here have been provided with the publication.

    Files:

    This repository contains a few directories related to this publication. To deposit into Zenodo, we have individually zipped each subfolder of the root directory.

    There are three directories included:

    • MetabolicLibrary

      • contains files related to the benchmarking analyses using the metabolic gene WPS data. This folder is partially overlaped with the working directory of the sister paper deposited at 10.5281/zenodo.14198997
    • method_simulation

      • contains files related to the simulation benchmarking
    • NHRLibrary

      • contains files related to the analyses of NHR gene WPS data

    Note: the parameter optimization output is deposited in a seperate Zenodo repository (10.5281/zenodo.15236858) for better oganization and easy usage. If you would like to reproduce results related to the "MetabolicLibrary" folder, please download and integrate the omitted subfolder "MetabolicLibrary/2_DE/output/" from this seperate repository.

    Please be advised that this repository contains raw codes and data that are not directly related to a figure in our paper. However, they may be useful to generate input used in the analysis of a figure, or to reproduce tables in our manuscript. It may also contain unpublished analyses and figures, which we did not intentionally delete and kept for records.

    Usage:

    Please refer to the table in below to locate a specific file for reproducing a figure of interest (also availabe in the METHOD_FIGURE_LOOKUP.xlsx under the root directory).

    FigureFileLinesaNotes
    Fig. 2cMetabolicLibrary/1_QC_dataCleaning/2_badSampleExclusion_manual.R65-235output figure is selected from figures/met10_lib6_badSamplePCA.pdf
    Fig. 2dNHRLibrary/example_bams/*-load the bam files in IGV to make the figure
    Fig. 3aMetabolicLibrary/2_DE/2_5_vectorlike_re_analysis_with_curated_44_NTP_conditions.R348-463
    Fig. 3b,cMetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R106-376
    Fig. 3dMetabolicLibrary/2_DE/SUPP_extra_figures_for_rewiring.R10-139
    Fig. 3eMetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R379-522
    Fig. 3f,gMetabolicLibrary/2_DE/SUPP_extra_figures_for_rewiring.R1-8
    Fig. 3hmethod_simulation/Supp_systematic_mean_variation_example.R1-138
    Fig. 3imethod_simulation/3_benchmark_DE_result_w_rep.R and 1_benchmark_DE_result_std_NB_w_rep.R1-518the example figure was from figures/GLM_NB_deltaMiu_k08_10rep/seed_12345_empircial_null_fit_simulated_data_NB_GLM.pdf and figures/GLM_NB_10rep/seed_1_empircial_null_fit_simulated_data_NB_GLM.pdf;
    Fig. 3jMetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R2104-2106load dependencies starting from line 1837
    Fig. 3kMetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R2053-2078load dependencies starting from line 1837; the GSEA was performed using SUPP_supplementary_figures_for_method_noiseness_GSEA.R
    Fig. 4a,bmethod_simulation/3_benchmark_DE_result_w_rep.R1-523
    Fig. 4cmethod_simulation/3_benchmark_WPS_parameters.R1-237
    Fig. 4dMetabolicLibrary/2_DE/2_5_vectorlike_re_analysis_with_curated_44_NTP_conditions.R1-346output figure is selected from figures/0_DE_QA/vectorlike_analysis/2d_cutoff_titration_71NTP_rawDE_log2FoldChange_log2FoldChange_raw.pdf and 2d_cutoff_titration_71NTP_p0.005_log2FoldChange_raw.pdf. The "p0.005" in the second file name indicates the p_outlier cutoff used in the final parameter set for EmpirDE.
    Fig. 4eMetabolicLibrary/2_DE/SUPP_plot_N_DE_repeated_RNAi.Rentire file
    Fig. 4fMetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R1020-1407
    Fig. 4g,hMetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R529-851
    Fig. 5dNHRLibrary/FinalAnalysis/2_DE_new/3_DE_network_analysis.R51-69; 94-112load dependencies starting from line 1
    Fig. 5eNHRLibrary/FinalAnalysis/2_DE_new/5_GSA_bubble_plot.R1-306
    Fig. 5fNHRLibrary/FinalAnalysis/2_DE_new/5_GSA.R1-1492
    Fig. 6aNHRLibrary/FinalAnalysis/2_DE_new/4_DE_similarity_analysis.R1-175
    Fig. 6bNHRLibrary/FinalAnalysis/6_case_study.R506-534load dependencies starting from line 1
    Fig. 6cNHRLibrary/FinalAnalysis/6_case_study.R668-888load dependencies starting from line 1
    Supplementary Fig. 1eNHRLibrary/FinalAnalysis/5_revision/REVISION_gene_detection_sensitivity_benchmark.R1-143
    Supplementary Fig. 1fMetabolicLibrary/1_QC_dataCleaning/2_badSampleExclusion_manual.R65-235output figure is selected from figures/met10_lib6_badSampleCorr.pdf
    Supplementary Fig. 1gMetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R2191-2342
    Supplementary Fig. 2aMetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R1409-1822
    Supplementary Fig. 2bmethod_simulation/Supp_systematic_mean_variation_example.R1-138
    Supplementary Fig. 2cmethod_simulation/Supp_systematic_mean_variation_example.R; 2_fit_logFC_distribution.R141-231; 1-201the middle panel was generated from Supp_systematic_mean_variation_example.R (lines 141-231) and right panel was from 2_fit_logFC_distribution.R (lines 1-201)
    Supplementary Fig. 2dmethod_simulation/1_benchmark_DE_result_std_NB_w_rep.R1-518the example figure was from figures/GLM_NB_10rep/seed_1_empircial_null_fit_simulated_data_NB_GLM.pdf;
    Supplementary Fig. 2emethod_simulation/3_benchmark_DE_result_w_rep.R1-518the example figure was from figures/GLM_NB_deltaMiu_k08_10rep/seed_12345_empircial_null_fit_simulated_data_NB_GLM.pdf
    Supplementary Fig. 2fmethod_simulation/3_benchmark_DE_result_w_rep.R528-573may need to run the code from line 1 to load other variables needed
    Supplementary Fig. 3a,bmethod_simulation/1_benchmark_DE_result_std_NB_w_rep.R1-523
    Supplementary Fig. 3cmethod_simulation/3_benchmark_WPS_parameters.R1-237
    Supplementary Fig. 3dMetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R3190-3300
    Supplementary Fig. 3e2_3_power_error_tradeoff_optimization.Rentire filethe figure was in figures/0_DE_QA/cleaning_strength_titration/benchmark_vectorlikes_titrate_cleaning_cutoff_DE_log2FoldChange_FDR0.2_FC1.pdf (produced in line 398); this script produced the titration plots for a series of thresholds, where we picked FDR0.2_FC1 for presentation in the paper
    Supplementary Fig. 3f2_3_power_error_tradeoff_optimization.Rentire filethe figure was in figures/0_DE_QA/cleaning_strength_titration/benchmark_independent_repeats_titrate_cleaning_cutoff_FP_log2FoldChange_raw.pdf (produced in line 195). The top line plot was from figures/0_DE_QA/cleaning_strength_titration/benchmark_independent_repeats_titrate_cleaning_cutoff_FP_log2FoldChange_raw_summary_stat.pdf (produced in line 218).
    Supplementary Fig. 4MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R853-898please run from line 529 to load dependencies
    Supplementary Fig. 5a,bMetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R2590-3185
    Supplementary Fig.

  4. Z

    Data from: A dataset of GitHub Actions workflow histories

    • data.niaid.nih.gov
    Updated Oct 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cardoen, Guillaume (2024). A dataset of GitHub Actions workflow histories [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10259013
    Explore at:
    Dataset updated
    Oct 25, 2024
    Dataset provided by
    University of Mons
    Authors
    Cardoen, Guillaume
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This replication package accompagnies the dataset and exploratory empirical analysis reported in the paper "A dataset of GitHub Actions workflow histories" published in the IEEE MSR 2024 conference. (The Jupyter notebook can be found in previous version of this dataset).

    Important notice : It looks like Zenodo is compressing gzipped files two times without notice, they are "double compressed". So, when you download them they should be named : x.gz.gz instead of x.gz. Notice that the provided MD5 refers to the original file.

    2024-10-25 update : updated repositories list and observation period. The filters relying on date were also updated.

    2024-07-09 update : fix sometimes invalid valid_yaml flag.

    The dataset was created as follow :

    First, we used GitHub SEART (on October 7th, 2024) to get a list of every non-fork repositories created before January 1st, 2024. having at least 300 commits and at least 100 stars where at least one commit was made after January 1st, 2024. (The goal of these filter is to exclude experimental and personnal repositories).

    We checked if a folder .github/workflows existed. We filtered out those that did not contained this folder and pulled the others (between 9th and 10thof October 2024).

    We applied the tool gigawork (version 1.4.2) to extract every files from this folder. The exact command used is python batch.py -d /ourDataFolder/repositories -e /ourDataFolder/errors -o /ourDataFolder/output -r /ourDataFolder/repositories_everything.csv.gz -- -w /ourDataFolder/workflows_auxiliaries. (The script batch.py can be found on GitHub).

    We concatenated every files in /ourDataFolder/output into a csv (using cat headers.csv output/*.csv > workflows_auxiliaries.csv in /ourDataFolder) and compressed it.

    We added the column uid via a script available on GitHub.

    Finally, we archived the folder with pigz /ourDataFolder/workflows (tar -c --use-compress-program=pigz -f workflows_auxiliaries.tar.gz /ourDataFolder/workflows)

    Using the extracted data, the following files were created :

    workflows.tar.gz contains the dataset of GitHub Actions workflow file histories.

    workflows_auxiliaries.tar.gz is a similar file containing also auxiliary files.

    workflows.csv.gz contains the metadata for the extracted workflow files.

    workflows_auxiliaries.csv.gz is a similar file containing also metadata for auxiliary files.

    repositories.csv.gz contains metadata about the GitHub repositories containing the workflow files. These metadata were extracted using the SEART Search tool.

    The metadata is separated in different columns:

    repository: The repository (author and repository name) from which the workflow was extracted. The separator "/" allows to distinguish between the author and the repository name

    commit_hash: The commit hash returned by git

    author_name: The name of the author that changed this file

    author_email: The email of the author that changed this file

    committer_name: The name of the committer

    committer_email: The email of the committer

    committed_date: The committed date of the commit

    authored_date: The authored date of the commit

    file_path: The path to this file in the repository

    previous_file_path: The path to this file before it has been touched

    file_hash: The name of the related workflow file in the dataset

    previous_file_hash: The name of the related workflow file in the dataset, before it has been touched

    git_change_type: A single letter (A,D, M or R) representing the type of change made to the workflow (Added, Deleted, Modified or Renamed). This letter is given by gitpython and provided as is.

    valid_yaml: A boolean indicating if the file is a valid YAML file.

    probably_workflow: A boolean representing if the file contains the YAML key on and jobs. (Note that it can still be an invalid YAML file).

    valid_workflow: A boolean indicating if the file respect the syntax of GitHub Actions workflow. A freely available JSON Schema (used by gigawork) was used in this goal.

    uid: Unique identifier for a given file surviving modifications and renames. It is generated on the addition of the file and stays the same until the file is deleted. Renamings does not change the identifier.

    Both workflows.csv.gz and workflows_auxiliaries.csv.gz are following this format.

  5. Z

    Complete Rxivist dataset of scraped biology preprint data

    • data-staging.niaid.nih.gov
    Updated Mar 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdill, Richard J.; Blekhman, Ran (2023). Complete Rxivist dataset of scraped biology preprint data [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_2529922
    Explore at:
    Dataset updated
    Mar 2, 2023
    Dataset provided by
    University of Minnesota
    Authors
    Abdill, Richard J.; Blekhman, Ran
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    rxivist.org allowed readers to sort and filter the tens of thousands of preprints posted to bioRxiv and medRxiv. Rxivist used a custom web crawler to index all papers posted to those two websites; this is a snapshot of Rxivist the production database. The version number indicates the date on which the snapshot was taken. See the included "README.md" file for instructions on how to use the "rxivist.backup" file to import data into a PostgreSQL database server.

    Please note this is a different repository than the one used for the Rxivist manuscript—that is in a separate Zenodo repository. You're welcome (and encouraged!) to use this data in your research, but please cite our paper, now published in eLife.

    Previous versions are also available pre-loaded into Docker images, available at blekhmanlab/rxivist_data.

    Version notes:

    2023-03-01

    The final Rxivist data upload, more than four years after the first and encompassing 223,541 preprints posted to bioRxiv and medRxiv through the end of February 2023.

    2020-12-07***

    In addition to bioRxiv preprints, the database now includes all medRxiv preprints as well.

    The website where a preprint was posted is now recorded in a new field in the "articles" table, called "repo".

    We've significantly refactored the web crawler to take advantage of developments with the bioRxiv API.

    The main difference is that preprints flagged as "published" by bioRxiv are no longer recorded on the same schedule that download metrics are updated: The Rxivist database should now record published DOI entries the same day bioRxiv detects them.

    Twitter metrics have returned, for the most part. Improvements with the Crossref Event Data API mean we can once again tally daily Twitter counts for all bioRxiv DOIs.

    The "crossref_daily" table remains where these are recorded, and daily numbers are now up to date.

    Historical daily counts have also been re-crawled to fill in the empty space that started in October 2019.

    There are still several gaps that are more than a week long due to missing data from Crossref.

    We have recorded available Crossref Twitter data for all papers with DOI numbers starting with "10.1101," which includes all medRxiv preprints. However, there appears to be almost no Twitter data available for medRxiv preprints.

    The download metrics for article id 72514 (DOI 10.1101/2020.01.30.927871) were found to be out of date for February 2020 and are now correct. This is notable because article 72514 is the most downloaded preprint of all time; we're still looking into why this wasn't updated after the month ended.

    2020-11-18

    Publication checks should be back on schedule.

    2020-10-26

    This snapshot fixes most of the data issues found in the previous version. Indexed papers are now up to date, and download metrics are back on schedule. The check for publication status remains behind schedule, however, and the database may not include published DOIs for papers that have been flagged on bioRxiv as "published" over the last two months. Another snapshot will be posted in the next few weeks with updated publication information.

    2020-09-15

    A crawler error caused this snapshot to exclude all papers posted after about August 29, with some papers having download metrics that were more out of date than usual. The "last_crawled" field is accurate.

    2020-09-08

    This snapshot is misconfigured and will not work without modification; it has been replaced with version 2020-09-15.

    2019-12-27

    Several dozen papers did not have dates associated with them; that has been fixed.

    Some authors have had two entries in the "authors" table for portions of 2019, one profile that was linked to their ORCID and one that was not, occasionally with almost identical "name" strings. This happened after bioRxiv began changing author names to reflect the names in the PDFs, rather than the ones manually entered into their system. These database records are mostly consolidated now, but some may remain.

    2019-11-29

    The Crossref Event Data API remains down; Twitter data is unavailable for dates after early October.

    2019-10-31

    The Crossref Event Data API is still experiencing problems; the Twitter data for October is incomplete in this snapshot.

    The README file has been modified to reflect changes in the process for creating your own DB snapshots if using the newly released PostgreSQL 12.

    2019-10-01

    The Crossref API is back online, and the "crossref_daily" table should now include up-to-date tweet information for July through September.

    About 40,000 authors were removed from the author table because the name had been removed from all preprints they had previously been associated with, likely because their name changed slightly on the bioRxiv website ("John Smith" to "J Smith" or "John M Smith"). The "author_emails" table was also modified to remove entries referring to the deleted authors. The web crawler is being updated to clean these orphaned entries more frequently.

    2019-08-30

    The Crossref Event Data API, which provides the data used to populate the table of tweet counts, has not been fully functional since early July. While we are optimistic that accurate tweet counts will be available at some point, the sparse values currently in the "crossref_daily" table for July and August should not be considered reliable.

    2019-07-01

    A new "institution" field has been added to the "article_authors" table that stores each author's institutional affiliation as listed on that paper. The "authors" table still has each author's most recently observed institution.

    We began collecting this data in the middle of May, but it has not been applied to older papers yet.

    2019-05-11

    The README was updated to correct a link to the Docker repository used for the pre-built images.

    2019-03-21

    The license for this dataset has been changed to CC-BY, which allows use for any purpose and requires only attribution.

    A new table, "publication_dates," has been added and will be continually updated. This table will include an entry for each preprint that has been published externally for which we can determine a date of publication, based on data from Crossref. (This table was previously included in the "paper" schema but was not updated after early December 2018.)

    Foreign key constraints have been added to almost every table in the database. This should not impact any read behavior, but anyone writing to these tables will encounter constraints on existing fields that refer to other tables. Most frequently, this means the "article" field in a table will need to refer to an ID that actually exists in the "articles" table.

    The "author_translations" table has been removed. This was used to redirect incoming requests for outdated author profile pages and was likely not of any functional use to others.

    The "README.md" file has been renamed "1README.md" because Zenodo only displays a preview for the file that appears first in the list alphabetically.

    The "article_ranks" and "article_ranks_working" tables have been removed as well; they were unused.

    2019-02-13.1

    After consultation with bioRxiv, the "fulltext" table will not be included in further snapshots until (and if) concerns about licensing and copyright can be resolved.

    The "docker-compose.yml" file was added, with corresponding instructions in the README to streamline deployment of a local copy of this database.

    2019-02-13

    The redundant "paper" schema has been removed.

    BioRxiv has begun making the full text of preprints available online. Beginning with this version, a new table ("fulltext") is available that contains the text of preprints that have been processed already. The format in which this information is stored may change in the future; any digression will be noted here.

    This is the first version that has a corresponding Docker image.

  6. Reddit Comments Dataset for Text Style Transfer Tasks

    • zenodo.org
    csv, json
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabian Kopf; Fabian Kopf (2023). Reddit Comments Dataset for Text Style Transfer Tasks [Dataset]. http://doi.org/10.5281/zenodo.8023142
    Explore at:
    csv, jsonAvailable download formats
    Dataset updated
    Jun 17, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Fabian Kopf; Fabian Kopf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reddit Comments Dataset for Text Style Transfer Tasks

    A dataset of Reddit comments prepared for Text Style Transfer Tasks.

    The dataset contains Reddit comments translated into a formal language. For the translation of Reddit comments into a formal language text-davinci-003 was used. To make text-davinci-003 translate the comments into a more formal version, the following prompt was used:
    "Here is some text: {original_comment} Here is a rewrite of the text, which is more neutral: {"
    This prompting technique was taken from A Recipe For Arbitrary Text Style Transfer with Large Language Models.

    The dataset contains comments from the following Subreddits: antiwork, atheism, Conservative, conspiracy, dankmemes, gaybros, leagueoflegends, lgbt, libertarian, linguistics, MensRights, news, offbeat, PoliticalCompassMemes, politics, teenagers, TrueReddit, TwoXChromosomes, wallstreetbets, worldnews.

    The quality of formal translations was assessed with BERTScore and chrF++:

    • BERTScore: F1-Score: 0.89, Precision: 0.90, Recall: 0.88
    • chrF++: 37.16

    The average perplexity of the generated formal texts was calculated using GPT-2 and is 123.77


    The dataset consists of 3 components.

    reddit_commments.csv

    This file contains a collection of randomly selected comments from 20 Subreddits. For each comment, the following information was collected:
    - subreddit (name of the subreddit in which the comment was posted)
    - id (ID of the comment)
    - submission_id (ID of the submission to which the comment was posted)
    - body (the comment itself)
    - created_utc (timestamp in seconds)
    - parent_id (The ID of the comment or submission to which the comment is a reply)
    - permalink (The URL to the original comment)-
    - token_size (How many tokens the comment will be split into by the standard GPT-2 tokenizer)
    - perplexity (What perplexity does GPT-2 calculate for the comment)

    The comments were filtered. This file contains only comments that:
    - have been split by GPT-2 Tokenizer into more than 10 tokens but less than 512 tokens.
    - are not [removed] or [deleted]
    - do not contain URLs

    This file was used as a source for the other two file types.

    Labeled Files (training_labeled.csv and eval_labeled.csv)

    These files contain the formal translations of the Reddit comments.

    The 150 comments with the highest calculated perplexity of GPT-2 from each Subreddit were translated into a formal version. This filter was used to translate as many comments as possible that have large stylistic salience.

    They are structured as follows:
    - Subreddit (name of the subreddit where the comment was posted).
    - Original Comment
    - Formal Comment

    Labeled Files with Style Examples (training_labeled_with_style_samples.json and eval_labeled_with_style_samples.json)

    These files contain an original Reddit comment, three sample comments from the same subreddit, and the formal translation of the original Reddit comment.

    These files can be used to train models to perform style transfers based on given examples.
    The task is to transform the formal translation of the Reddit comment, using the three given examples, into the style of the examples.

    An entry in this file is structured as follows:

    "data":[
    {
    "input_sentence":"The original Reddit comment",
    "style_samples":[
    "sample1",
    "sample2",
    "sample3"
    ],
    "results_sentence":"The formal translated input_sentence",
    "subreddit":"The subreddit from which the comments originated"
    },
    "..."
    ]

  7. Ukraine_SL: A checklist of vascular plants, bryobionts, and lichens of...

    • zenodo.org
    bin, txt, zip
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Denys Vynokurov; Denys Vynokurov; Dariia Borovyk; Dariia Borovyk; Valerii Darmostuk; Valerii Darmostuk; Denys Davydov; Denys Davydov; Svitlana Iemelianova; Svitlana Iemelianova; Jiří Danihelka; Jiří Danihelka (2025). Ukraine_SL: A checklist of vascular plants, bryobionts, and lichens of Ukraine for storing vegetation plots in TURBOVEG [Dataset]. http://doi.org/10.5281/zenodo.15192162
    Explore at:
    zip, bin, txtAvailable download formats
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Denys Vynokurov; Denys Vynokurov; Dariia Borovyk; Dariia Borovyk; Valerii Darmostuk; Valerii Darmostuk; Denys Davydov; Denys Davydov; Svitlana Iemelianova; Svitlana Iemelianova; Jiří Danihelka; Jiří Danihelka
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Ukraine
    Description

    Given the critical need for the unification and coordinated use of floristic checklists within the TURBOVEG software environment (Hennekens & Schaminée 2001), we propose a new species list, Ukraine_SL, for Ukrainian flora.

    The taxonomic basis for Ukraine_SL (for vascular plants) is the UkrTrait taxonomy (Vynokurov et al. 2024), which is based on the Checklist of vascular plants of Ukraine (Mosyakin & Fedoronchuk 1999) and supplemented with taxa newly recorded or described in Ukraine in the years following its publication. Additionally, corrections have been made to the spelling of some taxon names (see details in Vynokurov 2024). Bryobionts follow the Second checklist of bryobionts in Ukraine (Boiko 2014).

    For the vast majority of vascular plants, corresponding names from the Euro+Med database are provided, enabling efficient conversion of phytosociological relevés between different taxonomic systems and facilitating integration with the European Vegetation Archive (EVA) (Chytrý et al. 2016).

    Moreover, most vascular plants are linked to the Ukrainian Plant Trait Database (UkrTrait v. 1.0) (Vynokurov et al. 2024), allowing rapid extraction of available traits for vegetation studies (e.g. plant height, life forms, flowering period, etc.).

    Ukraine_SL will be regularly updated and published on the Zenodo platform. In addition to the species list for TURBOVEG itself (Ukraine_SL.zip), an Excel file with a taxonomic crosswalk (ukraine_sl_taxonomy.xlsx) is also provided. It includes matches between the UkrTrait taxonomy, the original taxon concepts from Mosyakin & Fedoronchuk (1999), and names from the Euro+Med database (europlusmed.org).

    An expert system file (expert_ukraine_sl_euromed.txt) is also available for download, enabling translation of vegetation plots to the Euro+Med floristic list within the JUICE software (Tichý 2002).

    Installation Instructions

    To install the species list in TURBOVEG (Ukraine_SL.zip), download and unzip the archive into the Turbowin/Species/ directory of your TURBOVEG installation. After unzipping, a folder named Ukraine_SL should appear, containing the file SPECIES.DBF. The list will then be available for use in TURBOVEG.

    When working with this list, it is critically important to use only the species already included and not to add new taxa manually, as this would prevent synchronization with future updates and may cause errors during database merging.

    If taxa not present in the list are needed, users should contact the authors. The list will then be updated, and a new version made available for download. The full update history, including a list of changes, will be accessible on the Zenodo website. Any newly added taxa will be assigned unique, non-overlapping IDs.

    Updating the List

    To update the list in TURBOVEG, download the latest version from Zenodo and replace the old version in the Turbowin/Species/ directory by deleting it and unzipping the new archive (Ukraine_SL.zip).

    Using the Expert System in JUICE for translation to the Euro+Med taxonomy

    To use the expert system file (expert_ukraine_sl_euromed.txt) in JUICE:

    1. Go to Analysis → Expert System Classificator.

    2. Upload the .txt file.

    3. In the window that appears, click "Modify Species Names", followed by "Merge Same Spec. Names".

  8. Data from: Hybrid Approaches to Detect Comments Violating Macro Norms on...

    • zenodo.org
    csv
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eshwar Chandrasekharan; Mattia Samory; Eric Gilbert; Eshwar Chandrasekharan; Mattia Samory; Eric Gilbert (2020). Hybrid Approaches to Detect Comments Violating Macro Norms on Reddit [Dataset]. http://doi.org/10.5281/zenodo.3338698
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Eshwar Chandrasekharan; Mattia Samory; Eric Gilbert; Eshwar Chandrasekharan; Mattia Samory; Eric Gilbert
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    [Content warning: Files may contain instances of highly inflammatory and offensive content.]


    This dataset was generated as an extension of our CSCW 2018 paper:

    Eshwar Chandrasekharan, Mattia Samory, Shagun Jhaver, Hunter Charvat, Amy Bruckman, Cliff Lampe, Jacob Eisenstein, and Eric Gilbert. 2018. The Internet’s Hidden Rules: An Empirical Study of Reddit Norm Violations at Micro, Meso, and Macro Scales. Proceedings of the ACM on Human-Computer Interaction 2, CSCW (2018), 32.

    Description:

    Working with over 2M removed comments collected from 100 different communities on Reddit (subreddit names listed in data/study-subreddits.csv), we identified 8 macro norms, i.e., norms that are widely enforced on most parts of Reddit. We extracted these macro norms by employing a hybrid approach—classification, topic modeling, and open-coding—on comments identified to be norm violations within at least 85 out of the 100 study subreddits. Finally, we labelled over 40K Reddit comments removed by moderators according to the specific type of macro norm being violated, and make this dataset publicly available (also available on Github).

    For each of the labeled topics, we identified the top 5000 removed comments that were best fit by the LDA topic model. In this way, we identified over 5000 removed comments that are examples of each type of macro norm violation described in the paper. The removed comments were sorted by their topic fit, stored into respective files based on the type of norm violation they represent, and are made available on this repo.

    Here we make the following datasets publicly available:

    * 1 file containing the log of over 2M removed comments obtained from the top 100 subreddits between May 2016 to March 2017, after filtering out the following comments: 1) comments by u/AutoModerator, 2) replies to removed comments (i.e., children of the poisoned tree - refer to the paper for more information), and 3) non-readable comments (not utf-8 encoded).

    * 8 files, each containing 5000+ removed comments obtained from Reddit, are stored in: data/macro-norm-violations/ , and they are split into different files based on the macro norm they violated. Each new line in the files represent a comment that was posted on Reddit between May 2016 to March 2017, and subsequently removed by subreddit moderators for violating community norms. All comments were preprocessed using the script in code/preprocessing-reddit-comments.py , in order to do the following: 1. remove new lines, 2. convert text to lowercase, and 3. strip numbers and punctuations from comments.

    Description of 1 file containing over 2M removed comments from 100 subreddits.

    • "reddit-removal-log.csv" - all comments that were removed from the 100 study subreddits during the study period described above (post-filtering).

    Descriptions of each file containing 5059 comments (that were removed from Reddit, and preprocessed) violating macro norms present in data/macro-norm-violations/:

    • "macro-norm-violations-n10-t0-misogynistic-slurs.csv" - Comments that use misogynistic slurs.
    • "macro-norm-violations-n15-t2-hatespeech-racist-homophobic.csv" - Comments containing hate speech that is racist or homophobic.
    • "macro-norm-violations-n10-t3-opposing-political-views-trump.csv", "macro-norm-violations-n15-t10-opposing-political-views-trump.csv" - Comments with opposing political views around Trump (depends on originating sub).
    • "macro-norm-violations-n10-t4-verbal-attacks-on-Reddit.csv" - Comments containing verbal attacks on Reddit or specific subreddits.
    • "macro-norm-violations-n10-t5-porno-links.csv" - Comments with pornographic links.
    • "macro-norm-violations-n10-t8-personal-attacks.csv", "macro-norm-violations-n10-t9-personal-attacks.csv"- Comments containing personal attacks.
    • "macro-norm-violations-n15-t3-abusing-and-criticisizing-mods.csv" - Comments abusing and criticisizng moderators.
    • "macro-norm-violations-n15-t9-namecalling-claiming-other-too-sensitive.csv" - Comments with name-calling, or claiming that the other person is too sensitive.

    More details about the dataset can be found on arXiv: https://arxiv.org/abs/1904.03596

  9. The Phantom EEG Dataset

    • zenodo.org
    bin
    Updated Oct 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2024). The Phantom EEG Dataset [Dataset]. http://doi.org/10.5281/zenodo.11238929
    Explore at:
    binAvailable download formats
    Dataset updated
    Oct 14, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    New version on https://zenodo.org/records/13341214.

    When you use this dataset, please cite this paper. More information about this dataset could also be found in this paper.

    Xu, X., Wang, B., Xiao, B., Niu, Y., Wang, Y., Wu, X., & Chen, J. (2024). Beware of Overestimated Decoding Performance Arising from Temporal Autocorrelations in Electroencephalogram Signals. arXiv preprint arXiv:2405.17024.

    1 Metadata

    Brief introduction

    The present work aims to demonstrate that temporal autocorrelations (TA) significantly impacts various BCI tasks even in conditions without neural activity. We used the watermelon as the phantom head and found that we could get the pitfall of overestimated decoding performance if continuous EEG data with the same class label were split into training and test sets. More details can be found in Motivation.

    As watermelons cannot perform any experimental tasks, we can reorganize it to the format of various actual EEG dataset without the need to collect EEG data as previous work did (examples in Domain Studied).

    Measurement devices

    Manufacturers: NeuroScan SynAmps2 system (Compumedics Limited, Victoria, Australia)

    Configuration: 64-channel Ag/AgCl electrode cap with a 10/20 layout

    Species

    Watermelons. Ten watermelons served as phantom heads.

    Domain Studied

    Overestimated Decoding Performance in EEG decoding.

    Following BCI datasets in various BCI tasks have been reorganized using the Phantom EEG Dataset. The pitfall has been found in four of five tasks.

    - CVPR dataset [1] for image decoding task.

    - DEAP dataset [2] for emotion recognition task.

    - KUL dataset [3] for auditory spatial attention decoding task.

    - BCIIV2a dataset [4] for motor imagery task (the pitfalls were absent due to the use of rapid-design paradigm during EEG recording).

    - SIENA dataset [5] for epilepsy detection task.

    Tasks Completed

    Resting State but you could reorganize it to any task in BCI.

    Dataset Name

    The Phantom EEG Dataset

    Dataset license

    Creative Commons Attribution 4.0 International

    Code

    The code to read the data files (.cnt) is provided in "Other". We could not add the file in this version because Zenodo demand that "you must create a new version to add, modify or delete files". We will add the file after organizing the datasets to comply with the FAIR principles in the version v2 recently.

    Data information

    The data will be published with following format in version v2:

    - CNT: the raw data.

    - BIDS: an extension to the brain imaging data structure for electroencephalography. BIDS primarily addresses the heterogeneity of data organization by following the FAIR principles [6].

    An additional electrode was placed on the lower part of the watermelon as the physiological reference, and the forehead served as the ground site. The inter-electrode impedances were maintained under 20 kOhm. Data were recorded at a sampling rate of 1000 Hz. EEG recordings for each watermelon lasted for more than 1 hour to ensure sufficient data for the decoding task.

    Each Subject (S*.cnt) contains the following information:

    EEG.data: EEG data (samples X channels)

    EEG.srate: Sampling frequency of the saved data

    EEG.chanlocs : channel numbers (1 to 68, ‘EKG’ ‘EMG’ 'VEO' 'HEO' were not recorded)

    Citation and more information

    Citation will be updated after the review period is completed.

    We will provide more information about this dataset (e.g. the units of the captured data) once our work is accepted. This is because our work is currently under review, and we are not allowed to disclose more information according to the relevant requirements.

    All metadata will be provided as a backup on Github and will be available after the review period is completed.

    2 Motivation

    Researchers have reported high decoding accuracy (>95%) using non-invasive Electroencephalogram (EEG) signals for brain-computer interface (BCI) decoding tasks like image decoding, emotion recognition, auditory spatial attention detection, epilepsy detection, etc. Since these EEG data were usually collected with well-designed paradigms in labs, the reliability and robustness of the corresponding decoding methods were doubted by some researchers, and they proposed that such decoding accuracy was overestimated due to the inherent temporal autocorrelations (TA) of EEG signals [7]–[9].

    However, the coupling between the stimulus-driven neural responses and the EEG temporal autocorrelations makes it difficult to confirm whether this overestimation exists in truth. Some researchers also argue that the effect of TA in EEG data on decoding is negligible and that it becomes a significant problem only under specific experimental designs in which subjects do not have enough resting time [10], [11].

    Due to a lack of problem formulation previous studies [7]–[9] only proposed that block-design should not be used to avoid the pitfall. However, the impact of TA could be avoided only when the trial of EEG was not further segmented into several samples. Otherwise, the overfitting or pitfall would still occur. In contrast, when the correct data splitting strategy was used (e.g. separating training and test data in time), the pitfall could also be avoided even when block-design was used.

    In our framework, we proposed the concept of "domain" to represent the EEG patterns resulting from TA and then used phantom EEG to remove stimulus-driven neural responses for verification. The results confirmed that the TA, always existing in the EEG data, added unique domain features to a continuous segment of EEG. The specific finding is that when the segment of EEG data with the same class label is split into multiple samples, the classifier will associate the sample's class label with the domain features, interfering with the learning of class-related features. This leads to an overestimation of decoding performance for test samples from the domains seen during training, and results in poor accuracy for test samples from unseen domains (as in real-world applications).

    Importantly, our work suggests that the key to reducing the impact of EEG TA on BCI decoding is to decouple class-related features from domain features in the actual EEG dataset. Our proposed unified framework serves as a reminder to BCI researchers of the impact of TA on their specific BCI tasks and is intended to guide them in selecting the appropriate experimental design, splitting strategy and model construction.

    3 The rationality for using watermelon as the phantom head

    We must point out that the "phantom EEG" indeed does not contain any "EEG" but records only noise, a watermelon is not a brain and does not generate any electrical signals. Therefore, the recorded electrical noises, even when amplified using equipment typically used for EEG, do not constitute EEG data when considering the definition of EEG. This is why previous researchers called it "phantom EEG". Some researchers may therefore think that it is questionable to use watermelon to get the phantom EEG.

    However, the usage of the phantom head allows researchers to evaluate the performance of neural-recording equipment and proposed algorithms without the effects of neural activity variability, artifacts, and potential ethical issues. Phantom heads used in previous studies include digital models [12]–[14], real human skulls [15]–[17], artificial physical phantoms [18]–[24] and watermelons [25]–[40]. Due to their similar conductivity to human tissue, similar size and shape to the human head, and ease of acquisition, watermelons are widely used as "phantom heads".

    Most works tried to use watermelon as a phantom head and found that the results analyzed using the neural signals from human subjects could not be obtained when using the phantom head, thus proving that the achieved results were indeed caused by neural signals. For example, Mutanen et.al [35] proposed that “the fact that the phantom head stimulation did not evoke similar biphasic artifacts excludes the possibility that residual induced artifacts, with the current TMS-compatible EEG system, could explain these components”.

    Our work differs significantly from most previous works. It is firstly found in our work that the phantom EEG exhibits the effect of TA on BCI decoding even when only noise was recorded, indicating the inherent existence of TA in the EEG data. The conclusion we hope to draw is that some current works may not truly use stimulus-driven neural responses to obtain the overestimated decoding performance. Similar logic may be found in a neuroscience review article [41], they proposed that EEG recordings from phantom head (watermelon) remind us that background noise may appear as positive results without proper statistical precautions.

    Reference

    [1] C. Spampinato, S. Palazzo, I. Kavasidis, D. Giordano, N. Souly, and M. Shah, “Deep Learning Human Mind for Automated Visual Classification,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 4503–4511.

    [2] S. Koelstra et al., “DEAP: A Database for Emotion Analysis ;Using Physiological Signals,” IEEE Transactions on Affective Computing, vol. 3, no. 1, pp. 18–31, 2012.

    [3] N. Das, T. Francart, and A. Bertrand, “Auditory Attention Detection Dataset KULeuven.” Zenodo, Aug. 27, 2020.

    [4] M. Tangermann et al., “Review of the BCI Competition IV,” Front.

  10. Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

    • zenodo.org
    application/gzip, bin +2
    Updated Aug 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788
    Explore at:
    bin, application/gzip, zip, text/x-pythonAvailable download formats
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb
    License

    https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

    Description
    Replication pack, FSE2018 submission #164:
    ------------------------------------------
    
    **Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
    A Case Study of the PyPI Ecosystem
    
    **Note:** link to data artifacts is already included in the paper. 
    Link to the code will be included in the Camera Ready version as well.
    
    
    Content description
    ===================
    
    - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
     described below
    - **settings.py** - settings template for the code archive.
    - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
     This dataset only includes stats aggregated by the ecosystem (PyPI)
    - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
     statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
     themselves, which take around 2TB.
    - **build_model.r, helpers.r** - R files to process the survival data 
      (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
      `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
      **dataset_full_Jan_2018.tgz**)
    - **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
    - LICENSE - text of GPL v3, under which this dataset is published
    - INSTALL.md - replication guide (~2 pages)
    Replication guide
    =================
    
    Step 0 - prerequisites
    ----------------------
    
    - Unix-compatible OS (Linux or OS X)
    - Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
    - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)
    
    Depending on detalization level (see Step 2 for more details):
    - up to 2Tb of disk space (see Step 2 detalization levels)
    - at least 16Gb of RAM (64 preferable)
    - few hours to few month of processing time
    
    Step 1 - software
    ----------------
    
    - unpack **ghd-0.1.0.zip**, or clone from gitlab:
    
       git clone https://gitlab.com/user2589/ghd.git
       git checkout 0.1.0
     
     `cd` into the extracted folder. 
     All commands below assume it as a current directory.
      
    - copy `settings.py` into the extracted folder. Edit the file:
      * set `DATASET_PATH` to some newly created folder path
      * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
    - install docker. For Ubuntu Linux, the command is 
      `sudo apt-get install docker-compose`
    - install libarchive and headers: `sudo apt-get install libarchive-dev`
    - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
     Without this dependency, you might get an error on the next step, 
     but it's safe to ignore.
    - install Python libraries: `pip install --user -r requirements.txt` . 
    - disable all APIs except GitHub (Bitbucket and Gitlab support were
     not yet implemented when this study was in progress): edit
     `scraper/init.py`, comment out everything except GitHub support
     in `PROVIDERS`.
    
    Step 2 - obtaining the dataset
    -----------------------------
    
    The ultimate goal of this step is to get output of the Python function 
    `common.utils.survival_data()` and save it into a CSV file:
    
      # copy and paste into a Python console
      from common import utils
      survival_data = utils.survival_data('pypi', '2008', smoothing=6)
      survival_data.to_csv('survival_data.csv')
    
    Since full replication will take several months, here are some ways to speedup
    the process:
    
    ####Option 2.a, difficulty level: easiest
    
    Just use the precomputed data. Step 1 is not necessary under this scenario.
    
    - extract **dataset_minimal_Jan_2018.zip**
    - get `survival_data.csv`, go to the next step
    
    ####Option 2.b, difficulty level: easy
    
    Use precomputed longitudinal feature values to build the final table.
    The whole process will take 15..30 minutes.
    
    - create a folder `
  11. Science Education Research Topic Modeling Dataset

    • zenodo.org
    bin, html +2
    Updated Oct 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tor Ole B. Odden; Tor Ole B. Odden; Alessandro Marin; Alessandro Marin; John L. Rudolph; John L. Rudolph (2024). Science Education Research Topic Modeling Dataset [Dataset]. http://doi.org/10.5281/zenodo.4094974
    Explore at:
    bin, txt, html, text/x-pythonAvailable download formats
    Dataset updated
    Oct 9, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tor Ole B. Odden; Tor Ole B. Odden; Alessandro Marin; Alessandro Marin; John L. Rudolph; John L. Rudolph
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This dataset contains scraped and processed text from roughly 100 years of articles published in the Wiley journal Science Education (formerly General Science Quarterly). This text has been cleaned and filtered in preparation for analysis using natural language processing techniques, particularly topic modeling with latent Dirichlet allocation (LDA). We also include a Jupyter Notebook illustrating how one can use LDA to analyze this dataset and extract latent topics from it, as well as analyze the rise and fall of those topics over the history of the journal.

    The articles were downloaded and scraped in December of 2019. Only non-duplicate articles with a listed author (according to the CrossRef metadata database) were included, and due to missing data and text recognition issues we excluded all articles published prior to 1922. This resulted in 5577 articles in total being included in the dataset. The text of these articles was then cleaned in the following way:

    • We removed duplicated text from each article: prior to 1969, articles in the journal were published in a magazine format in which the end of one article and the beginning of the next would share the same page, so we developed an automated detection of article beginnings and endings that was able to remove any duplicate text.
    • We removed the reference sections of the articles, as well headings (in all caps) such as “ABSTRACT”.
    • We reunited any partial words that were separated due to line breaks, text recognition issues, or British vs. American spellings (for example converting “per cent” to “percent”)
    • We removed all numbers, symbols, special characters, and punctuation, and lowercased all words.
    • We removed all stop words, which are words without any semantic meaning on their own—“the”, “in,” “if”, “and”, “but”, etc.—and all single-letter words.
    • We lemmatized all words, with the added step of including a part-of-speech tagger so our algorithm would only aggregate and lemmatize words from the same part of speech (e.g., nouns vs. verbs).
    • We detected and create bi-grams, sets of words that frequently co-occur and carry additional meaning together. These words were combined with an underscore: for example, “problem_solving” and “high_school”.

    After filtering, each document was then turned into a list of individual words (or tokens) which were then collected and saved (using the python pickle format) into the file scied_words_bigrams_V5.pkl.

    In addition to this file, we have also included the following files:

    1. SciEd_paper_names_weights.pkl: A file containing limited metadata (title, author, year published, and DOI) for each of the papers, in the same order as they appear within the main datafile. This file also includes the weights assigned by an LDA model used to analyze the data
    2. Science Education LDA Notebook.ipynb: A notebook file that replicates our LDA analysis, with a written explanation of all of the steps and suggestions on how to explore the results.
    3. Supporting files for the notebook. These include the requirements, the README, a helper script with functions for plotting that were too long to include in the notebook, and two HTML graphs that are embedded into the notebook.

    This dataset is shared under the terms of the Wiley Text and Data Mining Agreement, which allows users to share text and data mining output for non-commercial research purposes. Any questions or comments can be directed to Tor Ole Odden, t.o.odden@fys.uio.no.

  12. Redistribution of the map of the water surfaces of the Flemish Region...

    • zenodo.org
    bin
    Updated Nov 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    An Leyssen; An Leyssen; Kevin Scheers; Kevin Scheers; Jo Packet; Jo Packet; Florian Van Hecke; Florian Van Hecke; Carine Wils; Carine Wils (2024). Redistribution of the map of the water surfaces of the Flemish Region (status 2024) [Dataset]. http://doi.org/10.5281/zenodo.14203168
    Explore at:
    binAvailable download formats
    Dataset updated
    Nov 22, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    An Leyssen; An Leyssen; Kevin Scheers; Kevin Scheers; Jo Packet; Jo Packet; Florian Van Hecke; Florian Van Hecke; Carine Wils; Carine Wils
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Flanders, Flemish Region
    Description

    This is a redistribution of the dataset 'Watervlakken - versie 2024’ (Water surfaces - edition 2024), originally published by the Research Institute for Nature and Forest (INBO) and distributed by 'Informatie Vlaanderen' under a CC-BY compatible license. More specifically, this Zenodo record redistributes the GeoPackage file from the original data source, in order to support reproducible, analytical workflows on Flemish Natura 2000 habitats and regionally important biotopes.

    The digital map of standing water surfaces (edition 2024) is a georeferenced digital file of standing surface waters in Flanders (northern Belgium). The file contains 93 201 polygons with an area between 1.45 m² and 2.47 km² and can be considered as the most complete and accurate representation of lentic water bodies presently available for the Flemish territory. The map is based on topographic map layers, orthophoto images, the Digital Terrain Model of Flanders version II, results of a water prediction model and, to a lesser extent, field observations. It can be used for a wide range of applications in research, policy preparation and policy implementation, management planning and evaluation that consider the distribution and characteristics of stagnant water bodies. The map is also relevant internationally, including updates for the National Wetland Inventories (Ramsar). Furthermore, its unique reference to each object will considerably facilitate related data management.

    For this new edition of Watervlakken (2024), the orthophoto images of 2021, 2022 and 2023 and the digital terrain model of Flanders have been used. This edition also uses the results of an AI prediction model for water developed by VITO. Data from various Regional Landscapes, ad hoc user reports and field observations have been used to digitise additional polygons, make shape corrections or remove filled ponds from the map layer. For a number of water surfaces, new data on the Flemish type according to the European Water Framework Directive (WFD type), water depth and connectivity have been added to the attribute table.

    The data source is produced, owned and administered by the Research Institute for Nature and Forest (INBO, Department of Environment of the Flemish government).

  13. Thunderstorm outflows in the Mediterranean Sea area

    • zenodo.org
    txt, zip
    Updated Apr 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Federico Canepa; Federico Canepa; Massimiliano Burlando; Massimiliano Burlando; Maria Pia Repetto; Maria Pia Repetto (2024). Thunderstorm outflows in the Mediterranean Sea area [Dataset]. http://doi.org/10.5281/zenodo.10688746
    Explore at:
    txt, zipAvailable download formats
    Dataset updated
    Apr 30, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Federico Canepa; Federico Canepa; Massimiliano Burlando; Massimiliano Burlando; Maria Pia Repetto; Maria Pia Repetto
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Mediterranean Sea
    Description

    In the context of the European projects “Wind and Ports” (grant No. B87E09000000007) and “Wind, Ports and Sea” (grant No. B82F13000100005), an extensive in-situ wind monitoring network was installed in the main ports of the Northern Mediterranean Sea. An unprecedent number of wind records has been acquired and systematically analyzed. Among these, a considerable number of records presented non-stationary and non-Gaussian characteristics that are completely different from those of synoptic extra-tropical cyclones, widely known in the atmospheric science and wind engineering communities. The cross-checking with meteorological information allows to identify which of these events can be defined as thunderstorm winds, i.e., downbursts and gust fronts.

    The scientific literature of the last few decades has demonstrated that downbursts, and especially micro-bursts, are extremely dangerous for the natural and built environment. Furthermore, recent trends in climate change seem to preview drastic future scenarios in terms of intensification and frequency increase of this type of extreme events. However, the limited space and time structure of thunderstorm outflows makes them still difficult to be measured in nature and, consequently, to build physically reliable and easily applicable models as in the case of extra-tropical cyclones. For these reasons, the collection and publication of events of this type represents a unique opportunity for the scientific community.

    The dataset here presented was built in the context of the activities of the project THUNDERR “Detection, simulation, modelling and loading of thunderstorm outflows to design wind-safer and cost-efficient structures”, financed by the European Research Council (ERC), Advanced Grant 2016 (grant No. 741273, P.I. Prof. Giovanni Solari, University of Genoa). It collects 29 thunderstorm downbursts that occurred between 2010 and 2015 in the Italian ports of Genoa (GE) (4), Livorno (LI) (14), and La Spezia (SP) (11), and were recorded by means of ultrasonic anemometers (Gill WindObserver II in Genoa and La Spezia, Gill WindMaster Pro in Livorno). All thunderstorm events included in the database were verified by means of meteorological information, such as radar (CIMA Research Foundation is gratefully acknowledge for providing with most of the radar images), satellite, and lightning data. In fact, (i) high and localized clouds typical of thunderstorm cumulonimbus, (ii) precipitations, and (iii) lightnings represent reliable indicators of the occurrence of the thunderstorm event.

    Some events were recorded by multiple anemometers in the same port area – the total number of signals included in the database is 99. Despite the limited number of points (anemometers), this will allow the user to perform cross-correlation analysis in time and space to eventually retrieve size, position, trajectory of the storm, etc.

    The ASCII tab-delimited file ‘Anemometers_location.txt’ reports specifications of the anemometers used in this monitoring study: port code (Port code – Genoa-GE, Livorno-LI, La Spezia-SP); anemometer code (Anemometer code); latitude (Lat.) and longitude (Lon.) in decimal degree WGS84; height above the ground level (h a.g.l.) in meters; Instrument type. Bi-axial anemometers were used from the ports of Genoa and La Spezia, recording the two horizontal wind speed components (u, v). Three-axial ultrasonic anemometers were used in the port of Livorno, also providing the vertical wind speed component w (except bi-axial anemometers LI06 and LI07). All anemometers acquired velocity data at sampling frequency 10 Hz, sensitivity 0.01 m s-1 (except anemometers LI06 and LI07 with sensitivity 0.1 m s-1) and were installed at various heights ranging from 13.0 to 75.0 m, as reported in the file ‘Anemometers_location.txt’.

    The ASCII tab-delimited file ‘List_DBevents.txt’ lists all downburst records included in the database, in terms of: event and record number (Event | record no.); port code (Port code); date of event occurrence (Date) in the format yyyy-mm-dd; approximate time of occurrence of the velocity peak (Time [UTC]) in the format HH:MM; anemometer code (Anemometer code).

    The database is presented as a zip file (‘DB-records.zip’). The events are divided based on the port of occurrence (three folders GE, LI, and SP). Within each folder, the downburst events that were recorded in that specific port are reported as subfolders (name format ‘[port code]_yyyy-mm-dd’) and contain the single anemometers signals as TAB-delimited text files (name format ‘[port and anemometer code]_yyyy-mm-dd.txt’). Each sub-dataset (file) contains 3(4) columns and 360.000 rows. The first column shows the 10-h time vector (t, ISO format) in UTC, while the remaining 2(3) columns report the 10-h time series of 10-Hz instantaneous horizontal (zonal west-to-east u, meridional south-to-north v) and, where available, vertical (positive upward w) wind speed components, centred around the time of maximum horizontal wind speed (vectorial sum of u and v). The choice of representation of the wind speed in a large time interval (10 hours) allows the user to perform a more comprehensive and detailed analysis of the event by taking into account also the wind conditions before and after the onset of the downburst phenomenon. 'Not-a-Number' (‘NaN’) values are reported in wind velocity signals when the instrument did not record valid data. Some wind speed records show noise in discrete intervals of the signal, which reflects in an increase of the wind speed standard deviation. A modified Hampel filter was employed to remove measurement outliers. For each wind speed signal, every data sample was considered in ascending order, along with its adjacent ten samples (five on each side). This technique calculated the median and standard deviation within the sampling window using the median absolute deviation. Elements deviating from the median by more than six standard deviations were identified and replaced with 'NaN'. The tuning of the filter parameters involved finding a balance between overly agressive and insufficient removal of outliers. Residual outliers were subsequently manually removed through meticulous qualitative inspection. The complexity and subjectivity of this operation provide users with the opportunity to explore alternative approaches. Consequently, the published dataset includes two versions: an initial version (v1) comprising the original raw data with no filtering applied, and a second "cleaned" version (v2).

    The presented database can be further used by researchers to validate and calibrate experimental and numerical simulations, as well as analytical models, of downburst winds. It will also be an important resource for the scientific community working in the wind engineering field, in meteorology and atmospheric sciences, as well as in the risk management and reductions of losses related to thunderstorm events (i.e., insurance companies).

  14. MEG Attention Dataset Using Musicians and Non-Musicians - Part 3

    • zenodo.org
    pdf, zip
    Updated Aug 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jasmin Riegel; Alina Schüller; Alina Schüller; Tobias Reichenbach; Tobias Reichenbach; Jasmin Riegel (2024). MEG Attention Dataset Using Musicians and Non-Musicians - Part 3 [Dataset]. http://doi.org/10.5281/zenodo.12794808
    Explore at:
    zip, pdfAvailable download formats
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jasmin Riegel; Alina Schüller; Alina Schüller; Tobias Reichenbach; Tobias Reichenbach; Jasmin Riegel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data location

    The data is split into 3 Zenodo locations as it is too large for one location. In total the data set contains meg data of 58 participants. An overview of the participants and the amount of musical training the have conducted is also available. Each of the 3 Zenodo uploads contains the participant overview file + Set#.zip.

    Part/Set 1 (blue) contains: meg data of participants 1 - 19 + audio folder (can be found here)

    Part/Set 2 (pink) contains: meg data of participants 20 - 38 (can be found here)

    Part/Set 3 (yellow) contains: meg data of participants 39 - 58

    Experimental design

    We used four German audiobooks (all published by Hörbuch Hamburg Verlag and available online.

    1. „Frau Ella“ (narrated by lower pitched (LP) speaker and attended by participants)

    2. „Darum“ (narrated by LP speaker and ignored by participants)

    3. „Den Hund überleben“ (narrated by higher pitched (HP) speaker and attended by participants)

    4. „Looking for Hope“ (narrated by HP speaker and ignored by participants)

    The participants listened to 10 audiobook chapters. There were always 2 audiobooks presented at the same time (one narrated by a HP speaker and one by a LP speaker) and the participants attended one and ignored the other speaker. The structure of the chapters was as follows:

    Chapter 1 of audiobook 1 + random part of audiobook 4

    3 comprehension questions

    Chapter 1 of audiobook 3 + random part of audiobook 2

    3 comprehension questions

    Chapter 2 of audiobook 1 + random part of audiobook 4

    3 comprehension questions

    Chapter 2 of audiobook 3 + random part of audiobook 2

    3 comprehension questions

    Chapter 3 of audiobook 1 + random part of audiobook 4

    3 comprehension questions

    Chapter 3 of audiobook 3 + random part of audiobook 2

    3 comprehension questions

    Chapter 4 of audiobook 1 + random part of audiobook 4

    3 comprehension questions

    Chapter 4 of audiobook 3 + random part of audiobook 2

    3 comprehension questions

    Chapter 5 of audiobook 1 + random part of audiobook 4

    3 comprehension questions

    Chapter 5 of audiobook 3 + random part of audiobook 2

    3 comprehension questions

    MEG Data structure

    MEG data of 58 participants is contained in this data set.

    Each participant has a folder with its participant number as folder name (1,2,3,…).

    In the participant folder are two subfolders. One (LP_speaker_attended) containing the MEG data when the participant was attending the LP speaker (ignoring the HP speaker) and one (HP_speaker_attended) containing the MEG data measured when the participant was attending the HP speaker (ignoring the LP speaker). Note that after each chapter the participants switched the attention from LP to HP and vice versa but for evaluation we concatenated the data of the LP speaker attended/ HP speaker ignored mode and the HP speaker attended/ LP speaker ignored mode.

    The data of attending the HP speaker is of shape (248, 959416) (ca 16 minutes). That of the LP speaker is of shape (248, 1247854) (ca 21 minutes)

    #The meg data can be loaded with the mne python library

    meg = mne.read_raw_fif(“…/data_meg.fif“)

    #The data can be accessed:

    meg_data = meg.get_data()

    Exemplary code for performing source reconstruction and trf evaluation can be found in our git repository.

    Audio Data structure

    The original audio chapters of the audio books are stored in the folder „Audio“ in Part 1.

    There are two subfolders. One (attended_speech) contains the ten audiobook chapters which were attended by the participant (audiobook1_#, audiobook3_#). The other subfolder (ignored_speech) contains the ten audiobook chapters which were ignored by the participant (audiobook2_#, audiobook4_#).

    We recommend the librosa library for audio loading and processing.

    Audio data is provided with a sampling frequency of 44.1 kHz

    Each audio book is provided in 5 chapters as they were presented to the participants. The corresponding meg file as described above already contains the concatenated measured data of all five chapters.

    If you resample the audio data to 1000Hz and concatenate the chapters, the audio shape (n_times) will be equal to the corresponding n_times of the meg data.

    Processing of meg data

    The meg data was filtered analog with a 1.0 - 200 Hz filter and preprocessed offline using a notch filter (Firwin, 0.5 Hz bandwidth) to remove power line interference at frequencies 50, 100, 150 and 200 Hz.

    The data was then resampled from 1017.25 Hz to 1000 Hz.

    Technical details

    The meg system with which the data was recorded was a 248 magnetometer system (4D Neuroimaging, San Diego, CA, USA)

    The audio signal was presented through loud speakers outside the magnetic chamber and passed on to the participant via tubes of 2 m length and 2 cm diameter leading to a delay of the acoustic signal of 6 ms. The audio was presented diotically (both the attended and the ignored audio stream were presented in both ears) with a sound pressure level of 67 dB(A).

    The measurement setup was provided by a former study by Schilling et al (https://doi.org/10.1080/23273798.2020.1803375).

    Papers to cite when using this data

    • Riegel et al., "No Influence of Musical Training on the Cortical Contribution to the Speech-FFR and its Modulation Through Selective Attention" eneuro in print (https://doi.org/10.1101/2024.07.25.605057).
    • Schüller, Mücke et al. "Assessing the Impact of Selective Attention on the Cortical Tracking of the Speech Envelope in the Delta and Theta Frequency Bands and How Musical Training Does (Not) Affect it", under review (https://doi.org/10.1101/2024.08.01.606154).
    • Schüller et al., "Attentional Modulation of the Cortical Contribution to the Frequency-Following Response Evoked by Continuous Speech“ (https://doi.org/10.1523/JNEUROSCI.1247-23.2023).
  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Zenodo team; Zenodo team (2022). Zenodo Open Metadata snapshot - Training dataset for records and communities classifier building [Dataset]. http://doi.org/10.5281/zenodo.7438358
Organization logo

Zenodo Open Metadata snapshot - Training dataset for records and communities classifier building

Explore at:
bin, application/gzipAvailable download formats
Dataset updated
Dec 15, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Zenodo team; Zenodo team
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset contains Zenodo's published open access records and communities metadata, including entries marked by the Zenodo staff as spam and deleted.

The datasets are gzipped compressed JSON-lines files, where each line is a JSON object representation of a Zenodo record or community.

Records dataset

Filename: zenodo_open_metadata_{ date of export }.jsonl.gz

Each object contains the terms: part_of, thesis, description, doi, meeting, imprint, references, recid, alternate_identifiers, resource_type, journal, related_identifiers, title, subjects, notes, creators, communities, access_right, keywords, contributors, publication_date

which correspond to the fields with the same name available in Zenodo's record JSON Schema at https://zenodo.org/schemas/records/record-v1.0.0.json.

In addition, some terms have been altered:

  • The term files contains a list of dictionaries containing filetype, size, and filename only.
  • The term license contains a short Zenodo ID of the license (e.g. "cc-by").

Communities dataset

Filename: zenodo_community_metadata_{ date of export }.jsonl.gz

Each object contains the terms: id, title, description, curation_policy, page

which correspond to the fields with the same name available in Zenodo's community creation form.

Notes for all datasets

For each object the term spam contains a boolean value, determining whether a given record/community was marked as spam content by Zenodo staff.

Some values for the top-level terms, which were missing in the metadata may contain a null value.

A smaller uncompressed random sample of 200 JSON lines is also included for each dataset to test and get familiar with the format without having to download the entire dataset.

Search
Clear search
Close search
Google apps
Main menu