28 datasets found

Data from: Worm Perturb-Seq: massively parallel whole-animal RNAi and...

zenodo.org

bin, zip

Updated Apr 17, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Xuhang Li; Xuhang Li (2025). Worm Perturb-Seq: massively parallel whole-animal RNAi and RNA-seq [Dataset]. http://doi.org/10.5281/zenodo.15223779

Explore at:

zip, binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15223779

Dataset updated

Apr 17, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Xuhang Li; Xuhang Li

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Description:

This repository contains all source codes, necessary input and output data, and raw figures and tables for reproducing most figures and results published in the following study:

Hefei Zhang#, Xuhang Li#, Dongyuan Song, Onur Yukselen, Shivani Nanda, Alper Kucukural, Jingyi Jessica Li, Manuel Garber, Albertha J.M. Walhout. Worm Perturb-Seq: massively parallel whole-animal RNAi and RNA-seq. (2025) Nature Communications, in press (# equal contribution, *: correspondng author)

These include results related to method benchmarking and NHR data processing. Source data for figures that are not reproducible here have been provided with the publication.

Files:

This repository contains a few directories related to this publication. To deposit into Zenodo, we have individually zipped each subfolder of the root directory.

There are three directories included:

MetabolicLibrary
- contains files related to the benchmarking analyses using the metabolic gene WPS data. This folder is partially overlaped with the working directory of the sister paper deposited at 10.5281/zenodo.14198997
method_simulation
- contains files related to the simulation benchmarking
NHRLibrary
- contains files related to the analyses of NHR gene WPS data

Note: the parameter optimization output is deposited in a seperate Zenodo repository (10.5281/zenodo.15236858) for better oganization and easy usage. If you would like to reproduce results related to the "MetabolicLibrary" folder, please download and integrate the omitted subfolder "MetabolicLibrary/2_DE/output/" from this seperate repository.

Please be advised that this repository contains raw codes and data that are not directly related to a figure in our paper. However, they may be useful to generate input used in the analysis of a figure, or to reproduce tables in our manuscript. It may also contain unpublished analyses and figures, which we did not intentionally delete and kept for records.

Usage:

Please refer to the table in below to locate a specific file for reproducing a figure of interest (also availabe in the METHOD_FIGURE_LOOKUP.xlsx under the root directory).

Figure	File	Lines^a	Notes
Fig. 2c	MetabolicLibrary/1_QC_dataCleaning/2_badSampleExclusion_manual.R	65-235	output figure is selected from figures/met10_lib6_badSamplePCA.pdf
Fig. 2d	NHRLibrary/example_bams/*	-	load the bam files in IGV to make the figure
Fig. 3a	MetabolicLibrary/2_DE/2_5_vectorlike_re_analysis_with_curated_44_NTP_conditions.R	348-463
Fig. 3b,c	MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R	106-376
Fig. 3d	MetabolicLibrary/2_DE/SUPP_extra_figures_for_rewiring.R	10-139
Fig. 3e	MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R	379-522
Fig. 3f,g	MetabolicLibrary/2_DE/SUPP_extra_figures_for_rewiring.R	1-8
Fig. 3h	method_simulation/Supp_systematic_mean_variation_example.R	1-138
Fig. 3i	method_simulation/3_benchmark_DE_result_w_rep.R and 1_benchmark_DE_result_std_NB_w_rep.R	1-518	the example figure was from figures/GLM_NB_deltaMiu_k08_10rep/seed_12345_empircial_null_fit_simulated_data_NB_GLM.pdf and figures/GLM_NB_10rep/seed_1_empircial_null_fit_simulated_data_NB_GLM.pdf;
Fig. 3j	MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R	2104-2106	load dependencies starting from line 1837
Fig. 3k	MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R	2053-2078	load dependencies starting from line 1837; the GSEA was performed using SUPP_supplementary_figures_for_method_noiseness_GSEA.R
Fig. 4a,b	method_simulation/3_benchmark_DE_result_w_rep.R	1-523
Fig. 4c	method_simulation/3_benchmark_WPS_parameters.R	1-237
Fig. 4d	MetabolicLibrary/2_DE/2_5_vectorlike_re_analysis_with_curated_44_NTP_conditions.R	1-346	output figure is selected from figures/0_DE_QA/vectorlike_analysis/2d_cutoff_titration_71NTP_rawDE_log2FoldChange_log2FoldChange_raw.pdf and 2d_cutoff_titration_71NTP_p0.005_log2FoldChange_raw.pdf. The "p0.005" in the second file name indicates the p_outlier cutoff used in the final parameter set for EmpirDE.
Fig. 4e	MetabolicLibrary/2_DE/SUPP_plot_N_DE_repeated_RNAi.R	entire file
Fig. 4f	MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R	1020-1407
Fig. 4g,h	MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R	529-851
Fig. 5d	NHRLibrary/FinalAnalysis/2_DE_new/3_DE_network_analysis.R	51-69; 94-112	load dependencies starting from line 1
Fig. 5e	NHRLibrary/FinalAnalysis/2_DE_new/5_GSA_bubble_plot.R	1-306
Fig. 5f	NHRLibrary/FinalAnalysis/2_DE_new/5_GSA.R	1-1492
Fig. 6a	NHRLibrary/FinalAnalysis/2_DE_new/4_DE_similarity_analysis.R	1-175
Fig. 6b	NHRLibrary/FinalAnalysis/6_case_study.R	506-534	load dependencies starting from line 1
Fig. 6c	NHRLibrary/FinalAnalysis/6_case_study.R	668-888	load dependencies starting from line 1
Supplementary Fig. 1e	NHRLibrary/FinalAnalysis/5_revision/REVISION_gene_detection_sensitivity_benchmark.R	1-143
Supplementary Fig. 1f	MetabolicLibrary/1_QC_dataCleaning/2_badSampleExclusion_manual.R	65-235	output figure is selected from figures/met10_lib6_badSampleCorr.pdf
Supplementary Fig. 1g	MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R	2191-2342
Supplementary Fig. 2a	MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R	1409-1822
Supplementary Fig. 2b	method_simulation/Supp_systematic_mean_variation_example.R	1-138
Supplementary Fig. 2c	method_simulation/Supp_systematic_mean_variation_example.R; 2_fit_logFC_distribution.R	141-231; 1-201	the middle panel was generated from Supp_systematic_mean_variation_example.R (lines 141-231) and right panel was from 2_fit_logFC_distribution.R (lines 1-201)
Supplementary Fig. 2d	method_simulation/1_benchmark_DE_result_std_NB_w_rep.R	1-518	the example figure was from figures/GLM_NB_10rep/seed_1_empircial_null_fit_simulated_data_NB_GLM.pdf;
Supplementary Fig. 2e	method_simulation/3_benchmark_DE_result_w_rep.R	1-518	the example figure was from figures/GLM_NB_deltaMiu_k08_10rep/seed_12345_empircial_null_fit_simulated_data_NB_GLM.pdf
Supplementary Fig. 2f	method_simulation/3_benchmark_DE_result_w_rep.R	528-573	may need to run the code from line 1 to load other variables needed
Supplementary Fig. 3a,b	method_simulation/1_benchmark_DE_result_std_NB_w_rep.R	1-523
Supplementary Fig. 3c	method_simulation/3_benchmark_WPS_parameters.R	1-237
Supplementary Fig. 3d	MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R	3190-3300
Supplementary Fig. 3e	2_3_power_error_tradeoff_optimization.R	entire file	the figure was in figures/0_DE_QA/cleaning_strength_titration/benchmark_vectorlikes_titrate_cleaning_cutoff_DE_log2FoldChange_FDR0.2_FC1.pdf (produced in line 398); this script produced the titration plots for a series of thresholds, where we picked FDR0.2_FC1 for presentation in the paper
Supplementary Fig. 3f	2_3_power_error_tradeoff_optimization.R	entire file	the figure was in figures/0_DE_QA/cleaning_strength_titration/benchmark_independent_repeats_titrate_cleaning_cutoff_FP_log2FoldChange_raw.pdf (produced in line 195). The top line plot was from figures/0_DE_QA/cleaning_strength_titration/benchmark_independent_repeats_titrate_cleaning_cutoff_FP_log2FoldChange_raw_summary_stat.pdf (produced in line 218).
Supplementary Fig. 4	MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R	853-898	please run from line 529 to load dependencies
Supplementary Fig. 5a,b	MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R	2590-3185
Supplementary Fig.

KITAB Text Reuse Data
zenodo.org
application/gzip, pdf +1
Updated Jun 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Smith; Sarah Bowen Savant; Maxim Romanov; Ryan Muther; Masoumeh Seydi; Sohail Merchant; David Smith; Sarah Bowen Savant; Maxim Romanov; Ryan Muther; Masoumeh Seydi; Sohail Merchant (2024). KITAB Text Reuse Data [Dataset]. http://doi.org/10.5281/zenodo.11501559
Explore at:
application/gzip, zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11501559
Dataset updated
Jun 27, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
David Smith; Sarah Bowen Savant; Maxim Romanov; Ryan Muther; Masoumeh Seydi; Sohail Merchant; David Smith; Sarah Bowen Savant; Maxim Romanov; Ryan Muther; Masoumeh Seydi; Sohail Merchant
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
KITAB Text Reuse Data

KITAB is funded by the European Research Council under the European Union’s Horizon 2020 research and innovation programme, awarded to the KITAB project (Grant Agreement No. 772989, PI Sarah Bowen Savant), hosted at Aga Khan University, London. In addition, it has received funding from the Qatar National Library to aid in the adaptation of the passim algorithm for Arabic.

KITAB’s text reuse data is generated by running passim on the OpenITI corpus (DOI: 10.5281/zenodo.3082463). Each version is the output of a separate run and the version number corresponds to the corpus releases.

To prepare the corpus for a passim run, we normalize texts and remove most of the non-Arabic characters and then chunk the texts into passages of 300 words (using the non-Arabic characters, including white space) in length. The chunks, called milestones, are identified by unique ids. This dataset represents the reuse cases that have been identified among milestones.

The text reuse dataset consists of folders for each book. Each folder includes CSV files of the text reuse cases (alignments) between the corresponding book and all other books with which passim has found instances of reuses. The files have the below naming convention, using the book ids:

The CSV files are not the immediate output of passim, rather the result of the post-processing step. The folder structure is as below (for a total of four books, for example).

bookVersionID1
|- bookVersionID1_bookVersionID4.csv
|- bookVersionID1_bookVersionID3.csv
bookVersionID4
|-bookVersionID4_bookVersionID3.csv

Where we do not have any CSV files in any of the folders, it means that the passim algorithm has not been able to find any text reuse cases for that specific book. In the above example, we can not find any folder or CSV files for bookVresionID2, that means no reuse cases are detected between book2 and of the other three books.

To save computational resources, we generate text reuse data uni-directionally, which means a pair of documents is compared only once (document1 to document2, not document2 to document1).

The alignments the CSV files are a list of records. Each record shows a pair of matched passages between two books together with statistics, such as the algorithm score, and contextual information, such as the start and end positions of aligned passages so that one can find those passages in the books. A description of the alignment fields is given in the release notes.

For each dataset, we also generate statistical data on the alignments between the book pairs. The data is published in an application that facilitates search, filtering, and visualizations. The link to the corresponding application is given in the release notes.

Note on Release Numbering: Version 2020.1.1—where 2020 is the year of the release, the first dotted number—.1—is the ordinal release number in 2020, and the second dotted number—.1—is the overall release number. The first dotted number will reset every year, while the second one will continue on increasing.

Note: The very first release of the KITAB text reuse data (2019.1.1) is published here as it was too big to publish on Zenodo. To receive more information on the complete datasets please contact us via kitab-project@outlook.com (or other team members).

Future releases may include part of the generated data if the size of whole data is too big to publish on Zenodo. However, the data is open access for anyone to use. We provide the detailed information on the datasets in the corresponding release notes.
Z
Corpus of Decisions: Permanent Court of International Justice (CD-PCIJ)
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fobbe, Sean (2024). Corpus of Decisions: Permanent Court of International Justice (CD-PCIJ) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3840479
Explore at:
Dataset updated
Jul 16, 2024
Dataset authored and provided by
Fobbe, Sean
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Overview

The Corpus of Decisions: Permanent Court of International Justice (CD-PCIJ) collects and presents for the first time in human- and machine-readable formats all documents of PCIJ Series A, B and A/B of the Permanent Court of International Justice (PCIJ). Among these are judgments, advisory opinions, orders, appended minority opinions, annexes, applications instituting proceedings and requests for an advisory opinion. The International Court of Justice, the successor of the PCIJ, has kindly made available these documents on its website.

The Permanent Court of International Justice (PCIJ) was the primary judicial organ of the League of Nations, the ill-fated predecessor of the United Nations, which existed from 1920 to 1946. Nonetheless, as the first international court with general thematic jurisdiction, the PCIJ influenced international law in profound ways that are still felt today. Every lawyer who sets out on the path of international law encounters epoch-defining opinions such as the Lotus and Factory at Chorzów decisions, but the Court's lesser-known jurisprudence and the appended minority opinions offer many more ideas and legal principles which are seldom appreciated today.

This data set is designed to be complementary to and fully compatible with the Corpus of Decisions: International Court of Justice (CD-ICJ), which is also available open access.

Citation

A peer-reviewed academic paper describing the construction and relevance of the data set entitled 'Introducing Twin Corpora of Decisions for the International Court of Justice (ICJ) and the Permanent Court of International Justice (PCIJ)' was published open access in the Journal of Empirical Legal Studies (JELS). It is also available in print at JELS 2022, Vol. 19, No. 2, pp. 491-524.

If you use the data set for academic work, please cite both the JELS paper and the precise version of the data set you used for your analysis.

NEW in Version 1.1.0

Full recompilation of data set

CHANGELOG and README converted to external markdown files

Display of version number on Codebook and Compilation Report title pages fixed; correctly display semantic versioning

The ZIP archive of source files includes the TEX files

Config file converted to TOML format

All R packages are version-controlled with {renv}

Data set creation process cleans up all files from previous runs before a new data set is created

Remove redundant color from violin plots

Updates

The CD-PCIJ will only be updated if errors are discovered, enhancements are developed or in the unlikely event that the Court publishes additional documents within the collection ambit of the data set (PCIJ Series A, B and A/B).

Notifications regarding new and updated data sets will be published on my academic website at www.seanfobbe.com or via Mastodon at @seanfobbe@fediscience.org

Recommended Variants

Target Audience Recommended Variant Practitioners PDF_ENHANCED_MajorityOpinions Traditional Scholars PDF_ENHANCED_FULL Quantitative Analysts CSV_TESSERACT_FULL

Please refer to the Codebook regarding the relative merits of each variant. All variants are available in either English or French. Unless you have very specific needs you should only use the variants denoted 'ENHANCED' or 'TESSERACT' for serious work.

Features

Fully compatible with the Corpus of Decisions: International Court of Justice (CD-ICJ)

29 variables

Public Domain (CC-Zero 1.0)

Open and platform independent file formats (PDF, TXT, CSV)

Extensive Codebook

Compilation Report explains construction and validation of the data set in detail

Large number of diagrams for all purposes (see the 'ANALYSIS' archive)

Diagrams are available as PDF (for printing) and PNG (for web display), tables are available as CSV for easy readability by humans and machines

Secure cryptographic signatures

Publication of full source code (Open Source)

Key Metrics

Version: 1.1.0

Temporal Coverage: 22 May 1922 – 26 February 1940

Documents: 259 (English) / 261 (French)

Tokens: 1,296,536 (English) / 1,262,184 (French)

Formats: PDF, TXT, CSV

Source Code and Compilation Report

With every compilation of the full data set an extensive Compilation Report is created in a professionally layouted PDF format (comparable to the Codebook). The Compilation Report includes the Source Code, comments and explanations of design decisions, relevant computational results, exact timestamps and a table of contents with clickable internal hyperlinks to each section. The Compilation Report and Source Code are published under the same DOI: https://doi.org/10.5281/zenodo.7051937

For details of the construction and validation of the data set please refer to the Compilation Report.

Disclaimer

This data set has been created by Mr Seán Fobbe using documents available on the website of the International Court of Justice (https://www.icj-cij.org). It is a personal academic initiative and is not associated with or endorsed by the International Court of Justice or the United Nations.

The Court accepts no responsibility or liability arising out of my use, or that of third parties, of the documents and information produced, used or published on the Zenodo website. Neither the Court nor its staff members nor its contractors may be held responsible or liable for the consequences, financial or otherwise, resulting from the use of these documents and information.

Academic Publications (Fobbe)

Website — www.seanfobbe.com

Open Data — zenodo.org/communities/sean-fobbe-data

Code Repository — zenodo.org/communities/sean-fobbe-code

Regular Publications — zenodo.org/communities/sean-fobbe-publications

Contact

Did you discover any errors? Do you have suggestions on how to improve the data set? You can either post these to the Issue Tracker on GitHub or write me an e-mail at fobbe-data@posteo.de
[BIOCOM-PIPE] Example and expected output files with an Illumina dataset
zenodo.org
application/gzip
Updated Aug 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christophe Djemiel; Christophe Djemiel; Sophie Sadet-Bourgeteau; Sophie Sadet-Bourgeteau; Sébastien Terrat; Sébastien Terrat (2020). [BIOCOM-PIPE] Example and expected output files with an Illumina dataset [Dataset]. http://doi.org/10.5281/zenodo.3947784
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3947784
Dataset updated
Aug 1, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Christophe Djemiel; Christophe Djemiel; Sophie Sadet-Bourgeteau; Sophie Sadet-Bourgeteau; Sébastien Terrat; Sébastien Terrat
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BIOCOM-PIPE: a new user-friendly metabarcoding pipeline for the characterization of microbial diversity from 16S, 18S and 23S rRNA gene amplicons

Summary:

This Zenodo repository encompasses the demo input files and expected output files in the pipeline such that before directly applying it on large-scale datasets, a user can be assured that he/she has successfully implemented the pipeline. More precisely, we chose to use a recent dataset published (Sadet et al., 2018 – Applied Soil Ecology – DOI: 10.1016/j.apsoil.2018.02.006), with raw datasets areavailable in the EBI database system under project accession number PRJEB14258. We hope that these files will help users to efficiently test and checked the BIOCOM-PIPE pipeline. The deposited archive includes :

- the Input.txt file with chosen parameters,

- the project.csv file describing the composition of the library,

- .fastq files from the EBI database system under project accession number PRJEB14258,

- the expected result files and summary files after BIOCOM-PIPE analysis.

Background:

The ability to compare samples or studies easily using metabarcoding so as to better interpret microbial ecology results is an upcoming challenge. There exists a growing number of metabarcoding pipelines, each with its own benefits and limitations. However, very few have been developed to offer the opportunity to characterize various microbial communities (e.g., archaea, bacteria, fungi, photosynthetic microeukaryotes) with the same tool.

Results:

BIOCOM-PIPE is a flexible and independent suite of tools for processing data from high-throughput sequencing technologies, Roche 454 and Illumina platforms, and focused on the diversity of archaeal, bacterial, fungal, and photosynthetic microeukaryote amplicons. Various original methods were implemented in BIOCOM-PIPE to (i) remove chimeras based on read abundance, (ii) align sequences with structure-based alignments of RNA homologs using covariance models or a post-clustering tool (ReClustOR), and (iii) re-assign OTUs based on a reference OTU database. The comparison with two other pipelines (FROGS and mothur) highlighted that BIOCOM-PIPE was better at discriminating land use groups.

Conclusions:

The BIOCOM-PIPE pipeline makes it possible to analyze 16S/18S and 23S rRNA genes in the same package tool. This innovative approach defines a biological database from previously analyzed samples and performs post-clustering of reads with this reference database by using open-reference clustering. This makes it easier to compare projects from various sequencing runs. For advanced users, the pipeline was developed to allow for adding or modifying the components, the databases and the bioinformatics tools easily.
Complete Rxivist dataset of scraped biology preprint data
zenodo.org
data.niaid.nih.gov
bin
Updated Mar 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard J. Abdill; Richard J. Abdill; Ran Blekhman; Ran Blekhman (2023). Complete Rxivist dataset of scraped biology preprint data [Dataset]. http://doi.org/10.5281/zenodo.7688682
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7688682
Dataset updated
Mar 2, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Richard J. Abdill; Richard J. Abdill; Ran Blekhman; Ran Blekhman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
rxivist.org allowed readers to sort and filter the tens of thousands of preprints posted to bioRxiv and medRxiv. Rxivist used a custom web crawler to index all papers posted to those two websites; this is a snapshot of Rxivist the production database. The version number indicates the date on which the snapshot was taken. See the included "README.md" file for instructions on how to use the "rxivist.backup" file to import data into a PostgreSQL database server.

Please note this is a different repository than the one used for the Rxivist manuscript—that is in a separate Zenodo repository. You're welcome (and encouraged!) to use this data in your research, but please cite our paper, now published in eLife.

Previous versions are also available pre-loaded into Docker images, available at blekhmanlab/rxivist_data.

Version notes:

2023-03-01

The final Rxivist data upload, more than four years after the first and encompassing 223,541 preprints posted to bioRxiv and medRxiv through the end of February 2023.

2020-12-07***

In addition to bioRxiv preprints, the database now includes all medRxiv preprints as well.

The website where a preprint was posted is now recorded in a new field in the "articles" table, called "repo".

We've significantly refactored the web crawler to take advantage of developments with the bioRxiv API.

The main difference is that preprints flagged as "published" by bioRxiv are no longer recorded on the same schedule that download metrics are updated: The Rxivist database should now record published DOI entries the same day bioRxiv detects them.

Twitter metrics have returned, for the most part. Improvements with the Crossref Event Data API mean we can once again tally daily Twitter counts for all bioRxiv DOIs.

The "crossref_daily" table remains where these are recorded, and daily numbers are now up to date.

Historical daily counts have also been re-crawled to fill in the empty space that started in October 2019.

There are still several gaps that are more than a week long due to missing data from Crossref.

We have recorded available Crossref Twitter data for all papers with DOI numbers starting with "10.1101," which includes all medRxiv preprints. However, there appears to be almost no Twitter data available for medRxiv preprints.

The download metrics for article id 72514 (DOI 10.1101/2020.01.30.927871) were found to be out of date for February 2020 and are now correct. This is notable because article 72514 is the most downloaded preprint of all time; we're still looking into why this wasn't updated after the month ended.

2020-11-18

Publication checks should be back on schedule.

2020-10-26

This snapshot fixes most of the data issues found in the previous version. Indexed papers are now up to date, and download metrics are back on schedule. The check for publication status remains behind schedule, however, and the database may not include published DOIs for papers that have been flagged on bioRxiv as "published" over the last two months. Another snapshot will be posted in the next few weeks with updated publication information.

2020-09-15

A crawler error caused this snapshot to exclude all papers posted after about August 29, with some papers having download metrics that were more out of date than usual. The "last_crawled" field is accurate.

2020-09-08

This snapshot is misconfigured and will not work without modification; it has been replaced with version 2020-09-15.

2019-12-27

Several dozen papers did not have dates associated with them; that has been fixed.

Some authors have had two entries in the "authors" table for portions of 2019, one profile that was linked to their ORCID and one that was not, occasionally with almost identical "name" strings. This happened after bioRxiv began changing author names to reflect the names in the PDFs, rather than the ones manually entered into their system. These database records are mostly consolidated now, but some may remain.

2019-11-29

The Crossref Event Data API remains down; Twitter data is unavailable for dates after early October.

2019-10-31

The Crossref Event Data API is still experiencing problems; the Twitter data for October is incomplete in this snapshot.

The README file has been modified to reflect changes in the process for creating your own DB snapshots if using the newly released PostgreSQL 12.

2019-10-01

The Crossref API is back online, and the "crossref_daily" table should now include up-to-date tweet information for July through September.

About 40,000 authors were removed from the author table because the name had been removed from all preprints they had previously been associated with, likely because their name changed slightly on the bioRxiv website ("John Smith" to "J Smith" or "John M Smith"). The "author_emails" table was also modified to remove entries referring to the deleted authors. The web crawler is being updated to clean these orphaned entries more frequently.

2019-08-30

The Crossref Event Data API, which provides the data used to populate the table of tweet counts, has not been fully functional since early July. While we are optimistic that accurate tweet counts will be available at some point, the sparse values currently in the "crossref_daily" table for July and August should not be considered reliable.

2019-07-01

A new "institution" field has been added to the "article_authors" table that stores each author's institutional affiliation as listed on that paper. The "authors" table still has each author's most recently observed institution.

We began collecting this data in the middle of May, but it has not been applied to older papers yet.

2019-05-11

The README was updated to correct a link to the Docker repository used for the pre-built images.

2019-03-21

The license for this dataset has been changed to CC-BY, which allows use for any purpose and requires only attribution.

A new table, "publication_dates," has been added and will be continually updated. This table will include an entry for each preprint that has been published externally for which we can determine a date of publication, based on data from Crossref. (This table was previously included in the "paper" schema but was not updated after early December 2018.)

Foreign key constraints have been added to almost every table in the database. This should not impact any read behavior, but anyone writing to these tables will encounter constraints on existing fields that refer to other tables. Most frequently, this means the "article" field in a table will need to refer to an ID that actually exists in the "articles" table.

The "author_translations" table has been removed. This was used to redirect incoming requests for outdated author profile pages and was likely not of any functional use to others.

The "README.md" file has been renamed "1README.md" because Zenodo only displays a preview for the file that appears first in the list alphabetically.

The "article_ranks" and "article_ranks_working" tables have been removed as well; they were unused.

2019-02-13.1

After consultation with bioRxiv, the "fulltext" table will not be included in further snapshots until (and if) concerns about licensing and copyright can be resolved.

The "docker-compose.yml" file was added, with corresponding instructions in the README to streamline deployment of a local copy of this database.

2019-02-13

The redundant "paper" schema has been removed.

BioRxiv has begun making the full text of preprints available online. Beginning with this version, a new table ("fulltext") is available that contains the text of preprints that have been processed already. The format in which this information is stored may change in the future; any digression will be noted here.

This is the first version that has a corresponding Docker image.
Data from: A Novel Curated Scholarly Graph Connecting Textual and Data...
data.europa.eu
zenodo.org
unknown
Updated May 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2024). A Novel Curated Scholarly Graph Connecting Textual and Data Publications [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-7464120?locale=da
Explore at:
unknown(349944309)Available download formats
Dataset updated
May 31, 2024
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains an open and curated scholarly graph we built as a training and test set for data discovery, data connection, author disambiguation, and link prediction tasks. This graph represents the European Marine Science community included in the OpenAIRE Graph. The nodes of the graph we release represent publications, datasets, software, and authors respectively; edges interconnecting research products always have the publication as source, and the dataset/software as target. In addition, edges are labeled with semantics that outline whether the publication is referencing, citing, documenting, or supplementing the related outcome. To curate and enrich nodes metadata and edges semantics, we relied on the information extracted from the PDF of the publications and the datasets/software webpages respectively. We curated the authors so to remove duplicated nodes representing the same person. The resource we release counts 4,047 publications, 5,488 datasets, 22 software, 21,561 authors, and 9,692 edges connect publications to datasets/software. This graph is in the curated_MES folder. We provide this resource as: a property graph: we provide the dump that can be imported in neo4j 5 jsonl files containing publications, datasets, software, authors, and relationships respectively. Each line of a jsonl file contains a JSON object representing a node and contains the metadata of that node (or a relationship). We provide two additional scholarly graphs: The curated MES graph with the removed edges. During the curation we removed some edges since they were labeled with an inconsistent or imprecise semantics. This graph includes the same nodes and edges as the previous one, and, in addition, it contains the edges removed during the curation pipeline; these edges are marked as Removed. This graph is in the curated_MES_with_removed_semantics folder. The original MES community of OpenAIRE. It represents the MES community extracted from the OpenAIRE Research Graph. This graph has not been curated, and the metadata and semantics are those of the OpenAIRE Research Graph. This graph is in the original_MES_community folder.
Code and data for "ChromoGen: Diffusion model predicts single-cell chromatin...
zenodo.org
application/gzip, bin
Updated Dec 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Greg Schuette; Greg Schuette; Zhuohan Lao; Zhuohan Lao; Bin Zhang; Bin Zhang (2024). Code and data for "ChromoGen: Diffusion model predicts single-cell chromatin conformations" [Dataset]. http://doi.org/10.5281/zenodo.14218666
Explore at:
application/gzip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14218666
Dataset updated
Dec 5, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Greg Schuette; Greg Schuette; Zhuohan Lao; Zhuohan Lao; Bin Zhang; Bin Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 4, 2024
Description
This dataset includes all code and data required to reproduce the results of:

Greg Schuette, Zhuohan Lao, and Bin Zhang. ChromoGen: Diffusion model predicts single-cell chromatin conformations, 16 July 2024, PREPRINT (Version 1) available at Research Square [https://doi.org/10.21203/rs.3.rs-4630850/v1]

File descriptions:

chromogen_code.tar.gz contains all code and, as of its upload date, is identical to the corresponding GitHub repo. Note that:

Some or all of the code inside chromogen_code.tar.gz/ChromoGen/recreate_results/train/EPCOT/, chromogen_code.tar.gz/ChromoGen/recreate_results/generate_data/EPCOT, and chromogen_code.tar.gz/ChromoGen/src/model/Embedder was adapted from that provided in the original EPCOT paper, Zhang et al. (2023).

chromogen_code.tar.gz/ChromoGen/recreate_results/create_figures/Figure_4/domain_boundary_support/PostAnalysisTools.py was adopted from Bintu et al. (2018); our only change was translating the code from Python 2 to Python 3.

Several of the Jupyter Notebooks within chromogen_code.tar.gz/ChromoGen/recreate_results/create_figures/ visualize Hi-C and DNase-seq data from Rao et al. (2014) and The ENCODE Project Consortium (2012), respectively, though this dataset excludes the experimental data itself. Seechromogen_code.tar.gz/README.md for instructions on obtaining the data.

Dip-C data from Tan et al. (2018) are visualized throughout these notebooks, as well. This dataset excludes the raw Dip-C data, though it does include a post-processed version of the data (see bullets 4-5).

The files within chromogen_code.tar.gz/ChromoGen/recreate_results/generate_data/conformations/MDHomopolymer were originally used for Schuette et al. (2023), though we first make those scripts available here (the first author of both works created these files).

epcot_final.pt contains the fine-tuned EPCOT parameters. Note that the pre-trained parameters -- not included in this dataset -- came from Zhang et al. (2023) and were used as the starting point for our fine-tuning optimization of these parameters.

chromogen.pt contains the complete set of ChromoGen model parameters, including both the relevant fine-tuned EPCOT parameters and all diffusion model parameters. Note that this also contains the fine-tuned EPCOT parameters.

conformations.tar.gz contains all conformations analyzed in the manuscript, including the Dip-C conformations formatted in an HDF5 file, all ChromoGen-inferred conformations, and the MD-generated MD homopolymer conformations. Descriptively named subdirectories organize the data. Note that:

conformations.tar.gz/conformations/MDHomopolymer/DUMP_FILE.dcd is from Schuette et al. (2023), though it first made available here.

conformations.tar.gz/conformations/DipC/processed_data.h5 represents our post-processed version of the 3D genome structures predicted by Dip-C in Tan et al. (2018).

outside_data.tar.gz contains two subdirectories:

inputs contains our post-processed genome assembly file. Its sole content, hg19.h5, is a post-processed version of the FASTA-formatted hg19 human genome alignment created by Church et al. (2011), which we downloaded from the UCSC genome browser (Kent et al. (2002) and Nassar et al. (2023)). This dataset does NOT include the FASTA file itself.

training_data contains the Dip-C conformations post-processed by our pipeline. This is a duplicated version of the file described in bullet 4.2.

embeddings.tar.gz contains the sequence embeddings created by our fine-tuned EPCOT model for each region included in the diffusion model's training set. This is really only needed during training.

chromogen_code.tar.gz/ChromoGen/README.md and the README.md file on our GitHub repo (identical at the time of this dataset's publication) explain the content of each file in greater detail. They also explain how to use the code to reproduce our results or to make your own structure predictions.

You can download and organize all the files in this dataset as intended by running the following in bash:
# Download the code and expand the tarball whose contents define the # larger file structure of the repository this dataset is archiving.wget https://zenodo.org/records/14218666/files/chromogen_code.tar.gztar -xvzf chromogen_code.tar.gzrm chromogen_code.tar.gz# Enter the top-level directory of the repo, create the subdirectories # that'll contain the data, and cd to it cd ChromoGen mkdir -p recreate_results/downloaded_data/models cd recreate_results/downloaded_data # Download all the data in the proper locations wget https://zenodo.org/records/14218666/files/conformations.tar.gz & wget https://zenodo.org/records/14218666/files/embeddings.tar.gz & wget https://zenodo.org/records/14218666/files/outside_data.tar.gz & cd models wget https://zenodo.org/records/14218666/files/chromogen.pt & wget https://zenodo.org/records/14218666/files/epcot_final.pt & cd .. wait # Untar the three tarballs tar -xvzf conformations.tar.gz & tar -xvzf embeddings.tar.gz & tar -xvzf outside_data.tar.gz & wait # Remove the now-unneeded tarballs rm conformations.tar.gz embeddings.tar.gz outside_data.tar.gz
Z
Frequency and Rank of Family Names in Peru
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rob Hoare (2020). Frequency and Rank of Family Names in Peru [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3374107
Explore at:
Dataset updated
Jan 24, 2020
Dataset authored and provided by
Rob Hoare
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Count of family names (surnames, last names) in Peru, from an approximately 7% sample of the adult population.

In Peru, many people are registered as supporters of political parties, and their names are published by the Registro de Organizaciones Políticas. The lists include a DNI (national identity number) for each person to avoid duplicates. The 1,572,002 people on these lists (excluding the regional movements) represent around 7% of the adult population of Peru.

Their maternal and paternal family names have been sorted and counted. Nearly all of the names have entries for both paternal and maternal names.

These 3,142,561 family names represent 85,395 different names, most of which are infrequent. The file has been limited to names that occur ten or more times in the sample, which is 12,139 unique names (3,021,655 names, more than 96% of the total).

Each row in the file contains the rank, a percentage of that name in the entire set of 3,142,561 names, a count of the times the name occurs in the sample, and the name.

There are some names (around 800) in this file that contain a space. In most cases, these are names like "GARCIA DE RUIZ", where RUIZ is the name of the woman's husband. There are also cases where the name is like "DE LA CRUZ", which is a complete family name. No attempt has been made to remove the part of names which refer to the husband's name, this could be considered for a later version.
Science Education Research Topic Modeling Dataset
zenodo.org
data.niaid.nih.gov
bin, html +2
Updated Oct 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tor Ole B. Odden; Tor Ole B. Odden; Alessandro Marin; Alessandro Marin; John L. Rudolph; John L. Rudolph (2024). Science Education Research Topic Modeling Dataset [Dataset]. http://doi.org/10.5281/zenodo.4094974
Explore at:
bin, txt, html, text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4094974
Dataset updated
Oct 9, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tor Ole B. Odden; Tor Ole B. Odden; Alessandro Marin; Alessandro Marin; John L. Rudolph; John L. Rudolph
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset contains scraped and processed text from roughly 100 years of articles published in the Wiley journal Science Education (formerly General Science Quarterly). This text has been cleaned and filtered in preparation for analysis using natural language processing techniques, particularly topic modeling with latent Dirichlet allocation (LDA). We also include a Jupyter Notebook illustrating how one can use LDA to analyze this dataset and extract latent topics from it, as well as analyze the rise and fall of those topics over the history of the journal.

The articles were downloaded and scraped in December of 2019. Only non-duplicate articles with a listed author (according to the CrossRef metadata database) were included, and due to missing data and text recognition issues we excluded all articles published prior to 1922. This resulted in 5577 articles in total being included in the dataset. The text of these articles was then cleaned in the following way:

We removed duplicated text from each article: prior to 1969, articles in the journal were published in a magazine format in which the end of one article and the beginning of the next would share the same page, so we developed an automated detection of article beginnings and endings that was able to remove any duplicate text.

We removed the reference sections of the articles, as well headings (in all caps) such as “ABSTRACT”.

We reunited any partial words that were separated due to line breaks, text recognition issues, or British vs. American spellings (for example converting “per cent” to “percent”)

We removed all numbers, symbols, special characters, and punctuation, and lowercased all words.

We removed all stop words, which are words without any semantic meaning on their own—“the”, “in,” “if”, “and”, “but”, etc.—and all single-letter words.

We lemmatized all words, with the added step of including a part-of-speech tagger so our algorithm would only aggregate and lemmatize words from the same part of speech (e.g., nouns vs. verbs).

We detected and create bi-grams, sets of words that frequently co-occur and carry additional meaning together. These words were combined with an underscore: for example, “problem_solving” and “high_school”.

After filtering, each document was then turned into a list of individual words (or tokens) which were then collected and saved (using the python pickle format) into the file scied_words_bigrams_V5.pkl.

In addition to this file, we have also included the following files:

SciEd_paper_names_weights.pkl: A file containing limited metadata (title, author, year published, and DOI) for each of the papers, in the same order as they appear within the main datafile. This file also includes the weights assigned by an LDA model used to analyze the data

Science Education LDA Notebook.ipynb: A notebook file that replicates our LDA analysis, with a written explanation of all of the steps and suggestions on how to explore the results.

Supporting files for the notebook. These include the requirements, the README, a helper script with functions for plotting that were too long to include in the notebook, and two HTML graphs that are embedded into the notebook.

This dataset is shared under the terms of the Wiley Text and Data Mining Agreement, which allows users to share text and data mining output for non-commercial research purposes. Any questions or comments can be directed to Tor Ole Odden, t.o.odden@fys.uio.no.
Easy ORCID
zenodo.org
application/gzip
Updated May 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charles Tapley Hoyt; Charles Tapley Hoyt (2024). Easy ORCID [Dataset]. http://doi.org/10.5281/zenodo.11371268
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11371268
Dataset updated
May 28, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Charles Tapley Hoyt; Charles Tapley Hoyt
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The first-party ORCID data dump uses a data structure that is overly complex for most use cases. This Zenodo record contains a derived version that is much more straightforwards, accessible, and smaller. So far, this includes employers, education, external identifiers, and publications linked to PubMed. It adds additional processing to ground employers and educational instutitions using the Research Organization Registry (ROR). It also does some minor string processing, such as standardization of education types (e.g., Bachelor of Science, Master of Science).

It includes a pre-build Gilda index for named entity recognition (NER) and named entity normalization (NEN).

The records_hq.json.gz file is a subset of the full records file that only contains records that have at least one ROR-grounded employer, at least one ROR-grounded education, at least one standardized external identifier, or at least one publication indexed in PubMed. The point of this subset is to remove ORCID records that are generally not possible to match up to any external information.

It is automatically generated with code in https://github.com/cthoyt/orcid_downloader.
Source code and simulation results: Uncovering hidden resonances in...
zenodo.org
zip
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fridtjof Betz; Fridtjof Betz; Felix Binkowski; Felix Binkowski; Jan David Fischbach; Jan David Fischbach; Nick Feldman; Nick Feldman; Carsten Rockstuhl; Carsten Rockstuhl; Femius Koenderink; Femius Koenderink; Sven Burger; Sven Burger (2025). Source code and simulation results: Uncovering hidden resonances in non-Hermitian systems with scattering thresholds [Dataset]. http://doi.org/10.5281/zenodo.14651613
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14651613
Dataset updated
Jun 5, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Fridtjof Betz; Fridtjof Betz; Felix Binkowski; Felix Binkowski; Jan David Fischbach; Jan David Fischbach; Nick Feldman; Nick Feldman; Carsten Rockstuhl; Carsten Rockstuhl; Femius Koenderink; Femius Koenderink; Sven Burger; Sven Burger
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This publication offers the necessary data and scripts to replicate the findings of the article titled "Hidden resonances in non-Hermitian systems with scattering thresholds". Additionally convergence studies are provided. The article aims to offer a new perspective on resonances in the vicinity of scattering thresholds and provide access to hidden modes on different Riemann sheets.

Usage

All Matlab files can be run without solving scattering problems as the required data is stored in .mat files in the data directory. In order to run the simulations with JCMsuite you must delete the data directory and replace corresponding place holders with a path to your installation of JCMsuite. Free trial licenses are available, please refer to the homepage of JCMwave.

Requirements

JCMsuite (tested with version 6.4.1)

MATLAB (tested with version R2023b)

FEM convergence

We acquire the snapshots with the finite element method (FEM) solver JCMsuite. To estimate the error, the specular reflection has been collected at 24 equidistantly sampled points within the range of interest and at two additional sampling points on either side of the branch points (for further details we refer to the file convergence.m). The error is defined as \(\mathrm{min}\,\mathrm{abs}\left( R_0^n(\omega)-R_0^8(\omega)\right)\), where the superscript denotes the polynomial order of the FEM basis functions. Furthermore, the energy conservation (incoming energy minus reflection plus absorption) has been investigated.

All the data for the paper have been generated using \(n=5\). The error at the data points can therefore be expected to be below \(3\times10^{-7}\).

AAA convergence

The AAA algorithm adaptivly increases the degree \(m\) of the rational approximation until the error of the model with respect to all sample points falls below a given threshold \(t\) as long as \(m\) is smaller than half the number of sample points \(N\). We use \(t = 10^{-6}\) and \(t = 5\times 10^{-7}\) to make sure that it is larger than the error introduced through the FEM discretization. In the file AAAconvergence.m, error and model size are compared for different values of \(t\) and different numbers of support points. It can be observed that the error with respect to more than 500 reference points is smaller by orders of magnitude, while at the same time the size of the model is reduced and saturates quickly if the transformed variable \(\tilde{k}\) is used instead of \(k\). Here, 80 support points suffice for errors below \(10^{-6}\) for a spectrum containing three branch points and more than eight resonances (if hidden resonances are included).

Sampling scheme

We adopt a sampling scheme with additional samples in the vicinity of the branch points. This is achieved with equidistant samplings in the transformed space. For details we refer to the matlab scripts.
Z
EOSC Providers and Resources data-dump
data.niaid.nih.gov
zenodo.org
Updated Jun 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Condurache, Catalin (2024). EOSC Providers and Resources data-dump [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11109625
Explore at:
Dataset updated
Jun 7, 2024
Dataset provided by
Papastefanatos, George
Millar, Alexander Paul
Mantes, Thanassis
Condurache, Catalin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The EOSC Providers and Resources dataset contains the metadata descriptions (EOSC Profiles) of the EOSC Providers and the Resources (e.g. catalogues, services, data sources, training material and interoperability guidelines) they onboarded to the EOSC Catalogue and Marketplace during the EOSC Future project.

The dataset is based on a data extraction, provided by the ATHENA Research Center, using the public API of the EOSC Service Registry, which is part of the EOSC Resource Catalogue. The information provided here is a snapshot: a historical record of (a subset of) the information about various providers and resources recorded within the EOSC Catalogue and Marketplace at the end of the EOSC Future project in April 2024.

A curation process, using a publicly accessible data curation workflow, designed and implemented at DESY, was used to remove all known personal or sensitive data in line with GDPR and the EOSC Future Privacy Policy. This process is described in the methodology.md file within this dataset.

From April 2024, the EOSC Portal was phased out. On 24th April 2024, the European Commission announced the next phase of EOSC with the launch of the initial web presence of the EOSC EU Node. The goal of publishing this dataset is to enable other EOSC-related projects such as OSCARS, EOSC Beyond as well as researchers more broadly, to both reuse and build further on this work.
Data from: Kuopio gait dataset: motion capture, inertial measurement and...
zenodo.org
data.niaid.nih.gov
bin, txt, zip
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jere Lavikainen; Jere Lavikainen; Paavo Vartiainen; Paavo Vartiainen; Lauri Stenroth; Lauri Stenroth; Pasi Karjalainen; Pasi Karjalainen; Rami Korhonen; Rami Korhonen; Mimmi Liukkonen; Mimmi Liukkonen; Mika Mononen; Mika Mononen (2024). Kuopio gait dataset: motion capture, inertial measurement and video-based sagittal-plane keypoint data from walking trials [Dataset]. http://doi.org/10.5281/zenodo.10559504
Explore at:
zip, bin, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10559504
Dataset updated
Dec 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jere Lavikainen; Jere Lavikainen; Paavo Vartiainen; Paavo Vartiainen; Lauri Stenroth; Lauri Stenroth; Pasi Karjalainen; Pasi Karjalainen; Rami Korhonen; Rami Korhonen; Mimmi Liukkonen; Mimmi Liukkonen; Mika Mononen; Mika Mononen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains motion capture (3D marker trajectories, ground reaction forces and moments), inertial measurement unit (wearable Movella Xsens MTw Awinda sensors on the pelvis, both thighs, both shanks, and both feet), and sagittal-plane video (anatomical keypoints identified with the OpenPose human pose estimation algorithm) data.
The data is from 51 willing participants and collected in the HUMEA laboratory in the University of Eastern Finland, Kuopio, Finland, between 2022 and 2023. All trials were conducted barefoot.

The file structure contains an Excel file containing information of the participants, data folders under each subject (numbered 01 to 51), and a MATLAB script.

The Excel file has the following data for the participants:

ID: ID of the participants from 1 to 51

Age: age of the participant in years

Gender: biological sex as M for male, F for female

Leg: the participant's dominant leg, identified by asking which foot the participant would use to kick a football; R for right, L for left

Height: height of the participant in centimeters

Invalid_trials: list of invalid trials in the motion capture data (MOCAP) data, usually classified as such because the participant did not properly step on the middle force plate

IAD: inter-asis distance in millimeters, the distance between palpated left and right anterior superior iliac spine, measured with a caliper

Left_knee_width: width of the left knee from medial epicondyle to lateral epicondyle in millimeters, palpated and measured with a caliper

Right_knee_width: same as above for the right knee

Left_ankle width: width of the left ankle from medial malleolus to lateral malleolus in millimeters, palpated and measured with a caliper

Right_ankle_width: same as above for the right ankle

Left_thigh_length: the distance between the greater trochanter of the left femur and the lateral epicondyle of the left femur in millimeters, palpated and measured with a measuring tape

Right_thigh_length: same as above for the right thigh

Left_shank_length: the distance between the medial epicondyle of the femur and the medial malleolus of the tibia in millimeters, palpated and measured with a measuring tape

Right_shank_length: same as above for the right shank

Mass: mass in kilograms, measured on a force plate just before the walking measurements

ICD: inter-condylar distance of the knee of the dominant leg, measured from low-field MRI

Left_knee_width_mocap: distance between reflective MOCAP markers on the medial and lateral epicondyles of the knee in millimeters, measured from a static standing trial; -1 for missing (subject did not have those markers)

Right_knee_width_mocap: same as above for the right knee

The folders under each subject (folders numbered 01 to 51) are as follows:

imu: "Raw" inertial measurement unit (IMU) data files that can be read with Xsens Device API (included in Xsens MT Manager 4.6, which may be unavailable these days, not sure). You won't need this if you use the data in the imu_extracted folder.

imu_extracted: IMU data extracted from those data files using the Xsens Device API, so you don't have to.

The data is saved as MATLAB structs where the fields are named as a sensor ID (e.g., "B42D48"). The sensor IDs and their corresponding IMU locations are as follows:

pelvis IMU: B42DA3

right femur IMU: B42DA2

left femur IMU: B42D4D

right tibia IMU: B42DAE

left tibia IMU: B42D53

right foot IMU: B42D48

left foot IMU: B42D51 (except for subjects 01 and 02, where left foot IMU has the ID B42D4E)

Some of the data are just zeros as they couldn't be read from these sensors, but under each sensor, the fields "calibratedAcceleration", "freeAcceleration", "time", "rotationMatrix", and "quaternion" contain usable data.

time: Contains time stamps of the measurement at each frame recorded at 100 Hz, so if you remove the first value from all values in the time vector and divide the result by 100, you will get the time in seconds from the beginning of the walking trial.

calibratedAcceleration and freeAcceleration: Contain triaxial acceleration data from the accelerometers of the IMU. freeAcceleration is just calibratedAcceleration without the effect of Earth's gravitational acceleration.

rotationMatrix: Orientations of the IMU as rotation matrices.

quaternion: Orientations of the IMU as quaternions.

openpose: Trajectories of the keypoints identified from sagittal plane video frames, saved as json files.

The keypoints are from the BODY_25 model of OpenPose (https://cmu-perceptual-computing-lab.github.io/openpose/web/html/doc/md_doc_02_output.html).

Each frame in the video has its own json file.

You can use the function in the script "OpenPose_to_keypoint_table.m" in the root folder to read the keypoint trajectories and confidences of all frames in a walking trial into MATLAB tables. The function takes as argument the path to the folder containing the json files of the walking trial.

Note that some subjects (11, 14, 37, 49) do not have keypoint and IMU data.

The folders under each subject are divided into three ZIP archives with 17 subjects each.

The script "OpenPose_to_keypoint_table.m" is a MATLAB script for extracting keypoint trajectories and confidences from JSON files into tables in MATLAB.

Publication in Data in Brief: https://doi.org/10.1016/j.dib.2024.110841

Contact: Jere Lavikainen, jere.lavikainen@uef.fi
Invasion of Ukraine Discourse on TikTok Dataset
zenodo.org
data.niaid.nih.gov
bin, csv +1
Updated May 11, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benjamin Steel; Sara Parker; Derek Ruths; Benjamin Steel; Sara Parker; Derek Ruths (2023). Invasion of Ukraine Discourse on TikTok Dataset [Dataset]. http://doi.org/10.5281/zenodo.7534952
Explore at:
text/x-python, csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7534952
Dataset updated
May 11, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Benjamin Steel; Sara Parker; Derek Ruths; Benjamin Steel; Sara Parker; Derek Ruths
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Ukraine
Description
This is a dataset of videos and comments related to the invasion of Ukraine, published on TikTok by a number of users over the year of 2022. It was compiled by Benjamin Steel, Sara Parker and Derek Ruths at the Network Dynamics Lab, McGill University. We created this dataset to facilitate the study of TikTok, and the nature of social interaction on the platform relevant to a major political event.

The dataset has been released here on Zenodo: https://doi.org/10.5281/zenodo.7534952 as well as on Github: https://github.com/networkdynamics/data-and-code/tree/master/ukraine_tiktok

To create the dataset, we identified hashtags and keywords explicitly related to the conflict to collect a core set of videos (or ”TikToks”). We then compiled comments associated with these videos. All of the data captured is publically available information, and contains personally identifiable information. In total we collected approximately 16 thousand videos and 12 million comments, from approximately 6 million users. There are approximately 1.9 comments on average per user captured, and 1.5 videos per user who posted a video. The author personally collected this data using the web scraping PyTok library, developed by the author: https://github.com/networkdynamics/pytok.

Due to scraping duration, this is just a sample of the publically available discourse concerning the invasion of Ukraine on TikTok. Due to the fuzzy search functionality of the TikTok, the dataset contains videos with a range of relatedness to the invasion.

We release here the unique video IDs of the dataset in a CSV format. The data was collected without the specific consent of the content creators, so we have released only the data required to re-create it, to allow users to delete content from TikTok and be removed from the dataset if they wish. Contained in this repository are scripts that will automatically pull the full dataset, which will take the form of JSON files organised into a folder for each video. The JSON files are the entirety of the data returned by the TikTok API. We include a script to parse the JSON files into CSV files with the most commonly used data. We plan to further expand this dataset as collection processes progress and the war continues. We will version the dataset to ensure reproducibility.

To build this dataset from the IDs here:

Go to https://github.com/networkdynamics/pytok and clone the repo locally

Run pip install -e . in the pytok directory

Run pip install pandas tqdm to install these libraries if not already installed

Run get_videos.py to get the video data

Run video_comments.py to get the comment data

Run user_tiktoks.py to get the video history of the users

Run hashtag_tiktoks.py or search_tiktoks.py to get more videos from other hashtags and search terms

Run load_json_to_csv.py to compile the JSON files into two CSV files, comments.csv and videos.csv

If you get an error about the wrong chrome version, use the command line argument get_videos.py --chrome-version YOUR_CHROME_VERSION Please note pulling data from TikTok takes a while! We recommend leaving the scripts running on a server for a while for them to finish downloading everything. Feel free to play around with the delay constants to either speed up the process or avoid TikTok rate limiting.

Please do not hesitate to make an issue in this repo to get our help with this!

The videos.csv will contain the following columns:

video_id: Unique video ID

createtime: UTC datetime of video creation time in YYYY-MM-DD HH:MM:SS format

author_name: Unique author name

author_id: Unique author ID

desc: The full video description from the author

hashtags: A list of hashtags used in the video description

share_video_id: If the video is sharing another video, this is the video ID of that original video, else empty

share_video_user_id: If the video is sharing another video, this the user ID of the author of that video, else empty

share_video_user_name: If the video is sharing another video, this is the user name of the author of that video, else empty

share_type: If the video is sharing another video, this is the type of the share, stitch, duet etc.

mentions: A list of users mentioned in the video description, if any

The comments.csv will contain the following columns:

comment_id: Unique comment ID

createtime: UTC datetime of comment creation time in YYYY-MM-DD HH:MM:SS format

author_name: Unique author name

author_id: Unique author ID

text: Text of the comment

mentions: A list of users that are tagged in the comment

video_id: The ID of the video the comment is on

comment_language: The language of the comment, as predicted by the TikTok API

reply_comment_id: If the comment is replying to another comment, this is the ID of that comment

The date can be compiled into a user interaction network to facilitate study of interaction dynamics. There is code to help with that here: https://github.com/networkdynamics/polar-seeds. Additional scripts for further preprocessing of this data can be found there too.
Database populated with European diversification experiences
zenodo.org
data.niaid.nih.gov
bin, csv
Updated Jul 30, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dóra Drexler; Frédéric Vanwindekens; Frédéric Vanwindekens; Dóra Drexler (2020). Database populated with European diversification experiences [Dataset]. http://doi.org/10.5281/zenodo.3966842
Explore at:
csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3966842
Dataset updated
Jul 30, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dóra Drexler; Frédéric Vanwindekens; Frédéric Vanwindekens; Dóra Drexler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Europe
Description
The EU Horizon 2020 project DiverIMPACTS aims to promote the realisation of the full potential of crop diversification through rotation, multicropping and intercropping by demonstrating technical, economic and environmental benefits for famers, along the value chain and for society at large, and by providing innovations that can remove existing barriers and lock-ins of practical diffusion.

DiverIMPACTS does so by combining findings from several participatory case studies with a set of field experiments across Europe, and translating these into strategies, recommendations and fit-for-purpose tools developed with and for farmers, advisors and other actors along the value chain.

To first gain a good overview of the current situation, i.e. the existing success stories and challenges of crop diversification in Europe, Work Package 1 (WP 1) identified and analysed factors of success and failure associated with a variety of crop diversification experiences (CDEs) outside those already represented in the consortium (see Deliverable 1.1). WP 1 thus makes sure that the rich experience with crop diversification initiatives across Europe (e.g. from other Horizon 2020 projects) is taken into account for developing strategies, recommendations and tools.

Deliverable 1.1 provided i) a list of key drivers (ex ante occurrence of market opportunities, environmental constraints, availability of enabling advisory services, land and workforce availability etc.) to be further considered in WP3, and WP5; and ii) a comprehensive and exhaustive description of the links between key factors and CDE types. This analysis is the basis for consolidating or updating the tentative typology of crop diversification situations used for setting up DiverIMPACTS (case studies), and was used for selecting experiences for more detailed investigations in T1.2. It also complements the identification and characterisation of lock-ins and barriers to crop diversification, and serves their overcoming. During the process of collecting, cleaning and analysing the survey data, a Database of European diversification experiences was created.

All together 128 valid responses from 15 European countries – mainly from the project countries Belgium, France, Germany, Hungary, Italy, the Netherlands, Poland, Romania, Sweden, Switzerland, and UK, but also from Denmark, Finland, Luxemburg and Spain were received in T1.1, and were included in the database.

The database is stored in original and back-up form in a tabular ='.csv'= format that can be opened in Excel on the Sharepoint system of the project and now on Zenodo, under restricted WP1 area. A further ='.csv'= file was created to store the metadata of the database. This file helps to have a better overview of the questions and sub-questions that were asked in the survey and the type of answer that could be provided to each of them (e.g. factor, Yes-No selection or character).

Using the meta data and the database, a selection of personal data fields has been made (e.g. email addresses and names of people) that cannot be published with open access, and needs special attention and data handling. These variables were removed from the original database, and a public version of the database was created that can be shared with third parties. Links to the data files will be shared here after.

Developing a Shiny(c) application in R was chosen as a solution to visualize the public data, and make it possible for Partners and all interested parties to interactively view the survey results. The Shiny application is shared as an R-package and are freely accessible on the internet. The users have the possibility to download application and public data in order to visualize them on their own computer. A remote solution, facilitating the consultation of the data, will be installed in CRA-W, where the open data analyses module will be hosted. A short user guide and tutorial is part of this deliverable for helping interested parties to use the Shiny interface.

The chosen approach, linking R scripts, R packages and data files, will be useful in the future in order to continiously complete the data base and to update the application (new graphs, new functions regarding the demand of the main users). The release of the application will be shared using modern technologies of information and communication : project website, newsletter, blogs, twitter and other social networks.

The main deliverable (D1.2) which is public, is available here : 10.5281/zenodo.3966852
Green Roofs Footprints for New York City, Assembled from Available Data and...
zenodo.org
data.niaid.nih.gov
bin, csv, zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael L. Treglia; Michael L. Treglia; Timon McPhearson; Timon McPhearson; Eric W. Sanderson; Eric W. Sanderson; Greg Yetman; Greg Yetman; Emily Nobel Maxwell; Emily Nobel Maxwell (2020). Green Roofs Footprints for New York City, Assembled from Available Data and Remote Sensing [Dataset]. http://doi.org/10.5281/zenodo.1469674
Explore at:
csv, bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1469674
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Michael L. Treglia; Michael L. Treglia; Timon McPhearson; Timon McPhearson; Eric W. Sanderson; Eric W. Sanderson; Greg Yetman; Greg Yetman; Emily Nobel Maxwell; Emily Nobel Maxwell
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered
New York
Description
Summary:

The files contained herein represent green roof footprints in NYC visible in 2016 high-resolution orthoimagery of NYC (described at https://github.com/CityOfNewYork/nyc-geo-metadata/blob/master/Metadata/Metadata_AerialImagery.md). Previously documented green roofs were aggregated in 2016 from multiple data sources including from NYC Department of Parks and Recreation and the NYC Department of Environmental Protection, greenroofs.com, and greenhomenyc.org. Footprints of the green roof surfaces were manually digitized based on the 2016 imagery, and a sample of other roof types were digitized to create a set of training data for classification of the imagery. A Mahalanobis distance classifier was employed in Google Earth Engine, and results were manually corrected, removing non-green roofs that were classified and adjusting shape/outlines of the classified green roofs to remove significant errors based on visual inspection with imagery across multiple time points. Ultimately, these initial data represent an estimate of where green roofs existed as of the imagery used, in 2016.

These data are associated with an existing GitHub Repository, https://github.com/tnc-ny-science/NYC_GreenRoofMapping, and as needed and appropriate pending future work, versioned updates will be released here.

Terms of Use:

The Nature Conservancy and co-authors of this work shall not be held liable for improper or incorrect use of the data described and/or contained herein. Any sale, distribution, loan, or offering for use of these digital data, in whole or in part, is prohibited without the approval of The Nature Conservancy and co-authors. The use of these data to produce other GIS products and services with the intent to sell for a profit is prohibited without the written consent of The Nature Conservancy and co-authors. All parties receiving these data must be informed of these restrictions. Authors of this work shall be acknowledged as data contributors to any reports or other products derived from these data.

Associated Files:

As of this release, the specific files included here are:

GreenRoofData2016_20180917.geojson is in the human-readable, GeoJSON format, in geographic coordinates (Lat/Long, WGS84; EPSG 4263).

GreenRoofData2016_20180917.gpkg is in the GeoPackage format, which is an Open Standard readable by most GIS software including Esri products (tested on ArcMap 10.3.1 and multiple versions of QGIS). This dataset is in the New York State Plan Coordinate System (units in feet) for the Long Island Zone, North American Datum 1983, EPSG 2263.

GreenRoofData2016_20180917_Shapefile.zip is a zipped folder containing a Shapefile and associated files. Please note that some field names were truncated due to limitations of Shapefiles, but columns are in the same order as for other files and in the same order as listed below. This dataset is in the New York State Plan Coordinate System (units in feet) for the Long Island Zone, North American Datum 1983, EPSG 2263.

GreenRoofData2016_20180917.csv is a comma-separated values file (CSV) with coordinates for centroids for the green roofs stored in the table itself. This allows for easily opening the data in a tool like spreadsheet software (e.g., Microsoft Excel) or a text editor.

Column Information for the datasets:

Some, but not all fields were joined to the green roof footprint data based on building footprint and tax lot data; those datasets are embedded as hyperlinks below.

fid - Unique identifier

bin - NYC Building ID Number based on overlap between green roof areas and a building footprint dataset for NYC from August, 2017. (Newer building footprint datasets do not have linkages to the tax lot identifier (bbl), thus this older dataset was used). The most current building footprint dataset should be available at: https://data.cityofnewyork.us/Housing-Development/Building-Footprints/nqwf-w8eh. Associated metadata for fields from that dataset are available at https://github.com/CityOfNewYork/nyc-geo-metadata/blob/master/Metadata/Metadata_BuildingFootprints.md.

bbl - Boro Block and Lot number as a single string. This field is a tax lot identifier for NYC, which can be tied to the Digital Tax Map (http://gis.nyc.gov/taxmap/map.htm) and PLUTO/MapPLUTO (https://www1.nyc.gov/site/planning/data-maps/open-data/dwn-pluto-mappluto.page). Metadata for fields pulled from PLUTO/MapPLUTO can be found in the PLUTO Data Dictionary found on the aforementioned page. All joins to this bbl were based on MapPLUTO version 18v1.

gr_area - Total area of the footprint of the green roof as per this data layer, in square feet, calculated using the projected coordinate system (EPSG 2263).

bldg_area - Total area of the footprint of the associated building, in square feet, calculated using the projected coordinate system (EPSG 2263).

prop_gr - Proportion of the building covered by green roof according to this layer (gr_area/bldg_area).

cnstrct_yr - Year the building was constructed, pulled from the Building Footprint data.

doitt_id - An identifier for the building assigned by the NYC Dept. of Information Technology and Telecommunications, pulled from the Building Footprint Data.

heightroof - Height of the roof of the associated building, pulled from the Building Footprint Data.

feat_code - Code describing the type of building, pulled from the Building Footprint Data.

groundelev - Lowest elevation at the building level, pulled from the Building Footprint Data.

qa - Flag indicating a positive QA/QC check (using multiple types of imagery); all data in this dataset should have 'Good'

notes - Any notes about the green roof taken during visual inspection of imagery; for example, it was noted if the green roof appeared to be missing in newer imagery, or if there were parts of the roof for which it was unclear whether there was green roof area or potted plants.

classified - Flag indicating whether the green roof was detected image classification. (1 for yes, 0 for no)

digitized - Flag indicating whether the green roof was digitized prior to image classification and used as training data. (1 for yes, 0 for no)

newlyadded - Flag indicating whether the green roof was detected solely by visual inspection after the image classification and added. (1 for yes, 0 for no)

original_source - Indication of what the original data source was, whether a specific website, agency such as NYC Dept. of Parks and Recreation (DPR), or NYC Dept. of Environmental Protection (DEP). Multiple sources are separated by a slash.

address - Address based on MapPLUTO, joined to the dataset based on bbl.

borough - Borough abbreviation pulled from MapPLUTO.

ownertype - Owner type field pulled from MapPLUTO.

zonedist1 - Zoning District 1 type pulled from MapPLUTO.

spdist1 - Special District 1 pulled from MapPLUTO.

bbl_fixed - Flag to indicate whether bbl was manually fixed. Since tax lot data may have changed slightly since the release of the building footprint data used in this work, a small percentage of bbl codes had to be manually updated based on overlay between the green roof footprint and the MapPLUTO data, when no join was feasible based on the bbl code from the building footprint data. (1 for yes, 0 for no)

For GreenRoofData2016_20180917.csv there are two additional columns, representing the coordinates of centroids in geographic coordinates (Lat/Long, WGS84; EPSG 4263):

xcoord - Longitude in decimal degrees.

ycoord - Latitude in decimal degrees.

Acknowledgements:

This work was primarily supported through funding from the J.M. Kaplan Fund, awarded to the New York City Program of The Nature Conservancy, with additional support from the New York Community Trust, through New York City Audubon and the Green Roof Researchers Alliance.
Data from: NPOmix 1: antiSMASH results from 1,040 PoDP paired samples
zenodo.org
explore.openaire.eu
zip
Updated Dec 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tiago F Leao; Mingxun Wang; Ricardo Silva; Alexey Gurevich; Anelize Bauermeister; Paulo WP Gomes; Asker Brejnrod; Evgenia Glukhov; Allegra T Aron; Joris JR Louwen; Hyun Woo Kim; Raphael Reher; Marli F Fiore; Justin JJ van der Hooft; Lena Gerwick; William H Gerwick; Nuno Bandeira; Pieter C Dorrestein; Tiago F Leao; Mingxun Wang; Ricardo Silva; Alexey Gurevich; Anelize Bauermeister; Paulo WP Gomes; Asker Brejnrod; Evgenia Glukhov; Allegra T Aron; Joris JR Louwen; Hyun Woo Kim; Raphael Reher; Marli F Fiore; Justin JJ van der Hooft; Lena Gerwick; William H Gerwick; Nuno Bandeira; Pieter C Dorrestein (2022). NPOmix 1: antiSMASH results from 1,040 PoDP paired samples [Dataset]. http://doi.org/10.5281/zenodo.6637083
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6637083
Dataset updated
Dec 6, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tiago F Leao; Mingxun Wang; Ricardo Silva; Alexey Gurevich; Anelize Bauermeister; Paulo WP Gomes; Asker Brejnrod; Evgenia Glukhov; Allegra T Aron; Joris JR Louwen; Hyun Woo Kim; Raphael Reher; Marli F Fiore; Justin JJ van der Hooft; Lena Gerwick; William H Gerwick; Nuno Bandeira; Pieter C Dorrestein; Tiago F Leao; Mingxun Wang; Ricardo Silva; Alexey Gurevich; Anelize Bauermeister; Paulo WP Gomes; Asker Brejnrod; Evgenia Glukhov; Allegra T Aron; Joris JR Louwen; Hyun Woo Kim; Raphael Reher; Marli F Fiore; Justin JJ van der Hooft; Lena Gerwick; William H Gerwick; Nuno Bandeira; Pieter C Dorrestein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset used in the NPOmix validation (described in the publication) that includes antiSMASH results from 1,040 PoDP paired samples. The input for antiSMASH were FASTA genomes from NCBI that are listed at the PoDP database (https://pairedomicsdata.bioinformatics.nl). We remove all files from the antiSMASH output folder but the GenBank (.gbk) files for the BGCs.
Training and test data for antibody humanness evaluation
zenodo.org
application/gzip
Updated Jan 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Parkinson; Jonathan Parkinson (2024). Training and test data for antibody humanness evaluation [Dataset]. http://doi.org/10.5281/zenodo.10562968
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10562968
Dataset updated
Jan 24, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jonathan Parkinson; Jonathan Parkinson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 23, 2024
Description
### Training and test data for humanness evaluation

This data was collected in conjunction with and used for
training and testing for Parkinson / Wang et al 2024. The
data is organized as follows:

- Heavy chain training and multispecies test data (under the heavy chain folder)
- The conslidated cAb rep file contains training human sequences
- The test sample sequences folder contains fasta files with test sequences for each species
- Light chain training and multispecies test data (under the light chain folder)
- The conslidated cAb rep file contains training human sequences
- The test sample sequences folder contains fasta files with test sequences for each species
- Abybank data (under the abybank compiled data folder)
- This folder contains separate folders for heavy and light chain
- Each subfolder contains test data for a more diverse species set under fasta files for each species
- Humanization test data (under the humanization test data folder)
- The sequences in the parental.fa file were originally humanized as part of drug discovery programs
- The experimental.fa file contains the humanization results
- IMGT and ADA data (under the imgt test data folder)
- The imgt mab db fa and tsv files contain sequences and species assignments for IMGT mAb DB
- The thera ada fa file contains sequences evaluated in the clinic
- The Therapeutic ADA txt file contains anti drug antibody results for those antibodies

The data was retrieved from the following sources.

1. All heavy and light chain training data is from the cAb-Rep database from [Guo et al.](https://pubmed.ncbi.nlm.nih.gov/31649674/)
2. All testing data is from the Observed Antibody Space [(OAS) database](https://opig.stats.ox.ac.uk/webapps/oas/)

The training and test data show is after filtering for quality. The testing data was additionally randomly sampled to yield a set of 50,000 sequences for each species, then filtered to remove duplicates. The human test data was checked to ensure no overlap with the human training set.

The IMGT, ADA and humanization test data was retrieved from Prihoda et al. and
the associated [Github repo](https://github.com/Merck/BioPhi-2021-publication).

See Parkinson et al. 2024 and the associated github repos for more details on how models other than
SAM / AntPack were evaluated on this data.
Data for project "Cellulose nitrate lacquer on silver objects",...
zenodo.org
researchdata.se
+1more
Updated Mar 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elyse Canosa; Elyse Canosa; Charlotta Bylund Melin; Charlotta Bylund Melin (2024). Data for project "Cellulose nitrate lacquer on silver objects", RAÄ-2021-3131 [Dataset]. http://doi.org/10.5281/zenodo.7355017
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7355017
Dataset updated
Mar 25, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Elyse Canosa; Elyse Canosa; Charlotta Bylund Melin; Charlotta Bylund Melin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Oct 17, 2022
Description
Data from the project "Cellulose nitrate lacquer on silver objects" (RAÄ-2021-3131). Each zip file contains an Instrument report file which gives sample names and descriptions, instrument parameters, spectra figures and in some report interpretation notes. Data from the following analytical techniques is included: 1. Scanning electron microscopy with energy dispersive x-ray spectroscopy (SEM-EDS); 2. Polarized light microscopy (PLM); 3. Fourier transform infrared spectroscopy (FTIR).

Data was collected as a continuation of (but after the publication of) the bachelor theses "A greener solution: Investigating the potential use of Green Solvents to remove cellulose nitrate lacquer from silver objects" by Evelina Borén (https://hdl.handle.net/2077/72648) and "The use of gels for the removal of cellulose nitrate lacquer on silver" by Katayon Miri (https://hdl.handle.net/2077/72650), both at the Department for Conservation, University of Gothenburg. The project is also part of a larger invesitgation about removing cellulose nitrate lacquer on silver objects, supported by European Union funded IPERION HS project (2020-INFRAIA-2019-1),.......

Instrument reports including technical specifications of each instrument can also be found under DOI 10.5281/zenodo.10869595 .
Simulated length of 71 Alpine glaciers over the last millennium using OGGM
zenodo.org
explore.openaire.eu
+1more
nc, txt
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugues Goosse; Hugues Goosse; Pierre-Yves Barriat; Quentin Dalaiden; François Klein; Ben Marzeion; Ben Marzeion; Fabien Maussion; Fabien Maussion; Paolo Pelucchi; Anouk Vlug; Anouk Vlug; Pierre-Yves Barriat; Quentin Dalaiden; François Klein; Paolo Pelucchi (2020). Simulated length of 71 Alpine glaciers over the last millennium using OGGM [Dataset]. http://doi.org/10.5281/zenodo.1319334
Explore at:
nc, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1319334
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Hugues Goosse; Hugues Goosse; Pierre-Yves Barriat; Quentin Dalaiden; François Klein; Ben Marzeion; Ben Marzeion; Fabien Maussion; Fabien Maussion; Paolo Pelucchi; Anouk Vlug; Anouk Vlug; Pierre-Yves Barriat; Quentin Dalaiden; François Klein; Paolo Pelucchi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains the simulated length of 71 Alpine glaciers over the last millennium using the version 1.0 of the Open Global Glacier Model (OGGM) forced by global climate models (GCM) simulation outputs. For a description of the experimental design, see the associated publication:

Goosse, H., Barriat, P.-Y., Dalaiden, Q., Klein, F., Marzeion, B., Maussion, F., Pelucchi, P. and Vlug, A.: Testing the consistency between changes in simulated climate and Alpine glacier length over the past millennium, Climate of the Past, 2018.

Each NetCDF file corresponds to OGGM driven by one climate model over the period 1000-2004 CE. The file names are based on the acronyms given in the Table 1 of the associated publication. The variables included in the NetCDF files are:

- g_length: Annual mean length of the glaciers, in meters

- ID_glacier: an identifier for each glacier, allowing to make the link to the names of the glaciers given in glacier_names.txt.

- time_year: the time in years CE

In order to remove high frequency variability associated with the presence of snow that may remain in summer at altitudes lower than the glacier front, a filter with a 5-year window has been applied on the OGGM outputs to obtain the results stored in g_length.

Please contact Hugues Goosse for more information.

Facebook

Twitter

Click to copy link

Link copied

Cite

Xuhang Li; Xuhang Li (2025). Worm Perturb-Seq: massively parallel whole-animal RNAi and RNA-seq [Dataset]. http://doi.org/10.5281/zenodo.15223779

Data from: Worm Perturb-Seq: massively parallel whole-animal RNAi and RNA-seq

Explore at:

zip, binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15223779

Dataset updated

Apr 17, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Xuhang Li; Xuhang Li

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Description:

This repository contains all source codes, necessary input and output data, and raw figures and tables for reproducing most figures and results published in the following study:

These include results related to method benchmarking and NHR data processing. Source data for figures that are not reproducible here have been provided with the publication.

Files:

This repository contains a few directories related to this publication. To deposit into Zenodo, we have individually zipped each subfolder of the root directory.

There are three directories included:

MetabolicLibrary
- contains files related to the benchmarking analyses using the metabolic gene WPS data. This folder is partially overlaped with the working directory of the sister paper deposited at 10.5281/zenodo.14198997
method_simulation
- contains files related to the simulation benchmarking
NHRLibrary
- contains files related to the analyses of NHR gene WPS data

Usage:

Please refer to the table in below to locate a specific file for reproducing a figure of interest (also availabe in the METHOD_FIGURE_LOOKUP.xlsx under the root directory).

Figure	File	Lines^a	Notes
Fig. 2c	MetabolicLibrary/1_QC_dataCleaning/2_badSampleExclusion_manual.R	65-235	output figure is selected from figures/met10_lib6_badSamplePCA.pdf
Fig. 2d	NHRLibrary/example_bams/*	-	load the bam files in IGV to make the figure
Fig. 3a	MetabolicLibrary/2_DE/2_5_vectorlike_re_analysis_with_curated_44_NTP_conditions.R	348-463
Fig. 3b,c	MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R	106-376
Fig. 3d	MetabolicLibrary/2_DE/SUPP_extra_figures_for_rewiring.R	10-139
Fig. 3e	MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R	379-522
Fig. 3f,g	MetabolicLibrary/2_DE/SUPP_extra_figures_for_rewiring.R	1-8
Fig. 3h	method_simulation/Supp_systematic_mean_variation_example.R	1-138
Fig. 3i	method_simulation/3_benchmark_DE_result_w_rep.R and 1_benchmark_DE_result_std_NB_w_rep.R	1-518	the example figure was from figures/GLM_NB_deltaMiu_k08_10rep/seed_12345_empircial_null_fit_simulated_data_NB_GLM.pdf and figures/GLM_NB_10rep/seed_1_empircial_null_fit_simulated_data_NB_GLM.pdf;
Fig. 3j	MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R	2104-2106	load dependencies starting from line 1837
Fig. 3k	MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R	2053-2078	load dependencies starting from line 1837; the GSEA was performed using SUPP_supplementary_figures_for_method_noiseness_GSEA.R
Fig. 4a,b	method_simulation/3_benchmark_DE_result_w_rep.R	1-523
Fig. 4c	method_simulation/3_benchmark_WPS_parameters.R	1-237
Fig. 4d	MetabolicLibrary/2_DE/2_5_vectorlike_re_analysis_with_curated_44_NTP_conditions.R	1-346	output figure is selected from figures/0_DE_QA/vectorlike_analysis/2d_cutoff_titration_71NTP_rawDE_log2FoldChange_log2FoldChange_raw.pdf and 2d_cutoff_titration_71NTP_p0.005_log2FoldChange_raw.pdf. The "p0.005" in the second file name indicates the p_outlier cutoff used in the final parameter set for EmpirDE.
Fig. 4e	MetabolicLibrary/2_DE/SUPP_plot_N_DE_repeated_RNAi.R	entire file
Fig. 4f	MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R	1020-1407
Fig. 4g,h	MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R	529-851
Fig. 5d	NHRLibrary/FinalAnalysis/2_DE_new/3_DE_network_analysis.R	51-69; 94-112	load dependencies starting from line 1
Fig. 5e	NHRLibrary/FinalAnalysis/2_DE_new/5_GSA_bubble_plot.R	1-306
Fig. 5f	NHRLibrary/FinalAnalysis/2_DE_new/5_GSA.R	1-1492
Fig. 6a	NHRLibrary/FinalAnalysis/2_DE_new/4_DE_similarity_analysis.R	1-175
Fig. 6b	NHRLibrary/FinalAnalysis/6_case_study.R	506-534	load dependencies starting from line 1
Fig. 6c	NHRLibrary/FinalAnalysis/6_case_study.R	668-888	load dependencies starting from line 1
Supplementary Fig. 1e	NHRLibrary/FinalAnalysis/5_revision/REVISION_gene_detection_sensitivity_benchmark.R	1-143
Supplementary Fig. 1f	MetabolicLibrary/1_QC_dataCleaning/2_badSampleExclusion_manual.R	65-235	output figure is selected from figures/met10_lib6_badSampleCorr.pdf
Supplementary Fig. 1g	MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R	2191-2342
Supplementary Fig. 2a	MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R	1409-1822
Supplementary Fig. 2b	method_simulation/Supp_systematic_mean_variation_example.R	1-138
Supplementary Fig. 2c	method_simulation/Supp_systematic_mean_variation_example.R; 2_fit_logFC_distribution.R	141-231; 1-201	the middle panel was generated from Supp_systematic_mean_variation_example.R (lines 141-231) and right panel was from 2_fit_logFC_distribution.R (lines 1-201)
Supplementary Fig. 2d	method_simulation/1_benchmark_DE_result_std_NB_w_rep.R	1-518	the example figure was from figures/GLM_NB_10rep/seed_1_empircial_null_fit_simulated_data_NB_GLM.pdf;
Supplementary Fig. 2e	method_simulation/3_benchmark_DE_result_w_rep.R	1-518	the example figure was from figures/GLM_NB_deltaMiu_k08_10rep/seed_12345_empircial_null_fit_simulated_data_NB_GLM.pdf
Supplementary Fig. 2f	method_simulation/3_benchmark_DE_result_w_rep.R	528-573	may need to run the code from line 1 to load other variables needed
Supplementary Fig. 3a,b	method_simulation/1_benchmark_DE_result_std_NB_w_rep.R	1-523
Supplementary Fig. 3c	method_simulation/3_benchmark_WPS_parameters.R	1-237
Supplementary Fig. 3d	MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R	3190-3300
Supplementary Fig. 3e	2_3_power_error_tradeoff_optimization.R	entire file	the figure was in figures/0_DE_QA/cleaning_strength_titration/benchmark_vectorlikes_titrate_cleaning_cutoff_DE_log2FoldChange_FDR0.2_FC1.pdf (produced in line 398); this script produced the titration plots for a series of thresholds, where we picked FDR0.2_FC1 for presentation in the paper
Supplementary Fig. 3f	2_3_power_error_tradeoff_optimization.R	entire file	the figure was in figures/0_DE_QA/cleaning_strength_titration/benchmark_independent_repeats_titrate_cleaning_cutoff_FP_log2FoldChange_raw.pdf (produced in line 195). The top line plot was from figures/0_DE_QA/cleaning_strength_titration/benchmark_independent_repeats_titrate_cleaning_cutoff_FP_log2FoldChange_raw_summary_stat.pdf (produced in line 218).
Supplementary Fig. 4	MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R	853-898	please run from line 529 to load dependencies
Supplementary Fig. 5a,b	MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R	2590-3185
Supplementary Fig.

Clear search

Close search

Google apps

Main menu

Data from: Worm Perturb-Seq: massively parallel whole-animal RNAi and...

KITAB Text Reuse Data

Corpus of Decisions: Permanent Court of International Justice (CD-PCIJ)

[BIOCOM-PIPE] Example and expected output files with an Illumina dataset

Complete Rxivist dataset of scraped biology preprint data

Data from: A Novel Curated Scholarly Graph Connecting Textual and Data...

Code and data for "ChromoGen: Diffusion model predicts single-cell chromatin...

Frequency and Rank of Family Names in Peru

Science Education Research Topic Modeling Dataset

Easy ORCID

Source code and simulation results: Uncovering hidden resonances in...

Usage

Requirements

FEM convergence

AAA convergence

Sampling scheme

EOSC Providers and Resources data-dump

Data from: Kuopio gait dataset: motion capture, inertial measurement and...

Invasion of Ukraine Discourse on TikTok Dataset

Database populated with European diversification experiences

Green Roofs Footprints for New York City, Assembled from Available Data and...

Data from: NPOmix 1: antiSMASH results from 1,040 PoDP paired samples

Training and test data for antibody humanness evaluation

Data for project "Cellulose nitrate lacquer on silver objects",...

Simulated length of 71 Alpine glaciers over the last millennium using OGGM

Data from: Worm Perturb-Seq: massively parallel whole-animal RNAi and RNA-seq