Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description:
This repository contains all source codes, necessary input and output data, and raw figures and tables for reproducing most figures and results published in the following study:
Hefei Zhang#, Xuhang Li#, Dongyuan Song, Onur Yukselen, Shivani Nanda, Alper Kucukural, Jingyi Jessica Li, Manuel Garber, Albertha J.M. Walhout. Worm Perturb-Seq: massively parallel whole-animal RNAi and RNA-seq. (2025) Nature Communications, in press (# equal contribution, *: correspondng author)
These include results related to method benchmarking and NHR data processing. Source data for figures that are not reproducible here have been provided with the publication.
Files:
This repository contains a few directories related to this publication. To deposit into Zenodo, we have individually zipped each subfolder of the root directory.
There are three directories included:
MetabolicLibrary
method_simulation
NHRLibrary
Note: the parameter optimization output is deposited in a seperate Zenodo repository (10.5281/zenodo.15236858) for better oganization and easy usage. If you would like to reproduce results related to the "MetabolicLibrary" folder, please download and integrate the omitted subfolder "MetabolicLibrary/2_DE/output/" from this seperate repository.
Please be advised that this repository contains raw codes and data that are not directly related to a figure in our paper. However, they may be useful to generate input used in the analysis of a figure, or to reproduce tables in our manuscript. It may also contain unpublished analyses and figures, which we did not intentionally delete and kept for records.
Usage:
Please refer to the table in below to locate a specific file for reproducing a figure of interest (also availabe in the METHOD_FIGURE_LOOKUP.xlsx under the root directory).
Figure | File | Linesa | Notes |
Fig. 2c | MetabolicLibrary/1_QC_dataCleaning/2_badSampleExclusion_manual.R | 65-235 | output figure is selected from figures/met10_lib6_badSamplePCA.pdf |
Fig. 2d | NHRLibrary/example_bams/* | - | load the bam files in IGV to make the figure |
Fig. 3a | MetabolicLibrary/2_DE/2_5_vectorlike_re_analysis_with_curated_44_NTP_conditions.R | 348-463 | |
Fig. 3b,c | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 106-376 | |
Fig. 3d | MetabolicLibrary/2_DE/SUPP_extra_figures_for_rewiring.R | 10-139 | |
Fig. 3e | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 379-522 | |
Fig. 3f,g | MetabolicLibrary/2_DE/SUPP_extra_figures_for_rewiring.R | 1-8 | |
Fig. 3h | method_simulation/Supp_systematic_mean_variation_example.R | 1-138 | |
Fig. 3i | method_simulation/3_benchmark_DE_result_w_rep.R and 1_benchmark_DE_result_std_NB_w_rep.R | 1-518 | the example figure was from figures/GLM_NB_deltaMiu_k08_10rep/seed_12345_empircial_null_fit_simulated_data_NB_GLM.pdf and figures/GLM_NB_10rep/seed_1_empircial_null_fit_simulated_data_NB_GLM.pdf; |
Fig. 3j | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 2104-2106 | load dependencies starting from line 1837 |
Fig. 3k | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 2053-2078 | load dependencies starting from line 1837; the GSEA was performed using SUPP_supplementary_figures_for_method_noiseness_GSEA.R |
Fig. 4a,b | method_simulation/3_benchmark_DE_result_w_rep.R | 1-523 | |
Fig. 4c | method_simulation/3_benchmark_WPS_parameters.R | 1-237 | |
Fig. 4d | MetabolicLibrary/2_DE/2_5_vectorlike_re_analysis_with_curated_44_NTP_conditions.R | 1-346 | output figure is selected from figures/0_DE_QA/vectorlike_analysis/2d_cutoff_titration_71NTP_rawDE_log2FoldChange_log2FoldChange_raw.pdf and 2d_cutoff_titration_71NTP_p0.005_log2FoldChange_raw.pdf. The "p0.005" in the second file name indicates the p_outlier cutoff used in the final parameter set for EmpirDE. |
Fig. 4e | MetabolicLibrary/2_DE/SUPP_plot_N_DE_repeated_RNAi.R | entire file | |
Fig. 4f | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 1020-1407 | |
Fig. 4g,h | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 529-851 | |
Fig. 5d | NHRLibrary/FinalAnalysis/2_DE_new/3_DE_network_analysis.R | 51-69; 94-112 | load dependencies starting from line 1 |
Fig. 5e | NHRLibrary/FinalAnalysis/2_DE_new/5_GSA_bubble_plot.R | 1-306 | |
Fig. 5f | NHRLibrary/FinalAnalysis/2_DE_new/5_GSA.R | 1-1492 | |
Fig. 6a | NHRLibrary/FinalAnalysis/2_DE_new/4_DE_similarity_analysis.R | 1-175 | |
Fig. 6b | NHRLibrary/FinalAnalysis/6_case_study.R | 506-534 | load dependencies starting from line 1 |
Fig. 6c | NHRLibrary/FinalAnalysis/6_case_study.R | 668-888 | load dependencies starting from line 1 |
Supplementary Fig. 1e | NHRLibrary/FinalAnalysis/5_revision/REVISION_gene_detection_sensitivity_benchmark.R | 1-143 | |
Supplementary Fig. 1f | MetabolicLibrary/1_QC_dataCleaning/2_badSampleExclusion_manual.R | 65-235 | output figure is selected from figures/met10_lib6_badSampleCorr.pdf |
Supplementary Fig. 1g | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 2191-2342 | |
Supplementary Fig. 2a | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 1409-1822 | |
Supplementary Fig. 2b | method_simulation/Supp_systematic_mean_variation_example.R | 1-138 | |
Supplementary Fig. 2c | method_simulation/Supp_systematic_mean_variation_example.R; 2_fit_logFC_distribution.R | 141-231; 1-201 | the middle panel was generated from Supp_systematic_mean_variation_example.R (lines 141-231) and right panel was from 2_fit_logFC_distribution.R (lines 1-201) |
Supplementary Fig. 2d | method_simulation/1_benchmark_DE_result_std_NB_w_rep.R | 1-518 | the example figure was from figures/GLM_NB_10rep/seed_1_empircial_null_fit_simulated_data_NB_GLM.pdf; |
Supplementary Fig. 2e | method_simulation/3_benchmark_DE_result_w_rep.R | 1-518 | the example figure was from figures/GLM_NB_deltaMiu_k08_10rep/seed_12345_empircial_null_fit_simulated_data_NB_GLM.pdf |
Supplementary Fig. 2f | method_simulation/3_benchmark_DE_result_w_rep.R | 528-573 | may need to run the code from line 1 to load other variables needed |
Supplementary Fig. 3a,b | method_simulation/1_benchmark_DE_result_std_NB_w_rep.R | 1-523 | |
Supplementary Fig. 3c | method_simulation/3_benchmark_WPS_parameters.R | 1-237 | |
Supplementary Fig. 3d | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 3190-3300 | |
Supplementary Fig. 3e | 2_3_power_error_tradeoff_optimization.R | entire file | the figure was in figures/0_DE_QA/cleaning_strength_titration/benchmark_vectorlikes_titrate_cleaning_cutoff_DE_log2FoldChange_FDR0.2_FC1.pdf (produced in line 398); this script produced the titration plots for a series of thresholds, where we picked FDR0.2_FC1 for presentation in the paper |
Supplementary Fig. 3f | 2_3_power_error_tradeoff_optimization.R | entire file | the figure was in figures/0_DE_QA/cleaning_strength_titration/benchmark_independent_repeats_titrate_cleaning_cutoff_FP_log2FoldChange_raw.pdf (produced in line 195). The top line plot was from figures/0_DE_QA/cleaning_strength_titration/benchmark_independent_repeats_titrate_cleaning_cutoff_FP_log2FoldChange_raw_summary_stat.pdf (produced in line 218). |
Supplementary Fig. 4 | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 853-898 | please run from line 529 to load dependencies |
Supplementary Fig. 5a,b | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 2590-3185 | |
Supplementary Fig. |
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
KITAB Text Reuse Data
KITAB is funded by the European Research Council under the European Union’s Horizon 2020 research and innovation programme, awarded to the KITAB project (Grant Agreement No. 772989, PI Sarah Bowen Savant), hosted at Aga Khan University, London. In addition, it has received funding from the Qatar National Library to aid in the adaptation of the passim algorithm for Arabic.
KITAB’s text reuse data is generated by running passim on the OpenITI corpus (DOI: 10.5281/zenodo.3082463). Each version is the output of a separate run and the version number corresponds to the corpus releases.
To prepare the corpus for a passim run, we normalize texts and remove most of the non-Arabic characters and then chunk the texts into passages of 300 words (using the non-Arabic characters, including white space) in length. The chunks, called milestones, are identified by unique ids. This dataset represents the reuse cases that have been identified among milestones.
The text reuse dataset consists of folders for each book. Each folder includes CSV files of the text reuse cases (alignments) between the corresponding book and all other books with which passim has found instances of reuses. The files have the below naming convention, using the book ids:
The CSV files are not the immediate output of passim, rather the result of the post-processing step. The folder structure is as below (for a total of four books, for example).
bookVersionID1
|- bookVersionID1_bookVersionID4.csv
|- bookVersionID1_bookVersionID3.csv
bookVersionID4
|-bookVersionID4_bookVersionID3.csv
Where we do not have any CSV files in any of the folders, it means that the passim algorithm has not been able to find any text reuse cases for that specific book. In the above example, we can not find any folder or CSV files for bookVresionID2, that means no reuse cases are detected between book2 and of the other three books.
To save computational resources, we generate text reuse data uni-directionally, which means a pair of documents is compared only once (document1 to document2, not document2 to document1).
The alignments the CSV files are a list of records. Each record shows a pair of matched passages between two books together with statistics, such as the algorithm score, and contextual information, such as the start and end positions of aligned passages so that one can find those passages in the books. A description of the alignment fields is given in the release notes.
For each dataset, we also generate statistical data on the alignments between the book pairs. The data is published in an application that facilitates search, filtering, and visualizations. The link to the corresponding application is given in the release notes.
Note on Release Numbering: Version 2020.1.1—where 2020 is the year of the release, the first dotted number—.1—is the ordinal release number in 2020, and the second dotted number—.1—is the overall release number. The first dotted number will reset every year, while the second one will continue on increasing.
Note: The very first release of the KITAB text reuse data (2019.1.1) is published here as it was too big to publish on Zenodo. To receive more information on the complete datasets please contact us via kitab-project@outlook.com (or other team members).
Future releases may include part of the generated data if the size of whole data is too big to publish on Zenodo. However, the data is open access for anyone to use. We provide the detailed information on the datasets in the corresponding release notes.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Overview
The Corpus of Decisions: Permanent Court of International Justice (CD-PCIJ) collects and presents for the first time in human- and machine-readable formats all documents of PCIJ Series A, B and A/B of the Permanent Court of International Justice (PCIJ). Among these are judgments, advisory opinions, orders, appended minority opinions, annexes, applications instituting proceedings and requests for an advisory opinion. The International Court of Justice, the successor of the PCIJ, has kindly made available these documents on its website.
The Permanent Court of International Justice (PCIJ) was the primary judicial organ of the League of Nations, the ill-fated predecessor of the United Nations, which existed from 1920 to 1946. Nonetheless, as the first international court with general thematic jurisdiction, the PCIJ influenced international law in profound ways that are still felt today. Every lawyer who sets out on the path of international law encounters epoch-defining opinions such as the Lotus and Factory at Chorzów decisions, but the Court's lesser-known jurisprudence and the appended minority opinions offer many more ideas and legal principles which are seldom appreciated today.
This data set is designed to be complementary to and fully compatible with the Corpus of Decisions: International Court of Justice (CD-ICJ), which is also available open access.
Citation
A peer-reviewed academic paper describing the construction and relevance of the data set entitled 'Introducing Twin Corpora of Decisions for the International Court of Justice (ICJ) and the Permanent Court of International Justice (PCIJ)' was published open access in the Journal of Empirical Legal Studies (JELS). It is also available in print at JELS 2022, Vol. 19, No. 2, pp. 491-524.
If you use the data set for academic work, please cite both the JELS paper and the precise version of the data set you used for your analysis.
NEW in Version 1.1.0
Full recompilation of data set
CHANGELOG and README converted to external markdown files
Display of version number on Codebook and Compilation Report title pages fixed; correctly display semantic versioning
The ZIP archive of source files includes the TEX files
Config file converted to TOML format
All R packages are version-controlled with {renv}
Data set creation process cleans up all files from previous runs before a new data set is created
Remove redundant color from violin plots
Updates
The CD-PCIJ will only be updated if errors are discovered, enhancements are developed or in the unlikely event that the Court publishes additional documents within the collection ambit of the data set (PCIJ Series A, B and A/B).
Notifications regarding new and updated data sets will be published on my academic website at www.seanfobbe.com or via Mastodon at @seanfobbe@fediscience.org
Recommended Variants
Target Audience
Recommended Variant
Practitioners
PDF_ENHANCED_MajorityOpinions
Traditional Scholars
PDF_ENHANCED_FULL
Quantitative Analysts
CSV_TESSERACT_FULL
Please refer to the Codebook regarding the relative merits of each variant. All variants are available in either English or French. Unless you have very specific needs you should only use the variants denoted 'ENHANCED' or 'TESSERACT' for serious work.
Features
Fully compatible with the Corpus of Decisions: International Court of Justice (CD-ICJ)
29 variables
Public Domain (CC-Zero 1.0)
Open and platform independent file formats (PDF, TXT, CSV)
Extensive Codebook
Compilation Report explains construction and validation of the data set in detail
Large number of diagrams for all purposes (see the 'ANALYSIS' archive)
Diagrams are available as PDF (for printing) and PNG (for web display), tables are available as CSV for easy readability by humans and machines
Secure cryptographic signatures
Publication of full source code (Open Source)
Key Metrics
Version: 1.1.0
Temporal Coverage: 22 May 1922 – 26 February 1940
Documents: 259 (English) / 261 (French)
Tokens: 1,296,536 (English) / 1,262,184 (French)
Formats: PDF, TXT, CSV
Source Code and Compilation Report
With every compilation of the full data set an extensive Compilation Report is created in a professionally layouted PDF format (comparable to the Codebook). The Compilation Report includes the Source Code, comments and explanations of design decisions, relevant computational results, exact timestamps and a table of contents with clickable internal hyperlinks to each section. The Compilation Report and Source Code are published under the same DOI: https://doi.org/10.5281/zenodo.7051937
For details of the construction and validation of the data set please refer to the Compilation Report.
Disclaimer
This data set has been created by Mr Seán Fobbe using documents available on the website of the International Court of Justice (https://www.icj-cij.org). It is a personal academic initiative and is not associated with or endorsed by the International Court of Justice or the United Nations.
The Court accepts no responsibility or liability arising out of my use, or that of third parties, of the documents and information produced, used or published on the Zenodo website. Neither the Court nor its staff members nor its contractors may be held responsible or liable for the consequences, financial or otherwise, resulting from the use of these documents and information.
Academic Publications (Fobbe)
Website — www.seanfobbe.com
Open Data — zenodo.org/communities/sean-fobbe-data
Code Repository — zenodo.org/communities/sean-fobbe-code
Regular Publications — zenodo.org/communities/sean-fobbe-publications
Contact
Did you discover any errors? Do you have suggestions on how to improve the data set? You can either post these to the Issue Tracker on GitHub or write me an e-mail at fobbe-data@posteo.de
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BIOCOM-PIPE: a new user-friendly metabarcoding pipeline for the characterization of microbial diversity from 16S, 18S and 23S rRNA gene amplicons
Summary:
This Zenodo repository encompasses the demo input files and expected output files in the pipeline such that before directly applying it on large-scale datasets, a user can be assured that he/she has successfully implemented the pipeline. More precisely, we chose to use a recent dataset published (Sadet et al., 2018 – Applied Soil Ecology – DOI: 10.1016/j.apsoil.2018.02.006), with raw datasets areavailable in the EBI database system under project accession number PRJEB14258. We hope that these files will help users to efficiently test and checked the BIOCOM-PIPE pipeline. The deposited archive includes :
- the Input.txt file with chosen parameters,
- the project.csv file describing the composition of the library,
- .fastq files from the EBI database system under project accession number PRJEB14258,
- the expected result files and summary files after BIOCOM-PIPE analysis.
Background:
The ability to compare samples or studies easily using metabarcoding so as to better interpret microbial ecology results is an upcoming challenge. There exists a growing number of metabarcoding pipelines, each with its own benefits and limitations. However, very few have been developed to offer the opportunity to characterize various microbial communities (e.g., archaea, bacteria, fungi, photosynthetic microeukaryotes) with the same tool.
Results:
BIOCOM-PIPE is a flexible and independent suite of tools for processing data from high-throughput sequencing technologies, Roche 454 and Illumina platforms, and focused on the diversity of archaeal, bacterial, fungal, and photosynthetic microeukaryote amplicons. Various original methods were implemented in BIOCOM-PIPE to (i) remove chimeras based on read abundance, (ii) align sequences with structure-based alignments of RNA homologs using covariance models or a post-clustering tool (ReClustOR), and (iii) re-assign OTUs based on a reference OTU database. The comparison with two other pipelines (FROGS and mothur) highlighted that BIOCOM-PIPE was better at discriminating land use groups.
Conclusions:
The BIOCOM-PIPE pipeline makes it possible to analyze 16S/18S and 23S rRNA genes in the same package tool. This innovative approach defines a biological database from previously analyzed samples and performs post-clustering of reads with this reference database by using open-reference clustering. This makes it easier to compare projects from various sequencing runs. For advanced users, the pipeline was developed to allow for adding or modifying the components, the databases and the bioinformatics tools easily.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
rxivist.org allowed readers to sort and filter the tens of thousands of preprints posted to bioRxiv and medRxiv. Rxivist used a custom web crawler to index all papers posted to those two websites; this is a snapshot of Rxivist the production database. The version number indicates the date on which the snapshot was taken. See the included "README.md" file for instructions on how to use the "rxivist.backup" file to import data into a PostgreSQL database server.
Please note this is a different repository than the one used for the Rxivist manuscript—that is in a separate Zenodo repository. You're welcome (and encouraged!) to use this data in your research, but please cite our paper, now published in eLife.
Previous versions are also available pre-loaded into Docker images, available at blekhmanlab/rxivist_data.
Version notes:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains an open and curated scholarly graph we built as a training and test set for data discovery, data connection, author disambiguation, and link prediction tasks. This graph represents the European Marine Science community included in the OpenAIRE Graph. The nodes of the graph we release represent publications, datasets, software, and authors respectively; edges interconnecting research products always have the publication as source, and the dataset/software as target. In addition, edges are labeled with semantics that outline whether the publication is referencing, citing, documenting, or supplementing the related outcome. To curate and enrich nodes metadata and edges semantics, we relied on the information extracted from the PDF of the publications and the datasets/software webpages respectively. We curated the authors so to remove duplicated nodes representing the same person. The resource we release counts 4,047 publications, 5,488 datasets, 22 software, 21,561 authors, and 9,692 edges connect publications to datasets/software. This graph is in the curated_MES folder. We provide this resource as: a property graph: we provide the dump that can be imported in neo4j 5 jsonl files containing publications, datasets, software, authors, and relationships respectively. Each line of a jsonl file contains a JSON object representing a node and contains the metadata of that node (or a relationship). We provide two additional scholarly graphs: The curated MES graph with the removed edges. During the curation we removed some edges since they were labeled with an inconsistent or imprecise semantics. This graph includes the same nodes and edges as the previous one, and, in addition, it contains the edges removed during the curation pipeline; these edges are marked as Removed. This graph is in the curated_MES_with_removed_semantics folder. The original MES community of OpenAIRE. It represents the MES community extracted from the OpenAIRE Research Graph. This graph has not been curated, and the metadata and semantics are those of the OpenAIRE Research Graph. This graph is in the original_MES_community folder.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes all code and data required to reproduce the results of:
Greg Schuette, Zhuohan Lao, and Bin Zhang. ChromoGen: Diffusion model predicts single-cell chromatin conformations, 16 July 2024, PREPRINT (Version 1) available at Research Square [https://doi.org/10.21203/rs.3.rs-4630850/v1]
File descriptions:
chromogen_code.tar.gz
contains all code and, as of its upload date, is identical to the corresponding GitHub repo. Note that:
chromogen_code.tar.gz/ChromoGen/recreate_results/train/EPCOT/
, chromogen_code.tar.gz/ChromoGen/recreate_results/generate_data/EPCOT
, and chromogen_code.tar.gz/ChromoGen/src/model/Embedder
was adapted from that provided in the original EPCOT paper, Zhang et al. (2023). chromogen_code.tar.gz/ChromoGen/recreate_results/create_figures/Figure_4/domain_boundary_support/PostAnalysisTools.py
was adopted from Bintu et al. (2018); our only change was translating the code from Python 2 to Python 3. chromogen_code.tar.gz/ChromoGen/recreate_results/create_figures/
visualize Hi-C and DNase-seq data from Rao et al. (2014) and The ENCODE Project Consortium (2012), respectively, though this dataset excludes the experimental data itself. Seechromogen_code.tar.gz/README.md
for instructions on obtaining the data.chromogen_code.tar.gz/ChromoGen/recreate_results/generate_data/conformations/MDHomopolymer
were originally used for Schuette et al. (2023), though we first make those scripts available here (the first author of both works created these files). epcot_final.pt
contains the fine-tuned EPCOT parameters. Note that the pre-trained parameters -- not included in this dataset -- came from Zhang et al. (2023) and were used as the starting point for our fine-tuning optimization of these parameters. chromogen.pt
contains the complete set of ChromoGen model parameters, including both the relevant fine-tuned EPCOT parameters and all diffusion model parameters. Note that this also contains the fine-tuned EPCOT parameters. conformations.tar.gz
contains all conformations analyzed in the manuscript, including the Dip-C conformations formatted in an HDF5 file, all ChromoGen-inferred conformations, and the MD-generated MD homopolymer conformations. Descriptively named subdirectories organize the data. Note that:
conformations.tar.gz/conformations/MDHomopolymer/DUMP_FILE.dcd
is from Schuette et al. (2023), though it first made available here. conformations.tar.gz/conformations/DipC/processed_data.h5
represents our post-processed version of the 3D genome structures predicted by Dip-C in Tan et al. (2018). outside_data.tar.gz
contains two subdirectories:
inputs
contains our post-processed genome assembly file. Its sole content, hg19.h5
, is a post-processed version of the FASTA-formatted hg19 human genome alignment created by Church et al. (2011), which we downloaded from the UCSC genome browser (Kent et al. (2002) and Nassar et al. (2023)). This dataset does NOT include the FASTA file itself.training_data
contains the Dip-C conformations post-processed by our pipeline. This is a duplicated version of the file described in bullet 4.2. embeddings.tar.gz
contains the sequence embeddings created by our fine-tuned EPCOT model for each region included in the diffusion model's training set. This is really only needed during training. chromogen_code.tar.gz/ChromoGen/README.md
and the README.md
file on our GitHub repo (identical at the time of this dataset's publication) explain the content of each file in greater detail. They also explain how to use the code to reproduce our results or to make your own structure predictions.
You can download and organize all the files in this dataset as intended by running the following in bash:# Download the code and expand the tarball whose contents define the
# larger file structure of the repository this dataset is archiving.wget https://zenodo.org/records/14218666/files/chromogen_code.tar.gz
tar -xvzf chromogen_code.tar.gz
rm chromogen_code.tar.gz
# Enter the top-level directory of the repo, create the subdirectories
# that'll contain the data, and cd to it
cd ChromoGen
mkdir -p recreate_results/downloaded_data/models
cd recreate_results/downloaded_data
# Download all the data in the proper locations
wget https://zenodo.org/records/14218666/files/conformations.tar.gz &
wget https://zenodo.org/records/14218666/files/embeddings.tar.gz &
wget https://zenodo.org/records/14218666/files/outside_data.tar.gz &
cd models
wget https://zenodo.org/records/14218666/files/chromogen.pt &
wget https://zenodo.org/records/14218666/files/epcot_final.pt &
cd ..
wait
# Untar the three tarballs
tar -xvzf conformations.tar.gz &
tar -xvzf embeddings.tar.gz &
tar -xvzf outside_data.tar.gz &
wait
# Remove the now-unneeded tarballs
rm conformations.tar.gz embeddings.tar.gz outside_data.tar.gz
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Count of family names (surnames, last names) in Peru, from an approximately 7% sample of the adult population.
In Peru, many people are registered as supporters of political parties, and their names are published by the Registro de Organizaciones Políticas. The lists include a DNI (national identity number) for each person to avoid duplicates. The 1,572,002 people on these lists (excluding the regional movements) represent around 7% of the adult population of Peru.
Their maternal and paternal family names have been sorted and counted. Nearly all of the names have entries for both paternal and maternal names.
These 3,142,561 family names represent 85,395 different names, most of which are infrequent. The file has been limited to names that occur ten or more times in the sample, which is 12,139 unique names (3,021,655 names, more than 96% of the total).
Each row in the file contains the rank, a percentage of that name in the entire set of 3,142,561 names, a count of the times the name occurs in the sample, and the name.
There are some names (around 800) in this file that contain a space. In most cases, these are names like "GARCIA DE RUIZ", where RUIZ is the name of the woman's husband. There are also cases where the name is like "DE LA CRUZ", which is a complete family name. No attempt has been made to remove the part of names which refer to the husband's name, this could be considered for a later version.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset contains scraped and processed text from roughly 100 years of articles published in the Wiley journal Science Education (formerly General Science Quarterly). This text has been cleaned and filtered in preparation for analysis using natural language processing techniques, particularly topic modeling with latent Dirichlet allocation (LDA). We also include a Jupyter Notebook illustrating how one can use LDA to analyze this dataset and extract latent topics from it, as well as analyze the rise and fall of those topics over the history of the journal.
The articles were downloaded and scraped in December of 2019. Only non-duplicate articles with a listed author (according to the CrossRef metadata database) were included, and due to missing data and text recognition issues we excluded all articles published prior to 1922. This resulted in 5577 articles in total being included in the dataset. The text of these articles was then cleaned in the following way:
After filtering, each document was then turned into a list of individual words (or tokens) which were then collected and saved (using the python pickle format) into the file scied_words_bigrams_V5.pkl.
In addition to this file, we have also included the following files:
This dataset is shared under the terms of the Wiley Text and Data Mining Agreement, which allows users to share text and data mining output for non-commercial research purposes. Any questions or comments can be directed to Tor Ole Odden, t.o.odden@fys.uio.no.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The first-party ORCID data dump uses a data structure that is overly complex for most use cases. This Zenodo record contains a derived version that is much more straightforwards, accessible, and smaller. So far, this includes employers, education, external identifiers, and publications linked to PubMed. It adds additional processing to ground employers and educational instutitions using the Research Organization Registry (ROR). It also does some minor string processing, such as standardization of education types (e.g., Bachelor of Science, Master of Science).
It includes a pre-build Gilda index for named entity recognition (NER) and named entity normalization (NEN).
The records_hq.json.gz
file is a subset of the full records file that only contains records that have at least one ROR-grounded employer, at least one ROR-grounded education, at least one standardized external identifier, or at least one publication indexed in PubMed. The point of this subset is to remove ORCID records that are generally not possible to match up to any external information.
It is automatically generated with code in https://github.com/cthoyt/orcid_downloader.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This publication offers the necessary data and scripts to replicate the findings of the article titled "Hidden resonances in non-Hermitian systems with scattering thresholds". Additionally convergence studies are provided. The article aims to offer a new perspective on resonances in the vicinity of scattering thresholds and provide access to hidden modes on different Riemann sheets.
All Matlab files can be run without solving scattering problems as the required data is stored in .mat files in the data directory. In order to run the simulations with JCMsuite you must delete the data directory and replace corresponding place holders with a path to your installation of JCMsuite. Free trial licenses are available, please refer to the homepage of JCMwave.
We acquire the snapshots with the finite element method (FEM) solver JCMsuite. To estimate the error, the specular reflection has been collected at 24 equidistantly sampled points within the range of interest and at two additional sampling points on either side of the branch points (for further details we refer to the file convergence.m). The error is defined as \(\mathrm{min}\,\mathrm{abs}\left( R_0^n(\omega)-R_0^8(\omega)\right)\), where the superscript denotes the polynomial order of the FEM basis functions. Furthermore, the energy conservation (incoming energy minus reflection plus absorption) has been investigated.
All the data for the paper have been generated using \(n=5\). The error at the data points can therefore be expected to be below \(3\times10^{-7}\).
The AAA algorithm adaptivly increases the degree \(m\) of the rational approximation until the error of the model with respect to all sample points falls below a given threshold \(t\) as long as \(m\) is smaller than half the number of sample points \(N\). We use \(t = 10^{-6}\) and \(t = 5\times 10^{-7}\) to make sure that it is larger than the error introduced through the FEM discretization. In the file AAAconvergence.m, error and model size are compared for different values of \(t\) and different numbers of support points. It can be observed that the error with respect to more than 500 reference points is smaller by orders of magnitude, while at the same time the size of the model is reduced and saturates quickly if the transformed variable \(\tilde{k}\) is used instead of \(k\). Here, 80 support points suffice for errors below \(10^{-6}\) for a spectrum containing three branch points and more than eight resonances (if hidden resonances are included).
We adopt a sampling scheme with additional samples in the vicinity of the branch points. This is achieved with equidistant samplings in the transformed space. For details we refer to the matlab scripts.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The EOSC Providers and Resources dataset contains the metadata descriptions (EOSC Profiles) of the EOSC Providers and the Resources (e.g. catalogues, services, data sources, training material and interoperability guidelines) they onboarded to the EOSC Catalogue and Marketplace during the EOSC Future project.
The dataset is based on a data extraction, provided by the ATHENA Research Center, using the public API of the EOSC Service Registry, which is part of the EOSC Resource Catalogue. The information provided here is a snapshot: a historical record of (a subset of) the information about various providers and resources recorded within the EOSC Catalogue and Marketplace at the end of the EOSC Future project in April 2024.
A curation process, using a publicly accessible data curation workflow, designed and implemented at DESY, was used to remove all known personal or sensitive data in line with GDPR and the EOSC Future Privacy Policy. This process is described in the methodology.md file within this dataset.
From April 2024, the EOSC Portal was phased out. On 24th April 2024, the European Commission announced the next phase of EOSC with the launch of the initial web presence of the EOSC EU Node. The goal of publishing this dataset is to enable other EOSC-related projects such as OSCARS, EOSC Beyond as well as researchers more broadly, to both reuse and build further on this work.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains motion capture (3D marker trajectories, ground reaction forces and moments), inertial measurement unit (wearable Movella Xsens MTw Awinda sensors on the pelvis, both thighs, both shanks, and both feet), and sagittal-plane video (anatomical keypoints identified with the OpenPose human pose estimation algorithm) data.
The data is from 51 willing participants and collected in the HUMEA laboratory in the University of Eastern Finland, Kuopio, Finland, between 2022 and 2023. All trials were conducted barefoot.
The file structure contains an Excel file containing information of the participants, data folders under each subject (numbered 01 to 51), and a MATLAB script.
The Excel file has the following data for the participants:
The folders under each subject (folders numbered 01 to 51) are as follows:
The folders under each subject are divided into three ZIP archives with 17 subjects each.
The script "OpenPose_to_keypoint_table.m" is a MATLAB script for extracting keypoint trajectories and confidences from JSON files into tables in MATLAB.
Publication in Data in Brief: https://doi.org/10.1016/j.dib.2024.110841
Contact: Jere Lavikainen, jere.lavikainen@uef.fi
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a dataset of videos and comments related to the invasion of Ukraine, published on TikTok by a number of users over the year of 2022. It was compiled by Benjamin Steel, Sara Parker and Derek Ruths at the Network Dynamics Lab, McGill University. We created this dataset to facilitate the study of TikTok, and the nature of social interaction on the platform relevant to a major political event.
The dataset has been released here on Zenodo: https://doi.org/10.5281/zenodo.7534952 as well as on Github: https://github.com/networkdynamics/data-and-code/tree/master/ukraine_tiktok
To create the dataset, we identified hashtags and keywords explicitly related to the conflict to collect a core set of videos (or ”TikToks”). We then compiled comments associated with these videos. All of the data captured is publically available information, and contains personally identifiable information. In total we collected approximately 16 thousand videos and 12 million comments, from approximately 6 million users. There are approximately 1.9 comments on average per user captured, and 1.5 videos per user who posted a video. The author personally collected this data using the web scraping PyTok library, developed by the author: https://github.com/networkdynamics/pytok.
Due to scraping duration, this is just a sample of the publically available discourse concerning the invasion of Ukraine on TikTok. Due to the fuzzy search functionality of the TikTok, the dataset contains videos with a range of relatedness to the invasion.
We release here the unique video IDs of the dataset in a CSV format. The data was collected without the specific consent of the content creators, so we have released only the data required to re-create it, to allow users to delete content from TikTok and be removed from the dataset if they wish. Contained in this repository are scripts that will automatically pull the full dataset, which will take the form of JSON files organised into a folder for each video. The JSON files are the entirety of the data returned by the TikTok API. We include a script to parse the JSON files into CSV files with the most commonly used data. We plan to further expand this dataset as collection processes progress and the war continues. We will version the dataset to ensure reproducibility.
To build this dataset from the IDs here:
pip install -e .
in the pytok directorypip install pandas tqdm
to install these libraries if not already installedget_videos.py
to get the video datavideo_comments.py
to get the comment datauser_tiktoks.py
to get the video history of the usershashtag_tiktoks.py
or search_tiktoks.py
to get more videos from other hashtags and search termsload_json_to_csv.py
to compile the JSON files into two CSV files, comments.csv
and videos.csv
If you get an error about the wrong chrome version, use the command line argument get_videos.py --chrome-version YOUR_CHROME_VERSION
Please note pulling data from TikTok takes a while! We recommend leaving the scripts running on a server for a while for them to finish downloading everything. Feel free to play around with the delay constants to either speed up the process or avoid TikTok rate limiting.
Please do not hesitate to make an issue in this repo to get our help with this!
The videos.csv
will contain the following columns:
video_id
: Unique video ID
createtime
: UTC datetime of video creation time in YYYY-MM-DD HH:MM:SS format
author_name
: Unique author name
author_id
: Unique author ID
desc
: The full video description from the author
hashtags
: A list of hashtags used in the video description
share_video_id
: If the video is sharing another video, this is the video ID of that original video, else empty
share_video_user_id
: If the video is sharing another video, this the user ID of the author of that video, else empty
share_video_user_name
: If the video is sharing another video, this is the user name of the author of that video, else empty
share_type
: If the video is sharing another video, this is the type of the share, stitch, duet etc.
mentions
: A list of users mentioned in the video description, if any
The comments.csv
will contain the following columns:
comment_id
: Unique comment ID
createtime
: UTC datetime of comment creation time in YYYY-MM-DD HH:MM:SS format
author_name
: Unique author name
author_id
: Unique author ID
text
: Text of the comment
mentions
: A list of users that are tagged in the comment
video_id
: The ID of the video the comment is on
comment_language
: The language of the comment, as predicted by the TikTok API
reply_comment_id
: If the comment is replying to another comment, this is the ID of that comment
The date can be compiled into a user interaction network to facilitate study of interaction dynamics. There is code to help with that here: https://github.com/networkdynamics/polar-seeds. Additional scripts for further preprocessing of this data can be found there too.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The EU Horizon 2020 project DiverIMPACTS aims to promote the realisation of the full potential of crop diversification through rotation, multicropping and intercropping by demonstrating technical, economic and environmental benefits for famers, along the value chain and for society at large, and by providing innovations that can remove existing barriers and lock-ins of practical diffusion.
DiverIMPACTS does so by combining findings from several participatory case studies with a set of field experiments across Europe, and translating these into strategies, recommendations and fit-for-purpose tools developed with and for farmers, advisors and other actors along the value chain.
To first gain a good overview of the current situation, i.e. the existing success stories and challenges of crop diversification in Europe, Work Package 1 (WP 1) identified and analysed factors of success and failure associated with a variety of crop diversification experiences (CDEs) outside those already represented in the consortium (see Deliverable 1.1). WP 1 thus makes sure that the rich experience with crop diversification initiatives across Europe (e.g. from other Horizon 2020 projects) is taken into account for developing strategies, recommendations and tools.
Deliverable 1.1 provided i) a list of key drivers (ex ante occurrence of market opportunities, environmental constraints, availability of enabling advisory services, land and workforce availability etc.) to be further considered in WP3, and WP5; and ii) a comprehensive and exhaustive description of the links between key factors and CDE types. This analysis is the basis for consolidating or updating the tentative typology of crop diversification situations used for setting up DiverIMPACTS (case studies), and was used for selecting experiences for more detailed investigations in T1.2. It also complements the identification and characterisation of lock-ins and barriers to crop diversification, and serves their overcoming. During the process of collecting, cleaning and analysing the survey data, a Database of European diversification experiences was created.
All together 128 valid responses from 15 European countries – mainly from the project countries Belgium, France, Germany, Hungary, Italy, the Netherlands, Poland, Romania, Sweden, Switzerland, and UK, but also from Denmark, Finland, Luxemburg and Spain were received in T1.1, and were included in the database.
The database is stored in original and back-up form in a tabular ='.csv'= format that can be opened in Excel on the Sharepoint system of the project and now on Zenodo, under restricted WP1 area. A further ='.csv'= file was created to store the metadata of the database. This file helps to have a better overview of the questions and sub-questions that were asked in the survey and the type of answer that could be provided to each of them (e.g. factor, Yes-No selection or character).
Using the meta data and the database, a selection of personal data fields has been made (e.g. email addresses and names of people) that cannot be published with open access, and needs special attention and data handling. These variables were removed from the original database, and a public version of the database was created that can be shared with third parties. Links to the data files will be shared here after.
Developing a Shiny(c) application in R was chosen as a solution to visualize the public data, and make it possible for Partners and all interested parties to interactively view the survey results. The Shiny application is shared as an R-package and are freely accessible on the internet. The users have the possibility to download application and public data in order to visualize them on their own computer. A remote solution, facilitating the consultation of the data, will be installed in CRA-W, where the open data analyses module will be hosted. A short user guide and tutorial is part of this deliverable for helping interested parties to use the Shiny interface.
The chosen approach, linking R scripts, R packages and data files, will be useful in the future in order to continiously complete the data base and to update the application (new graphs, new functions regarding the demand of the main users). The release of the application will be shared using modern technologies of information and communication : project website, newsletter, blogs, twitter and other social networks.
The main deliverable (D1.2) which is public, is available here : 10.5281/zenodo.3966852
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Summary:
The files contained herein represent green roof footprints in NYC visible in 2016 high-resolution orthoimagery of NYC (described at https://github.com/CityOfNewYork/nyc-geo-metadata/blob/master/Metadata/Metadata_AerialImagery.md). Previously documented green roofs were aggregated in 2016 from multiple data sources including from NYC Department of Parks and Recreation and the NYC Department of Environmental Protection, greenroofs.com, and greenhomenyc.org. Footprints of the green roof surfaces were manually digitized based on the 2016 imagery, and a sample of other roof types were digitized to create a set of training data for classification of the imagery. A Mahalanobis distance classifier was employed in Google Earth Engine, and results were manually corrected, removing non-green roofs that were classified and adjusting shape/outlines of the classified green roofs to remove significant errors based on visual inspection with imagery across multiple time points. Ultimately, these initial data represent an estimate of where green roofs existed as of the imagery used, in 2016.
These data are associated with an existing GitHub Repository, https://github.com/tnc-ny-science/NYC_GreenRoofMapping, and as needed and appropriate pending future work, versioned updates will be released here.
Terms of Use:
The Nature Conservancy and co-authors of this work shall not be held liable for improper or incorrect use of the data described and/or contained herein. Any sale, distribution, loan, or offering for use of these digital data, in whole or in part, is prohibited without the approval of The Nature Conservancy and co-authors. The use of these data to produce other GIS products and services with the intent to sell for a profit is prohibited without the written consent of The Nature Conservancy and co-authors. All parties receiving these data must be informed of these restrictions. Authors of this work shall be acknowledged as data contributors to any reports or other products derived from these data.
Associated Files:
As of this release, the specific files included here are:
Column Information for the datasets:
Some, but not all fields were joined to the green roof footprint data based on building footprint and tax lot data; those datasets are embedded as hyperlinks below.
For GreenRoofData2016_20180917.csv there are two additional columns, representing the coordinates of centroids in geographic coordinates (Lat/Long, WGS84; EPSG 4263):
Acknowledgements:
This work was primarily supported through funding from the J.M. Kaplan Fund, awarded to the New York City Program of The Nature Conservancy, with additional support from the New York Community Trust, through New York City Audubon and the Green Roof Researchers Alliance.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset used in the NPOmix validation (described in the publication) that includes antiSMASH results from 1,040 PoDP paired samples. The input for antiSMASH were FASTA genomes from NCBI that are listed at the PoDP database (https://pairedomicsdata.bioinformatics.nl). We remove all files from the antiSMASH output folder but the GenBank (.gbk) files for the BGCs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
### Training and test data for humanness evaluation
This data was collected in conjunction with and used for
training and testing for Parkinson / Wang et al 2024. The
data is organized as follows:
- Heavy chain training and multispecies test data (under the heavy chain folder)
- The conslidated cAb rep file contains training human sequences
- The test sample sequences folder contains fasta files with test sequences for each species
- Light chain training and multispecies test data (under the light chain folder)
- The conslidated cAb rep file contains training human sequences
- The test sample sequences folder contains fasta files with test sequences for each species
- Abybank data (under the abybank compiled data folder)
- This folder contains separate folders for heavy and light chain
- Each subfolder contains test data for a more diverse species set under fasta files for each species
- Humanization test data (under the humanization test data folder)
- The sequences in the parental.fa file were originally humanized as part of drug discovery programs
- The experimental.fa file contains the humanization results
- IMGT and ADA data (under the imgt test data folder)
- The imgt mab db fa and tsv files contain sequences and species assignments for IMGT mAb DB
- The thera ada fa file contains sequences evaluated in the clinic
- The Therapeutic ADA txt file contains anti drug antibody results for those antibodies
The data was retrieved from the following sources.
1. All heavy and light chain training data is from the cAb-Rep database from [Guo et al.](https://pubmed.ncbi.nlm.nih.gov/31649674/)
2. All testing data is from the Observed Antibody Space [(OAS) database](https://opig.stats.ox.ac.uk/webapps/oas/)
The training and test data show is after filtering for quality. The testing data was additionally randomly sampled to yield a set of 50,000 sequences for each species, then filtered to remove duplicates. The human test data was checked to ensure no overlap with the human training set.
The IMGT, ADA and humanization test data was retrieved from Prihoda et al. and
the associated [Github repo](https://github.com/Merck/BioPhi-2021-publication).
See Parkinson et al. 2024 and the associated github repos for more details on how models other than
SAM / AntPack were evaluated on this data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data from the project "Cellulose nitrate lacquer on silver objects" (RAÄ-2021-3131). Each zip file contains an Instrument report file which gives sample names and descriptions, instrument parameters, spectra figures and in some report interpretation notes. Data from the following analytical techniques is included: 1. Scanning electron microscopy with energy dispersive x-ray spectroscopy (SEM-EDS); 2. Polarized light microscopy (PLM); 3. Fourier transform infrared spectroscopy (FTIR).
Data was collected as a continuation of (but after the publication of) the bachelor theses "A greener solution: Investigating the potential use of Green Solvents to remove cellulose nitrate lacquer from silver objects" by Evelina Borén (https://hdl.handle.net/2077/72648) and "The use of gels for the removal of cellulose nitrate lacquer on silver" by Katayon Miri (https://hdl.handle.net/2077/72650), both at the Department for Conservation, University of Gothenburg. The project is also part of a larger invesitgation about removing cellulose nitrate lacquer on silver objects, supported by European Union funded IPERION HS project (2020-INFRAIA-2019-1),.......
Instrument reports including technical specifications of each instrument can also be found under DOI 10.5281/zenodo.10869595 .
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the simulated length of 71 Alpine glaciers over the last millennium using the version 1.0 of the Open Global Glacier Model (OGGM) forced by global climate models (GCM) simulation outputs. For a description of the experimental design, see the associated publication:
Goosse, H., Barriat, P.-Y., Dalaiden, Q., Klein, F., Marzeion, B., Maussion, F., Pelucchi, P. and Vlug, A.: Testing the consistency between changes in simulated climate and Alpine glacier length over the past millennium, Climate of the Past, 2018.
Each NetCDF file corresponds to OGGM driven by one climate model over the period 1000-2004 CE. The file names are based on the acronyms given in the Table 1 of the associated publication. The variables included in the NetCDF files are:
- g_length: Annual mean length of the glaciers, in meters
- ID_glacier: an identifier for each glacier, allowing to make the link to the names of the glaciers given in glacier_names.txt.
- time_year: the time in years CE
In order to remove high frequency variability associated with the presence of snow that may remain in summer at altitudes lower than the glacier front, a filter with a 5-year window has been applied on the OGGM outputs to obtain the results stored in g_length.
Please contact Hugues Goosse for more information.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description:
This repository contains all source codes, necessary input and output data, and raw figures and tables for reproducing most figures and results published in the following study:
Hefei Zhang#, Xuhang Li#, Dongyuan Song, Onur Yukselen, Shivani Nanda, Alper Kucukural, Jingyi Jessica Li, Manuel Garber, Albertha J.M. Walhout. Worm Perturb-Seq: massively parallel whole-animal RNAi and RNA-seq. (2025) Nature Communications, in press (# equal contribution, *: correspondng author)
These include results related to method benchmarking and NHR data processing. Source data for figures that are not reproducible here have been provided with the publication.
Files:
This repository contains a few directories related to this publication. To deposit into Zenodo, we have individually zipped each subfolder of the root directory.
There are three directories included:
MetabolicLibrary
method_simulation
NHRLibrary
Note: the parameter optimization output is deposited in a seperate Zenodo repository (10.5281/zenodo.15236858) for better oganization and easy usage. If you would like to reproduce results related to the "MetabolicLibrary" folder, please download and integrate the omitted subfolder "MetabolicLibrary/2_DE/output/" from this seperate repository.
Please be advised that this repository contains raw codes and data that are not directly related to a figure in our paper. However, they may be useful to generate input used in the analysis of a figure, or to reproduce tables in our manuscript. It may also contain unpublished analyses and figures, which we did not intentionally delete and kept for records.
Usage:
Please refer to the table in below to locate a specific file for reproducing a figure of interest (also availabe in the METHOD_FIGURE_LOOKUP.xlsx under the root directory).
Figure | File | Linesa | Notes |
Fig. 2c | MetabolicLibrary/1_QC_dataCleaning/2_badSampleExclusion_manual.R | 65-235 | output figure is selected from figures/met10_lib6_badSamplePCA.pdf |
Fig. 2d | NHRLibrary/example_bams/* | - | load the bam files in IGV to make the figure |
Fig. 3a | MetabolicLibrary/2_DE/2_5_vectorlike_re_analysis_with_curated_44_NTP_conditions.R | 348-463 | |
Fig. 3b,c | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 106-376 | |
Fig. 3d | MetabolicLibrary/2_DE/SUPP_extra_figures_for_rewiring.R | 10-139 | |
Fig. 3e | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 379-522 | |
Fig. 3f,g | MetabolicLibrary/2_DE/SUPP_extra_figures_for_rewiring.R | 1-8 | |
Fig. 3h | method_simulation/Supp_systematic_mean_variation_example.R | 1-138 | |
Fig. 3i | method_simulation/3_benchmark_DE_result_w_rep.R and 1_benchmark_DE_result_std_NB_w_rep.R | 1-518 | the example figure was from figures/GLM_NB_deltaMiu_k08_10rep/seed_12345_empircial_null_fit_simulated_data_NB_GLM.pdf and figures/GLM_NB_10rep/seed_1_empircial_null_fit_simulated_data_NB_GLM.pdf; |
Fig. 3j | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 2104-2106 | load dependencies starting from line 1837 |
Fig. 3k | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 2053-2078 | load dependencies starting from line 1837; the GSEA was performed using SUPP_supplementary_figures_for_method_noiseness_GSEA.R |
Fig. 4a,b | method_simulation/3_benchmark_DE_result_w_rep.R | 1-523 | |
Fig. 4c | method_simulation/3_benchmark_WPS_parameters.R | 1-237 | |
Fig. 4d | MetabolicLibrary/2_DE/2_5_vectorlike_re_analysis_with_curated_44_NTP_conditions.R | 1-346 | output figure is selected from figures/0_DE_QA/vectorlike_analysis/2d_cutoff_titration_71NTP_rawDE_log2FoldChange_log2FoldChange_raw.pdf and 2d_cutoff_titration_71NTP_p0.005_log2FoldChange_raw.pdf. The "p0.005" in the second file name indicates the p_outlier cutoff used in the final parameter set for EmpirDE. |
Fig. 4e | MetabolicLibrary/2_DE/SUPP_plot_N_DE_repeated_RNAi.R | entire file | |
Fig. 4f | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 1020-1407 | |
Fig. 4g,h | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 529-851 | |
Fig. 5d | NHRLibrary/FinalAnalysis/2_DE_new/3_DE_network_analysis.R | 51-69; 94-112 | load dependencies starting from line 1 |
Fig. 5e | NHRLibrary/FinalAnalysis/2_DE_new/5_GSA_bubble_plot.R | 1-306 | |
Fig. 5f | NHRLibrary/FinalAnalysis/2_DE_new/5_GSA.R | 1-1492 | |
Fig. 6a | NHRLibrary/FinalAnalysis/2_DE_new/4_DE_similarity_analysis.R | 1-175 | |
Fig. 6b | NHRLibrary/FinalAnalysis/6_case_study.R | 506-534 | load dependencies starting from line 1 |
Fig. 6c | NHRLibrary/FinalAnalysis/6_case_study.R | 668-888 | load dependencies starting from line 1 |
Supplementary Fig. 1e | NHRLibrary/FinalAnalysis/5_revision/REVISION_gene_detection_sensitivity_benchmark.R | 1-143 | |
Supplementary Fig. 1f | MetabolicLibrary/1_QC_dataCleaning/2_badSampleExclusion_manual.R | 65-235 | output figure is selected from figures/met10_lib6_badSampleCorr.pdf |
Supplementary Fig. 1g | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 2191-2342 | |
Supplementary Fig. 2a | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 1409-1822 | |
Supplementary Fig. 2b | method_simulation/Supp_systematic_mean_variation_example.R | 1-138 | |
Supplementary Fig. 2c | method_simulation/Supp_systematic_mean_variation_example.R; 2_fit_logFC_distribution.R | 141-231; 1-201 | the middle panel was generated from Supp_systematic_mean_variation_example.R (lines 141-231) and right panel was from 2_fit_logFC_distribution.R (lines 1-201) |
Supplementary Fig. 2d | method_simulation/1_benchmark_DE_result_std_NB_w_rep.R | 1-518 | the example figure was from figures/GLM_NB_10rep/seed_1_empircial_null_fit_simulated_data_NB_GLM.pdf; |
Supplementary Fig. 2e | method_simulation/3_benchmark_DE_result_w_rep.R | 1-518 | the example figure was from figures/GLM_NB_deltaMiu_k08_10rep/seed_12345_empircial_null_fit_simulated_data_NB_GLM.pdf |
Supplementary Fig. 2f | method_simulation/3_benchmark_DE_result_w_rep.R | 528-573 | may need to run the code from line 1 to load other variables needed |
Supplementary Fig. 3a,b | method_simulation/1_benchmark_DE_result_std_NB_w_rep.R | 1-523 | |
Supplementary Fig. 3c | method_simulation/3_benchmark_WPS_parameters.R | 1-237 | |
Supplementary Fig. 3d | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 3190-3300 | |
Supplementary Fig. 3e | 2_3_power_error_tradeoff_optimization.R | entire file | the figure was in figures/0_DE_QA/cleaning_strength_titration/benchmark_vectorlikes_titrate_cleaning_cutoff_DE_log2FoldChange_FDR0.2_FC1.pdf (produced in line 398); this script produced the titration plots for a series of thresholds, where we picked FDR0.2_FC1 for presentation in the paper |
Supplementary Fig. 3f | 2_3_power_error_tradeoff_optimization.R | entire file | the figure was in figures/0_DE_QA/cleaning_strength_titration/benchmark_independent_repeats_titrate_cleaning_cutoff_FP_log2FoldChange_raw.pdf (produced in line 195). The top line plot was from figures/0_DE_QA/cleaning_strength_titration/benchmark_independent_repeats_titrate_cleaning_cutoff_FP_log2FoldChange_raw_summary_stat.pdf (produced in line 218). |
Supplementary Fig. 4 | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 853-898 | please run from line 529 to load dependencies |
Supplementary Fig. 5a,b | MetabolicLibrary/2_DE/SUPP_supplementary_figures_for_methods.R | 2590-3185 | |
Supplementary Fig. |