Facebook
TwitterDataset for paper
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This spreadsheet presents the structured mapping of repository capabilities and characteristics conducted in Task 5.2 of the FIDELIS project. It includes metadata and annotations for over 80 resources—such as standards, best practices, and landscape analyses — aligned with the 30 Activities and Functions defined in the FIDELIS Transparent Trustworthy Repository Attributes Matrix (TTRAM). The mapping covers both domain-agnostic and domain-specific resources across five scientific communities and serves as a foundational dataset for the FIDELIS landscape analysis.
Facebook
TwitterThis is a subset of the Zenodo-ML Dinosaur Dataset [Github] that has been converted to small png files and organized in folders by the language so you can jump right in to using machine learning methods that assume image input.
Included are .tar.gz files, each named based on a file extension, and when extracted, will produce a folder of the same name.
tree -L 1
.
├── c
├── cc
├── cpp
├── cs
├── css
├── csv
├── cxx
├── data
├── f90
├── go
├── html
├── java
├── js
├── json
├── m
├── map
├── md
├── txt
└── xml
And we can peep inside a (somewhat smaller) of the set to see that the subfolders are zenodo identifiers. A zenodo identifier corresponds to a single Github repository, so it means that the png files produced are chunks of code of the extension type from a particular repository.
$ tree map -L 1
map
├── 1001104
├── 1001659
├── 1001793
├── 1008839
├── 1009700
├── 1033697
├── 1034342
...
├── 836482
├── 838329
├── 838961
├── 840877
├── 840881
├── 844050
├── 845960
├── 848163
├── 888395
├── 891478
└── 893858
154 directories, 0 files
Within each folder (zenodo id) the files are prefixed by the zenodo id, followed by the index into the original image set array that is provided with the full dinosaur dataset archive.
$ tree m/891531/ -L 1
m/891531/
├── 891531_0.png
├── 891531_10.png
├── 891531_11.png
├── 891531_12.png
├── 891531_13.png
├── 891531_14.png
├── 891531_15.png
├── 891531_16.png
├── 891531_17.png
├── 891531_18.png
├── 891531_19.png
├── 891531_1.png
├── 891531_20.png
├── 891531_21.png
├── 891531_22.png
├── 891531_23.png
├── 891531_24.png
├── 891531_25.png
├── 891531_26.png
├── 891531_27.png
├── 891531_28.png
├── 891531_29.png
├── 891531_2.png
├── 891531_30.png
├── 891531_3.png
├── 891531_4.png
├── 891531_5.png
├── 891531_6.png
├── 891531_7.png
├── 891531_8.png
└── 891531_9.png
0 directories, 31 files
So what's the difference?
The difference is that these files are organized by extension type, and provided as actual png images. The original data is provided as numpy data frames, and is organized by zenodo ID. Both are useful for different things - this particular version is cool because we can actually see what a code image looks like.
How many images total?
We can count the number of total images:
find "." -type f -name *.png | wc -l
3,026,993
The script to create the dataset is provided here. Essentially, we start with the top extensions as identified by this work (excluding actual images files) and then write each 80x80 image to an actual png image, organizing by extension then zenodo id (as shown above).
I tested a few methods to write the single channel 80x80 data frames as png images, and wound up liking cv2's imwrite function because it would save and then load the exact same content.
import cv2
cv2.imwrite(image_path, image)
Given the above, it's pretty easy to load an image! Here is an example using scipy, and then for newer Python (if you get a deprecation message) using imageio.
image_path = '/tmp/data1/data/csv/1009185/1009185_0.png'
from imageio import imread
image = imread(image_path)
array([[116, 105, 109, ..., 32, 32, 32],
[ 48, 44, 48, ..., 32, 32, 32],
[ 48, 46, 49, ..., 32, 32, 32],
...,
[ 32, 32, 32, ..., 32, 32, 32],
[ 32, 32, 32, ..., 32, 32, 32],
[ 32, 32, 32, ..., 32, 32, 32]], dtype=uint8)
image.shape
(80,80)
# Deprecated
from scipy import misc
misc.imread(image_path)
Image([[116, 105, 109, ..., 32, 32, 32],
[ 48, 44, 48, ..., 32, 32, 32],
[ 48, 46, 49, ..., 32, 32, 32],
...,
[ 32, 32, 32, ..., 32, 32, 32],
[ 32, 32, 32, ..., 32, 32, 32],
[ 32, 32, 32, ..., 32, 32, 32]], dtype=uint8)
Remember that the values in the data are characters that have been converted to ordinal. Can you guess what 32 is?
ord(' ')
32
# And thus if you wanted to convert it back...
chr(32)
So how t...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Zenodo.org is a popular data repository hosted by CERN. There are tens of thousands of datasets in the repository, but not all of them are used to the same extent.
This dataset includes names and links to the top 500 most downloaded datasets on Zenodo.
This dataset can be used to find datasets deposited on zenodo that would benefit from additional exposure to the DS/ML community by uploading them to Kaggle.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This file collection is part of the ORD Landscape and Cost Analysis Project (DOI: 10.5281/zenodo.2643460), a study jointly commissioned by the SNSF and swissuniversities in 2018.
Please cite this data collection as:
von der Heyde, M. (2019). Data from the International Open Data Repository Survey. Retrieved from https://doi.org/10.5281/zenodo.2643493
Further information is given in the corresponding data paper:
von der Heyde, M. (2019). International Open Data Repository Survey: Description of collection, collected data, and analysis methods [Data paper]. Retrieved from https://doi.org/10.5281/zenodo.2643450
Contact
Swiss National Science Foundation (SNSF)
Open Research Data Group
E-mail: ord@snf.ch
swissuniversities
Program "Scientific Information"
Gabi Schneider
E-Mail: isci@swissuniversities.ch
Facebook
TwitterTable of Contents
Main Description File Descriptions Linked Files Installation and Instructions
This is the Zenodo repository for the manuscript titled "A TCR β chain-directed antibody-fusion molecule that activates and expands subsets of T cells and promotes antitumor activity.". The code included in the file titled marengo_code_for_paper_jan_2023.R was used to generate the figures from the single-cell RNA sequencing data.
The following libraries are required for script execution:
Seurat scReportoire ggplot2 stringr dplyr ggridges ggrepel ComplexHeatmap
The code can be downloaded and opened in RStudios. The "marengo_code_for_paper_jan_2023.R" contains all the code needed to reproduce the figues in the paper The "Marengo_newID_March242023.rds" file is available at the following address: https://zenodo.org/badge/DOI/10.5281/zenodo.7566113.svg (Zenodo DOI: 10.5281/zenodo.7566113). The "all_res_deg_for_heat_updated_march2023.txt" file contains the unfiltered results from DGE anlaysis, also used to create the heatmap with DGE and volcano plots. The "genes_for_heatmap_fig5F.xlsx" contains the genes included in the heatmap in figure 5F.
This repository contains code for the analysis of single cell RNA-seq dataset. The dataset contains raw FASTQ files, as well as, the aligned files that were deposited in GEO. The "Rdata" or "Rds" file was deposited in Zenodo. Provided below are descriptions of the linked datasets:
Gene Expression Omnibus (GEO) ID: GSE223311(https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE223311)
Title: Gene expression profile at single cell level of CD4+ and CD8+ tumor infiltrating lymphocytes (TIL) originating from the EMT6 tumor model from mSTAR1302 treatment. Description: This submission contains the "matrix.mtx", "barcodes.tsv", and "genes.tsv" files for each replicate and condition, corresponding to the aligned files for single cell sequencing data. Submission type: Private. In order to gain access to the repository, you must use a reviewer token (https://www.ncbi.nlm.nih.gov/geo/info/reviewer.html).
Sequence read archive (SRA) repository ID: SRX19088718 and SRX19088719
Title: Gene expression profile at single cell level of CD4+ and CD8+ tumor infiltrating lymphocytes (TIL) originating from the EMT6 tumor model from mSTAR1302 treatment.
Description: This submission contains the raw sequencing or .fastq.gz files, which are tab delimited text files.
Submission type: Private. In order to gain access to the repository, you must use a reviewer token (https://www.ncbi.nlm.nih.gov/geo/info/reviewer.html).
Zenodo DOI: 10.5281/zenodo.7566113(https://zenodo.org/record/7566113#.ZCcmvC2cbrJ)
Title: A TCR β chain-directed antibody-fusion molecule that activates and expands subsets of T cells and promotes antitumor activity. Description: This submission contains the "Rdata" or ".Rds" file, which is an R object file. This is a necessary file to use the code. Submission type: Restricted Acess. In order to gain access to the repository, you must contact the author.
The code included in this submission requires several essential packages, as listed above. Please follow these instructions for installation:
Ensure you have R version 4.1.2 or higher for compatibility.
Although it is not essential, you can use R-Studios (Version 2022.12.0+353 (2022.12.0+353)) for accessing and executing the code.
marengo_code_for_paper_jan_2023.R Install_Packages.R Marengo_newID_March242023.rds genes_for_heatmap_fig5F.xlsx all_res_deg_for_heat_updated_march2023.txt
You can use the following code to set the working directory in R:
setwd(directory)
Facebook
TwitterOverview Precision Liming Soil Datasets (LimeSoDa) is a collection of 31 datasets from a field- and farm-scale soil mapping context. These datasets are 'ready-to-use' for modeling purposes, as they include target soil properties and features in a tidy tabular format. Three target soil properties are present in every dataset: (1) soil organic matter (SOM) or soil organic carbon (SOC), (2) pH, and (3) clay content, while the features for modeling are dataset-specific. The primary goal of LimeSoDa is to enable more reliable benchmarking of machine learning methods in digital soil mapping and pedometrics. All the associated materials and data from LimeSoDa can be downloaded in this data repository. However, for a more in-depth analysis, we refer to the published paper 'LimeSoDa: A Dataset Collection for Benchmarking of Machine Learning Regressors in Digital Soil Mapping' by Schmidinger et al. (2025). You may also use our R and Python package likewise called LimeSoDa. Citation Upon usage of datasets from LimeSoDa, please cite our associated paper: Schmidinger, J., Vogel, S., Barkov, V., Pham, A.-D., Gebbers, R., Tavakoli, H., Correa, J., Tavares, T.R., Filippi, P., Jones, E. J., Lukas, V., Boenecke, E., Ruehlmann, J., Schroeter, I., Kramer, E., Paetzold, S., Kodaira, M., Wadoux, A.M.J.-C., Bragazza, L., Metzger, K., Huang, J., Valente, D.S.M., Safanelli, J.L., Bottega, E.L., Dalmolin, R.S.D., Farkas, C., Steiger, A., Horst, T. Z., Ramirez-Lopez, L., Scholten, T., Stumpf, F., Rosso, P., Costa, M.M., Zandonadi, R.S., Wetterlind, J. & Atzmueller, M. (2025). LimeSoDa: A Dataset Collection for Benchmarking of Machine Learning Regressors in Digital Soil Mapping.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Files and datasets in Parquet format related to molecular dynamics and retrieved from the Zenodo, Figshare and OSF data repositories. The file 'data_model_parquet.md' is a codebook that contains data models for the Parquet files.
Facebook
Twitterdelete (repository outdated)
new repository: https://zenodo.org/record/3713179#.XoxaDWDgq71
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ab-initio Data Repository for Physics-Informed Data-Driven Model
This repository saved the precise Density Functional Theory (DFT) calculations and Vienna Ab initio Simulation Package (VASP) codes to provide a comprehensive dataset for physics-informed models. It specifically considers the steelmaking process by focusing on different types of non-metallic inclusions (NMIs) within the steel melt.
Data Sets Included:
Purpose and Application: This repository is designed to support advanced physics-informed modeling approaches, such as those using machine learning algorithms to predict clogging and inclusion behaviors in steelmaking processes.
The datasets includes following types of NMIs with detailed characteristics in the size range of 1-10 µm:
Facebook
TwitterZenodo is an open repository that allows researchers to deposit research papers, data sets, research software, reports, and any other research related digital artefacts.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Datasets utilized to train NMIRacle
This dataset repository contains derived data used for the development and evaluation of the NMIRacle framework. The data is not original. It is constructed from the following publicly available Zenodo datasets:
Multimodal spectroscopic dataset (License: CDLA–Sharing 1.0)https://zenodo.org/records/14770232
NMR2Struct training data (License: CC-BY-4.0)https://zenodo.org/records/13892026
Please refer to the original Zenodo repositories for the… See the full description on the dataset page: https://huggingface.co/datasets/fedeotto/nmiracle-datasets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the dataset referenced in the Scientific Data journal article titled "Aerial Imagery-Derived Dataset of Manufactured Housing Communities in the North Central United States" by Armin Yeganeh, Maria Marshall, and Noah Durst. The associated code scripts are available at https://github.com/arminyeganeh/mhc
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
WiFi measurements database for UJI's library and supporting material.
The measurements were collected by one person using one Android smartphone during 15 months at two floor of the library building from Universitat Jaume I, in Spain. It contains 63,504 WiFi fingerprints, which are organized into datasets. Each dataset is the result of a collection campaign.
The supporting material includes Matlab® scripts to load and filter the desired data, and provides examples on possible studies that the database may enable. The supporting material also includes the bookshelve local coordinates.
Citation request:
G.M. Mendoza-Silva, P. Richter, J. Torres-Sospedra, E.S. Lohan, J. Huerta, "Long-Term
Wi-Fi fingerprinting dataset and supporting material", Zenodo repository, DOI 10.5281/zenodo.1066041.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data continues with the development of the NPEGC Trinity de novo metatranscriptome assemblies from the protein data repository of The North Pacific Eukaryotic Gene Catalog. The nucleotide sequences corresponding to the NPEGC cluster representatives are collected together in these repository files:
NPac.G1PA.bf100.id99.nt.fasta.gz
NPac.G2PA.bf100.id99.nt.fasta.gz
NPac.G3PA.bf100.id99.nt.fasta.gz
NPac.G3PA_diel.bf100.id99.nt.fasta.gz
NPac.D1PA.bf100.id99.nt.fasta.gz
A full description of this data is published in Scientific Data, available here: The North Pacific Eukaryotic Gene Catalog of metatranscriptome assemblies and annotations. Please cite this publication if your research uses this data:
Groussman, R. D., Coesel, S. N., Durham, B. P., Schatz, M. J., & Armbrust, E. V. (2024). The North Pacific Eukaryotic Gene Catalog of metatranscriptome assemblies and annotations. Scientific Data, 11(1), 1161.
These nucleotide sequences have been sourced from the Zenodo repository for raw assemblies: The North Pacific Eukaryotic Gene Catalog: Raw assemblies from Gradients 1, 2 and 3
Key processing steps are sampled below with links to the detailed code on the main github code repository: https://github.com/armbrustlab/NPac_euk_gene_catalog
Code used to build the kallisto indices and map the short reads against indices with kallisto are online in the code repository here: NPEGC.nt_kallisto_counts.sh
There are two main steps:
1. Generate the kallisto index on the sets of clustered nucleotide metatranscripts
2. Map the short reads from environmental samples back to the assembly index
As generated above, kallisto generates separate results files for each of the sample files. Even after compression, the total size of the tarballed kallisto output results directories are prohibitively large (>50GB). We use the code in this template R script to join together the 'est_count' estimated count values for the tens of millions of protein sequences in each project metatranscriptome, along with length.
The code in this template script was used for each project: aggregate_kallisto_counts.R
The output count files for each project are Gzip-compressed and uploaded to the NPEGC nucleotide data repository here:
G1PA.raw.est_counts.csv.gz
G2PA.raw.est_counts.csv.gz
G3PA.raw.est_counts.csv.gz
G3PA_diel.raw.est_counts.csv.gz
D1PA.raw.est_counts.csv.gz
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The format is JSON, the list of lists. Each list is the group of very similar repositories (Weighted Jaccard Similarity threshold 0.8~0.9).
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Within the ESA funded WorldCereal project we have built an open harmonized reference data repository at global extent for model training or product validation in support of land cover and crop type mapping. Data from 2017 onwards were collected from many different sources and then harmonized, annotated and evaluated. These steps are explained in the harmonization protocol (10.5281/zenodo.7584463). This protocol also clarifies the naming convention of the shape files and the WorldCereal attributes (LC, CT, IRR, valtime and sampleID) that were added to the original data sets.
This publication includes those harmonized data sets of which the original data set was published under the CC-BY-SA license or a license similar to CC-BY-SA. See document "_In-situ-data-World-Cereal - license - CC-BY-SA.pdf" for an overview of the original data sets.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains the parameterization of a no-policy baseline scenario of the global 11-regional MESSAGEix-GLOBIOM integrated assessment model. Regions, time periods, commodities, technologies and relations included in this model are described in a separate repository. The dataset relies on the MESSAGEix modeling framework (Huppmann et al. 2019) and can be imported into MESSAGEix via the read_excel() functionality, for which a tutorial is available, or via snapshot.load() as described here. After the import the scenario can be solved and modified to create new scenarios. Note that the published scenario as included in the ENGAGE global scenarios dataset has been run with a release candidate of version 3.4.0 of MESSAGEix.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This Zenodo repository contains the data and code for the article entitled "A natural disaster exacerbates and redistributes disease risk across free-ranging macaques by altering social structure".
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this work, based on GitHub Archive project and repository mining tools, we process all available data into concise and structured format to generate GitHub developer behavior and repository evolution dataset. With the self-configurable interactive analysis tool provided by us, it will give us a macroscopic view of open source ecosystem evolution.
Facebook
TwitterDataset for paper