Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Correspondence to: Joost de Vries, joost.devries@bristol.ac.uk
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains spatial transcriptomics data related to the Wu et al. 2021 study "A single-cell and spatially resolved atlas of human breast cancers". Processed count matrices, brightfield HE-images (plain and annotated) and meta-data (containing clinical information and spot pathological details) for 6 primary breast cancers profiled using the Visium assay (10X Genomics). If you use this dataset in your research, please consider citing the above study.
The content of the files are:
raw_count_matrices.tar.gz - spaceranger processed raw count matrices.
spatial.tar.gz - spaceranger processed spatial files (images, scalefactors, aligned fiducials, position lists)
filtered_count_matrices.tar.gz - filtered count matrices.
metadata.tar.gz - metadata for tissues and spots of filtered count matrices, including clinical subtype and pathological annotation of each spot.
images.pdf - pdf detailing the H&E and annotation images.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data continues with the development of the unprocessed NPEGC Trinity de novo metatranscriptome assemblies, uploaded to this Zenodo repository for raw assemblies: The North Pacific Eukaryotic Gene Catalog: Raw assemblies from Gradients 1, 2 and 3
A full description of this data is published in Scientific Data, available here: The North Pacific Eukaryotic Gene Catalog of metatranscriptome assemblies and annotations. Please cite this publication if your research uses this data:
Groussman, R. D., Coesel, S. N., Durham, B. P., Schatz, M. J., & Armbrust, E. V. (2024). The North Pacific Eukaryotic Gene Catalog of metatranscriptome assemblies and annotations. Scientific Data, 11(1), 1161.
Excerpts of key processing steps are sampled below with links to the detailed code on the main github code repository: https://github.com/armbrustlab/NPac_euk_gene_catalog
Processing and annotation of protein-level NPEGC metatranscripts is done in 6 primary steps:
1. Six-frame translation into protein sequences
2. Frame-selection of protein-coding translation frames
3. Clustering of protein sequences at 99% sequence identity
4. Taxonomic annotation against MarFERReT v1.1 + MARMICRODB v1.0 multi-kingdom marine reference protein sequence library with DIAMOND
5. Functional annotation against Pfam 35.0 protein family HMM profiles using HMMER3
6. Functional annotation against KOfam HMM profiles (KEGG release 104.0) using KofamScan v1.3.0# Define local NPEGC base directory here:
NPEGC_DIR="/mnt/nfs/projects/armbrust-metat"
# Raw assemblies are located in the /assemblies/raw/ directory
# for each of the metatranscriptome projects
PROJECT_LIST="D1PA G1PA G2PA G3PA G3PA_diel"
# raw Trinity assemblies:
RAW_ASSEMBLY_DIR="${NPEGC_DIR}/${PROJECT}/assemblies/raw"
Translation
We began processing the raw metatranscriptome assemblies by six-frame translation from nucleotide transcripts into three forward and three reverse reading frame translations, using the transeq function in the EMBOSS package. We add a cruise and sample prefix to the sequence IDs to ensure unique identification downstream (ex, `>TRINITY_DN2064353_c0_g1_i1_1` to `>G1PA_S09C1_3um_TRINITY_DN2064353_c0_g1_i1_1` for the S09C1_3um sample in the G1PA assemblies). See NPEGC.6tr_frame_selection_clustering.sh for full code description.
Example of six-frame translation using transeqtranseq -auto -sformat pearson -frame 6 -sequence 6tr/${PREFIX}.Trinity.fasta -outseq 6tr/${PREFIX}.Trinity.6tr.fasta
Frame selection
We use a custom frame-selection python script keep_longest_frame.py to determine the longest coding length in each open reading frame and retain this sequence (or multiple sequences if there is a tie) for downstream analyses. See NPEGC.6tr_frame_selection_clustering.sh for full code description.
Clustering by sequence identity
To reduce sequence redundancy and near-identical sequences, we cluster protein sequences at the 99% sequence identity level and retain the sequence cluster representative in a reduced-size FASTA output file. See NPEGC.6tr_frame_selection_clustering.sh for full code description of linclust/mmseqs clustering.
Sample of linclust clustering script: core mmseqs functionfunction NPEGC_linclust {
# make an index of the fasta file:
$MMSEQS_DIR/mmseqs createdb $FASTA_PATH/$FASTA_FILE NPac.$STUDY.bf100.db
# cluster sequences at $MIN_SEQ_ID
$MMSEQS_DIR/mmseqs linclust NPac.${STUDY}.bf100.db NPac.${STUDY}.clusters.db NPac_tmp --min-seq-id ${MIN_SEQ_ID}
# retieve cluster representatives:
$MMSEQS_DIR/mmseqs result2repseq NPac.${STUDY}.bf100.db NPac.${STUDY}.clusters.db NPac.${STUDY}.clusters.rep
# generate flat FASTA output with cluster reps
$MMSEQS_DIR/mmseqs result2flat NPac.${STUDY}.bf100.db NPac.${STUDY}.bf100.db NPac.${STUDY}.clusters.rep NPac.${STUDY}.bf100.id99.fasta --use-fasta-header
}
Corresponding files uploaded to this repository: Gzip-compressed FASTA files after translation, frame-selection, and clustering at 99% sequence identity (.bf100.id99.aa.fasta.gz)
NPac.G1PA.bf100.id99.aa.fasta.gz
NPac.G2PA.bf100.id99.aa.fasta.gz
NPac.G3PA.bf100.id99.aa.fasta.gz
NPac.G3PA_diel.bf100.id99.aa.fasta.gz
NPac.D1PA.bf100.id99.aa.fasta.gz
MarFERReT + MARMICRODB taxonomic annotation with DIAMOND
Taxonomy was inferred for the NPEGC metatranscripts with the DIAMOND fast read alignment software against the MarFERReT v1.1 + MARMICRODB v1.0 multi-kingdom marine reference protein sequence library (v1.1), a combined database of the MarFERReT v1.1 marine microbial eukaryote sequence library and MARMICRODB v1.0 prokaryote-focused marine genome database. See NPEGC.diamond_taxonomy.log.sh for full description of DIAMOND annotation.
Excerpt of core DIAMOND function:function NPEGC_diamond {
# FASTA filename for $STUDY
FASTER_FASTA="NPac.${STUDY}.bf100.id99.aa.fasta"
# Output filename for LCA results in lca.tab file:
LCA_TAB="NPac.${STUDY}.MarFERReT_v1.1_MMDB.lca.tab"
echo "Beginning ${STUDY}"
singularity exec --no-home --bind ${DATA_DIR} \
"${CONTAINER_DIR}/diamond.sif" diamond blastp \
-c 4 --threads $N_THREADS \
--db $MFT_MMDB_DMND_DB -e $EVALUE --top 10 -f 102 \
--memory-limit 110 \
--query ${FASTER_FASTA} -o ${LCA_TAB} >> "${STUDY}.MarFERReT_v1.1_MMDB.log" 2>&1
}
Corresponding files uploaded to this repository: Gzip-compressed diamond lowest common ancestor predictions with NCBI Taxonomy against a combined MarFERReT + MARMICRODB taxonomic library (*.Pfam35.domtblout.tab.gz)
NPac.G1PA.MarFERReT_v1.1_MMDB.lca.tab.gz
NPac.G2PA.MarFERReT_v1.1_MMDB.lca.tab.gz
NPac.G3PA.MarFERReT_v1.1_MMDB.lca.tab.gz
NPac.G3PA_diel.MarFERReT_v1.1_MMDB.lca.tab.gz
NPac.D1PA.MarFERReT_v1.1_MMDB.lca.tab.gz
Pfam 35.0 functional annotation using HMMER3
Clustered protein sequences were annotated against the Pfam 35.0 collection of 19,179 protein family Hidden Markov Models (HMMs) using HMMER 3.3 with the Pfam 35.0 protein family database. Pfam annotation code is documented here: NPEGC.hmmer_function.sh
Excerpt of core hmmsearch function:function NPEGC_hmmer {
# Define input FASTA
INPUT_FASTA="NPac.${STUDY}.bf100.id99.aa.fasta"
# hmmsearch call:
hmmsearch --cut_tc --cpu $NCORES --domtblout $ANNOTATION_DIR/${STUDY}.Pfam35.domtblout.tab $HMM_PROFILE ${INPUT_FASTA}
# compress output file:
gzip $ANNOTATION_DIR/${STUDY}.Pfam35.domtblout.tab
}
Corresponding files uploaded to this repository: Gzip-compressed hmmsearch domain table files for Pfam35 queries (*.Pfam35.domtblout.tab.gz)
G1PA.Pfam35.domtblout.tab.gz
G2PA.Pfam35.domtblout.tab.gz
G3PA.Pfam35.domtblout.tab.gz
G3PA_diel.Pfam35.domtblout.tab.gz
D1PA.Pfam35.domtblout.tab.gz
KEGG functional annotation using KofamScan v1.3.0
Clustered protein sequences were annotated against the KEGG collection (release 104.0) of 20,819 protein family Hidden Markov Models (HMMs) using KofamScan and KofamKOALA. Kofam annotation code is documented here: NPEGC.kofamscan_function.sh
Excerpt of core NPEGC_kofam function:
# Core function to perform KofamScan annotation
function NPEGC_kofam {
# Define input FASTA
local INPUT_FASTA="NPac.${STUDY}.bf100.id99.aa.fasta"
# KofamScan call
${KOFAM_DIR}/kofam_scan-1.3.0/exec_annotation -f detail-tsv -E ${EVALUE} -o ${ANNOTATION_DIR}/NPac.${STUDY}.bf100.id99.aa.tsv ${FASTA_DIR}/${INPUT_FASTA}
# Keep best hit
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This is a dataset of approximately 200,000 vacuum field stellarators along with the electromagnetic coils that generate them. The devices in the database are available in a couple of formats (SIMSOPT, VMEC) useful to the stellarator community.
typo in uploads v1, v2: the 'total_coil_length' and 'coil_length_per_hp' keys in dataframe should read 'total_coil_length_threshold' and ''coil_length_threshold_per_hp'', i.e., the maximum allowable coil length and maximum allowable coil length per half period, respectively.
v2 (January 29, 2024): additional QA devices added.
v1 (October 29, 2023): initial upload.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
WiFi measurements database for UJI's library and supporting material.
The measurements were collected by one person using one Android smartphone during 15 months at two floor of the library building from Universitat Jaume I, in Spain. It contains 63,504 WiFi fingerprints, which are organized into datasets. Each dataset is the result of a collection campaign.
The supporting material includes Matlab® scripts to load and filter the desired data, and provides examples on possible studies that the database may enable. The supporting material also includes the bookshelve local coordinates.
Citation request:
G.M. Mendoza-Silva, P. Richter, J. Torres-Sospedra, E.S. Lohan, J. Huerta, "Long-Term
Wi-Fi fingerprinting dataset and supporting material", Zenodo repository, DOI 10.5281/zenodo.1066041.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# Replication code and data for: Tracking green space along streets of world cities
Falchetta, G., & Hammad, A. T. (2025). Tracking green space along streets of world cities. Environmental Research: Infrastructure and Sustainability. https://doi.org/10.1088/2634-4505/add9c4
To replicate the analysis, the results, and the figures of the paper:
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Code and Raw Data for Obesity Particulate Treatment study
This repository contains raw data for studies done by the Bridges Lab and our collaborators on the metabolic effects of in utero exposure to particulates containing environmentally persistent free radicals on obese adult mice. This repository contains the data for the manuscripts detailed below. The tag column indicates the state of the dataset at the indicated time.:
Publication Dataset Tag E. J. Stephenson, A. Ragauskas, S. Jaligama, J. R. Redd, J. Parvathareddy, M. J. Peloquin, J. Saravia, J. Han, S. A. Cormier, D. Bridges, Exposure to environmentally persistent free radicals during gestation lowers energy expenditure and impairs skeletal muscle mitochondrial function in adult mice. (2016). American Journal of Physioogy - Endocrinology and Metabolism. doi:10.1152/ajpendo.00521.2015. ObesityParticulateTreatment-v1.0.0 Licence
This ObesityParticulateTreatment data is made available under the Open Data Commons Attribution License: http://opendatacommons.org/licenses/by/1.0.
Data Files
Data files are located in the data directory The raw data in this analysis is located in data/raw and is the following files:
Script Files
Script files are saved in scripts folder and were analysed in this order
Manuscript
The manuscript files, including the manuscript, the figures, tables and supplementary data are in the manuscript directory.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Processed count matrices, brightfield HE-images (plain and annotated), spot selection files and meta-data associated with the manuscript "Spatial Deconvolution of HER2-positive Breast Tumors Reveals Novel Intercellular Relationships".
The content of the files are:
count-matrices.zip - processed count matrices formatted as [n_spots]x[n_genes] and named as [PATIENT][SECTION].tsv.gz.
images.zip - contains two folders HE and annotation, the former holds the HE-images for respective section named as [PATIENT][SECTION].jpg, the latter holds the annotated (by the pathologist) images named by patient (only one section from each patient was annotated).
spot-selection.zip - contains .tsv files to map array coordinates to pixel coordinates, allowing the spots and their associated expression values to be visualized jointly. Files are named as [PATIENT][SECTION]_selection.tsv.gz
meta.zip - for all annotated sections, these files are similar to the spot-selection files, but also includes the label of each spot (e.g., breast glands, connective tissue, etc.).
All files are password protected (encrypted), use the passeword zNLXkYk3Q9znUseS do decrypt the data.
code.zip - a clone of the github repository created (2021-05-12).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
borderscape_webgis_data_v6.0.zip
.README.md
: a formatted text document (Markdown syntax) describing the contents of this repository.sites.geojson
: a GeoJSON file with information on each archaeological site included in the webGIS.borderscape_sites.csv
: the list of archaeological sites and their attributes from which the sites.geojson
file was built for the webGIS, in the open CSV (comma separated values) format.borderscape_archaeological_sites.xlsx
: the list of archaeological sites and their attributes. It contains the same information as borderscape_sites.csv
as an Excel Workbook (Office Open XML)flooding_nile.geojson
: a GeoJSON polygon file with information on Nile flood levels at 86m and 94.5m ASL.borderscape_bibliography.bib
: A bibliography with all of the sources abbreviated in the sites.csv
file.merged_coronas_freegr.tif
: a GEOtif of the georeferenced CORONA imagery showing the Lower Nubian landscape prior to the construction of the Aswan High Dam.borderscape_data.zip
contains the following ZIP archives with the spatial (shapefiles) data:borderscape_archaeological_sites.zip
: a ZIP archive of a shapefile showing all of the archaeological sites and their attributes used in the webGIS.sites_phase1.zip
: a ZIP archive of a shapefile showing archaeological sites used in the webGIS from Phase 1.sites_phase2.zip
: a ZIP archive of a shapefile showing archaeological sites used in the webGIS from Phase 2.sites_phase3.zip
: a ZIP archive of a shapefile showing archaeological sites used in the webGIS from Phase 3.sites_phase4.zip
: a ZIP archive of a shapefile showing archaeological sites used in the webGIS from Phase 4.sites_phase5.zip
: a ZIP archive of a shapefile showing archaeological sites used in the webGIS from Phase 5.sites_phase6.zip
: a ZIP archive of a shapefile showing archaeological sites used in the webGIS from Phase 6.86m_flooding_contour.zip
: a ZIP archive of a shapefile showing flooded areas at 86m ASL.94.5m_flooding_contour.zip
: a ZIP archive of a shapefile showing flooded areas at 94.5m ASL.Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and code supporting the manuscript "Tracking and classifying Amazon fire events in near-real time" accepted in Science Advances.
This is a subset of the Zenodo-ML Dinosaur Dataset [Github] that has been converted to small png files and organized in folders by the language so you can jump right in to using machine learning methods that assume image input.
Included are .tar.gz files, each named based on a file extension, and when extracted, will produce a folder of the same name.
tree -L 1
.
├── c
├── cc
├── cpp
├── cs
├── css
├── csv
├── cxx
├── data
├── f90
├── go
├── html
├── java
├── js
├── json
├── m
├── map
├── md
├── txt
└── xml
And we can peep inside a (somewhat smaller) of the set to see that the subfolders are zenodo identifiers. A zenodo identifier corresponds to a single Github repository, so it means that the png files produced are chunks of code of the extension type from a particular repository.
$ tree map -L 1
map
├── 1001104
├── 1001659
├── 1001793
├── 1008839
├── 1009700
├── 1033697
├── 1034342
...
├── 836482
├── 838329
├── 838961
├── 840877
├── 840881
├── 844050
├── 845960
├── 848163
├── 888395
├── 891478
└── 893858
154 directories, 0 files
Within each folder (zenodo id) the files are prefixed by the zenodo id, followed by the index into the original image set array that is provided with the full dinosaur dataset archive.
$ tree m/891531/ -L 1
m/891531/
├── 891531_0.png
├── 891531_10.png
├── 891531_11.png
├── 891531_12.png
├── 891531_13.png
├── 891531_14.png
├── 891531_15.png
├── 891531_16.png
├── 891531_17.png
├── 891531_18.png
├── 891531_19.png
├── 891531_1.png
├── 891531_20.png
├── 891531_21.png
├── 891531_22.png
├── 891531_23.png
├── 891531_24.png
├── 891531_25.png
├── 891531_26.png
├── 891531_27.png
├── 891531_28.png
├── 891531_29.png
├── 891531_2.png
├── 891531_30.png
├── 891531_3.png
├── 891531_4.png
├── 891531_5.png
├── 891531_6.png
├── 891531_7.png
├── 891531_8.png
└── 891531_9.png
0 directories, 31 files
So what's the difference?
The difference is that these files are organized by extension type, and provided as actual png images. The original data is provided as numpy data frames, and is organized by zenodo ID. Both are useful for different things - this particular version is cool because we can actually see what a code image looks like.
How many images total?
We can count the number of total images:
find "." -type f -name *.png | wc -l
3,026,993
The script to create the dataset is provided here. Essentially, we start with the top extensions as identified by this work (excluding actual images files) and then write each 80x80 image to an actual png image, organizing by extension then zenodo id (as shown above).
I tested a few methods to write the single channel 80x80 data frames as png images, and wound up liking cv2's imwrite function because it would save and then load the exact same content.
import cv2
cv2.imwrite(image_path, image)
Given the above, it's pretty easy to load an image! Here is an example using scipy, and then for newer Python (if you get a deprecation message) using imageio.
image_path = '/tmp/data1/data/csv/1009185/1009185_0.png'
from imageio import imread
image = imread(image_path)
array([[116, 105, 109, ..., 32, 32, 32],
[ 48, 44, 48, ..., 32, 32, 32],
[ 48, 46, 49, ..., 32, 32, 32],
...,
[ 32, 32, 32, ..., 32, 32, 32],
[ 32, 32, 32, ..., 32, 32, 32],
[ 32, 32, 32, ..., 32, 32, 32]], dtype=uint8)
image.shape
(80,80)
# Deprecated
from scipy import misc
misc.imread(image_path)
Image([[116, 105, 109, ..., 32, 32, 32],
[ 48, 44, 48, ..., 32, 32, 32],
[ 48, 46, 49, ..., 32, 32, 32],
...,
[ 32, 32, 32, ..., 32, 32, 32],
[ 32, 32, 32, ..., 32, 32, 32],
[ 32, 32, 32, ..., 32, 32, 32]], dtype=uint8)
Remember that the values in the data are characters that have been converted to ordinal. Can you guess what 32 is?
ord(' ')
32
# And thus if you wanted to convert it back...
chr(32)
So how t...
This repository contains data for the main text figures plus some supplementary figures in the article:
Kikstra et al 2021 Nat. Energy. DOI: 10.1038/s41560-021-00904-8
This dataset should be cited as: Kikstra et al. (2021). Data for climate mitigation scenarios with persistent COVID-19 related energy demand changes. DOI: 10.5281/zenodo.5211169
In order to reproduce the figures, one needs to use the script that is available on GitHub at:
https://github.com/iiasa/covid-energy-demand-scenarios
The most accessible way of exploring the scenario data behind this article would be to go to https://data.ece.iiasa.ac.at/engage/#/workspaces/60.
This goes to a web tool hosted by the International Institute of Applied Systems Analysis (IIASA) which provides access to a database of these and more variables of interest, defined for each scenario on the detail of MESSAGE regions, with a few example workspaces available within the ENGAGE Scenario Explorer.
The Scenario Explorer is a versatile open access tool to browse, visualize and download data and results. Users can freely create a private workspace where customized plots can be saved and shared.
For tutorials on how to use the Scenario Explorer, please visit https://software.ece.iiasa.ac.at/ixmp-server/tutorials.html.
The scenarios that were used for the IPCC Special Report on 1.5C warming (SR1.5) have been made available at https://data.ece.iiasa.ac.at/iamc-1.5c-explorer/.
The data is available for download at the ENGAGE Scenario Explorer. The license permits use of the scenario ensemble for scientific research and science communication, but restricts redistribution of substantial parts of the data. Please refer to the FAQ and legal code for more information.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset accompanies the following article:
Nearing, Grey, et al. "Global prediction of extreme floods in ungauged watersheds." Nature (2024).
The code repository associated with this data is repository here: https://github.com/google-research-datasets/global_streamflow_model_paper/. It is highly recommended to use the associated code repository to process this data.
The `model_data.tgz` repository includes reforecasts from the Google model and reanalyses from the GloFAS model. Google model outputs are in units [mm/day] and GloFAS outputs are in units [m3/s]. Model outputs are daily and timestamps are right-labeled, meaning that model outputs labeled, .e.g., 01/01/2020 correspond to streamflow predictions for the day of 12/31/2019.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ESA WorldCereal 2021 products v100
The European Space Agency (ESA) WorldCereal 10m 2021 product suite consist of global-scale annual and seasonal crop maps and (where applicable) their related confidence. Every file in this repository contains up to 106 agro-ecological zone (AEZ) products which were all processed with respect to their own regional seasonality and should be considered as independent products.
Naming convention of the ZIP files is as follows:
WorldCereal_{year}_{season}_{product}_{classification|confidence}.zip
The actual AEZ-based GeoTIFF files inside each ZIP are named according to following convention:
{AEZ_id}_{season}_{product}_{startdate}_{enddate}_{classification|confidence}.tif
The seasons are defined in Table 1. Note that cereals as described by WorldCereal include wheat, barley and rye, which belong to the Triticeae tribe. Next to the actual WorldCereal products, this repository contains the files "WorldCereal_AEZ.geojson" that contains the AEZ description and outline, as well as "QGIS_stylefiles.zip" which contains QGIS style files (.qml) for product visualization purposes.
Season | Description |
---|---|
tc-annual | A one-year cycle being defined in a region by the end of the last considered growing season |
tc-wintercereals | The main cereals season defined in a region |
tc-springcereals | Optional springcereals season, only defined in certain AEZ |
tc-maize-main | The main maize season defined in a region |
tc-maize-second | Optional second maize season, only defined in certain AEZ. |
Note: AEZs for which no irrigation product is available were not processed because of the unavailability of thermal Landsat data.
A scientific paper describing the WorldCereal products and the methodology behind them is available through the link below:
This work was supported by the European Space Agency under contract N°4000130569/20/I-NB.
https://www.gnu.org/licenses/agpl.txthttps://www.gnu.org/licenses/agpl.txt
This repository contains all the data files for a simulated exome-sequencing study of 150 families, ascertained to contain at least four members affected with lymphoid cancer. Please note that previous versions of this repository omitted a key file linking the genotypes of individuals to their family and individual IDs; this file, geno_key.txt, is now included. All other files remain the same as in previous versions.
The simulated data can be found in the files section below. The files are:
All the scripts used to generate these data can be found in the GitHub repository archived at https://zenodo.org/records/12694914
We have also uploaded one intermediate .Rdata file, Chromwide.Rdata, to save the user substantial time when running the associated RMarkdown script for the simulation. We recommend loading Chromwide.Rdata into your R work-space rather than generating it from scratch.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a repository containing processed data for MethylBoostER, an XGBoost model that classifies kidney cancer subtypes. The open-source code can be found here: https://github.com/ss-lab-cancerunit/MethylBoostER.
Globally, interest in understanding the life cycle related greenhouse gas (GHG) emissions of buildings is increasing. Robust data is required for benchmarking and analysis of parameters driving resource use and whole life carbon (WLC) emissions. However, open datasets combining information on energy and material use as well as whole life carbon emissions remain largely unavailable – until now.
We present a global database on whole life carbon, energy use, and material intensity of buildings. It contains data on more than 1,200 building case studies and includes over 300 attributes addressing context and site, building design, assessment methods, energy and material use, as well as WLC emissions across different life cycle stages. The data was collected through various meta-studies, using a dedicated data collection template (DCT) and processing scripts (Python Jupyter Notebooks), all of which are shared alongside this data descriptor.
This dataset is valuable for industrial ecology and sustainable construction research and will help inform decision-making in the building industry as well as the climate policy context.
The need for reducing greenhouse gas (GHG) emissions across Europe require defining and implementing a performance system for both operational and embodied carbon at the building level that provides relevant guidance for policymakers and the building industry. So-called whole life carbon (WLC) of buildings is gaining increasing attention among decision-makers concerned with climate and industrial policy, as well as building procurement, design, and operation. However, most open buildings datasets published thus far have been focusing on building’s operational energy consumption and related parameters 1,2,2–4. Recent years furthermore brought large-scale datasets on building geometry (footprint, height) 5,6 as well as the publication of some datasets on building construction systems and material intensity 7,8. Heeren and Fishman’s database seed on material intensity (MI) of buildings 7, an essential reference to this work, was a first step towards an open data repository on material-related environmental impacts of buildings. In their 2019 descriptor, the authors present data on the material coefficients of more than 300 building cases intended for use in studies applying material flow analysis (MFA), input-output (IO) or life cycle assessment (LCA) methods. Guven et al. 8 elaborated on this effort by publishing a construction classification system database for understanding resource use in building construction. However, thus far, there is a lack of publicly available data that combines material composition, energy use and also considers life cycle-related environmental impacts, such as life cycle-related GHG emissions, also referred to as building’s whole life carbon.
The Global Database on Whole Life Carbon, Energy Use, and Material Intensity of Buildings (CarbEnMats-Buildings) published alongside this descriptor provides information on more than 1,200 buildings worldwide. The dataset includes attributes on geographical context and site, main building design characteristics, LCA-based assessment methods, as well as information on energy and material use, and related life cycle greenhouse gas (GHG) emissions, commonly referred to as whole life carbon (WLC), with a focus on embodied carbon (EC) emissions. The dataset compiles data obtained through a systematic review of the scientific literature as well as systematic data collection from both literature sources and industry partners. By applying a uniform data collection template (DCT) and related automated procedures for systematic data collection and compilation, we facilitate the processing, analysis and visualization along predefined categories and attributes, and support the consistency of data types and units. The descriptor includes specifications related to the DCT spreadsheet form used for obtaining these data as well as explanations of the data processing and feature engineering steps undertaken to clean and harmonise the data records. The validation focuses on describing the composition of the dataset and values observed for attributes related to whole life carbon, energy and material intensity.
The data published with this descriptor offers the largest open compilation of data on whole life carbon emissions, energy use and material intensity of buildings published to date. This open dataset is expected to be valuable for research applications in the context of MFA, I/O and LCA modelling. It also offers a unique data source for benchmarking whole life carbon, energy use and material intensity of buildings to inform policy and decision-making in the context of the decarbonization of building construction and operation as well as commercial real estate in Europe and beyond.
All files related to this descriptor are available on a public GitHub repository and related release via Zenodo (https://doi.org/10.5281/zenodo.8363895). The repository contains the following files:
Please consult the related data descriptor article (linked at the top) for further information, e.g.:
The dataset, the data collection template as well as the code used for processing, harmonization and visualization are published under a GNU General Public License v3.0. The GNU General Public License is a free, copyleft license for software and other kinds of works. We encourage you to review, reuse, and refine the data and scripts and eventually share-alike.
The CarbEnMats-Buildings database is the results of a highly collaborative effort and needs your active contributions to further improve and grow the open building data landscape. Reach out to the lead author (email, linkedin) if you are interested to contribute your data or time.
When referring to this work, please cite both the descriptor and the dataset:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Dataset was used and is supplementary to the paper "In vitro genotoxicity testing using γH2AX biomarker, microscopy and automatic image analysis in ImageJ - a pilot study with valinomycin". It contains both RAW single-channel images and numerical results obtained with BioImage Analysis and evaluation available through the GitHub repository: https://github.com/martinschatz-cz/genotoxicity-bia.
Naming Convention for images
ChannelName_YYYYMMD_Well_PossitionInWell_AcqRun.tiff
Naming Convention for results
Well_AllResults_YYYY-MM-DD_Results.csv
The csv are ‘,’ separated, and automatically named by analysis script.
Folder Structure
Images (1069 files, as Images.zip)
4H - all images for all wells
24H - all images for all wells
Results (36 files)
Results_4H (as Results_4h.zip)
Results_24H (as Results_24h.zip)
Measurement Settings
Manufacturer and model of microscope: Olympus IX83 P2ZF
Objective lens magnification, NA: 10x Olympus IX3 Nosepiece, LensNA=0.3
Excitation filters (mounted in the light source)
Violet: 395/25nm LED module 1, DAPI
Green: 555/28nm LED module 5, Cy3
Quad band filter set for DAPI/FITC/Cy3/Cy5
Quad band polychroic mirror (mounted in the filter turret):
BP 411-454nm,
BP 495-536nm,
BP 577-617nm
BP 655-810nm
Emission filters (mounted in the fast emission filter wheel, infront the camera):
DAPI: BP 421-445nm
Cy3: BP 581-619nm
Illumination light source: Lumencor Lumencor Spectra X Lamp
Pixel size: 650nm x 650nm
Camera manufacturer and model: Hamamatsu ORCA-Flash4.0
Software program(s) and version: OLYMPUS cellSens Dimension 3.2 (Build 23706)
Image acquisition settings
expposure 500 ms
gain: 0
binning: 4 x 4
Experiment manager: ZDC + autofocus, two channels: DAPI and Cy3
Image Processing and Analysis
The data analysis workflow consists of several stages, each of which was executed by a specific script. Firstly, the raw data were manually cleaned and automatically sorted and organized using sort_wells.ijm in FIJI. Secondly, image analysis was performed using Process_WFolder_macro_v1.ijm in FIJI, which processed the image data and extracted the relevant features. Finally, the results were further processed using SF_dataVis_and_statistics_mean_XYh.ipynb in Python (Jupyter Notebook), which generated the final output in the form of a CSV file.
In this repository, you can access the resulting CSV file, which contains the final results of our analysis. Additionally, we have provided the scripts used to process the data, which are available on our GitHub repository (LINK). You will find instruction how to create local Jupyter Hub for Python scripts. These scripts are accompanied by a short manual that provides an overview of the data analysis workflow and helps users navigate through the code. By making our scripts available, we hope to facilitate transparency and reproducibility of our research. If you encounter any issues, please report them through the GitHub repository: https://github.com/martinschatz-cz/genotoxicity-bia.
We believe that our work can be useful for other researchers and analysts who are interested in studying similar datasets. We invite you to explore the contents of this repository and use the data and scripts provided here to further your research.
Cell lines and culture conditions
Human cervical adenocarcinoma (HeLa) and a Chinese hamster ovary (CHO-K1) cell lines were obtained from American Type Culture Collection (ATCC). The HeLa cells were grown MEM supplemented with 10 % FBS and NEAA. CHO-K1 cells were cultivated with DMEM supplemented with L-proline (final concentration 35 mg/l). The cell incubation took place in a humidified atmosphere of 5% CO2 at 37 °C.
Direct measurement of DNA DSBs
The cells were seeded in concentration 0,5 × 105 cells/ml into the 96-well plate (VWR, 10062-900). The cells were rinsed by phosphate buffered saline (PBS;) after 24h incubation, and medium with reduced FBS content (5 %) was added. Valinomycin was dissolved in DMSO and added to cells in two final concentrations (30 and 15 𝞵M). After 4h/24h incubation the visualization was done using and following protocol of HCS DNA Damage Kit. The cells were fixed by 4% paraformaldehyde solution for 15 min at room temperature. The cells were rinsed once by PBS and the permeabilization was performed using Triton® X-100 () solution by incubation for 15 min at room temperature. The wells were rinsed with PBS once and the plate was blocked by 1% BSA blocking solution. After 1 hour incubation at room temperature the blocking solution was removed and 100 𝞵l of pH2AX mouse monoclonal antibody solution (1:1000 in BSA) was pipetted into each well incubated for 1 hour at room temperature. After three times rinsing by PBS the 100 𝞵l of Alexa Fluor® 555 goat anti-mouse IgG (H+L; 1:2000) and Hoechst 33342 (1:6000) solution was incubated for 1 hour at room temperature protected from light. After the incubation the wells were rinsed three times by PBS. The plate was stored with 100 𝞵l in the refrigerator (4 °C) until the image analysis was performed.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains MD simulations and associated analyses of a peptide series of short peptides including phosphorylated residues. It is one of five repositories that are associated to the following research article:
Bickel,D., and Vranken,W. (2024) Effects of Phosphorylation on Protein Backbone Dynamics and Conformational Preferences. J. Chem. Theory Comput. https://doi.org/10.1021/acs.jctc.4c00206.
The full list of the related repositories is given here:
10.5281/zenodo.10517328
10.5281/zenodo.10518872
10.5281/zenodo.10518971
10.5281/zenodo.10518993
10.5281/zenodo.10519033
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Correspondence to: Joost de Vries, joost.devries@bristol.ac.uk