Facebook
TwitterThe NIH Common Data Elements (CDE) Repository has been designed to provide access to structured human and machine-readable definitions of data elements that have been recommended or required by NIH Institutes and Centers and other organizations for use in research and for other purposes. Visit the NIH CDE Resource Portal for contextual information about the repository.
Facebook
TwitterTherapeutics Data Commons (TDC) is an open-science initiative started at Harvard with AI/ML-ready datasets and ML tasks for therapeutics. It provides an ecosystem of tools, leaderboards, and community resources, including data functions, model benchmarking and comparison strategies, meaningful data splits, data processors, public leaderboards, and molecule generation oracles. All resources are integrated and accessible via an open Python library. TDC is available at https://tdcommons.ai.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Historical NCI Genomic Data Commons data (v09-14-2017). Clinical ('phenotype') and gene expression (HTSeq FPKM-UQ).
dataset: phenotype - Phenotype
cohortGDC TCGA Colon Cancer (COAD)
dataset IDTCGA-COAD/Xena_Matrices/TCGA-COAD.GDC_phenotype.tsv
downloadhttps://gdc.xenahubs.net/download/TCGA-COAD/Xena_Matrices/TCGA-COAD.GDC_phenotype.tsv.gz; Full metadata
samples570
version11-27-2017
hubhttps://gdc.xenahubs.net
type of dataphenotype
authorGenomic Data Commons
raw datahttps://docs.gdc.cancer.gov/Data/Release_Notes/Data_Release_Notes/#data-release-90
raw datahttps://api.gdc.cancer.gov/data/
input data formatROWs (samples) x COLUMNs (identifiers) (i.e. clinicalMatrix)
570 samples X 151 identifiersAll IdentifiersAll Samples
dataset: gene expression RNAseq - HTSeq - FPKM-UQ
cohortGDC TCGA Colon Cancer (COAD)
dataset IDTCGA-COAD/Xena_Matrices/TCGA-COAD.htseq_fpkm-uq.tsv
downloadhttps://gdc.xenahubs.net/download/TCGA-COAD/Xena_Matrices/TCGA-COAD.htseq_fpkm-uq.tsv.gz; Full metadata
samples512
version09-14-2017
hubhttps://gdc.xenahubs.net
type of datagene expression RNAseq
unitlog2(fpkm-uq+1)
platformIllumina
ID/Gene Mappinghttps://gdc.xenahubs.net/download/probeMaps/gencode.v22.annotation.gene.probeMap.gz; Full metadata
authorGenomic Data Commons
raw datahttps://docs.gdc.cancer.gov/Data/Release_Notes/Data_Release_Notes/#data-release-80
raw datahttps://api.gdc.cancer.gov/data/
wranglingData from the same sample but from different vials/portions/analytes/aliquotes is averaged; data from different samples is combined into genomicMatrix; all data is then log2(x+1) transformed.
input data formatROWs (identifiers) x COLUMNs (samples) (i.e. genomicMatrix)
60,484 identifiers X 512 samples
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset outlines a proposed set of core, minimal metadata elements that can be used to describe biomedical datasets, such as those resulting from research funded by the National Institutes of Health. It can inform efforts to better catalog or index such data to improve discoverability. The proposed metadata elements are based on an analysis of the metadata schemas used in a set of NIH-supported data sharing repositories. Common elements from these data repositories were identified, mapped to existing data-specific metadata standards from to existing multidisciplinary data repositories, DataCite and Dryad, and compared with metadata used in MEDLINE records to establish a sustainable and integrated metadata schema. From the mappings, we developed a preliminary set of minimal metadata elements that can be used to describe NIH-funded datasets. Please see the readme file for more details about the individual sheets within the spreadsheet.
Facebook
TwitterThis dataset tracks the updates made on the dataset "NIH Common Data Elements Repository" as a repository for previous versions of the data and metadata.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As policies, good practices and funder mandates on research data management evolve, more emphasis has been put on the licencing of data. Licencing information allow potential re-users to quickly identify what they can do with the data in question and is therefore an important component to ensure the reusability of research.
In my research I analyse a pre-existing collection of 840 Horizon 2020 public data management plans (DMPs) available on the repository of the University of Vienna, Phaidra,, to determine which ones mention creative commons licences and among those who do, what licences are being used.
This excel file contains the data underlying the publication "Uncommon Commons? Creative Commons licencing in Horizon 2020 Data Management Plans ".
Sheet 1 contains the data collected in the previous "Data Re-Use" project: 840 DMPs downloaded from CORDIS and vetted to ensure they are public documents and not copyrighted
Sheet 2 contains the same data as sheet 1, with columns D to Q not visible (for better reading) but an added column R which now contains the CC licening information (where available)
Sheet 3 is filtered so that only the projects containing CC BY relevant licencing are shown
Sheet 4 is filtered so that only the projects containing CC-BY-SA relevant licencing are shown
Sheet 5 is filtered so that only the projects containing CC-BY-NC relevant licencing are shown
Sheet 6 is filtered so that only the projects containing CC-BY-ND relevant licencing are shown
Sheet 7 is filtered so that only the projects containing Cc-BY-NC-ND relevant licencing are shown
Sheet 8 is filtered so that only the projects containing CC-BY-NC-SA relevant licencing are shown
Sheet 9 is filtered so that only the projects containing CC0 relevant information are shown
Sheet 10 provides an overview table of the relevant licences (manual entry)
Sheet 11 and 12 contain graphic visulations of the data as used in the article
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Microplitis demolitor (Hymenoptera: Braconidae) is a parasitoid used as a biological control agent to control larval-stage Lepidoptera and serves as a model for studying the function and evolution of symbiotic viruses in the genus Bracovirus. This dataset presents the Microplitis demolitor Official Gene Set (OGS) v1.0. The OGS is an integration of automatic gene predictions from Microplitis demolitor genome annotations NCBI-RefSeq's gene set NCBI Microplitis demolitor Annotation Release 101 (https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Microplitis_demolitor/101/), with manual annotations by the research community, performed via the Apollo manual curation software (https://zenodo.org/record/1295754#.YDgLyJNKivg). Manual annotations were QC'd via the GFF3toolkit (https://github.com/NAL-i5K/gff3toolkit) and NCBI's table2asn_GFF software (https://ftp.ncbi.nih.gov/toolbox/ncbi_tools/converters/by_program/table2asn_GFF/), and merged with NCBI Microplitis demolitor Annotation Release 101 via the GFF3toolkit (https://github.com/NAL-i5K/gff3toolkit). Resources in this dataset:Resource Title: Microplitis demolitor Official Gene Set micdem_OGSv1.0. File Name: micdem_OGSv1.0.tar.gzResource Description: This directory contains files for the Official Gene Set 1.0 for Microplitis demolitor (micdem_OGSv1.0). The general procedure for generating this OGS is outlined here: https://github.com/NAL-i5K/GFF3toolkit/. QC of community-curated models from the Apollo software was performed by NAL staff using the GFF3toolkit function gff3_QC, and errors were fixed using gff3_fix. OGSv1.0 was generated by merging NCBI-RefSeq's gene set NCBI Microplitis demolitor Annotation Release 101 (https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Microplitis_demolitor/101/) with the QC'd and error-corrected community-curated models, and generating i5k Workspace IDs for all manually annotated features.
1) Fasta files - Protein Sequences: micdem_OGSv1.0_pep.fa - Coding Sequences (CDS): micdem_OGSv1.0_CDS.fa - Transcript Sequences (includes non-coding sequence): micdem_OGSv1.0_trans.fa
2) Gff3 file: micdem_OGSv1.0.gff
3) Mapping file between Gene set NCBI Microplitis demolitor Annotation Release 101 and OGSv1.0: ID_map_report.txt
Facebook
TwitterThis blog post was posted on November 13, 2015 and was written. by. George Komatsoulis. It is a cross-post from the NIH's Data Science blog - https://datascience.nih.gov/blog.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Research Software Alliance's (ReSA) mission is to bring research software communities together to collaborate on the advancement of research software. Given the ReSA mission, it is important to understand the landscape of communities involved with research software. In 2020, ReSA completed an initial exercise to scope the international research software community landscape. This work was reported by ReSA's Software Landscape Analysis task force via a blog post. The majority of the communities in the previous analysis represented the global north. To improve the extent of this landscape analysis, ReSA announced a paid opportunity for short-term contractors located in the global south to collect data on communities and funders in their region in early 2022. This document describes how the work was undertaken, a summary of findings, the gaps and opportunities perceived by the data collectors and some highlights. This work identified 126 organisations and communities and 62 funder bodies that support research software in the global south. Their main activities are connecting people, training, and networking, and support through research grants.
To add to this communities list please fill in the following form https://forms.gle/KJE9vkBnM6vhh7cEA
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
House flies (Musca domestica L.) are vectors of human and animal pathogens at livestock operations. Microbial communities in flies are acquired from, and correlate with, their local environment. However, variation among microbial communities carried by flies from farms in different geographical areas is not well understood. We characterized bacterial communities of female house flies collected from beef and dairy farms in Oklahoma, Kansas, and Nebraska and further evaluated the prevalence of antibiotic resistance genes in bacteria within flies. We evaluated the influence of farm type and farm location on bacterial communities, diversity, pathogenic bacteria strains and prevalence of antibiotic resistance genes. These data can be used for better understanding of abundance and prevalence of bacterial communities in house flies associated with livestock operations. These data were collected in September 2019. Abbreviations used include Operational Taxonomic Units(OTUs), Canonical Correspondence analysis (CCA), Infectious Bovine Keratoconjunctivitis (IBK), Anti Microbial Resistance (AMR), and Antibiotic Resistance Genes (ARGs).
The raw Illumina MiSeq sequence data for this project can be found here:
https://www.ncbi.nlm.nih.gov/bioproject/PRJNA863664
Resources in this dataset:
Resource title: Metadata for Microbiome of House Fly Associated with Cattle Farms File name: Metadata for Microbiome of House Fly Associated with Cattle Farms.xlsx Resource description: This spreadsheet links the raw sequence reads on NCBI with data on farm type, farm location and sample type.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This paper aims to get a better understanding of the motivational and transaction cost features of building global scientific research commons, with a view to contributing to the debate on the design of appropriate policy measures under the recently adopted Nagoya Protocol. For this purpose, the paper analyses the results of a world-wide survey of managers and users of microbial culture collections, which focused on the role of social and internalized motivations, organizational networks and external incentives in promoting the public availability of upstream research assets. Overall, the study confirms the hypotheses of the social production model of information and shareable goods, but it also shows the need to complete this model. For the sharing of materials, the underlying collaborative economy in excess capacity plays a key role in addition to the social production, while for data, competitive pressures amongst scientists tend to play a bigger role.
Facebook
TwitterCommon data operations expressed as MLMs.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Industrial Ecology Data Commons (iedc) is a database that contains more than 200 IE-related datasets from the literature, including stocks, flows, process descriptions, IO tables, material composition of products, and many more. Launched in 2018, the iedc is continuously improved and expanded.
The homepage of the project is https://www.database.industrialecology.uni-freiburg.de/
This Zenodo backup contains a .zip file with 156 parameter templates (xlsx), which where all uploaded to the iedc (SQL database) and are available online.
This backup is for archiving the intermediate step between raw data and uploaded data.
It contains all data that were gathered up to and including November 2024 except for those data that were uploaded directly via Pyhton scripts from other sources (like .csv) and not via the xlsx templates.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset corresponds to a collection of images and/or image-derived data available from National Cancer Institute Imaging Data Commons (IDC) [1]. This dataset was converted into DICOM representation and ingested by the IDC team. You can explore and visualize the corresponding images using IDC Portal here: GTEx. You can use the manifests included in this Zenodo record to download the content of the collection following the Download instructions below.
The Genotype-Tissue Expression (GTEx) Project established a data resource and tissue bank to study the relationship between genetic variants and gene expression in multiple human tissues and across individuals. The project included contributions from numerous groups with diverse expertise in biospecimen collection and processing, pathology review, molecular analysis, and data management. The contributors are collectively called the GTEx Consortium.
GTEx collected a total of 26,468 unique tissue samples from 50+ different tissue types, from 956 healthy postmortem donors. The standardized biospecimen collection and analysis practices applied during the study served to minimize preanalytical variability associated with specimen-related factors and their potential impact on analytic endpoints. Each GTEx tissue was divided into two tissue blocks, one for histology and one for molecular analysis; both tissue blocks were preserved in PAXgene Tissue Fixative (Qiagen) solution for 6 to 24 hours, followed by PAXgene Tissue Stabilizer (Qiagen) as specified in the project-specific standard operating procedures. Tissue blocks were processed and embedded in paraffin at the GTEx central repository at the Van Andel Institute (MI) and hematoxylin and eosin–stained slides were generated from all GTEx donors. Digitally scanned whole slide images of PAXgene-fixed/stabilized, paraffin-embedded tissue sections were created using Aperio Scanscope software (Leica Biosystems). The digital images were then reviewed and annotated by one of four board-certified pathologists assigned to the GTEx study. There are a total of 25,503 digital histology images in the GTEx collection.
GTEx was supported by the NIH Common Fund (2010 – 2019). Additional resources include the GTEx Biobank, the GTEx Portal, and the full dataset at dbGaP (accession number phs000424).
Please refer to the listed GTEx publications below for more details [2-7].
A manifest file's name indicates the IDC data release in which a version of collection data was first introduced. For example, collection_id-idc_v8-aws.s5cmd corresponds to the contents of the collection_id collection introduced in IDC data release v8. If there is a subsequent version of this Zenodo page, it will indicate when a subsequent version of the corresponding collection was introduced.
gtex-idc_v19-aws.s5cmd: manifest of files available for download from public IDC Amazon Web Services bucketsgtex-idc_v19-gcs.s5cmd: manifest of files available for download from public IDC Google Cloud Storage bucketsgtex-idc_v19-dcf.dcf: Gen3 manifest (for details see https://learn.canceridc.dev/data/organization-of-data/guids-and-uuids)Note that manifest files that end in -aws.s5cmd reference files stored in Amazon Web Services (AWS) buckets, while -gcs.s5cmd reference files in Google Cloud Storage. The actual files are identical and are mirrored between AWS and GCP.
Each of the manifests include instructions in the header on how to download the included files.
To download the files using .s5cmd manifests:
pip install --upgrade idc-index.s5cmd manifest file: idc download manifest.s5cmdTo download the files using .dcf manifest, see manifest header.
The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health (commonfund.nih.gov/GTEx). Additional funds were provided by the NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. Donors were enrolled at Biospecimen Source Sites funded by NCI/Leidos Biomedical Research, Inc. subcontracts to the National Disease Research Interchange (10XS170), Roswell Park Cancer Institute (10XS171), and Science Care, Inc. (X10S172). The Laboratory, Data Analysis, and Coordinating Center (LDACC) was funded through a contract (HHSN268201000029C) to the Broad Institute of MIT and Harvard. Biorepository operations were funded through a Leidos Biomedical Research, Inc. subcontract to Van Andel Research Institute (10ST1035). Additional data repository and project management were provided by Leidos Biomedical Research, Inc. (HHSN261200800001E). The Brain Bank was supported with supplements to University of Miami grant DA006227. Statistical Methods development grants were made to the University of Geneva (MH090941& MH101814), the University of Chicago (MH090951, MH090937, MH101825, & MH101820), the University of North Carolina - Chapel Hill (MH090936), North Carolina State University (MH101819), Harvard University (MH090948), Stanford University (MH101782), Washington University (MH101810), and to the University of Pennsylvania (MH101822).
Imaging Data Commons team has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.
[1] Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W. L., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. National cancer institute imaging data commons: Toward transparency, reproducibility, and scalability in imaging artificial intelligence. Radiographics 43, (2023).
[2] Sobin, L., Barcus, M., Branton, P. A., Engel, K. B., Keen, J., Tabor, D., Ardlie, K. G., Greytak, S. R., Roche, N., Luke, B., Vaught, J., Guan, P. & Moore, H. M. Histologic and quality assessment of genotype-Tissue Expression (GTEx) research samples: A large postmortem tissue collection. Arch. Pathol. Lab. Med. (2024). doi:10.5858/arpa.2023-0467-OA
[3] GTEx Consortium. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).
[4] GTEx Consortium. Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015).
[5] GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
[6] Carithers, L. J., Ardlie, K., Barcus, M., Branton, P. A., Britton, A., Buia, S. A., Compton, C. C., DeLuca, D. S., Peter-Demchok, J., Gelfand, E. T., Guan, P., Korzeniewski, G. E., Lockhart, N. C., Rabiner, C. A., Rao, A. K., Robinson, K. L., Roche, N. V., Sawyer, S. J., Segrè, A. V., Shive, C. E., Smith, A. M., Sobin, L. H., Undale, A. H., Valentino, K. M., Vaught, J., Young, T. R., Moore, H. M. & GTEx Consortium. A novel approach to high-quality postmortem tissue procurement: The GTEx project. Biopreserv. Biobank. 13, 311–319 (2015).
[7] Branton, P. A., Sobin, L., Barcus, M., Engel, K. B., Greytak, S. R., Guan, P., Vaught, J. & Moore, H. M. Notable histologic findings in a ‘normal’ cohort: The National Institutes of Health Genotype-Tissue Expression (GTEx) project. Arch. Pathol. Lab. Med. (2024). doi:10.5858/arpa.2023-0468-OA
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Culicoides biting midges are important vectors of diverse microbes such as viruses, protozoa, and nematodes that cause diseases in wild and domestic animals. However, little is known about the role of microbial communities in midge larval habitat utilization in the wild. In this study, we characterized microbial communities (bacterial, protistan, fungal and metazoan) in soils from disturbed (bison and cattle grazed) and undisturbed (non-grazed) pond and spring potential midge larval habitats. We evaluated the influence of habitat and grazing disturbance and their interaction on microbial communities, diversity, presence of midges, and soil properties. These data can be used to better understand environmental microbial communities in tallgrass prairie ecosystems associated with grazed versus ungrazed pond and spring habitats and to draw inferences on the interactions of these communities and soil properties with the presence of biting midge larvae. These data should not be used to make inferences for ecosystems other than tallgrass prairie, for animal management methods other than open cow-calf or bison grazing (such as feedlots, dairies, or stockyards), or for other grazing mammals (such as sheep or goats). These data were collected between the months of September and December and therefore are not representative of microbial communities present from January through August. Abbreviations used include Total Carbon (TC), Total Nitrogen (TN), Organic Matter (OM), Konza Prairie Biological Station (KPBS), Operational Taxonomic Unit (OTU), Principal Coordinates Analysis (PCoA), ribosomal RNA (rRNA), and vesicular stomatitis virus (VSV). The raw Illumina MiSeq sequence data for this project can be found here: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA862140 Resources in this dataset:
Resource Title: Metadata for Midge Larval Habitat Soil Microbiome File Name: Metadata for NCBI Accession PRJNA862140.xlsx Resource Description: This spreadsheet links the raw sequence reads on NCBI with data on the presence/absence of Culicoides midges and soil chemistry data (% total soil nitrogen, % total soil carbon, and % organic matter).
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset corresponds to a collection of images and/or image-derived data available from National Cancer Institute Imaging Data Commons (IDC) [1]. This dataset was converted into DICOM representation and ingested by the IDC team. You can explore and visualize the corresponding images using IDC Portal here: TCGA-READ. You can use the manifests included in this Zenodo record to download the content of the collection following the Download instructions below.
The Cancer Genome Atlas-Rectum Adenocarcinoma (TCGA-READ) data collection is part of a larger effort to enhance the TCGA http://cancergenome.nih.gov/ data set with characterized radiological images. The Cancer Imaging Program (CIP), with the cooperation of several TCGA tissue-contributing institutions, has archived a large portion of the radiological images of the genetically-analyzed READ cases.
Please see the TCGA-READ wiki page to learn more about the images and to obtain any supporting metadata for this collection.
A manifest file's name indicates the IDC data release in which a version of collection data was first introduced.
For example, collection_id-idc_v8-aws.s5cmd corresponds to the contents of the
collection_id collection introduced in IDC data
release v8. If there is a subsequent version of this Zenodo page, it will indicate when a subsequent version of
the corresponding collection was introduced.
tcga_read-idc_v8-aws.s5cmd: manifest of files available for download from public IDC Amazon Web Services bucketstcga_read-idc_v8-gcs.s5cmd: manifest of files available for download from public IDC Google Cloud Storage bucketstcga_read-idc_v8-dcf.dcf: Gen3 manifest (for details see https://learn.canceridc.dev/data/organization-of-data/guids-and-uuids)Note that manifest files that end in -aws.s5cmd reference files stored in Amazon Web Services (AWS) buckets, while -gcs.s5cmd reference
files in Google Cloud Storage. The actual files are identical and are mirrored between AWS and GCP.
Each of the manifests include instructions in the header on how to download the included files.
To download the files using .s5cmd manifests:
pip install --upgrade idc-index.s5cmd manifest file: idc download manifest.s5cmd.To download the files using .dcf manifest, see manifest header.
Imaging Data Commons team has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.
[1] Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence. RadioGraphics (2023). https://doi.org/10.1148/rg.230180
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The CWL alignment workflow included in this case study is designed by Data Biosphere. It adapts the alignment pipeline originally developed at Abecasis Lab, The University of Michigan. This workflow is part of NIH Data Commons initiative and comprises of four stages. First step, "Pre-align'' accepts a Compressed Alignment Map (CRAM) file (a compressed format for BAM files developed by European Bioinformatics Institute (EBI)) and human genome reference sequence as input and using underlying software utilities of SAMtools such as view, sort and fixmate returns a list of fastq files which can be used as input for the next step. The next step "Align'' also accepts the human reference genome as input along with the output files from "Pre-align'' and uses BWA-mem to generate aligned reads as BAM files. SAMBLASTER is used to mark duplicate reads and SAMtools view to convert read files from SAM to BAM format. The BAM files generated after "Align'' are sorted with "SAMtool sort''. Finally, these sorted alignment files are merged to produce single sorted BAM file using SAMtools merge in "Post-align'' step.
This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see https://w3id.org/cwl/prov/0.6.0 or use https://pypi.org/project/cwlprov/ to explore
Facebook
TwitterThe COVID Information Commons (CIC) is an open website portal and community to facilitate knowledge-sharing and collaboration across various COVID research efforts, funded by the NSF Convergence Accelerator and the  NSF Technology, Innovation and Partnerships Directorate. The CIC serves as an open resource for researchers, students, and decision-makers from academia, government, not-for-profits and industry to identify collaboration opportunities, to leverage each other's research findings, and to accelerate the most promising research to mitigate the broad societal impacts of the COVID-19 pandemic. The CIC was developed as a collaborative proposal led by the Northeast Big Data Innovation Hub, hosted by Columbia University, in collaboration with the Midwest Big Data Innovation Hub, South Big Data Innovation Hub, and West Big Data Innovation Hub. It was funded by the NSF Convergence Accelerator (NSF #2028999) in May 2020 and launched in July 2020. The initial focus of the CIC website ..., The NSF and NIH funded COVID related awards corpus in the CIC was collected primarily from NSF and NIH via APIs. Further information has been collected directly from researchers, who filled out an online form to enhance the descriptions. The dataset has been cleaned and enhanced by automated processing, using custom scripts to remove invalid characters, and standardize names of funding agency divisions., , # COVID Information Commons Archive
This archive is a snapshot of the COVID Information Commons (CIC). The CIC is a live database that records information about COVID-19 researchers and their projects.
The snapshot of the CIC contains the following files, each listed with a description of the fields it contains:
cic_people_export.json -- Researchers who have studied aspects of COVID-19. All information known about the researchers in CIC, except email addresses, which have been filtered out for privacy purposes. Some researchers have minimal information, as CIC may only know their name via a reference in a grant description. Other people have more complete records, if they have provided additional information to the CIC.
Facebook
Twitterhttps://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
The Cancer Genome Atlas Rectum Adenocarcinoma (TCGA-READ) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the CIP TCGA Radiology Initiative.
Facebook
TwitterThis dataset contains 2D image slices extracted from the publicly available Pancreas-CT-SEG dataset, which provides manually segmented pancreas annotations for contrast-enhanced 3D abdominal CT scans. The original dataset was curated by the National Institutes of Health Clinical Center (NIH) and was made available through the NCI Imaging Data Commons (IDC). The dataset consists of 82 CT scans from 53 male and 27 female subjects, converted into 2D slices for segmentation tasks.
Dataset Details:
Modality: Contrast-enhanced CT (portal-venous phase, ~70s post-injection)
Number of Subjects: 82
Age Range: 18 to 76 years (Mean: 46.8 ± 16.7 years)
Scan Resolution: 512 × 512 pixels per slice
Slice Thickness: Varies between 1.5 mm and 2.5 mm
Scanners Used: Philips and Siemens MDCT scanners (120 kVp tube voltage)
Segmentation: Manually performed by a medical student and verified by an expert radiologist
Data Format: Converted from 3D DICOM/NIfTI to 2D PNG/JPEG slices for segmentation tasks
Total Dataset Size: ~1.85 GB
Category: Non-cancerous healthy controls (No pancreatic cancer lesions or major abdominal pathologies)
Preprocessing and Conversion:
The original 3D CT scans and corresponding pancreas segmentation masks (available in NIfTI format) were converted into 2D slices to facilitate 2D medical image segmentation tasks. The conversion steps include:
Extracting axial slices from each 3D CT scan.
Normalizing pixel intensities for consistency.
Saving images in PNG/JPEG format for compatibility with deep learning frameworks.
Generating corresponding binary segmentation masks where the pancreas region is labeled.
Dataset Structure:
Applications
This dataset is ideal for medical image segmentation tasks such as:
Deep learning-based pancreas segmentation (e.g., using U-Net, DeepLabV3+)
Automated organ detection and localization
AI-assisted diagnosis and analysis of abdominal CT scans
Acknowledgments & References
This dataset is derived from:
National Cancer Institute Imaging Data Commons (IDC) [1]
The Cancer Imaging Archive (TCIA) [2]
Original dataset DOI: https://doi.org/10.7937/K9/TCIA.2016.tNB1kqBU
Citations: If you use this dataset, please cite the following:
Roth, H., Farag, A., Turkbey, E. B., Lu, L., Liu, J., & Summers, R. M. (2016). Data From Pancreas-CT (Version 2). The Cancer Imaging Archive. DOI: 10.7937/K9/TCIA.2016.tNB1kqBU
Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., et al. (2023). National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence. Radiographics 43.
License: This dataset is provided under the Creative Commons Attribution 4.0 International (CC-BY-4.0) license. Users must abide by the TCIA Data Usage Policy and Restrictions.
Additional Resources: Imaging Data Commons (IDC) Portal: https://portal.imaging.datacommons.cancer.gov/explore/
OHIF DICOM Viewer: https://viewer.ohif.org/
This dataset provides a high-quality, well-annotated resource for researchers and developers working on medical image analysis, segmentation, and AI-based pancreas detection.
Facebook
TwitterThe NIH Common Data Elements (CDE) Repository has been designed to provide access to structured human and machine-readable definitions of data elements that have been recommended or required by NIH Institutes and Centers and other organizations for use in research and for other purposes. Visit the NIH CDE Resource Portal for contextual information about the repository.