84 datasets found
  1. Transcriptome Analysis of Beta macrocarpa and Identification of...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xlsx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huiyan Fan; Yongliang Zhang; Haiwen Sun; Junying Liu; Ying Wang; Xianbing Wang; Dawei Li; Jialin Yu; Chenggui Han (2023). Transcriptome Analysis of Beta macrocarpa and Identification of Differentially Expressed Transcripts in Response to Beet Necrotic Yellow Vein Virus Infection [Dataset]. http://doi.org/10.1371/journal.pone.0132277
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Huiyan Fan; Yongliang Zhang; Haiwen Sun; Junying Liu; Ying Wang; Xianbing Wang; Dawei Li; Jialin Yu; Chenggui Han
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundRhizomania is one of the most devastating diseases of sugar beet. It is caused by Beet necrotic yellow vein virus (BNYVV) transmitted by the obligate root-infecting parasite Polymyxa betae. Beta macrocarpa, a wild beet species widely used as a systemic host in the laboratory, can be rub-inoculated with BNYVV to avoid variation associated with the presence of the vector P. betae. To better understand disease and resistance between beets and BNYVV, we characterized the transcriptome of B. macrocarpa and analyzed global gene expression of B. macrocarpa in response to BNYVV infection using the Illumina sequencing platform.ResultsThe overall de novo assembly of cDNA sequence data generated 75,917 unigenes, with an average length of 1054 bp. Based on a BLASTX search (E-value ≤ 10−5) against the non-redundant (NR, NCBI) protein, Swiss-Prot, the Gene Ontology (GO), Clusters of Orthologous Groups of proteins (COG) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases, there were 39,372 unigenes annotated. In addition, 4,834 simple sequence repeats (SSRs) were also predicted, which could serve as a foundation for various applications in beet breeding. Furthermore, comparative analysis of the two transcriptomes revealed that 261 genes were differentially expressed in infected compared to control plants, including 128 up- and 133 down-regulated genes. GO analysis showed that the changes in the differently expressed genes were mainly enrichment in response to biotic stimulus and primary metabolic process.ConclusionOur results not only provide a rich genomic resource for beets, but also benefit research into the molecular mechanisms of beet- BNYV Vinteraction.

  2. h

    ncbi_disease

    • huggingface.co
    Updated May 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NLM/DIR BioNLP Group (2024). ncbi_disease [Dataset]. https://huggingface.co/datasets/ncbi/ncbi_disease
    Explore at:
    Dataset updated
    May 23, 2024
    Dataset authored and provided by
    NLM/DIR BioNLP Group
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    This paper presents the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH®) or Online Mendelian Inheritance in Man (OMIM®). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. In this setting, a high inter-annotator agreement was observed. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency.

    For more details, see: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3951655/

    The original dataset can be downloaded from: https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/NCBI_corpus.zip This dataset has been converted to CoNLL format for NER using the following tool: https://github.com/spyysalo/standoff2conll Note: there is a duplicate document (PMID 8528200) in the original data, and the duplicate is recreated in the converted data.

  3. Results of Data analysis of RNA-Seq

    • figshare.com
    xlsx
    Updated Jan 11, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kiichi Hirota (2018). Results of Data analysis of RNA-Seq [Dataset]. http://doi.org/10.6084/m9.figshare.5353462.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jan 11, 2018
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Kiichi Hirota
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data analysis of RNA-Seq FASTQ files for RCC4-EV cells (DRR100656) and RCC4-VHL cells (DRR100657) were obtained from the Sequence Read Archive (https://trace.ddbj.nig.ac.jp/dra/index_e.html). The quality of sequence data was evaluated by FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) after the trimming process by fastx_toolkit v 0.0.14 (http://hannonlab.cshl.edu/fastx_toolkit/). The human reference sequence file (hs37d5.fa) was downloaded from the 1000 genome ftp site (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/), and the annotated general feature format (gff) file was downloaded from the Illumina iGenome ftp site (ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Homo_sapiens/NCBI/build37.2/). The human genome index was constructed with bowtie-build in Bowtie v.2.2.9. The fastq files were aligned to the reference genomic sequence by TopHat v.2.1.1 with default parameters. Bowtie2 v2.2.9 and Samtools v.1.3.1 was used with the TopHat program47. Estimation of transcript abundance was calculated, and the count values were normalized to the upper quartile of the fragments per kilobase of transcript per million fragments mapped reads (FPKM) using Cufflinks (cuffdiff) v2.1.1. cuffdiff output (gene_exp. diff) was presentated (gene_exp.diff.txt).

  4. N

    The WU-Minn Human Connectome Project: An overview: tfMRI LANGUAGE STORY...

    • neurovault.org
    nifti
    Updated Jun 30, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). The WU-Minn Human Connectome Project: An overview: tfMRI LANGUAGE STORY minus MATH zstat1 [Dataset]. http://identifiers.org/neurovault.image:3142
    Explore at:
    niftiAvailable download formats
    Dataset updated
    Jun 30, 2018
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    FSL5.0

    glassbrain

    Collection description

    IMPORTANT:

    This is open access data. You must agree to Terms and conditions of using this data before using it, available at:

    http://humanconnectome.org/data/data-use-terms/open-access.html

    Open Access Data (all imaging data and most of the behavioral data) is available to those who register an account at ConnectomeDB and agree to the Open Access Data Use Terms. This includes agreement to comply with institutional rules and regulations.

    This means you may need the approval of your IRB or Ethics Committee to use the data. The released HCP data are not considered de-identified, since certain combinations of HCP Restricted Data (available through a separate process) might allow identification of individuals. Different national, state and local laws may apply and be interpreted differently, so it is important that you consult with your IRB or Ethics Committee before beginning your research. If needed and upon request, the HCP will provide a certificate stating that you have accepted the HCP Open Access Data Use Terms.

    Please note that everyone who works with HCP open access data must review and agree to these terms, including those who are accessing shared copies of this data. If you are sharing HCP Open Access data, please advise your co-researchers that they must register with ConnectomeDB and agree to these terms.

    Register and sign the Open Access Data Use Terms at ConnectomeDB:

    https://db.humanconnectome.org/

    Preprocessing Details:
    http://www.ncbi.nlm.nih.gov/pubmed/23668970

    T-stat maps were generated with FSL's randomise:

    randomise -i 4D -o OneSampT -1 -T

    and the package TtoZ was used to generate the Z-stat maps:

    https://github.com/vsoch/TtoZ

    Tasks:
    http://humanconnectome.org/documentation/S500/HCP_S500+MEG2_Release_Appendix_VI.pdf

    Subject species

    homo sapiens

    Modality

    fMRI-BOLD

    Analysis level

    group

    Cognitive paradigm (task)

    language processing fMRI task paradigm

    Map type

    Z

  5. PubMed Article Summarization Dataset

    • kaggle.com
    zip
    Updated Dec 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). PubMed Article Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/pubmed-article-summarization-dataset
    Explore at:
    zip(686033678 bytes)Available download formats
    Dataset updated
    Dec 5, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    PubMed Article Summarization Dataset

    PubMed Summarization Dataset

    By ccdv (From Huggingface) [source]

    About this dataset

    The dataset consists of multiple files, including validation.csv, train.csv, and test.csv. Each file contains a combination of articles and their respective abstracts. The articles are sourced directly from PubMed, ensuring they represent a wide range of topics across various scientific disciplines.

    In order to provide reliable datasets for different purposes, the files have been carefully curated to serve specific functions. validation.csv contains a subset of articles with their corresponding abstracts that can be used for validating the performance of summarization models during development. train.csv features a larger set of article-abstract pairs specifically intended for training such models.

    Finally, test.csv serves as an independent evaluation set that allows developers to measure the effectiveness and generalizability of their summarization models against unseen data points. By using this test set, researchers can assess how well their algorithms perform in generating concise summaries that accurately capture the main findings and conclusions within scientific articles.

    Researchers in natural language processing (NLP), machine learning (ML), or any related field can utilize this dataset to advance automatic text summarization techniques focused on scientific literature. Whether it's building extractive or abstractive methods or exploring novel approaches like neural networks or transformer-based architectures, this rich dataset provides ample opportunities for experimentation and progress in the field.

    How to use the dataset

    Introduction:

    Dataset Structure:

    • article: The full text of a scientific article from the PubMed database (Text).
    • abstract: A summary of the main findings and conclusions of the article (Text).

    Using the Dataset: To maximize the utility of this dataset, it is important to understand its purpose and how it can be utilized:

    • Training Models: The train.csv file contains articles and their corresponding abstracts that can be used for training summarization models or developing algorithms that generate concise summaries automatically.

    • Validation Purposes: The validation.csv file serves as a test set for fine-tuning your models or comparing different approaches during development.

    • Evaluating Model Performance: The test.csv file offers a separate set of articles along with their corresponding abstracts specifically designed for evaluating the performance of various summarization models.

    Tips for Utilizing the Dataset Effectively:

    • Preprocessing: Before using this dataset, consider preprocessing steps such as removing irrelevant sections (e.g., acknowledgments, references), cleaning up invalid characters or formatting issues if any exist.

    • Feature Engineering: Explore additional features like article length, sentence structure complexity, or domain-specific details that may assist in improving summarization model performance.

    • Model Selection & Evaluation: Experiment with different summarization algorithms, ranging from traditional extractive approaches to more advanced abstractive methods. Evaluate model performance using established metrics such as ROUGE (Recall-Oriented Understudy for Gisting Evaluation).

    • Data Augmentation: Depending on the size of your dataset, you may consider augmenting it further by applying techniques like data synthesis or employing external resources (e.g., pre-trained language models) to enhance model performance.

    Conclusion:

    Research Ideas

    • Textual analysis and information retrieval: Researchers can use this dataset to analyze patterns in scientific literature or conduct information retrieval tasks. By examining the relationship between article content and its abstract, researchers can gain insights into how different sections of a scientific paper contribute to its overall summary.

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: validation.csv | Column name | Description ...

  6. n

    Data from: SkewDB: A comprehensive database of GC and 10 other skews for...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Oct 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bert Hubert (2021). SkewDB: A comprehensive database of GC and 10 other skews for over 28,000 chromosomes and plasmids [Dataset]. http://doi.org/10.5061/dryad.g4f4qrfr6
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 4, 2021
    Dataset provided by
    Independent researcher
    Authors
    Bert Hubert
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    GC skew denotes the relative excess of G nucleotides over C nucleotides on the leading versus the lagging replication strand of eubacteria. While the effect is small, typically around 2.5%, it is robust and pervasive. GC skew and the analogous TA skew are a localized deviation from Chargaff’s second parity rule, which states that G and C, and T and A occur with (mostly) equal frequency even within a strand.

    Most bacteria also show the analogous TA skew. Different phyla show different kinds of skew and differing relations between TA and GC skew. This article introduces an open access database (https://skewdb.org) of GC and 10 other skews for over 28,000 chromosomes and plasmids. Further details like codon bias, strand bias, strand lengths and taxonomic data are also included.

    The SkewDB database can be used to generate or verify hypotheses. Since the origins of both the second parity rule, as well as GC skew itself, are not yet satisfactorily explained, such a database may enhance our understanding of microbial DNA.

    Methods The SkewDB analysis relies exclusively on the tens of thousands of FASTA and GFF3 files available through the NCBI download service, which covers both GenBank and RefSeq. The database includes bacteria, archaea and their plasmids. Furthermore, to ease analysis, the NCBI Taxonomy database is sourced and merged so output data can quickly be related to (super)phyla or specific species. No other data is used, which greatly simplifies processing. Data is read directly in the compressed format provided by NCBI.

    All results are emitted as standard CSV files. In the first step of the analysis, for each organism the FASTA sequence and the GFF3 annotation file are parsed. Every chromosome in the FASTA file is traversed from beginning to end, while a running total is kept for cumulative GC and TA skew. In addition, within protein coding genes, such totals are also kept separately for these skews on the first, second and third codon position. Furthermore, separate totals are kept for regions which do not code for proteins. In addition, to enable strand bias measurements, a cumulative count is maintained of nucleotides that are part of a positive or negative sense gene. The counter is increased for positive sense nucleotides, decreased for negative sense nucleotides, and left alone for non-genic regions.

    A separate counter is kept for non-genic nucleotides. Finally, G and C nucleotides are counted, regardless of if they are part of a gene or not. These running totals are emitted at 4096 nucleotide intervals, a resolution suitable for determining skews and shifts. In addition, one line summaries are stored for each chromosome. These line includes the RefSeq identifier of the chromosome, the full name mentioned in the FASTA file, plus counts of A, C, G and T nucleotides. Finally five levels of taxonomic data are stored.

    Chromosomes and plasmids of fewer than 100 thousand nucleotides are ignored, as these are too noisy to model faithfully. Plasmids are clearly marked in the database, enabling researchers to focus on chromosomes if so desired. Fitting Once the genomes have been summarised at 4096-nucleotide resolution, the skews are fitted to a simple model. The fits are based on four parameters. Alpha1 and alpha2 denote the relative excess of G over C on the leading and lagging strands. If alpha1 is 0.046, this means that for every 1000 nucleotides on the leading strand, the cumulative count of G excess increases by 46. The third parameter is div and it describes how the chromosome is divided over leading and lagging strands. If this number is 0.557, the leading replication strand is modeled to make up 55.7% of the chromosome. The final parameter is shift (the dotted vertical line), and denotes the offset of the origin of replication compared to the DNA FASTA file. This parameter has no biological meaning of itself, and is an artifact of the DNA assembly process.

    The goodness-of-fit number consists of the root mean squared error of the fit, divided by the absolute mean skew. This latter correction is made to not penalize good fits for bacteria showing significant skew. GC skew tends to be defined very strongly, and it is therefore used to pick the div and shift parameters of the DNA sequence, which are then kept as a fixed constraint for all the other skews, which might not be present as clearly. The fitting process itself is a downhill simplex method optimization over the three dimensions, seeded with the average observed skew over the whole genome, and assuming there is no shift, and that the leading and lagging strands are evenly distributed. The simplex optimization is tuned so that it takes sufficiently large steps so it can reach the optimum even if some initial assumptions are off.

  7. utility: Collection of Tumor-Infiltrating Lymphocyte Single-Cell Experiments...

    • zenodo.org
    zip
    Updated Apr 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholas Borcherding; Nicholas Borcherding (2022). utility: Collection of Tumor-Infiltrating Lymphocyte Single-Cell Experiments with TCR [Dataset]. http://doi.org/10.5281/zenodo.5524577
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 6, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nicholas Borcherding; Nicholas Borcherding
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    The original intent of assembling a data set of publicly-available tumor-infiltrating T cells (TILs) with paired TCR sequencing was to expand and improve the scRepertoire R package. However, after some discussion, we decided to release the data set for everyone, a complete summary of the sequencing runs and the sample information can be found in the meta data of the Seurat object. This repository contains the code for the initial processing and annotating of the data set (we are calling this version 0.0.1). This involves several steps 1) loading the respective GE data, 2) harmonizing the data by sample and cohort information, 3) iterating through automatic annotation, 4) unifying annotation via manual inspection and enrichment analysis, and 5) adding the TCR information.

    Methods

    Single-Cell Data Processing

    The filtered gene matrices output from Cell Ranger align function from individual sequencing runs (10x Genomics, Pleasanton, CA) loaded into the R global environment. For each sequencing run cell barcodes were appended to contain a unique prefix to prevent issues with duplicate barcodes. The results were then ported into individual Seurat objects (citation), where the cells with > 10% mitochondrial genes and/or 2.5x natural log distribution of counts were excluded for quality control purposes. At the individual sequencing run level, doublets were estimated using the scDblFinder (v1.4.0) R package. All the sequencing runs across experiments were merged into a single Seurat Object using the merge() function. All the data was then normalized using the default settings and 2,000 variable genes were identified using the "vst" method. Next the data was scaled with the default settings and principal components were calculated for 40 components. Data was integrated using the harmony (v1.0.0) R package (citation) using both cohort and sample information to correct for batch effect with up to 20 iterations. The UMAP was created using the runUMAP() function in Seurat, using 20 dimensions of the harmony calculations.

    Annotation of Cells

    Automatic annotation was performed using the singler (v1.4.1) R package (citation) with the HPCA (citation) and DICE (citation) data sets as references and the fine label discriminators. Individual sequencing runs were subsetted to run through the singleR algorithm in order to reduce memory demands. The output of all the singleR analyses were collated and appended to the meta data of the seurat object. Likewise, the ProjecTILs (v0.4.1) R Package (citation) was used for automatic annotation as a partially orthogonal approach. Consensus annotation was derived from all 3 databases (HPCA, DICE, ProjecTILs) using a majority approach. No annotation designation was assigned to cells that returned NA for both singleR and ProjecTILs. Mixed annotations were designated with SingleR identified non-Tcells and ProjecTILs identified T cells. Cell type designations with less than 100 cells in the entire cohort were reduced to "other". Automated annotations were checked manually using canonical marker genes and gene enrichment analysis performed using UCell (v1.0.0) R package (citation).

    Addition of TCR data

    The filtered contig annotation T cell receptor (TCR) data for available sequencing runs were loaded into the R global environment. Individual contigs were combined using the combineTCR() function of scRepertoire (v1.3.2) R Package (citation). Clonotypes were assigned to barcodes and were multiple duplicate chains for individual cells were filtered to select for the top expressing contig by read count. The clonotype data was then added to the Seurat Object with proportion across individual patients being used to calculate frequency.

    Citations

    As of right now, there is no citation associated with the assembled data set. However if using the data, please find the corresponding manuscript for each data set in the meta.data of the single-cell object. In addition, if using the processed data, feel free to modify the language in the methods section (above) and please cite the appropriate manuscripts of the software or references that were used.

    Itemized List of the Software Used

    Itemized List of Reference Data Used

    • Human Primary Cell Atlas (HPCA) - citation
    • Database Immune Cell Expression (DICE) - citation
    • Immune-related Gene Sets - citation

    Future Directions

    • Data Hosting for Interactive Analysis
    • Easy Submission Portal for Researchers to Add Data
    • Using the Data to Build a Reference Atlas

    There are areas in which we are actively hoping to develop to further facilitate the usefulness of the data set - if you have other suggestions, please reach out using the contact information below.

    Contact

    Questions, comments, suggestions, please feel free to contact Nick Borcherding via this repository, email, or using twitter.

  8. n

    Data from: ChromDB- the chromatin database

    • neuinfo.org
    • dknet.org
    • +2more
    Updated Apr 28, 2007
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2007). ChromDB- the chromatin database [Dataset]. http://identifiers.org/RRID:SCR_007597
    Explore at:
    Dataset updated
    Apr 28, 2007
    Description

    ChromDB is a chromatin database. Three types of sequences are included in the database: genomic-based (predominantly plant sequences); transcript-based (EST contigs or cDNAs for plants lacking a sequenced genome); and NCBI RefSeq sequences for a variety of model animal organisms. The Gene Record Page for any sequence indicates the type of sequence. The broad mission of ChromDB is display, annotate, and curate sequences of two broad functional classes of biologically important proteins: chromatin-associated proteins (CAPs) and RNA interference-associated proteins. Plant proteins are the major focus of the work support by The Plant Genome Research Program (PGRP) of the National Science Foundation. Our intent is to produce intensively curated sequence information and make it available to the research and teaching community in support of comparative analyses toward understanding the chromatin proteome in plants, especially in important crop species. In order to do a comparative analysis, it is necessary to include non-plant proteins in the database. Non-plant genes are not curated to the degree carried out for plants and to automate the process of data import, our non-plant genes are from the RefSeq database of NCBI. We reason that the inclusion of non-plant, model organisms will broaden the relevance and usefulness of ChromDB to the entire chromatin community and will provide a more complete data set for phylogenetic analyses in support of the evolution of the plant chromatin proteome. ChromDB is funded by a grant from the National Science Foundation Plant Genome Research Project(#DBI-0421679).

  9. GEO_Processing_Exploratory_DGE_Analysis

    • kaggle.com
    zip
    Updated Nov 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Nagendra (2025). GEO_Processing_Exploratory_DGE_Analysis [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/geo-processing-exploratory-dge-analysis
    Explore at:
    zip(13026816 bytes)Available download formats
    Dataset updated
    Nov 29, 2025
    Authors
    Dr. Nagendra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset provides a comprehensive workflow for differential gene expression (DGE) analysis.

    It focuses on processing and analyzing GEO (Gene Expression Omnibus) datasets.

    The dataset includes code for retrieving GEO datasets directly from NCBI GEO.

    It provides data cleaning, normalization, and pre-processing steps for gene expression data.

    The workflow demonstrates exploratory data analysis (EDA) on gene expression datasets.

    Differential expression analysis is performed to identify significantly expressed genes.

    Includes visualizations such as heatmaps, volcano plots, and PCA for insights.

    Designed for researchers and bioinformaticians interested in gene expression analysis.

    Supports reproducibility and can be adapted to different GEO datasets.

    Uses Python programming language and popular bioinformatics libraries like pandas, numpy, and matplotlib.

    Encourages integration with downstream functional enrichment and pathway analysis.

  10. Length distribution of unigenes in the assembled transcriptomes.

    • figshare.com
    xls
    Updated Jun 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huiyan Fan; Yongliang Zhang; Haiwen Sun; Junying Liu; Ying Wang; Xianbing Wang; Dawei Li; Jialin Yu; Chenggui Han (2023). Length distribution of unigenes in the assembled transcriptomes. [Dataset]. http://doi.org/10.1371/journal.pone.0132277.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Huiyan Fan; Yongliang Zhang; Haiwen Sun; Junying Liu; Ying Wang; Xianbing Wang; Dawei Li; Jialin Yu; Chenggui Han
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Length distribution of unigenes in the assembled transcriptomes.

  11. d

    Data from: Transcriptomes of bovine ovarian follicular and luteal cells

    • catalog.data.gov
    • s.cnmilf.com
    • +2more
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data from: Transcriptomes of bovine ovarian follicular and luteal cells [Dataset]. https://catalog.data.gov/dataset/data-from-transcriptomes-of-bovine-ovarian-follicular-and-luteal-cells-f9bea
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    Affymetrix Bovine GeneChip® Gene 1.0 ST Array RNA expression analysis was performed on four somatic ovarian cell types: the granulosa cells (GCs) and theca cells (TCs) of the dominant follicle and the large luteal cells (LLCs) and small luteal cells (SLCs) of the corpus luteum. The normalized linear microarray data was deposited to the NCBI GEO repository (GSE83524). Subsequent ANOVA determined genes that were enriched (≥2 fold more) or decreased (≤−2 fold less) in one cell type compared to all three other cell types, and these analyzed and filtered datasets are presented as tables. Genes that were shared in enriched expression in both follicular cell types (GCs and TCs) or in both luteal cells types (LLCs and SLCs) are also reported in tables. The standard deviation of the analyzed array data in relation to the log of the expression values is shown as a figure. These data have been further analyzed and interpreted in the companion article "Gene expression profiling of ovarian follicular and luteal cells provides insight into cellular identities and functions", Romereim et al., (2017) Mol. Cell. Endocrinol. 439:379-394. https://doi.org/10.1016/j.mce.2016.09.029 Resources in this dataset:Resource Title: RNA Expression Data from Four Isolated Bovine Ovarian Somatic Cell Types. File Name: Web Page, url: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE83524 NCBI Gene Expression Omnibus (GEO) Accession Display. Analysis of the RNA present in each bovine cell type using Affymetrix microarrays yielded new cell-specific genetic markers, functional insight into the behavior of each cell type via Gene Ontology Annotations and Ingenuity Pathway Analysis, and evidence of small and large luteal cell lineages using Principle Component Analysis. Enriched expression of select genes for each cell type was validated by qPCR. This expression analysis offers insight into the lineage and differentiation process that transforms somatic follicular cells into luteal cells. The orignal Affymetrix .CEL files and the normalized linear expression data are included in this submission.

  12. Genome assemblies and respective cgMLST profiles of a diverse dataset...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jul 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Verónica Mixão; Verónica Mixão; Holger Brendebach; Miguel Pinto; Daniel Sobral; João Paulo Gomes; Carlus Deneke; Simon Tausch; Vítor Borges; Holger Brendebach; Miguel Pinto; Daniel Sobral; João Paulo Gomes; Carlus Deneke; Simon Tausch; Vítor Borges (2023). Genome assemblies and respective cgMLST profiles of a diverse dataset comprising 1,874 Listeria monocytogenes isolates [Dataset]. http://doi.org/10.5281/zenodo.7116879
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 24, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Verónica Mixão; Verónica Mixão; Holger Brendebach; Miguel Pinto; Daniel Sobral; João Paulo Gomes; Carlus Deneke; Simon Tausch; Vítor Borges; Holger Brendebach; Miguel Pinto; Daniel Sobral; João Paulo Gomes; Carlus Deneke; Simon Tausch; Vítor Borges
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset

    This dataset comprises the genome assemblies and respective 1,748-loci core-genome (cg) Multiple Locus Sequence Type (MLST) profiles [Pasteur schema (Moura et al. 2016) available in chewie-NS (Mamede et al. 2022)] of a final set of 1,874 Listeria monocytogenes samples selected among the Whole-Genome Sequencing (WGS) data publicly available in the European Nucleotide Archive (ENA) or in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) at the beginning of the analysis (November 2021). This set of samples was carefully selected to cover a wide genetic diversity (assessed in terms of Sequence Type [ST]). In total, 204 different STs are represented in this dataset, with ST121, ST6, ST9, ST1 and ST155 being in the top 5 and, together, corresponding to 37.9% of the dataset.

    File “Lm_metadata.xlsx” contains metadata information for each isolate, including ENA/SRA accession number, BioProject and in-silico MLST ST.

    The directory “assemblies/” contains all the genome assemblies (.fasta format) of each isolate presented in the metadata file.

    The file “profiles/Lm_profile.tsv” corresponds to a tab separated file with the 1,748-loci cgMLST profile of each isolate presented in the metadata file. These profiles were determined as explained below.

    Dataset selection and curation

    With the objective of creating a diverse dataset of L. monocytogenes genome assemblies, we collected information about the genetic diversity (STs) of the isolates available at BIGSdb-Lm database in the beginning of this analysis (November 2021) and in other previous works. Based on this information, we selected an initial dataset comprising 1,957 samples associated with three previous studies (Moura et al. 2016; Maury et al. 2017; Painset et al. 2019). Their WGS data was downloaded from ENA/SRA with fastq-dl v1.0.6. Read quality control, trimming and assembly were performed with the Aquamis v1.3.9 (Deneke et al. 2021) using default parameters. Assembly quality control (QC), including contamination assessment, as well as MLST ST determination were performed with the same pipeline. All genome assemblies passing the QC were included in the final dataset. Among the others, we noticed that a considerable proportion of assemblies was flagged as “QC fail” exclusively due to the “NumContamSNVs” parameter, suggesting that this setting might have been too strict. After manual inspection of a random subset, assemblies for which the percentage of reads corresponding to the correct species was >98% were recovered and integrated in the final dataset (those samples are labeled in the Metadata file). In total, 1,874 isolates passed the dataset curation step and were included in the final dataset. cgMLST profiles of each of these isolates were determined with chewBBACA v2.8.5 (Silva et al. 2018), using the 1,748-loci Pasteur schema (Moura et al. 2016) available in chewie-NS (Mamede et al. 2022) and downloaded on June 23rd, 2022.

  13. u

    Data from: Manduca sexta Official Gene Set v2.2

    • agdatacommons.nal.usda.gov
    application/x-gzip
    Updated Nov 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Kanost; Gary Blissard; Stephen Richards; Susan Brown; Alexie Papanicolaou; Nicolae Herndon; Haobo Jiang (2025). Manduca sexta Official Gene Set v2.2 [Dataset]. http://doi.org/10.15482/USDA.ADC/1522540
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Nov 21, 2025
    Dataset provided by
    Ag Data Commons
    Authors
    Michael Kanost; Gary Blissard; Stephen Richards; Susan Brown; Alexie Papanicolaou; Nicolae Herndon; Haobo Jiang
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Manduca sexta, known as the tobacco hornworm or Carolina sphinx moth, is a lepidopteran insect that is used extensively as a model system for research in insect biochemistry, physiology, neurobiology, development, and immunity. One important benefit of this species as an experimental model is its extremely large size, reaching more than 10 g in the larval stage. M. sexta larvae feed on solanaceous plants and thus must tolerate a substantial challenge from plant allelochemicals, including nicotine. This dataset presents the Manduca sexta Official Gene Set (OGS) v2.2. It is derived from Official Gene Sets OGSv2.0 and OGSv2.1. OGSv2.0 was modified to meet NCBI-GenBank quality review, resulting in OGSv2.1. This was deposited in NCBI GenBank (accession number AIXA00000000). OGS2.1 was lifted over to assembly JHU_Msex_v1.0/GCF_014839805.1 using the LiftOff (https://doi.org/10.1093/bioinformatics/btaa1016) and Genometools (http://genometools.org/) software, resulting in OGSv2.2. OGSv2.2 has quality issues due to the liftover - the i5k Workspace@NAL recommends NCBI Annotation Release 102 for general analysis, instead. Resources in this dataset:Resource Title: Manduca sexta Official Gene Set v2.2. File Name: OGSv2.2.tar.gzResource Description: The attached tar.gz archive (OGSv2.2.tar.gz) contains the following files: mansex_OGSv2.2.gff. Gff3 file of all gene predictions of Manduca sexta genome annotations OGSv2.2 readme.txt. A readme describing the OGS generation process.

  14. DrugComb - an integrative cancer drug combination data portal (v1.4)

    • zenodo.org
    csv
    Updated May 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jing Tang; Jing Tang (2024). DrugComb - an integrative cancer drug combination data portal (v1.4) [Dataset]. http://doi.org/10.5281/zenodo.11102665
    Explore at:
    csvAvailable download formats
    Dataset updated
    May 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jing Tang; Jing Tang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Drug combination therapy has the potential to enhance efficacy, reduce dose-dependent toxicity and prevent the emergence of drug resistance. However, discovery of synergistic and effective drug combinations has been a laborious and often serendipitous process. In recent years, identification of combination therapies has been accelerated due to the advances in high-throughput drug screening, but informatics approaches for systems-level data management and analysis are needed. To contribute toward this goal, we created an open-access data portal called DrugComb (https://drugcomb.org) where the results of drug combination screening studies are accumulated, standardized and harmonized. Through the data portal, we provided a web server to analyze and visualize users' own drug combination screening data. The users can also effectively participate a crowdsourcing data curation effect by depositing their data at DrugComb. To initiate the data repository, we collected 437 932 drug combinations tested on a variety of cancer cell lines. We showed that linear regression approaches, when considering chemical fingerprints as predictors, have the potential to achieve high accuracy of predicting the sensitivity of drug combinations. All the data and informatics tools are freely available in DrugComb to enable a more efficient utilization of data resources for future drug combination discovery.

    Citations:

    [1] Nucleic Acids Res. 2021, 49(W1):W174-W184. doi: 10.1093/nar/gkab438

    https://pubmed.ncbi.nlm.nih.gov/34060634/

    [2] Nucleic Acids Res. 2019, 47(W1):W43-W51. doi: 10.1093/nar/gkz337
    https://pubmed.ncbi.nlm.nih.gov/31066443/

  15. R

    Data from: Bioinformatic and statistical scripts to analyze acorn mycobiota...

    • entrepot.recherche.data.gouv.fr
    bin, csv, tsv, txt +1
    Updated Jun 28, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tania Fort; Charlie Pauvert; Amy E. Zanne; Otso Ovaskainen; Thomas Caignard; Matthieu Barret; Stéphane Compant; Arndt Hampe; Sylvain Delzon; Corinne Vacher; Tania Fort; Charlie Pauvert; Amy E. Zanne; Otso Ovaskainen; Thomas Caignard; Matthieu Barret; Stéphane Compant; Arndt Hampe; Sylvain Delzon; Corinne Vacher (2019). Bioinformatic and statistical scripts to analyze acorn mycobiota structure and composition [Dataset]. http://doi.org/10.15454/SM6OCR
    Explore at:
    tsv(4748), csv(23478), csv(2948623), tsv(70820), type/x-r-syntax(12014), bin(6617), bin(49109), bin(28090), csv(3686), txt(26261), bin(15355)Available download formats
    Dataset updated
    Jun 28, 2019
    Dataset provided by
    Recherche Data Gouv
    Authors
    Tania Fort; Charlie Pauvert; Amy E. Zanne; Otso Ovaskainen; Thomas Caignard; Matthieu Barret; Stéphane Compant; Arndt Hampe; Sylvain Delzon; Corinne Vacher; Tania Fort; Charlie Pauvert; Amy E. Zanne; Otso Ovaskainen; Thomas Caignard; Matthieu Barret; Stéphane Compant; Arndt Hampe; Sylvain Delzon; Corinne Vacher
    License

    https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html

    Description

    This dataset contains R-scripts to perform bioinformatic and statistical analysis on acorn associated fungal communities (Fort et al., Maternal effects and environmental filtering shape seed fungal communities in oak trees. Submitted). Bioinformatic script was applied to raw sequences after paired-end sequences were joined using PEAR v0.9.10 (Zhang et al., 2014). Raw sequences are available at https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA551388. Metadata and ASV tables are also provided allowing to use statistical analysis scripts without running the bioinformatic steps.

  16. d

    Data from: Analysis of expressed sequence tags from a significant livestock...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    Updated Apr 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data from: Analysis of expressed sequence tags from a significant livestock pest, the stable fly (Stomoxys calcitrans), identifies transcripts with a putative role in chemosensation and sex determination. [Dataset]. https://catalog.data.gov/dataset/data-from-analysis-of-expressed-sequence-tags-from-a-significant-livestock-pest-the-stable-4e875
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    The stable fly, Stomoxys calcitrans L. (Diptera: Muscidae), is one of the most significant pests of livestock in the United States. The identification of targets for the development of novel control for this pest species, focusing on those molecules that play a role in successful feeding and reproduction, is critical to mitigating its impact on confined and rangeland livestock. This data set was obtained from pyrosequencing of stable fly immature and adult specimens, comprising genes expressed at these stages. Stable fly specimens were obtained from an in vitro colony that is maintained at the Knipling-Bushland U.S. Livestock Insects Research Laboratory (Kerrville, TX) at 27°C, 60% RH, and a photoperiod of 12:12 [L:D] h. The stages included 1 g each of newly laid (t0) and 24 h post-oviposition (t24) embryos, 5 g of pooled 2nd–3rd instar larvae (late larvae), 2.5 g newly pupariated pupae (early pupae), 5 g pharate adults (pupae 2 d prior to eclosion; late pupae), and 1.5 g each of heads from unfed adult females and males (adult). Total RNA was isolated from various stages using the ToTALLY RNA Isolation Kit (Ambion, Foster City, CA) following the manufacturer’s protocol. Five micrograms of normalized cDNA was prepared for sequencing on the 454/ Roche GS-FLX. Double-stranded cDNA was nebulized to generate nominal 500-kb fragments and a shotgun library prepared for GS-FLX sequencing as per the manufacturer’s instructions (Roche, Indianapolis, IN). The sequencing library was run on a full picotitre plate and resulting data submitted to NCBI Short Read Archive (SRX018014). The resulting sequence data was assembled using newbler (RocheN) and the assembly optimized using a beta version of NGEN (DNAstar, Madison, WI) and Seqman (DNAstar, Oxford, UK). BLASTx was utilized based upon W.ND-BLAST (Dowd et al., 2005) against an embl-derived database for Drosophila (2008). Functional annotations were derived using DAVID (Dennis et al., 2003). Raw reads were submitted to the Sequence Read Archive (SRA) Database at NCBI. Resources in this dataset:Resource Title: Analysis of expressed sequence tags from a significant livestock pest, the stable fly (Stomoxys calcitrans), identifies transcripts with a putative role in chemosensation and sex-determination. File Name: Web Page, url: https://www.ncbi.nlm.nih.gov/bioproject/79611 Sample description: The stable fly, Stomoxys calcitrans L. (Diptera: Muscidae), is one of the most significant pests of livestock in the United States. The identification of targets for the development of novel control for this pest species, focusing on those molecules that play a role in successful feeding and reproduction, is critical to mitigating its impact on confined and rangeland livestock. A database was developed representing genes expressed at the immature and adult lifestages of the stable fly, comprising data obtained from pyrosequencing both immature and adult stages and from small scale sequencing of an antennal/maxillary palp expressed sequence tag library. The full-length sequence and expression of 21 transcripts that may have a role in chemosensation is presented, including 13 odorant binding proteins, 6 chemosensory proteins, and 2 odorant receptors. Transcripts with potential roles in sex determination and reproductive behaviors are identified, including evidence for the sex-specific expression of stable fly doublesex- and transformer-like transcripts. The current database will be a valuable tool for target identification and for comparative studies with other Diptera.Resource Software Recommended: SRA Toolkit,url: https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software

  17. GapFinder Demo

    • researchdata.edu.au
    Updated Aug 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Annette McGrath; David Lovell; Stephen Bent; Louise Ord; Xin-Yi Chua (2023). GapFinder Demo [Dataset]. https://researchdata.edu.au/gapfinder-demo/2296716
    Explore at:
    Dataset updated
    Aug 4, 2023
    Dataset provided by
    CSIROhttp://www.csiro.au/
    Authors
    Annette McGrath; David Lovell; Stephen Bent; Louise Ord; Xin-Yi Chua
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    GapFinder is a CSIRO-developed web-based visualisation tool that enables the exploration of large distance matrices that have a hierarchical structure to their observations. The tool aims to help us find gaps between different groups of observations or, at least, understand when these gaps do not exist

    GapFinder Demo demonstrates the application of the GapFinder visualisation to fish metabarcode data. The aim is to help environmental DNA (eDNA) researchers choose PCR primers that allow them to clearly distinguish species of interest in their experiments by their DNA “barcode”. Since these primers yield DNA barcodes from multiple species, this approach is known as metabarcoding. eDNA metabarcoding is a broadly used practical method to determine which species are present in an environment, and measure biodiversity or detect pests, among many other applications of importance in biodiversity and biosecurity.

    The tool allows users to interactively view and interrogate in-silico generated DNA metabarcoding data that allows researchers to determine the best primer pairs to use to discriminate their species of interest from other similar species.

    eDNA Researchers face the challenge of determining which genetic targets will allow them to distinguish closely related species relevant to the study system. In practice, this involves selection of PCR primers to produce particular DNA sequence barcodes, entailing intensive in silico and in vitro validation. This tool tackles that challenge in a way that saves computational and experimental resources by centralising the analysis of primer and barcode sequence data.

    Lineage: Public DNA sequence data from NCBI (nt) was downloaded (Release 231) and processed using a newly developed method to determine and visualise the similarity between predicted amplicon sequences from different species. Briefly, these are the outline steps to process data downloaded from the NCBI nt database: 1. From a BLAST formatted database generated from Release 231 of the nt database, sequences were extracted to FASTA files for input into in-silico PCR prediction tools. 2. Given a primer pair, run in-silico PCR to find target sequences (amplicons). The resulting amplicons were quality controlled. 3. Amplicons are dereplicated into unique groups 4. Unique amplicons are then pairwise aligned to generate the pairwise alignments table that becomes the input to the data transformation for the GapFinder metabarcode visualisation tool.

    Please note that if the public DNA sequence entries do not contain both forward and reverse primers, then these sequences are not available for in-silico PCR and will not appear in the dataset.

  18. S

    RNA binding strengths of centromeres and the flanking sequences on 19 and X...

    • scidb.cn
    Updated Jun 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ning Ji; Xiaocui Duan; Ning Jiao; Baixue Lv; Zhixue Song; Xiaodie Wang; Xin Liu; Peiyuan Wu; Suleman Shah; Murad Khan; Xiufang Wang; Zhanjun Lv (2021). RNA binding strengths of centromeres and the flanking sequences on 19 and X chromosomes [Dataset]. http://doi.org/10.11922/sciencedb.00890
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 26, 2021
    Dataset provided by
    Science Data Bank
    Authors
    Ning Ji; Xiaocui Duan; Ning Jiao; Baixue Lv; Zhixue Song; Xiaodie Wang; Xin Liu; Peiyuan Wu; Suleman Shah; Murad Khan; Xiufang Wang; Zhanjun Lv
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The centromere regions of human chromosome are constitutive heterochromatin that is almost non-transcrited. In human cells, interactions between RNA and DNA can activate gene expression in a process known as RNA activation. To describe the possibility of all RNAs binding to DNA sequences, we developed a metric we call RNA binding strength. Here we analyzed the RNA binding strengths of the centromere regions of chromosomes 19 and X and their upstream and downstream flanking sequences using bioinformatics. We found that the RNA binding strengths of the centromere regions were significantly lower than those of the corresponding flanking sequences. We concluded that low RNA binding strength of the human centromere regions may contribute to centromere’s characteristic of lacking transcription.Sequence data and softwareThe nucleotide sequences of the centromere regions and the flanking sequences of human chromosomes X and 19 were obtained from NCBI (GRCh38 Primary Assembly HSCHR6 CTG1, http://www.ncbi.nlm.nih.gov/ projects/ genome/ guide/human).). A total of 1,000 genes highly expressed in human tonsil germinal center B cells were selected for analysis based on the results of the Digital Differential Display (NCBI UniGene Lib.289 -NCI_CGAP_GCB1). Gene-Analyser 2.0 software (see Gene-Analyser 2.0) was used to analyze the number of 7-nucleotide (7-nt) strings (47=16,384), which was written by our team. The 50-kb DNA sequence was represented by a long column of numbers whose sum was 49,994. Microsoft Excel software 2003 was used for statistical analysis.The algorithm for the RNA binding strengthThe RNA binding strength algorithm is based on the principle that more complementarity between RNA and DNA results in more binding between RNAs and DNAs. For example, when there is one 5'-TTTTTTT DNA molecule and ten 5'-AAAAAAA RNA molecules in a certain volume solution, the likelihood of DNA binding with RNAs is 10 (10×1=10). If there are ten 5'-TTTTTTT DNA molecules, the likelihood of DNA binding with RNAs is 100 (10×10=100). The binding of single-strand RNA and double-strand DNA accounts for competition between RNA and DNA for binding.The centromere of chromosome 19 (chr19) is located at 24.50Mb-27.19Mb. DNA sequence from 19Mb to 32Mb of chromosome 19 was divided into different 50kb fragments. Gene-Analyser 2.0 software was used to analyze the 7nt strings contained in each 50kb fragment. The results of analysis was shown in folder of Chr19 sequence and 7 nt strings. Folder, named “19”, contains the DNA base sequences and their 7nt strings from 19000001 bp to 20000000 bp of chr19 and contains 20 text files and 20 Excel files. Folders 20-32 represent the same meaning as folder “19”.The centromere of chromosome X (chrX) is located at 58.61Mb-62.41Mb. DNA sequence from 52Mb to 67Mb of chrX was divided into different 50kb fragments. Gene-Analyzer 2.0 software was used to analyze the 7nt strings contained in each 50kb fragment.The results of analysis was shown in folder of ChrX sequence and 7 nt strings. Folder 52 contains the DNA base sequences and their 7nt strings from 52000001 bp to 53000000 bp of chrX and contains 20 text files and 20 Excel files. Folders 53-67 represent the same meaning as folder 52.The total RNA of 1,000 genes highly expressed in tonsil germinal center B cells is shown in Excel file, named “7nt strings of RNA expressed by 1000 genes”.7nt string data of each 50kb fragement is multiplied by the total RNA of same strings, and the sum of 16,384 products is the RNA binding strength of the 50 kb fragement.The 50 kb fragment size was chosen because a transcription unit contains 10-50kb sequences (based on DNase I digestion of the hemoglobin and ovoalbumin genes). One-thousand genes highly expressed in human tonsil germinal center B cells were selected (as described above), and the 7-nt string numbers for these genes were calculated from the sense strand (including introns and exons). The 7-nt string numbers for each gene multiply by the expression frequency of the gene (Lib.5601; http://www.ncbi.nlm.nih.gov/UniGene/), which results in the calculated numbers of the 7-nt string for the gene (see excel file, named “ChrX52 RNA binding strength”, shows how to calculate the RNA binding strength value of one 50-kb DNA fragment (from 52000001 bp to 53000000 bp of chrX ).Excel file, named “Chr19”, shows the result of RNA binding strengths of centromeres and the flanking sequences of Chr19. The Excel file, named “ChrX” , shows the value of RNA binding strengths of centromeres and the flanking sequences of chrX.

  19. DGE GO Enrichment Analysis Microarray Data GDS2778

    • kaggle.com
    zip
    Updated Nov 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Nagendra (2025). DGE GO Enrichment Analysis Microarray Data GDS2778 [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/dge-go-enrichment-analysis-microarray-data-gds2778
    Explore at:
    zip(6820264 bytes)Available download formats
    Dataset updated
    Nov 29, 2025
    Authors
    Dr. Nagendra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    his dataset is based on National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) DataSet accession GDS2778. girke.bioinformatics.ucr.edu +1

    The dataset originates from a microarray experiment measuring global gene expression under specific experimental conditions. girke.bioinformatics.ucr.edu +1

    Raw and processed expression data (for all probes/genes) are included, enabling downstream analysis such as normalization, differential expression, and clustering.

    The dataset has been used to perform differential gene expression (DGE) analysis to identify genes that are up- or down-regulated under the experimental condition compared to control.

    Data processing steps typically include normalization (e.g., log-transformation), quality control, probe-to-gene mapping, and statistical testing for significance (e.g., using packages such as limma or other DGE tools). mahsa-ehsanifard.github.io +1

    Resulting differentially expressed genes (DEGs) include statistics such as log fold change (logFC), adjusted p‑values (adj.P.Val), and possibly other metrics (e.g., B-statistic), allowing assessment of both magnitude and significance of changes.

    The dataset also includes a visualization file (heatmap image) that displays expression patterns of DEGs (or top variable genes) across samples — enabling clustering and pattern recognition across samples and genes.

    The heatmap helps illustrate sample-wise and gene-wise expression variation: clustering groups together samples (e.g. control vs treatment) and genes with similar expression dynamics. NCBI +1

    This dataset is suitable for further bioinformatics analysis: e.g. functional enrichment (GO/Pathway), co‑expression analysis, gene signature identification, or integration with other datasets.

    Users who download this dataset can reproduce or extend analyses, such as re-normalization, alternative clustering, custom DEG thresholds, or downstream biological interpretation (pathway, network analysis).

  20. HIV-1 and HIV-2 RNA Sequences

    • kaggle.com
    Updated Mar 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Proto Bioengineering (2024). HIV-1 and HIV-2 RNA Sequences [Dataset]. http://doi.org/10.34740/kaggle/dsv/7846544
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 14, 2024
    Dataset provided by
    Kaggle
    Authors
    Proto Bioengineering
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This is the raw RNA data (the genome) for HIV-1 and HIV-2 from the NCBI. Data is available here for both viruses in FASTA and GenBank formats.

    What is HIV?

    Human Immunodeficiency Virus (HIV) is a virus that infects a person's immune system, which can potentially destroy said immune system and lead to immunocompromise or AIDS.

    What is RNA?

    The RNA of a virus is literally its blueprint and is what gets replicated when a virus infects a cell. A virus's goal is to make millions of copies of itself by hijacking the machinery of living cells. You can think of all viruses as floating blueprints that trick a cell into making more of that blueprint, which then infect other cells to make more copies, and so on.

    Human Immunodeficiency Virus (HIV) is a type of retrovirus, which not only infects cells to make copies of itself, but also inserts a copy of itself into that cell's DNA, which makes it harder to eradicate.

    HIV-1 vs. HIV-2

    Like most viruses, HIV has more than one type (HIV-1 and HIV-2). There are also different strains and subtypes.

    HIV-1 is more prevalent worldwide than HIV-2 and is also more deadly. HIV-2 is mostly found in West Africa, and it is less likely to progress to immune system failure and AIDS (Nyamweya et al., 2013). The two viruses share a 55% similarity in their RNA sequences (Motomura, Chen, & Hu, 2007). Both genomes are included in this dataset.

    Read more about the differences between HIV-1 and HIV-2 here.

    What HIV looks like

    You can see how HIV-1's RNA sequence leads its ultimate physical shape here on NCBI.

    GenBank vs. FASTA files

    GenBank and FASTA are two of the most popular file formats in Bioinformatics. They both have DNA or RNA, as well as an accession number (or ID), and the virus's name.

    However, GenBank is a more detailed bioinformatics-type file than FASTA. While FASTA only has a name, accession number/ID, and the RNA itself, GenBank also has named gene or protein sequences, which is crucial to understanding what the RNA actually makes and thus how the virus actually works.

    FASTA files technically have all of this info, since we can deduce the genes and/or proteins from the RNA, but GenBank files already contain the work of other scientists who have done this gene/protein-identification for us.

    Data Source

    This data was downloaded from the National Center for Biotechnology Information (NCBI) by the bio command line tool.

    License

    This data is in the public domain per the NCBI. Their statement on data licensing and copyright is as follows:

    Databases of molecular data on the NCBI Web site include such examples as nucleotide sequences (GenBank), protein sequences, macromolecular structures, molecular variation, gene expression, and mapping data. They are designed to provide and encourage access within the scientific community to sources of current and comprehensive information. Therefore, NCBI itself places no restrictions on the use or distribution of the data contained therein. Nor do we accept data when the submitter has requested restrictions on reuse or redistribution. However, some submitters of the original data (or the country of origin of such data) may claim patent, copyright, or other intellectual property rights in all or a portion of the data (that has been submitted). NCBI is not in a position to assess the validity of such claims and since there is no transfer of rights from submitters to NCBI, NCBI has no rights to transfer to a third party. Therefore, NCBI cannot provide comment or unrestricted permission concerning the use, copying, or distribution of the information contained in the molecular databases.

    Thank you to the National Cancer Institute on Unsplash for the banner image.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Huiyan Fan; Yongliang Zhang; Haiwen Sun; Junying Liu; Ying Wang; Xianbing Wang; Dawei Li; Jialin Yu; Chenggui Han (2023). Transcriptome Analysis of Beta macrocarpa and Identification of Differentially Expressed Transcripts in Response to Beet Necrotic Yellow Vein Virus Infection [Dataset]. http://doi.org/10.1371/journal.pone.0132277
Organization logo

Transcriptome Analysis of Beta macrocarpa and Identification of Differentially Expressed Transcripts in Response to Beet Necrotic Yellow Vein Virus Infection

Explore at:
10 scholarly articles cite this dataset (View in Google Scholar)
xlsxAvailable download formats
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Huiyan Fan; Yongliang Zhang; Haiwen Sun; Junying Liu; Ying Wang; Xianbing Wang; Dawei Li; Jialin Yu; Chenggui Han
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

BackgroundRhizomania is one of the most devastating diseases of sugar beet. It is caused by Beet necrotic yellow vein virus (BNYVV) transmitted by the obligate root-infecting parasite Polymyxa betae. Beta macrocarpa, a wild beet species widely used as a systemic host in the laboratory, can be rub-inoculated with BNYVV to avoid variation associated with the presence of the vector P. betae. To better understand disease and resistance between beets and BNYVV, we characterized the transcriptome of B. macrocarpa and analyzed global gene expression of B. macrocarpa in response to BNYVV infection using the Illumina sequencing platform.ResultsThe overall de novo assembly of cDNA sequence data generated 75,917 unigenes, with an average length of 1054 bp. Based on a BLASTX search (E-value ≤ 10−5) against the non-redundant (NR, NCBI) protein, Swiss-Prot, the Gene Ontology (GO), Clusters of Orthologous Groups of proteins (COG) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases, there were 39,372 unigenes annotated. In addition, 4,834 simple sequence repeats (SSRs) were also predicted, which could serve as a foundation for various applications in beet breeding. Furthermore, comparative analysis of the two transcriptomes revealed that 261 genes were differentially expressed in infected compared to control plants, including 128 up- and 133 down-regulated genes. GO analysis showed that the changes in the differently expressed genes were mainly enrichment in response to biotic stimulus and primary metabolic process.ConclusionOur results not only provide a rich genomic resource for beets, but also benefit research into the molecular mechanisms of beet- BNYV Vinteraction.

Search
Clear search
Close search
Google apps
Main menu