Facebook
TwitterThe dataset was collected through whole-transcriptome RNA-Sequencing technologies. The processing method was described in the manuscript.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
VCF files containing filtered mutated sites in SARS-CoV-2 genomes obtained from GISAID EpiCoV and submitted from the UK and the US, separated by individual mutations. The columns correspond to viral genome accession ID, nucleotide position in the genome, mutation ID (left blank in all rows), reference nucleotide, identified mutation, quality, filter, and information columns (all left blank), format (GT in all rows), column corresponding to reference genome (all 0, referring to reference nucleotide column), and columns corresponding to isolate genomes, with each row identifying the nucleotide in the POS column, and whether it is non-mutant (0), or the mutant indicated in the identified mutation column (1). The files is tab delimited, with the UK file having 12696 rows including the names, and 18135 columns, and the US file having 15588 rows including the names, and 16277 columns.
The file was generated to test the hypothesis whether the different SARS-CoV-2 genes or protein coding regions are positively or negatively selected differently between 14408C>T / 23403A>G double mutants and double wildtype isolates, using mutation rate models, and whether regional distributions affect the mutation rates. Our findings have shown that the RdRp coding region and the S gene show the highest amount of selection across viral generations, and that different countries can affect the synonymous and nonsynonymous mutation rates for individual genes.
Facebook
TwitterMultidimensional scaling (MDS) is a dimensionality reduction technique for microbial ecology data analysis that represents the multivariate structure while preserving pairwise distances between samples. While its improvements have enhanced the ability to reveal data patterns by sample groups, these MDS-based methods require prior assumptions for inference, limiting their application in general microbiome analysis. In this study, we introduce a new MDS-based ordination, “F-informed MDS,†which configures the data distribution based on the F-statistic, the ratio of dispersion between groups sharing common and different characteristics. Using simulated compositional datasets, we demonstrate that the proposed method is robust to hyperparameter selection while maintaining statistical significance throughout the ordination process. Various quality metrics for evaluating dimensionality reduction confirm that F-informed MDS is comparable to state-of-the-art methods in preserving both local and ..., , # Multidimensional scaling informed by F-statistic: Visualizing grouped microbiome data with inference
monospaced.Â
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Portuguese National Registry on low weight newborns between 2013 and 2018, made available for research purposes. Dataset is composed of 3823 unique entries registering birthweight, biological sex of the infant (1-Male; 2-Female), CRIB score (0-21) and survival (0-Survival; 1-Death).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Bioenergetics data ( DEL2G ) of interaction between D614G mutants vs N501Y mutants are statistically analysed to find out the correlation study .
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This package contains synthetic datasets used to evaluate the the SNVPhyl (http://snvphyl.readthedocs.io/) pipeline. This is divided into two separate datasets. Additional details on how these datasets were constructed are available at https://github.com/apetkau/snvphyl-validations.1. e-coli-simulated-dataset: Simulated reads for evaluated SNVPhyl's SNV detection accuracy.Reads are based off of an E. Coli reference genome (NC_002695), plus two plasmids (NC_002128, NC_002127) which were concatenated into a single fasta file reference-genome/e_coli_sakai_w_plasmids.fasta. Random mutations were introduced to produce the variant genomes present in the genomes under variant-genomes/. Reads were simulated using ART Illumina (http://www.niehs.nih.gov/research/resources/software/biostatistics/art/) to generate the fastq files in this directory.2. salmonella-heidelberg-contamination: Simulated reads for evaluating SNVPhyl's performance in the presence of contamination from another genomic sample.Reads for the sample SH13-001 (BioSample: SAMN04334637) were downsampled and contaminated with SH12-001 (BioSample: SAMN04334627) at percentages of 5%, 10%, 20%, and 30%.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset consists of 1763 observations, each representing a unique patient, and 12 different attributes associated with heart disease. This dataset is a critical resource for researchers focusing on predictive analytics in cardiovascular diseases.
Variables Overview: 1. Age: A continuous variable indicating the age of the patient. 2. Sex: A categorical variable with two levels ('Male', 'Female'), indicating the gender of the patient. 3. CP (Chest Pain type): A categorical variable describing the type of chest pain experienced by the patient, with categories such as 'Asymptomatic', 'Atypical Angina', 'Typical Angina', and 'Non-Angina'. 4. TRTBPS (Resting Blood Pressure): A continuous variable indicating the resting blood pressure (in mm Hg) on admission to the hospital. 5. Chol (Serum Cholesterol): A continuous variable measuring the serum cholesterol in mg/dl. 6. FBS (Fasting Blood Sugar): A binary variable where 1 represents fasting blood sugar > 120 mg/dl, and 0 otherwise. 7. Rest ECG (Resting Electrocardiographic Results): Categorizes the resting electrocardiographic results of the patient into 'Normal', 'ST Elevation', and other categories. 8. Thalachh (Maximum Heart Rate Achieved): A continuous variable indicating the maximum heart rate achieved by the patient. 9. Exng (Exercise Induced Angina): A binary variable where 1 indicates the presence of exercise-induced angina, and 0 otherwise. 10. Oldpeak (ST Depression Induced by Exercise Relative to Rest): A continuous variable indicating the ST depression induced by exercise relative to rest. 11. Slope (Slope of the Peak Exercise ST Segment): A categorical variable with levels such as 'Flat', 'Up Sloping', representing the slope of the peak exercise ST segment. 14. Target: A binary target variable indicating the presence (1) or absence (0) of heart disease.
Descriptive Statistics: The patients' age ranges from 29 to 77 years, with a mean age of approximately 54 years. The resting blood pressure spans from 94 to 200 mm Hg, and the average cholesterol level is about 246 mg/dl. The maximum heart rate achieved varies widely among patients, from 71 to 202 beats per minute.
Importance for Research: This dataset provides a comprehensive view of various factors that could potentially be linked to heart disease, making it an invaluable resource for developing predictive models. By analyzing relationships and patterns within these variables, researchers can identify key predictors of heart disease and enhance the accuracy of diagnostic tools. This could lead to better preventive measures and treatment strategies, ultimately improving patient outcomes in the realm of cardiovascular health
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This archive is a database generated using the novel Virus Pop pipeline, which simulates realistic protein sequences and adds new branches to a protein phylogenetic tree. An article describing the pipeline is currently under review.
The database contains simulations of 995 different proteins from 93 virus genera, providing a total of 24,138,277 sequences, both in amino acid and nucleotide.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Assisted reproductive technologies, including in vitro fertilization (IVF), are now frequently used, and increasing evidence indicates that IVF causes gene expression changes in children and adolescents that increase the risk of metabolic diseases. Although such gene expression changes are thought to be due to IVF-induced epigenetic changes, the mechanism remains elusive. We tested whether the transcription factor ATF7, – which mediates stress-induced changes in histone H3K9 tri- and di-methylation, typical marks of epigenetic silencing – is involved in the IVF-induced gene expression changes. IVF up- and down-regulated the expression of 688 and 204 genes, respectively, in the liver of 3-week-old wild-type (WT) mice, whereas 87% and 68% of these were not changed, respectively, by IVF in ATF7-deficient (Atf7—/—) mice. The genes, which are involved in metabolism, such as pyrimidine and purine metabolism, were up-regulated in WT mice but not in Atf7—/— mice. Of the genes whose expression was up-regulated by IVF in WT mice, 37% were also up-regulated by a loss of ATF7. These results indicate that ATF7 is a key factor in establishing the memory of IVF effects on metabolic pathways.
Facebook
TwitterBuilding prediction models based on complex omics datasets such as transcriptomics, proteomics, metabolomics remains a challenge in bioinformatics and biostatistics. Regularized regression techniques are typically used to deal with the high dimensionality of these datasets. However, due to the presence of correlation in the datasets, it is difficult to select the best model and application of these methods yields unstable results. We propose a novel strategy for model selection where the obtained models also perform well in terms of overall predictability. Several three step approaches are considered, where the steps are 1) network construction, 2) clustering to empirically derive modules or pathways, and 3) building a prediction model incorporating the information on the modules. For the first step, we use weighted correlation networks and Gaussian graphical modelling. Identification of groups of features is performed by hierarchical clustering. The grouping information is included in the prediction model by using group-based variable selection or group-specific penalization. We compare the performance of our new approaches with standard regularized regression via simulations. Based on these results we provide recommendations for selecting a strategy for building a prediction model given the specific goal of the analysis and the sizes of the datasets. Finally we illustrate the advantages of our approach by application of the methodology to two problems, namely prediction of body mass index in the DIetary, Lifestyle, and Genetic determinants of Obesity and Metabolic syndrome study (DILGOM) and prediction of response of each breast cancer cell line to treatment with specific drugs using a breast cancer cell lines pharmacogenomics dataset.
Facebook
TwitterIntroductionDeepening the genetic mechanisms underlying Normal Hearing Function (NHF) has proven challenging, despite extensive efforts through Genome-Wide Association Studies (GWAS).MethodsNHF was described as a set of nine quantitative traits (i.e., hearing thresholds at 0.25, 0.5, 1, 2, 4, and 8 kHz, and three pure-tone averages of thresholds at low, medium, and high frequencies). For each trait, GWAS analyses were performed on the Moli-sani cohort (n = 1,209); then, replication analyses were conducted on Carlantino (CAR, n = 261) and Val Borbera (VBI, n = 425) cohorts. Expression levels of the most significantly associated genes were assessed employing single-nucleus RNA sequencing data (snRNA-seq) on human fetal and adult inner ear tissues. Finally, for all nine NHF traits, Transcriptome-Wide Association Studies (TWAS) were performed, combining GWAS summary statistics and pre-computed gene expression weights in 12 brain tissues.ResultsGWAS on the Discovery cohort allowed the detection of 667 SNPs spanning 327 protein coding genes at a p < 10−5, across the nine NHF traits. Two loci with a p < 5 × 10−8 were replicated: 1. rs112501869 within SLC1A6 gene, encoding a brain high-affinity glutamate transporter, reached p = 6.21 × 10−9 in the 0.25 kHz trait. 2. rs73519456 within ASTN2 gene, encoding the Astrotactin protein 2, reached genome-wide significance in three NHF traits: 0.5 kHz (p = 1.86 × 10−8), PTAL (p = 9.40 × 10−9), and PTAM (p = 3.64 × 10−8). SnRNA-seq data analyses revealed a peculiar expression of the ASTN2 gene in the neuronal and dark cells populations, while for SLC1A6 no significant expression was detected. TWAS analyses detected that the ARF4-AS1 gene (eQTL: rs1584327) was statistically significant (p = 4.49 × 10−6) in the hippocampal tissue for the 0.25 kHz trait.ConclusionThis study took advantage of three Italian cohorts, deeply characterized from a genetic and audiological point of view. Bioinformatics and biostatistics analyses allowed the identification of three novel candidate genes, namely, SLC1A6, ASTN2, and ARF4-AS1. Functional studies and replication in larger and independent cohorts will be essential to confirm the biological role of these genes in regulating hearing function; however, these results confirm GWAS and TWAS as powerful methods for novel gene discovery, thus paving the way for a deeper understanding of the entangled genetic landscape underlying the auditory system.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Chorismate mutase is a well-known model enzyme, catalyzing the Claisen rearrangement of chorismate to prephenate. Recent high-resolution crystal structures along the reaction coordinate of this enzyme enable computational analyses at unprecedented detail. Using quantum chemical simulations, we have investigated how the catalytic reaction mechanism is affected by electrostatic and hydrogen bond interactions. Our calculations showed that the transition state was mainly stabilized electrostatically, with Arg90 playing the leading role. The effect was augmented by selective hydrogen bond formation to the transition state in the wild-type enzyme, facilitated by a small-scale local induced fit. We further identified a previously underappreciated water molecule, which separates the negative charges during the reaction. The analysis includes the wild-type enzyme and a non-natural enzyme variant, where the catalytic arginine was replaced with an isosteric citrulline residue.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The conserved RNA-binding protein Hfq has multiple regulatory roles within the prokaryotic cell, including promoting stable duplex formation between small RNAs and mRNAs, and thus hfq deletion mutants have pleiotropic phenotypes. Previous proteome and transcriptome studies of Neisseria meningitidis have generated limited insight into differential gene expression due to Hfq loss. In this study, reversed-phase liquid chromatography combined with data-independent alternate scanning mass spectrometry (LC-MSE) was utilized for rapid high-resolution quantitative proteomic analysis to further elucidate the differentially expressed proteome of a meningococcal hfq deletion mutant. Whole cell lysates of N. meningitidis serogroup B H44/76 wild type (wt) and H44/76Δhfq (Δhfq) grown in liquid growth medium were subjected to tryptic digestion. The resulting peptide mixtures were separated by LC prior to analysis by MSE. Differential expression was analyzed by Student’s t-Test with control for false discovery rate (FDR). Reliable quantification of relative expression comparing wt and Δhfq was achieved with 506 proteins (20%). Upon FDR control at q ≤ 0.05, 48 up- and 59 downregulated proteins were identified. From these, 81 were identified as novel Hfq-regulated candidates, while 15 proteins were previously found by SDS-PAGE/MS and 24 with microarray analyses. Thus, using LC-MSE we have expanded the repertoire of Hfq regulated proteins. In conjunction with previous studies, a comprehensive network of Hfq regulated proteins was constructed and differentially expressed proteins were found to be involved in a large variety of cellular processes. The results and comparisons with other Gram-negative model systems, suggest still unidentified sRNA analogues in N. meningitidis.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reference Viral Databases (RVDB-prot and RVDB-prot-HMM) were developed by Thomas Bigot in Marc Eloit’s Pathogen Discovery group in collaboration with Center of Bioinformatics, Biostatistics and Integrative Biology (C3BI) at Institut Pasteur, for enhancing virus detection using next-generation sequencing (NGS) technologies. They are based on the reference Viral DataBase, courtesy of Arifa Khan’s group at CBER, FDA:https://hive.biochemistry.gwu.edu/rvdb/.They are updated after each new release of the nucleotidic database. The version number of the protein databases follows the one of the original nucleic database.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Cell refractive index (RI) was proposed as a putative cancer biomarker of great potential, being correlated with cell content and morphology, cell division rate and membrane permeability. We used Digital Holographic Microscopy (DHM) to compare RI and dry mass density of two B16 murine melanoma sublines of different metastatic potential. Using statistical methods, the phase shifts distribution within the reconstructed quantitative phase images (QPIs) was analyzed by the method of bimodality coefficients. The observed correlation of RI and bimodality profile with the cells metastatic potential was validated by real time impedance based-assay and clonogenic tests. We suggest RI and QPIs histograms bimodality analysis to be developed as optical biomarkers useful in label-free detection and quantitative evaluation of cell metastatic potential.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroudType III secretion systems (T3SSs) are central to the pathogenesis and specifically deliver their secreted substrates (type III secreted proteins, T3SPs) into host cells. Since T3SPs play a crucial role in pathogen-host interactions, identifying them is crucial to our understanding of the pathogenic mechanisms of T3SSs. This study reports a novel and effective method for identifying the distinctive residues which are conserved different from other SPs for T3SPs prediction. Moreover, the importance of several sequence features was evaluated and further, a promising prediction model was constructed.ResultsBased on the conservation profiles constructed by a position-specific scoring matrix (PSSM), 52 distinctive residues were identified. To our knowledge, this is the first attempt to identify the distinct residues of T3SPs. Of the 52 distinct residues, the first 30 amino acid residues are all included, which is consistent with previous studies reporting that the secretion signal generally occurs within the first 30 residue positions. However, the remaining 22 positions span residues 30–100 were also proven by our method to contain important signal information for T3SP secretion because the translocation of many effectors also depends on the chaperone-binding residues that follow the secretion signal. For further feature optimisation and compression, permutation importance analysis was conducted to select 62 optimal sequence features. A prediction model across 16 species was developed using random forest to classify T3SPs and non-T3 SPs, with high receiver operating curve of 0.93 in the 10-fold cross validation and an accuracy of 94.29% for the test set. Moreover, when performing on a common independent dataset, the results demonstrate that our method outperforms all the others published to date. Finally, the novel, experimentally confirmed T3 effectors were used to further demonstrate the model’s correct application. The model and all data used in this paper are freely available at http://cic.scu.edu.cn/bioinformatics/T3SPs.zip.
Facebook
TwitterAmyotrophic lateral sclerosis (ALS) is a progressive and fatal neurodegenerative disease. While genetics and other factors contribute to ALS pathogenesis, critical knowledge is still missing and validated biomarkers for monitoring the disease activity have not yet been identified. To address those aspects we carried out this study with the primary aim of identifying possible miRNAs/mRNAs dysregulation associated with the sporadic form of the disease (sALS). Additionally, we explored miRNAs as modulating factors of the observed clinical features. Study included 56 sALS and 20 healthy controls (HCs). We analyzed the peripheral blood samples of sALS patients and HCs with a high-throughput next-generation sequencing followed by an integrated bioinformatics/biostatistics analysis. Results showed that 38 miRNAs (let-7a-5p, let-7d-5p, let-7f-5p, let-7g-5p, let-7i-5p, miR-103a-3p, miR-106b-3p, miR-128-3p, miR-130a-3p, miR-130b-3p, miR-144-5p, miR-148a-3p, miR-148b-3p, miR-15a-5p, miR-15b-5p, miR-151a-5p, miR-151b, miR-16-5p, miR-182-5p, miR-183-5p, miR-186-5p, miR-22-3p, miR-221-3p, miR-223-3p, miR-23a-3p, miR-26a-5p, miR-26b-5p, miR-27b-3p, miR-28-3p, miR-30b-5p, miR-30c-5p, miR-342-3p, miR-425-5p, miR-451a, miR-532-5p, miR-550a-3p, miR-584-5p, miR-93-5p) were significantly downregulated in sALS. We also found that different miRNAs profiles characterized the bulbar/spinal onset and the progression rate. This observation supports the hypothesis that miRNAs may impact the phenotypic expression of the disease. Genes known to be associated with ALS (e.g., PARK7, C9orf72, ALS2, MATR3, SPG11, ATXN2) were confirmed to be dysregulated in our study. We also identified other potential candidate genes like LGALS3 (implicated in neuroinflammation) and PRKCD (activated in mitochondrial-induced apoptosis). Some of the downregulated genes are involved in molecular bindings to ions (i.e., metals, zinc, magnesium) and in ions-related functions. The genes that we found upregulated were involved in the immune response, oxidation–reduction, and apoptosis. These findings may have important implication for the monitoring, e.g., of sALS progression and therefore represent a significant advance in the elucidation of the disease’s underlying molecular mechanisms. The extensive multidisciplinary approach we applied in this study was critically important for its success, especially in complex disorders such as sALS, wherein access to genetic background is a major limitation.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterThe dataset was collected through whole-transcriptome RNA-Sequencing technologies. The processing method was described in the manuscript.