Archive of aggregated information about sequence variation and its relationship to human health. Provides reports of relationships among human variations and phenotypes along with supporting evidence. Submissions from clinical testing labs, research labs, locus-specific databases, expert panels and professional societies are welcome. Collects reports of variants found in patient samples, assertions made regarding their clinical significance, information about submitter, and other supporting data. Alleles described in submissions are mapped to reference sequences, and reported according to HGVS standard.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
ClinVar variants
For more information check out our paper and repository.
Usage
Pandas
import pandas as pd df = pd.read_parquet("hf://datasets/songlab/clinvar/test.parquet")
Polars
import polars as pl df = pl.read_parquet("https://huggingface.co/datasets/songlab/clinvar/resolve/main/test.parquet")
Datasets
from datasets import load_dataset dataset = load_dataset("songlab/clinvar", split="test")
ClinVar aggregates information about genomic variation and its relationship to human health.
This analysis was performed with Jupyter notebooks, so all code is in ipynb files. We recommend running these files using Jupyter, which can easily be installed using conda. The notebooks should function in a python 3.8 environment. Note that the visualizations in the three Floweaver*.ipynb files will work only in a Jupyter notebook environment and not in a Jupyter lab environment. If you have any questions about running these files, please contact asharo@ucsc.edu and brenner@compbio.berkeley.edu The following python packages are required to run these notebooks:
Pandas cyvcf2 numpy matplotlib pickle joblib floweaver ipysankeywidget
To reproduce the analysis in full, and to understand the logical flow, you must run the notebooks in the below order. However, if you are interested in a specific analysis, all intermediate files have also been provided, so in practice, you may run notebooks out of order. Due to restrictions on HGMD data sharing, primary and intermediate HGMD files are not ...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variant effect predictions from up to 85 variant effect prediction tools, conservation metrics and substitution matrices for 896,930 variants from the ClinVar or gnomAD databases.ClinVar pathogenic and likely pathogenic variants with a rating of 1* or higher are indicated with a label of '1' in the 'label' column.gnomAD variants not present in ClinVar are indicated with a label of '0' on the 'label' column.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variants in the cystic fibrosis transmembrane conductance regulator gene (CFTR) result in cystic fibrosis–a lethal autosomal recessive disorder. Missense variants that alter a single amino acid in the CFTR protein are among the most common cystic fibrosis variants, yet tools for accurately predicting molecular consequences of missense variants have been limited to date. AlphaMissense (AM) is a new technology that predicts the pathogenicity of missense variants based on dual learned protein structure and evolutionary features. Here, we evaluated the ability of AM to predict the pathogenicity of CFTR missense variants. AM predicted a high pathogenicity for CFTR residues overall, resulting in a high false positive rate and fair classification performance on CF variants from the CFTR2.org database. AM pathogenicity score correlated modestly with pathogenicity metrics from persons with CF including sweat chloride level, pancreatic insufficiency rate, and Pseudomonas aeruginosa infection rate. Correlation was also modest with CFTR trafficking and folding competency in vitro. By contrast, the AM score correlated well with CFTR channel function in vitro–demonstrating the dual structure and evolutionary training approach learns important functional information despite lacking such data during training. Different performance across metrics indicated AM may determine if polymorphisms in CFTR are recessive CF variants yet cannot differentiate mechanistic effects or the nature of pathophysiology. Finally, AM predictions offered limited utility to inform on the pharmacological response of CF variants i.e., theratype. Development of new approaches to differentiate the biochemical and pharmacological properties of CFTR variants is therefore still needed to refine the targeting of emerging precision CF therapeutics.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variant effect predictions and DMS data for ClinVar and gnomAD variants in BRCA1, P53, CALM1, PTEN, HRAS and TPK1.
Database of non identifiable, summary data on all variants identified in Childrens Mercy Genomic Medicine Center, including project data of Genomic Answers for Kids. Database can be searched and viewed with genomic annotations, population database cross references such as ClinVar, gnomAD and dbSNP, ACMG curations and local allele frequency. Variant data are available for bulk download as annotated VCF.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For coding variant effect predictions (VEPs), our approach centers on clinical variant sets specifically related to inherited cardiomyopathies (CM) and arrhythmias (ARM). We utilize a pre-compiled dataset comprised of rare missense pathogenic and benign variants, categorized using a cohort-based approach for diseases such as cardiomyopathy and arrhythmias, as detailed in the previous report by Zhang et al. ClinVar CM and ARM datasets include all missense variants in CM and ARM, respectively, are extracted from ClinVar (Landrum et al.). In the realm of non-coding VEPs, our focus shifts to splicing-related variants, utilizing a dataset from the multiplexed assay for exon recognition by Chong et al., which highlights the significant impact of rare genetic variants on splicing disruptions. Similarly, the ClinVar Splicing dataset, compiled from ClinVar, encompasses all benign sequences and pathogenic variants pertinent to splicing.
For the ClinVar CM and ARM datasets, we translate the DNA sequences into protein sequences using the human genome assembly hg38 from https://www.ncbi.nlm.nih.gov/grc/human. We employed the GFF file, MANE.GRCh38.v1.1.ensembl_genomic.gff.gz from https://www.ncbi.nlm.nih.gov/refseq/MANE, to annotate coding versus non-coding regions for each gene, as only coding DNA sequences are translated into proteins. Additionally, protein domains, cataloged in the Pfam database (Finn et al.), are essential for the functional characterization of proteins. These domains are identified by aligning the translated sequences to known domain structures, thereby facilitating deeper insights into protein function.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
All JPH2 variants described in literature and rare JPH2 variants submitted to ClinVar Database with phenotype information.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ENCODES a protein that exhibits cupric reductase (NADH) activity (ortholog); ferric-chelate reductase (NADPH) activity (ortholog); heme binding (ortholog); INVOLVED IN copper ion import (ortholog); ASSOCIATED WITH pleomorphic xanthoastrocytoma (ortholog); FOUND IN endosome (ortholog); plasma membrane (ortholog); INTERACTS WITH (+)-schisandrin B; 1-naphthyl isothiocyanate; 2,3,7,8-tetrachlorodibenzodioxine
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ENCODES a protein that exhibits identical protein binding; G protein-coupled receptor activity involved in regulation of postsynaptic membrane potential (ortholog); GABA receptor activity (ortholog); INVOLVED IN negative regulation of synaptic plasticity (ortholog); phospholipase C-activating G protein-coupled receptor signaling pathway (ortholog); regulation of postsynapse organization (ortholog); ASSOCIATED WITH Colorectal Neoplasms (ortholog); Neurodevelopmental Disorders (ortholog); prostate cancer (ortholog); FOUND IN postsynaptic membrane; GABA-ergic synapse (ortholog); glutamatergic synapse (ortholog); INTERACTS WITH (S)-nicotine; 2,3,7,8-tetrachlorodibenzodioxine; 2,3,7,8-Tetrachlorodibenzofuran
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ENCODES a protein that exhibits beta-catenin binding (inferred); transcription coregulator activity (inferred); INVOLVED IN regulation of transcription by RNA polymerase II (inferred); PARTICIPATES IN RNA polymerase II transcription pathway; ASSOCIATED WITH Developmental Disease (ortholog); genetic disease (ortholog); glycogen storage disease XV (ortholog); FOUND IN mediator complex (inferred); nucleus (inferred); INTERACTS WITH 17alpha-ethynylestradiol; 2,3,7,8-tetrachlorodibenzodioxine; 4,4'-sulfonyldiphenol
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Characteristics of combined asthma clusters with high, low, and average PCD-GRS values.
MedGen is NCBI's portal to information about conditions and phenotypes related to Medical Genetics. Terms from the NIH Genetic Testing Registry (GTR), UMLS, HPO, Orphanet, ClinVar and other sources are aggregated into concepts, each of which is assigned a unique identifier and a preferred name and symbol. The core content of the record may include names, identifiers used by other databases, mode of inheritance, clinical features, and map location of the loci affecting the disorder. The concept identifier (CUI) is used to aggregate information about that concept, similar to the way NCBI Gene serves as a gateway to gene-related information.
MedGen provides links to such resources as: Genetic tests registered in the NIH Genetic Testing Registry (GTR), GeneReviews, ClinVar, OMIM, Related genes, Disorders with similar clinical features, Medical and research literature, Practice guidelines, Consumer resources, Ontologies such as HPO and ORDO.
Links to the GTR, GeneReviews, and Practice Guidelines are based on curation by NCBI staff. Other data feeds are automated, but reviewed by NCBI staff and informed by feedback from the community.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ENCODES a protein that exhibits protein homodimerization activity; identical protein binding (ortholog); INVOLVED IN regulation of receptor internalization; actin cytoskeleton organization (ortholog); cell-matrix adhesion (ortholog); ASSOCIATED WITH acrodermatitis (ortholog); pleomorphic xanthoastrocytoma (ortholog); FOUND IN cytosol; cell cortex (ortholog); cytoplasm (ortholog); INTERACTS WITH 1,2,4-trimethylbenzene; 2,3,7,8-tetrachlorodibenzodioxine; 2,4-dinitrotoluene
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As part of a broader collaborative network of exome sequencing studies, we developed a jointly called data set of 5,685 Ashkenazi Jewish exomes. We make publicly available a resource of site and allele frequencies, which should serve as a reference for medical genetics in the Ashkenazim (hosted in part at https://ibd.broadinstitute.org, also available in gnomAD at http://gnomad.broadinstitute.org). We estimate that 34% of protein-coding alleles present in the Ashkenazi Jewish population at frequencies greater than 0.2% are significantly more frequent (mean 15-fold) than their maximum frequency observed in other reference populations. Arising via a well-described founder effect approximately 30 generations ago, this catalog of enriched alleles can contribute to differences in genetic risk and overall prevalence of diseases between populations. As validation we document 148 AJ enriched protein-altering alleles that overlap with "pathogenic" ClinVar alleles (table available at https://github.com/macarthur-lab/clinvar/blob/master/output/clinvar.tsv), including those that account for 10–100 fold differences in prevalence between AJ and non-AJ populations of some rare diseases, especially recessive conditions, including Gaucher disease (GBA, p.Asn409Ser, 8-fold enrichment); Canavan disease (ASPA, p.Glu285Ala, 12-fold enrichment); and Tay-Sachs disease (HEXA, c.1421+1G>C, 27-fold enrichment; p.Tyr427IlefsTer5, 12-fold enrichment). We next sought to use this catalog, of well-established relevance to Mendelian disease, to explore Crohn's disease, a common disease with an estimated two to four-fold excess prevalence in AJ. We specifically attempt to evaluate whether strong acting rare alleles, particularly protein-truncating or otherwise large effect-size alleles, enriched by the same founder-effect, contribute excess genetic risk to Crohn's disease in AJ, and find that ten rare genetic risk factors in NOD2 and LRRK2 are enriched in AJ (p < 0.005), including several novel contributing alleles, show evidence of association to CD. Independently, we find that genomewide common variant risk defined by GWAS shows a strong difference between AJ and non-AJ European control population samples (0.97 s.d. higher, p
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Background: Identifying clinically relevant genetic variants is crucial for a fast and reliable genetic diagnosis. With exome sequencing now standard, diagnostic labs are in need of a, in principle, to-the-day-accurate list of genes associated with rare diseases. Manual curation efforts are slow and often disease specific, while efforts relying on single sources are too inaccurate and result in too many false-positive genes.
Methods: We established the MorbidGenes panel based on a list of publicly available databases: OMIM, PanelApp, SysNDD, ClinVar, HGMD and GenCC. A simple logic allows inclusion of genes with sufficient evidence based on a voting algorithm. By providing an API endpoint, users can directly access the list and meta data for all relevant information on their genes of interest.
Results: The panel currently includes 4,677 genes (v.2022-02.1, as of February 2022) with minimally sufficient evidence on disease causality to classify them as diagnostically relevant. Reproducible filtering and versioning allow the integration into diagnostic pipelines. In-house Implementation successfully removed false positive genes and reduced time requirements in routine exome diagnostics. The panel is updated monthly, and we will integrate novel sources on a regular basis.
Conclusion: The MorbidGenes panel is a comprehensive and open overview of clinically relevant genes based on a growing list of sources. It supports genetic diagnostics labs by providing diagnostically relevant genes in a QM conform format on a monthly basis with more frequent updates planned. Once genomes are standard, diagnostically relevant non-coding regions will also be included.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
INVOLVED IN axonogenesis (ortholog); gephyrin clustering involved in postsynaptic density assembly (ortholog); neurotransmitter-gated ion channel clustering (ortholog); ASSOCIATED WITH intellectual disability (ortholog); FOUND IN postsynaptic density membrane; cell surface (ortholog); GABA-ergic synapse (ortholog); INTERACTS WITH 2,3,7,8-tetrachlorodibenzodioxine; 2,3,7,8-Tetrachlorodibenzofuran; 3,4-methylenedioxymethamphetamine
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ENCODES a protein that exhibits serine C-palmitoyltransferase activity (ortholog); INVOLVED IN response to immobilization stress; adipose tissue development (ortholog); ceramide biosynthetic process (ortholog); PARTICIPATES IN Fabry disease pathway; Gaucher's disease pathway; Krabbe disease pathway; ASSOCIATED WITH Charcot-Marie-Tooth disease (ortholog); COVID-19 (ortholog); genetic disease (ortholog); FOUND IN serine palmitoyltransferase complex (ortholog); INTERACTS WITH (+)-schisandrin B; 1-benzylpiperazine; 3-chloropropane-1,2-diol
Archive of aggregated information about sequence variation and its relationship to human health. Provides reports of relationships among human variations and phenotypes along with supporting evidence. Submissions from clinical testing labs, research labs, locus-specific databases, expert panels and professional societies are welcome. Collects reports of variants found in patient samples, assertions made regarding their clinical significance, information about submitter, and other supporting data. Alleles described in submissions are mapped to reference sequences, and reported according to HGVS standard.