Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This file contains the human gene and disease names used for text mining in the DISEASES database v2.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the dataset of our EMNLP 2021 paper:
Graphine: A Dataset for Graph-aware Terminology Definition Generation.
Please read the "readme.md" in it for the format of the dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reduced dictionary used by PiTagger for participation in the BioCreative V.5 BeCalm TIPS task.
Facebook
TwitterThis is the HQSNP DB (high-quality SNP database) developed by CHG bioinformatics group. The high-quality SNP is defined as a SNP having allele frequency or genotyping data. The majority of the HQSNPs come from HapMap, others come from JSNP (Japanese SNP database), TSC (The SNP Consortium), Affymetrix 120K SNP, and Perlegen SNP. There are four kinds of SNP search you can do: * Get SNPs by dbSNP rs#: Choose this search if you have already selected a list of SNPs and you just want to get the SNP information. The program will generate a Excel file containing the SNP flanking sequence, variation, quality, function, etc. In the Excel file, there are 10 highlighted fields. You can send only those highlighted information to Illumina to get SNP pre-score. (The same fields are presented in other types of searches as well.) * Get gene SNPs by gene names: Choose this search if you have a list of gene names and you want to get the SNP information in these genes. The gene name can be official gene symbol, Ensembl gene ID, RefSeq accession ID, LocusLink number, etc. * Get gene SNPs by genome regions: Choose this search if you have a list of genome regions and you want to get all gene SNP information in these regions. The software will find all the Ensembl genes in the regions and find SNPs associated to each Ensembl gene. * Get genome scan SNPs by genome regions: Choose this search if you have a list of genome regions and you want to get evenly spaced SNPs in these regions. A SNP selection tool (SNPselector) was built upon HQSNP. It took snp ID list, gene name list, or genome region list as input and searched SNPs for genome scan or gene assoctiation study. It could take an optional ABI SNP file (exported from ABI SNP search web page) as input for checking whether the candidate SNP is available from ABI. It could also take an optional Illumina SNP pre-score file as input to select SNP for Illumina SNP assay. It generated results sorted by tag SNP in LD block, SNP quality, SNP function, SNP regulatory potential, and SNP mutation risk. SNPselector is now retired from public use (as of September 30, 2010).
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset collected from 'genome.jp' web-based dataset by using its ftp : *** https://www.genome.jp/ftp/kegg/ It include bioinformatics and medical datbases in pathway, medical, genome, medicus , drug and .etc categories.
This dataset include 5 .txt files: dgroup : 'Entry_ID' , 'name', 'type' and 'member' information about drugs disease: 'Entry_ID' , 'name' , 'subgroup', 'supergroup', 'description' ,'genes' and 'category' about drugs and related disease drug: this file include molecular information of drugs network: this file include network of genes interaction with their 'class' and 'gene' information variant: this file include variants of the genes and 'gene variant id' , 'gene name' , 'gene definition' and 'variation type' categories.
Important definitions
1.Signaling Pathways : Describes a series of chemical reactions in which a group of molecules in a cell work together to control a cell function, such as cell division or cell death. A cell receives signals from its environment when a molecule, such as a hormone or growth factor, binds to a specific protein receptor on or in the cell. After the first molecule in the pathway receives a signal, it activates another molecule. This process is repeated through the entire signaling pathway until the last molecule is activated and the cell function is carried out. Abnormal activation of signaling pathways may lead to diseases, such as cancer. Drugs are being developed to target specific molecules involved in these pathways. These drugs may help keep cancer cells from growing. (https://www.cancer.gov/publications/dictionaries/cancer-terms/def/signaling-pathway)
2.Variants of gene :An alteration in the most common DNA nucleotide sequence. The term variant can be used to describe an alteration that may be benign, pathogenic, or of unknown significance. The term variant is increasingly being used in place of the term mutation. (https://www.cancer.gov/publications/dictionaries/genetics-dictionary/def/variant)
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Currently, no consensus exists regarding criteria required to designate a protein within a proteomic data set as a cell surface protein. Most published proteomic studies rely on varied ontology annotations or computational predictions instead of experimental evidence when attributing protein localization. Consequently, standardized approaches for analyzing and reporting cell surface proteome data sets would increase confidence in localization claims and promote data use by other researchers. Recently, we developed Veneer, a web-based bioinformatic tool that analyzes results from cell surface N-glycocapture workflowsthe most popular cell surface proteomics method used to date that generates experimental evidence of subcellular location. Veneer assigns protein localization based on defined experimental and bioinformatic evidence. In this study, we updated the criteria and process for assigning protein localization and added new functionality to Veneer. Results of Veneer analysis of 587 cell surface N-glycocapture data sets from 32 published studies demonstrate the importance of applying defined criteria when analyzing cell surface proteomics data sets and exemplify how Veneer can be used to assess experimental quality and facilitate data extraction for informing future biological studies and annotating public repositories.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The somatic variant calling workflow included in this case study is designed by Blue Collar Bioinformatics (bcbio), a community-driven initiative to develop best-practice pipelines for variant calling, RNA-seq and small RNA analysis workflows. According to the documentation, the goal of this project is to facilitate the automated analysis of high throughput data by making the resources quantifiable, analyzable, scalable, accessible and reproducible.
All the underlying tools are containerized facilitating software use in the workflow. The somatic variant calling workflow defined in CWL is available on GitHub and equipped with a well defined test dataset.
This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see https://w3id.org/cwl/prov/0.5.0 or use https://pypi.org/project/cwlprov/ to explore
Facebook
TwitterApicomplexa tick-borne hemoparasites, including Babesia bovis, Babesia microti, and Theileria equi are responsible for bovine and human babesiosis and equine theileriosis, respectively. These parasites of vast medical, epidemiological, and economic impact have complex life cycles in their vertebrate and tick hosts. Large gaps in knowledge concerning the mechanisms used by these parasites for gene regulation remain. Regulatory genes coding for DNA binding proteins such as members of the Api-AP2, HMG, and Myb families are known to play crucial roles as transcription factors. Although the repertoire of Api-AP2 has been defined and a HMG gene was previously identified in the B. bovis genome, these regulatory genes have not been described in detail in B. microti and T. equi. In this study, comparative bioinformatics was used to: (i) identify and map genes encoding for these transcription factors among three parasites’ genomes; (ii) identify a previously unreported HMG gene in B. microti; (iii) define a repertoire of eight conserved Myb genes; and (iv) identify AP2 correlates among B. bovis and the better-studied Plasmodium parasites. Searching the available transcriptome of B. bovis defined patterns of transcription of these three gene families in B. bovis erythrocyte stage parasites. Sequence comparisons show conservation of functional domains and general architecture in the AP2, Myb, and HMG proteins, which may be significant for the regulation of common critical parasite life cycle transitions in B. bovis, B. microti, and T. equi. A detailed understanding of the role of gene families encoding DNA binding proteins will provide new tools for unraveling regulatory mechanisms involved in B. bovis, B. microti, and T. equi life cycles and environmental adaptive responses and potentially contributes to the development of novel convergent strategies for improved control of babesiosis and equine piroplasmosis.
Facebook
TwitterCollection of gene expression and similar datasets related to brain tumors. In particular Medulloblastoma. Medulloblastoma is the most common malignant brain tumor in childhood. Typically csv files genes x samples.
GSE124814 WOW! Integration of many (all?) medulloblastoma datasets(!): 1641 samples, of which 1350 samples represent primary medulloblastomas and 291 samples represent normal brain
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE124814 Weishaupt H, Johansson P, Sundström A, Lubovac-Pilav Z et al. Batch-normalization of cerebellar and medulloblastoma gene expression datasets utilizing empirically defined negative control genes. Bioinformatics 2019 Sep 15;35(18):3357-3364. PMID: 30715209 https://doi.org/10.1093/bioinformatics/btz066 We downloaded a total of 1796 CEL files from previously published GEO or ArrayExpress records: GSE85217(n=763), GSE25219(n=154), GSE60862(n=130), GSE12992(n=40), GSE67850(n=22), GSE10327(n=62), GSE30074(n=30), E-MTAB-292(n=19), GSE74195(n=30), GSE37418(n=76), GSE4036(n=14), GSE62803(n=52), GSE21140(n=103), GSE37382(n=50), GSE22569(n=24), GSE35974(n=50), GSE73038(n=46), GSE50161(n=24), GSE3526(n=9), GSE50765(n=12), GSE49243(n=58), GSE41842(n=19), GSE44971(n=9). After preprocessing of all CEL files, we averaged the expression profiles of samples that mapped to the same patient in a single dataset, producing a final expression array comprising 1641 samples, of which 1350 samples represent primary medulloblastomas and 291 samples represent normal brain (cerebellum/upper rhombic lip). Also discussed in paper: A transcriptome-based classifier to determine molecular subtypes in medulloblastoma https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008263
GSE85217 (Cavalli ... Taylor ) 768 samples 2016 ( Affimetrix Human Gene 1.1 ST Array ) Cavalli FMG, Remke M, Rampasek L, Peacock J et al. Intertumoral Heterogeneity within Medulloblastoma Subgroups. Cancer Cell 2017 Jun 12;31(6):737-754.e6. PMID: 28609654 Ramaswamy V, Taylor MD. Bioinformatic Strategies for the Genomic and Epigenomic Characterization of Brain Tumors. Methods Mol Biol 2019;1869:37-56. PMID: 30324512 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85217
GSE202043 (Pomeroy) 214 samples, 2011 (Expression profiling by array) Cho YJ, Tsherniak A, Tamayo P, Santagata S et al. Integrative genomic analysis of medulloblastoma identifies a molecular subgroup that drives poor clinical outcome. J Clin Oncol 2011 Apr 10;29(11):1424-30. PMID: 21098324 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE202043
GSE12992 (Fattet ... Delattre) 72 samples, 2009 (Expression profiling by array) Fattet S, Haberler C, Legoix P, Varlet P et al. Beta-catenin status in paediatric medulloblastomas: correlation of immunohistochemical expression with mutational status, genetic profiles, and clinical characteristics. J Pathol 2009 May;218(1):86-94. PMID: 19197950 A series of 72 pediatric medulloblastoma tumors has been studied at the genomic level (array-CGH), screened for CTNNB1 mutations and beta-catenin expression (immunohistochemistry). A subset of 40 tumor samples has been analyzed at the RNA expression level (Affymetrix HG U133 Plus 2.0). https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE12992
GSE37382 (Northcott ... Taylor) 2012 (Expression profiling by array, Affymetrix Human Gene 1.1 ST Array profiling of 285 primary medulloblastoma samples.) Northcott PA, Shih DJ, Peacock J, Garzia L et al. Subgroup-specific structural variation across 1,000 medulloblastoma genomes. Nature 2012 Aug 2;488(7409):49-56. PMID: 22832581 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37382
GSE10327 (M. Kool ) 62 samples, 2008 ( Expression profiling by array ) (beware it is sometimes referred as GSE10237 in original paper and several references - that is an error reference). Kool M, Koster J, Bunt J, Hasselt NE et al. Integrated genomics identifies five medulloblastoma subtypes with distinct genetic profiles, pathway signatures and clinicopathological features. PLoS One 2008 Aug 28;3(8):e3088. PMID: 18769486 Rack PG, Ni J, Payumo AY, Nguyen V et al. Arhgap36-dependent activation of Gli transcription factors. Proc Natl Acad Sci U S A 2014 Jul 29;111(30):11061-6. PMID: 25024229 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE10327
Other datasets (not yet loaded):
(47.1 Gb, 2012) (Expression profiling by array, Genome variation profiling by SNP array, SNP genotyping by SNP array ) Northcott PA, Shih DJ, Peacock J, Garzia L et al. Subgroup-specific structural variation across 1,000 medulloblastoma genomes. Nature 2012 Aug 2;488(7409):49-56. PMID: 22832581 Here we report somatic copy number aberrations (SCNAs) in 1087 unique medulloblastomas. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37385
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the development of technology, an enormous amount of sequencing data is being generated rapidly. However, transforming this data into patient care is a critical challenge. There are two difficulties: how to integrate functional information into mutation interpretation and how to make the integration easy to apply. One solution is to visualize amino acid changes with protein structure and function in web app platform. There are multiple existing tools for plotting mutations, but the majority of them requires programming skills that are not common background for clinicians or researchers. Furthermore, the recurrent mutations are the focus and the recurrence cutoff varies. Yet, none of the current software offers customer-defined cutoff. Thus, we developed this user-friendly web-based tool, Mutplot (https://bioinformaticstools.shinyapps.io/lollipop/). Mutplot retrieves up-to-date domain information from the protein resource UniProt (https://www.uniprot.org/), integrates the submitted mutation information and produces lollipop diagrams with annotations and highlighted candidates. It offers flexible output options. For data that follows security standards, the app can also be hosted in web servers inside a firewall or computers without internet with Uniprot database stored on them. Altogether, Mutplot is an excellent tool for visualizing protein mutations, especially for clinicians or researchers without any bioinformatics background.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
spliz_*:
Separate tables with the SpliZ and SpliZVD score for each cell and gene for each dataset. The cell, gene, cell type, SpliZ, and SpliZVD are given by the cell, geneR1A_uniq, ontology, scZ, and svd_z0 columns respectively.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Detailed results for Blekhman’s RNA-seq count data. (a) Silhouette indices (s i ) for each sample i and the average (AS). The sample names (A1, A2, A3, B1, B2, or B3) for i correspond to those shown in Fig. 1b. (b) PDEG values at various FDR thresholds (1%, 5%, 10%, 20%, 30%, and 40% FDR). The values at 10% FDR were the same as those shown in Fig. 1b. (c) Percentages of true DEGs (PtrueDEG), defined as PDEG × (1 − FDR threshold), at corresponding FDR thresholds shown in (b). (XLSX 19 kb)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A summary of the criteria that would define a general genomics workbench environment, and suggested implications on technical requirements.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R-codes for analyses. This zipped file includes a total of 23 R-code files. Results can be obtained by executing scripts in the order of the serial numbers XX in the filename “rcode_XX_...” Note that two files (“rcode_08_Add6_pre.R” and “rcode_10_Add7_pre.R”) must be executed using R ver. 3.1.3 (affy ver. 1.44.0) instead of R ver. 3.3.2 (affy ver. 1.52.0). (ZIP 33 kb)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison between Mutplot and other most popular tools for mutaiotn plots.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The folder "Homo_sapiens" contains 8 subfolders. Each of them is an individual input Bioproject downloaded from SRA and reanalized with the same transcriptomic pipeline in order to compute:1) Differentially Alternative spliced Genes (DAS) in .tsv format;2) Differentially expressed genes in csv format;3) Gene ontology analysis over DEGs results. When the analysis has produced statistically significant results for Gene Enrichment Ontology analysis, three .csv files have been added to each folder, one for BP=Biological Process results, one for CC=Cellular Component results and one for MF=Molecular Functions results.When a Bioprojects appears more than once, it means that DEGs have been computed over diffferent varibles (e.g. Rtt vs. wt) or treated as indipendent studies when multiple source materials are present. To distinguish the studies an "_0", "_1", "_2" progressive number has been added to the folder names (e.g. in PRJNA509687_0 the samples under study were iPSC derived neural cortical neurons RTT vs. Wt while in PRJNA509687_1 the samples were derived from iPSC derived neural progenitors RTT vs. Wt). To facilitate the folder navigation, a file named "parameters" has been added to each folder.4) DESeq2 inputs divided in:gene count matrices in csv formatassociated phenodata.csvThe same logic is applied to the main folder "Mus_musculus",which contains 13 subfolders with DAS and DEG results
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
β-galactosidases are biotechnologically interesting enzymes that catalyze the hydrolysis or transgalactosylation of β-galactosides. Among them, the Aspergillus niger β-galactosidase (AnβGal) belongs to the glycoside hydrolase family 35 (GH35) and is widely used in the industry due to its high hydrolytic activity degrading lactose. We present here its three-dimensional structure in complex with different oligosaccharides, to illustrate the structural determinants of the broad specificity of the enzyme against different glycoside linkages. Remarkably, the residues Phe264, Tyr304 and Trp806 make a dynamic hydrophobic platform that accommodates the sugar at subsite +1 suggesting a main role on the recognition of structurally different substrates. Moreover, complexes with the trisaccharides show two potential subsites +2 depending on the substrate type. This feature and the peculiar shape of its wide cavity suggest that AnβGal might accommodate branched substrates from the complex net of polysaccharides composing the plant material in its natural environment. Relevant residues were selected and mutagenesis analyses were performed to evaluate their role in the catalytic performance and the hydrolase/transferase ratio of AnβGal. Thus, we generated mutants with improved transgalactosylation activity. In particular, the variant Y304F/Y355H/N357G/W806F displays a higher level of galacto-oligosaccharides (GOS) production than the Aspergillus oryzae β-galactosidase, which is the preferred enzyme in the industry owing to its high transferase activity. Our results provide new knowledge on the determinants modulating specificity and the catalytic performance of fungal GH35 β-galactosidases. In turn, this fundamental background gives novel tools for the future improvement of these enzymes, which represent an interesting target for rational design.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Background: Esophageal Squamous Cell Cancer (ESCC) is an aggressive disease associated with a poor prognosis. As a newly defined form of regulated cell death, ferroptosis plays a crucial role in cancer development and treatment and might be a promising therapeutic target. However, the expression patterns of ferroptosis-related genes (FRGs) in ESCC remain to be systematically analyzed.Methods: First, we retrieved the transcriptional profile of ESCC from TCGA and GEO datasets (GSE47404, GSE23400, and GSE53625) and performed unsupervised clustering to identify different ferroptosis patterns. Then, we used the ssGSEA algorithm to estimate the immune cell infiltration of these patterns and explored the differences in immune cell abundance. Common genes among patterns were finally identified as signature genes of ferroptosis patterns.Results: Herein, we depicted the multi-omics landscape of FRGs through integrated bioinformatics analysis and identified three ESCC subtypes with distinct immune characteristics: clusters A-C. Cluster C was abundant in CD8+ T cells and other immune cell infiltration, while cluster A was immune-barren. By comparing the differently expressed genes between clusters of diverse datasets, we defined a gene signature for each cluster and successfully validated it in the TCGA-ESCC dataset.Conclusion: We provided a comprehensive insight into the expression pattern of ferroptosis genes and their interaction with immune cell infiltration. Additionally, we established a gene signature to define the ferroptosis patterns, which might be used to predict the response to immunotherapy.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a list of Arabidopsis thaliana CNSs merged from the following CNS lists: 1) Haudry et al. (2013) An atlas of over 90,000 conserved noncoding sequences provides insight into crucifer regulatory regions. Nat. Genet. 45:891-898. 2) PL3.0 (TAIR 10 version): Turco et al. (2013) Automated conserved noncoding sequence (CNS) discovery reveals differences in gene content and promoter evolution among grasses. Frontiers in Plant Genetics and Genomics 4:170-180. 3) Van de Velde et al (2014) Inferences of transcriptional networks in Arabidopsis through conserved noncoding sequence analysis. Plant Cell 26:2729-2745. CNSs from the individual lists were concatenated and then merged using merge from the BEDTools suite (Quinlan AR and Hall IM, 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 26, 6, pp. 841–842). CNSs from the merged list were assigned to an Arabidopsis thaliana gene based on their PL3.0 component. PL3.0 CNSs are defined as syntenic conserved noncoding regions between Arabidopsis thaliana and the early branching Brassicaceae Aethionema arabicum. Orthologous Arabidopsis thaliana-Aethionema arabicum genes were identified using a combination of CoGe: Synfind (Tang et al. (2011) BMC Bioinformatics 12:102) and the PL3.0 CNS pipeline (Turco et al. 2013). closestBed (Bedtools) was then used to map PL3.0 CNSs to the closest Arabidopsis thaliana gene which had an Aethionema arabicum ortholog. Distance to the nearest gene is included in the closestBed output. Proximal regions were defined as being 1000 bp upstream from the transcription start site (5' proximal) or 1000 bp downstream from the gene (3' proximal). For intragenic CNSs, a custom perlscript was used to identify the position of the CNS in introns vs UTRs. Overlap with UTRs and CDS regions was calculated using intersectBed (BEDTools) using bedfiles created from GFF "UTR" and "CDS" features. CNS sequences overlapping CDSs by 50% or more were given "CDS" designations. CNSs overlapping UTRs by 50% or more were given 5' or 3' UTR designations. CNSs without a PL3.0 component were then assigned to an Arabidopsis thaliana gene if they were present in the genespace of an arabidopsis gene, with the genespace being defined as the region between and encompassing the 5'-most PL3.0 CNS and the 3'-most PL3.0 CNS. Once assigned to an arabidopsis gene, the distance to that gene was calculated using closestBed (BEDTools) and intersectBed was used, as above, to identify the position of intragenic CNSs. An Arabidopsis thaliana genome has been made available on CoGe, dsgid 25725, decorated with 2 sets of CNSs: 1) PL3.0 and 2) the merged set from this datasheet. To see the CNSs, in Results Visualization Options, set "Show preannotated CNSs?" to "Yes". Note: CNS assignments to Arabidopsis thaliana genes are best-guess computational assignments; individual PL3.0 CNSs may in actuality function in regulating genes that are not the closest Arabidopsis thaliana gene with an Aethionema arabicum ortholog. This is particularly true for genes with complex regulation. In the GEvo links included in this spreadsheet these can often be seen as clusters of CNSs extending beyond the midpoint between two Arabidopsis thaliana genes. By adding additional orthologous genes to GEvo panels, it is often possible to assign a CNS to an Arabidopsis thaliana gene with greater confidence if only one of the two Arabidopsis thaliana genes is retained in all genomes along with the CNS.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a list of Arabidopsis thaliana CNS sequences present in at least two of the three following CNS lists: 1) Haudry et al. (2013) An atalas of over 90,000 conserved noncoding sequences provides insight into crucifer regulatory regions. Nat. Genet. 45:891-898. 2) PL3.0 (TAIR 10 version): Turco et al. (2013) Automated conserved noncoding sequence (CNS) discovery reveals differences in gene content and promoter evolution among grasses. Frontiers in Plant Genetics and Genomics 4:170-180. 3) Van de Velde et al (2014) Inferences of transcriptional networks in Arabidopsis through conserved noncoding sequence analysis. Plant Cell 26:2729-2745. CNS sequences found in at least 2 of the 3 CNS lists were identified using multiIntersectBed from the BEDTools suite (Quinlan AR and Hall IM, 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 26, 6, pp. 841–842). CNSs from the verified2 list were assigned to an Arabidopsis thaliana gene based on their PL3.0 component. PL3.0 CNSs are defined as syntenic conserved noncoding regions between Arabidopsis thaliana and the early branching Brassicaceae Aethionema arabicum. Orthologous Arabidopsis thaliana-Aethionema arabicum genes were identified using a combination of CoGe: Synfind (Tang et al. (2011) BMC Bioinformatics 12:102) and the PL3.0 CNS pipeline (Turco et al. 2013). closestBed (Bedtools) was then used to map PL3.0 CNSs to the closest Arabidopsis thaliana gene with an Aethionema arabicum ortholog. Distance to the nearest gene is included in the closestBed output. Proximal regions were defined as being 1000 bp upstream from the transcription start site (5' proximal) or 1000 bp downstream from the gene (3' proximal). CNSs without a PL3.0 component were also assigned to an Arabidopsis thaliana gene if they were intragenic or if they were in the genespace of an arabidopsis gene, with the genespace being defined as the region between and encompassing the 5'-most PL3.0 CNS and the 3'-most PL3.0 CNS. For intragenic CNSs, a custom perlscript was used to identify the position of the CNS in introns vs UTRs. Overlap with UTRs and CDS regions was calculated using intersectBed (BEDTools) using bedfiles created from GFF "UTR", "gene", and "CDS" features. CNS sequences overlapping CDSs by 50% or more were given "CDS" designations. CNSs overlapping UTRs by 50% or more were given 5' or 3' UTR designations. Note: CNS assignments to Arabidopsis thaliana genes are best-guess computational assignments; individual PL3.0 CNSs may in actuality function in regulating genes that are not the closest Arabidopsis thaliana gene with an Aethionema arabicum ortholog. This is particularly true for genes with complex regulation. In the GEvo links included in this spreadsheet these can often be seen as clusters of CNSs extending beyond the midpoint between two Arabidopsis thaliana genes. By adding additional orthologous genes to GEvo panels, it is often possible to assign a CNS to an Arabidopsis thaliana gene with greater confidence if only one of the two Arabidopsis thaliana genes is retained in all genomes along with the CNS.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This file contains the human gene and disease names used for text mining in the DISEASES database v2.