Facebook
TwitterThe data in the Porcine Translational Research Database is supported by >5800 references, and contains 65 data fields for each entry, including >9700 full length (5′ and 3′) unambiguous pig sequences, >2400 real time PCR assays and reactivity information on >1700 antibodies. It also contains gene and/or protein expression data for >2200 genes and identifies and corrects errors (gene duplications artifacts, mis-assemblies, mis-annotations, and incorrect species assignments) for >2,000 porcine genes. This database is the largest manually curated database for any single veterinary species and is unique among porcine gene databases in regard to linking gene expression to gene function, identifying related gene pathways, and connecting data with other porcine gene database. Resources in this dataset: Resource Title: The Porcine Translational Research Database. File Name: Web Page, url: https://www.ars.usda.gov/northeast-area/beltsville-md/beltsville-human-nutrition-research-center/diet-genomics-and-immunology-laboratory/docs/dgil-porcine-translational-research-database/
Facebook
Twitterhttp://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Our groundbreaking translation dataset represents a monumental advancement in the field of natural language processing and machine translation. Comprising a staggering 785 million records, this corpus bridges language barriers by offering translations from English to an astonishing 548 languages. The dataset promises to be a cornerstone resource for researchers, engineers, and developers seeking to enhance their machine translation models, cross-lingual analysis, and linguistic investigations.
Size of the dataset – 41GB(Uncompressed) and Compressed – 20GB
Key Features:
Scope and Scale: With a comprehensive collection of 785 million records, this dataset provides an unparalleled wealth of translated text. Each record consists of an English sentence paired with its translation in one of the 548 target languages, enabling multi-directional translation applications.
Language Diversity: Encompassing translations into 548 languages, this dataset represents a diverse array of linguistic families, dialects, and scripts. From widely spoken languages to those with limited digital representation, the dataset bridges communication gaps on a global scale.
Quality and Authenticity: The translations have been meticulously curated, verified, and cross-referenced to ensure high quality and authenticity. This attention to detail guarantees that the dataset is not only extensive but also reliable, serving as a solid foundation for machine learning applications. Data is collected from various open datasets for my personal ML projects and looking to share it to team.
Use Case Versatility: Researchers and practitioners across a spectrum of domains can harness this dataset for a myriad of applications. It facilitates the training and evaluation of machine translation models, empowers cross-lingual sentiment analysis, aids in linguistic typology studies, and supports cultural and sociolinguistic investigations.
Machine Learning Advancement: Machine translation models, especially neural machine translation (NMT) systems, can leverage this dataset to enhance their training. The large-scale nature of the dataset allows for more robust and contextually accurate translation outputs.
Fine-tuning and Customization: Developers can fine-tune translation models using specific language pairs, offering a powerful tool for specialized translation tasks. This customization capability ensures that the dataset is adaptable to various industries and use cases.
Data Format: The dataset is provided in a structured json format, facilitating easy integration into existing machine learning pipelines. This structured approach expedites research and experimentation. Json format contains the English word and equivalent word as single record. Data was exported from MongoDB database to ensure the uniqueness of the record. Each of the record is unique and sorted.
Access: The dataset is available for academic and research purposes, enabling the global AI community to contribute to and benefit from its usage. A well-documented API and sample code are provided to expedite exploration and integration.
The English-to-548-languages translation dataset represents an incredible leap forward in advancing multilingual communication, breaking down barriers to understanding, and fostering collaboration on a global scale. It holds the potential to reshape how we approach cross-lingual communication, linguistic studies, and the development of cutting-edge translation technologies.
Dataset Composition: The dataset is a culmination of translations from English, a widely spoken and understood language, into 548 distinct languages. Each language represents a unique linguistic and cultural background, providing a rich array of translation contexts. This diverse range of languages spans across various language families, regions, and linguistic complexities, making the dataset a comprehensive repository for linguistic research.
Data Volume and Scale: With a staggering 785 million records, the dataset boasts an immense scale that captures a vast array of translations and linguistic nuances. Each translation entry consists of an English source text paired with its corresponding translation in one of the 548 target languages. This vast corpus allows researchers and practitioners to explore patterns, trends, and variations across languages, enabling the development of robust and adaptable translation models.
Linguistic Coverage: The dataset covers an extensive set of languages, including but not limited to Indo-European, Afroasiatic, Sino-Tibetan, Austronesian, Niger-Congo, and many more. This broad linguistic coverage ensures that languages with varying levels of grammatical complexity, vocabulary richness, and syntactic structures are included, enhancing the applicability of translation models across diverse linguistic landscapes.
Dataset Preparation: The translation ...
Facebook
TwitterThis dataset includes protein post-translational modifications as well as associated annotation data obtained from the Biological General Repository for Interaction databases (BIOGRID) for major model organisms species including the type of modification, protein sequence and specific amino acid involved.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
5′, ORF and 3′ end comparison of porcine and human mRNAs. 5′, ORF and 3′ end comparison of porcine and human mRNAs (XLSX 66 kb)
Facebook
TwitterPorcine or artiodactyl-specific paralogs. Gene names, Ensembl and NCBI loci numbers and Build 10.2 NCBI gene coordinates of porcine or artiodactyl-specific paralogs (XLSX 58 kb)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Artifactually duplicated genes in Ensembl build 10.2. Gene names, Ensembl and NCBI loci numbers and NCBI genome build 10.2 coordinates of artifactually duplicated genes (XLSX 282 kb)
Facebook
TwitterA database of publicly available genes, alternatively translational isoforms and their detailed annotation. Alternative translational initiation is one of mechanisms to increase the complexity level of an organism by alternative gene expression pathways. The use of alternative translation initiation codons in a singe mRNA contributes to the generation of protein diversity. The genes produce two or more versions of the encoded proteins, and the shorter version, initiated from a downstream in-frame start codon, lacks the N-terminal amino acids fragment of the full-length isoform version. Since the first discovery of alternative translation initiation, a small, yet growing, number of mRNAs initiating translation from alternative start codons have been reported. Various studies began to emerge focusing on this new field in gene expression and revealed the biological significance of the use of alternative initiation. In response to the need for systematic studies on genes involving alternative translational initiation, Alternative Translational Initiation Database(ATID) is established to provide data of publicly available genes, alternatively translational isoforms and their detailed annotation.
Facebook
TwitterPorcine genes missing in Ensembl build 10.2 of the porcine genome. Gene names and evidence/source for RNA sequence of genes that are missing from Ensembl build 10.2. (XLSX 112 kb)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
STARR is a data resource that is designed to improve access to healthcare data by researchers. STARR contains data from Stanford Health Care and the Stanford Children’s Hospital in the United States and supports diverse use cases and research applications. STARR has raw data, analysis-ready data, linked data across different data modalities, support for different data models, multiple clinical data warehouses, data search and access tools, data de-identification pipelines, concierge services, training, and documentation. The database contains clinical information on over 1.6 million pediatric and adult patients cared for at Stanford University Medical Center since 1995.
Facebook
TwitterA compilation of programmed; translational recoding events taken from the scientific literature and personal communications. The database deals with programmed ribosomal frameshifting, codon redefinition and translational bypass occurring in a variety of organisms. The entries for each event include the sequences of the corresponding genes, their encoded proteins for both the normal and alternate decoding, the types of the recoding events involved, trans-factors and cis-elements that influence recoding.
Facebook
TwitterA db-release of PhenCards to coincide with the release of the paper
This is a citable repo with a zip file of everything used to make the Elasticsearch Lucene index database for PhenCards v1.0.0.
This includes all data, like HPO, ICD, UMLS (without restricted sources), IRS data, Open990 data. Because ofOpen990, we cannot make it commercially available, but it is still fully open source for academics.
However, we also provide the code for preprocessing of the data for Lucene indexing.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Proteins perform essential cellular functions, which range from cell division and metabolism to DNA replication. Thus, decoding the mechanism of action of cells, requires understanding of the functioning and physicochemical properties of proteins [1]. While the genetic code encodes the primary structure of proteins, they undergo various modifications as part of their normal functioning including addition of modifying groups, such as acetyl, phosphoryl, glycosyl, and methyl, to one or more amino acids after translation, which is known as post-translational modification (PTM) [2, 3]. PTMs play an essential role in regulating protein functions by altering their physicochemical properties and understanding these reactions provides valuable insights regarding cell function. Advances in proteomics research have significantly deepened our understanding of PTMs and their impact on cellular functions and disease mechanisms. The study of PTMs is now at the forefront of research in molecular biology and biochemistry. Many databases, software, and tools have been developed to enhance our understanding of the various PTMs that affect human plasma proteins and help to simplify the analysis of complex PTM data [4]. These PTM databases and tools contain significant information and are a valuable resource for the research community. Key databases include dbPTM, UniProt, and PubChem. Utilising these databases, protein-related information like substrate peptides, amino acid sequence numbers, and experimentally validated PTM sites can be identified and curated. This dataset presents curated information regarding PTM-related changes in the physicochemical properties of the 16 most abundant plasma proteins [5], i.e., Serum Albumin, Serotransferrin, Antithrombin-III, Apolipoprotein A-I, Apolipoprotein A-IV, Apolipoprotein B-100, Apolipoprotein C-II, Apolipoprotein C-III, Apolipoprotein E, Clusterin, Complement C3, Haptoglobin, Histidine-rich glycoprotein, Mannose-binding protein C, Hemoglobin, and Fibrinogen alpha chain. The physicochemical properties studied, and the impact of different PTMs on the properties, include the protein molecular weight, isoelectric point, surface hydrophobicity, and solubility. The PTMs explored include phosphorylation, acetylation, glycosylation, methylation, ubiquitination, SUMOylation, lipidation, glutathionylation, nitrosylation, sulfoxidation, succinylation, neddylation, malonylation, hydroxylation, oxidation, and palmitoylation. References Alberts B, Johnson A, Lewis J, et al. Molecular Biology of the Cell. 4th edition. New York: Garland Science; 2002. Analyzing Protein Structure and Function. Chen, H.; Venkat, S.; McGuire, P.; Gan, Q.; Fan, C. Recent Development of Genetic Code Expansion for Posttranslational Modification Studies. Molecules 2018, 23, 1662. Marc Oeller, Ryan Kang, Hannah Bolt, Ana Gomes dos Santos, Annika Langborg Weinmann, Antonios Nikitidis, Pavol Zlatoidsky, Wu Su, Werngard Czechtizky, Leonardo De Maria,Pietro Sormanni, Michele Vendruscolo: Sequence-based prediction of the solubility of peptides containing non-natural amino acids [bioRiv]. Ramazi S, Zahiri J. Posttranslational modifications in proteins: resources, tools and prediction methods. Database (Oxford). 2021 Apr 7;2021:baab012.
Facebook
TwitterOur dataset are transcripts and codebooks for a focus group study. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. EPA cannot release CBI, or data protected by copyright, patent, or otherwise subject to trade secret restrictions. Request for access to CBI data may be directed to the dataset owner by an authorized person by contacting the party listed. It can be accessed through the following means: Contact Katie Williams, williams.kathleen@epa.gov. Format: The data are transcripts and protected by IRB approvals. This dataset is associated with the following publication: Eisenhauer, E., K. Williams, K. Margeson, S. Paczuski, K. Mulvaney, and M.C. Hano. Advancing translational research in environmental science: The role and impact of social science. Environmental Science & Policy. Elsevier Science Ltd, New York, NY, USA, 120: 165-172, (2021).
Facebook
TwitterComprehensive analysis of post-translation modifications (PTMs) is an important mission of proteomics. However, the consideration of PTMs increases the search space and may therefore impair the efficiency of protein identification. Using thousands of proteomic searches, we investigated the practical aspects of considering multiple PTMs in Byonic searches for the maximization of protein and peptide hits. The inclusion of all PTMs, which occur with at least 2% frequency in the sample, has an advantageous effect on protein and peptide identification. A linear relationship was established between the number of considered PTMs and the number of reliably identified peptides and proteins. Even though they handle multiple modifications less efficiently, the results of MASCOT (using the Percolator function) and Andromeda (the search engine included in MaxQuant) became comparable to those of Byonic, in the case of a few PTMs.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
[PlanTL/medicine/neural machine translation/translation models] The files needed to use the Neural Machine Translation system for the Biomedical Domain. The available language directions for translation are: English to Spanish, Spanish to English, English to Portuguese, Portuguese to English, Spanish to Portuguese and Portuguese to Spanish.
Facebook
TwitterMGVB is a collection of tools for proteomics data analysis. It covers data processing from in silico digestion of protein sequences to comprehensive identification of postranslational modifications and solving the protein inference problem. The toolset is developed with efficiency in mind. It enables analysis at a fraction of the resources cost typically required by existing commercial and free tools. MGVB, as it is a native application, is much faster than existing proteomics tools such as MaxQuant and MSFragger and, in the same time, finds very similar, in some cases even larger number of peptides at a chosen level of statistical significance. It implements a probabilistic scoring function to match spectra to sequences, and a novel combinatorial search strategy for finding post-translational modifications, and a Bayesian approach to locate modification sites. This report describes the algorithms behind the tools, presents benchmarking data sets analysis comparing MGVB performance to MaxQuant/Andromeda, and provides step by step instructions for using it in typical analytical scenarios. The toolset is provided free to download and use for academic research and in software projects, but is not open source at the present. It is the intention of the author that it will be made open source in the near future—following rigorous evaluations and feedback from the proteomics research community.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
The primary mission of the Alliance of Genome Resources (the Alliance) is to develop and maintain sustainable genome information resources that facilitate the use of diverse model organisms in understanding the genetic and genomic basis of human biology, health and disease. This understanding is fundamental for advancing genome biology research and for translating human genome data into clinical utility. The unified Alliance information system will represent the union of the data and information represented in the current individual MODs rather than the intersection, and thus provide the best of each in one place while maintaining community integrity and preserving the unique aspects of each model organism. By working together we can be more comprehensive and efficient, and hence more sustainable. Through the implementation of a shared, modular information system architecture, the Alliance seeks to serve diverse user communities including (i) human geneticists who want access to all model organism data for orthologous human genes; (ii) basic science researchers who use specific model organisms to understand fundamental biology; (iii) computational biologists and data scientists who need access to standardized, well-structured data, both big and small; and (iv) educators and students. Community genome resources such as the Model Organism Databases and the Gene Ontology Consortium have developed high quality resources enabling cost and time effective information retrieval and aggregation that would otherwise require countless hours to achieve. Regardless of their success and utility, there remain challenges to using and sustaining MODs. Searching across multiple model organism database resources remains a barrier to realizing the full impact of these resources in advancing genome biology and genomic medicine. In addition, despite a growing need for MODs by the biomedical research community as well as the increasing volumes of data and publications, the financial resources available to sustain MODs and related information resources are being reduced. We believe that one contribution to solving these challenges while continuing to serve our diverse user communities is to unify our efforts. To this end, six MODs (Saccharomyces Genome Database, WormBase, FlyBase, Zebrafish Information Network, Mouse Genome Database, Rat Genome Database) and the Gene Ontology (GO) project joined together in the fall of 2016 to form the Alliance of Genome Resources (the Alliance) consortium. Resources in this dataset:Resource Title: Alliance of Genome Resources. File Name: Web Page, url: https://www.alliancegenome.org/
Facebook
TwitterDatabase that represents a centralized platform to visually depict and integrate information pertaining to domain architecture, post-translational modifications, interaction networks and disease association for each protein in the human proteome. All the information in HPRD has been manually extracted from the literature by expert biologists who read, interpret and analyze the published data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The CRITICAL dataset is the first cross-Clinical and Translational Science Award (CTSA) initiative to create a multi-site, multi-modal, de-identified clinical dataset. It combines deep-data depth with broad-data width, addressing a major unmet need in healthcare research. The dataset encompasses comprehensive longitudinal inpatient and outpatient data, including pre-, during- and post-ICU admissions, for approximately 400,000 distinct critical-care patients. This diverse dataset supports the exploration of urgent clinical problems and facilitates the development of fair and generalizable AI tools for advanced patient monitoring and decision support.
The dataset has been curated to serve the research community, fostering innovations in AI/machine learning (ML), outcomes research, and other translational science domains. Its unique combination of size, diversity, and comprehensiveness makes it a valuable resource for tackling long-standing clinical challenges.
The metadata provided here for this dataset is licensed under Attribution 4.0 International (CC BY 4.0). Please note that access to the data itself is restricted and requires compliance with the applicable Data Use Agreement (DUA). More information about the DUA and how to request access can be found at https://critical.fsm.northwestern.edu/, or by contacting critical@northwestern.edu.
Facebook
TwitterDatabase of human histone variants, sites of their post-translational modifications and various histone modifying enzymes. The database covers 5 types of histones, 8 types of their post-translational modifications and 13 classes of modifying enzymes. Many data fields are hyperlinked to other databases (e.g. UnprotKB/Swiss-Prot, HGNC, OMIM, Unigene etc.). Additionally, this database also provides sequences of promoter regions (-700 TSS +300) for all gene entries. These sequences were extracted from the UCSC genome browser. Sites of post-translational modifications of histones were manually searched from PubMed listed literature. Current version contains information for about ~50 histone proteins and ~150 histone modifying enzymes. HIstome is a combined effort of researchers from two institutions, Advanced Center for Treatment, Research and Education in Cancer (ACTREC), Navi Mumbai and Center of Excellence in Epigenetics (CoEE), Indian Institute of Science Education and Research (IISER), Pune.
Facebook
TwitterThe data in the Porcine Translational Research Database is supported by >5800 references, and contains 65 data fields for each entry, including >9700 full length (5′ and 3′) unambiguous pig sequences, >2400 real time PCR assays and reactivity information on >1700 antibodies. It also contains gene and/or protein expression data for >2200 genes and identifies and corrects errors (gene duplications artifacts, mis-assemblies, mis-annotations, and incorrect species assignments) for >2,000 porcine genes. This database is the largest manually curated database for any single veterinary species and is unique among porcine gene databases in regard to linking gene expression to gene function, identifying related gene pathways, and connecting data with other porcine gene database. Resources in this dataset: Resource Title: The Porcine Translational Research Database. File Name: Web Page, url: https://www.ars.usda.gov/northeast-area/beltsville-md/beltsville-human-nutrition-research-center/diet-genomics-and-immunology-laboratory/docs/dgil-porcine-translational-research-database/