Data for sequence comparison of commamox genomes and genes identified. This dataset is associated with the following publication: Camejo, P., J. Santodomingo, K. McMahon, and D. Noguera. Genome-enabled insights into the ecophysiology of the comammox bacterium Ca. Nitrospira nitrosa. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 2(5): 1-16, (2017).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Paper Data For "Identification of ACHE as the Hub Gene targeting Solasonine Associated with NSCLC Using Integrated Bioinformatics Analysis"
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Familiarity with genome-scale data and the bioinformatic skills to analyze it have become essential for understanding and advancing modern biology and human health, yet many undergraduate biology majors are never exposed to hands-on bioinformatics. This paper presents a module that introduces students to applied bioinformatic analysis within the context of a research-based microbiology lab course. One of the most commonly used genomic analyses in biology is resequencing: determining the sequence of DNA bases in a derived strain of some organism, and comparing it to the known ancestral genome of that organism to better understand the phenotypic differences between them. Many existing CUREs — Course Based Undergraduate Research Experiences — evolve or select new strains of bacteria and compare them phenotypically to ancestral strains. This paper covers standardized strategies and procedures, accessible to undergraduates, for preparing and analyzing microbial whole-genome resequencing data to examine the genotypic differences between such strains. Wet-lab protocols and computational tutorials are provided, along with additional guidelines for educators, providing instructors without a next-generation sequencing or bioinformatics background the necessary information to incorporate whole-genome sequencing and command-line analysis into their class. This module introduces novice students to running software at the command-line, giving them exposure and familiarity with the types of tools that make up the vast majority of open-source scientific software used in contemporary biology. Completion of the module improves student attitudes toward computing, which may make them more likely to pursue further bioinformatics study.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
WB raw data
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Subjective data models dataset
This dataset is comprised of data collected from study participants, for a study into how people working with biological data perceive data, and whether or not this perception of data aligns with a person's experiential and educational background. We call the concept of what data looks like to an individual a "subjective data model".
Todo: link paper/preprint once published.
Computational python analysis code: https://doi.org/10.5281/zenodo.7022789 and https://github.com/yochannah/subjective-data-models-analysis
Files
Transcripts of the recorded sessions are attached and have been verified by a second researcher. These files are all in plain text .txt format. Note that participant 3 did not agree to sharing the transcript of their interview.
Interview paper files This folder has digital and photographed versions of the files shown to the participants for the file mapping task. Note that the original files are from the NCBI and from FlyBase.
Videos and stills from the recordings have been deleted in line with the Data Management Plan and Ethical Review.
anonymous_participant_list.csv
shows which files have transcripts associated (not all participants agreed to share transcripts), what the order of Tasks A and B were, the date of interview, and what entities participants added to the set provided (if any). See the paper methods for more info about why entities were added to the set.
cards.txt
is a full list of the cards presented in the tasks.
background survey
and background manual annotations
are the select survey data about participant background and manual additions to this where necessary, e.g. to interpret free text.
codes.csv
shows the qualitative codes used within the transcripts.
entry_point.csv
is a record of participants' identified entry points into the data.
file_mapping_responses
shows a record of responses to the file mapping task.
Computational and Structural Biotechnology Journal Impact Factor 2024-2025 - ResearchHelpDesk - Computational and Structural Biotechnology Journal (CSBJ) is an online gold open access journal publishing research articles and reviews after full peer review. All articles are published, without barriers to access, immediately upon acceptance. The journal places a strong emphasis on functional and mechanistic understanding of how molecular components in a biological process work together through the application of computational methods. Structural data may provide such insights, but they are not a pre-requisite for publication in the journal. Specific areas of interest include, but are not limited to: Structure and function of proteins, nucleic acids and other macromolecules Structure and function of multi-component complexes Protein folding, processing and degradation Enzymology Computational and structural studies of plant systems Microbial Informatics Genomics Proteomics Metabolomics Algorithms and Hypothesis in Bioinformatics Mathematical and Theoretical Biology Computational Chemistry and Drug Discovery Microscopy and Molecular Imaging Nanotechnology Systems and Synthetic Biology The journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence, and enables the rapid publication of papers under the following categories: Research articles Review articles Mini Reviews Highlights Communications Software/Web server articles Methods articles Database articles Book Reviews Meeting Reviews
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This record contains the data (references, reads, assemblies) used in the analyses for the Trycycler paper.
http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/
Although new and emerging information technologies (IT) can enable the analysis of rapidly expanding bioinformatics data, no standards exists. Standards validate a technology or process against a compilation of consolidated best practice specifications. Standards development represents an effective way to retrieve textual evidence, work collaboratively, and integrate bioinformatics with global e-health initiatives. Thus, standards barriers can impede otherwise productive research efforts. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1
Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Gene expression matrix of different developmental stages of tissues or individuals in lotus for the Scientific Data paper
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This record contains the data (references, reads, assemblies) used in the analyses for the Polypolish paper.
RNA expression analysis was performed on the corpus luteum tissue at five time points after prostaglandin F2 alpha treatment of midcycle cows using an Affymetrix Bovine Gene v1 Array. The normalized linear microarray data was uploaded to the NCBI GEO repository (GSE94069). Subsequent statistical analysis determined differentially expressed transcripts ± 1.5-fold change from saline control with P ≤ 0.05. Gene ontology of differentially expressed transcripts was annotated by DAVID and Panther. Physiological characteristics of the study animals are presented in a figure. Bioinformatic analysis by Ingenuity Pathway Analysis was curated, compiled, and presented in tables. A dataset comparison with similar microarray analyses was performed and bioinformatics analysis by Ingenuity Pathway Analysis, DAVID, Panther, and String of differentially expressed genes from each dataset as well as the differentially expressed genes common to all three datasets were curated, compiled, and presented in tables. Finally, a table comparing four bioinformatics tools' predictions of functions associated with genes common to all three datasets is presented. These data have been further analyzed and interpreted in the companion article "Early transcriptome responses of the bovine mid-cycle corpus luteum to prostaglandin F2 alpha includes cytokine signaling". Resources in this dataset:Resource Title: Supporting information as Excel spreadsheets and tables. File Name: Web Page, url: http://www.sciencedirect.com/science/article/pii/S2352340917304031?via=ihub#s0070
Gene expression analysis is one of the most important tasks for genomic medicine, using these it is possible to classify tumors, which are directly related with the development of cancer. This paper presents a clustering method for tumor classification, vector quantization, using gene expression profiles from microarrays of mRNA with samples of cervical cancer and normal cervix. Vector quantization is used to divide the space into regions, and the centroids of the regions represent patients with tumors or healthy ones. Also the regions found by the vector quantizer are used as the base for classifying other tumors, that could help in the prognostics of the illness or for finding new groups of tumors. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1
Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.
Different from significant gene expression analysis which looks for all genes that are differentially regulated, feature selection in prognostic gene expression analysis aims at finding a subset of informative marker genes that are discriminative for prediction. Unfortunately feature selection in the literature of microarray study is predominated by the simple heuristic univariate gene filter paradigm that selects differentially expressed genes according to their statistical significance. Since the univariate approach does not take into account the correlated or interactive structure among the genes, classifiers built on genes so selected can be less accurate. More advanced approaches based on multivariate models have to be considered. Here, we introduce a feature ranking method through forward orthogonal search to assist prognostic gene selection. Application to published gene-lists selected by univariate models shows that the feature space can be largely reduced while achieving improved testing performances. Our results indicate that "significant" features selected using the gene-wised approaches can contain irrelevant genes that only serve to complicate model building. Multivariate feature ranking can help to reduce feature redundancy and to select highly informative prognostic marker genes. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1
Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Background: Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the "citation benefit". Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results: Here, we look at citation rates while controlling for many known citation predictors, and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion: After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered.We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.
http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/
In this paper we implement and test the recently described nearest subspace classifier on a range of microarray cancer datasets. Its classification accuracy is tested against nearest neighbor and nearest centroid algorithms, and is shown to give a significant improvement. This classification system uses class-dependent PCA to construct a subspace for each class. Test vectors are assigned the class label of the nearest subspace, which is defined as the minimum reconstruction error across all subspaces. Furthermore, we demonstrate this distance measure is equivalent to the null-space component of the vector being analyzed. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1
Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.
This work presents a new consensus clustering method for gene expression microarray data based on a genetic algorithm. Using two datasets - DA and DB - as input, the genetic algorithm examines putative partitions for the samples in DA, selecting biomarkers that support such partitions. The biomarkers are then used to build a classifier which is used in DB to determine its samples classes. The genetic algorithm is guided by an objective function that takes into account the accuracy of classification in both datasets, the number of biomarkers that support the partition, and the distribution of the samples across the classes for each dataset. To illustrate the method, two whole-genome breast cancer instances from dfferent sources were used. In this application, the results indicate that the method could be used to find unknown subtypes of diseases supported by biomarkers presenting similar gene expression profiles across platforms. Moreover, even though this initial study was restricted to two datasets and two classes, the method can be easily extended to consider both more datasets and classes. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1
Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.
Xylan is the main component of hemicellulose which is present in nature in large amounts and can be degraded by either acid or enzymic catalysis with the advantages of a highly efficient conversion rate and non-corrosive and environmentally friendly conditions. Although, the complete breakdown of xylan requires the action of several different enzymes, the depolymerizing endo-1,4,β-xylanase (EC 3.2.1.8) is the key enzyme with possible applications in waste treatment, fuel and chemical production and paper manufacture. In consequence, the importance of finding or making thermostable xylanases has been highlighted. Therefore, it is inevitable to understand the features involving in xylanase thermostability. Here, we looked at more than seventy attributes of 30 xylanase proteins (active in different temperatures) by applying a feature selection algorithm which assigns a p value to each attribute based on the asymptotic distribution of a transformation on the Pearson correlation coefficient, and then, sorts them according to their p values in order to find the most contributing ones regarding the xylanase proteins thermostability. The results showed that the count of oxygen, nitrogen, Glu, Lys, Cys, Phe, Trp, the count of positively and negatively charged residues as well as the count of other residues were the most important features with respect to xylanase thermostability, and 12 more properties were recognized to have a marginal effect on this aspect, while the rest were revealed to be unimportant. The importance of "important" and "marginal" features in xylanase thermostability has been discussed in this paper. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1
Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reproducibility Project: Cancer Biology (https://osf.io/e81xl/wiki/home/) aims to reproduce the key experiments from 50 landmark papers in cancer research. As a follow up to the previously published study, which showed a lack of indentifiability of research resources in the published biomedical literature (Vasilevsky, et al. 2014, PeerJ 1:e148), we analyzed 6 resource types reported in these papers to determine the identifiability of these resources. The resource types included antibodies, cell lines, constructs, knockdown reagents, model organisms and software. The results showed an average 85% of the resources were identifiable, and the ability to identify the resources varied amongst the resource types.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data stem fromI. Cantone, L. Marucci, F. Iorio, M.A. Ricci, V. Belcastro, M. Bansal, S. Santini, M. di Bernardo, D. di Bernardo and M.P. Cosma (2009): A Yeast Synthetic Network for In Vivo Assessment of Reverse-Engineering and Modeling Approaches, Cell, 137, 172-181.Grzegorczyk, M. and Husmeier, D. (2012): A Non-Homogeneous Dynamic Bayesian Network with Sequentially Coupled Interaction Parameters for Applications in Systems and Synthetic Biology, Statistical Applications in Genetics and Molecular Biology (SAGMB), 11(4), Article 7.Please see those publications for details.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets, conda environments and Softwares for the course "Population Genomics" of Prof Kasper Munch. This course material is maintained by the health data science sandbox. This webpage shows the latest version of the course material.
The data is connected to the following repository: https://github.com/hds-sandbox/Popgen_course_aarhus. The original course material from Prof Kasper Munch is at https://github.com/kaspermunch/PopulationGenomicsCourse.
Description
The participants will after the course have detailed knowledge of the methods and applications required to perform a typical population genomic study.
The participants must at the end of the course be able to:
The course introduces key concepts in population genomics from generation of population genetic data sets to the most common population genetic analyses and association studies. The first part of the course focuses on generation of population genetic data sets. The second part introduces the most common population genetic analyses and their theoretical background. Here topics include analysis of demography, population structure, recombination and selection. The last part of the course focus on applications of population genetic data sets for association studies in relation to human health.
Curriculum
The curriculum for each week is listed below. "Coop" refers to a set of lecture notes by Graham Coop that we will use throughout the course.
Course plan
Data for sequence comparison of commamox genomes and genes identified. This dataset is associated with the following publication: Camejo, P., J. Santodomingo, K. McMahon, and D. Noguera. Genome-enabled insights into the ecophysiology of the comammox bacterium Ca. Nitrospira nitrosa. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 2(5): 1-16, (2017).