100+ datasets found

Introductions to Bioinformatics
figshare.com
pdf
Updated Jan 18, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aidan Budd (2016). Introductions to Bioinformatics [Dataset]. http://doi.org/10.6084/m9.figshare.830401.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.830401.v1
Dataset updated
Jan 18, 2016
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Aidan Budd
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of similar but different presentations I've made aimed at introducing bioinformatics to bench biologists.
f
Data from: “Bioinformatics: Introduction and Methods,” a Bilingual Massive...
datasetcatalog.nlm.nih.gov
Updated Dec 11, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meng, Yuqi; Wei, Liping; Gao, Ge; Yang, Xiaoxu; He, Yao; Ding, Yang; Liu, Fenglin; Ye, Adam Yongxin; Wang, Meng (2014). “Bioinformatics: Introduction and Methods,” a Bilingual Massive Open Online Course (MOOC) as a New Example for Global Bioinformatics Education [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001209841
Explore at:
Dataset updated
Dec 11, 2014
Authors
Meng, Yuqi; Wei, Liping; Gao, Ge; Yang, Xiaoxu; He, Yao; Ding, Yang; Liu, Fenglin; Ye, Adam Yongxin; Wang, Meng
Description
“Bioinformatics: Introduction and Methods,” a Bilingual Massive Open Online Course (MOOC) as a New Example for Global Bioinformatics Education
Bioinformatics Protein Dataset - Simulated
kaggle.com
zip
Updated Dec 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
Explore at:
zip(12928905 bytes)Available download formats
Dataset updated
Dec 27, 2024
Authors
Rafael Gallo
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Subtitle

"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

Description

Introduction

This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

Columns Included

ID_Protein: Unique identifier for each protein.

Sequence: String of amino acids.

Molecular_Weight: Molecular weight calculated from the sequence.

Isoelectric_Point: Estimated isoelectric point based on the sequence composition.

Hydrophobicity: Average hydrophobicity calculated from the sequence.

Total_Charge: Sum of the charges of the amino acids in the sequence.

Polar_Proportion: Percentage of polar amino acids in the sequence.

Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.

Sequence_Length: Total number of amino acids in the sequence.

Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

Inspiration and Sources

While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

Proposed Uses

This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

How This Dataset Was Created

Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.

Property Calculation: Physicochemical properties were calculated using the Biopython library.

Class Assignment: Classes were randomly assigned for classification purposes.

Limitations

The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.

The functional classes are simulated and do not correspond to actual biological characteristics.

Data Split

The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

Acknowledgment

This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.
q
Bioinformatics is a BLAST: Engaging First-Year Biology Students on Campus...
qubeshub.org
Updated Oct 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shem Unger*; Mark Rollins (2022). Bioinformatics is a BLAST: Engaging First-Year Biology Students on Campus Biodiversity Using DNA Barcoding [Dataset]. https://qubeshub.org/community/groups/coursesource/publications?id=3520
Explore at:
Dataset updated
Oct 4, 2022
Dataset provided by
QUBES
Authors
Shem Unger*; Mark Rollins
Description
In order to introduce students to the concept of molecular diversity, we developed a short, engaging online lesson using basic bioinformatics techniques. Students were introduced to basic bioinformatics while learning about local on-campus species diversity by 1) identifying species based on a given sequence (performing Basic Local Alignment Search Tool [BLAST] analysis) and 2) researching and documenting the natural history of each species identified in a concise write-up. To assess the student’s perception of this lesson, we surveyed students using a Likert scale and asking them to elaborate in written reflection on this activity. When combined, student responses indicated that 94% of students agreed this lesson helped them understand DNA barcoding and how it is used to identify species. The majority of students, 89.5%, reported they enjoyed the lesson and mainly provided positive feedback, including “It really opened my eyes to different species on campus by looking at DNA sequences”, “I loved searching information and discovering all this new information from a DNA sequence”, and finally, “the database was fun to navigate and identifying species felt like a cool puzzle.” Our results indicate this lesson both engaged and informed students on the use of DNA barcoding as a tool to identify local species biodiversity.

Primary Image: DNA Barcoded Specimens. Crane fly, dragonfly, ant, and spider identified using DNA barcoding.
Dataset for practice session 1 in bioinformatics
figshare.com
txt
Updated Jul 17, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elena Sugis (2016). Dataset for practice session 1 in bioinformatics [Dataset]. http://doi.org/10.6084/m9.figshare.3490211.v3
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3490211.v3
Dataset updated
Jul 17, 2016
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Elena Sugis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset for the practice in the data preprocessing and unsupervised learning in the introduction to bioinformatics course
Introduction to Biodiversity Informatics
figshare.com
pptx
Updated Feb 5, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dimitrios Koureas (2016). Introduction to Biodiversity Informatics [Dataset]. http://doi.org/10.6084/m9.figshare.1295382.v3
Explore at:
pptxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1295382.v3
Dataset updated
Feb 5, 2016
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Dimitrios Koureas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A brief introduction to the concept, vision and challenges associated with Biodiversity Informatics.
Data_Sheet_2_Bioinformatics-Based Activities in High School: Fostering...
frontiersin.figshare.com
pdf
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ana Martins; Maria João Fonseca; Marina Lemos; Leonor Lencastre; Fernando Tavares (2023). Data_Sheet_2_Bioinformatics-Based Activities in High School: Fostering Students’ Literacy, Interest, and Attitudes on Gene Regulation, Genomics, and Evolution.pdf [Dataset]. http://doi.org/10.3389/fmicb.2020.578099.s002
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fmicb.2020.578099.s002
Dataset updated
Jun 3, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Ana Martins; Maria João Fonseca; Marina Lemos; Leonor Lencastre; Fernando Tavares
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The key role of bioinformatics in explaining biological phenomena calls for the need to rethink didactic approaches at high school aligned with a new scientific reality. Despite several initiatives to introduce bioinformatics in the classroom, there is still a lack of knowledge on their impact on students’ learning gains, engagement, and motivation. In this study, we detail the effects of four bioinformatics laboratories tailored for high school biology classes named “Mining the Genome: Using Bioinformatics Tools in the Classroom to Support Student Discovery of Genes” on literacy, interest, and attitudes on 387 high school students. By exploring these laboratories, students get acquainted with bioinformatics and acknowledge that many bioinformatics tools can be intuitive for beginners. Furthermore, introducing comparative genomics in their learning practices contributed for a better understanding of curricular contents regarding the identification of genes, their regulation, and how to make evolutionary assumptions. Following the intervention, students were able to pinpoint bioinformatics tools required to identify genes in a genomics sequence, and most importantly, they were able to solve genomics-related misconceptions. Overall, students revealed a positive attitude regarding the integration of bioinformatics-based approaches in their learning practices, reinforcing their added value in educational approaches.
f
Comparison of the multiple-delivery-mode training model employed by...
datasetcatalog.nlm.nih.gov
Updated Feb 25, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lennard, Katie; Aron, Shaun; Panji, Sumir; Kennedy, Dane; Mulder, Nicola; Allali, Imane; Fields, Christopher J; Ras, Verena; Mwaikono, Kilaza Samson; Rendon, Gloria; Claassen-Weitz, Shantelle; Holmes, Jessica R.; Botha, Gerrit (2021). Comparison of the multiple-delivery-mode training model employed by H3ABioNet’s Introduction to Bioinformatics (IBT) course and the 16s rRNA Microbiome Intermediate Bioinformatics Training course (16S). [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000897705
Explore at:
Dataset updated
Feb 25, 2021
Authors
Lennard, Katie; Aron, Shaun; Panji, Sumir; Kennedy, Dane; Mulder, Nicola; Allali, Imane; Fields, Christopher J; Ras, Verena; Mwaikono, Kilaza Samson; Rendon, Gloria; Claassen-Weitz, Shantelle; Holmes, Jessica R.; Botha, Gerrit
Description
The table provides a short description of the major components of the model employed by each course, highlighting any differences between the two (deviations are indicated by an asterisk (*)).
f
Data from: Bioinformatics calls the school: Use of smartphones to introduce...
datasetcatalog.nlm.nih.gov
Updated Feb 14, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rueda, Ana Julia Velez; Benítez, Guillermo I.; Parisi, Gustavo; Fornasari, María Silvina; Hasenahuer, Marcia Anahí; Marchetti, Julia; Palopoli, Nicolas (2019). Bioinformatics calls the school: Use of smartphones to introduce Python for bioinformatics in high schools [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000159463
Explore at:
Dataset updated
Feb 14, 2019
Authors
Rueda, Ana Julia Velez; Benítez, Guillermo I.; Parisi, Gustavo; Fornasari, María Silvina; Hasenahuer, Marcia Anahí; Marchetti, Julia; Palopoli, Nicolas
Description
The dynamic nature of technological developments invites us to rethink the learning spaces. In this context, science education can be enriched by the contribution of new computational resources, making the educational process more up-to-date, challenging, and attractive. Bioinformatics is a key interdisciplinary field, contributing to the understanding of biological processes that is often underrated in secondary schools. As a useful resource in learning activities, bioinformatics could help in engaging students to integrate multiple fields of knowledge (logical-mathematical, biological, computational, etc.) and generate an enriched and long-lasting learning environment. Here, we report our recent project in which high school students learned basic concepts of programming applied to solving biological problems. The students were taught the Python syntax, and they coded simple tools to answer biological questions using resources at hand. Notably, these were built mostly on the students’ own smartphones, which proved to be capable, readily available, and relevant complementary tools for teaching. This project resulted in an empowering and inclusive experience that challenged differences in social background and technological accessibility.
z
Introduction to Ancient Metagenomics Textbook (Edition 2024): Introduction...
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Sep 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thiseas C. Lamnidis; Thiseas C. Lamnidis; Aida Andrades Valtueña; Aida Andrades Valtueña; James A. Fellows Yates; James A. Fellows Yates (2024). Introduction to Ancient Metagenomics Textbook (Edition 2024): Introduction to the Command Line [Dataset]. http://doi.org/10.5281/zenodo.13759270
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13759270
Dataset updated
Sep 13, 2024
Dataset provided by
SPAAM Community
Authors
Thiseas C. Lamnidis; Thiseas C. Lamnidis; Aida Andrades Valtueña; Aida Andrades Valtueña; James A. Fellows Yates; James A. Fellows Yates
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data and conda software environment file for the chapter 'Introduction to the Command Line' of the SPAAM Community's textbook: Introduction to Ancient Metagenomics (https://www.spaam-community.org/intro-to-ancient-metagenomics-book).
q
Sequence Similarity: An inquiry based and "under the hood" approach for...
qubeshub.org
Updated Aug 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adam Kleinschmit*; Benita Brink; Steven Roof; Carlos Goller; Sabrina Robertson (2021). Sequence Similarity: An inquiry based and "under the hood" approach for incorporating molecular sequence alignment in introductory undergraduate biology courses [Dataset]. http://doi.org/10.24918/cs.2019.5
Explore at:
Unique identifier
https://doi.org/10.24918/cs.2019.5
Dataset updated
Aug 28, 2021
Dataset provided by
QUBES
Authors
Adam Kleinschmit*; Benita Brink; Steven Roof; Carlos Goller; Sabrina Robertson
Description
Introductory bioinformatics exercises often walk students through the use of computational tools, but often provide little understanding of what a computational tool does "under the hood." A solid understanding of how a bioinformatics computational algorithm functions, including its limitations, is key for interpreting the output in a biologically relevant context. This introductory bioinformatics exercise integrates an introduction to web-based sequence alignment algorithms with models to facilitate student reflection and appreciation for how computational tools provide similarity output data. The exercise concludes with a set of inquiry-based questions in which students may apply computational tools to solve a real biological problem.

In the module, students first define sequence similarity and then investigate how similarity can be quantitatively compared between two similar length proteins using a Blocks Substitution Matrix (BLOSUM) scoring matrix. Students then look for local regions of similarity between a sequence query and subjects within a large database using Basic Local Alignment Search Tool (BLAST). Lastly, students access text-based FASTA-formatted sequence information via National Center for Biotechnology Information (NCBI) databases as they collect sequences for a multiple sequence alignment using Clustal Omega to generate a phylogram and evaluate evolutionary relationships. The combination of diverse, inquiry-based questions, paper models, and web-based computational resources provides students with a solid basis for more advanced bioinformatics topics and an appreciation for the importance of bioinformatics tools across the discipline of biology.
o
Introduction to single cell RNAseq analysis: supplementary material
explore.openaire.eu
Updated Apr 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jose Alejandro Romero Herrera; Samuele Soraggi (2023). Introduction to single cell RNAseq analysis: supplementary material [Dataset]. http://doi.org/10.5281/zenodo.7920686
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7920686
Dataset updated
Apr 14, 2023
Authors
Jose Alejandro Romero Herrera; Samuele Soraggi
Description
This archive contains supplementary material used in the workshop "Introduction to single cell RNAseq analysis" taught by the Danish National Sandbox for Health Data Science. The course repo can be found on Github. Data.zip contains 6 10x runs on Spermatogonia development. 3 from healthy individuals and 3 from azoospermic individuals. Data has been already preprocessed using cellranger and can be loaded using Seurat (R) or scanpy (python). Slides.zip contains slides explaning theory regarding single cell RNAseq data analysis Notebooks.zip contains Rmarkdown files to follow the course in using R in Rstudio. Updated version of the notebooks.
s
Data used in exercises in course Introduction to Data Management Practices
figshare.scilifelab.se
zip
Updated Jan 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yvonne Kallberg; Elin Kronander; Niclas Jareborg; Markus Englund; Wolmar Nyberg Åkerström (2025). Data used in exercises in course Introduction to Data Management Practices [Dataset]. http://doi.org/10.17044/scilifelab.14301317.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.17044/scilifelab.14301317.v3
Dataset updated
Jan 15, 2025
Dataset provided by
Uppsala University
Authors
Yvonne Kallberg; Elin Kronander; Niclas Jareborg; Markus Englund; Wolmar Nyberg Åkerström
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This record contains the data files used in exercises in the NBIS course "Introduction to Data Management Practices".
z
Introduction to Ancient Metagenomics Textbook (Edition 2024): Introduction...
zenodo.org
application/gzip
Updated Sep 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kevin Nota; Kevin Nota; Robin Warner; Maxime Borry; Maxime Borry; Robin Warner (2024). Introduction to Ancient Metagenomics Textbook (Edition 2024): Introduction to Python and Pandas [Dataset]. http://doi.org/10.5281/zenodo.11394586
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11394586
Dataset updated
Sep 13, 2024
Dataset provided by
SPAAM Community
Authors
Kevin Nota; Kevin Nota; Robin Warner; Maxime Borry; Maxime Borry; Robin Warner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data and conda software environment file for the chapter 'Introduction to Python and Pandas' of the SPAAM Community's textbook: Introduction to Ancient Metagenomics (https://www.spaam-community.org/intro-to-ancient-metagenomics-book).
M
Bioinformatics Services Market to Hit US$ 10.7 Billion in Next Decade
media.market.us
Updated Nov 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market.us Media (2024). Bioinformatics Services Market to Hit US$ 10.7 Billion in Next Decade [Dataset]. https://media.market.us/bioinformatics-services-market-news/
Explore at:
Dataset updated
Nov 5, 2024
Dataset authored and provided by
Market.us Media
License
https://media.market.us/privacy-policyhttps://media.market.us/privacy-policy
Time period covered
2022 - 2032
Area covered
United States
Description
Introduction

The Global Bioinformatics Services Market is poised for substantial growth, projected to increase from USD 2.9 billion in 2023 to USD 10.7 billion by 2033, achieving a compound annual growth rate (CAGR) of 13.9%. This market expansion is fueled by several key factors including technological advancements in genomics and the increasing complexity of biological datasets, which necessitate advanced computational technologies for efficient data management, analysis, and interpretation. These technologies are crucial for advancing medical research and improving patient care, particularly through personalized treatment plans and precision medicine.

Institutions like the Mayo Clinic are significantly contributing to this growth by expanding their bioinformatics services to support translational research and enhance patient care through the integration of large multi-omics data sets. Additionally, prominent educational institutions such as Stanford and Georgetown University are advancing their bioinformatics programs to equip the next generation of professionals with the necessary skills to address complex biomedical challenges using computational and quantitative methods.

The sector is also witnessing a surge in demand within the healthcare and pharmaceutical industries, where bioinformatics tools are integral to drug discovery and disease diagnosis. This demand drives the development of therapeutic strategies and deepens the understanding of disease mechanisms, further boosting the market growth. Research initiatives and collaborations, such as those at Harvard Medical Schoolâ€™s Department of Biomedical Informatics and Stanford's Biomedical Informatics Research division, are key in transforming biomedical data into actionable insights for precision medicine.

In terms of recent industry developments, in January 2024, Qiagen announced a significant expansion of investments into its Qiagen Digital Insights (QDI) business. This expansion, fueled by robust sales of approximately $100 million in 2023, is set to enhance QDI's bioinformatics capabilities, including launching at least five new products and broadening the applications of Artificial Intelligence and Natural Language Processing within the sector.

Furthermore, in January 2023, Agilent Technologies unveiled a major investment of $725 million to double its manufacturing capacity for nucleic acid-based therapeutics, in response to the rapid growth in the therapeutic oligonucleotides market, projected to reach $2.4 billion by 2027. This expansion will introduce two new manufacturing lines to meet the escalating demand for siRNA, antisense, and CRISPR guide RNA molecules, reinforcing Agilent's market presence and capacity in this fast-evolving field.
o
WORKSHOP: Introduction to Machine Learning in R - from data to knowledge
explore.openaire.eu
Updated Dec 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fotis Psomopoulos; Eden Zhang; Erin Graham; Giorgia Mori; Uwe Winter (2024). WORKSHOP: Introduction to Machine Learning in R - from data to knowledge [Dataset]. http://doi.org/10.5281/zenodo.14545611
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14545611
Dataset updated
Dec 9, 2024
Authors
Fotis Psomopoulos; Eden Zhang; Erin Graham; Giorgia Mori; Uwe Winter
Description
This record includes training materials associated with the Australian BioCommons workshop ‘Introduction to Machine Learning in R - from data to knowledge’. This workshop took place over one, 4 hour sessions on 09 December 2024. Event description With the rise in high-throughput sequencing technologies, the volume of omics data has grown exponentially. A major issue is to mine useful knowledge from these heterogeneous collections of data. The analysis of complex high-volume data is not trivial and classical tools cannot be used to explore their full potential. Machine Learning (ML), a discipline in which computers perform automated learning without being programmed explicitly and assist humans to make sense of large and complex data sets, can thus be very useful in mining large omics datasets to uncover new insights that can advance the field of bioinformatics. This hands-on workshop will introduce participants to the ML taxonomy and the applications of common ML algorithms to health data. The workshop will cover the foundational concepts and common methods being used to analyse omics data sets by providing a practical context through the use of basic but widely used R libraries. Participants will acquire an understanding of the standard ML processes, as well as the practical skills in applying them on familiar problems and publicly available real-world data sets. Materials are shared under a Creative Commons Attribution 4.0 International agreement unless otherwise specified and were current at the time of the event. Lead trainers: Dr Fotis Psomopoulos, Senior Researcher, Institute of Applied Biosciences (INAB), Center for Research and Technology Hellas (CERTH) Facilitators: Dr Giorgia Mori, Australian BioCommons Dr Eden Zhang, Sydney Informatics Hub Dr Erin Graham, Queensland Cyber Infrastructure Foundation (QCIF) Infrastructure provision: Uwe Winter, Australian BioCommons Host: Dr. Giorgia Mori, Australian BioCommons Training materials Files and materials included in this record: Event metadata (PDF): Information about the event including, description, event URL, learning objectives, prerequisites, technical requirements etc. Training materials webpage Data and documentation
Transcriptomics in yeast
kaggle.com
zip
Updated Jan 24, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CostalAether (2017). Transcriptomics in yeast [Dataset]. https://www.kaggle.com/costalaether/yeast-transcriptomics
Explore at:
zip(4901525 bytes)Available download formats
Dataset updated
Jan 24, 2017
Authors
CostalAether
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Disclaimer
This is a data set of mine that I though might be enjoyable to the community. It's concerning Next generation sequencing and Transcriptomics. I used several raw datasets, that are public, but the processing to get to this dataset is extensive. This is my first contribution to kaggle, so be nice, and let me know how I can improve the experience. NGS machines are combined the biggest data producer worldwide. So why not add some (more? ) to kaggle.
A look into Yeast transcriptomics

Background

Yeasts ( in this case saccharomyces cerevisiae) are used in the production of beer, wine, bread and a whole lot of Biotech applications such as creating complex pharmaceuticals. They are living eukaryotic organisms (meaning quite complex). All living organisms store information in their DNA, but action within a cell is carried out by specific Proteins. The path from DNA to Protein (from data to action) is simple. a specific region on the DNA gets transcribed to mRNA, that gets translated to proteins. Common assumption says that the translation step is linear, more mRNA means more protein. Cells actively regulate the amount of protein by the amount of mRNA it creates. The expression of each gene depends on the condition the cell is in (starving, stressed etc..) Modern methods in Biology show us all mRNA that is currently inside a cell. Assuming the linearity of the process, we can get more protein the more specific mRNA is available to a cell. Making mRNA an excellent marker for what is actually happening inside a cell. It is important to consider that mRNA is fragile. It is actively replenished only when it is needed. Both mRNA and proteins are expensive for a cell to produce .

Yeasts are good model organisms for this, since they only have about 6000 genes. They are also single cells which is more homogeneous, and contain few advanced features (splice junctions etc.)

( all of this is heavily simplified, let me know if I should go into more details )

The data

files
The following files are provided **SC_expression.csv** expression values for each gene over the available conditions **labels_CC.csv ** labels for the individual genes , their status and where known intracellular localization ( see below) Maybe this would be nice as a little competition, I'll see how this one is going before I'll upload the other label files. Please provide some feedback on the presentation, and whatever else you would want me to share.
background
I used 92 samples from various openly available raw datasets, and ran them through a modern RNAseq pipeline. Spanning a range of different conditions (I hid the raw names). The conditions covered stress conditions, temperature and heavy metals, as well as growth media changes and the deletion of specific genes. Originally I had 150 sets, 92 are of good enough quality. Evaluation was done on gene level. Each gene got it's own row, Samples are columns (some are in replicates over several columns) . Expression levels were normalized with by TPM (transcripts per million), a default normalization procedure. Raw counts would have been integers, normalized they are floats.
Analysis and labels

Genes

The function of individual genes is a matter of dispute. Clearly living cells are complex. The inner machinations of cells are not visible. Gene functionality is commonly inferred indirectly by removing a gene, and test the cells behavior. This is time consuming and not very precise. As you can see in the dataset, there is still much to be done to fully understand even single cell yeasts.

The provided dataset is allows for a different approach to functional classification of genes. The label files contained in the set correspond a gene to a specific label. The classification is based on the official Gene Onthology associations classification. I simplified the nomenclature. Gene functionality is usually given in a hierarchical structure. [inside cell --> cytoplasma --> associated to complex A ... ] I'm only keeping high level associations, and using readable terms instead of GO terms. I'll extend if people are interested.

Labels

CC labels concern Cellular Component.
Where the gene is within a cell. goes into details of found associations. the term 'cellular_component' should be seen as E.g the label 'cellular_component' is synonymous with 'unknown location' . CC is the easiest label to attach to a gene. It is the one that can be studied the easiest. Still there are many genes missing.

MF labels concern Molecular Function. What is the gene doing. [upcoming] BP labels concern Biological Processes. What is the genes involvement. [upcoming]

The core interest here is whether it is possible to improve the genes classification by modeling the data. A common assu...
f
Table1_Bioinformatics on the Road: Taking Training to Students and...
frontiersin.figshare.com
docx
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marcus Braga; Fabrício Araujo; Edian Franco; Kenny Pinheiro; Jakelyne Silva; Denner Maués; Sebastiao Neto; Lucas Pompeu; Luis Guimaraes; Adriana Carneiro; Igor Hamoy; Rommel Ramos (2023). Table1_Bioinformatics on the Road: Taking Training to Students and Researchers Beyond State Capitals.DOCX [Dataset]. http://doi.org/10.3389/feduc.2021.726930.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/feduc.2021.726930.s001
Dataset updated
Jun 8, 2023
Dataset provided by
Frontiers
Authors
Marcus Braga; Fabrício Araujo; Edian Franco; Kenny Pinheiro; Jakelyne Silva; Denner Maués; Sebastiao Neto; Lucas Pompeu; Luis Guimaraes; Adriana Carneiro; Igor Hamoy; Rommel Ramos
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In Brazil, training capable bioinformaticians is done, mostly, in graduate programs, sometimes with experiences during the undergraduate period. However, this formation tends to be inefficient in attracting students to the area and mainly in attracting professionals to support research projects in research groups. To solve these issues, participation in short courses is important for training students and professionals in the usage of tools for specific areas that use bioinformatics, as well as in ways to develop solutions tailored to the local needs of academic institutions or research groups. In this aim, the project “Bioinformática na Estrada” (Bioinformatics on the Road) proposed improving bioinformaticians’ skills in undergraduate and graduate courses, primarily in the countryside of the State of Pará, in the Amazon region of Brazil. The project scope is practical courses focused on the areas of interest of the place where the courses are occurring to train and encourage students and researchers to work in this field, reducing the existing gap due to the lack of qualified bioinformatics professionals. Theoretical and practical workshops took place, such as Introduction to Bioinformatics, Computer Science Basics, Applications of Computational Intelligence applied to Bioinformatics and Biotechnology, Computational Tools for Bioinformatics, Soil Genomics and Research Perspectives and Horizons in the Amazon Region. In the end, 444 undergraduate and graduate students from higher education institutions in the state of Pará and other Brazilian states attended the events of the Bioinformatics on the Road project.
q
A Fun Introductory Command Line Lesson: Next Generation Sequencing Quality...
qubeshub.org
Updated Aug 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rachael †; William †; Sabrina Robertson; Andrew Lonsdale; Caylin Murray; Jason Williams; Ray Enke (2021). A Fun Introductory Command Line Lesson: Next Generation Sequencing Quality Analysis with Emoji! [Dataset]. http://doi.org/10.24918/cs.2021.17
Explore at:
Unique identifier
https://doi.org/10.24918/cs.2021.17
Dataset updated
Aug 30, 2021
Dataset provided by
QUBES
Authors
Rachael †; William †; Sabrina Robertson; Andrew Lonsdale; Caylin Murray; Jason Williams; Ray Enke
Description
Radical innovations in DNA sequencing technology over the past decade have created an increased need for computational bioinformatics analyses in the 21st century STEM workforce. Recent evidence however demonstrates that there are significant barriers to teaching these skills at the undergraduate level including lack of faculty training, lack of student interest in bioinformatics, lack of vetted teaching materials, and overly full curricula. To this end, the James Madison University, Center for Genome & Metagenome Studies (JMU CGEMS) and other PUI collaborators are devoted to developing and disseminating engaging bioinformatics teaching materials specifically designed for streamlined integration into general undergraduate biology curriculum. Here, we have developed and integrated a fun introductory level lesson to command line next generation sequencing (NGS) analysis into a large enrollment core biology course. This one-off activity takes a crucial but mundane aspect of NGS quality control (QC) analysis and incorporates the use of Emoji data outputs using the software FASTQE to pique student interest. This amusing command line analysis is subsequently paired with a more rigorous research-grade software package called FASTP in which students complete sequence QC and filtering using a few simple commands. Collectively, this short lesson provides novice-level faculty and students an engaging entry point to learning basic genomics command line programming skills as a gateway to more complex and elaborated applications of computational bioinformatics analyses.

Primary image: Undergraduate students learn the basics of command line NGS quality analysis using the FASTQE and FASTP programs.
q
Making toast: Using analogies to explore concepts in bioinformatics
qubeshub.org
Updated Aug 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kate Hertweck (2021). Making toast: Using analogies to explore concepts in bioinformatics [Dataset]. http://doi.org/10.24918/cs.2016.11
Explore at:
Unique identifier
https://doi.org/10.24918/cs.2016.11
Dataset updated
Aug 26, 2021
Dataset provided by
QUBES
Authors
Kate Hertweck
Description
Contemporary biology is moving towards heavy reliance on computational methods to manage, find patterns, and derive meaning from large-scale data, such as genomic sequences. Biology teachers are increasingly compelled to prepare students with skills to meet these challenges. However, introducing biology students to more abstract concepts associated with computational thinking remains a major challenge. Analogies have long been used in science classrooms to help students comprehend complex concepts by relating them to familiar processes. Here I present a multi-step procedure for introducing students to large-scale data analysis (bioinformatics workflows) by asking them to describe a common daily task: making toast. First, students describe the main steps associated with this procedure. Next, students are presented with alternative scenarios for materials and equipment and are asked to extend the analogy to accommodate them. Finally, students are led through examples of how the analogy breaks down, or fails to accurately represent, a bioinformatics analysis. This structured approach to student exploration of analogies related to computational biology capitalizes on diverse student experiences to both clarify concepts and ameliorate possible misconceptions. Similar methods can be used to introduce many abstract concepts in both biology and computer science.

Facebook

Twitter

Click to copy link

Link copied

Cite

Aidan Budd (2016). Introductions to Bioinformatics [Dataset]. http://doi.org/10.6084/m9.figshare.830401.v1

Introductions to Bioinformatics

Explore at:

9 scholarly articles cite this dataset (View in Google Scholar)

pdfAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.830401.v1

Dataset updated

Jan 18, 2016

Dataset provided by

Figsharehttp://figshare.com/
figshare

Authors

Aidan Budd

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A collection of similar but different presentations I've made aimed at introducing bioinformatics to bench biologists.

Clear search

Close search

Google apps

Main menu

Introductions to Bioinformatics

Data from: “Bioinformatics: Introduction and Methods,” a Bilingual Massive...

Bioinformatics Protein Dataset - Simulated

Subtitle

Description

Introduction

Columns Included

Inspiration and Sources

Proposed Uses

How This Dataset Was Created

Limitations

Data Split

Acknowledgment

Bioinformatics is a BLAST: Engaging First-Year Biology Students on Campus...

Dataset for practice session 1 in bioinformatics

Introduction to Biodiversity Informatics

Data_Sheet_2_Bioinformatics-Based Activities in High School: Fostering...

Comparison of the multiple-delivery-mode training model employed by...

Data from: Bioinformatics calls the school: Use of smartphones to introduce...

Introduction to Ancient Metagenomics Textbook (Edition 2024): Introduction...

Sequence Similarity: An inquiry based and "under the hood" approach for...

Introduction to single cell RNAseq analysis: supplementary material

Data used in exercises in course Introduction to Data Management Practices

Introduction to Ancient Metagenomics Textbook (Edition 2024): Introduction...

Bioinformatics Services Market to Hit US$ 10.7 Billion in Next Decade

Introduction

WORKSHOP: Introduction to Machine Learning in R - from data to knowledge

Transcriptomics in yeast

Disclaimer

A look into Yeast transcriptomics

Background

The data

files

background

Analysis and labels

Genes

Labels

Table1_Bioinformatics on the Road: Taking Training to Students and...

A Fun Introductory Command Line Lesson: Next Generation Sequencing Quality...

Making toast: Using analogies to explore concepts in bioinformatics

Introductions to Bioinformatics