72 datasets found

d
Data from: Semi-artificial datasets as a resource for validation of...
search.dataone.org
explore.openaire.eu
+2more
Updated May 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lucie Tamisier; Annelies Haegeman; Yoika Foucart; Nicolas Fouillien; Maher Al Rwahnih; Nihal Buzkan; Thierry Candresse; Michela Chiumenti; Kris De Jonghe; Marie Lefebvre; Paolo Margaria; Jean SÃ©bastien Reynard; Kristian Stevens; Denis Kutnjak; SÃ©bastien Massart (2025). Semi-artificial datasets as a resource for validation of bioinformatics pipelines for plant virus detection [Dataset]. http://doi.org/10.5061/dryad.0zpc866z8
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.0zpc866z8
Dataset updated
May 21, 2025
Dataset provided by
Dryad Digital Repository
Authors
Lucie Tamisier; Annelies Haegeman; Yoika Foucart; Nicolas Fouillien; Maher Al Rwahnih; Nihal Buzkan; Thierry Candresse; Michela Chiumenti; Kris De Jonghe; Marie Lefebvre; Paolo Margaria; Jean SÃ©bastien Reynard; Kristian Stevens; Denis Kutnjak; SÃ©bastien Massart
Time period covered
Jan 1, 2021
Description
In the last decade, High-Throughput Sequencing (HTS) has revolutionized biology and medicine. This technology allows the sequencing of huge amount of DNA and RNA fragments at a very low price. In medicine, HTS tests for disease diagnostics are already brought into routine practice. However, the adoption in plant health diagnostics is still limited. One of the main bottlenecks is the lack of expertise and consensus on the standardization of the data analysis. The Plant Health Bioinformatic Network (PHBN) is an Euphresco project aiming to build a community network of bioinformaticians/computational biologists working in plant health. One of the main goals of the project is to develop reference datasets that can be used for validation of bioinformatics pipelines and for standardization purposes.

Semi-artificial datasets have been created for this purpose (Datasets 1 to 10). They are composed of a â€œrealâ€ HTS dataset spiked with artificial viral reads. It will allow researchers to adjust ...
m
Research data for "Subjective data models in bioinformatics: Do wet-lab and...
figshare.manchester.ac.uk
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yochannah Yehudi; Carole Goble; Caroline Jay; Lukas Hughes-Noehrer (2023). Research data for "Subjective data models in bioinformatics: Do wet-lab and computational biologists comprehend data differently?" [Dataset]. http://doi.org/10.48420/20641017.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.48420/20641017.v2
Dataset updated
Jun 1, 2023
Dataset provided by
University of Manchester
Authors
Yochannah Yehudi; Carole Goble; Caroline Jay; Lukas Hughes-Noehrer
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Subjective data models dataset

This dataset is comprised of data collected from study participants, for a study into how people working with biological data perceive data, and whether or not this perception of data aligns with a person's experiential and educational background. We call the concept of what data looks like to an individual a "subjective data model".

Todo: link paper/preprint once published.

Computational python analysis code: https://doi.org/10.5281/zenodo.7022789 and https://github.com/yochannah/subjective-data-models-analysis

Files

Transcripts of the recorded sessions are attached and have been verified by a second researcher. These files are all in plain text .txt format. Note that participant 3 did not agree to sharing the transcript of their interview. Interview paper files This folder has digital and photographed versions of the files shown to the participants for the file mapping task. Note that the original files are from the NCBI and from FlyBase. Videos and stills from the recordings have been deleted in line with the Data Management Plan and Ethical Review. anonymous_participant_list.csv shows which files have transcripts associated (not all participants agreed to share transcripts), what the order of Tasks A and B were, the date of interview, and what entities participants added to the set provided (if any). See the paper methods for more info about why entities were added to the set. cards.txt is a full list of the cards presented in the tasks. background survey and background manual annotations are the select survey data about participant background and manual additions to this where necessary, e.g. to interpret free text. codes.csv shows the qualitative codes used within the transcripts. entry_point.csv is a record of participants' identified entry points into the data. file_mapping_responses shows a record of responses to the file mapping task.
m
Principles and steps for integrating bioinformatics
data.mendeley.com
Updated Aug 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hang Thi Nguyen (2024). Principles and steps for integrating bioinformatics [Dataset]. http://doi.org/10.17632/wjx5h7wh22.3
Explore at:
Unique identifier
https://doi.org/10.17632/wjx5h7wh22.3
Dataset updated
Aug 7, 2024
Authors
Hang Thi Nguyen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Biological data is increasing at a high speed, creating a vast amount of knowledge, while updating knowledge in teaching is limited, along with the unchanged time in the classroom. Therefore, integrating bioinformatics into teaching will be effective in teaching biology today. However, the big challenge is that pedagogical university students have yet to learn the basic knowledge and skills of bioinformatics, so they have difficulty and confusion when using it. However, the big challenge is that pedagogical university students have yet to learn the basic knowledge and skills of bioinformatics, so they have difficulty and confusion when using it in biology teaching. This dataset includes survey results on high school teachers, teacher training curriculums and pedagogical students in Vietnam. The highlights of this dataset are six basic principles and four steps of bioinformatics integration in teaching biology at high schools, with illustrative examples. The principles and approaches of integrating Bioinformatics into biology teaching improve the quality of biology teaching and promote STEM education in Vietnam and developing countries.
Bioinformatics Software Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Bioinformatics Software Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/bioinformatics-software-market
Explore at:
csv, pptx, pdfAvailable download formats
Dataset updated
Jan 7, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Bioinformatics Software Market Outlook

The global bioinformatics software market size was valued at approximately USD 10 billion in 2023, and it is projected to reach around USD 25 billion by 2032, growing at a robust CAGR of 11% during the forecast period. This remarkable growth is fueled by the increased application of bioinformatics in drug discovery and development, the rising demand for personalized medicine, and the ongoing advancements in sequencing technologies. The convergence of biology and information technology has led to the optimization of biological data management, propelling the market's expansion as it transforms the landscape of biotechnology and pharmaceutical research. The rapid integration of artificial intelligence and machine learning techniques to process complex biological data further accentuates the growth trajectory of this market.

An essential growth factor for the bioinformatics software market is the burgeoning demand for sequencing technologies. The decreasing cost of sequencing has led to a massive increase in the volume of genomic data generated, necessitating advanced software solutions to manage and interpret this data efficiently. This demand is particularly evident in genomics and proteomics, where bioinformatics software plays a critical role in analyzing and visualizing large datasets. Additionally, the adoption of cloud computing in bioinformatics offers scalable resources and cost-effective solutions for data storage and processing, further fueling market growth. The increasing collaboration between research institutions and software companies to develop innovative bioinformatics tools is also contributing positively to market expansion.

Another significant driver is the growth of personalized medicine, which relies heavily on bioinformatics for the analysis of individual genetic information to tailor therapeutic strategies. As healthcare systems worldwide move towards precision medicine, the demand for bioinformatics software that can integrate genetic, phenotypic, and environmental data becomes more pronounced. This trend is not only transforming patient care but also significantly impacting drug development processes, as pharmaceutical companies aim to create more effective and targeted therapies. The strategic partnerships and collaborations between biotech firms and bioinformatics software providers are critical in advancing personalized medicine and enhancing patient outcomes.

The increasing prevalence of complex diseases such as cancer and neurological disorders necessitates comprehensive research efforts, driving the need for robust bioinformatics software. These diseases require multi-omics approaches for better understanding, diagnosis, and treatment, where bioinformatics tools are indispensable. The ongoing research and development activities in this area, supported by government funding and private investments, are fostering innovation in bioinformatics solutions. Furthermore, the development of user-friendly and intuitive software interfaces is expanding the market beyond specialized research labs to include clinical settings and hospitals, broadening the potential user base and enhancing market penetration.

From a regional perspective, North America currently leads the bioinformatics software market, thanks to its advanced technological infrastructure, significant investment in healthcare R&D, and the presence of numerous key market players. The region accounted for the largest market share in 2023 and is expected to maintain its dominance throughout the forecast period. Meanwhile, the Asia Pacific region is anticipated to exhibit the highest CAGR, driven by increasing investments in biotechnology and pharmaceutical research, expanding healthcare infrastructure, and the rising adoption of bioinformatics in emerging economies like China and India. Europe's market growth is also significant, supported by substantial funding for genomic research and a strong focus on precision medicine initiatives.

Lifesciences Data Mining and Visualization are becoming increasingly vital in the bioinformatics software market. As the volume of biological data continues to grow exponentially, the need for sophisticated tools to mine and visualize this data is paramount. These tools enable researchers to uncover hidden patterns and insights from complex datasets, facilitating breakthroughs in genomics, proteomics, and other life sciences fields. The integration of advanced data mining techniques with visualization capabilities allows for a more intuitive
h
Restricted Boltzmann Machine for Missing Data Imputation in Biomedical...
datahub.hku.hk
Updated Aug 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wen Ma (2020). Restricted Boltzmann Machine for Missing Data Imputation in Biomedical Datasets [Dataset]. http://doi.org/10.25442/hku.12752549.v1
Explore at:
Unique identifier
https://doi.org/10.25442/hku.12752549.v1
Dataset updated
Aug 13, 2020
Dataset provided by
HKU Data Repository
Authors
Wen Ma
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
NCCTG Lung cancer datasetSurvival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities.2.CNV measurements of CNV of GBM This dataset records the information about copy number variation of Glioblastoma (GBM).Abstract:In biology and medicine, conservative patient and data collection malpractice can lead to missing or incorrect values in patient registries, which can affect both diagnosis and prognosis. Insufficient or biased patient information significantly impedes the sensitivity and accuracy of predicting cancer survival. In bioinformatics, making a best guess of the missing values and identifying the incorrect values are collectively called “imputation”. Existing imputation methods work by establishing a model based on the data mechanism of the missing values. Existing imputation methods work well under two assumptions: 1) the data is missing completely at random, and 2) the percentage of missing values is not high. These are not cases found in biomedical datasets, such as the Cancer Genome Atlas Glioblastoma Copy-Number Variant dataset (TCGA: 108 columns), or the North Central Cancer Treatment Group Lung Cancer (NCCTG) dataset (NCCTG: 9 columns). We tested six existing imputation methods, but only two of them worked with these datasets: The Last Observation Carried Forward (LOCF) and K-nearest Algorithm (KNN). Predictive Mean Matching (PMM) and Classification and Regression Trees (CART) worked only with the NCCTG lung cancer dataset with fewer columns, except when the dataset contains 45% missing data. The quality of the imputed values using existing methods is bad because they do not meet the two assumptions.In our study, we propose a Restricted Boltzmann Machine (RBM)-based imputation method to cope with low randomness and the high percentage of the missing values. RBM is an undirected, probabilistic and parameterized two-layer neural network model, which is often used for extracting abstract information from data, especially for high-dimensional data with unknown or non-standard distributions. In our benchmarks, we applied our method to two cancer datasets: 1) NCCTG, and 2) TCGA. The running time, root mean squared error (RMSE) of the different methods were gauged. The benchmarks for the NCCTG dataset show that our method performs better than other methods when there is 5% missing data in the dataset, with 4.64 RMSE lower than the best KNN. For the TCGA dataset, our method achieved 0.78 RMSE lower than the best KNN.In addition to imputation, RBM can achieve simultaneous predictions. We compared the RBM model with four traditional prediction methods. The running time and area under the curve (AUC) were measured to evaluate the performance. Our RBM-based approach outperformed traditional methods. Specifically, the AUC was up to 19.8% higher than the multivariate logistic regression model in the NCCTG lung cancer dataset, and the AUC was higher than the Cox proportional hazard regression model, with 28.1% in the TCGA dataset.Apart from imputation and prediction, RBM models can detect outliers in one pass by allowing the reconstruction of all the inputs in the visible layer with in a single backward pass. Our results show that RBM models have achieved higher precision and recall on detecting outliers than other methods.
Data from: Ensembl TSS dataset for GRCh38
zenodo.org
portalcienciaytecnologia.jcyl.es
+2more
bin
Updated Aug 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
José A. Barbero-Aparicio; José A. Barbero-Aparicio; Alicia Olivares-Gil; Alicia Olivares-Gil; José F. Díez-Pastor; José F. Díez-Pastor; César García-Osorio; César García-Osorio (2024). Ensembl TSS dataset for GRCh38 [Dataset]. http://doi.org/10.5281/zenodo.7147597
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7147597
Dataset updated
Aug 26, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
José A. Barbero-Aparicio; José A. Barbero-Aparicio; Alicia Olivares-Gil; Alicia Olivares-Gil; José F. Díez-Pastor; José F. Díez-Pastor; César García-Osorio; César García-Osorio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We used the human genome reference sequence in its GRCh38.p13 version in order to have a reliable source of data in which to carry out our experiments. We chose this version because it is the most recent one available in Ensemble at the moment. However, the DNA sequence by itself is not enough, the specific TSS position of each transcript is needed. In this section, we explain the steps followed to generate the final dataset. These steps are: raw data gathering, positive instances processing, negative instances generation and data splitting by chromosomes.

First, we need an interface in order to download the raw data, which is composed by every transcript sequence in the human genome. We used Ensembl release 104 (Howe et al., 2020) and its utility BioMart (Smedley et al., 2009), which allows us to get large amounts of data easily. It also enables us to select a wide variety of interesting fields, including the transcription start and end sites. After filtering instances that present null values in any relevant field, this combination of the sequence and its flanks will form our raw dataset. Once the sequences are available, we find the TSS position (given by Ensembl) and the 2 following bases to treat it as a codon. After that, 700 bases before this codon and 300 bases after it are concatenated, getting the final sequence of 1003 nucleotides that is going to be used in our models. These specific window values have been used in (Bhandari et al., 2021) and we have kept them as we find it interesting for comparison purposes. One of the most sensitive parts of this dataset is the generation of negative instances. We cannot get this kind of data in a straightforward manner, so we need to generate it synthetically. In order to get examples of negative instances, i.e. sequences that do not represent a transcript start site, we select random DNA positions inside the transcripts that do not correspond to a TSS. Once we have selected the specific position, we get 700 bases ahead and 300 bases after it as we did with the positive instances.

Regarding the positive to negative ratio, in a similar problem, but studying TIS instead of TSS (Zhang135
et al., 2017), a ratio of 10 negative instances to each positive one was found optimal. Following this136
idea, we select 10 random positions from the transcript sequence of each positive codon and label them137
as negative instances. After this process, we end up with 1,122,113 instances: 102,488 positive and 1,019,625 negative sequences. In order to validate and test our models, we need to split this dataset into three parts: train, validation and test. We have decided to make this differentiation by chromosomes, as it is done in (Perez-Rodriguez et al., 2020). Thus, we use chromosome 16 as validation because it is a good example of a chromosome with average characteristics. Then we selected samples from chromosomes 1, 3, 13, 19 and 21 to be part of the test set and used the rest of them to train our models. Every step of this process can be replicated using the scripts available in https://github.com/JoseBarbero/EnsemblTSSPrediction.
B
Bioinformatics Data Analysis Service Report
datainsightsmarket.com
doc, pdf, ppt
Updated Jun 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Bioinformatics Data Analysis Service Report [Dataset]. https://www.datainsightsmarket.com/reports/bioinformatics-data-analysis-service-523584
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Jun 24, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Bioinformatics Data Analysis Services market is experiencing robust growth, driven by the exponential increase in biological data generated through next-generation sequencing (NGS) and other high-throughput technologies. The market's expansion is fueled by the rising demand for personalized medicine, precision oncology, drug discovery, and agricultural biotechnology. Advancements in cloud computing and artificial intelligence (AI) are further accelerating the adoption of these services, enabling faster and more efficient analysis of complex datasets. Key players like Illumina, Thermo Fisher Scientific, and QIAGEN are strategically investing in R&D and acquisitions to strengthen their market positions and offer comprehensive solutions. The market is segmented based on service type (e.g., genomics, transcriptomics, proteomics), application (e.g., drug discovery, diagnostics), and deployment mode (cloud-based, on-premise). Competitive landscape is characterized by both large established players and smaller specialized companies focusing on niche applications. While the market faces challenges such as data security concerns and the need for skilled bioinformaticians, the overall growth trajectory remains positive. Looking ahead to 2033, the market is projected to maintain a significant Compound Annual Growth Rate (CAGR), fueled by continuous technological innovation and expanding applications. The increasing accessibility of bioinformatics tools and services, coupled with government initiatives promoting genomic research, will further propel market expansion. The integration of big data analytics and AI will play a critical role in unlocking valuable insights from complex biological datasets, leading to breakthroughs in various healthcare and research domains. Furthermore, strategic partnerships and collaborations between bioinformatics companies and research institutions will contribute to the market's continued growth. Despite potential restraints like regulatory hurdles and the high cost of advanced analytical tools, the long-term outlook for the Bioinformatics Data Analysis Services market remains highly promising.
Bioinformatic training needs at a health sciences campus
plos.figshare.com
figshare.com
pdf
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeffrey C. Oliver (2023). Bioinformatic training needs at a health sciences campus [Dataset]. http://doi.org/10.1371/journal.pone.0179581
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0179581
Dataset updated
Jun 4, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Jeffrey C. Oliver
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundHealth sciences research is increasingly focusing on big data applications, such as genomic technologies and precision medicine, to address key issues in human health. These approaches rely on biological data repositories and bioinformatic analyses, both of which are growing rapidly in size and scope. Libraries play a key role in supporting researchers in navigating these and other information resources.MethodsWith the goal of supporting bioinformatics research in the health sciences, the University of Arizona Health Sciences Library established a Bioinformation program. To shape the support provided by the library, I developed and administered a needs assessment survey to the University of Arizona Health Sciences campus in Tucson, Arizona. The survey was designed to identify the training topics of interest to health sciences researchers and the preferred modes of training.ResultsSurvey respondents expressed an interest in a broad array of potential training topics, including "traditional" information seeking as well as interest in analytical training. Of particular interest were training in transcriptomic tools and the use of databases linking genotypes and phenotypes. Staff were most interested in bioinformatics training topics, while faculty were the least interested. Hands-on workshops were significantly preferred over any other mode of training. The University of Arizona Health Sciences Library is meeting those needs through internal programming and external partnerships.ConclusionThe results of the survey demonstrate a keen interest in a variety of bioinformatic resources; the challenge to the library is how to address those training needs. The mode of support depends largely on library staff expertise in the numerous subject-specific databases and tools. Librarian-led bioinformatic training sessions provide opportunities for engagement with researchers at multiple points of the research life cycle. When training needs exceed library capacity, partnering with intramural and extramural units will be crucial in library support of health sciences bioinformatic research.
[Dataset] Data for the course "Population Genomics" at Aarhus University
zenodo.org
application/gzip, bin
Updated Jan 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samuele Soraggi; Samuele Soraggi; Kasper Munch; Kasper Munch (2025). [Dataset] Data for the course "Population Genomics" at Aarhus University [Dataset]. http://doi.org/10.5281/zenodo.7670839
Explore at:
application/gzip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7670839
Dataset updated
Jan 8, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Samuele Soraggi; Samuele Soraggi; Kasper Munch; Kasper Munch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets, conda environments and Softwares for the course "Population Genomics" of Prof Kasper Munch. This course material is maintained by the health data science sandbox. This webpage shows the latest version of the course material.

Data.tar.gz Contains the datasets and executable files for some of the softwares
You can unpack by simply doing
tar -zxf Data.tar.gz -C ./
This will create a folder called Data with the uncompressed material inside

Course_Env.packed.tar.gz Contains the conda environment used for the course. This needs to be unpacked to adjust all the prefixes (Note this environment is created on Ubuntu 22.10). You do this in the command line by

creating the folder Course_Env: mkdir Course_Env

untar the file: tar -zxf Course_Env.packed.tar.gz -C Course_Env

Activate the environment: conda activate ./Course_Env

Run the unpacking script (it can take quite some time to get it done): conda-unpack

Course_Env.unpacked.tar.gz The same environment as above, but will work only if untarred into the folder /usr/Material - so use the version above if you are using it in another folder. This file is mostly to execute the course in our own cloud environment.

environment_with_args.yml The file needed to generate the conda environment. Create and activate the environment with the following commands:

conda env create -f environment_with_args.yml -p ./Course_Env

conda activate ./Course_Env

The data is connected to the following repository: https://github.com/hds-sandbox/Popgen_course_aarhus. The original course material from Prof Kasper Munch is at https://github.com/kaspermunch/PopulationGenomicsCourse.

Description

The participants will after the course have detailed knowledge of the methods and applications required to perform a typical population genomic study.

The participants must at the end of the course be able to:

Identify an experimental platform relevant to a population genomic analysis.

Apply commonly used population genomic methods.

Explain the theory behind common population genomic methods.

Reflect on strengths and limitations of population genomic methods.

Interpret and analyze results of population genomic inference.

Formulate population genetics hypotheses based on data

The course introduces key concepts in population genomics from generation of population genetic data sets to the most common population genetic analyses and association studies. The first part of the course focuses on generation of population genetic data sets. The second part introduces the most common population genetic analyses and their theoretical background. Here topics include analysis of demography, population structure, recombination and selection. The last part of the course focus on applications of population genetic data sets for association studies in relation to human health.

Curriculum

The curriculum for each week is listed below. "Coop" refers to a set of lecture notes by Graham Coop that we will use throughout the course.

Course plan

Course intro and overview:

Coop chapters 1, 2, 3, Paper: Genome Diversity Project

Drift and the coalescent:

Coop chapter 4; Paper: Platypus

Exercise: Read mapping and base calling

Recombination:

Lecture: Review: Recombination in eukaryotes, Review: Recombination rate estimation

Exercise: Phasing and recombination rate

Population strucure and incomplete lineage sorting:

Lecture: Coop chapter 6, Review: Incomplete lineage sorting

Exercise: Working with VCF files

Hidden Markov models:

Lecture: Durbin chapter 3, Paper: population structure

Exercise: Inference of population structure and admixture

Ancestral recombination graphs:

Lecture: Paper: Approximating the ARG, Paper: Tree inference

Exercise: ARG dashboard exercises + Inference of trees along sequence

Past population demography:

Lecture: Coop chapter 4, Paper: PSMC, revisit Paper: Tree inference

Exercise: Inferring historical populations

Direct and linked selection:

Lecture: Coop chapters 12, 13, revisit Paper: Tree inference

Admixture:

Lecture: Review: Admixture, Paper: Admixture inference

Exercise: Detecting archaic ancestry in modern humans

Genome-wide association study (GWAS):

Lecture: Coop lecture notes 99-120

Exercise: GWAS quality control

Heritability:

Lecture: Coop Lecture notes Sec. 2.2 (p23-36) + Chap. 7 (p119-142)

Exercise: Association testing

Evolution and disease:

Lecture: Coop Lecture notes Sec. 11.0.1 (p217-221)

Exercise: Estimating heritability
I
Molecular Biology Databases Published in Nucleic Acids Research between...
databank.illinois.edu
Updated Feb 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heidi Imker (2024). Molecular Biology Databases Published in Nucleic Acids Research between 1991-2016 [Dataset]. http://doi.org/10.13012/B2IDB-4311325_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-4311325_V1
Dataset updated
Feb 1, 2024
Authors
Heidi Imker
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset was developed to create a census of sufficiently documented molecular biology databases to answer several preliminary research questions. Articles published in the annual Nucleic Acids Research (NAR) “Database Issues” were used to identify a population of databases for study. Namely, the questions addressed herein include: 1) what is the historical rate of database proliferation versus rate of database attrition?, 2) to what extent do citations indicate persistence?, and 3) are databases under active maintenance and does evidence of maintenance likewise correlate to citation? An overarching goal of this study is to provide the ability to identify subsets of databases for further analysis, both as presented within this study and through subsequent use of this openly released dataset.
f
Dataset.covars
figshare.com
txt
Updated Nov 22, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Achal Neupane (2019). Dataset.covars [Dataset]. http://doi.org/10.6084/m9.figshare.10816532.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.10816532.v1
Dataset updated
Nov 22, 2019
Dataset provided by
figshare
Authors
Achal Neupane
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The following data sets are provided to do SKAT_EMMAX test:Dataset.fam, Dataset.covars, Dataset.kinship, gene.raw, gene.gene
s
MINUTE-ChIP example data
figshare.scilifelab.se
txt
Updated Jan 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carmen Navarro Luzon; Simon Elsässer (2025). MINUTE-ChIP example data [Dataset]. http://doi.org/10.17044/scilifelab.25348405.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.17044/scilifelab.25348405.v1
Dataset updated
Jan 15, 2025
Dataset provided by
Karolinska Institutet
Authors
Carmen Navarro Luzon; Simon Elsässer
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
This collection contains an example MINUTE-ChIP dataset to run minute pipeline on, provided as supporting material to help users understand the results of a MINUTE-ChIP experiment from raw data to a primary analysis that yields the relevant files for downstream analysis along with summarized QC indicators. Example primary non-demultiplexed FASTQ files provided here were used to generate GSM5493452-GSM5493463 (H3K27m3) and GSM5823907-GSM5823918 (Input), deposited on GEO with the minute pipeline all together under series GSE181241. For more information about MINUTE-ChIP, you can check the publication relevant to this dataset: Kumar, Banushree, et al. "Polycomb repressive complex 2 shields naïve human pluripotent cells from trophectoderm differentiation." Nature Cell Biology 24.6 (2022): 845-857. If you want more information about the minute pipeline, there is a public biorXiv and a GitHub repository and official documentation.
n
DRCAT Resource Catalogue
neuinfo.org
dknet.org
+1more
Updated Jul 29, 2011
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2011). DRCAT Resource Catalogue [Dataset]. http://identifiers.org/RRID:SCR_005931
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_005931
Dataset updated
Jul 29, 2011
Description
Data resource catalog that collates metadata on bioinformatics Web-based data resources including databases, ontologies, taxonomies and catalogues. An entry includes information such as resource identifier(s), name, description and URL. ''''Query'''' lines are defined for each resource that describe what type(s) of data are available, in what format, how (by what identifier) the data can be retrieved and from where (URL). DRCAT was developed to provide more extensive data integration for EMBOSS, but it has many applications beyond EMBOSS. DRCAT entries (including ''''Query'''' lines) are annotated with terms from the EDAM ontology of common bioinformatics concepts.
o
Genetic Classification Discrepancy Dataset
opendatabay.com
.undefined
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DataDooix LTD (2025). Genetic Classification Discrepancy Dataset [Dataset]. https://www.opendatabay.com/data/science-research/b1be7488-492b-4ab2-8b48-851c409f889a
Explore at:
.undefinedAvailable download formats
Dataset updated
May 27, 2025
Dataset authored and provided by
DataDooix LTD
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Public Health & Epidemiology
Description
Provide a brief description of the dataset, including its purpose, context, and significance.

Dataset Features

List and describe each column or key feature of the dataset.

Column 1 Name: Description of what this column represents.

Column 2 Name: Add as needed...

Distribution

Detail the format, size, and structure of the dataset.

Data Volume: Number of rows/records, number of columns, etc.

Usage

This dataset is ideal for a variety of applications:

Application: Brief description of the first use case.

Application: Add more as needed.

Coverage

Explain the scope and coverage of the dataset:

Geographic Coverage: Region, country, or global.

Time Range: Start date - End date of data collection.

Demographics (if applicable): Age groups, gender, industries, etc.

License

CC0

Who Can Use It

List examples of intended users and their use cases:

Data Scientists: For training machine learning models.

Researchers: For academic or scientific studies.

Businesses: For analysis, insights, or AI development.

Include any additional notes or context about the dataset that might be helpful for users.
MALDI-MS dataset for use with open-source untargeted metabolomic workflow...
zenodo.org
data.niaid.nih.gov
+1more
bin, zip
Updated Aug 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heather Walker; Heather Walker (2023). MALDI-MS dataset for use with open-source untargeted metabolomic workflow for complex biological samples [Dataset]. http://doi.org/10.5061/dryad.dbrv15f5c
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.dbrv15f5c
Dataset updated
Aug 21, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Heather Walker; Heather Walker
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Untargeted metabolomics is a powerful tool for measuring and understanding complex biological chemistries. However, employment, bioinformatics and downstream analysis of mass spectrometry (MS) data can be daunting for inexperienced users. Numerous open-source and free to-use data processing and analysis tools exist for various untargeted MS approaches, but choosing the 'correct' pipeline isn't straight-forward. This data set can be used in conjunction with a user-friendly online guide which presents a workflow for connecting these tools to process, analyse and annotate various untargeted MS datasets. The workflow is intended to guide exploratory analysis in order to inform decision-making regarding costly and time-consuming downstream targeted MS approaches. The workflow provides practical advice concerning experimental design, organisation of data and downstream analysis, and offers details on sharing and storing valuable MS data for posterity. The workflow is editable and modular, allowing flexibility for updated/ changing methodologies and increased clarity and detail as user participation becomes more common allowing contributions and improvements to the workflow via the online repository.
d
Data from: The new bioinformatics: integrating ecological data from the gene...
datadryad.org
data.niaid.nih.gov
zip
Updated Jul 16, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew B. Jones; Mark P. Schildahuer; O. J. Reichman; Shawn Bowers; Mark P. Schildhauer; O.J. Reichman (2012). The new bioinformatics: integrating ecological data from the gene to the biosphere [Dataset]. http://doi.org/10.5061/dryad.qb0d6
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.qb0d6
Dataset updated
Jul 16, 2012
Dataset provided by
Dryad
Authors
Matthew B. Jones; Mark P. Schildahuer; O. J. Reichman; Shawn Bowers; Mark P. Schildhauer; O.J. Reichman
Time period covered
2012
Description
Cumulative number of data packages in the Knowledge Network for Biocomplexity until 2007-06-21This data set records the cumulative number of data packages in the Knowledge Network for Biocomplexity (KNB) data repository through 2007-06-21. A data package represents a set of data files and metadata files that together make a coherent, citable unit for some particular scientific activity. Each data package in the KNB is described by a scientific metadata document and can be composed of one or more data files that contain various segments of the data in question.cumdatasets-20070622.csv
r
Data from: Consensus clustering of gene expression microarray data using...
researchdata.edu.au
bridges.monash.edu
Updated May 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexandre Mendes (2022). Consensus clustering of gene expression microarray data using genetic algorithms [Dataset]. http://doi.org/10.4225/03/5a13728358b1d
Explore at:
Unique identifier
https://doi.org/10.4225/03/5a13728358b1d
Dataset updated
May 5, 2022
Dataset provided by
Monash University
Authors
Alexandre Mendes
Description
This work presents a new consensus clustering method for gene expression microarray data based on a genetic algorithm. Using two datasets - DA and DB - as input, the genetic algorithm examines putative partitions for the samples in DA, selecting biomarkers that support such partitions. The biomarkers are then used to build a classifier which is used in DB to determine its samples classes. The genetic algorithm is guided by an objective function that takes into account the accuracy of classification in both datasets, the number of biomarkers that support the partition, and the distribution of the samples across the classes for each dataset. To illustrate the method, two whole-genome breast cancer instances from dfferent sources were used. In this application, the results indicate that the method could be used to find unknown subtypes of diseases supported by biomarkers presenting similar gene expression profiles across platforms. Moreover, even though this initial study was restricted to two datasets and two classes, the method can be easily extended to consider both more datasets and classes. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1

Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.
INSDC Environment Sample Sequences
gbif.org
researchdata.edu.au
Updated Jul 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
European Bioinformatics Institute (EMBL-EBI); European Bioinformatics Institute (EMBL-EBI) (2025). INSDC Environment Sample Sequences [Dataset]. http://doi.org/10.15468/mcmd5g
Explore at:
Unique identifier
https://doi.org/10.15468/mcmd5g
Dataset updated
Jul 12, 2025
Dataset provided by
European Bioinformatics Institutehttp://www.ebi.ac.uk/
Global Biodiversity Information Facilityhttps://www.gbif.org/
Authors
European Bioinformatics Institute (EMBL-EBI); European Bioinformatics Institute (EMBL-EBI)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
This dataset contains INSDC sequences associated with environmental sample identifiers. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) by querying data with the search parameters: `environmental_sample=True & host=""`
EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230).
The data was then processed as follows:
1. Human sequences were excluded.
2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number.
3. Contigs and whole genome shotgun (WGS) records were added individually.
4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept.
5. The records associated with the same vouchers are aggregated together.
6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by `scientific_name`, `collection_date`, `location`, `country`, `identified_by`, `collected_by` and `sample_accession` (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: https://github.com/gbif/embl-adapter/issues/10#issuecomment-855757978
7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip
More information available here: https://github.com/gbif/embl-adapter#readme
You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md
d
Alternative Splicing Annotation Project II Database
dknet.org
scicrunch.org
+3more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Alternative Splicing Annotation Project II Database [Dataset]. http://identifiers.org/RRID:SCR_000322
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_000322
Dataset updated
Jan 29, 2022
Description
THIS RESOURCE IS NO LONGER IN SERVICE, documented on 8/12/13. An expanded version of the Alternative Splicing Annotation Project (ASAP) database with a new interface and integration of comparative features using UCSC BLASTZ multiple alignments. It supports 9 vertebrate species, 4 insects, and nematodes, and provides with extensive alternative splicing analysis and their splicing variants. As for human alternative splicing data, newly added EST libraries were classified and included into previous tissue and cancer classification, and lists of tissue and cancer (normal) specific alternatively spliced genes are re-calculated and updated. They have created a novel orthologous exon and intron databases and their splice variants based on multiple alignment among several species. These orthologous exon and intron database can give more comprehensive homologous gene information than protein similarity based method. Furthermore, splice junction and exon identity among species can be valuable resources to elucidate species-specific genes. ASAP II database can be easily integrated with pygr (unpublished, the Python Graph Database Framework for Bioinformatics) and its powerful features such as graph query, multi-genome alignment query and etc. ASAP II can be searched by several different criteria such as gene symbol, gene name and ID (UniGene, GenBank etc.). The web interface provides 7 different kinds of views: (I) user query, UniGene annotation, orthologous genes and genome browsers; (II) genome alignment; (III) exons and orthologous exons; (IV) introns and orthologous introns; (V) alternative splicing; (IV) isoform and protein sequences; (VII) tissue and cancer vs. normal specificity. ASAP II shows genome alignments of isoforms, exons, and introns in UCSC-like genome browser. All alternative splicing relationships with supporting evidence information, types of alternative splicing patterns, and inclusion rate for skipped exons are listed in separate tables. Users can also search human data for tissue- and cancer-specific splice forms at the bottom of the gene summary page. The p-values for tissue-specificity as log-odds (LOD) scores, and highlight the results for LOD >= 3 and at least 3 EST sequences are all also reported.
f
Data from: Sdfconf: A Novel, Flexible, and Robust Molecular Data Management...
figshare.com
zip
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sakari T. Lätti; Sanna Niinivehmas; Olli T. Pentikäinen (2023). Sdfconf: A Novel, Flexible, and Robust Molecular Data Management Tool [Dataset]. http://doi.org/10.1021/acs.jcim.1c01051.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.1c01051.s001
Dataset updated
Jun 8, 2023
Dataset provided by
ACS Publications
Authors
Sakari T. Lätti; Sanna Niinivehmas; Olli T. Pentikäinen
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Projects in chemo- and bioinformatics often consist of scattered data in various types and are difficult to access in a meaningful way for efficient data analysis. Data is usually too diverse to be even manipulated effectively. Sdfconf is data manipulation and analysis software to address this problem in a logical and robust manner. Other software commonly used for such tasks are either not designed with molecular and/or conformational data in mind or provide only a narrow set of tasks to be accomplished. Furthermore, many tools are only available within commercial software packages. Sdfconf is a flexible, robust, and free-of-charge tool for linking data from various sources for meaningful and efficient manipulation and analysis of molecule data sets. Sdfconf packages molecular structures and metadata into a complete ensemble, from which one can access both the whole data set and individual molecules and/or conformations. In this software note, we offer some practical examples of the utilization of sdfconf.

Facebook

Twitter

Click to copy link

Link copied

Cite

Lucie Tamisier; Annelies Haegeman; Yoika Foucart; Nicolas Fouillien; Maher Al Rwahnih; Nihal Buzkan; Thierry Candresse; Michela Chiumenti; Kris De Jonghe; Marie Lefebvre; Paolo Margaria; Jean SÃ©bastien Reynard; Kristian Stevens; Denis Kutnjak; SÃ©bastien Massart (2025). Semi-artificial datasets as a resource for validation of bioinformatics pipelines for plant virus detection [Dataset]. http://doi.org/10.5061/dryad.0zpc866z8

Data from: Semi-artificial datasets as a resource for validation of bioinformatics pipelines for plant virus detection

Explore at:

Unique identifier

https://doi.org/10.5061/dryad.0zpc866z8

Dataset updated

May 21, 2025

Dataset provided by

Dryad Digital Repository

Authors

Time period covered

Jan 1, 2021

Description

In the last decade, High-Throughput Sequencing (HTS) has revolutionized biology and medicine. This technology allows the sequencing of huge amount of DNA and RNA fragments at a very low price. In medicine, HTS tests for disease diagnostics are already brought into routine practice. However, the adoption in plant health diagnostics is still limited. One of the main bottlenecks is the lack of expertise and consensus on the standardization of the data analysis. The Plant Health Bioinformatic Network (PHBN) is an Euphresco project aiming to build a community network of bioinformaticians/computational biologists working in plant health. One of the main goals of the project is to develop reference datasets that can be used for validation of bioinformatics pipelines and for standardization purposes.

Semi-artificial datasets have been created for this purpose (Datasets 1 to 10). They are composed of a â€œrealâ€ HTS dataset spiked with artificial viral reads. It will allow researchers to adjust ...

Clear search

Close search

Google apps

Main menu

Data from: Semi-artificial datasets as a resource for validation of...

Research data for "Subjective data models in bioinformatics: Do wet-lab and...

Principles and steps for integrating bioinformatics

Bioinformatics Software Market Report | Global Forecast From 2025 To 2033

Bioinformatics Software Market Outlook

Restricted Boltzmann Machine for Missing Data Imputation in Biomedical...

Data from: Ensembl TSS dataset for GRCh38

Bioinformatics Data Analysis Service Report

Bioinformatic training needs at a health sciences campus

[Dataset] Data for the course "Population Genomics" at Aarhus University

Molecular Biology Databases Published in Nucleic Acids Research between...

Dataset.covars

MINUTE-ChIP example data

DRCAT Resource Catalogue

Genetic Classification Discrepancy Dataset

Dataset Features

Distribution

Usage

Coverage

License

Who Can Use It

MALDI-MS dataset for use with open-source untargeted metabolomic workflow...

Data from: The new bioinformatics: integrating ecological data from the gene...

Data from: Consensus clustering of gene expression microarray data using...

INSDC Environment Sample Sequences

Alternative Splicing Annotation Project II Database

Data from: Sdfconf: A Novel, Flexible, and Robust Molecular Data Management...

Data from: Semi-artificial datasets as a resource for validation of bioinformatics pipelines for plant virus detection