Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CHIIMP output from samples when replicates were bioinformatically combined (via bash scripting 'cat') prior to having genotypes called. This was done to evaluate if minor alleles recovered would vary with coverage.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Restriction site-associated DNA sequencing (RAD-seq) provides high-resolution population genomic data at low cost, and has become an important component in ecological and evolutionary studies. As with all high-throughput technologies, analytic strategies require critical validation to ensure accurate and unbiased interpretation. To test for the impact of bioinformatic data processing on downstream population genetic inferences, we analysed mammalian RAD-seq data (>100 individuals) with 312 combinations of methodology (de novo vs. mapping to references of increasing divergence) and filtering criteria (missing data, HWE, FIS, coverage, mapping, genotype quality). In an effort to identify commonalities and biases in all pipelines, we computed summary statistics (nr. loci, nr. SNP, π, Hetobs, FIS, FST, Ne, m) and compared the results to independent null expectations (isolation-by-distance correlation, expected transition-to-transversion ratio Ts/Tv, Mendelian mismatch rates of known parent-offspring trios). We observed large differences between reference-based and de novo approaches, the former generally calling more SNPs and reducing FIS and Ts/Tv. Data completion levels showed little impact on most summary statistics, and FST estimates were robust across all pipelines. The site-frequency spectrum (SFS) was highly sensitive to the chosen approach as reflected in large variance of parameter estimates across demographic scenarios (single-population bottlenecks and isolation-with-migration model). Null-expectations were best met by reference-based approaches, though contingent on the specific criteria. We recommend RAD-seq studies employ reference-based approaches to a closely related genome, and due to the high stochasticity associated with the pipeline advocate the use of multiple pipelines to ensure robust population genetic and demographic inferences.
Shiga toxin-producing Escherichia coli (STEC) and Listeria monocytogenes are responsible for severe foodborne illnesses in the United States. Current identification methods require at least four days to identify STEC and six days for L. monocytogenes. Adoption of long-read, whole genome sequencing for testing could significantly reduce the time needed for identification, but method development costs are high. Therefore, the goal of this project was to use NanoSim-H software to simulate Oxford Nanopore sequencing reads to assess the feasibility of sequencing-based foodborne pathogen detection and guide experimental design. Sequencing reads were simulated for STEC, L. monocytogenes, and a 1:1 combination of STEC and Bos taurus genomes using NanoSim-H. This dataset includes all of the simulated reads generated by the project in fasta format. This dataset can be analyzed bioinformatically or used to test bioinformatic pipelines.
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
RMQS: The French Soil Quality Monitoring Network (RMQS) is a national program for the assessment and long-term monitoring of the quality of French soils. This network is based on the monitoring of 2240 sites representative of French soils and their land use. These sites are spread over the whole French territory (metropolitan and overseas) along a systematic square grid of 16 km x 16 km cells. The network covers a broad spectrum of climatic, soil and land-use conditions (croplands, permanent grasslands, woodlands, orchards and vineyards, natural or scarcely anthropogenic land and urban parkland). The first sampling campaign in metropolitan France took place from 2000 to 2009. Dataset: This dataset contains config files used to run the bioinformatic pipeline and the control sample data that were not published before Reference environmental DNA samples named “G4” in internal laboratory processes were added for each molecular analysis. They were used for technical validation, but not necessarily published alongside the datasets. The taxonomy and OTU abundance files for these control samples were built like the taxonomy and abundance file of the main dataset. As these internal control samples were clustered against the RMQS dataset in an open reference fashion, they contained new OTUs (noted as “OUT”) that corresponded to sequences that did not match any of 188,030 RMQS reference sequences. The sample bank association file links each sample to its sequencing library. The G4 metadata file links each G4 to its library, molecular tag and sequence repository information. File structure: Taxonomy files rmqs1_control_taxonomy_: Taxonomy is splitted across five files with one line per site and one column per taxa. Each line sums to 10k (rarefaction threshold). Three supplementary columns are present: Unknown: not matching any reference. Unclassified: missing taxa between genus and phylum. Environmental: matched to sample from environmental study, generally with only a phylum name. rmqs1_16S_otu_abundance.tsv: OTU abundance per site (one column per OTUs, “DB” + number for OTUs from RMQS reference set, “OUT” for OTUs not matching any “DB” ones). Each line sums to 10k (rarefaction threshold). rmqs1_16S_bank_association.tsv: two columns file with bank name for each sample rmqs1_16S_bank_metadata.tsv: library_name: library name used in labs study_accession, sample_accession, experiment_accession, run_accession: SRA EBI identifier library_name_genoscope: library name used in the Genoscope sequence center MID: multiplex identifier sequence run_alias: Genoscope internal alias ftp_link: FTP link to download library Input_G4.txt: Tabulated file containing the parameters and the bioinformatic steps done by the BIOCOM-PIPE pipeline to extract, treat and analyze controls from raw librairies detailed in the rmqs1_16S_bank_metadata.tsv. project_G4.tab: Comma separated file containing the needed information to generate the Input.txt file with the BIOCOM-PIPE pipeline for controls only: PROJECT: Project name chosen by the user LIBRARY_NAME: Library name chosen by the user LIBRARY_NAME_RECEIVED: Library name chosen by the sequencing partner and used by BIOCOM-PIPE SAMPLE_NAME: Sample name chosen by the user MID_F: MID name or MID sequence associated to the Forward primer MID_R: MID name or MID sequence associated to the Reverse primer TARGET: Target gene (16S, 18S, or 23S) PRIMER_F: Forward primer name used for amplification PRIMER_R: Reverse primer name used for amplification SEQUENCE_PRIMER_F: Forward primer sequence used for amplification SEQUENCE_PRIMER_R: Reverse primer sequence used for amplification Input_GLOBAL.txt: Tabulated file containing the parameters and the bioinformatic steps done by the BIOCOM-PIPE pipeline to extract, treat and analyze controls and samples from raw librairies detailed in the rmqs1_16S_bank_metadata.tsv. project_GLOBAL.tab: Comma separated file containing the needed information to generate the Input.txt file for controls and samples with the BIOCOM-PIPE pipeline: PROJECT: Project name chosen by the user LIBRARY_NAME: Library name chosen by the user LIBRARY_NAME_RECEIVED: Library name chosen by the sequencing partner and used by BIOCOM-PIPE SAMPLE_NAME: Sample name chosen by the user MID_F: MID name or MID sequence associated to the Forward primer MID_R: MID name or MID sequence associated to the Reverse primer TARGET: Target gene (16S, 18S, or 23S) PRIMER_F: Forward primer name used for amplification PRIMER_R: Reverse primer name used for amplification SEQUENCE_PRIMER_F: Forward primer sequence used for amplification SEQUENCE_PRIMER_R: Reverse primer sequence used for amplification Details: Three libraries (58,59 and 69) data were re-sequenced and are not detailed in files. Some samples can be present in several libraries. We kept only the one with the highest number of sequences.
Example dataset one: British otter diet
Faecal samples were collected during otter post-mortems by the Cardiff University Otter Project. Extracted faecal DNA was amplified using two metabarcoding primer pairs designed to amplify regions of the 16S rRNA and cytochrome c oxidase subunit I (COI) genes, each primer having ten-base-pair molecular identifier tags (MID tags) to facilitate post-bioinformatic sample identification. Extraction and PCR negative controls, unused MID tag combinations, repeat samples and mock communities were included alongside the focal eDNA samples. Mock communities comprised standardised mixtures of DNA of marine species not previously detected in the diet of Eurasian otters. The resultant DNA libraries for each marker were sequenced on separate MiSeq V2 chips with 2x250bp paired-end reads.
Example dataset two: cereal crop spider diet
Money spiders (Bathyphantes, Erigone, Microlinyphia and Tenuiphantes; Araneae: Linyphiidae) and wolf spiders (Pardosa; Ar...
The ABYSS project aims at describing deep-sea benthic biodiversity spanning several branches of the tree of life with eDNA metabarcoding tools. To accommodate both micro- and macro biologists, we designed a bioinformatic pipeline based on Illumina read correction with Dada2 allowing analysing metabarcodes from prokaryotic and eukaryotic life compartments.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
SSR_pipeline is a flexible set of programs designed to efficiently identify simple sequence repeats (e.g., microsatellites) from paired-end high-throughput Illumina DNA sequencing data. The program suite contains 3 analysis modules along with a fourth control module that can automate analyses of large volumes of data. The modules are used to 1) identify the subset of paired-end sequences that pass Illumina quality standards, 2) align paired-end reads into a single composite DNA sequence, and 3) identify sequences that possess microsatellites (both simple and compound) conforming to user-specified parameters. The microsatellite search algorithm is extremely efficient, and we have used it to identify repeats with motifs from 2 to 25bp in length. Each of the 3 analysis modules can also be used independently to provide greater flexibility or to work with FASTQ or FASTA files generated from other sequencing platforms (Roche 454, Ion Torrent, etc.). We demonstrate use of the program with data from the brine fly Ephydra packardi (Diptera: Ephydridae) and provide empirical timing benchmarks to illustrate program performance on a common desktop computer environment. We further show that the Illumina platform is capable of identifying large numbers of microsatellites, even when using unenriched sample libraries and a very small percentage of the sequencing capacity from a single DNA sequencing run. All modules from SSR_pipeline are implemented in the Python programming language and can therefore be used from nearly any computer operating system (Linux, Macintosh, and Windows).
https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.3/customlicense?persistentId=doi:10.11588/DATA/10056https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.3/customlicense?persistentId=doi:10.11588/DATA/10056
Kinetic Operating Microarray Analyzer (KOMA) enables calibration and high-throughput analysis of quantitative microarray data collected by using kinetic detection protocol. This tool can be also helpful for analyzing data from any other analytical assays employing enzymatic signal amplification, in which a broader range of quantification is reached by the time-resolved recording of readouts.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data in brief is repository of article: Genomic and bioinformatic analysis of Vicilin dataset, the 7S globulin from cowpea (Vigna unguiculata) seeds
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ameloblastoma is a highly aggressive odontogenic tumor, and its pathogenesis is associated with multiple participating genes. Objective: Our aim was to identify and validate new critical genes of conventional ameloblastoma using microarray and bioinformatics analysis. Methods: Gene expression microarray and bioinformatic analysis were performed to use CHIP H10KA and DAVID software for enrichment. Protein-protein interactions (PPI) were visualized using STRING-Cytoscape with MCODE plugin, followed by Kaplan-Meier and GEPIA analysis that were employed for the candidate's postulation. RT-qPCR and IHC assays were performed to validate the bioinformatic approach. Results: 376 upregulated genes were identified. PPI analysis revealed 14 genes that were validated by Kaplan-Meier and GEPIA resulting in PDGFA and IL2RA as candidate genes. The RT-qPCR analysis confirmed their intense expression. Immunohistochemistry analysis showed that PDGFA expression is parenchyma located. Conclusion: With bioinformatics methods, we can identify upregulated genes in conventional ameloblastoma, and with RT-qPCR and immunoexpression analysis validate that PDGFA could be a more specific and localized therapeutic target.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset tracks annual science proficiency from 2021 to 2022 for Research Laboratory High School-bioinformatic vs. New York and Buffalo City School District
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset tracks annual diversity score from 2021 to 2023 for Research Laboratory High School-bioinformatic vs. New York and Buffalo City School District
Community composition data are essential for conservation management, facilitating identification of rare native and invasive species, along with abundant ones. However, traditional capture-based morphological surveys require considerable taxonomic expertise, are time consuming and expensive, can kill rare taxa and damage habitats, and often are prone to false negatives. Alternatively, metabarcode assays can be used to assess the genetic identity and compositions of entire communities from environmental samples, comprising a more sensitive, less damaging, and relatively time- and cost-efficient approach. However, there is a trade-off between the stringency of bioinformatic filtering needed to remove false positives and the potential for false negatives. The present investigation thus evaluated use of four mitochondrial (mt) DNA metabarcode assays and a customized bioinformatic pipeline to increase confidence in species identifications by removing false positives, while achieving high de...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Features of bioinformatically-defined Mycobacteriophage endolysin domains.
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
This archive provides the bioinformatic scripts used to analyze a metabarcoding dataset (available at https://www.ncbi.nlm.nih.gov/bioproject/678415) representing foliar fungal communities associated with two grapevine varieties (Vitis vinifera ‘Regent’ and ‘Cabernet-Sauvignon’) inoculated with powdery mildew (Erysiphe necator) under various drought conditions. The raw and filtered ASV tables are provided in QIIME2 format and R phyloseq format. Summary statistics and interactive taxonomic plots (.qzv files) can be viewed in the QIIME2view interface.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Table of contents
This repository contains the data that support the findings of the manuscript
"TOA: a software package for automated functional annotation in non-model plant species".
Directories:
BenchmarkTranscriptomes: Fasta files with the sequences corresponding to the benchmark transcriptomes tested in the manuscript:
Cnuc: Cocos nucifera embryos (Huang et al., 2014).
Fsyl: Fagus sylvatica leaves (Müller, Seifert, Lübbe, Leuschner, & Finkeldey, 2017).
Pcan: Pinus canariensis immature xylem (Chano, Collada, & Soto, 2017).
GoldStandardTranscripts: transcripts fasta files of two well known gene families of proteins in Arabidopsis thaliana fetched from TAIR: the kinesins gene family (GS1) (Reddy & Day, 2001) and the monolignol biosynthesis gene family (GS2) (Raes et al., 2003)
EnTAP: output of the six simulation tests performed with EnTAP.
TOA: output of the six simulation tests performed with TOA.
TRAPID: output of the three simulation tests performed with TRAPID.
Trinotate: output of the three simulation tests performed with Trinotate.
References:
Chano, V., Collada, C., & Soto, A. (2017). Transcriptomic analysis of wound xylem formation in Pinus canariensis. BMC Plant Biology, 17(234). doi:10.1186/s12870-017-1183-3
Huang, Y. Y., Lee, C. P., Fu, J. L., Chang, B. C. H., Matzke, A. J. M., & Matzke, M. (2014). De novo transcriptome sequence assembly from coconut leaves and seeds with a focus on factors involved in RNA-directed DNA methylation. G3: Genes, Genomes, Genetics, 4(11), 2147–2157. doi:10.1534/g3.114.013409
Müller, M., Seifert, S., Lübbe, T., Leuschner, C., & Finkeldey, R. (2017). De novo transcriptome assembly and analysis of differential gene expression in response to drought in European beech. PLoS ONE, 12(9), 1–20. doi:10.1371/journal.pone.0184167
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset tracks annual distribution of students across grade levels in Research Laboratory High School-bioinformatic
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundHealth sciences research is increasingly focusing on big data applications, such as genomic technologies and precision medicine, to address key issues in human health. These approaches rely on biological data repositories and bioinformatic analyses, both of which are growing rapidly in size and scope. Libraries play a key role in supporting researchers in navigating these and other information resources.MethodsWith the goal of supporting bioinformatics research in the health sciences, the University of Arizona Health Sciences Library established a Bioinformation program. To shape the support provided by the library, I developed and administered a needs assessment survey to the University of Arizona Health Sciences campus in Tucson, Arizona. The survey was designed to identify the training topics of interest to health sciences researchers and the preferred modes of training.ResultsSurvey respondents expressed an interest in a broad array of potential training topics, including "traditional" information seeking as well as interest in analytical training. Of particular interest were training in transcriptomic tools and the use of databases linking genotypes and phenotypes. Staff were most interested in bioinformatics training topics, while faculty were the least interested. Hands-on workshops were significantly preferred over any other mode of training. The University of Arizona Health Sciences Library is meeting those needs through internal programming and external partnerships.ConclusionThe results of the survey demonstrate a keen interest in a variety of bioinformatic resources; the challenge to the library is how to address those training needs. The mode of support depends largely on library staff expertise in the numerous subject-specific databases and tools. Librarian-led bioinformatic training sessions provide opportunities for engagement with researchers at multiple points of the research life cycle. When training needs exceed library capacity, partnering with intramural and extramural units will be crucial in library support of health sciences bioinformatic research.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Historical Dataset of Research Laboratory High School-bioinformatic is provided by PublicSchoolReview and contain statistics on metrics:Total Students Trends Over Years (2021-2023),Total Classroom Teachers Trends Over Years (2021-2023),Distribution of Students By Grade Trends,Student-Teacher Ratio Comparison Over Years (2021-2023),Asian Student Percentage Comparison Over Years (2021-2023),Hispanic Student Percentage Comparison Over Years (2021-2023),Black Student Percentage Comparison Over Years (2021-2023),White Student Percentage Comparison Over Years (2021-2023),Two or More Races Student Percentage Comparison Over Years (2021-2023),Diversity Score Comparison Over Years (2021-2023),Free Lunch Eligibility Comparison Over Years (2021-2023),Reading and Language Arts Proficiency Comparison Over Years (2021-2022),Math Proficiency Comparison Over Years (2021-2022),Science Proficiency Comparison Over Years (2021-2022),Overall School Rank Trends Over Years (2021-2022),Graduation Rate Comparison Over Years (2021-2022)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset tracks annual hispanic student percentage from 2021 to 2023 for Research Laboratory High School-bioinformatic vs. New York and Buffalo City School District
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CHIIMP output from samples when replicates were bioinformatically combined (via bash scripting 'cat') prior to having genotypes called. This was done to evaluate if minor alleles recovered would vary with coverage.