Facebook
TwitterBiology students’ understanding of statistics is incomplete due to poor integration of these two disciplines. In some cases, students fail to learn statistics at the undergraduate level due to poor student interest and cursory teaching of concepts, highlighting a need for new and unique approaches to the teaching of statistics in the undergraduate biology curriculum. The most effective method of teaching statistics is to provide opportunities for students to apply concepts, not just learn facts. Opportunities to learn statistics also need to be prevalent throughout a student’s education to reinforce learning. The purpose of developing and implementing curriculum that integrates a topic in biology with an emphasis on statistical analysis was to improve students’ quantitative thinking skills. Our lesson focuses on the change in the richness of native species for a specified area with the aid of iNaturalist and the capacity for analysis afforded by Google Sheets. We emphasized the skills of data entry, storage, organization, curation and analysis. Students then had to report their findings, as well as discuss biases and other confounding factors. Pre- and post-lesson assessment revealed students’ quantitative thinking skills, as measured by a paired-samples t test, improved. At the end of the lesson, students had an increased understanding of basic statistical concepts, such as bias in research and making data-based claims, within the framework of biology.
Primary Image: Website screenshot of an iNaturalist observation (Clasping Milkweed – Asclepias amplexicalis). This image is an example of a data entry on iNaturalist. The data students export from iNaturalist is made up of hundreds, or even thousands, of observations like this one. This image is licensed under Creative Commons Attribution - Share Alike 4.0 International license. Source: Observation by cassi saari, 2014.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
See Tables S1, S2 and S3 for more details.arestricted to proteins with prediction of localization; b (%) of the localized proteins.
Facebook
Twitteryosubshin/walton-hard-exclude-geometry-biology-statistics-1k-1 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitteryosubshin/oumi-walton-exclude-geometry-biology-statistics dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterSpecies trees provide insight into basic biology, including the mechanisms of evolution and how it modifies biomolecular function and structure, biodiversity and co-evolution between genes and species. Yet, gene trees often differ from species trees, creating challenges to species tree estimation. One of the most frequent causes for conflicting topologies between gene trees and species trees is incomplete lineage sorting (ILS), which is modelled by the multi-species coalescent. While many methods have been developed to estimate species trees from multiple genes, some which have statistical guarantees under the multi-species coalescent model, existing methods are too computationally intensive for use with genome-scale analyses or have been shown to have poor accuracy under some realistic conditions. Results: We present ASTRAL, a fast method for estimating species trees from multiple genes. ASTRAL is statistically consistent, can run on datasets with thousands of genes and has outstanding..., Availability and implementation: ASTRAL is available in open source form at https://github.com/smirarab/ASTRAL/. Datasets studied in this article are available at http://www.cs.utexas.edu/users/phylo/datasets/astral. Contact: Â warnow@illinois.edu Supplementary information: Â Supplementary data are available at Bioinformatics online., , # ASTRAL: genome-scale coalescent-based species tree estimation
This repository includes both simulated and biological dataset.
The following datasets are used in the ASTRAL paper shown above. All these archive files include README files that describe their content.
This file includes: 1. our estimated gene trees on alignments provided to us by authors of Song et al, 2012, PNAS, 2. our estimated species trees on the same dataset.
We have re-analyses of two biological datasets in our paper.
We obtained gene alignments from the Song et al and re-estimated gene trees and species trees.
The following files are included in mammals.zip
mammals-alignments.zip contains all the alignments that we obtained from Song et al.
mammals-genetreess.zip contains gene trees that we estimated. For each gene, we include 3 files
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supporting tables and figures. Table S1. The impact of different effect sizes on gene selection strategies when the sample size is fixed and relatively small. Mean (STD) of true positives computed from SIMU1 with 20 repetitions are reported. Sample size: . Total number of genes: 1000. Number of differentially expressed genes: 100. Number of permutations for Nstat: 10000. The significance threshold: 0.05. Table S2. The impact of different effect sizes on gene selection strategies when the sample size is fixed and relatively small. Mean (STD) of false positives computed from SIMU1 with 20 repetitions are reported. Sample size: . Total number of genes: 1000. Number of differentially expressed genes: 100. Number of permutations for Nstat: 10000. The significance threshold: 0.05. Table S3. The impact of different sample sizes on gene selection strategies when the effect size is fixed and relatively small. Mean (STD) of true positives computed from SIMU2 with 20 repetitions are reported. Effect size: . Total number of genes: 1000. Number of differentially expressed genes: 100. Number of permutations for Nstat: 10000. The significance threshold: 0.05. Table S4. The impact of different sample sizes on gene selection strategies when the effect size is fixed and relatively small. Mean (STD) of false positives computed from SIMU2 with 20 repetitions are reported. Effect size: . Total number of genes: 1000. Number of differentially expressed genes: 100. Number of permutations for Nstat: 10000. The significance threshold: 0.05. Table S5. The impact of different sample sizes on gene selection strategies when the effect size is fixed and relatively large. Mean (STD) of true positives computed from SIMU2 with 20 repetitions are reported. Effect size: . Total number of genes: 1000. Number of differentially expressed genes: 100. Number of permutations for Nstat: 10000. The significance threshold: 0.05. Table S6. The impact of different sample sizes on gene selection strategies when the effect size is fixed and relatively large. Mean (STD) of false positives computed from SIMU2 with 20 repetitions are reported. Effect size: . Total number of genes: 1000. Number of differentially expressed genes: 100. Number of permutations for Nstat: 10000. The significance threshold: 0.05. Table S7. The impact of different sample sizes on gene selection strategies with simulation based on biological data. Mean (STD) of true positives computed from SIMU-BIO with 20 repetitions are reported. Total number of genes: 9005. Number of permutations for Nstat: 100000. The significance threshold: 0.05. Table S8. The impact of different sample sizes on gene selection strategies with simulation based on biological data. Mean (STD) of false positives computed from SIMU-BIO with 20 repetitions are reported. Total number of genes: 9005. Number of permutations for Nstat: 100000. The significance threshold: 0.05. Table S9. The numbers of differentially expressed genes detected by different selection strategies. Total number of genes: 9005. Number of permutations for Nstat: 100000. The significance threshold: 0.05. Figure S1. Histogram of pairwise Pearson correlation coefficients between genes computed from HYPERDIP without normalization. Number of genes: 9005. Number of arrays: 88. (PDF)
Facebook
TwitterMeta analysis
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset is a comprehensive collection of over 3 million research paper titles and abstracts, curated and consolidated from multiple high-quality academic sources. The dataset provides a unified, clean, and standardized format for researchers, data scientists, and machine learning practitioners working on natural language processing, academic research analysis, and knowledge discovery tasks.
title and abstract columns| Metric | Value |
|---|---|
| Total Records | ~3,000,000+ |
| Columns | 2 (title, abstract) |
| File Size | 4.15 GB |
| Format | CSV |
| Duplicates | Removed |
| Missing Values | Removed |
cleaned_papers.csv
├── title (string): Scientific paper title
└── abstract (string): Scientific paper abstract
The dataset underwent a rigorous cleaning and standardization process:
title and abstract formatThis dataset is ideal for:
This dataset consolidates academic papers from the following sources:
This dataset represents a point-in-time consolidation. Future versions may include: - Additional academic sources - Extended fields (authors, publication dates, venues) - Domain-specific subsets - Enhanced metadata
Please respect the individual licenses of the source datasets. This consolidated version is provided for research and educational purposes. When using this dataset:
🙏 Acknowledgments
Special thanks to all the original dataset creators and the academic communities that make their research data publicly available. This work builds upon their valuable contributions to open science and knowledge sharing.
Keywords: academic papers, research abstracts, NLP, machine learning, text mining, scientific literature, ArXiv, PubMed, natural language processing, research dataset
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A link in this analysis is defined by a difference between sequences in less than 10% of available sites.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The ROC areas and the PR areas of different methods on SynTReN datasets with noise 0.1, 0.2, 0.3, respectively.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CORE Database Statistics.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Rates given are per 1000 pairs. A link in this analysis is defined by a difference between sequences in less than 10% of available sites.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
All: proteins in which all cysteines are bonded. None: proteins with no disulfide bridges. Mix: proteins with both bonded cysteines and non-bonded cysteines. Positive: number of bonded cysteines. Negative: number of non-bonded cysteines.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
*Some studies used more than one classifier.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
aCoefficients represent standard deviation (SD) change in latent factor per 5 year increase in age.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The ROC areas and the PR areas of different methods on DREAM3 challenge Yeast dataset with size 10, 50, 100 and Syndata, respectively.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of the discriminatory power resulting from the gradient boosting approach when applying different values of the smoothing parameter . Numbers refer to to the median value and interquartile range (in parentheses) of the final on 100 simulation runs. The amount of pre-selected genes is denoted as , is the size of the training samples and cens. refers to the censoring rate. We recommend to use the value , which is also the default value of the new Cindex family for the R add-on package mboost.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A link in this analysis is defined by a difference between sequences in less than 10% of available sites.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
N/A indicates that that the test is not applicable (cf. Analysis, section 6). Results for loci J3, J6, and U6 are not shown because the tests are also not applicable (as for locus L4).
Facebook
TwitterBiology students’ understanding of statistics is incomplete due to poor integration of these two disciplines. In some cases, students fail to learn statistics at the undergraduate level due to poor student interest and cursory teaching of concepts, highlighting a need for new and unique approaches to the teaching of statistics in the undergraduate biology curriculum. The most effective method of teaching statistics is to provide opportunities for students to apply concepts, not just learn facts. Opportunities to learn statistics also need to be prevalent throughout a student’s education to reinforce learning. The purpose of developing and implementing curriculum that integrates a topic in biology with an emphasis on statistical analysis was to improve students’ quantitative thinking skills. Our lesson focuses on the change in the richness of native species for a specified area with the aid of iNaturalist and the capacity for analysis afforded by Google Sheets. We emphasized the skills of data entry, storage, organization, curation and analysis. Students then had to report their findings, as well as discuss biases and other confounding factors. Pre- and post-lesson assessment revealed students’ quantitative thinking skills, as measured by a paired-samples t test, improved. At the end of the lesson, students had an increased understanding of basic statistical concepts, such as bias in research and making data-based claims, within the framework of biology.
Primary Image: Website screenshot of an iNaturalist observation (Clasping Milkweed – Asclepias amplexicalis). This image is an example of a data entry on iNaturalist. The data students export from iNaturalist is made up of hundreds, or even thousands, of observations like this one. This image is licensed under Creative Commons Attribution - Share Alike 4.0 International license. Source: Observation by cassi saari, 2014.