Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Public health-related decision-making on policies aimed at controlling the COVID-19 pandemic outbreak depends on complex epidemiological models that are compelled to be robust and use all relevant available data. This data article provides a new combined worldwide COVID-19 dataset obtained from official data sources with improved systematic measurement errors and a dedicated dashboard for online data visualization and summary. The dataset adds new measures and attributes to the normal attributes of official data sources, such as daily mortality, and fatality rates. We used comparative statistical analysis to evaluate the measurement errors of COVID-19 official data collections from the Chinese Center for Disease Control and Prevention (Chinese CDC), World Health Organization (WHO) and European Centre for Disease Prevention and Control (ECDC). The data is collected by using text mining techniques and reviewing pdf reports, metadata, and reference data. The combined dataset includes complete spatial data such as countries area, international number of countries, Alpha-2 code, Alpha-3 code, latitude, longitude, and some additional attributes such as population. The improved dataset benefits from major corrections on the referenced data sets and official reports such as adjustments in the reporting dates, which suffered from a one to two days lag, removing negative values, detecting unreasonable changes in historical data in new reports and corrections on systematic measurement errors, which have been increasing as the pandemic outbreak spreads and more countries contribute data for the official repositories. Additionally, the root mean square error of attributes in the paired comparison of datasets was used to identify the main data problems. The data for China is presented separately and in more detail, and it has been extracted from the attached reports available on the main page of the CCDC website. This dataset is a comprehensive and reliable source of worldwide COVID-19 data that can be used in epidemiological models assessing the magnitude and timeline for confirmed cases, long-term predictions of deaths or hospital utilization, the effects of quarantine, stay-at-home orders and other social distancing measures, the pandemic’s turning point or in economic and social impact analysis, helping to inform national and local authorities on how to implement an adaptive response approach to re-opening the economy, re-open schools, alleviate business and social distancing restrictions, design economic programs or allow sports events to resume.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A dataset that is useful for training in applied biostatistics, in the context of biomedical research
Facebook
Twitterhttps://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
These are the tab and csv files from the Methods in Biostatistics with R Book
Facebook
TwitterBiology students’ understanding of statistics is incomplete due to poor integration of these two disciplines. In some cases, students fail to learn statistics at the undergraduate level due to poor student interest and cursory teaching of concepts, highlighting a need for new and unique approaches to the teaching of statistics in the undergraduate biology curriculum. The most effective method of teaching statistics is to provide opportunities for students to apply concepts, not just learn facts. Opportunities to learn statistics also need to be prevalent throughout a student’s education to reinforce learning. The purpose of developing and implementing curriculum that integrates a topic in biology with an emphasis on statistical analysis was to improve students’ quantitative thinking skills. Our lesson focuses on the change in the richness of native species for a specified area with the aid of iNaturalist and the capacity for analysis afforded by Google Sheets. We emphasized the skills of data entry, storage, organization, curation and analysis. Students then had to report their findings, as well as discuss biases and other confounding factors. Pre- and post-lesson assessment revealed students’ quantitative thinking skills, as measured by a paired-samples t test, improved. At the end of the lesson, students had an increased understanding of basic statistical concepts, such as bias in research and making data-based claims, within the framework of biology.
Primary Image: Website screenshot of an iNaturalist observation (Clasping Milkweed – Asclepias amplexicalis). This image is an example of a data entry on iNaturalist. The data students export from iNaturalist is made up of hundreds, or even thousands, of observations like this one. This image is licensed under Creative Commons Attribution - Share Alike 4.0 International license. Source: Observation by cassi saari, 2014.
Facebook
TwitterThis dataset was created by Johar M. Ashfaque
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Three examples dataset to perform biostatistics analysis.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundIn medical practice, clinically unexpected measurements might be quite properly handled by the remeasurement, removal, or reclassification of patients. If these habits are not prevented during clinical research, how much of each is needed to sway an entire study?Methods and ResultsBelieving there is a difference between groups, a well-intentioned clinician researcher addresses unexpected values. We tested how much removal, remeasurement, or reclassification of patients would be needed in most cases to turn an otherwise-neutral study positive. Remeasurement of 19 patients out of 200 per group was required to make most studies positive. Removal was more powerful: just 9 out of 200 was enough. Reclassification was most powerful, with 5 out of 200 enough. The larger the study, the smaller the proportion of patients needing to be manipulated to make the study positive: the percentages needed to be remeasured, removed, or reclassified fell from 45%, 20%, and 10% respectively for a 20 patient-per-group study, to 4%, 2%, and 1% for an 800 patient-per-group study. Dot-plots, but not bar-charts, make the perhaps-inadvertent manipulations visible. Detection is possible using statistical methods such as the Tadpole test.ConclusionsBehaviours necessary for clinical practice are destructive to clinical research. Even small amounts of selective remeasurement, removal, or reclassification can produce false positive results. Size matters: larger studies are proportionately more vulnerable. If observational studies permit selective unblinded enrolment, malleable classification, or selective remeasurement, then results are not credible. Clinical research is very vulnerable to “remeasurement, removal, and reclassification”, the 3 evil R's.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
That sounds like a valuable dataset for anyone interested in pursuing a master's degree in Data Science or Analytics in the United States! This dataset is very helpful to pursuing a master's degree in the US. - Are you looking to analyze this dataset for any specific insights or purposes? - Do you need assistance with anything related to it?
About .csv file
- Subject Name: The name or field of study of the master's program, such as Data Science, Data Analytics, or Applied Biostatistics.
- University Name: The name of the university offering the master's program.
- Per Year Fees: The program's tuition fees are usually in euros per year. For some programs, the fees may be listed as "full" or "full-time," indicating a lump sum for the entire program or full-time enrollment, respectively.
- About Program: A brief description or overview of the master's program, providing insights into its curriculum, focus areas, and any unique features.
- Program Duration: The duration of the master's program, typically expressed in years or months.
- University Location: The location of the university where the program is offered, including the city and state.
- Program Name: The official name of the master's program, often indicating its degree type (e.g., M.Sc. for Master of Science) and format (e.g., full-time, part-time, online).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains supplementary materials related to the study "𝐒𝐮𝐫𝐯𝐞𝐲 𝐨𝐧 𝐜𝐫𝐢𝐭𝐢𝐜𝐚𝐥 𝐫𝐞𝐬𝐮𝐥𝐭𝐬 𝐦𝐚𝐧𝐚𝐠𝐞𝐦𝐞𝐧𝐭 𝐢𝐧 𝐁𝐫𝐚𝐳𝐢𝐥𝐢𝐚𝐧 𝐜𝐥𝐢𝐧𝐢𝐜𝐚𝐥 𝐥𝐚𝐛𝐨𝐫𝐚𝐭𝐨𝐫𝐢𝐞𝐬: 𝐏𝐫𝐨𝐟𝐢𝐥𝐢𝐧𝐠 𝐩𝐫𝐚𝐜𝐭𝐢𝐜𝐞𝐬 𝐭𝐡𝐫𝐨𝐮𝐠𝐡 𝐦𝐮𝐥𝐭𝐢𝐯𝐚𝐫𝐢𝐚𝐭𝐞 𝐚𝐧𝐚𝐥𝐲𝐬𝐢𝐬 𝐚𝐧𝐝 𝐚 '𝐍𝐞𝐰 𝐒𝐭𝐚𝐭𝐢𝐬𝐭𝐢𝐜𝐬' 𝐚𝐩𝐩𝐫𝐨𝐚𝐜𝐡". The dataset, figures, exported results, and analysis scripts are included to ensure full transparency and reproducibility of the research findings.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains raw Excel files and R scripts used for statistical and multivariate analyses of studied attributes of tomato plants subjected to cadmium (Cd) stress and treated with melatonin-loaded silica nanoparticles. It includes data for bar, line and box plots, Duncan’s multiple range test (DMRT) results, correlation analyses, principal component analysis (PCA), and heatmap visualizations. The dataset provides a comprehensive resource for reproducing these analysis.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Electroencephalogram (EEG) is used to monitor child's brain during coma by recording data on electrical neural activity of the brain. Signals are captured by multiple electrodes called channels located over the scalp. Statistical analyses of EEG data includes classification and prediction using arrays of EEG features, but few models for the underlying stochastic processes have been proposed. For this purpose, a new strictly stationary strong mixing diffusion model with marginal multimodal (three-peak) distribution (MixGGDiff) and exponentially decaying autocorrelation function for modeling of increments of EEG data was proposed. The increments were treated as discrete-time observations and a diffusion process where the stationary distribution is viewed as a mixture of three non-central generalized Gaussian distributions (MixGGD) was constructed.Probability density function of a mixed generalized Gaussian distribution (MixGGD) consists of three components and is described using a total of 12 parameters:\muk, location parameter of each of the components,sk, shape parameter of each of the components, \sigma2k, parameter related to the scale of each of the components andwk, weight of each of the components, where k, k={1,2,3} refers to theindex of the component of a MixGGD. The parameters of this distribution were estimated using the expectation-maximization algorithm, where the added shape parameter is estimated using the higher order statistics approach based on an analytical relationship between the shape parameter and kurtosis.To illustrate an application of the MixGGDiff to real data, analysis of EEG data collected in Uganda between 2008 and 2015 from 78 children within age-range of 18 months to 12 years who were in coma due to cerebral malaria was performed. EEG were recorded using the International 10–20 system with the sampling rate of 500 Hz and the average record duration of 30 min. EEG signal for every child was the result of a recording from 19 channels. MixGGD was fitted to each channel of every child's recording separately, hence for each channel a total of 12 parameter estimates were obtained. The data is presented in a matrix form (dimension 79*228) in a .csv format and consists of 79 rows where the first row is a header row which contains the names of the variables and the subsequent 78 rows represent parameter estimates of one instance (i.e. one child, without identifiers that could be related back to a specific child). There are a total of 228 columns (19 channels times 12 parameter estimates) where each column represents one parameter estimate of one component of MixGGD in the order of the channels, thus columns 1 to 12 refer to parameter estimates on the first channel, columns 13 to 24 refer to parameter estimates on the second channel and so on. Each variable name starts with "chi" where "ch" is an abbreviation of "channel" and i refers to the order of the channel from EEG recording. The rest of the characters in variable names refer to the parameter estimate names of the components of a MixGGD, thus for example "ch3sigmasq1" refers to the parameter estimate of \sigma2 of the first component of MixGGD obtained from EEG increments on the third channel. Parameter estimates contained in the .csv file are all real numbers within a range of -671.11 and 259326.96.Research results based upon these data are published at https://doi.org/10.1007/s00477-023-02524-y
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is the result of a study on the quality of official datasets available for COVID-19. This study uses comparative statistical analysis to evaluate the measurement errors and improve the accuracy of COVID-19 official data collections namely “Chinese Center for Disease Control and Prevention (Chinese CDC)” reports, pdf report files of World Health Organization (WHO) and European Centre for Disease Prevention and Control (ECDC). The result is an improved data-set for the COVID-19 studies. More information can be found in this manuscript: “A study on the quality of Novel Coronavirus (COVID-19) official datasets”, published in the Statistical Journal of the IAOS (Journal of the International Association for Official Statistics).
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global Biostatistics and Programming Services market size reached USD 2.94 billion in 2024, reflecting robust demand from the pharmaceutical and biotechnology sectors. The market is poised for significant expansion, with a projected CAGR of 8.2% during the forecast period. By 2033, the market is forecasted to reach USD 5.85 billion, driven by increasing clinical trial complexity, stringent regulatory requirements, and the rising adoption of advanced analytics in drug development. The ongoing digital transformation in life sciences and the growing outsourcing trend among pharmaceutical and biotechnology companies are key factors propelling this growth.
One of the primary growth drivers for the Biostatistics and Programming Services market is the escalating demand for efficient data management and statistical analysis in clinical trials. As clinical trials become more complex, encompassing larger and more diverse patient populations, the volume and complexity of data generated have surged. This has necessitated the adoption of advanced biostatistical methods and programming services to ensure data integrity, regulatory compliance, and accelerated timelines. The industry’s focus on reducing time-to-market for new therapeutics has further intensified the need for specialized biostatistics and programming expertise, as these services are pivotal in streamlining data collection, cleaning, and analysis processes. As a result, pharmaceutical and biotechnology companies are increasingly outsourcing these functions to specialized service providers, leading to market expansion.
Technological advancements are also significantly shaping the growth trajectory of the Biostatistics and Programming Services market. The integration of artificial intelligence, machine learning, and big data analytics into clinical data management and statistical programming is enhancing the quality and efficiency of data analysis. These technologies enable the handling of large-scale, multi-source data sets, facilitate predictive modeling, and improve the accuracy of statistical outputs. Moreover, the adoption of cloud-based platforms for clinical data management is enabling real-time access, collaboration, and scalability, which are critical for multi-center and global clinical trials. The ongoing evolution of regulatory standards, such as the implementation of CDISC (Clinical Data Interchange Standards Consortium) standards, is further driving the demand for sophisticated programming and biostatistical services capable of meeting stringent data submission requirements.
Another notable growth factor is the increasing trend of outsourcing biostatistics and programming services to contract research organizations (CROs) and specialized service providers. Pharmaceutical and biotechnology companies are under constant pressure to optimize costs, accelerate timelines, and access specialized expertise. Outsourcing enables these companies to focus on their core competencies while leveraging the technical capabilities and regulatory expertise of external partners. This trend is particularly pronounced among small and mid-sized enterprises, which may lack in-house resources for comprehensive biostatistical and programming support. Additionally, the globalization of clinical trials and the expansion of research activities into emerging markets are fueling the need for region-specific expertise and scalable service delivery models.
From a regional perspective, North America continues to dominate the Biostatistics and Programming Services market, accounting for the largest share in 2024. This is attributed to the presence of a robust pharmaceutical and biotechnology industry, advanced healthcare infrastructure, and a high concentration of clinical trials. The region’s regulatory landscape, characterized by stringent data integrity and compliance requirements, further drives the adoption of specialized biostatistics and programming services. Europe follows closely, supported by strong government initiatives and increasing R&D investments. Meanwhile, the Asia Pacific region is emerging as a high-growth market, propelled by the rising number of clinical trials, expanding pharmaceutical manufacturing capabilities, and favorable regulatory reforms. These regional dynamics are expected to shape the competitive landscape and growth opportunities in the coming years.
<b
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the R code for the data generation and analysis for the paper:
Zhou, Z., Li, D., Huh, D., Xie, M., & Mun, E. Y. (2024). A Simulation Study of the Performance of Statistical Models for Count Outcomes with Excessive Zeros. Statistics in Medicine. https://doi.org/10.1002/sim.10198
Abstract
Background: Outcome measures that are count variables with excessive zeros are common in health behaviors research. Examples include the number of standard drinks consumed or alcohol-related problems experienced over time. There is a lack of empirical data about the relative performance of prevailing statistical models for assessing the efficacy of interventions when outcomes are zero-inflated, particularly compared with recently developed marginalized count regression approaches for such data. Methods: The current simulation study examined five commonly used approaches for analyzing count outcomes, including two linear models (with outcomes on raw and log-transformed scales, respectively) and three prevailing count distribution-based models (i.e., Poisson, negative binomial, and zero-inflated Poisson (ZIP) models). We also considered the marginalized zero-inflated Poisson (MZIP) model, a novel alternative that estimates the overall effects on the population mean while adjusting for zero-inflation. Motivated by alcohol misuse prevention trials, extensive simulations were conducted to evaluate and compare the statistical power and Type I error rate of candidate statistical models and approaches across data conditions that varied in sample size (N = 100 to 500), zero rate (0.2 to 0.8), and intervention effect sizes conditions. Results: Under zero-inflation, the Poisson model failed to control the Type I error rate, resulting in higher than expected false positive results. When the intervention effects on the zero (vs. non-zero) and count parts were in the same direction, the MZIP model had the highest statistical power, followed by the linear model with outcomes on the raw scale, negative binomial model, and ZIP model. The performance of linear model with a log-transformed outcome variable was unsatisfactory. When only one of the effects on the zero (vs. non-zero) part and the count part existed, the ZIP model had the highest statistical power. Conclusions: The MZIP model demonstrated better statistical properties in detecting true intervention effects and controlling false positive results for zero-inflated count outcomes. This MZIP model may serve as an appealing analytical approach to evaluating overall intervention effects in studies with count outcomes marked by excessive zeros.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This synthetic dataset simulates a Phase III randomized controlled clinical trial evaluating CardioX (Drug A) versus an active comparator (Drug B) and a placebo for treating hypertension. It is designed for clinical data analysis, anomaly detection, and risk-based monitoring (RBM) applications.
The dataset includes 1,000 patients across 50 trial sites, with realistic patient demographics, blood pressure readings, cholesterol levels, dropout rates, and adverse event reporting. Several anomalies have been embedded to simulate real-world data quality issues commonly encountered in clinical trials.
This dataset is ideal for data quality assessments, statistical anomaly detection (Z-scores, IQR, clustering), and risk-based management (RBM) in clinical research.
🔹 Clinical Trial Data Analysis – Investigate treatment efficacy and safety trends.
🔹 Anomaly Detection – Apply Z-scores, IQR, and ML-based clustering methods to identify outliers.
🔹 Risk-Based Monitoring (RBM) – Detect potential site-level risks and data inconsistencies.
🔹 Machine Learning Applications – Train models for adverse event prediction or dropout risk estimation.
| Column Name | Description |
|---|---|
| Patient_ID | Unique identifier for each trial participant. |
| Site_ID | Site where the patient was enrolled (1-50) |
| Age | Patient age (in years). |
| Gender | Male or Female. |
| Enrollment_Date | Date when the patient was enrolled in the study. |
| Treatment_Group | Assigned treatment: Placebo, Drug A (CardioX), or Drug B (Active Comparator). |
| Adverse_Events | Number of adverse events (AEs) reported by the patient. |
| Dropout | Whether the patient dropped out of the study (1 = Yes, 0 = No). |
| Systolic_BP | Systolic Blood Pressure (mmHg). |
| Diastolic_BP | Diastolic Blood Pressure (mmHg). |
| Cholesterol_Level | Total cholesterol level (mg/dL). |
This dataset is fully synthetic and does not contain real patient data. It is created for educational, analytical, and research purposes in clinical data science and biostatistics.
🔗 If you use this dataset, tag me! Let’s discuss insights & findings! 🚀
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Simulated datasets used in the supplement with small group size of 10, log-scale noise levels and zero proportions.The settings are reflected in the title of the data:(1) small: group size = 10; (2) 0.5, 1, 1.5: the log-scale noise level(3) 0.3, 0.4, 0.5: the extra zero proportion(4) The last number is the index of replicates. There are 50 replicates for each setting
Facebook
TwitterThis dataset contains the quiz data and survey data for "Ethics in Clinical Research, E-Module Versus Traditional Online Lecture, a Randomized Study"
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset accompanies the Methodological Appraisal and Credibility Assessment checklist (MACA), developed to evaluate the methodological quality of studies applying computational and statistical methods for composite indicator construction. The repository includes three reproducible components: MACA_ICC-KAPPA: Scripts and results for inter-rater reliability analysis (Cohen’s Kappa, Gwet’s AC1, PABAK, and ICC) of the MACA checklist. MACA_FinalScore: Script and output for calculating the averaged and total scores of the final 17-item fused version of MACA, based on two independent evaluators. MACA_heatmap: Python script and visualization for the heatmap summarizing methodological quality across studies. All scripts are written in Python and include example input and output files for transparency and reproducibility. The dataset supports the umbrella review on methodological quality assessment in computational and statistical methods applied to public health and composite indicators.
Facebook
TwitterMultidimensional scaling (MDS) is a dimensionality reduction technique for microbial ecology data analysis that represents the multivariate structure while preserving pairwise distances between samples. While its improvements have enhanced the ability to reveal data patterns by sample groups, these MDS-based methods require prior assumptions for inference, limiting their application in general microbiome analysis. In this study, we introduce a new MDS-based ordination, “F-informed MDS,†which configures the data distribution based on the F-statistic, the ratio of dispersion between groups sharing common and different characteristics. Using simulated compositional datasets, we demonstrate that the proposed method is robust to hyperparameter selection while maintaining statistical significance throughout the ordination process. Various quality metrics for evaluating dimensionality reduction confirm that F-informed MDS is comparable to state-of-the-art methods in preserving both local and ..., , # Multidimensional scaling informed by F-statistic: Visualizing grouped microbiome data with inference
monospaced.Â
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This data set comes as a supplementary resource for my book on Biostatistics and SPSS. Readers are free to download this file and practice using SPSS as they go along reading the book.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Public health-related decision-making on policies aimed at controlling the COVID-19 pandemic outbreak depends on complex epidemiological models that are compelled to be robust and use all relevant available data. This data article provides a new combined worldwide COVID-19 dataset obtained from official data sources with improved systematic measurement errors and a dedicated dashboard for online data visualization and summary. The dataset adds new measures and attributes to the normal attributes of official data sources, such as daily mortality, and fatality rates. We used comparative statistical analysis to evaluate the measurement errors of COVID-19 official data collections from the Chinese Center for Disease Control and Prevention (Chinese CDC), World Health Organization (WHO) and European Centre for Disease Prevention and Control (ECDC). The data is collected by using text mining techniques and reviewing pdf reports, metadata, and reference data. The combined dataset includes complete spatial data such as countries area, international number of countries, Alpha-2 code, Alpha-3 code, latitude, longitude, and some additional attributes such as population. The improved dataset benefits from major corrections on the referenced data sets and official reports such as adjustments in the reporting dates, which suffered from a one to two days lag, removing negative values, detecting unreasonable changes in historical data in new reports and corrections on systematic measurement errors, which have been increasing as the pandemic outbreak spreads and more countries contribute data for the official repositories. Additionally, the root mean square error of attributes in the paired comparison of datasets was used to identify the main data problems. The data for China is presented separately and in more detail, and it has been extracted from the attached reports available on the main page of the CCDC website. This dataset is a comprehensive and reliable source of worldwide COVID-19 data that can be used in epidemiological models assessing the magnitude and timeline for confirmed cases, long-term predictions of deaths or hospital utilization, the effects of quarantine, stay-at-home orders and other social distancing measures, the pandemic’s turning point or in economic and social impact analysis, helping to inform national and local authorities on how to implement an adaptive response approach to re-opening the economy, re-open schools, alleviate business and social distancing restrictions, design economic programs or allow sports events to resume.