54 datasets found

f
Summary statistics (mean, standard deviation, median, interquartile range,...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Apr 10, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pavanello, Sofia; Simeone, Claudio; Porru, Stefano; Mastrangelo, Giuseppe; Carta, Angela; Arici, Cecilia; Izzotti, Alberto (2014). Summary statistics (mean, standard deviation, median, interquartile range, number of subjects) for “ln_adducts” in cases, controls, and total population. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001172049
Explore at:
Dataset updated
Apr 10, 2014
Authors
Pavanello, Sofia; Simeone, Claudio; Porru, Stefano; Mastrangelo, Giuseppe; Carta, Angela; Arici, Cecilia; Izzotti, Alberto
Description
Summary statistics (mean, standard deviation, median, interquartile range, number of subjects) for “ln_adducts” in cases, controls, and total population.
Time and summary statistics in days (median and interquartile 25 to 75%...
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amy Gadoud; Eleanor Kane; Una Macleod; Pat Ansell; Steven Oliver; Miriam Johnson (2023). Time and summary statistics in days (median and interquartile 25 to 75% range) from first time coded as on a palliative care register to date of death for each disease group. [Dataset]. http://doi.org/10.1371/journal.pone.0113188.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0113188.t003
Dataset updated
Jun 3, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Amy Gadoud; Eleanor Kane; Una Macleod; Pat Ansell; Steven Oliver; Miriam Johnson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Percentages may not total 100 due to rounding.Time and summary statistics in days (median and interquartile 25 to 75% range) from first time coded as on a palliative care register to date of death for each disease group.
Descriptive statistics, mean ± SD, range, median and interquartile range...
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hélène Follet; Delphine Farlay; Yohann Bala; Stéphanie Viguet-Carrin; Evelyne Gineyts; Brigitte Burt-Pichat; Julien Wegrzyn; Pierre Delmas; Georges Boivin; Roland Chapurlat (2023). Descriptive statistics, mean ± SD, range, median and interquartile range (IQR). [Dataset]. http://doi.org/10.1371/journal.pone.0055232.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0055232.t001
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Hélène Follet; Delphine Farlay; Yohann Bala; Stéphanie Viguet-Carrin; Evelyne Gineyts; Brigitte Burt-Pichat; Julien Wegrzyn; Pierre Delmas; Georges Boivin; Roland Chapurlat
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Descriptive statistics, mean ± SD, range, median and interquartile range (IQR).
Simulation Data Set
catalog.data.gov
s.cnmilf.com
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Simulation Data Set [Dataset]. https://catalog.data.gov/dataset/simulation-data-set
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
Meta data and supporting documentation
catalog.data.gov
s.cnmilf.com
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Meta data and supporting documentation [Dataset]. https://catalog.data.gov/dataset/meta-data-and-supporting-documentation
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
We include a description of the data sets in the meta-data as well as sample code and results from a simulated data set. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: The R code is available on line here: https://github.com/warrenjl/SpGPCW. Format: Abstract The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publicly available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. File format: R workspace file. Metadata (including data dictionary) • y: Vector of binary responses (1: preterm birth, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate). This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
Walmart Stocks Data 2025
kaggle.com
zip
Updated Feb 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mehar Shan Ali (2025). Walmart Stocks Data 2025 [Dataset]. https://www.kaggle.com/meharshanali/walmart-stocks-data-2025
Explore at:
zip(467062 bytes)Available download formats
Dataset updated
Feb 23, 2025
Authors
Mehar Shan Ali
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
📊 Walmart Stock Price Dataset & Exploratory Data Analysis (EDA)

🏢 About Walmart

Walmart Inc. is a multinational retail corporation that operates a chain of hypermarkets, discount department stores, and grocery stores. It is one of the world's largest companies by revenue and a key player in the retail sector. Walmart's stock is actively traded on major stock exchanges, making it an interesting subject for financial analysis.

📌 Dataset Overview

This dataset contains historical stock price data for Walmart, sourced directly from Yahoo Finance using the yfinance Python API. The data covers daily stock prices and includes multiple key financial indicators.

📊 Features Included in the Dataset

Date 📅 – The trading day recorded.

Open Price 🟢 – Price at market open.

High Price 🔼 – Highest price of the day.

Low Price 🔽 – Lowest price of the day.

Close Price 🔴 – Price at market close.

Adjusted Close Price 📉 – Closing price adjusted for splits & dividends.

Trading Volume 📈 – Total shares traded.

Dividends 💰 – Cash payments to shareholders.

Stock Splits 🔄 – Records stock split events.

🔍 Exploratory Data Analysis (EDA) Steps

This notebook performs an extensive EDA to uncover insights into Walmart's stock price trends, volatility, and overall behavior in the stock market. The following analysis steps are included:

1️⃣ Data Preprocessing & Cleaning

Load data using Pandas

Handle missing values (if any)

Check data types and format them properly

Convert date column into a datetime format

2️⃣ Descriptive Statistics & Summary

Calculate key statistical measures like mean, median, standard deviation, and interquartile range (IQR)

Identify stock price trends over time

Check data distribution and skewness

3️⃣ Data Visualizations

📉 Line Plot – Analyze trends in closing prices over time.

📦 Box Plot – Detect potential outliers in stock prices.

📊 Histogram – Understand the distribution of closing prices.

📈 Moving Averages – Use short-term and long-term moving averages to observe stock trends.

🔥 Correlation Heatmap – Find relationships between stock market indicators.

4️⃣ Time Series Analysis

Identify trends and seasonality in the stock price data.

Calculate daily, weekly, and monthly returns.

Use rolling windows to analyze moving averages and volatility.

5️⃣ Insights & Conclusions

How volatile is Walmart’s stock over the given period?

Does the stock exhibit strong uptrends or downtrends?

Are there any strong correlations between features?

What insights can be drawn for investors and traders?

🚀 Use Cases & Applications

This dataset and analysis can be useful for: - 📡 Stock Market Analysis – Evaluating Walmart’s stock price trends and volatility. - 🏦 Investment Research – Assisting traders and investors in making informed decisions. - 🎓 Educational Purposes – Teaching data science and financial analysis using real-world stock data. - 📊 Algorithmic Trading – Developing trading strategies based on historical stock price trends.

📥 Download the dataset and explore Walmart’s stock performance today! 🚀
f
Descriptive statistics for time (mm:ss ± SD) to cessation of movement (COM)...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Mar 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kieffer, Justin D.; Miller, Jesse; Park, Janice Y.; Canturri, Albert; Torrisi, Dawn; Arruda, Andréia G.; Williams, Todd E.; Cheng, Ting-Yu; Bowman, Andrew S.; Culhane, Marie R.; Hougentogler, Daniel P.; Youngblood, Brad L.; Cressman, Michael D.; Campler, Magnus R.; Flory, Gary A.; Hunt, Lucia; Hill, Jeff (2025). Descriptive statistics for time (mm:ss ± SD) to cessation of movement (COM) and external activity (EA, milli-gravity (mg) [g = acceleration of gravity or 9.8 m s − 2) for quartiles Q1 and Q3 as well as the interquartile range (IQR [Q3-Q1for finisher pigs (N = 79) depopulated using water-based foam (WBF), nitrogen-foam (N2F), and carbon dioxide (CO2). [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002055241
Explore at:
Dataset updated
Mar 25, 2025
Authors
Kieffer, Justin D.; Miller, Jesse; Park, Janice Y.; Canturri, Albert; Torrisi, Dawn; Arruda, Andréia G.; Williams, Todd E.; Cheng, Ting-Yu; Bowman, Andrew S.; Culhane, Marie R.; Hougentogler, Daniel P.; Youngblood, Brad L.; Cressman, Michael D.; Campler, Magnus R.; Flory, Gary A.; Hunt, Lucia; Hill, Jeff
Description
Descriptive statistics for time (mm:ss ± SD) to cessation of movement (COM) and external activity (EA, milli-gravity (mg) [g = acceleration of gravity or 9.8 m s − 2) for quartiles Q1 and Q3 as well as the interquartile range (IQR [Q3-Q1for finisher pigs (N = 79) depopulated using water-based foam (WBF), nitrogen-foam (N2F), and carbon dioxide (CO2).
f
Summary Statistics for Temperature and other Meteorological Variables with...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Apr 25, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lee, Eunil; Lee, Suji; Kwon, Bo Yeon; Kim, Hana; Rha, Seung-Woon; Jung, Dea Ho; Jeong, Myung Ho; Jo, Kyung Hee; Park, Man Sik (2014). Summary Statistics for Temperature and other Meteorological Variables with the Level of Air pollutants in Study Areas. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001206001
Explore at:
Dataset updated
Apr 25, 2014
Authors
Lee, Eunil; Lee, Suji; Kwon, Bo Yeon; Kim, Hana; Rha, Seung-Woon; Jung, Dea Ho; Jeong, Myung Ho; Jo, Kyung Hee; Park, Man Sik
Description
SD: Standard deviation.IQR: Interquartile range.
Numpy , pandas and matplot lib practice
kaggle.com
zip
Updated Jul 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
pratham saraf (2023). Numpy , pandas and matplot lib practice [Dataset]. https://www.kaggle.com/datasets/prathamsaraf1389/numpy-pandas-and-matplot-lib-practise/suggestions
Explore at:
zip(385020 bytes)Available download formats
Dataset updated
Jul 16, 2023
Authors
pratham saraf
License
https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
Description
The dataset has been created specifically for practicing Python, NumPy, Pandas, and Matplotlib. It is designed to provide a hands-on learning experience in data manipulation, analysis, and visualization using these libraries.

Specifics of the Dataset:

The dataset consists of 5000 rows and 20 columns, representing various features with different data types and distributions. The features include numerical variables with continuous and discrete distributions, categorical variables with multiple categories, binary variables, and ordinal variables. Each feature has been generated using different probability distributions and parameters to introduce variations and simulate real-world data scenarios. The dataset is synthetic and does not represent any real-world data. It has been created solely for educational purposes.

One of the defining characteristics of this dataset is the intentional incorporation of various real-world data challenges:

Certain columns are randomly selected to be populated with NaN values, effectively simulating the common challenge of missing data. - The proportion of these missing values in each column varies randomly between 1% to 70%. - Statistical noise has been introduced in the dataset. For numerical values in some features, this noise adheres to a distribution with mean 0 and standard deviation 0.1. - Categorical noise is introduced in some features', with its categories randomly altered in about 1% of the rows. Outliers have also been embedded in the dataset, resonating with the Interquartile Range (IQR) rule

Context of the Dataset:

The dataset aims to provide a comprehensive playground for practicing Python, NumPy, Pandas, and Matplotlib. It allows learners to explore data manipulation techniques, perform statistical analysis, and create visualizations using the provided features. By working with this dataset, learners can gain hands-on experience in data cleaning, preprocessing, feature engineering, and visualization. Sources of the Dataset:

The dataset has been generated programmatically using Python's random number generation functions and probability distributions. No external sources or real-world data have been used in creating this dataset.
Life Expectancy WHO
kaggle.com
zip
Updated Jun 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). Life Expectancy WHO [Dataset]. https://www.kaggle.com/datasets/vikramamin/life-expectancy-who
Explore at:
zip(121472 bytes)Available download formats
Dataset updated
Jun 19, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The objective behind attempting this dataset was to understand the predictors that contribute to the life expectancy around the world. I have used Linear Regression, Decision Tree and Random Forest for this purpose. Steps Involved: - Read the csv file - Data Cleaning: - Variables Country and Status were showing as having character data types. These had to be converted to factor - 2563 missing values were encountered with Population variable having the most of the missing values i.e 652 - Missing rows were dropped before we could run the analysis. 3) Run Linear Regression - Before running linear regression, 3 variables were dropped as they were not found to be having that much of an effect on the dependent variable i.e Life Expectancy. These 3 variables were Country, Year & Status. This meant we are now working with 19 variables (1 dependent and 18 independent variables) - We run the linear regression. Multiple R squared is 83% which means that independent variables can explain 83% change or variance in the dependent variable. - OULTLIER DETECTION. We check for outliers using IQR and find 54 outliers. These outliers are then removed before we run the regression analysis once again. Multiple R squared increased from 83% to 86%. - MULTICOLLINEARITY. We check for multicollinearity using the VIF model(Variance Inflation Factor). This is being done in case when two or more independent variables showing high correlation. The thumb rule is that absolute VIF values above 5 should be removed. We find 6 variables that have a VIF value higher than 5 namely Infant.deaths, percentage.expenditure,Under.five.deaths,GDP,thinness1.19,thinness5.9. Infant deaths and Under Five deaths have strong collinearity so we drop infant deaths(which has the higher VIF value). - When we run the linear regression model again, VIF value of Under.Five.Deaths goes down from 211.46 to 2.74 while the other variable's VIF values reduce very less. Variable thinness1.19 is now dropped and we run the regression once more. - Variable thinness5.9 whose absolute VIF value was 7.61 has now dropped to 1.95. GDP and Population are still having VIF value more than 5 but I decided against dropping these as I consider them to be important independent variables. - SET THE SEED AND SPLIT THE DATA INTO TRAIN AND TEST DATA. We run the train data and get multiple R squared of 86% and p value less than that of alpha which states that it is statistically significant. We use the train data to predict the test data to find out the RMSE and MAPE. We run the library(Metrics) for this purpose. - In Linear Regression, RMSE (Root Mean Squared Error) is 3.2. This indicates that on an average, the predicted values have an error of 3.2 years as compared to the actual life expectancy values. - MAPE (Mean Absolute Percentage Error) is 0.037. This indicates an accuracy prediction of 96.20% (1-0.037). - MAE (Mean Absolute Error) is 2.55. This indicates that on an average, the predicted values deviate by approximately 2.83 years from the actual values.

We use DECISION TREE MODEL for the analysis.

Run the required libraries (rpart, rpart.plot, RColorBrewer, rattle).

We run the decision tree analysis using rpart and plot the tree. We use fancyRpartPlot.

We use 5 fold cross validation method with CP (complexity parameter) being 0.01.

In Decision Tree , RMSE (Root Mean Squared Error) is 3.06. This indicates that on an average, the predicted values have an error of 3.06 years as compared to the actual life expectancy values.

MAPE (Mean Absolute Percentage Error) is 0.035. This indicates an accuracy prediction of 96.45% (1-0.035).

MAE (Mean Absolute Error) is 2.35. This indicates that on an average, the predicted values deviate by approximately 2.35 years from the actual values.

We use RANDOM FOREST for the analysis.

Run library(randomForest)

We use varImpPlot to find out which variables are most significant and least significant. Income composition is the most important followed by adult mortality and the least relevant independent variable is Population.

Predict Life expectancy through random forest model.

In Random Forest , RMSE (Root Mean Squared Error) is 1.73. This indicates that on an average, the predicted values have an error of 1.73 years as compared to the actual life expectancy values.

MAPE (Mean Absolute Percentage Error) is 0.01. This indicates an accuracy prediction of 98.27% (1-0.01).

MAE (Mean Absolute Error) is 1.14. This indicates that on an average, the predicted values deviate by approximately 1.14 years from the actual values.

Conclusion: Random Forest is the best model for predicting the life expectancy values as it has the lowest RMSE, MAPE and MAE.
d
Data from: Public supply, self-supplied domestic, irrigation, and...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Public supply, self-supplied domestic, irrigation, and thermoelectric water-use data from 5-year compilation datasets from 1985 to 2015 used to assess data variability and uncertainty [Dataset]. https://catalog.data.gov/dataset/public-supply-self-supplied-domestic-irrigation-and-thermoelectric-water-use-data-from-5-y
Explore at:
Dataset updated
Nov 21, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
The U.S. Geological Survey (USGS) National Water Use Program is responsible for compiling and disseminating the Nation's water-use data. Working in cooperation with local, State, and Federal agencies, the USGS has published an estimate of water use in the United States every 5 years, beginning in 1950. These 5-year compilations contain water-use estimates that are aggregated to the county level in the United States. This USGS data release contains summaries of method codes used in the 2015 national compilation of public supply, self-supplied domestic, thermoelectric, and irrigation water-use data. This data release also contains the county-level water-use estimates that support the evaluations in Luukkonen and others (2021). Finally, this data release contains summaries of regional medians and interquartile ranges from 1985 to 2015 that were used to highlight areas of unexpected variability, consistency and/or potential values that warrant further investigation. This data release supports the following publication: Luukkonen, C.L., Belitz, K., Sullivan, S.L., and Sargent, P., 2021, Factors affecting uncertainty of public supply, self-supplied domestic, irrigation, and thermoelectric water-use data, 1985-2015-evaluation of information sources, estimation methods, and data variability: U.S. Geological Survey Scientific Investigations Report 2021-5082, 78 p., https://doi.org/10.3133/sir20215082.

Hypertension Treatment Clinical Trial Dataset

kaggle.com

zip

Updated Mar 10, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Isabella D (2025). Hypertension Treatment Clinical Trial Dataset [Dataset]. https://www.kaggle.com/datasets/isabelladil/phase-iii-clinical-trial-dataset

Explore at:

zip(14424 bytes)Available download formats

Dataset updated

Mar 10, 2025

Authors

Isabella D

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Synthetic Clinical Trial Dataset – Hypertension Drug Trial (CardioX vs. Active Comparator vs. Placebo)

📝 About This Dataset

This synthetic dataset simulates a Phase III randomized controlled clinical trial evaluating CardioX (Drug A) versus an active comparator (Drug B) and a placebo for treating hypertension. It is designed for clinical data analysis, anomaly detection, and risk-based monitoring (RBM) applications.

The dataset includes 1,000 patients across 50 trial sites, with realistic patient demographics, blood pressure readings, cholesterol levels, dropout rates, and adverse event reporting. Several anomalies have been embedded to simulate real-world data quality issues commonly encountered in clinical trials.

This dataset is ideal for data quality assessments, statistical anomaly detection (Z-scores, IQR, clustering), and risk-based management (RBM) in clinical research.

🚀 Potential Use Cases

🔹 Clinical Trial Data Analysis – Investigate treatment efficacy and safety trends.

🔹 Anomaly Detection – Apply Z-scores, IQR, and ML-based clustering methods to identify outliers.

🔹 Risk-Based Monitoring (RBM) – Detect potential site-level risks and data inconsistencies.

🔹 Machine Learning Applications – Train models for adverse event prediction or dropout risk estimation.

📊 Dataset Features

Column Name	Description
Patient_ID	Unique identifier for each trial participant.
Site_ID	Site where the patient was enrolled (1-50)
Age	Patient age (in years).
Gender	Male or Female.
Enrollment_Date	Date when the patient was enrolled in the study.
Treatment_Group	Assigned treatment: Placebo, Drug A (CardioX), or Drug B (Active Comparator).
Adverse_Events	Number of adverse events (AEs) reported by the patient.
Dropout	Whether the patient dropped out of the study (1 = Yes, 0 = No).
Systolic_BP	Systolic Blood Pressure (mmHg).
Diastolic_BP	Diastolic Blood Pressure (mmHg).
Cholesterol_Level	Total cholesterol level (mg/dL).

📢 Acknowledgment & Licensing

This dataset is fully synthetic and does not contain real patient data. It is created for educational, analytical, and research purposes in clinical data science and biostatistics.

🔗 If you use this dataset, tag me! Let’s discuss insights & findings! 🚀

S
Data used to support a meta-analysis investigating ecological effects of...
dataverse.scholarsportal.info
borealisdata.ca
+1more
csv, txt
Updated Nov 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scholars Portal Dataverse (2019). Data used to support a meta-analysis investigating ecological effects of urban lawn management [Dataset]. http://doi.org/10.5683/SP2/RRJTEN
Explore at:
txt(7468), csv(8029)Available download formats
Unique identifier
https://doi.org/10.5683/SP2/RRJTEN
Dataset updated
Nov 20, 2019
Dataset provided by
Scholars Portal Dataverse
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
Jan 1, 2002 - Dec 31, 2018
Area covered
Trois-Rivières, Quebec, Springfield, United States, MA, Finland, Helsinki, Reading, United Kingdom, Tubingen, Germany, United Kingdom, Bracknell, Rennes, France
Description
This data supports a meta-analysis investigating ecological impacts of intense lawn management (mowing). Raw data on invertebrate abundance and temperature data was collected by Léonie Carignan-Guillemette (2018) and Caroline Turcotte (2017) under the supervision of Raphaël Proulx and Vincent Maire (refer to Appendix S1 within related publication for more information). Other data was gathered and processed according to the following: We searched the Scopus database on 8 February, 2019 with the following combinations of keywords: (lawn OR turf) AND mowing AND (urban OR city). Generally, studies were ineligible when: full-text of the article was not available even after contacting the authors; mowing was incidental to the study and not an experimental factor; response variables were not ecologically relevant; confounding factors (e.g. fertilisation) could not be isolated; a non-urban context was used; or simulated data were presented. We extracted the mean and statistical variation (standard deviation or standard error) for each response variable in control (less-intensively mown) and treatment (intensively mown) groups. Reported data were used when available. Otherwise, data were extracted from published figures using the Web Plot Digitizer tool. Where summary data on median, and interquartile range was presented, mean and standard deviation was estimated. Variables with multi-temporal data (e.g. soil moisture) were summarised using the mean and pooled standard deviation to provide an aggregated value per site per year. Where seasonal trends were evident in raw multi-temporal data (e.g. soil temperature), data was detrended using a polynomial function and analysis applied to the residuals.
d
Data from: Acceptability of short message service reminders as the support...
search.dataone.org
datadryad.org
Updated Oct 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Laban Muteebwa; Edith Nakku Joloba; Joan Nangendo; Dan Muramuzi; Faith Akello; Sabrina Kitaka Bakeera; Fred Collins Semitala; Aggrey S. Semeere; Charles Karamagi (2025). Acceptability of short message service reminders as the support tool for PrEP adherence among young women in Mukono district, Uganda [Dataset]. http://doi.org/10.5061/dryad.cvdncjt8h
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.cvdncjt8h
Dataset updated
Oct 28, 2025
Dataset provided by
Dryad Digital Repository
Authors
Laban Muteebwa; Edith Nakku Joloba; Joan Nangendo; Dan Muramuzi; Faith Akello; Sabrina Kitaka Bakeera; Fred Collins Semitala; Aggrey S. Semeere; Charles Karamagi
Area covered
Mukono, Uganda
Description
Adolescent girls and young women (AGYW) have a disproportionately high incidence of HIV compared to males of the same age in Uganda. AGYW are a priority sub-group for daily oral Pre-Exposure Prophylaxis (PrEP), but their adherence has consistently remained low. Short Message Service (SMS) reminders could improve adherence to PrEP in AGYW. However, there is a paucity of literature about the acceptability of SMS reminders among AGYW using PrEP. We assessed the level of acceptability of SMS reminders as a PrEP adherence support tool and the associated factors, among AGYW in Mukono district, Central Uganda. We consecutively enrolled AGYW using PrEP in Mukono district in a cross-sectional study. A structured pre-tested questionnaire was administered to participants by three trained research assistants. Data were analyzed in STATA 17.0; continuous variables were summarized using median and interquartile range (IQR) while categorical variables were summarized using frequencies and percentages...., The data set was collected through a reseacher administered questionnaire. The main dependent variable was acceptability of SMS reminders. This was measured using the seven constructs derived from the Theoretical Framework of Acceptability (TFA)(1). These include; affective attitude, burden, perceived effectiveness, ethicality, intervention coherence, opportunity costs, and self-efficacy. A 5-point Likert item question per construct was used and each level of a Likert scale was given a weight ranging from one to five. The summated scores from the weights assigned to each response were computed. The obtained summated acceptability score was then dichotomized using the 50th percentile of the possible summated scores which ranges from 7 to 35 (the 50th percentile is 21). Therefore â€œAcceptability of SMS reminders" was defined as a value greater than 21. The independent variables were captured as described in the data dictionary attached Data analysis was performed in STATA versi..., , The participants gave written informed consent to publish de-identified data in accordance with Uganda National Cuncil for Science and Technology (UNCST), a local human participant research regulator. The identifying characteristics like numerical age, physical address were reducted., # Acceptability of short message service reminders as the support tool for PrEP adherence among young women in Mukono district, Uganda

Dataset DOI: 10.5061/dryad.cvdncjt8h

Description of the data and file structure

In this dataset, we aimed to assess the acceptability of short message service (SMS) reminders among Adolescent Girls and Young Women (AGYW) prescribed Pre-Exposure Prophylaxis (PrEP). We also measured demographic and other individual factorsÂ

Files and variables

File: Manuscript_dataset.dta

Description:Â This section describes the variables included in the dataset (data dictionary)

| Variable Name | Variable type | Variable Label | Value Labels | | :------------------- | :------------ | :---------------------------------...
Data from: Sharing of clinical trial data and results reporting practices...
zenodo.org
data.niaid.nih.gov
+1more
Updated Jun 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jennifer Miller; Joseph S. Ross; Marc Wilenzick; Michelle M. Mello; Jennifer Miller; Joseph S. Ross; Marc Wilenzick; Michelle M. Mello (2022). Data from: Sharing of clinical trial data and results reporting practices among large pharmaceutical companies: cross sectional descriptive study and pilot of a tool to improve company practices [Dataset]. http://doi.org/10.5061/dryad.k81584t
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.k81584t
Dataset updated
Jun 1, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jennifer Miller; Joseph S. Ross; Marc Wilenzick; Michelle M. Mello; Jennifer Miller; Joseph S. Ross; Marc Wilenzick; Michelle M. Mello
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Objectives: To develop and pilot a tool to measure and improve pharmaceutical companies' clinical trial data sharing policies and practices. Design: Cross sectional descriptive analysis. Setting: Large pharmaceutical companies with novel drugs approved by the US Food and Drug Administration in 2015. Data sources: Data sharing measures were adapted from 10 prominent data sharing guidelines from expert bodies and refined through a multi-stakeholder deliberative process engaging patients, industry, academics, regulators, and others. Data sharing practices and policies were assessed using data from ClinicalTrials.gov, Drugs@FDA, corporate websites, data sharing platforms and registries (eg, the Yale Open Data Access (YODA) Project and Clinical Study Data Request (CSDR)), and personal communication with drug companies. Main outcome measures: Company level, multicomponent measure of accessibility of participant level clinical trial data (eg, analysis ready dataset and metadata); drug and trial level measures of registration, results reporting, and publication; company level overall transparency rankings; and feasibility of the measures and ranking tool to improve company data sharing policies and practices. Results: Only 25% of large pharmaceutical companies fully met the data sharing measure. The median company data sharing score was 63% (interquartile range 58-85%). Given feedback and a chance to improve their policies to meet this measure, three companies made amendments, raising the percentage of companies in full compliance to 33% and the median company data sharing score to 80% (73-100%). The most common reasons companies did not initially satisfy the data sharing measure were failure to share data by the specified deadline (75%) and failure to report the number and outcome of their data requests. Across new drug applications, a median of 100% (interquartile range 91-100%) of trials in patients were registered, 65% (36-96%) reported results, 45% (30-84%) were published, and 95% (69-100%) were publicly available in some form by six months after FDA drug approval. When examining results on the drug level, less than half (42%) of reviewed drugs had results for all their new drug applications trials in patients publicly available in some form by six months after FDA approval. Conclusions: It was feasible to develop a tool to measure data sharing policies and practices among large companies and have an impact in improving company practices. Among large companies, 25% made participant level trial data accessible to external investigators for new drug approvals in accordance with the current study's measures; this proportion improved to 33% after applying the ranking tool. Other measures of trial transparency were higher. Some companies, however, have substantial room for improvement on transparency and data sharing of clinical trials.
f
Descriptive statistics of biomarker measurements.
datasetcatalog.nlm.nih.gov
figshare.com
Updated Mar 10, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wiley, Laura; Kingston, Andrew; Chinnery, Patrick F.; Martin-Ruiz, Carmen; Catt, Michael; Collerton, Joanna; von Zglinicki, Thomas; Ashok, Deepthi; Davies, Karen; Talbot, Duncan C. S.; Jagger, Carol; Kirkwood, Thomas B. L. (2014). Descriptive statistics of biomarker measurements. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001195388
Explore at:
Dataset updated
Mar 10, 2014
Authors
Wiley, Laura; Kingston, Andrew; Chinnery, Patrick F.; Martin-Ruiz, Carmen; Catt, Michael; Collerton, Joanna; von Zglinicki, Thomas; Ashok, Deepthi; Davies, Karen; Talbot, Duncan C. S.; Jagger, Carol; Kirkwood, Thomas B. L.
Description
(SE: Standard error, IQR: Interquartile range, n: Number of participants, P1Baseline (phase 1)).
f
Summary of Participant and Data Characteristics.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated May 1, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hernandez-Andrade, Edgar; Dassanayake, Maya T.; Yeo, Lami; Marusak, Hilary A.; Berman, Susan; Shastri, Rupal; Hassan, Sonia S.; Brown, Jesse A.; Mody, Swati; Romero, Roberto; Thomason, Moriah E. (2014). Summary of Participant and Data Characteristics. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001212236
Explore at:
Dataset updated
May 1, 2014
Authors
Hernandez-Andrade, Edgar; Dassanayake, Maya T.; Yeo, Lami; Marusak, Hilary A.; Berman, Susan; Shastri, Rupal; Hassan, Sonia S.; Brown, Jesse A.; Mody, Swati; Romero, Roberto; Thomason, Moriah E.
Description
Younger fetuses are defined as GA <31 weeks, older fetuses are defined as GA≥31 weeks.*denotes significant p-values. Abbreviations: GA, gestational age; MRI, magnetic resonance imaging; M, male; F, female; SD, standard deviation; IQR, interquartile range.

Student Academic Performance (Synthetic Dataset)

kaggle.com

zip

Updated Oct 10, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Mamun Hasan (2025). Student Academic Performance (Synthetic Dataset) [Dataset]. https://www.kaggle.com/datasets/mamunhasan2cs/student-academic-performance-synthetic-dataset

Explore at:

zip(9287 bytes)Available download formats

Dataset updated

Oct 10, 2025

Authors

Mamun Hasan

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset is a synthetic collection of student performance data created for data preprocessing, cleaning, and analysis practice in Data Mining and Machine Learning courses. It contains information about 1,020 students, including their study habits, attendance, and test performance, with intentionally introduced missing values, duplicates, and outliers to simulate real-world data issues.

The dataset is suitable for laboratory exercises, assignments, and demonstration of key preprocessing techniques such as:

Handling missing values
Removing duplicates
Detecting and treating outliers
Data normalization and transformation
Encoding categorical variables
Exploratory data analysis (EDA)
Regression Analysis

📊 Columns Description

Column Name	Description
Student_ID	Unique identifier for each student (e.g., S0001, S0002, …)
Age	Age of the student (between 18 and 25 years)
Gender	Gender of the student (Male/Female)
Study_Hours	Average number of study hours per day (contains missing values and outliers)
Attendance(%)	Percentage of class attendance (contains missing values)
Test_Score	Final exam score (0–100 scale)
Grade	Letter grade derived from test scores (`F`, `C`, `B`, `A`, `A+`)

🧠 Example Lab Tasks Using This Dataset:

Identify and impute missing values using mean/median.
Detect and remove duplicate records.
Use IQR or Z-score methods to handle outliers.
Normalize Study_Hours and Test_Score using Min-Max scaling.
Encode categorical variables (Gender, Grade) for model input.
Prepare a clean dataset ready for classification/regression analysis.
Can be used for Limited Regression

🎯 Possible Regression Targets

Test_Score → Predict test score based on study hours, attendance, age, and gender.

🧩 Example Regression Problem

Predict the student’s test score using their study hours, attendance percentage, and age.

🧠 Sample Features: X = ['Age', 'Gender', 'Study_Hours', 'Attendance(%)'] y = ['Test_Score']

You can use:

Linear Regression (for simplicity)
Polynomial Regression (to explore nonlinear patterns)
Decision Tree Regressor or Random Forest Regressor

And analyze feature influence using correlation or SHAP/LIME explainability.

f
Demographic Data in the moxibustion and usual care Groups.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jul 25, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Choi, Jin-Bong; Jung, Hee-Jung; Kang, Kyung-Won; Kim, Tae-Hun; Kim, Hyeong Jun; Shin, Mi-Suk; Kim, Joo-Hee; Kim, Kun Hyung; Kang, Jung Won; Kim, Ae-Ran; Song, Ho Sueb; Kim, Jung Eun; Lee, MinHee; Hong, Kwon Eui; Lee, Seunghoon; Park, Hyo-Ju; Jung, So-Young; Choi, Sun-Mi (2014). Demographic Data in the moxibustion and usual care Groups. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001187308
Explore at:
Dataset updated
Jul 25, 2014
Authors
Choi, Jin-Bong; Jung, Hee-Jung; Kang, Kyung-Won; Kim, Tae-Hun; Kim, Hyeong Jun; Shin, Mi-Suk; Kim, Joo-Hee; Kim, Kun Hyung; Kang, Jung Won; Kim, Ae-Ran; Song, Ho Sueb; Kim, Jung Eun; Lee, MinHee; Hong, Kwon Eui; Lee, Seunghoon; Park, Hyo-Ju; Jung, So-Young; Choi, Sun-Mi
Description
*The Wilcoxon rank sum test was used for statistical analysis.†The Chi-squared test was used for statistical analysis.‡The t-test was used for statistical analysis.§Fisher's exact test was used for statistical analysis. IQR: interquartile range.
d
A decade-long analysis of trends in antimicrobial resistance at a...
datadryad.org
search.dataone.org
zip
Updated Aug 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ajaya Basnet (2024). A decade-long analysis of trends in antimicrobial resistance at a neurosurgical hospital in Kathmandu, Nepal [Dataset]. http://doi.org/10.5061/dryad.zpc866thj
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.zpc866thj
Dataset updated
Aug 27, 2024
Dataset provided by
Dryad
Authors
Ajaya Basnet
Time period covered
Aug 14, 2024
Area covered
Nepal, Kathmandu
Description
In the patient information sheet, outcome variables [bacterial pathogens and viral-bacterial coinfections (simultaneous occurrences)] and predictor variables (patient demographics, time frame, specimen type, type of bacterial isolate(s), and antimicrobial susceptibility patterns) were collected from the hospital records. The data were anonymized to ensure patient confidentiality. Data was entered and managed using Microsoft Excel, version 13.0, and analyzed using Statistical Package for Social Sciences (SPSS), version 17.0. Descriptive data were analyzed in terms of frequency and percentage. Quantitative data were reported as mean, median, and interquartile range (IQR). Qualitative variables were analyzed using the Chi-square test, while quantitative variables were analyzed using the independent student t-test, with statistical significance determined at a p-value of <0.05 within a 95% confidence interval (CI).

Facebook

Twitter

Click to copy link

Link copied

Cite

Pavanello, Sofia; Simeone, Claudio; Porru, Stefano; Mastrangelo, Giuseppe; Carta, Angela; Arici, Cecilia; Izzotti, Alberto (2014). Summary statistics (mean, standard deviation, median, interquartile range, number of subjects) for “ln_adducts” in cases, controls, and total population. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001172049

Summary statistics (mean, standard deviation, median, interquartile range, number of subjects) for “ln_adducts” in cases, controls, and total population.

Explore at:

Dataset updated

Apr 10, 2014

Authors

Pavanello, Sofia; Simeone, Claudio; Porru, Stefano; Mastrangelo, Giuseppe; Carta, Angela; Arici, Cecilia; Izzotti, Alberto

Description

Summary statistics (mean, standard deviation, median, interquartile range, number of subjects) for “ln_adducts” in cases, controls, and total population.

Clear search

Close search

Google apps

Main menu

Summary statistics (mean, standard deviation, median, interquartile range,...

Time and summary statistics in days (median and interquartile 25 to 75%...

Descriptive statistics, mean ± SD, range, median and interquartile range...

Simulation Data Set

Meta data and supporting documentation

Walmart Stocks Data 2025

📊 Walmart Stock Price Dataset & Exploratory Data Analysis (EDA)

🏢 About Walmart

📌 Dataset Overview

📊 Features Included in the Dataset

🔍 Exploratory Data Analysis (EDA) Steps

1️⃣ Data Preprocessing & Cleaning

2️⃣ Descriptive Statistics & Summary

3️⃣ Data Visualizations

4️⃣ Time Series Analysis

5️⃣ Insights & Conclusions

🚀 Use Cases & Applications

Descriptive statistics for time (mm:ss ± SD) to cessation of movement (COM)...

Summary Statistics for Temperature and other Meteorological Variables with...

Numpy , pandas and matplot lib practice

Life Expectancy WHO

We use DECISION TREE MODEL for the analysis.

We use RANDOM FOREST for the analysis.

Data from: Public supply, self-supplied domestic, irrigation, and...

Hypertension Treatment Clinical Trial Dataset

Synthetic Clinical Trial Dataset – Hypertension Drug Trial (CardioX vs. Active Comparator vs. Placebo)

📝 About This Dataset

🚀 Potential Use Cases

📊 Dataset Features

📢 Acknowledgment & Licensing

Data used to support a meta-analysis investigating ecological effects of...

Data from: Acceptability of short message service reminders as the support...

Description of the data and file structure

Files and variables

Data from: Sharing of clinical trial data and results reporting practices...

Descriptive statistics of biomarker measurements.

Summary of Participant and Data Characteristics.

Student Academic Performance (Synthetic Dataset)

📊 Columns Description

🧠 Example Lab Tasks Using This Dataset:

🎯 Possible Regression Targets

🧩 Example Regression Problem

Demographic Data in the moxibustion and usual care Groups.

A decade-long analysis of trends in antimicrobial resistance at a...

Summary statistics (mean, standard deviation, median, interquartile range, number of subjects) for “ln_adducts” in cases, controls, and total population.