54 datasets found
  1. f

    Summary statistics (mean, standard deviation, median, interquartile range,...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Apr 10, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pavanello, Sofia; Simeone, Claudio; Porru, Stefano; Mastrangelo, Giuseppe; Carta, Angela; Arici, Cecilia; Izzotti, Alberto (2014). Summary statistics (mean, standard deviation, median, interquartile range, number of subjects) for “ln_adducts” in cases, controls, and total population. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001172049
    Explore at:
    Dataset updated
    Apr 10, 2014
    Authors
    Pavanello, Sofia; Simeone, Claudio; Porru, Stefano; Mastrangelo, Giuseppe; Carta, Angela; Arici, Cecilia; Izzotti, Alberto
    Description

    Summary statistics (mean, standard deviation, median, interquartile range, number of subjects) for “ln_adducts” in cases, controls, and total population.

  2. Time and summary statistics in days (median and interquartile 25 to 75%...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amy Gadoud; Eleanor Kane; Una Macleod; Pat Ansell; Steven Oliver; Miriam Johnson (2023). Time and summary statistics in days (median and interquartile 25 to 75% range) from first time coded as on a palliative care register to date of death for each disease group. [Dataset]. http://doi.org/10.1371/journal.pone.0113188.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Amy Gadoud; Eleanor Kane; Una Macleod; Pat Ansell; Steven Oliver; Miriam Johnson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Percentages may not total 100 due to rounding.Time and summary statistics in days (median and interquartile 25 to 75% range) from first time coded as on a palliative care register to date of death for each disease group.

  3. Descriptive statistics, mean ± SD, range, median and interquartile range...

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hélène Follet; Delphine Farlay; Yohann Bala; Stéphanie Viguet-Carrin; Evelyne Gineyts; Brigitte Burt-Pichat; Julien Wegrzyn; Pierre Delmas; Georges Boivin; Roland Chapurlat (2023). Descriptive statistics, mean ± SD, range, median and interquartile range (IQR). [Dataset]. http://doi.org/10.1371/journal.pone.0055232.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Hélène Follet; Delphine Farlay; Yohann Bala; Stéphanie Viguet-Carrin; Evelyne Gineyts; Brigitte Burt-Pichat; Julien Wegrzyn; Pierre Delmas; Georges Boivin; Roland Chapurlat
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Descriptive statistics, mean ± SD, range, median and interquartile range (IQR).

  4. Simulation Data Set

    • catalog.data.gov
    • s.cnmilf.com
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Simulation Data Set [Dataset]. https://catalog.data.gov/dataset/simulation-data-set
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).

  5. Meta data and supporting documentation

    • catalog.data.gov
    • s.cnmilf.com
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Meta data and supporting documentation [Dataset]. https://catalog.data.gov/dataset/meta-data-and-supporting-documentation
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    We include a description of the data sets in the meta-data as well as sample code and results from a simulated data set. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: The R code is available on line here: https://github.com/warrenjl/SpGPCW. Format: Abstract The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publicly available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. File format: R workspace file. Metadata (including data dictionary) • y: Vector of binary responses (1: preterm birth, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate). This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).

  6. Walmart Stocks Data 2025

    • kaggle.com
    zip
    Updated Feb 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mehar Shan Ali (2025). Walmart Stocks Data 2025 [Dataset]. https://www.kaggle.com/meharshanali/walmart-stocks-data-2025
    Explore at:
    zip(467062 bytes)Available download formats
    Dataset updated
    Feb 23, 2025
    Authors
    Mehar Shan Ali
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    📊 Walmart Stock Price Dataset & Exploratory Data Analysis (EDA)

    🏢 About Walmart

    Walmart Inc. is a multinational retail corporation that operates a chain of hypermarkets, discount department stores, and grocery stores. It is one of the world's largest companies by revenue and a key player in the retail sector. Walmart's stock is actively traded on major stock exchanges, making it an interesting subject for financial analysis.

    📌 Dataset Overview

    This dataset contains historical stock price data for Walmart, sourced directly from Yahoo Finance using the yfinance Python API. The data covers daily stock prices and includes multiple key financial indicators.

    📊 Features Included in the Dataset

    • Date 📅 – The trading day recorded.
    • Open Price 🟢 – Price at market open.
    • High Price 🔼 – Highest price of the day.
    • Low Price 🔽 – Lowest price of the day.
    • Close Price 🔴 – Price at market close.
    • Adjusted Close Price 📉 – Closing price adjusted for splits & dividends.
    • Trading Volume 📈 – Total shares traded.
    • Dividends 💰 – Cash payments to shareholders.
    • Stock Splits 🔄 – Records stock split events.

    🔍 Exploratory Data Analysis (EDA) Steps

    This notebook performs an extensive EDA to uncover insights into Walmart's stock price trends, volatility, and overall behavior in the stock market. The following analysis steps are included:

    1️⃣ Data Preprocessing & Cleaning

    • Load data using Pandas
    • Handle missing values (if any)
    • Check data types and format them properly
    • Convert date column into a datetime format

    2️⃣ Descriptive Statistics & Summary

    • Calculate key statistical measures like mean, median, standard deviation, and interquartile range (IQR)
    • Identify stock price trends over time
    • Check data distribution and skewness

    3️⃣ Data Visualizations

    • 📉 Line Plot – Analyze trends in closing prices over time.
    • 📦 Box Plot – Detect potential outliers in stock prices.
    • 📊 Histogram – Understand the distribution of closing prices.
    • 📈 Moving Averages – Use short-term and long-term moving averages to observe stock trends.
    • 🔥 Correlation Heatmap – Find relationships between stock market indicators.

    4️⃣ Time Series Analysis

    • Identify trends and seasonality in the stock price data.
    • Calculate daily, weekly, and monthly returns.
    • Use rolling windows to analyze moving averages and volatility.

    5️⃣ Insights & Conclusions

    • How volatile is Walmart’s stock over the given period?
    • Does the stock exhibit strong uptrends or downtrends?
    • Are there any strong correlations between features?
    • What insights can be drawn for investors and traders?

    🚀 Use Cases & Applications

    This dataset and analysis can be useful for: - 📡 Stock Market Analysis – Evaluating Walmart’s stock price trends and volatility. - 🏦 Investment Research – Assisting traders and investors in making informed decisions. - 🎓 Educational Purposes – Teaching data science and financial analysis using real-world stock data. - 📊 Algorithmic Trading – Developing trading strategies based on historical stock price trends.

    📥 Download the dataset and explore Walmart’s stock performance today! 🚀

  7. f

    Descriptive statistics for time (mm:ss ± SD) to cessation of movement (COM)...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Mar 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kieffer, Justin D.; Miller, Jesse; Park, Janice Y.; Canturri, Albert; Torrisi, Dawn; Arruda, Andréia G.; Williams, Todd E.; Cheng, Ting-Yu; Bowman, Andrew S.; Culhane, Marie R.; Hougentogler, Daniel P.; Youngblood, Brad L.; Cressman, Michael D.; Campler, Magnus R.; Flory, Gary A.; Hunt, Lucia; Hill, Jeff (2025). Descriptive statistics for time (mm:ss ± SD) to cessation of movement (COM) and external activity (EA, milli-gravity (mg) [g = acceleration of gravity or 9.8 m s − 2) for quartiles Q1 and Q3 as well as the interquartile range (IQR [Q3-Q1for finisher pigs (N = 79) depopulated using water-based foam (WBF), nitrogen-foam (N2F), and carbon dioxide (CO2). [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002055241
    Explore at:
    Dataset updated
    Mar 25, 2025
    Authors
    Kieffer, Justin D.; Miller, Jesse; Park, Janice Y.; Canturri, Albert; Torrisi, Dawn; Arruda, Andréia G.; Williams, Todd E.; Cheng, Ting-Yu; Bowman, Andrew S.; Culhane, Marie R.; Hougentogler, Daniel P.; Youngblood, Brad L.; Cressman, Michael D.; Campler, Magnus R.; Flory, Gary A.; Hunt, Lucia; Hill, Jeff
    Description

    Descriptive statistics for time (mm:ss ± SD) to cessation of movement (COM) and external activity (EA, milli-gravity (mg) [g = acceleration of gravity or 9.8 m s − 2) for quartiles Q1 and Q3 as well as the interquartile range (IQR [Q3-Q1for finisher pigs (N = 79) depopulated using water-based foam (WBF), nitrogen-foam (N2F), and carbon dioxide (CO2).

  8. f

    Summary Statistics for Temperature and other Meteorological Variables with...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Apr 25, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lee, Eunil; Lee, Suji; Kwon, Bo Yeon; Kim, Hana; Rha, Seung-Woon; Jung, Dea Ho; Jeong, Myung Ho; Jo, Kyung Hee; Park, Man Sik (2014). Summary Statistics for Temperature and other Meteorological Variables with the Level of Air pollutants in Study Areas. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001206001
    Explore at:
    Dataset updated
    Apr 25, 2014
    Authors
    Lee, Eunil; Lee, Suji; Kwon, Bo Yeon; Kim, Hana; Rha, Seung-Woon; Jung, Dea Ho; Jeong, Myung Ho; Jo, Kyung Hee; Park, Man Sik
    Description

    SD: Standard deviation.IQR: Interquartile range.

  9. Numpy , pandas and matplot lib practice

    • kaggle.com
    zip
    Updated Jul 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    pratham saraf (2023). Numpy , pandas and matplot lib practice [Dataset]. https://www.kaggle.com/datasets/prathamsaraf1389/numpy-pandas-and-matplot-lib-practise/suggestions
    Explore at:
    zip(385020 bytes)Available download formats
    Dataset updated
    Jul 16, 2023
    Authors
    pratham saraf
    License

    https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/

    Description

    The dataset has been created specifically for practicing Python, NumPy, Pandas, and Matplotlib. It is designed to provide a hands-on learning experience in data manipulation, analysis, and visualization using these libraries.

    Specifics of the Dataset:

    The dataset consists of 5000 rows and 20 columns, representing various features with different data types and distributions. The features include numerical variables with continuous and discrete distributions, categorical variables with multiple categories, binary variables, and ordinal variables. Each feature has been generated using different probability distributions and parameters to introduce variations and simulate real-world data scenarios. The dataset is synthetic and does not represent any real-world data. It has been created solely for educational purposes.

    One of the defining characteristics of this dataset is the intentional incorporation of various real-world data challenges:

    Certain columns are randomly selected to be populated with NaN values, effectively simulating the common challenge of missing data. - The proportion of these missing values in each column varies randomly between 1% to 70%. - Statistical noise has been introduced in the dataset. For numerical values in some features, this noise adheres to a distribution with mean 0 and standard deviation 0.1. - Categorical noise is introduced in some features', with its categories randomly altered in about 1% of the rows. Outliers have also been embedded in the dataset, resonating with the Interquartile Range (IQR) rule

    Context of the Dataset:

    The dataset aims to provide a comprehensive playground for practicing Python, NumPy, Pandas, and Matplotlib. It allows learners to explore data manipulation techniques, perform statistical analysis, and create visualizations using the provided features. By working with this dataset, learners can gain hands-on experience in data cleaning, preprocessing, feature engineering, and visualization. Sources of the Dataset:

    The dataset has been generated programmatically using Python's random number generation functions and probability distributions. No external sources or real-world data have been used in creating this dataset.

  10. Life Expectancy WHO

    • kaggle.com
    zip
    Updated Jun 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vikram amin (2023). Life Expectancy WHO [Dataset]. https://www.kaggle.com/datasets/vikramamin/life-expectancy-who
    Explore at:
    zip(121472 bytes)Available download formats
    Dataset updated
    Jun 19, 2023
    Authors
    vikram amin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The objective behind attempting this dataset was to understand the predictors that contribute to the life expectancy around the world. I have used Linear Regression, Decision Tree and Random Forest for this purpose. Steps Involved: - Read the csv file - Data Cleaning: - Variables Country and Status were showing as having character data types. These had to be converted to factor - 2563 missing values were encountered with Population variable having the most of the missing values i.e 652 - Missing rows were dropped before we could run the analysis. 3) Run Linear Regression - Before running linear regression, 3 variables were dropped as they were not found to be having that much of an effect on the dependent variable i.e Life Expectancy. These 3 variables were Country, Year & Status. This meant we are now working with 19 variables (1 dependent and 18 independent variables) - We run the linear regression. Multiple R squared is 83% which means that independent variables can explain 83% change or variance in the dependent variable. - OULTLIER DETECTION. We check for outliers using IQR and find 54 outliers. These outliers are then removed before we run the regression analysis once again. Multiple R squared increased from 83% to 86%. - MULTICOLLINEARITY. We check for multicollinearity using the VIF model(Variance Inflation Factor). This is being done in case when two or more independent variables showing high correlation. The thumb rule is that absolute VIF values above 5 should be removed. We find 6 variables that have a VIF value higher than 5 namely Infant.deaths, percentage.expenditure,Under.five.deaths,GDP,thinness1.19,thinness5.9. Infant deaths and Under Five deaths have strong collinearity so we drop infant deaths(which has the higher VIF value). - When we run the linear regression model again, VIF value of Under.Five.Deaths goes down from 211.46 to 2.74 while the other variable's VIF values reduce very less. Variable thinness1.19 is now dropped and we run the regression once more. - Variable thinness5.9 whose absolute VIF value was 7.61 has now dropped to 1.95. GDP and Population are still having VIF value more than 5 but I decided against dropping these as I consider them to be important independent variables. - SET THE SEED AND SPLIT THE DATA INTO TRAIN AND TEST DATA. We run the train data and get multiple R squared of 86% and p value less than that of alpha which states that it is statistically significant. We use the train data to predict the test data to find out the RMSE and MAPE. We run the library(Metrics) for this purpose. - In Linear Regression, RMSE (Root Mean Squared Error) is 3.2. This indicates that on an average, the predicted values have an error of 3.2 years as compared to the actual life expectancy values. - MAPE (Mean Absolute Percentage Error) is 0.037. This indicates an accuracy prediction of 96.20% (1-0.037). - MAE (Mean Absolute Error) is 2.55. This indicates that on an average, the predicted values deviate by approximately 2.83 years from the actual values.

    We use DECISION TREE MODEL for the analysis.

    • Run the required libraries (rpart, rpart.plot, RColorBrewer, rattle).
    • We run the decision tree analysis using rpart and plot the tree. We use fancyRpartPlot.
    • We use 5 fold cross validation method with CP (complexity parameter) being 0.01.
    • In Decision Tree , RMSE (Root Mean Squared Error) is 3.06. This indicates that on an average, the predicted values have an error of 3.06 years as compared to the actual life expectancy values.
    • MAPE (Mean Absolute Percentage Error) is 0.035. This indicates an accuracy prediction of 96.45% (1-0.035).
    • MAE (Mean Absolute Error) is 2.35. This indicates that on an average, the predicted values deviate by approximately 2.35 years from the actual values.

    We use RANDOM FOREST for the analysis.

    • Run library(randomForest)
    • We use varImpPlot to find out which variables are most significant and least significant. Income composition is the most important followed by adult mortality and the least relevant independent variable is Population.
    • Predict Life expectancy through random forest model.
    • In Random Forest , RMSE (Root Mean Squared Error) is 1.73. This indicates that on an average, the predicted values have an error of 1.73 years as compared to the actual life expectancy values.
    • MAPE (Mean Absolute Percentage Error) is 0.01. This indicates an accuracy prediction of 98.27% (1-0.01).
    • MAE (Mean Absolute Error) is 1.14. This indicates that on an average, the predicted values deviate by approximately 1.14 years from the actual values.

    Conclusion: Random Forest is the best model for predicting the life expectancy values as it has the lowest RMSE, MAPE and MAE.

  11. d

    Data from: Public supply, self-supplied domestic, irrigation, and...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Public supply, self-supplied domestic, irrigation, and thermoelectric water-use data from 5-year compilation datasets from 1985 to 2015 used to assess data variability and uncertainty [Dataset]. https://catalog.data.gov/dataset/public-supply-self-supplied-domestic-irrigation-and-thermoelectric-water-use-data-from-5-y
    Explore at:
    Dataset updated
    Nov 21, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    The U.S. Geological Survey (USGS) National Water Use Program is responsible for compiling and disseminating the Nation's water-use data. Working in cooperation with local, State, and Federal agencies, the USGS has published an estimate of water use in the United States every 5 years, beginning in 1950. These 5-year compilations contain water-use estimates that are aggregated to the county level in the United States. This USGS data release contains summaries of method codes used in the 2015 national compilation of public supply, self-supplied domestic, thermoelectric, and irrigation water-use data. This data release also contains the county-level water-use estimates that support the evaluations in Luukkonen and others (2021). Finally, this data release contains summaries of regional medians and interquartile ranges from 1985 to 2015 that were used to highlight areas of unexpected variability, consistency and/or potential values that warrant further investigation. This data release supports the following publication: Luukkonen, C.L., Belitz, K., Sullivan, S.L., and Sargent, P., 2021, Factors affecting uncertainty of public supply, self-supplied domestic, irrigation, and thermoelectric water-use data, 1985-2015-evaluation of information sources, estimation methods, and data variability: U.S. Geological Survey Scientific Investigations Report 2021-5082, 78 p., https://doi.org/10.3133/sir20215082.

  12. Hypertension Treatment Clinical Trial Dataset

    • kaggle.com
    zip
    Updated Mar 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isabella D (2025). Hypertension Treatment Clinical Trial Dataset [Dataset]. https://www.kaggle.com/datasets/isabelladil/phase-iii-clinical-trial-dataset
    Explore at:
    zip(14424 bytes)Available download formats
    Dataset updated
    Mar 10, 2025
    Authors
    Isabella D
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    Synthetic Clinical Trial Dataset – Hypertension Drug Trial (CardioX vs. Active Comparator vs. Placebo)
    📝 About This Dataset

    This synthetic dataset simulates a Phase III randomized controlled clinical trial evaluating CardioX (Drug A) versus an active comparator (Drug B) and a placebo for treating hypertension. It is designed for clinical data analysis, anomaly detection, and risk-based monitoring (RBM) applications.

    The dataset includes 1,000 patients across 50 trial sites, with realistic patient demographics, blood pressure readings, cholesterol levels, dropout rates, and adverse event reporting. Several anomalies have been embedded to simulate real-world data quality issues commonly encountered in clinical trials.

    This dataset is ideal for data quality assessments, statistical anomaly detection (Z-scores, IQR, clustering), and risk-based management (RBM) in clinical research.

    🚀 Potential Use Cases

    🔹 Clinical Trial Data Analysis – Investigate treatment efficacy and safety trends.

    🔹 Anomaly Detection – Apply Z-scores, IQR, and ML-based clustering methods to identify outliers.

    🔹 Risk-Based Monitoring (RBM) – Detect potential site-level risks and data inconsistencies.

    🔹 Machine Learning Applications – Train models for adverse event prediction or dropout risk estimation.

    📊 Dataset Features
    Column NameDescription
    Patient_IDUnique identifier for each trial participant.
    Site_IDSite where the patient was enrolled (1-50)
    AgePatient age (in years).
    GenderMale or Female.
    Enrollment_DateDate when the patient was enrolled in the study.
    Treatment_GroupAssigned treatment: Placebo, Drug A (CardioX), or Drug B (Active Comparator).
    Adverse_EventsNumber of adverse events (AEs) reported by the patient.
    DropoutWhether the patient dropped out of the study (1 = Yes, 0 = No).
    Systolic_BPSystolic Blood Pressure (mmHg).
    Diastolic_BPDiastolic Blood Pressure (mmHg).
    Cholesterol_LevelTotal cholesterol level (mg/dL).
    📢 Acknowledgment & Licensing

    This dataset is fully synthetic and does not contain real patient data. It is created for educational, analytical, and research purposes in clinical data science and biostatistics.

    🔗 If you use this dataset, tag me! Let’s discuss insights & findings! 🚀

  13. S

    Data used to support a meta-analysis investigating ecological effects of...

    • dataverse.scholarsportal.info
    • borealisdata.ca
    • +1more
    csv, txt
    Updated Nov 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scholars Portal Dataverse (2019). Data used to support a meta-analysis investigating ecological effects of urban lawn management [Dataset]. http://doi.org/10.5683/SP2/RRJTEN
    Explore at:
    txt(7468), csv(8029)Available download formats
    Dataset updated
    Nov 20, 2019
    Dataset provided by
    Scholars Portal Dataverse
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2002 - Dec 31, 2018
    Area covered
    Trois-Rivières, Quebec, Springfield, United States, MA, Finland, Helsinki, Reading, United Kingdom, Tubingen, Germany, United Kingdom, Bracknell, Rennes, France
    Description

    This data supports a meta-analysis investigating ecological impacts of intense lawn management (mowing). Raw data on invertebrate abundance and temperature data was collected by Léonie Carignan-Guillemette (2018) and Caroline Turcotte (2017) under the supervision of Raphaël Proulx and Vincent Maire (refer to Appendix S1 within related publication for more information). Other data was gathered and processed according to the following: We searched the Scopus database on 8 February, 2019 with the following combinations of keywords: (lawn OR turf) AND mowing AND (urban OR city). Generally, studies were ineligible when: full-text of the article was not available even after contacting the authors; mowing was incidental to the study and not an experimental factor; response variables were not ecologically relevant; confounding factors (e.g. fertilisation) could not be isolated; a non-urban context was used; or simulated data were presented. We extracted the mean and statistical variation (standard deviation or standard error) for each response variable in control (less-intensively mown) and treatment (intensively mown) groups. Reported data were used when available. Otherwise, data were extracted from published figures using the Web Plot Digitizer tool. Where summary data on median, and interquartile range was presented, mean and standard deviation was estimated. Variables with multi-temporal data (e.g. soil moisture) were summarised using the mean and pooled standard deviation to provide an aggregated value per site per year. Where seasonal trends were evident in raw multi-temporal data (e.g. soil temperature), data was detrended using a polynomial function and analysis applied to the residuals.

  14. d

    Data from: Acceptability of short message service reminders as the support...

    • search.dataone.org
    • datadryad.org
    Updated Oct 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laban Muteebwa; Edith Nakku Joloba; Joan Nangendo; Dan Muramuzi; Faith Akello; Sabrina Kitaka Bakeera; Fred Collins Semitala; Aggrey S. Semeere; Charles Karamagi (2025). Acceptability of short message service reminders as the support tool for PrEP adherence among young women in Mukono district, Uganda [Dataset]. http://doi.org/10.5061/dryad.cvdncjt8h
    Explore at:
    Dataset updated
    Oct 28, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Laban Muteebwa; Edith Nakku Joloba; Joan Nangendo; Dan Muramuzi; Faith Akello; Sabrina Kitaka Bakeera; Fred Collins Semitala; Aggrey S. Semeere; Charles Karamagi
    Area covered
    Mukono, Uganda
    Description

    Adolescent girls and young women (AGYW) have a disproportionately high incidence of HIV compared to males of the same age in Uganda. AGYW are a priority sub-group for daily oral Pre-Exposure Prophylaxis (PrEP), but their adherence has consistently remained low. Short Message Service (SMS) reminders could improve adherence to PrEP in AGYW. However, there is a paucity of literature about the acceptability of SMS reminders among AGYW using PrEP. We assessed the level of acceptability of SMS reminders as a PrEP adherence support tool and the associated factors, among AGYW in Mukono district, Central Uganda. We consecutively enrolled AGYW using PrEP in Mukono district in a cross-sectional study. A structured pre-tested questionnaire was administered to participants by three trained research assistants. Data were analyzed in STATA 17.0; continuous variables were summarized using median and interquartile range (IQR) while categorical variables were summarized using frequencies and percentages...., The data set was collected through a reseacher administered questionnaire. The main dependent variable was acceptability of SMS reminders. This was measured using the seven constructs derived from the Theoretical Framework of Acceptability (TFA)(1). These include; affective attitude, burden, perceived effectiveness, ethicality, intervention coherence, opportunity costs, and self-efficacy. A 5-point Likert item question per construct was used and each level of a Likert scale was given a weight ranging from one to five. The summated scores from the weights assigned to each response were computed. The obtained summated acceptability score was then dichotomized using the 50th percentile of the possible summated scores which ranges from 7 to 35 (the 50th percentile is 21). Therefore “Acceptability of SMS reminders" was defined as a value greater than 21. The independent variables were captured as described in the data dictionary attached Data analysis was performed in STATA versi..., , The participants gave written informed consent to publish de-identified data in accordance with Uganda National Cuncil for Science and Technology (UNCST), a local human participant research regulator. The identifying characteristics like numerical age, physical address were reducted., # Acceptability of short message service reminders as the support tool for PrEP adherence among young women in Mukono district, Uganda

    Dataset DOI: 10.5061/dryad.cvdncjt8h

    Description of the data and file structure

    In this dataset, we aimed to assess the acceptability of short message service (SMS) reminders among Adolescent Girls and Young Women (AGYW) prescribed Pre-Exposure Prophylaxis (PrEP). We also measured demographic and other individual factorsÂ

    Files and variables

    File: Manuscript_dataset.dta

    Description:Â This section describes the variables included in the dataset (data dictionary)

    | Variable Name | Variable type | Variable Label | Value Labels | | :------------------- | :------------ | :---------------------------------...

  15. Data from: Sharing of clinical trial data and results reporting practices...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    Updated Jun 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jennifer Miller; Joseph S. Ross; Marc Wilenzick; Michelle M. Mello; Jennifer Miller; Joseph S. Ross; Marc Wilenzick; Michelle M. Mello (2022). Data from: Sharing of clinical trial data and results reporting practices among large pharmaceutical companies: cross sectional descriptive study and pilot of a tool to improve company practices [Dataset]. http://doi.org/10.5061/dryad.k81584t
    Explore at:
    Dataset updated
    Jun 1, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jennifer Miller; Joseph S. Ross; Marc Wilenzick; Michelle M. Mello; Jennifer Miller; Joseph S. Ross; Marc Wilenzick; Michelle M. Mello
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Objectives: To develop and pilot a tool to measure and improve pharmaceutical companies' clinical trial data sharing policies and practices. Design: Cross sectional descriptive analysis. Setting: Large pharmaceutical companies with novel drugs approved by the US Food and Drug Administration in 2015. Data sources: Data sharing measures were adapted from 10 prominent data sharing guidelines from expert bodies and refined through a multi-stakeholder deliberative process engaging patients, industry, academics, regulators, and others. Data sharing practices and policies were assessed using data from ClinicalTrials.gov, Drugs@FDA, corporate websites, data sharing platforms and registries (eg, the Yale Open Data Access (YODA) Project and Clinical Study Data Request (CSDR)), and personal communication with drug companies. Main outcome measures: Company level, multicomponent measure of accessibility of participant level clinical trial data (eg, analysis ready dataset and metadata); drug and trial level measures of registration, results reporting, and publication; company level overall transparency rankings; and feasibility of the measures and ranking tool to improve company data sharing policies and practices. Results: Only 25% of large pharmaceutical companies fully met the data sharing measure. The median company data sharing score was 63% (interquartile range 58-85%). Given feedback and a chance to improve their policies to meet this measure, three companies made amendments, raising the percentage of companies in full compliance to 33% and the median company data sharing score to 80% (73-100%). The most common reasons companies did not initially satisfy the data sharing measure were failure to share data by the specified deadline (75%) and failure to report the number and outcome of their data requests. Across new drug applications, a median of 100% (interquartile range 91-100%) of trials in patients were registered, 65% (36-96%) reported results, 45% (30-84%) were published, and 95% (69-100%) were publicly available in some form by six months after FDA drug approval. When examining results on the drug level, less than half (42%) of reviewed drugs had results for all their new drug applications trials in patients publicly available in some form by six months after FDA approval. Conclusions: It was feasible to develop a tool to measure data sharing policies and practices among large companies and have an impact in improving company practices. Among large companies, 25% made participant level trial data accessible to external investigators for new drug approvals in accordance with the current study's measures; this proportion improved to 33% after applying the ranking tool. Other measures of trial transparency were higher. Some companies, however, have substantial room for improvement on transparency and data sharing of clinical trials.

  16. f

    Descriptive statistics of biomarker measurements.

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Mar 10, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wiley, Laura; Kingston, Andrew; Chinnery, Patrick F.; Martin-Ruiz, Carmen; Catt, Michael; Collerton, Joanna; von Zglinicki, Thomas; Ashok, Deepthi; Davies, Karen; Talbot, Duncan C. S.; Jagger, Carol; Kirkwood, Thomas B. L. (2014). Descriptive statistics of biomarker measurements. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001195388
    Explore at:
    Dataset updated
    Mar 10, 2014
    Authors
    Wiley, Laura; Kingston, Andrew; Chinnery, Patrick F.; Martin-Ruiz, Carmen; Catt, Michael; Collerton, Joanna; von Zglinicki, Thomas; Ashok, Deepthi; Davies, Karen; Talbot, Duncan C. S.; Jagger, Carol; Kirkwood, Thomas B. L.
    Description

    (SE: Standard error, IQR: Interquartile range, n: Number of participants, P1Baseline (phase 1)).

  17. f

    Summary of Participant and Data Characteristics.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated May 1, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hernandez-Andrade, Edgar; Dassanayake, Maya T.; Yeo, Lami; Marusak, Hilary A.; Berman, Susan; Shastri, Rupal; Hassan, Sonia S.; Brown, Jesse A.; Mody, Swati; Romero, Roberto; Thomason, Moriah E. (2014). Summary of Participant and Data Characteristics. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001212236
    Explore at:
    Dataset updated
    May 1, 2014
    Authors
    Hernandez-Andrade, Edgar; Dassanayake, Maya T.; Yeo, Lami; Marusak, Hilary A.; Berman, Susan; Shastri, Rupal; Hassan, Sonia S.; Brown, Jesse A.; Mody, Swati; Romero, Roberto; Thomason, Moriah E.
    Description

    Younger fetuses are defined as GA <31 weeks, older fetuses are defined as GA≥31 weeks.*denotes significant p-values. Abbreviations: GA, gestational age; MRI, magnetic resonance imaging; M, male; F, female; SD, standard deviation; IQR, interquartile range.

  18. Student Academic Performance (Synthetic Dataset)

    • kaggle.com
    zip
    Updated Oct 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mamun Hasan (2025). Student Academic Performance (Synthetic Dataset) [Dataset]. https://www.kaggle.com/datasets/mamunhasan2cs/student-academic-performance-synthetic-dataset
    Explore at:
    zip(9287 bytes)Available download formats
    Dataset updated
    Oct 10, 2025
    Authors
    Mamun Hasan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is a synthetic collection of student performance data created for data preprocessing, cleaning, and analysis practice in Data Mining and Machine Learning courses. It contains information about 1,020 students, including their study habits, attendance, and test performance, with intentionally introduced missing values, duplicates, and outliers to simulate real-world data issues.

    The dataset is suitable for laboratory exercises, assignments, and demonstration of key preprocessing techniques such as:

    • Handling missing values
    • Removing duplicates
    • Detecting and treating outliers
    • Data normalization and transformation
    • Encoding categorical variables
    • Exploratory data analysis (EDA)
    • Regression Analysis

    📊 Columns Description

    Column NameDescription
    Student_IDUnique identifier for each student (e.g., S0001, S0002, …)
    AgeAge of the student (between 18 and 25 years)
    GenderGender of the student (Male/Female)
    Study_HoursAverage number of study hours per day (contains missing values and outliers)
    Attendance(%)Percentage of class attendance (contains missing values)
    Test_ScoreFinal exam score (0–100 scale)
    GradeLetter grade derived from test scores (F, C, B, A, A+)

    🧠 Example Lab Tasks Using This Dataset:

    • Identify and impute missing values using mean/median.
    • Detect and remove duplicate records.
    • Use IQR or Z-score methods to handle outliers.
    • Normalize Study_Hours and Test_Score using Min-Max scaling.
    • Encode categorical variables (Gender, Grade) for model input.
    • Prepare a clean dataset ready for classification/regression analysis.
    • Can be used for Limited Regression

    🎯 Possible Regression Targets

    Test_Score → Predict test score based on study hours, attendance, age, and gender.

    🧩 Example Regression Problem

    Predict the student’s test score using their study hours, attendance percentage, and age.

    🧠 Sample Features: X = ['Age', 'Gender', 'Study_Hours', 'Attendance(%)'] y = ['Test_Score']

    You can use:

    • Linear Regression (for simplicity)
    • Polynomial Regression (to explore nonlinear patterns)
    • Decision Tree Regressor or Random Forest Regressor

    And analyze feature influence using correlation or SHAP/LIME explainability.

  19. f

    Demographic Data in the moxibustion and usual care Groups.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jul 25, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Choi, Jin-Bong; Jung, Hee-Jung; Kang, Kyung-Won; Kim, Tae-Hun; Kim, Hyeong Jun; Shin, Mi-Suk; Kim, Joo-Hee; Kim, Kun Hyung; Kang, Jung Won; Kim, Ae-Ran; Song, Ho Sueb; Kim, Jung Eun; Lee, MinHee; Hong, Kwon Eui; Lee, Seunghoon; Park, Hyo-Ju; Jung, So-Young; Choi, Sun-Mi (2014). Demographic Data in the moxibustion and usual care Groups. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001187308
    Explore at:
    Dataset updated
    Jul 25, 2014
    Authors
    Choi, Jin-Bong; Jung, Hee-Jung; Kang, Kyung-Won; Kim, Tae-Hun; Kim, Hyeong Jun; Shin, Mi-Suk; Kim, Joo-Hee; Kim, Kun Hyung; Kang, Jung Won; Kim, Ae-Ran; Song, Ho Sueb; Kim, Jung Eun; Lee, MinHee; Hong, Kwon Eui; Lee, Seunghoon; Park, Hyo-Ju; Jung, So-Young; Choi, Sun-Mi
    Description

    *The Wilcoxon rank sum test was used for statistical analysis.†The Chi-squared test was used for statistical analysis.‡The t-test was used for statistical analysis.§Fisher's exact test was used for statistical analysis. IQR: interquartile range.

  20. d

    A decade-long analysis of trends in antimicrobial resistance at a...

    • datadryad.org
    • search.dataone.org
    zip
    Updated Aug 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ajaya Basnet (2024). A decade-long analysis of trends in antimicrobial resistance at a neurosurgical hospital in Kathmandu, Nepal [Dataset]. http://doi.org/10.5061/dryad.zpc866thj
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 27, 2024
    Dataset provided by
    Dryad
    Authors
    Ajaya Basnet
    Time period covered
    Aug 14, 2024
    Area covered
    Nepal, Kathmandu
    Description

    In the patient information sheet, outcome variables [bacterial pathogens and viral-bacterial coinfections (simultaneous occurrences)] and predictor variables (patient demographics, time frame, specimen type, type of bacterial isolate(s), and antimicrobial susceptibility patterns) were collected from the hospital records. The data were anonymized to ensure patient confidentiality. Data was entered and managed using Microsoft Excel, version 13.0, and analyzed using Statistical Package for Social Sciences (SPSS), version 17.0. Descriptive data were analyzed in terms of frequency and percentage. Quantitative data were reported as mean, median, and interquartile range (IQR). Qualitative variables were analyzed using the Chi-square test, while quantitative variables were analyzed using the independent student t-test, with statistical significance determined at a p-value of <0.05 within a 95% confidence interval (CI).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Pavanello, Sofia; Simeone, Claudio; Porru, Stefano; Mastrangelo, Giuseppe; Carta, Angela; Arici, Cecilia; Izzotti, Alberto (2014). Summary statistics (mean, standard deviation, median, interquartile range, number of subjects) for “ln_adducts” in cases, controls, and total population. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001172049

Summary statistics (mean, standard deviation, median, interquartile range, number of subjects) for “ln_adducts” in cases, controls, and total population.

Explore at:
Dataset updated
Apr 10, 2014
Authors
Pavanello, Sofia; Simeone, Claudio; Porru, Stefano; Mastrangelo, Giuseppe; Carta, Angela; Arici, Cecilia; Izzotti, Alberto
Description

Summary statistics (mean, standard deviation, median, interquartile range, number of subjects) for “ln_adducts” in cases, controls, and total population.

Search
Clear search
Close search
Google apps
Main menu