100+ datasets found
  1. U

    An example data set for exploration of Multiple Linear Regression

    • data.usgs.gov
    • catalog.data.gov
    Updated Feb 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    William Farmer (2024). An example data set for exploration of Multiple Linear Regression [Dataset]. http://doi.org/10.5066/P9T5ZEXV
    Explore at:
    Dataset updated
    Feb 24, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Authors
    William Farmer
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Time period covered
    1956 - 2016
    Description

    This data set contains example data for exploration of the theory of regression based regionalization. The 90th percentile of annual maximum streamflow is provided as an example response variable for 293 streamgages in the conterminous United States. Several explanatory variables are drawn from the GAGES-II data base in order to demonstrate how multiple linear regression is applied. Example scripts demonstrate how to collect the original streamflow data provided and how to recreate the figures from the associated Techniques and Methods chapter.

  2. d

    Data from: Data for multiple linear regression models for predicting...

    • catalog.data.gov
    • data.usgs.gov
    • +2more
    Updated Nov 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Data for multiple linear regression models for predicting microcystin concentration action-level exceedances in selected lakes in Ohio [Dataset]. https://catalog.data.gov/dataset/data-for-multiple-linear-regression-models-for-predicting-microcystin-concentration-action
    Explore at:
    Dataset updated
    Nov 19, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Ohio
    Description

    Site-specific multiple linear regression models were developed for eight sites in Ohio—six in the Western Lake Erie Basin and two in northeast Ohio on inland reservoirs--to quickly predict action-level exceedances for a cyanotoxin, microcystin, in recreational and drinking waters used by the public. Real-time models include easily- or continuously-measured factors that do not require that a sample be collected. Real-time models are presented in two categories: (1) six models with continuous monitor data, and (2) three models with on-site measurements. Real-time models commonly included variables such as phycocyanin, pH, specific conductance, and streamflow or gage height. Many of the real-time factors were averages over time periods antecedent to the time the microcystin sample was collected, including water-quality data compiled from continuous monitors. Comprehensive models use a combination of discrete sample-based measurements and real-time factors. Comprehensive models were useful at some sites with lagged variables (< 2 weeks) for cyanobacterial toxin genes, dissolved nutrients, and (or) N to P ratios. Comprehensive models are presented in three categories: (1) three models with continuous monitor data and lagged comprehensive variables, (2) five models with no continuous monitor data and lagged comprehensive variables, and (3) one model with continuous monitor data and same-day comprehensive variables. Funding for this work was provided by the Ohio Water Development Authority and the U.S. Geological Survey Cooperative Water Program.

  3. polynomial regression

    • kaggle.com
    Updated Jul 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miraj Deep Bhandari (2023). polynomial regression [Dataset]. http://doi.org/10.34740/kaggle/ds/3482232
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 5, 2023
    Dataset provided by
    Kaggle
    Authors
    Miraj Deep Bhandari
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The Ice Cream Selling dataset is a simple and well-suited dataset for beginners in machine learning who are looking to practice polynomial regression. It consists of two columns: temperature and the corresponding number of units of ice cream sold.

    The dataset captures the relationship between temperature and ice cream sales. It serves as a practical example for understanding and implementing polynomial regression, a powerful technique for modeling nonlinear relationships in data.

    The dataset is designed to be straightforward and easy to work with, making it ideal for beginners. The simplicity of the data allows beginners to focus on the fundamental concepts and steps involved in polynomial regression without overwhelming complexity.

    By using this dataset, beginners can gain hands-on experience in preprocessing the data, splitting it into training and testing sets, selecting an appropriate degree for the polynomial regression model, training the model, and evaluating its performance. They can also explore techniques to address potential challenges such as overfitting.

    With this dataset, beginners can practice making predictions of ice cream sales based on temperature inputs and visualize the polynomial regression curve that represents the relationship between temperature and ice cream sales.

    Overall, the Ice Cream Selling dataset provides an accessible and practical learning resource for beginners to grasp the concepts and techniques of polynomial regression in the context of analyzing ice cream sales data.

  4. m

    Panel dataset on Brazilian fuel demand

    • data.mendeley.com
    Updated Oct 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sergio Prolo (2024). Panel dataset on Brazilian fuel demand [Dataset]. http://doi.org/10.17632/hzpwbp7j22.1
    Explore at:
    Dataset updated
    Oct 7, 2024
    Authors
    Sergio Prolo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Brazil
    Description

    Summary : Fuel demand is shown to be influenced by fuel prices, people's income and motorization rates. We explore the effects of electric vehicle's rates in gasoline demand using this panel dataset.

    Files : dataset.csv - Panel dimensions are the Brazilian state ( i ) and year ( t ). The other columns are: gasoline sales per capita (ln_Sg_pc), prices of gasoline (ln_Pg) and ethanol (ln_Pe) and their lags, motorization rates of combustion vehicles (ln_Mi_c) and electric vehicles (ln_Mi_e) and GDP per capita (ln_gdp_pc). All variables are all under the natural log function, since we use this to calculate demand elasticities in a regression model.

    adjacency.csv - The adjacency matrix used in interaction with electric vehicles' motorization rates to calculate spatial effects. At first, it follows a binary adjacency formula: for each pair of states i and j, the cell (i, j) is 0 if the states are not adjacent and 1 if they are. Then, each row is normalized to have sum equal to one.

    regression.do - Series of Stata commands used to estimate the regression models of our study. dataset.csv must be imported to work, see comment section.

    dataset_predictions.xlsx - Based on the estimations from Stata, we use this excel file to make average predictions by year and by state. Also, by including years beyond the last panel sample, we also forecast the model into the future and evaluate the effects of different policies that influence gasoline prices (taxation) and EV motorization rates (electrification). This file is primarily used to create images, but can be used to further understand how the forecasting scenarios are set up.

    Sources: Fuel prices and sales: ANP (https://www.gov.br/anp/en/access-information/what-is-anp/what-is-anp) State population, GDP and vehicle fleet: IBGE (https://www.ibge.gov.br/en/home-eng.html?lang=en-GB) State EV fleet: Anfavea (https://anfavea.com.br/en/site/anuarios/)

  5. Z

    Regression analysis in Galaxy with car purchase price prediction dataset

    • data.niaid.nih.gov
    Updated Aug 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaivan Kamali (2022). Regression analysis in Galaxy with car purchase price prediction dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4660496
    Explore at:
    Dataset updated
    Aug 4, 2022
    Dataset provided by
    Penn State University
    Authors
    Kaivan Kamali
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Source/Credit: Michael Grogan https://github.com/MGCodesandStats https://github.com/MGCodesandStats/datasets/blob/master/cars.csv

    Sample dataset for regression analysis. Given 5 attributes (age, gender, miles driven per day, debt, and income) predict how much someone will spend on purchasing a car. All 5 of the input attributes have been scaled to be in 0 to 1 range. Training set has 723 training examples. Test set has 242 test examples.

    This dataset will be used in an upcoming Galaxy Training Network tutorial (https://training.galaxyproject.org/training-material/topics/statistics/) on use of feedforward neural networks for regression analysis.

  6. Data from: S1 Dataset -

    • plos.figshare.com
    xlsx
    Updated Jul 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tianyi Deng; Chengqi Xue; Gengpei Zhang (2024). S1 Dataset - [Dataset]. http://doi.org/10.1371/journal.pone.0305038.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 10, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Tianyi Deng; Chengqi Xue; Gengpei Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The meta-learning method proposed in this paper addresses the issue of small-sample regression in the application of engineering data analysis, which is a highly promising direction for research. By integrating traditional regression models with optimization-based data augmentation from meta-learning, the proposed deep neural network demonstrates excellent performance in optimizing glass fiber reinforced plastic (GFRP) for wrapping concrete short columns. When compared with traditional regression models, such as Support Vector Regression (SVR), Gaussian Process Regression (GPR), and Radial Basis Function Neural Networks (RBFNN), the meta-learning method proposed here performs better in modeling small data samples. The success of this approach illustrates the potential of deep learning in dealing with limited amounts of data, offering new opportunities in the field of material data analysis.

  7. Simulation Studies as Designed Experiments: The Comparison of Penalized...

    • figshare.com
    ai
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elias Chaibub Neto; J. Christopher Bare; Adam A. Margolin (2023). Simulation Studies as Designed Experiments: The Comparison of Penalized Regression Models in the “Large p, Small n” Setting [Dataset]. http://doi.org/10.1371/journal.pone.0107957
    Explore at:
    aiAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Elias Chaibub Neto; J. Christopher Bare; Adam A. Margolin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    New algorithms are continuously proposed in computational biology. Performance evaluation of novel methods is important in practice. Nonetheless, the field experiences a lack of rigorous methodology aimed to systematically and objectively evaluate competing approaches. Simulation studies are frequently used to show that a particular method outperforms another. Often times, however, simulation studies are not well designed, and it is hard to characterize the particular conditions under which different methods perform better. In this paper we propose the adoption of well established techniques in the design of computer and physical experiments for developing effective simulation studies. By following best practices in planning of experiments we are better able to understand the strengths and weaknesses of competing algorithms leading to more informed decisions about which method to use for a particular task. We illustrate the application of our proposed simulation framework with a detailed comparison of the ridge-regression, lasso and elastic-net algorithms in a large scale study investigating the effects on predictive performance of sample size, number of features, true model sparsity, signal-to-noise ratio, and feature correlation, in situations where the number of covariates is usually much larger than sample size. Analysis of data sets containing tens of thousands of features but only a few hundred samples is nowadays routine in computational biology, where “omics” features such as gene expression, copy number variation and sequence data are frequently used in the predictive modeling of complex phenotypes such as anticancer drug response. The penalized regression approaches investigated in this study are popular choices in this setting and our simulations corroborate well established results concerning the conditions under which each one of these methods is expected to perform best while providing several novel insights.

  8. Logistic Regression

    • kaggle.com
    zip
    Updated Dec 24, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ananya Nayan (2017). Logistic Regression [Dataset]. https://www.kaggle.com/datasets/dragonheir/logistic-regression
    Explore at:
    zip(3349 bytes)Available download formats
    Dataset updated
    Dec 24, 2017
    Authors
    Ananya Nayan
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Ananya Nayan

    Released under Database: Open Database, Contents: © Original Authors

    Contents

  9. A Comparison of Variance Estimation Methods for Regression Analyses with the...

    • catalog.data.gov
    • data.virginia.gov
    • +1more
    Updated Sep 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Substance Abuse and Mental Health Services Administration (2025). A Comparison of Variance Estimation Methods for Regression Analyses with the Mental Health Surveillance Study Clinical Sample [Dataset]. https://catalog.data.gov/dataset/a-comparison-of-variance-estimation-methods-for-regression-analyses-with-the-mental-health
    Explore at:
    Dataset updated
    Sep 7, 2025
    Dataset provided by
    Substance Abuse and Mental Health Services Administrationhttps://www.samhsa.gov/
    Description

    The purpose of this report is to compare alternative methods for producing measures of SEs for regression models for the MHSS clinical sample with the goal of producing more accurate and potentially smaller SEs.

  10. Data from: Regression with Empirical Variable Selection: Description of a...

    • plos.figshare.com
    txt
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anne E. Goodenough; Adam G. Hart; Richard Stafford (2023). Regression with Empirical Variable Selection: Description of a New Method and Application to Ecological Datasets [Dataset]. http://doi.org/10.1371/journal.pone.0034338
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Anne E. Goodenough; Adam G. Hart; Richard Stafford
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Despite recent papers on problems associated with full-model and stepwise regression, their use is still common throughout ecological and environmental disciplines. Alternative approaches, including generating multiple models and comparing them post-hoc using techniques such as Akaike's Information Criterion (AIC), are becoming more popular. However, these are problematic when there are numerous independent variables and interpretation is often difficult when competing models contain many different variables and combinations of variables. Here, we detail a new approach, REVS (Regression with Empirical Variable Selection), which uses all-subsets regression to quantify empirical support for every independent variable. A series of models is created; the first containing the variable with most empirical support, the second containing the first variable and the next most-supported, and so on. The comparatively small number of resultant models (n = the number of predictor variables) means that post-hoc comparison is comparatively quick and easy. When tested on a real dataset – habitat and offspring quality in the great tit (Parus major) – the optimal REVS model explained more variance (higher R2), was more parsimonious (lower AIC), and had greater significance (lower P values), than full, stepwise or all-subsets models; it also had higher predictive accuracy based on split-sample validation. Testing REVS on ten further datasets suggested that this is typical, with R2 values being higher than full or stepwise models (mean improvement = 31% and 7%, respectively). Results are ecologically intuitive as even when there are several competing models, they share a set of “core” variables and differ only in presence/absence of one or two additional variables. We conclude that REVS is useful for analysing complex datasets, including those in ecology and environmental disciplines.

  11. Dataset for: Power analysis for multivariable Cox regression models

    • wiley.figshare.com
    • search.datacite.org
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emil Scosyrev; Ekkehard Glimm (2023). Dataset for: Power analysis for multivariable Cox regression models [Dataset]. http://doi.org/10.6084/m9.figshare.7010483.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Wileyhttps://www.wiley.com/
    Authors
    Emil Scosyrev; Ekkehard Glimm
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    In power analysis for multivariable Cox regression models, variance of the estimated log-hazard ratio for the treatment effect is usually approximated by inverting the expected null information matrix. Because in many typical power analysis settings assumed true values of the hazard ratios are not necessarily close to unity, the accuracy of this approximation is not theoretically guaranteed. To address this problem, the null variance expression in power calculations can be replaced with one of alternative expressions derived under the assumed true value of the hazard ratio for the treatment effect. This approach is explored analytically and by simulations in the present paper. We consider several alternative variance expressions, and compare their performance to that of the traditional null variance expression. Theoretical analysis and simulations demonstrate that while the null variance expression performs well in many non-null settings, it can also be very inaccurate, substantially underestimating or overestimating the true variance in a wide range of realistic scenarios, particularly those where the numbers of treated and control subjects are very different and the true hazard ratio is not close to one. The alternative variance expressions have much better theoretical properties, confirmed in simulations. The most accurate of these expressions has a relatively simple form - it is the sum of inverse expected event counts under treatment and under control scaled up by a variance inflation factor.

  12. m

    Example Stata syntax and data construction for negative binomial time series...

    • data.mendeley.com
    Updated Nov 2, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarah Price (2022). Example Stata syntax and data construction for negative binomial time series regression [Dataset]. http://doi.org/10.17632/3mj526hgzx.2
    Explore at:
    Dataset updated
    Nov 2, 2022
    Authors
    Sarah Price
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We include Stata syntax (dummy_dataset_create.do) that creates a panel dataset for negative binomial time series regression analyses, as described in our paper "Examining methodology to identify patterns of consulting in primary care for different groups of patients before a diagnosis of cancer: an exemplar applied to oesophagogastric cancer". We also include a sample dataset for clarity (dummy_dataset.dta), and a sample of that data in a spreadsheet (Appendix 2).

    The variables contained therein are defined as follows:

    case: binary variable for case or control status (takes a value of 0 for controls and 1 for cases).

    patid: a unique patient identifier.

    time_period: A count variable denoting the time period. In this example, 0 denotes 10 months before diagnosis with cancer, and 9 denotes the month of diagnosis with cancer,

    ncons: number of consultations per month.

    period0 to period9: 10 unique inflection point variables (one for each month before diagnosis). These are used to test which aggregation period includes the inflection point.

    burden: binary variable denoting membership of one of two multimorbidity burden groups.

    We also include two Stata do-files for analysing the consultation rate, stratified by burden group, using the Maximum likelihood method (1_menbregpaper.do and 2_menbregpaper_bs.do).

    Note: In this example, for demonstration purposes we create a dataset for 10 months leading up to diagnosis. In the paper, we analyse 24 months before diagnosis. Here, we study consultation rates over time, but the method could be used to study any countable event, such as number of prescriptions.

  13. Subset for multiple regression analysis: socio-demographic data, social...

    • figshare.com
    txt
    Updated Jan 19, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrés Aparicio (2021). Subset for multiple regression analysis: socio-demographic data, social distance and the identification of mental health causes [Dataset]. http://doi.org/10.6084/m9.figshare.13607087.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 19, 2021
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Andrés Aparicio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data collected following the methodology and procedures described in (1,2). The sample consisted of Chilean adults (18 years of age or older) and was stratified by age, gender, and educational level. Five hundred and eighty-three participants began the process to answer the questionnaires either in person or online. Before the analysis, we excluded incomplete records, questionnaires answered by Chilean people living outside of Chile, and foreign people living in Chile for less than 10 years. This article reports the results obtained from 395 participants (68%). The final sample included adults from 18 to 78 years of age with low, middle and high educational levels.1. Scior K, Potts HW, Furnham AF. Awareness of schizophrenia and intellectual disability and stigma across ethnic groups in the UK. Psychiatry Res [Internet]. 2013 Jul 30 [cited 2019 Jan 5];208(2):125–30. Available from: https://www.sciencedirect.com/science/article/pii/S0165178112005604?via=ihub2. Scior K, Furnham A. Development and validation of the Intellectual Disability Literacy Scale for assessment of knowledge, beliefs and attitudes to intellectual disability. Res Dev Disabil [Internet]. 2011 Sep [cited 2017 Dec 31];32(5):1530–41. Available from: http://www.ncbi.nlm.nih.gov/pubmed/21377320

  14. f

    Multiple linear regression model.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Sep 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dias, Sara Simões; Pedro, Ana Rita; Rosário, Jorge; Dias, Sónia (2024). Multiple linear regression model. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001409589
    Explore at:
    Dataset updated
    Sep 24, 2024
    Authors
    Dias, Sara Simões; Pedro, Ana Rita; Rosário, Jorge; Dias, Sónia
    Description

    IntroductionThe capacity of higher education students to comprehend and act on health information is a pivotal factor in attaining favourable health outcomes and well-being. Assessing the health literacy of these students is essential in order to develop targeted interventions and provide informed health support. The aim of this study was to identify the level of health literacy and to analyse its relationship with determinants such as socio-demographic variables, chronic disease, perceived health status, and perceived availability of money for expenses among higher education students in the Alentejo region of southern Portugal.MethodologyAn observational, descriptive and cross-sectional study was conducted between 22 June and 12 September 2023. An online structured questionnaire consisting of the Portuguese version of the European Health Literacy Survey Questionnaire—16 items (HLS-EU-PT-Q16), including socio-demographic data, presence of chronic diseases, perceived health status, and availability of money for expenses. Data were analysed using independent samples t-test, one-way ANOVA, post-hoc Gabriel’s test, and multivariate logistic regression analyses at a significance level of 0.05. Regression models were used to investigate the relationship between health literacy and various determinants. The study protocol was approved by the Ethics Committee of the University of Évora, and all participants gave written informed consent.ResultsAnalysis of the HLS-EU-PT-Q16 showed that 82.3% of the 1228 students sampled had limited health literacy. The mean health literacy score was 19.3 ± 12.8 on a scale of 0 to 50, with subscores of 19.4 ± 13.9 for health care, 19.1 ± 13.1 for disease prevention, and 19.0 ± 13.7 for health promotion. Significant associations were found between health literacy and several determinants. Higher health literacy was associated with the absence of chronic diseases. Regression analysis showed that lower health literacy was associated with not attending health-related courses, not living with a health professional, perceiving limited availability of money for expenses, and having an unsatisfactory health status.ConclusionThis study improves the understanding of health literacy levels among higher education students in Alentejo, Portugal, and identifies key determinants. Higher education students in this region had relatively low levels of health literacy, which may have a negative impact on their health outcomes. These findings highlight the need for interventions to improve health literacy among higher education students and to address the specific needs of high-risk subgroups in the Alentejo.

  15. Walmart Dataset

    • kaggle.com
    zip
    Updated Dec 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M Yasser H (2021). Walmart Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/walmart-dataset
    Explore at:
    zip(125095 bytes)Available download formats
    Dataset updated
    Dec 26, 2021
    Authors
    M Yasser H
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://raw.githubusercontent.com/Masterx-AI/Project_Retail_Analysis_with_Walmart/main/Wallmart1.jpg" alt="">

    Description:

    One of the leading retail stores in the US, Walmart, would like to predict the sales and demand accurately. There are certain events and holidays which impact sales on each day. There are sales data available for 45 stores of Walmart. The business is facing a challenge due to unforeseen demands and runs out of stock some times, due to the inappropriate machine learning algorithm. An ideal ML algorithm will predict demand accurately and ingest factors like economic conditions including CPI, Unemployment Index, etc.

    Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of all, which are the Super Bowl, Labour Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data. Historical sales data for 45 Walmart stores located in different regions are available.

    Acknowledgements

    The dataset is taken from Kaggle.

    Objective:

    • Understand the Dataset & cleanup (if required).
    • Build Regression models to predict the sales w.r.t single & multiple features.
    • Also evaluate the models & compare their respective scores like R2, RMSE, etc.
  16. Taiwan height and weight sampling data, 2017~2020

    • kaggle.com
    zip
    Updated Sep 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ta-wei Lo (2024). Taiwan height and weight sampling data, 2017~2020 [Dataset]. https://www.kaggle.com/datasets/taweilo/taiwan-wright-and-weight-sampling-data
    Explore at:
    zip(48516 bytes)Available download formats
    Dataset updated
    Sep 16, 2024
    Authors
    Ta-wei Lo
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    Taiwan
    Description

    1. File Information

    This dataset is a synthetic dataset created based on sampling statistics from the Taiwan Ministry of Health and Welfare. It includes data on height, weight, BMI, and age of individuals, making it suitable for various health-related analyses.

    2. Meta Data

    ColumnDescriptionData TypeExample
    yrAge of the individualInteger15
    heightHeight of the individual in centimetersFloat160.5
    weightWeight of the individual in kilogramsFloat60.0
    bmiBody Mass Index (BMI)Float22.5
    genderCategorical gender value (0: Female, 1: Male)Integer0

    3. Potential Analyses

    Exploratory Data Analysis (EDA):

    • Distribution analysis for height, weight, and BMI.
    • Age and gender-based trends.

    Regression Analysis:

    • Linear Regression: Predict weight based on height and BMI.
    • Logistic Regression: Classify individuals by BMI categories.

    Clustering and Classification:

    • Group individuals into categories (e.g., underweight, healthy, overweight) based on BMI.

    Time-Series/Trend Analysis:

    • Investigate how health metrics (BMI) evolve over age groups.

    Feel free to leave comments on the discussion. I'd appreciate your upvote if you find my dataset useful! 😀

  17. Adjusted multiple regression analysis models showing independently...

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Defaru Desalegn; Shimelis Girma; Tilahun Abdeta (2023). Adjusted multiple regression analysis models showing independently associated factors with domains of quality of life and overall quality of life among people with schizophrenia (n = 351). [Dataset]. http://doi.org/10.1371/journal.pone.0229514.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Defaru Desalegn; Shimelis Girma; Tilahun Abdeta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Adjusted multiple regression analysis models showing independently associated factors with domains of quality of life and overall quality of life among people with schizophrenia (n = 351).

  18. Housing Prices Dataset

    • kaggle.com
    zip
    Updated Jan 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M Yasser H (2022). Housing Prices Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/housing-prices-dataset
    Explore at:
    zip(4740 bytes)Available download formats
    Dataset updated
    Jan 12, 2022
    Authors
    M Yasser H
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://raw.githubusercontent.com/Masterx-AI/Project_Housing_Price_Prediction_/main/hs.jpg" alt="">

    Description:

    A simple yet challenging project, to predict the housing price based on certain factors like house area, bedrooms, furnished, nearness to mainroad, etc. The dataset is small yet, it's complexity arises due to the fact that it has strong multicollinearity. Can you overcome these obstacles & build a decent predictive model?

    Acknowledgement:

    Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102. Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.

    Objective:

    • Understand the Dataset & cleanup (if required).
    • Build Regression models to predict the sales w.r.t a single & multiple feature.
    • Also evaluate the models & compare thier respective scores like R2, RMSE, etc.
  19. 10 Million Number Dataset

    • kaggle.com
    zip
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mehedi Hasand1497 (2025). 10 Million Number Dataset [Dataset]. https://www.kaggle.com/datasets/mehedihasand1497/10-million-random-number-dataset-for-ml/data
    Explore at:
    zip(2285635720 bytes)Available download formats
    Dataset updated
    Apr 28, 2025
    Authors
    Mehedi Hasand1497
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    About the Dataset: Random Data with Hidden Structure

    This dataset consists of 10,000,000 samples with 50 numerical features. Each feature has been randomly generated using a uniform distribution between 0 and 1. To add complexity, a hidden structure has been introduced in some of the features. Specifically, Feature 2 is related to Feature 1, making it a good candidate for regression analysis tasks. The other features remain purely random, allowing for the exploration of feature engineering and random data generation techniques.

    Key Features and Structure

    • Feature 1: A random number drawn from a uniform distribution between 0 and 1.
    • Feature 2: A function of Feature 1, specifically Feature 2 ≈ 2 × Feature 1 + small Gaussian noise (N(0, 0.05)). This introduces a hidden linear relationship with a small amount of noise for added realism.
    • Features 3 to 50: Independent random numbers generated between 0 and 1, with no relationship to each other or any other features.

    This hidden structure allows you to test models on data where a simple pattern (between Feature 1 and Feature 2) exists, but with noise that can challenge more advanced models in finding the relationship.

    Dataset Overview

    Feature NameDescription
    feature_1Random number (0–1, uniform)
    feature_22 × feature_1 + small noise (N(0, 0.05))
    feature_3–50Independent random numbers (0–1)
    • Rows: 10,000,000
    • Columns: 50
    • Format: CSV
    • File Size: 5.32 GB ## Intended Uses

    This dataset is versatile and can be used for various machine learning tasks, including:

    • Testing and benchmarking machine learning models: Evaluate model performance on large, randomly generated datasets.
    • Regression analysis practice: The relationship between Feature 1 and Feature 2 makes it ideal for testing regression models.
    • Feature engineering experiments: Explore techniques for selecting, transforming, or creating new features.
    • Random data generation research: Investigate methods for generating synthetic data and its applications.
    • Large-scale data processing testing: Test frameworks such as Pandas, Dask, and Spark for processing large datasets.

    Licensing

    This dataset is made available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. You are free to share and adapt the material for any purpose, even commercially, as long as proper attribution is given.

    Learn more about the license here

  20. Survey Data of the socio-demographic, economic and water source types that...

    • zenodo.org
    • datadryad.org
    bin, csv
    Updated Jun 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shewayiref Geremew Gebremichael; Shewayiref Geremew Gebremichael (2022). Survey Data of the socio-demographic, economic and water source types that influences HHs drinking water supply [Dataset]. http://doi.org/10.5061/dryad.mw6m905w8
    Explore at:
    bin, csvAvailable download formats
    Dataset updated
    Jun 4, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shewayiref Geremew Gebremichael; Shewayiref Geremew Gebremichael
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Background: Clean water is an essential part of human healthy life and wellbeing. More recently, rapid population growth, high illiteracy rate, lack of sustainable development, and climate change; faces a global challenge in developing countries. The discontinuity of drinking water supply forces households either to use unsafe water storage materials or to use water from unsafe sources. The present study aimed to identify the determinants of water source types, use, quality of water, and sanitation perception of physical parameters among urban households in North-West Ethiopia.

    Methods: A community-based cross-sectional study was conducted among households from February to March 2019. An interview-based a pretested and structured questionnaire was used to collect the data. Data collection samples were selected randomly and proportional to each of the kebeles' households. MS Excel and R Version 3.6.2 were used to enter and analyze the data; respectively. Descriptive statistics using frequencies and percentages were used to explain the sample data concerning the predictor variable. Both bivariate and multivariate logistic regressions were used to assess the association between independent and response variables.

    Results: Four hundred eighteen (418) households have participated. Based on the study undertaken,78.95% of households used improved and 21.05% of households used unimproved drinking water sources. Households drinking water sources were significantly associated with the age of the participant (x2 = 20.392, df=3), educational status(x2 = 19.358, df=4), source of income (x2 = 21.777, df=3), monthly income (x2 = 13.322, df=3), availability of additional facilities (x2 = 98.144, df=7), cleanness status (x2 =42.979, df=4), scarcity of water (x2 = 5.1388, df=1) and family size (x2 = 9.934, df=2). The logistic regression analysis also indicated that those factors are significantly determining the water source types used by the households. Factors such as availability of toilet facility, household member type, and sex of the head of the household were not significantly associated with drinking water sources.

    Conclusion: The uses of drinking water from improved sources were determined by different demographic, socio-economic, sanitation, and hygiene-related factors. Therefore, ; the local, regional, and national governments and other supporting organizations shall improve the accessibility and adequacy of drinking water from improved sources in the area.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
William Farmer (2024). An example data set for exploration of Multiple Linear Regression [Dataset]. http://doi.org/10.5066/P9T5ZEXV

An example data set for exploration of Multiple Linear Regression

Explore at:
Dataset updated
Feb 24, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
William Farmer
License

U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically

Time period covered
1956 - 2016
Description

This data set contains example data for exploration of the theory of regression based regionalization. The 90th percentile of annual maximum streamflow is provided as an example response variable for 293 streamgages in the conterminous United States. Several explanatory variables are drawn from the GAGES-II data base in order to demonstrate how multiple linear regression is applied. Example scripts demonstrate how to collect the original streamflow data provided and how to recreate the figures from the associated Techniques and Methods chapter.

Search
Clear search
Close search
Google apps
Main menu