99 datasets found

d
Data from: An example data set for exploration of Multiple Linear Regression...
catalog.data.gov
data.usgs.gov
Updated Nov 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). An example data set for exploration of Multiple Linear Regression [Dataset]. https://catalog.data.gov/dataset/an-example-data-set-for-exploration-of-multiple-linear-regression
Explore at:
Dataset updated
Nov 20, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This data set contains example data for exploration of the theory of regression based regionalization. The 90th percentile of annual maximum streamflow is provided as an example response variable for 293 streamgages in the conterminous United States. Several explanatory variables are drawn from the GAGES-II data base in order to demonstrate how multiple linear regression is applied. Example scripts demonstrate how to collect the original streamflow data provided and how to recreate the figures from the associated Techniques and Methods chapter.
Logistic Regression
kaggle.com
zip
Updated Dec 24, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ananya Nayan (2017). Logistic Regression [Dataset]. https://www.kaggle.com/datasets/dragonheir/logistic-regression
Explore at:
zip(3349 bytes)Available download formats
Dataset updated
Dec 24, 2017
Authors
Ananya Nayan
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Dataset

This dataset was created by Ananya Nayan

Released under Database: Open Database, Contents: © Original Authors

Contents
polynomial regression
kaggle.com
Updated Jul 5, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miraj Deep Bhandari (2023). polynomial regression [Dataset]. http://doi.org/10.34740/kaggle/ds/3482232
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/ds/3482232
Dataset updated
Jul 5, 2023
Dataset provided by
Kaggle
Authors
Miraj Deep Bhandari
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The Ice Cream Selling dataset is a simple and well-suited dataset for beginners in machine learning who are looking to practice polynomial regression. It consists of two columns: temperature and the corresponding number of units of ice cream sold.

The dataset captures the relationship between temperature and ice cream sales. It serves as a practical example for understanding and implementing polynomial regression, a powerful technique for modeling nonlinear relationships in data.

The dataset is designed to be straightforward and easy to work with, making it ideal for beginners. The simplicity of the data allows beginners to focus on the fundamental concepts and steps involved in polynomial regression without overwhelming complexity.

By using this dataset, beginners can gain hands-on experience in preprocessing the data, splitting it into training and testing sets, selecting an appropriate degree for the polynomial regression model, training the model, and evaluating its performance. They can also explore techniques to address potential challenges such as overfitting.

With this dataset, beginners can practice making predictions of ice cream sales based on temperature inputs and visualize the polynomial regression curve that represents the relationship between temperature and ice cream sales.

Overall, the Ice Cream Selling dataset provides an accessible and practical learning resource for beginners to grasp the concepts and techniques of polynomial regression in the context of analyzing ice cream sales data.
Marketing Linear Multiple Regression
kaggle.com
zip
Updated Apr 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FayeJavad (2020). Marketing Linear Multiple Regression [Dataset]. https://www.kaggle.com/datasets/fayejavad/marketing-linear-multiple-regression
Explore at:
zip(1907 bytes)Available download formats
Dataset updated
Apr 24, 2020
Authors
FayeJavad
Description
Dataset

This dataset was created by FayeJavad

Contents
SPSS Data Set S1 Logistic Regression Model Data
figshare.com
bin
Updated Jan 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michelle Klailova; Phyllis Lee (2016). SPSS Data Set S1 Logistic Regression Model Data [Dataset]. http://doi.org/10.6084/m9.figshare.1051748.v2
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1051748.v2
Dataset updated
Jan 19, 2016
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Michelle Klailova; Phyllis Lee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data set from PLOS ONE Article Published Entitled: Western Lowland Gorillas Signal Selectively Using Odor
Pearson's Height Data 📏 Simple linear regression
kaggle.com
zip
Updated Aug 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MaDiha 🌷 (2024). Pearson's Height Data 📏 Simple linear regression [Dataset]. https://www.kaggle.com/datasets/fundal/pearsons-height-data-simple-linear-regression
Explore at:
zip(3544 bytes)Available download formats
Dataset updated
Aug 17, 2024
Authors
MaDiha 🌷
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Description The table below gives the heights of fathers and their sons, based on a famous experiment by Karl Pearson around 1903. The number of cases is 1078. Random noise was added to the original data, to produce heights to the nearest 0.1 inch.

Objective: Use this dataset to practice simple linear regression.

Columns - Father height - Son height

Source: Department of Statistics, University of California, Berkeley

Download TSV source file: Pearson.tsv
i
multi-output regression datasets
ieee-dataport.org
Updated Nov 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chunyu Wang (2025). multi-output regression datasets [Dataset]. https://ieee-dataport.org/documents/multi-output-regression-datasets
Explore at:
Dataset updated
Nov 20, 2025
Authors
Chunyu Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
1 ) and there are 16 continuous input variables.
R
20ml Regression Dataset
universe.roboflow.com
zip
Updated Jun 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
20ml (2024). 20ml Regression Dataset [Dataset]. https://universe.roboflow.com/20ml-tkw8a/20ml-regression/model/4
Explore at:
zipAvailable download formats
Dataset updated
Jun 26, 2024
Dataset authored and provided by
20ml
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Volume
Description
20ml Regression

## Overview 20ml Regression is a dataset for classification tasks - it contains Volume annotations for 5,683 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
regression_data
figshare.com
datasetcatalog.nlm.nih.gov
bin
Updated Mar 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Björn Thor Arnarson (2021). regression_data [Dataset]. http://doi.org/10.6084/m9.figshare.14139731.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14139731.v1
Dataset updated
Mar 2, 2021
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Björn Thor Arnarson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Regression data
Housing Prices Dataset
kaggle.com
zip
Updated Jan 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M Yasser H (2022). Housing Prices Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/housing-prices-dataset
Explore at:
zip(4740 bytes)Available download formats
Dataset updated
Jan 12, 2022
Authors
M Yasser H
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
https://raw.githubusercontent.com/Masterx-AI/Project_Housing_Price_Prediction_/main/hs.jpg" alt="">

Description:

A simple yet challenging project, to predict the housing price based on certain factors like house area, bedrooms, furnished, nearness to mainroad, etc. The dataset is small yet, it's complexity arises due to the fact that it has strong multicollinearity. Can you overcome these obstacles & build a decent predictive model?

Acknowledgement:

Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102. Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.

Objective:

Understand the Dataset & cleanup (if required).

Build Regression models to predict the sales w.r.t a single & multiple feature.

Also evaluate the models & compare thier respective scores like R2, RMSE, etc.
House Price Regression Dataset
kaggle.com
zip
Updated Sep 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prokshitha Polemoni (2024). House Price Regression Dataset [Dataset]. https://www.kaggle.com/datasets/prokshitha/home-value-insights
Explore at:
zip(27045 bytes)Available download formats
Dataset updated
Sep 6, 2024
Authors
Prokshitha Polemoni
Description
Home Value Insights: A Beginner's Regression Dataset

This dataset is designed for beginners to practice regression problems, particularly in the context of predicting house prices. It contains 1000 rows, with each row representing a house and various attributes that influence its price. The dataset is well-suited for learning basic to intermediate-level regression modeling techniques.

Features:

Square_Footage: The size of the house in square feet. Larger homes typically have higher prices.

Num_Bedrooms: The number of bedrooms in the house. More bedrooms generally increase the value of a home.

Num_Bathrooms: The number of bathrooms in the house. Houses with more bathrooms are typically priced higher.

Year_Built: The year the house was built. Older houses may be priced lower due to wear and tear.

Lot_Size: The size of the lot the house is built on, measured in acres. Larger lots tend to add value to a property.

Garage_Size: The number of cars that can fit in the garage. Houses with larger garages are usually more expensive.

Neighborhood_Quality: A rating of the neighborhood’s quality on a scale of 1-10, where 10 indicates a high-quality neighborhood. Better neighborhoods usually command higher prices.

House_Price (Target Variable): The price of the house, which is the dependent variable you aim to predict.

Potential Uses:

Beginner Regression Projects: This dataset can be used to practice building regression models such as Linear Regression, Decision Trees, or Random Forests. The target variable (house price) is continuous, making this an ideal problem for supervised learning techniques.

Feature Engineering Practice: Learners can create new features by combining existing ones, such as the price per square foot or age of the house, providing an opportunity to experiment with feature transformations.

Exploratory Data Analysis (EDA): You can explore how different features (e.g., square footage, number of bedrooms) correlate with the target variable, making it a great dataset for learning about data visualization and summary statistics.

Model Evaluation: The dataset allows for various model evaluation techniques such as cross-validation, R-squared, and Mean Absolute Error (MAE). These metrics can be used to compare the effectiveness of different models.

Versatility:

The dataset is highly versatile for a range of machine learning tasks. You can apply simple linear models to predict house prices based on one or two features, or use more complex models like Random Forest or Gradient Boosting Machines to understand interactions between variables.

It can also be used for dimensionality reduction techniques like PCA or to practice handling categorical variables (e.g., neighborhood quality) through encoding techniques like one-hot encoding.

This dataset is ideal for anyone wanting to gain practical experience in building regression models while working with real-world features.
f
Multi variant linear regression analysis of article citations vs. twitter...
datasetcatalog.nlm.nih.gov
Updated Feb 19, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bollen, Johan; Pepe, Alberto; Shuai, Xin (2013). Multi variant linear regression analysis of article citations vs. twitter mentions , article arXiv downloads , and time in days elapsed between beginning of our test period and submission of article, . [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001640994
Explore at:
Dataset updated
Feb 19, 2013
Authors
Bollen, Johan; Pepe, Alberto; Shuai, Xin
Description
: p<0.1,: p<0.05,: p<0.01,**: p<0.001.
Input data for regression and deep learning approaches
figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Vlah; Spencer Rhea (2023). Input data for regression and deep learning approaches [Dataset]. http://doi.org/10.6084/m9.figshare.22349377.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22349377.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Michael Vlah; Spencer Rhea
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This zip file contains all input time-series data needed to run the code for "Virtual gauges: the surprising potential to reconstruct continuous streamflow from strategic measurements" by Vlah et al. 2023. You may also want to download our complete zip of model objects, rather than re-running hundreds of models. Follow the README files in our code on GitHub: https://github.com/vlahm/neon_q_sim.
LocBench
figshare.com
application/x-gzip
Updated Aug 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nemin Wu; Qian Cao; Zhangyu Wang; Zeping Liu; Yanlin Qi; Jielu Zhang; Joshua Ni; Xiaobai Yao; Hongxu Ma; Lan Mu; Stefano Ermon; Tanuja Ganu; Akshay Nambi; Ni Lao; Gengchen Mai (2025). LocBench [Dataset]. http://doi.org/10.6084/m9.figshare.26026798.v3
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26026798.v3
Dataset updated
Aug 10, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Nemin Wu; Qian Cao; Zhangyu Wang; Zeping Liu; Yanlin Qi; Jielu Zhang; Joshua Ni; Xiaobai Yao; Hongxu Ma; Lan Mu; Stefano Ermon; Tanuja Ganu; Akshay Nambi; Ni Lao; Gengchen Mai
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LocBench contains 7 geo-aware image classification datasets and 4 geo-aware image regression datasets, aiming to systematically compare the location encoders' performance and their impact on the model's overall geographic bias.Geo-aware image classification datasets: BirdSnap, BirdSnap†, NABirds†, iNat2017, iNat2018, YFCC, and fMoW. Please note that BirdSnap and BirdSnap† are in the same file birdsnap.tar.gz. iNat2017 and iNat2018 are not stored here due to storage limitations but can be accessed via Dropbox.Geo-aware image regression datasets: MOSAIKS (Population Density, Forest Cover, Nightlight Luminosity, and Elevation), and SustainBench (Asset Index, Women BMI, Water Index, Child Mortality Rate, Sanitation Index, and Women Edu). The SustainBench datasets can be downloaded from Google Drive. You can find dhs_test_labels.csv and dhs_trainval_labels.csv in sustainbench_dhs_labels.zip here if they were not in the Google Drive.All data products created through our work that are not covered under upstream licensing agreements are available via a CC BY-NC 4.0 license. All upstream data use restrictions take precedence over this license.
Cleaned NHANES 1988-2018
figshare.com
txt
Updated Feb 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). Cleaned NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v9
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21743372.v9
Dataset updated
Feb 18, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.
Walmart Dataset
kaggle.com
zip
Updated Dec 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M Yasser H (2021). Walmart Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/walmart-dataset
Explore at:
zip(125095 bytes)Available download formats
Dataset updated
Dec 26, 2021
Authors
M Yasser H
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
https://raw.githubusercontent.com/Masterx-AI/Project_Retail_Analysis_with_Walmart/main/Wallmart1.jpg" alt="">

Description:

One of the leading retail stores in the US, Walmart, would like to predict the sales and demand accurately. There are certain events and holidays which impact sales on each day. There are sales data available for 45 stores of Walmart. The business is facing a challenge due to unforeseen demands and runs out of stock some times, due to the inappropriate machine learning algorithm. An ideal ML algorithm will predict demand accurately and ingest factors like economic conditions including CPI, Unemployment Index, etc.

Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of all, which are the Super Bowl, Labour Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data. Historical sales data for 45 Walmart stores located in different regions are available.

Acknowledgements

The dataset is taken from Kaggle.

Objective:

Understand the Dataset & cleanup (if required).

Build Regression models to predict the sales w.r.t single & multiple features.

Also evaluate the models & compare their respective scores like R2, RMSE, etc.
d
Health and Retirement Study (HRS)
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Damico, Anthony (2023). Health and Retirement Study (HRS) [Dataset]. http://doi.org/10.7910/DVN/ELEKOY
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/ELEKOY
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Damico, Anthony
Description
analyze the health and retirement study (hrs) with r the hrs is the one and only longitudinal survey of american seniors. with a panel starting its third decade, the current pool of respondents includes older folks who have been interviewed every two years as far back as 1992. unlike cross-sectional or shorter panel surveys, respondents keep responding until, well, death d o us part. paid for by the national institute on aging and administered by the university of michigan's institute for social research, if you apply for an interviewer job with them, i hope you like werther's original. figuring out how to analyze this data set might trigger your fight-or-flight synapses if you just start clicking arou nd on michigan's website. instead, read pages numbered 10-17 (pdf pages 12-19) of this introduction pdf and don't touch the data until you understand figure a-3 on that last page. if you start enjoying yourself, here's the whole book. after that, it's time to register for access to the (free) data. keep your username and password handy, you'll need it for the top of the download automation r script. next, look at this data flowchart to get an idea of why the data download page is such a righteous jungle. but wait, good news: umich recently farmed out its data management to the rand corporation, who promptly constructed a giant consolidated file with one record per respondent across the whole panel. oh so beautiful. the rand hrs files make much of the older data and syntax examples obsolete, so when you come across stuff like instructions on how to merge years, you can happily ignore them - rand has done it for you. the health and retirement study only includes noninstitutionalized adults when new respondents get added to the panel (as they were in 1992, 1993, 1998, 2004, and 2010) but once they're in, they're in - respondents have a weight of zero for interview waves when they were nursing home residents; but they're still responding and will continue to contribute to your statistics so long as you're generalizing about a population from a previous wave (for example: it's possible to compute "among all americans who were 50+ years old in 1998, x% lived in nursing homes by 2010"). my source for that 411? page 13 of the design doc. wicked. this new github repository contains five scripts: 1992 - 2010 download HRS microdata.R loop through every year and every file, download, then unzip everything in one big party impor t longitudinal RAND contributed files.R create a SQLite database (.db) on the local disk load the rand, rand-cams, and both rand-family files into the database (.db) in chunks (to prevent overloading ram) longitudinal RAND - analysis examples.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create tw o database-backed complex sample survey object, using a taylor-series linearization design perform a mountain of analysis examples with wave weights from two different points in the panel import example HRS file.R load a fixed-width file using only the sas importation script directly into ram with < a href="http://blog.revolutionanalytics.com/2012/07/importing-public-data-with-sas-instructions-into-r.html">SAScii parse through the IF block at the bottom of the sas importation script, blank out a number of variables save the file as an R data file (.rda) for fast loading later replicate 2002 regression.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create a database-backed complex sample survey object, using a taylor-series linearization design exactly match the final regression shown in this document provided by analysts at RAND as an update of the regression on pdf page B76 of this document . click here to view these five scripts for more detail about the health and retirement study (hrs), visit: michigan's hrs homepage rand's hrs homepage the hrs wikipedia page a running list of publications using hrs notes: exemplary work making it this far. as a reward, here's the detailed codebook for the main rand hrs file. note that rand also creates 'flat files' for every survey wave, but really, most every analysis you c an think of is possible using just the four files imported with the rand importation script above. if you must work with the non-rand files, there's an example of how to import a single hrs (umich-created) file, but if you wish to import more than one, you'll have to write some for loops yourself. confidential to sas, spss, stata, and sudaan users: a tidal wave is coming. you can get water up your nose and be dragged out to sea, or you can grab a surf board. time to transition to r. :D
d
Data for: Nonintrusive heat flux quantification using acoustic emissions...
datadryad.org
search.dataone.org
+1more
zip
Updated Jun 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christy Dunlap; Hari Pandey; Ethan Weems; Han Hu (2025). Data for: Nonintrusive heat flux quantification using acoustic emissions during pool boiling [Dataset]. http://doi.org/10.5061/dryad.q573n5tvq
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.q573n5tvq
Dataset updated
Jun 13, 2025
Dataset provided by
Dryad
Authors
Christy Dunlap; Hari Pandey; Ethan Weems; Han Hu
Time period covered
May 16, 2025
Description
Data for: Nonintrusive heat flux quantification using acoustic emissions during pool boiling

Dataset DOI: 10.5061/dryad.q573n5tvq

Description of the data and file structure

This dataset includes temperature profiles, hydrophone acoustic data, and bubble images from a transient pool boiling test of deionized water on a flat copper surface. This dataset is used to train for SeqReg, a machine learning framework for sequence regression.

Files and variables

File: ned3-004_BoilingAcousticsHydrophone.zip

Description: This dataset includes two .lvm files, one .txt file, one spreadsheet, and a folder of images.

The Boiling-32_Hydrophone.lvm file includes the continuously sampled acoustic data using two HTI-96-MIN hydrophones with an NI-9230 module.

The Boiling-32_Temperature.lvm includes the temperature measurements using probe thermocouples (Omega Engineering Tj36-CPSS-032U-6) in the copper block a...
u
Data from: Supplementary Material for "Kernel Regression Mapping for Vocal...
pub-dev.ub.uni-bielefeld.de
pub.uni-bielefeld.de
Updated Oct 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas Hermann; Gerold Baier; Ulrich Stephani; Helge Ritter (2024). Supplementary Material for "Kernel Regression Mapping for Vocal EEG Sonification" [Dataset]. https://pub-dev.ub.uni-bielefeld.de/record/2698580
Explore at:
Dataset updated
Oct 23, 2024
Authors
Thomas Hermann; Gerold Baier; Ulrich Stephani; Helge Ritter
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
https://pub.uni-bielefeld.de/download/2698580/2702926" width="300" style="float:right;">

This paper introduces kernel regression mapping sonification (KRMS) for optimized mappings between data features and the parameter space of Parameter Mapping Sonification. Kernel regression allows to map data spaces to high-dimensional parameter spaces such that specific locations in data space with pre-determined extent are represented by selected acoustic parameter vectors. Thereby, specifically chosen correlated settings of parameters may be selected to create perceptual fingerprints, such as a particular timbre or vowel. With KRMS, the perceptual fingerprints become clearly audible and separable. Furthermore, kernel regression defines meaningful interpolations for any point in between. We present and discuss the basic approach exemplified by our previously introduced vocal EEG sonification, report new sonifications and generalize the approach towards automatic parameter mapping generators using unsupervised learning approaches.

Sonification examples

Sonification Example S1 (mp3, 356k):

formant transitions during absence EEG using a mapping of dipole x/y to the first two formants of a subtractive synthesizer.

Sonification Example S2 (mp3, 248k):

formant transitions during absence EEG using a mapping of delay-embedding feature described in the paper to the first two formants.

Sonification Example Series 3:

In the series from S3.1 to S3.5, it can be heard that the transitions between formants become successively sharper with decreasing bandwidth sigma:

S3.1 (mp3, 248k): sigma = 0.50

S3.2 (mp3, 248k): sigma = 0.35

S3.3 (mp3, 248k): sigma = 0.25

S3.4 (mp3, 248k): sigma = 0.15

S3.5 (mp3, 248k): sigma = 0.05

Sonification Example S4 (mp3, 496k):

same as S2, here rendered at 1/4 of real-time for better discernability of formant transitions.

Sonification Example Series 5:

In the series from S5.1 to S5.5 is exactly as in S3.1-S3.5 a series with decreasing bandwidth sigma, here rendered at 1/4 of real-time for better discernability of formant transitions

S5.1 (mp3, 496k): sigma = 0.50

S5.2 (mp3, 496k): sigma = 0.35

S5.3 (mp3, 496k): sigma = 0.25

S5.4 (mp3, 496k): sigma = 0.15

S5.5 (mp3, 496k): sigma = 0.05
d
Replication Data for: \"A Topic-based Segmentation Model for Identifying...
search.dataone.org
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kim, Sunghoon; Lee, Sanghak; McCulloch, Robert (2024). Replication Data for: \"A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews\" [Dataset]. http://doi.org/10.7910/DVN/EE3DE2
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/EE3DE2
Dataset updated
Sep 25, 2024
Dataset provided by
Harvard Dataverse
Authors
Kim, Sunghoon; Lee, Sanghak; McCulloch, Robert
Description
We provide instructions, codes and datasets for replicating the article by Kim, Lee and McCulloch (2024), "A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews." This repository provides a user-friendly R package for any researchers or practitioners to apply A Topic-based Segmentation Model with Unstructured Texts (latent class regression with group variable selection) to their datasets. First, we provide a R code to replicate the illustrative simulation study: see file 1. Second, we provide the user-friendly R package with a very simple example code to help apply the model to real-world datasets: see file 2, Package_MixtureRegression_GroupVariableSelection.R and Dendrogram.R. Third, we provide a set of codes and instructions to replicate the empirical studies of customer-level segmentation and restaurant-level segmentation with Yelp reviews data: see files 3-a, 3-b, 4-a, 4-b. Note, due to the dataset terms of use by Yelp and the restriction of data size, we provide the link to download the same Yelp datasets (https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/versions/6). Fourth, we provided a set of codes and datasets to replicate the empirical study with professor ratings reviews data: see file 5. Please see more details in the description text and comments of each file. [A guide on how to use the code to reproduce each study in the paper] 1. Full codes for replicating Illustrative simulation study.txt -- [see Table 2 and Figure 2 in main text]: This is R source code to replicate the illustrative simulation study. Please run from the beginning to the end in R. In addition to estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships, you will get dendrograms of selected groups of variables in Figure 2. Computing time is approximately 20 to 30 minutes 3-a. Preprocessing raw Yelp Reviews for Customer-level Segmentation.txt: Code for preprocessing the downloaded unstructured Yelp review data and preparing DV and IVs matrix for customer-level segmentation study. 3-b. Instruction for replicating Customer-level Segmentation analysis.txt -- [see Table 10 in main text; Tables F-1, F-2, and F-3 and Figure F-1 in Web Appendix]: Code for replicating customer-level segmentation study with Yelp data. You will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 3 to 4 hours. 4-a. Preprocessing raw Yelp reviews_Restaruant Segmentation (1).txt: R code for preprocessing the downloaded unstructured Yelp data and preparing DV and IVs matrix for restaurant-level segmentation study. 4-b. Instructions for replicating restaurant-level segmentation analysis.txt -- [see Tables 5, 6 and 7 in main text; Tables E-4 and E-5 and Figure H-1 in Web Appendix]: Code for replicating restaurant-level segmentation study with Yelp. you will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 10 to 12 hours. [Guidelines for running Benchmark models in Table 6] Unsupervised Topic model: 'topicmodels' package in R -- after determining the number of topics(e.g., with 'ldatuning' R package), run 'LDA' function in the 'topicmodels'package. Then, compute topic probabilities per restaurant (with 'posterior' function in the package) which can be used as predictors. Then, conduct prediction with regression Hierarchical topic model (HDP): 'gensimr' R package -- 'model_hdp' function for identifying topics in the package (see https://radimrehurek.com/gensim/models/hdpmodel.html or https://gensimr.news-r.org/). Supervised topic model: 'lda' R package -- 'slda.em' function for training and 'slda.predict' for prediction. Aggregate regression: 'lm' default function in R. Latent class regression without variable selection: 'flexmix' function in 'flexmix' R package. Run flexmix with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, conduct prediction of dependent variable per each segment. Latent class regression with variable selection: 'Unconstraind_Bayes_Mixture' function in Kim, Fong and DeSarbo(2012)'s package. Run the Kim et al's model (2012) with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, we can do prediction of dependent variables per each segment. The same R package ('KimFongDeSarbo2012.zip') can be downloaded at: https://sites.google.com/scarletmail.rutgers.edu/r-code-packages/home 5. Instructions for replicating Professor ratings review study.txt -- [see Tables G-1, G-2, G-4 and G-5, and Figures G-1 and H-2 in Web Appendix]: Code to replicate the Professor ratings reviews study. Computing time is approximately 10 hours. [A list of the versions of R, packages, and computer...

Facebook

Twitter

Click to copy link

Link copied

Cite

U.S. Geological Survey (2025). An example data set for exploration of Multiple Linear Regression [Dataset]. https://catalog.data.gov/dataset/an-example-data-set-for-exploration-of-multiple-linear-regression

Data from: An example data set for exploration of Multiple Linear Regression

Explore at:

Dataset updated

Nov 20, 2025

Dataset provided by

United States Geological Surveyhttp://www.usgs.gov/

Description

This data set contains example data for exploration of the theory of regression based regionalization. The 90th percentile of annual maximum streamflow is provided as an example response variable for 293 streamgages in the conterminous United States. Several explanatory variables are drawn from the GAGES-II data base in order to demonstrate how multiple linear regression is applied. Example scripts demonstrate how to collect the original streamflow data provided and how to recreate the figures from the associated Techniques and Methods chapter.

Clear search

Close search

Google apps

Main menu

Data from: An example data set for exploration of Multiple Linear Regression...

Logistic Regression

Dataset

Contents

polynomial regression

Marketing Linear Multiple Regression

Dataset

Contents

SPSS Data Set S1 Logistic Regression Model Data

Pearson's Height Data 📏 Simple linear regression

multi-output regression datasets

20ml Regression Dataset

20ml Regression

regression_data

Housing Prices Dataset

Description:

Acknowledgement:

Objective:

House Price Regression Dataset

Home Value Insights: A Beginner's Regression Dataset

Features:

Potential Uses:

Versatility:

Multi variant linear regression analysis of article citations vs. twitter...

Input data for regression and deep learning approaches

LocBench

Cleaned NHANES 1988-2018

Walmart Dataset

Description:

Acknowledgements

Objective:

Health and Retirement Study (HRS)

Data for: Nonintrusive heat flux quantification using acoustic emissions...

Data for: Nonintrusive heat flux quantification using acoustic emissions during pool boiling

Description of the data and file structure

Files and variables

File: ned3-004_BoilingAcousticsHydrophone.zip

Data from: Supplementary Material for "Kernel Regression Mapping for Vocal...

Sonification examples

Replication Data for: \"A Topic-based Segmentation Model for Identifying...

Data from: An example data set for exploration of Multiple Linear Regression