100+ datasets found

d
An example data set for exploration of Multiple Linear Regression
catalog.data.gov
data.usgs.gov
Updated Jul 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). An example data set for exploration of Multiple Linear Regression [Dataset]. https://catalog.data.gov/dataset/an-example-data-set-for-exploration-of-multiple-linear-regression
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This data set contains example data for exploration of the theory of regression based regionalization. The 90th percentile of annual maximum streamflow is provided as an example response variable for 293 streamgages in the conterminous United States. Several explanatory variables are drawn from the GAGES-II data base in order to demonstrate how multiple linear regression is applied. Example scripts demonstrate how to collect the original streamflow data provided and how to recreate the figures from the associated Techniques and Methods chapter.
d
Data for multiple linear regression models for predicting microcystin...
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Data for multiple linear regression models for predicting microcystin concentration action-level exceedances in selected lakes in Ohio [Dataset]. https://catalog.data.gov/dataset/data-for-multiple-linear-regression-models-for-predicting-microcystin-concentration-action
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Ohio
Description
Site-specific multiple linear regression models were developed for eight sites in Ohio—six in the Western Lake Erie Basin and two in northeast Ohio on inland reservoirs--to quickly predict action-level exceedances for a cyanotoxin, microcystin, in recreational and drinking waters used by the public. Real-time models include easily- or continuously-measured factors that do not require that a sample be collected. Real-time models are presented in two categories: (1) six models with continuous monitor data, and (2) three models with on-site measurements. Real-time models commonly included variables such as phycocyanin, pH, specific conductance, and streamflow or gage height. Many of the real-time factors were averages over time periods antecedent to the time the microcystin sample was collected, including water-quality data compiled from continuous monitors. Comprehensive models use a combination of discrete sample-based measurements and real-time factors. Comprehensive models were useful at some sites with lagged variables (< 2 weeks) for cyanobacterial toxin genes, dissolved nutrients, and (or) N to P ratios. Comprehensive models are presented in three categories: (1) three models with continuous monitor data and lagged comprehensive variables, (2) five models with no continuous monitor data and lagged comprehensive variables, and (3) one model with continuous monitor data and same-day comprehensive variables. Funding for this work was provided by the Ohio Water Development Authority and the U.S. Geological Survey Cooperative Water Program.
m
Panel dataset on Brazilian fuel demand
data.mendeley.com
Updated Oct 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergio Prolo (2024). Panel dataset on Brazilian fuel demand [Dataset]. http://doi.org/10.17632/hzpwbp7j22.1
Explore at:
Unique identifier
https://doi.org/10.17632/hzpwbp7j22.1
Dataset updated
Oct 7, 2024
Authors
Sergio Prolo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary : Fuel demand is shown to be influenced by fuel prices, people's income and motorization rates. We explore the effects of electric vehicle's rates in gasoline demand using this panel dataset.

Files : dataset.csv - Panel dimensions are the Brazilian state ( i ) and year ( t ). The other columns are: gasoline sales per capita (ln_Sg_pc), prices of gasoline (ln_Pg) and ethanol (ln_Pe) and their lags, motorization rates of combustion vehicles (ln_Mi_c) and electric vehicles (ln_Mi_e) and GDP per capita (ln_gdp_pc). All variables are all under the natural log function, since we use this to calculate demand elasticities in a regression model.

adjacency.csv - The adjacency matrix used in interaction with electric vehicles' motorization rates to calculate spatial effects. At first, it follows a binary adjacency formula: for each pair of states i and j, the cell (i, j) is 0 if the states are not adjacent and 1 if they are. Then, each row is normalized to have sum equal to one.

regression.do - Series of Stata commands used to estimate the regression models of our study. dataset.csv must be imported to work, see comment section.

dataset_predictions.xlsx - Based on the estimations from Stata, we use this excel file to make average predictions by year and by state. Also, by including years beyond the last panel sample, we also forecast the model into the future and evaluate the effects of different policies that influence gasoline prices (taxation) and EV motorization rates (electrification). This file is primarily used to create images, but can be used to further understand how the forecasting scenarios are set up.

Sources: Fuel prices and sales: ANP (https://www.gov.br/anp/en/access-information/what-is-anp/what-is-anp) State population, GDP and vehicle fleet: IBGE (https://www.ibge.gov.br/en/home-eng.html?lang=en-GB) State EV fleet: Anfavea (https://anfavea.com.br/en/site/anuarios/)
Linear Regression example Dataset
kaggle.com
zip
Updated May 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Çağrı Karadeniz (2021). Linear Regression example Dataset [Dataset]. https://www.kaggle.com/arkaradeniz/linear-regression-example-dataset
Explore at:
zip(8811856 bytes)Available download formats
Dataset updated
May 5, 2021
Authors
Çağrı Karadeniz
Description
Dataset

This dataset was created by Çağrı Karadeniz

Contents

It contains the following files:
c
Student Performance (Multiple Linear Regression) Dataset
cubig.ai
Updated May 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CUBIG (2025). Student Performance (Multiple Linear Regression) Dataset [Dataset]. https://cubig.ai/store/products/392/student-performance-multiple-linear-regression-dataset
Explore at:
Dataset updated
May 29, 2025
Dataset authored and provided by
CUBIG
License
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
Measurement technique
Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
Description
1) Data Introduction • The Student Performance (Multiple Linear Regression) Dataset is designed to analyze the relationship between students’ learning habits and academic performance. Each sample includes key indicators related to learning, such as study hours, sleep duration, previous test scores, and the number of practice exams completed.

2) Data Utilization (1) Characteristics of the Student Performance (Multiple Linear Regression) Dataset: • The target variable, Hours Studied, quantitatively represents the amount of time a student has invested in studying. The dataset is structured to allow modeling and inference of learning behaviors based on correlations with other variables.

(2) Applications of the Student Performance (Multiple Linear Regression) Dataset: • AI-Based Study Time Prediction Models: The dataset can be used to develop regression models that estimate a student’s expected study time based on inputs like academic performance, sleep habits, and engagement patterns. • Behavioral Analysis and Personalized Learning Strategies: It can be applied to identify students with insufficient study time and design personalized study interventions based on academic and lifestyle patterns.
J
The two‐sample linear regression model with interval‐censored covariates...
journaldata.zbw.eu
jda-test.zbw.eu
txt
Updated Dec 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Pacini; David Pacini (2022). The two‐sample linear regression model with interval‐censored covariates (replication data) [Dataset]. http://doi.org/10.15456/jae.2022327.0707557005
Explore at:
txt(4434)Available download formats
Unique identifier
https://doi.org/10.15456/jae.2022327.0707557005
Dataset updated
Dec 7, 2022
Dataset provided by
ZBW - Leibniz Informationszentrum Wirtschaft
Authors
David Pacini; David Pacini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
There are surveys that gather precise information on an outcome of interest, but measure continuous covariates by a discrete number of intervals, in which case the covariates are interval censored. For applications with a second independent dataset precisely measuring the covariates, but not the outcome, this paper introduces a semiparametrically efficient estimator for the coefficients in a linear regression model. The second sample serves to establish point identification. An empirical application investigating the relationship between income and body mass index illustrates the use of the estimator.
SPSS Data Set S1 Logistic Regression Model Data
figshare.com
bin
Updated Jan 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michelle Klailova; Phyllis Lee (2016). SPSS Data Set S1 Logistic Regression Model Data [Dataset]. http://doi.org/10.6084/m9.figshare.1051748.v2
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1051748.v2
Dataset updated
Jan 19, 2016
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Michelle Klailova; Phyllis Lee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data set from PLOS ONE Article Published Entitled: Western Lowland Gorillas Signal Selectively Using Odor
Employee Data
kaggle.com
Updated Mar 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zahid Feroze (2025). Employee Data [Dataset]. https://www.kaggle.com/datasets/zahidmughal2343/employee-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 8, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Zahid Feroze
Description
The 10,000 Worlds Employee Dataset is a comprehensive dataset designed for analyzing workforce trends, employee performance, and organizational dynamics within a large-scale company setting. This dataset contains information on 10,000 employees, spanning various departments, roles, and experience levels. It is ideal for research in human resource analytics, machine learning applications in employee retention, performance prediction, and diversity analysis.

Key Features of the Dataset: Employee Demographics:

Age, gender, ethnicity Education level, degree specialization Years of experience Employment Details:

Department (e.g., HR, Engineering, Marketing) Job title and seniority level Employment type (full-time, part-time, contract) Performance & Productivity Metrics:

Annual performance ratings Work hours, overtime details Training programs attended Compensation & Benefits:

Salary, bonuses, stock options Benefits (healthcare, pension plans, remote work options) Employee Engagement & Retention:

Job satisfaction scores Attrition and turnover rates Promotion history and career growth Workplace Environment Factors:

Team collaboration metrics Employee feedback and survey results Work-life balance indicators Use Cases: HR Analytics: Identifying patterns in employee satisfaction, retention, and performance. Predictive Modeling: Forecasting attrition risks and promotion likelihoods. Diversity & Inclusion Analysis: Understanding representation across departments. Compensation Benchmarking: Comparing salaries and benefits within and across industries. This dataset is highly valuable for data scientists, HR professionals, and business analysts looking to gain insights into workforce dynamics and improve organizational strategies.

Would you like any additional details or a sample schema for the dataset?
Simulation Data Set
catalog.data.gov
s.cnmilf.com
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Simulation Data Set [Dataset]. https://catalog.data.gov/dataset/simulation-data-set
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
d
Example Groundwater-Level Datasets and Benchmarking Results for the...
catalog.data.gov
data.usgs.gov
+1more
Updated Oct 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Example Groundwater-Level Datasets and Benchmarking Results for the Automated Regional Correlation Analysis for Hydrologic Record Imputation (ARCHI) Software Package [Dataset]. https://catalog.data.gov/dataset/example-groundwater-level-datasets-and-benchmarking-results-for-the-automated-regional-cor
Explore at:
Dataset updated
Oct 13, 2024
Dataset provided by
U.S. Geological Survey
Description
This data release provides two example groundwater-level datasets used to benchmark the Automated Regional Correlation Analysis for Hydrologic Record Imputation (ARCHI) software package (Levy and others, 2024). The first dataset contains groundwater-level records and site metadata for wells located on Long Island, New York (NY) and some surrounding mainland sites in New York and Connecticut. The second dataset contains groundwater-level records and site metadata for wells located in the southeastern San Joaquin Valley of the Central Valley, California (CA). For ease of exposition these are referred to as NY and CA datasets, respectively. Both datasets are formatted with column headers that can be read by the ARCHI software package within the R computing environment. These datasets were used to benchmark the imputation accuracy of three ARCHI model settings (OLS, ridge, and MOVE.1) against the widely used imputation program missForest (Stekhoven and Bühlmann, 2012). The ARCHI program was used to process the NY and CA datasets on monthly and annual timesteps, respectively, filter out sites with insufficient data for imputation, and create 200 test datasets from each of the example datasets with 5 percent of observations removed at random (herein, referred to as "holdouts"). Imputation accuracy for test datasets was assessed using normalized root mean square error (NRMSE), which is the root mean square error divided by the standard deviation of the observed holdout values. ARCHI produces prediction intervals (PIs) using a non-parametric bootstrapping routine, which were assessed by computing a coverage rate (CR) defined as the proportion of holdout observations falling within the estimated PI. The multiple regression models included with the ARCHI package (OLS and ridge) were further tested on all test datasets at eleven different levels of the p_per_n input parameter, which limits the maximum ratio of regression model predictors (p) per observations (n) as a decimal fraction greater than zero and less than or equal to one. This data release contains ten tables formatted as tab-delimited text files. The “CA_data.txt” and “NY_data.txt” tables contain 243,094 and 89,997 depth-to-groundwater measurement values (value, in feet below land surface) indexed by site identifier (site_no) and measurement date (date) for CA and NY datasets, respectively. The “CA_sites.txt” and “NY_sites.txt” tables contain site metadata for the 4,380 and 476 unique sites included in the CA and NY datasets, respectively. The “CA_NRMSE.txt” and “NY_NRMSE.txt” tables contain NRMSE values computed by imputing 200 test datasets with 5 percent random holdouts to assess imputation accuracy for three different ARCHI model settings and missForest using CA and NY datasets, respectively. The “CA_CR.txt” and “NY_CR.txt” tables contain CR values used to evaluate non-parametric PIs generated by bootstrapping regressions with three different ARCHI model settings using the CA and NY test datasets, respectively. The “CA_p_per_n.txt” and “NY_p_per_n.txt” tables contain mean NRMSE values computed for 200 test datasets with 5 percent random holdouts at 11 different levels of p_per_n for OLS and ridge models compared to training error for the same models on the entire CA and NY datasets, respectively. References Cited Levy, Z.F., Stagnitta, T.J., and Glas, R.L., 2024, ARCHI: Automated Regional Correlation Analysis for Hydrologic Record Imputation, v1.0.0: U.S. Geological Survey software release, https://doi.org/10.5066/P1VVHWKE. Stekhoven, D.J., and Bühlmann, P., 2012, MissForest—non-parametric missing value imputation for mixed-type data: Bioinformatics 28(1), 112-118. https://doi.org/10.1093/bioinformatics/btr597.
Predict Purity and Price of Honey
kaggle.com
Updated Feb 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Umair Zia (2024). Predict Purity and Price of Honey [Dataset]. https://www.kaggle.com/datasets/stealthtechnologies/predict-purity-and-price-of-honey
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 18, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Umair Zia
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
About Dataset

CS (Color Score): Represents the color score of the honey sample, ranging from 1.0 to 10.0. Lower values indicate a lighter color, while higher values indicate a darker color.

Density: Represents the density of the honey sample in grams per cubic centimeter at 25°C, ranging from 1.21 to 1.86.

WC (Water Content): Represents the water content in the honey sample, ranging from 12.0% to 25.0%.

pH: Represents the pH level of the honey sample, ranging from 2.50 to 7.50.

EC (Electrical Conductivity): Represents the electrical conductivity of the honey sample in milliSiemens per centimeter.

F (Fructose Level): Represents the fructose level of the honey sample, ranging from 20 to 50.

G (Glucose Level): Represents the glucose level of the honey sample, ranging from 20 to 45.

Pollen_analysis: Represents the floral source of the honey sample. Possible values include Clover, Wildflower, Orange Blossom, Alfalfa, Acacia, Lavender, Eucalyptus, Buckwheat, Manuka, Sage, Sunflower, Borage, Rosemary, Thyme, Heather, Tupelo, Blueberry, Chestnut, and Avocado.

Viscosity: Represents the viscosity of the honey sample in centipoise, ranging from 1500 to 10000. Viscosity values between 2500 and 9500 are considered optimal for purity.

Purity: The target variable represents the purity of the honey sample, ranging from 0.01 to 1.00.

Price: The calculated price of the honey.
o
TELEVISION DATASET 2022
opendatabay.com
.csv
Updated Jun 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). TELEVISION DATASET 2022 [Dataset]. https://www.opendatabay.com/data/ai-ml/463caf4d-f082-4605-aa0c-3f004a6386fd
Explore at:
.csvAvailable download formats
Dataset updated
Jun 11, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
This dataset contains 886 samples with 12 attributes. Necessary things: You have to clean the dataset There are some missing values in this dataset. You can delete some columns if you think that they are adding very less contribution

** columns in this dataset-**

1.Product_Name : This indicates the manufacturer of the product i.e. Television (brand) 2.Stars: Average customer ratings on a scale of 5. eg:- 4 /5 3.Ratings: this column show that number of people who rate these stars. for eg;- stars=4 and Rating=5000, it means that 5000 people gives 4 stars on a particular product. 4.Reviews: Number of people who pass comment after buying product. 5.Current Price: This column has the Selling Price or the Discounted Price of the product.

.6 MRP: This column includes the Original Price of the product from the manufacturer. 7.channel: this column tells us what channel support in product for eg:- Netflix|Prime Video|Disney+Hotstar|Youtube|HD Ready etc 8.Operating_system:This categorical variable shows the type of OS like Android, Linux, etc. 9.Picture_qualtiy(resolution): This has multiple categories and indicates the type of display i.e. LED, HD LED, etc. 10.Speaker:this columns shows us about type of speaker used by company for eg:-20 W Speaker Output,2 x HDMI | 2 x USB 11.Frequency:this columns shows television Broadcast Frequencies for eg;-60 Hz Refresh Rate 12.Image_url: this column show us link of product for eg;- https://rukminim1.flixcart.com/image/312/312/ku1k4280/television/p/f/6/crel7369- croma-original-imag7969pxhrwp2k.jpeg?q=70

Inspiration: This dataset could be used to explore the current market scenario for Televisions. There are various types of screens with different operating systems offered by several manufacturers at competitive prices. Some questions this dataset could be used to answer are -

Demand for different types of televisions and Number of Players in the market Which are the top 5 brands for television? Which brand has the highest number of products i.e. television ? Are televisions with higher ratings more expensive? Average Selling Price by product_Name (brand)

Original Data Source: TELEVISION DATASET 2022
d
Calibration datasets and model archive summaries for regression models...
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Calibration datasets and model archive summaries for regression models developed to estimate metal concentrations at nine sites on the Animas and San Juan Rivers, Colorado, New Mexico, and Utah: U.S. Geological Survey data release, https://doi.org/10.5066/P9THSFE0 [Dataset]. https://catalog.data.gov/dataset/calibration-datasets-and-model-archive-summaries-for-regression-models-developed-to-estima
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Utah, Colorado, New Mexico
Description
This data release supports the following publication: Mast, M. A., 2018, Estimating metal concentrations with regression analysis and water-quality surrogates at nine sites on the Animas and San Juan Rivers, Colorado, New Mexico, and Utah: U.S. Geological Survey Scientific Investigations Report 2018-5116. The U.S. Geological Survey (USGS), in cooperation with the U. S. Environmental Protection Agency (EPA), developed site-specific regression models to estimate concentrations of selected metals at nine USGS streamflow-gaging stations along the Animas and San Juan Rivers. Multiple linear-regression models were developed by relating metal concentrations in discrete water-quality samples to continuously monitored streamflow and surrogate parameters including specific conductance, pH, turbidity, and water temperature. Models were developed for dissolved and total concentrations of aluminum, arsenic, cadmium, iron, lead, manganese, and zinc using water-quality samples collected during 2005–17 by several agencies, using different collection methods and analytical laboratories. Calibration datasets in comma-separated format (CSV) include the variables of sampling date and time, metal concentrations (in micrograms per liter), stream discharge (in cubic feet per second), specific conductance (in microsiemens per centimeter at 25 degrees Celsius), pH, water temperature (in degrees Celsius), turbidity (in nephelometric turbidity units), and calculated seasonal terms based on Julian day. Surrogate parameters and discrete water-quality samples were used from nine sites including Cement Creek at Silverton, Colo. (USGS station 09358550); Animas River below Silverton, Colo. (USGS station 09359020); Animas River at Durango, Colo. (USGS station 09361500); Animas River Near Cedar Hill, N. Mex. (USGS station 09363500); Animas River below Aztec, N. Mex. (USGS station 09364010); San Juan River at Farmington, N. Mex. (USGS station 09365000); San Juan River at Shiprock, N. Mex (USGS Station 09368000); San Juan River at Four Corners, Colo. (USGS station 09371010); and San Juan River near Bluff, Utah (USGS station 09379500). Model archive summaries in pdf format include model statistics, data, and plots and were generated using a R script developed by USGS Kansas Water Science Center available at https://patrickeslick.github.io/ModelArchiveSummary/. A description of each USGS streamflow gaging station along with information about the calibration datasets also are provided.
AirQualityCOVID-dataset
zenodo.org
data.niaid.nih.gov
zip
Updated Apr 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jaime González-Pardo; Jaime González-Pardo; Rodrigo Manzanas; Rodrigo Manzanas; Sandra Ceballos-Santos; Sandra Ceballos-Santos (2023). AirQualityCOVID-dataset [Dataset]. http://doi.org/10.5281/zenodo.5642868
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5642868
Dataset updated
Apr 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jaime González-Pardo; Jaime González-Pardo; Rodrigo Manzanas; Rodrigo Manzanas; Sandra Ceballos-Santos; Sandra Ceballos-Santos
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains all the data used for the article "Estimating changes in air pollutant levels due to COVID-19 lockdown measures based on a business-as-usual prediction scenario using data mining models: A case-study for urban traffic sites in Spain", submitted to Environmental Software & Modelling by J. González-Pardo et al. (2022) published in Science of the Total Environment (STOTEN). For the sake of reproducibility, it includes Jupyter notebooks with worked examples which allow to reproduce the results shown in that paper.

Contact: jaime.diez.gp@gmail.com

During the course of this research the pyaemet python library has been developed in order to download daily meteorological observations from the Spanish Met Service (AEMET) via its OpenData API REST and it is needed to perform the data curation process.

This research was developed in the framework of the project “Contaminación atmosférica y COVID-19: ¿Qué podemos aprender de esta pandemia?”, selected in the Extraordinary BBVA Foundation grant call for SARS-CoV-2 and COVID-19 research proposals, within the area of ecology and veterinary science.
UCI and OpenML Data Sets for Ordinal Quantification
zenodo.org
data.niaid.nih.gov
zip
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8177302
Dataset updated
Jul 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
Dataset: The Impact of Altitude Training on NCAA Division I Female Swimmers’...
figshare.com
xlsx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katherine Manzione (2023). Dataset: The Impact of Altitude Training on NCAA Division I Female Swimmers’ Performance [Dataset]. http://doi.org/10.6084/m9.figshare.22736030.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22736030.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Katherine Manzione
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains the data used in the paper: "The Impact of Altitude Training on NCAA Division I Female Swimmers’ Performance" being submitted to the International Journal of Performance Analysis in Sport.
h
srsd-feynman_easy
huggingface.co
Updated Dec 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yoshitomo Matsubara (2024). srsd-feynman_easy [Dataset]. http://doi.org/10.57967/hf/0763
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/0763
Dataset updated
Dec 5, 2024
Authors
Yoshitomo Matsubara
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for SRSD-Feynman (Easy set)

Dataset Summary

Our SRSD (Feynman) datasets are designed to discuss the performance of Symbolic Regression for Scientific Discovery. We carefully reviewed the properties of each formula and its variables in the Feynman Symbolic Regression Database to design reasonably realistic sampling range of values so that our SRSD datasets can be used for evaluating the potential of SRSD such as whether or not an SR method con (re)discover… See the full description on the dataset page: https://huggingface.co/datasets/yoshitomo-matsubara/srsd-feynman_easy.
4
Data from: Sample simulation files underlying the publication: "Computing...
data.4tu.nl
investiga.upo.es
zip
Updated Mar 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shrinjay Sharma; Richard Baur; Marcello S. Rigutto; Erik Zuidema; Umang Agarwal; Sofia Calero; David Dubbeldam; Thijs Vlugt (2025). Sample simulation files underlying the publication: "Computing Entropy for Long-Chain Alkanes Using Linear Regression: Application to Hydroisomerization" [Dataset]. http://doi.org/10.4121/e855547e-384c-4802-9faa-00981a99419b.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/e855547e-384c-4802-9faa-00981a99419b.v1
Dataset updated
Mar 20, 2025
Dataset provided by
4TU.ResearchData
Authors
Shrinjay Sharma; Richard Baur; Marcello S. Rigutto; Erik Zuidema; Umang Agarwal; Sofia Calero; David Dubbeldam; Thijs Vlugt
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample simulation files for "Computing Entropy for Long-Chain Alkanes Using Linear Regression: Application to Hydroisomerization". Please read the README file for more information, and refer to the main manuscript for more details.

The file Entropy_alkanes.xlsx contains the computed entropy values derived from the enthalpies and Gibbs free energies predicted by our linear regression model for isomers ranging from C1 to C14. This file also contains the comparison between the absolute entropies computed using our model and those predicted using group contribution methods by Benson et al. and Constantinou and Gani.
4
Data from: Sample simulation files underlying the publication: "Prediction...
data.4tu.nl
investiga.upo.es
zip
Updated Mar 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shrinjay Sharma; Josh J. Sleijfer; Stach van der Zeeuw; Daniil Zorzos; Silvia Lasala; Marcello S. Rigutto; Erik Zuidema; Umang Agarwal; Richard Baur; Sofia Calero; David Dubbeldam; Thijs Vlugt (2025). Sample simulation files underlying the publication: "Prediction of Thermochemical Properties of Long-Chain Alkanes Using Linear Regression: Application to Hydroisomerization" [Dataset]. http://doi.org/10.4121/f905ae7e-2950-4a9b-91b4-b592e4c788e9.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/f905ae7e-2950-4a9b-91b4-b592e4c788e9.v1
Dataset updated
Mar 20, 2025
Dataset provided by
4TU.ResearchData
Authors
Shrinjay Sharma; Josh J. Sleijfer; Stach van der Zeeuw; Daniil Zorzos; Silvia Lasala; Marcello S. Rigutto; Erik Zuidema; Umang Agarwal; Richard Baur; Sofia Calero; David Dubbeldam; Thijs Vlugt
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample simulation files for "Prediction of Thermochemical Properties of Long-Chain Alkanes Using Linear Regression: Application to Hydroisomerization". Please read the README file for more information, and refer to the main manuscript for more details.

The training data set for the thermochemical properties of all isomers ranging from C1to C10 obtained from Scott's tables are listed in Thermochem_properties_C1-C10_isomers_Scotts_tables.xlsx. The Python script for the linear regression model is provided in Python_script_linear_regression.py. Thermochem_properties_C1-C14_linear_regression.xlsx contains the predicted thermo-chemical properties of all isomers till C14 molecules with temperatures ranging from (0−1000) K. The coefficients of the occurrences of the second order groups obtained using linear regression and the corresponding temperature dependent quadratic polynomial fits are also listed in this file. The reaction equilibrium distribution data of hydroisomerization of C10 and C14 molecules in the gas phase and MTW-type zeolite at 500 K are tabulated in in this excel file.
m
Global Burden of Disease analysis dataset of noncommunicable disease...
data.mendeley.com
Updated Apr 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Cundiff (2023). Global Burden of Disease analysis dataset of noncommunicable disease outcomes, risk factors, and SAS codes [Dataset]. http://doi.org/10.17632/g6b39zxck4.10
Explore at:
Unique identifier
https://doi.org/10.17632/g6b39zxck4.10
Dataset updated
Apr 6, 2023
Authors
David Cundiff
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This formatted dataset (AnalysisDatabaseGBD) originates from raw data files from the Institute of Health Metrics and Evaluation (IHME) Global Burden of Disease Study (GBD2017) affiliated with the University of Washington. We are volunteer collaborators with IHME and not employed by IHME or the University of Washington.

The population weighted GBD2017 data are on male and female cohorts ages 15-69 years including noncommunicable diseases (NCDs), body mass index (BMI), cardiovascular disease (CVD), and other health outcomes and associated dietary, metabolic, and other risk factors. The purpose of creating this population-weighted, formatted database is to explore the univariate and multiple regression correlations of health outcomes with risk factors. Our research hypothesis is that we can successfully model NCDs, BMI, CVD, and other health outcomes with their attributable risks.

These Global Burden of disease data relate to the preprint: The EAT-Lancet Commission Planetary Health Diet compared with Institute of Health Metrics and Evaluation Global Burden of Disease Ecological Data Analysis. The data include the following: 1. Analysis database of population weighted GBD2017 data that includes over 40 health risk factors, noncommunicable disease deaths/100k/year of male and female cohorts ages 15-69 years from 195 countries (the primary outcome variable that includes over 100 types of noncommunicable diseases) and over 20 individual noncommunicable diseases (e.g., ischemic heart disease, colon cancer, etc). 2. A text file to import the analysis database into SAS 3. The SAS code to format the analysis database to be used for analytics 4. SAS code for deriving Tables 1, 2, 3 and Supplementary Tables 5 and 6 5. SAS code for deriving the multiple regression formula in Table 4. 6. SAS code for deriving the multiple regression formula in Table 5 7. SAS code for deriving the multiple regression formula in Supplementary Table 7
8. SAS code for deriving the multiple regression formula in Supplementary Table 8 9. The Excel files that accompanied the above SAS code to produce the tables

For questions, please email davidkcundiff@gmail.com. Thanks.

Facebook

Twitter

Click to copy link

Link copied

Cite

U.S. Geological Survey (2024). An example data set for exploration of Multiple Linear Regression [Dataset]. https://catalog.data.gov/dataset/an-example-data-set-for-exploration-of-multiple-linear-regression

An example data set for exploration of Multiple Linear Regression

Explore at:

Dataset updated

Jul 6, 2024

Dataset provided by

United States Geological Surveyhttp://www.usgs.gov/

Description

This data set contains example data for exploration of the theory of regression based regionalization. The 90th percentile of annual maximum streamflow is provided as an example response variable for 293 streamgages in the conterminous United States. Several explanatory variables are drawn from the GAGES-II data base in order to demonstrate how multiple linear regression is applied. Example scripts demonstrate how to collect the original streamflow data provided and how to recreate the figures from the associated Techniques and Methods chapter.

Clear search

Close search

Google apps

Main menu

An example data set for exploration of Multiple Linear Regression

Data for multiple linear regression models for predicting microcystin...

Panel dataset on Brazilian fuel demand

Linear Regression example Dataset

Dataset

Contents

Student Performance (Multiple Linear Regression) Dataset

The two‐sample linear regression model with interval‐censored covariates...

SPSS Data Set S1 Logistic Regression Model Data

Employee Data

Simulation Data Set

Example Groundwater-Level Datasets and Benchmarking Results for the...

Predict Purity and Price of Honey

About Dataset

TELEVISION DATASET 2022

Calibration datasets and model archive summaries for regression models...

AirQualityCOVID-dataset

UCI and OpenML Data Sets for Ordinal Quantification

Dataset: The Impact of Altitude Training on NCAA Division I Female Swimmers’...

srsd-feynman_easy

Data from: Sample simulation files underlying the publication: "Computing...

Data from: Sample simulation files underlying the publication: "Prediction...

Global Burden of Disease analysis dataset of noncommunicable disease...

An example data set for exploration of Multiple Linear Regression