100+ datasets found

d
An example data set for exploration of Multiple Linear Regression
catalog.data.gov
data.usgs.gov
Updated Jul 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). An example data set for exploration of Multiple Linear Regression [Dataset]. https://catalog.data.gov/dataset/an-example-data-set-for-exploration-of-multiple-linear-regression
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This data set contains example data for exploration of the theory of regression based regionalization. The 90th percentile of annual maximum streamflow is provided as an example response variable for 293 streamgages in the conterminous United States. Several explanatory variables are drawn from the GAGES-II data base in order to demonstrate how multiple linear regression is applied. Example scripts demonstrate how to collect the original streamflow data provided and how to recreate the figures from the associated Techniques and Methods chapter.
d
Data for multiple linear regression models for predicting microcystin...
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Data for multiple linear regression models for predicting microcystin concentration action-level exceedances in selected lakes in Ohio [Dataset]. https://catalog.data.gov/dataset/data-for-multiple-linear-regression-models-for-predicting-microcystin-concentration-action
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Ohio
Description
Site-specific multiple linear regression models were developed for eight sites in Ohio—six in the Western Lake Erie Basin and two in northeast Ohio on inland reservoirs--to quickly predict action-level exceedances for a cyanotoxin, microcystin, in recreational and drinking waters used by the public. Real-time models include easily- or continuously-measured factors that do not require that a sample be collected. Real-time models are presented in two categories: (1) six models with continuous monitor data, and (2) three models with on-site measurements. Real-time models commonly included variables such as phycocyanin, pH, specific conductance, and streamflow or gage height. Many of the real-time factors were averages over time periods antecedent to the time the microcystin sample was collected, including water-quality data compiled from continuous monitors. Comprehensive models use a combination of discrete sample-based measurements and real-time factors. Comprehensive models were useful at some sites with lagged variables (< 2 weeks) for cyanobacterial toxin genes, dissolved nutrients, and (or) N to P ratios. Comprehensive models are presented in three categories: (1) three models with continuous monitor data and lagged comprehensive variables, (2) five models with no continuous monitor data and lagged comprehensive variables, and (3) one model with continuous monitor data and same-day comprehensive variables. Funding for this work was provided by the Ohio Water Development Authority and the U.S. Geological Survey Cooperative Water Program.
c
Student Performance (Multiple Linear Regression) Dataset
cubig.ai
Updated May 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CUBIG (2025). Student Performance (Multiple Linear Regression) Dataset [Dataset]. https://cubig.ai/store/products/392/student-performance-multiple-linear-regression-dataset
Explore at:
Dataset updated
May 29, 2025
Dataset authored and provided by
CUBIG
License
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
Measurement technique
Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
Description
1) Data Introduction • The Student Performance (Multiple Linear Regression) Dataset is designed to analyze the relationship between students’ learning habits and academic performance. Each sample includes key indicators related to learning, such as study hours, sleep duration, previous test scores, and the number of practice exams completed.

2) Data Utilization (1) Characteristics of the Student Performance (Multiple Linear Regression) Dataset: • The target variable, Hours Studied, quantitatively represents the amount of time a student has invested in studying. The dataset is structured to allow modeling and inference of learning behaviors based on correlations with other variables.

(2) Applications of the Student Performance (Multiple Linear Regression) Dataset: • AI-Based Study Time Prediction Models: The dataset can be used to develop regression models that estimate a student’s expected study time based on inputs like academic performance, sleep habits, and engagement patterns. • Behavioral Analysis and Personalized Learning Strategies: It can be applied to identify students with insufficient study time and design personalized study interventions based on academic and lifestyle patterns.
f
Subset for multiple regression analysis: socio-demographic data, social...
figshare.com
txt
Updated Jan 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrés Aparicio (2021). Subset for multiple regression analysis: socio-demographic data, social distance and the identification of mental health causes [Dataset]. http://doi.org/10.6084/m9.figshare.13607087.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13607087.v2
Dataset updated
Jan 19, 2021
Dataset provided by
figshare
Authors
Andrés Aparicio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data collected following the methodology and procedures described in (1,2). The sample consisted of Chilean adults (18 years of age or older) and was stratified by age, gender, and educational level. Five hundred and eighty-three participants began the process to answer the questionnaires either in person or online. Before the analysis, we excluded incomplete records, questionnaires answered by Chilean people living outside of Chile, and foreign people living in Chile for less than 10 years. This article reports the results obtained from 395 participants (68%). The final sample included adults from 18 to 78 years of age with low, middle and high educational levels.1. Scior K, Potts HW, Furnham AF. Awareness of schizophrenia and intellectual disability and stigma across ethnic groups in the UK. Psychiatry Res [Internet]. 2013 Jul 30 [cited 2019 Jan 5];208(2):125–30. Available from: https://www.sciencedirect.com/science/article/pii/S0165178112005604?via=ihub2. Scior K, Furnham A. Development and validation of the Intellectual Disability Literacy Scale for assessment of knowledge, beliefs and attitudes to intellectual disability. Res Dev Disabil [Internet]. 2011 Sep [cited 2017 Dec 31];32(5):1530–41. Available from: http://www.ncbi.nlm.nih.gov/pubmed/21377320
h
testingdatasetcards
huggingface.co
Updated Feb 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria Murphy (2024). testingdatasetcards [Dataset]. https://huggingface.co/datasets/mariakmurphy55/testingdatasetcards
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 2, 2024
Authors
Maria Murphy
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card for Testingdatasetcards

Very Simple Multiple Linear Regression Dataset

Dataset Details Dataset Description

Curated by: HUSSAIN NASIR KHAN (Kaggle) Shared by [optional]: Maria Murphy Language(s) (NLP): English License: CC0: Public Domain

Uses

Intended for practice with linear regression.

Dataset Structure

Contains three columns (age, experience, income) and twenty observations.
d
Data from: Data and model archive for multiple linear regression models for...
catalog.data.gov
data.usgs.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Data and model archive for multiple linear regression models for prediction of weighted cyanotoxin mixture concentrations and microcystin concentrations at three recurring bloom sites in Kabetogama Lake in Minnesota [Dataset]. https://catalog.data.gov/dataset/data-and-model-archive-for-multiple-linear-regression-models-for-prediction-of-weighted-cy
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Kabetogama lake, Minnesota
Description
Multiple linear regression models were developed using data collected in 2016 and 2017 from three recurring bloom sites in Kabetogama Lake in northern Minnesota. These models were developed to predict concentrations of cyanotoxins (anatoxin-a, microcystin, and saxitoxin) that occur within the blooms. Virtual Beach software (version 3.0.6) was used to develop four models: two cyanotoxin mixture (MIX) models and two microcystin (MC) models. Models include those using readily available environmental variables (for example, wind speed and specific conductance) and those using additional comprehensive variables (based on laboratory analyses). Many of the independent variables were averages over a certain time period prior to a sample date, whereas other independent variables were lagged between 4 and 8 days. Funding for this work was provided by the U.S Geological Survey – National Park Service Partnership and the U.S. Geological Survey Environmental Health Program (Toxic Substance Hydrology and Contaminant Biology). The resulting model equations and final datasets are included in this data release while an associated child item model archive includes all the files needed to run and develop these VB models.
Predicting End Semester Performance
kaggle.com
Updated Oct 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arvind Kiwelekar (2020). Predicting End Semester Performance [Dataset]. https://www.kaggle.com/akiwelekar/predictingese/notebooks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 10, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Arvind Kiwelekar
Description
Context

I use this dataset to explain the concepts of simple and multiple linear regression while teaching a course on Machine Learning. I found the use of existing datasets such as height-weight and predicting house prices has limited appeal. Because these datasets fail to relate to the academic life of engineering graduates, which is more or less revolved around classroom attendance and performance in examinations.

Content

The dataset is small in size reflecting the classroom attendance, marks scored in mid-semester examination(MSE), and end-semester examination (ESE) by 73 students. This data is collected in the year 2014 when I was teaching a course on Software Architecture to the seventh-semester student of B. Tech. (Computer Engineering) at Dr Babasaheb Ambedkar Technological University Lonere. The attendance is in percentage means an entry of 90 in the column of 'Attendance' indicates 90% attendance. The mid-semester (MSE) marks are out of 30 and the end-semester (ESE) marks are out of 70.
f
Data from: Solving linear regression without skewness of the residuals’...
tandf.figshare.com
txt
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Ricker (2023). Solving linear regression without skewness of the residuals’ distribution [Dataset]. http://doi.org/10.6084/m9.figshare.8152901.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8152901.v1
Dataset updated
Jun 5, 2023
Dataset provided by
Taylor & Francis
Authors
Martin Ricker
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Linear ordinary least squares (OLS) regression assumes an unskewed distribution of the residuals for correct inference and prediction. A proof is given that for Manly’s exponential transformation of the dependent variable, there is always at least one solution for λ, such that the skewness of the standardized residuals’ distribution is zero. A computer code in Mathematica, together with an illustrative example, are provided. Generalized linear models are discussed briefly in comparison.
Ideals Dataset
zenodo.org
csv
Updated Sep 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ana Cruz; Ana Cruz (2022). Ideals Dataset [Dataset]. http://doi.org/10.5281/zenodo.6866935
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6866935
Dataset updated
Sep 19, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ana Cruz; Ana Cruz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Generated datasets of multiple ideals distributions used in the research in linear regressions and machine learning algorithm for the thesis s 'Predicting the performance of Buchberger‘s algorithm' . Concatenated and concatenated_stats are the datasets with the ideals exponents and correspondent polynomial additions, these datasets were created specifically for RNN, features_dataset contains statistics regarding the ideals and polynomial_additions_dataset contains info regarding their polynomial additions created for multiple linear regression models and simple neural networks.
Nestle India -Historical Stock Price Data
kaggle.com
Updated Apr 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mansi Gaikwad (2022). Nestle India -Historical Stock Price Data [Dataset]. https://www.kaggle.com/datasets/mansigaikwad/nestle-india-historical-stock-price-data/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 25, 2022
Dataset provided by
Kaggle
Authors
Mansi Gaikwad
Description
This data is downloaded from the official Bombay Stock Exchange Website (BSE). This file contains the last 10 years of Historical Stock Price (By Security & Period) Security Name - Nestle India Ltd. Period - Daily Start Date - 2nd January 2012 End Date - 21st April 2022. This is one of the Best datasets for Regression Supervised Machine Learning. You can Perform SImple as well as Multiple Linear Regression on this Dataset.
USA Optimal Product Price Prediction Dataset
kaggle.com
Updated Nov 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
asaniczka (2023). USA Optimal Product Price Prediction Dataset [Dataset]. http://doi.org/10.34740/kaggle/ds/3893031
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/ds/3893031
Dataset updated
Nov 7, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
asaniczka
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Area covered
United States
Description
This dataset contains product prices from Amazon USA, with a focus on price prediction. With a good amount of data on what price points sell the most, you can train machine learning models to predict the optimal price for a product based on its features and product name.

If you find this dataset useful, make sure to show your appreciation by upvoting! ❤️✨

Inspirations

This dataset is a superset of my Amazon USA product price dataset. Another inspiration is this competition that awareded 100K Prize Money

What To Do?

Your objective is to create a prediction model that will assist sellers in pricing their products within the optimal price range to generate the most sales.

The dataset includes various data points, such as the number of reviews, rating, best seller status, and items sold last month.

You can select specific factors (e.g., over 100 reviews = optimal price for the product) and then divide the dataset into products priced optimally vs products priced unoptimally.

By utilizing techniques like vectorizing product names and features, you can train a model to provide the optimal price for a product, which sellers or businesses might find valuable.

How to know if a product sells?

I would prefer to use the number of reviews as a metric to determine if a product sells. More reviews = more sales, right?

According to one source only 1-2% of buyers leave a review

So if we multiply the reviews for a product by 50x, then we would get a good understanding how many units has sold.

If we then multiple the product price by number of units sold, we'd get the total revenue generated by the product

How is this useful?

Sellers and businesses can leverage your model to determine the optimal price for their products, thereby maximizing sales.

Businesses can assess the profitability of a product and plan their supply chain accordingly.
d
Suspended sediment and bedload data, simple linear regression models, loads,...
catalog.data.gov
data.usgs.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Suspended sediment and bedload data, simple linear regression models, loads, elevation data, and FaSTMECH models for Rice Creek, Minnesota, 2010-2019 [Dataset]. https://catalog.data.gov/dataset/suspended-sediment-and-bedload-data-simple-linear-regression-models-loads-elevation-d-2010
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Rice Creek, Minnesota
Description
A series of simple linear regression models were developed for the U.S. Geological Survey (USGS) streamgage at Rice Creek below Highway 8 in Mounds View, Minnesota (USGS station number 05288580). The simple linear regression models were calibrated using streamflow data to estimate suspended-sediment (total, fines, and sands) and bedload. Data were collected during water years 2010, 2011, 2014, 2018, and 2019. The estimates from the simple linear regressions were used to calculate loads for water years 2010 through 2019. The calibrated simple linear regression models were used to improve understanding of sediment transport processes and increase accuracy of estimating sediment and loads for Rice Creek. Two multidimensional flow and models were developed with the International River Interface Cooperative (iRIC) software and Flow and Sediment Transport with Morphological Evolution of Channels (FaSTMECH) solver. These models were developed with elevation data from terrestrial laser scanning, airborne light detection and ranging, acoustic Doppler current profiler, total station, and real-time kinematic global navigation satellite system. The models were calibrated, validated, and run for multiple streamflow scenarios to compare an original section to a restored section of Rice Creek. All contents of this data release are part of the associated report, U.S. Geological Survey Scientific Investigations Report 2022–5004 (https://doi.org/10.3133/sir20225004).
m
Global Burden of Disease analysis dataset of noncommunicable disease...
data.mendeley.com
Updated Apr 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Cundiff (2023). Global Burden of Disease analysis dataset of noncommunicable disease outcomes, risk factors, and SAS codes [Dataset]. http://doi.org/10.17632/g6b39zxck4.10
Explore at:
Unique identifier
https://doi.org/10.17632/g6b39zxck4.10
Dataset updated
Apr 6, 2023
Authors
David Cundiff
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This formatted dataset (AnalysisDatabaseGBD) originates from raw data files from the Institute of Health Metrics and Evaluation (IHME) Global Burden of Disease Study (GBD2017) affiliated with the University of Washington. We are volunteer collaborators with IHME and not employed by IHME or the University of Washington.

The population weighted GBD2017 data are on male and female cohorts ages 15-69 years including noncommunicable diseases (NCDs), body mass index (BMI), cardiovascular disease (CVD), and other health outcomes and associated dietary, metabolic, and other risk factors. The purpose of creating this population-weighted, formatted database is to explore the univariate and multiple regression correlations of health outcomes with risk factors. Our research hypothesis is that we can successfully model NCDs, BMI, CVD, and other health outcomes with their attributable risks.

These Global Burden of disease data relate to the preprint: The EAT-Lancet Commission Planetary Health Diet compared with Institute of Health Metrics and Evaluation Global Burden of Disease Ecological Data Analysis. The data include the following: 1. Analysis database of population weighted GBD2017 data that includes over 40 health risk factors, noncommunicable disease deaths/100k/year of male and female cohorts ages 15-69 years from 195 countries (the primary outcome variable that includes over 100 types of noncommunicable diseases) and over 20 individual noncommunicable diseases (e.g., ischemic heart disease, colon cancer, etc). 2. A text file to import the analysis database into SAS 3. The SAS code to format the analysis database to be used for analytics 4. SAS code for deriving Tables 1, 2, 3 and Supplementary Tables 5 and 6 5. SAS code for deriving the multiple regression formula in Table 4. 6. SAS code for deriving the multiple regression formula in Table 5 7. SAS code for deriving the multiple regression formula in Supplementary Table 7
8. SAS code for deriving the multiple regression formula in Supplementary Table 8 9. The Excel files that accompanied the above SAS code to produce the tables

For questions, please email davidkcundiff@gmail.com. Thanks.
E
The Human Know-How Dataset
dtechtive.com
find.data.gov.scot
pdf, zip
Updated Apr 29, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2016). The Human Know-How Dataset [Dataset]. http://doi.org/10.7488/ds/1394
Explore at:
pdf(0.0582 MB), zip(19.67 MB), zip(0.0298 MB), zip(9.433 MB), zip(13.06 MB), zip(0.2837 MB), zip(5.372 MB), zip(69.8 MB), zip(20.43 MB), zip(5.769 MB), zip(14.86 MB), zip(19.78 MB), zip(43.28 MB), zip(62.92 MB), zip(92.88 MB), zip(90.08 MB)Available download formats
Unique identifier
https://doi.org/10.7488/ds/1394
Dataset updated
Apr 29, 2016
Description
The Human Know-How Dataset describes 211,696 human activities from many different domains. These activities are decomposed into 2,609,236 entities (each with an English textual label). These entities represent over two million actions and half a million pre-requisites. Actions are interconnected both according to their dependencies (temporal/logical orders between actions) and decompositions (decomposition of complex actions into simpler ones). This dataset has been integrated with DBpedia (259,568 links). For more information see: - The project website: http://homepages.inf.ed.ac.uk/s1054760/prohow/index.htm - The data is also available on datahub: https://datahub.io/dataset/human-activities-and-instructions ---------------------------------------------------------------- * Quickstart: if you want to experiment with the most high-quality data before downloading all the datasets, download the file '9of11_knowhow_wikihow', and optionally files 'Process - Inputs', 'Process - Outputs', 'Process - Step Links' and 'wikiHow categories hierarchy'. * Data representation based on the PROHOW vocabulary: http://w3id.org/prohow# Data extracted from existing web resources is linked to the original resources using the Open Annotation specification * Data Model: an example of how the data is represented within the datasets is available in the attached Data Model PDF file. The attached example represents a simple set of instructions, but instructions in the dataset can have more complex structures. For example, instructions could have multiple methods, steps could have further sub-steps, and complex requirements could be decomposed into sub-requirements. ---------------------------------------------------------------- Statistics: * 211,696: number of instructions. From wikiHow: 167,232 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 44,464 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 2,609,236: number of RDF nodes within the instructions From wikiHow: 1,871,468 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 737,768 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 255,101: number of process inputs linked to 8,453 distinct DBpedia concepts (dataset Process - Inputs) * 4,467: number of process outputs linked to 3,439 distinct DBpedia concepts (dataset Process - Outputs) * 376,795: number of step links between 114,166 different sets of instructions (dataset Process - Step Links)
H
Time-Series Matrix (TSMx): A visualization tool for plotting multiscale...
dataverse.harvard.edu
Updated Jul 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Georgios Boumis; Brad Peter (2024). Time-Series Matrix (TSMx): A visualization tool for plotting multiscale temporal trends [Dataset]. http://doi.org/10.7910/DVN/ZZDYM9
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/ZZDYM9
Dataset updated
Jul 8, 2024
Dataset provided by
Harvard Dataverse
Authors
Georgios Boumis; Brad Peter
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Time-Series Matrix (TSMx): A visualization tool for plotting multiscale temporal trends TSMx is an R script that was developed to facilitate multi-temporal-scale visualizations of time-series data. The script requires only a two-column CSV of years and values to plot the slope of the linear regression line for all possible year combinations from the supplied temporal range. The outputs include a time-series matrix showing slope direction based on the linear regression, slope values plotted with colors indicating magnitude, and results of a Mann-Kendall test. The start year is indicated on the y-axis and the end year is indicated on the x-axis. In the example below, the cell in the top-right corner is the direction of the slope for the temporal range 2001–2019. The red line corresponds with the temporal range 2010–2019 and an arrow is drawn from the cell that represents that range. One cell is highlighted with a black border to demonstrate how to read the chart—that cell represents the slope for the temporal range 2004–2014. This publication entry also includes an excel template that produces the same visualizations without a need to interact with any code, though minor modifications will need to be made to accommodate year ranges other than what is provided. TSMx for R was developed by Georgios Boumis; TSMx was originally conceptualized and created by Brad G. Peter in Microsoft Excel. Please refer to the associated publication: Peter, B.G., Messina, J.P., Breeze, V., Fung, C.Y., Kapoor, A. and Fan, P., 2024. Perspectives on modifiable spatiotemporal unit problems in remote sensing of agriculture: evaluating rice production in Vietnam and tools for analysis. Frontiers in Remote Sensing, 5, p.1042624. https://www.frontiersin.org/journals/remote-sensing/articles/10.3389/frsen.2024.1042624 TSMx sample chart from the supplied Excel template. Data represent the productivity of rice agriculture in Vietnam as measured via EVI (enhanced vegetation index) from the NASA MODIS data product (MOD13Q1.V006). TSMx R script: # import packages library(dplyr) library(readr) library(ggplot2) library(tibble) library(tidyr) library(forcats) library(Kendall) options(warn = -1) # disable warnings # read data (.csv file with "Year" and "Value" columns) data <- read_csv("EVI.csv") # prepare row/column names for output matrices years <- data %>% pull("Year") r.names <- years[-length(years)] c.names <- years[-1] years <- years[-length(years)] # initialize output matrices sign.matrix <- matrix(data = NA, nrow = length(years), ncol = length(years)) pval.matrix <- matrix(data = NA, nrow = length(years), ncol = length(years)) slope.matrix <- matrix(data = NA, nrow = length(years), ncol = length(years)) # function to return remaining years given a start year getRemain <- function(start.year) { years <- data %>% pull("Year") start.ind <- which(data[["Year"]] == start.year) + 1 remain <- years[start.ind:length(years)] return (remain) } # function to subset data for a start/end year combination splitData <- function(end.year, start.year) { keep <- which(data[['Year']] >= start.year & data[['Year']] <= end.year) batch <- data[keep,] return(batch) } # function to fit linear regression and return slope direction fitReg <- function(batch) { trend <- lm(Value ~ Year, data = batch) slope <- coefficients(trend)[[2]] return(sign(slope)) } # function to fit linear regression and return slope magnitude fitRegv2 <- function(batch) { trend <- lm(Value ~ Year, data = batch) slope <- coefficients(trend)[[2]] return(slope) } # function to implement Mann-Kendall (MK) trend test and return significance # the test is implemented only for n>=8 getMann <- function(batch) { if (nrow(batch) >= 8) { mk <- MannKendall(batch[['Value']]) pval <- mk[['sl']] } else { pval <- NA } return(pval) } # function to return slope direction for all combinations given a start year getSign <- function(start.year) { remaining <- getRemain(start.year) combs <- lapply(remaining, splitData, start.year = start.year) signs <- lapply(combs, fitReg) return(signs) } # function to return MK significance for all combinations given a start year getPval <- function(start.year) { remaining <- getRemain(start.year) combs <- lapply(remaining, splitData, start.year = start.year) pvals <- lapply(combs, getMann) return(pvals) } # function to return slope magnitude for all combinations given a start year getMagn <- function(start.year) { remaining <- getRemain(start.year) combs <- lapply(remaining, splitData, start.year = start.year) magns <- lapply(combs, fitRegv2) return(magns) } # retrieve slope direction, MK significance, and slope magnitude signs <- lapply(years, getSign) pvals <- lapply(years, getPval) magns <- lapply(years, getMagn) # fill-in output matrices dimension <- nrow(sign.matrix) for (i in 1:dimension) { sign.matrix[i, i:dimension] <- unlist(signs[i]) pval.matrix[i, i:dimension] <- unlist(pvals[i]) slope.matrix[i, i:dimension] <- unlist(magns[i]) } sign.matrix <-...
Improving Time, Cost and Quality of Hire
kaggle.com
Updated Nov 8, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Soham Ganguly (2021). Improving Time, Cost and Quality of Hire [Dataset]. https://www.kaggle.com/sohamganguly90/improving-time-cost-and-quality-of-hire/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 8, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Soham Ganguly
Description
Content

Improve Time , Cost and Quality of Hire in a random recruitment data. Objective is to Minimize the Time and Cost of Hire and maximize the Quality of Hire metrics.Sample People Analytics project to Mainly used ANOVA, Correlation and Multiple Linear Regression in order to perform the Predictive and Prescriptive Analytics in this Dataset. Dashboards are made in Excel.

Acknowledgements

Kaggle for the sample Dataset (I made modifications to the original Dataset) XLRI for giving me the opportunity to create this project

Inspiration

Inspired by the desire to step into the venture of learning People Analytics.
f
Dataset for: Some Remarks on the R2 for Clustering
wiley.figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicola Loperfido; Thaddeus Tarpey (2023). Dataset for: Some Remarks on the R2 for Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.6124508.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6124508.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Wiley
Authors
Nicola Loperfido; Thaddeus Tarpey
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.
Multi-turn Prompts Dataset
kaggle.com
Updated Oct 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SoftAge.AI (2024). Multi-turn Prompts Dataset [Dataset]. https://www.kaggle.com/datasets/softageai/multi-turn-prompts-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 25, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
SoftAge.AI
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Description This dataset consists of 400 text-only fine-tuned versions of multi-turn conversations in the English language based on 10 categories and 19 use cases. It has been generated with ethically sourced human-in-the-loop data methods and aligned with supervised fine-tuning, direct preference optimization, and reinforcement learning through human feedback.

The human-annotated data is focused on data quality and precision to enhance the generative response of models used for AI chatbots, thereby improving their recall memory and recognition ability for continued assistance.

Key Features Prompts focused on user intent and were devised using natural language processing techniques. Multi-turn prompts with up to 5 turns to enhance responsive memory of large language models for pretraining. Conversational interactions for queries related to varied aspects of writing, coding, knowledge assistance, data manipulation, reasoning, and classification.

Dataset Source Subject matter expert annotators @SoftAgeAI have annotated the data at simple and complex levels, focusing on quality factors such as content accuracy, clarity, coherence, grammar, depth of information, and overall usefulness.

Structure & Fields The dataset is organized into different columns, which are detailed below:

P1, R1, P2, R2, P3, R3, P4, R4, P5 (object): These columns represent the sequence of prompts (P) and responses (R) within a single interaction. Each interaction can have up to 5 prompts and 5 corresponding responses, capturing the flow of a conversation. The prompts are user inputs, and the responses are the model's outputs. Use Case (object): Specifies the primary application or scenario for which the interaction is designed, such as "Q&A helper" or "Writing assistant." This classification helps in identifying the purpose of the dialogue. Type (object): Indicates the complexity of the interaction, with entries labeled as "Complex" in this dataset. This denotes that the dialogues involve more intricate and multi-layered exchanges. Category (object): Broadly categorizes the interaction type, such as "Open-ended QA" or "Writing." This provides context on the nature of the conversation, whether it is for generating creative content, providing detailed answers, or engaging in complex problem-solving. Intended Use Cases

The dataset can enhance query assistance model functioning related to shopping, coding, creative writing, travel assistance, marketing, citation, academic writing, language assistance, research topics, specialized knowledge, reasoning, and STEM-based. The dataset intends to aid generative models for e-commerce, customer assistance, marketing, education, suggestive user queries, and generic chatbots. It can pre-train large language models with supervision-based fine-tuned annotated data and for retrieval-augmented generative models. The dataset stands free of violence-based interactions that can lead to harm, conflict, discrimination, brutality, or misinformation. Potential Limitations & Biases This is a static dataset, so the information is dated May 2024.

Note If you have any questions related to our data annotation and human review services for large language model training and fine-tuning, please contact us at SoftAge Information Technology Limited at info@softage.ai.
4
Supporting dataset for "Multi-model assessment of the atmospheric and...
data.4tu.nl
zip
Updated Sep 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jurriaan van 't Hoff; Didier Hauglustaine; Johannes Pletzer; Agnieszka Skowron; Volker Grewe; Sigrun Matthes; Maximilian M. Meuser; Robin N. Thor; Irene C Dedoussi (2024). Supporting dataset for "Multi-model assessment of the atmospheric and radiative effects of supersonic transport aircraft" [Dataset]. http://doi.org/10.4121/dd38833d-6c5d-47d8-bb10-7535ce1eecf1.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/dd38833d-6c5d-47d8-bb10-7535ce1eecf1.v2
Dataset updated
Sep 30, 2024
Dataset provided by
4TU.ResearchData
Authors
Jurriaan van 't Hoff; Didier Hauglustaine; Johannes Pletzer; Agnieszka Skowron; Volker Grewe; Sigrun Matthes; Maximilian M. Meuser; Robin N. Thor; Irene C Dedoussi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
European Union
European Union Horizon 2020 research and innovation programme
Description
This dataset is a supporting dataset for "Intermodel comparison of the atmospheric composition changes due to emissions from a potential future supersonic aircraft fleet" (https://doi.org/10.5194/acp-25-2515-2025). This dataset contains all data necessary to reproduce the results of the publication.

In this work four state-of-the-science atmospheric chemistry transport models (EMAC, GEOS-Chem, LMDZ-INCA, MOZART-3) are used to evaluate the effects of three supersonic emission scenarios on a 2050 atmosphere. The future atmosphere and emissions are based on the SSP 3.7 scenario.

For each of the models this dataset includes the volume mixing ratio (vmr) and mass distribution of several key species through the atmosphere during the last years of model integration. The LMDZ-INCA and GEOS-Chem data span a 3 year interval, whereas 6 years of data is provided for EMAC and the MOZART-3 data is already annually averaged. We refer to the associated publication for more information regarding these timescales.

The dataset contains atmospheric histories for 4 different emission scenarios, denoted as A0 to A3. The A0 scenario is a baseline scenario with no supersonic aviation (only subsonic). In the A1 scenario part of the subsonic civil aviation is replaced by supersonic aircraft operating at mach 2/0. Scenario A2 is a variant of A1 with triple the emission of nitrogen oxides from the supersonic aircraft, and scenario A3 considers the partial replacement of subsonic civil aviation with mach 1.6 supersonic aircraft instead at a lower cruise altitude. For detailed descriptions of the scenarios we refer to the associated publication.

The files are named using the following convention (MODEL)_(SCENARIO)_(VARIABLE).nc4, for example the file GEOS-Chem_A1_H2O_mass.nc4 contains H2O mass distributions for the A1 scenario from the GEOS-Chem model. The following variables are included in this dataset:

halogen_vmr: Volume mixing ratios of halogen species (time, lat, lon, pressure) [mol mol-1]
H2O_mass: Mass distribution of H2O (time, lat, lon, pressure) [kg]
H2O_vmr: Volume mixing ratio of H2O (time, lat, lon, pressure) [mol mol-1]
NOx_mass: Mass distribution of NOx (time, lat, lon, pressure) [kg (NO2)]
NOx_vmr: Volume mixing ratio of NOx (time, lat, lon, pressure) [mol mol-1]
NOy_vmr: Volume mixing ratio of NOy (time, lat, lon, pressure) [mol mol-1]
O3_columns: Ozone columns in dobson units (time, lat, lon) [DU]
O3_mass: Mass distribution of O3 (time, lat, lon, pressure) [kg]
O3_vmr: Volume mixing ratio of O3 (time, lat, lon, pressure) [mol mol-1]
OddOx_ClOxBrOx_loss: Odd Oxygen loss rate to ClOx and BrOx families (time, lat, lon, pressure) [molec cm-3 s-1] (For EMAC split into ClOx and BrOx files separately)
OddOx_HOx_loss: Odd Oxygen loss rate to HOx family (time, lat, lon, pressure) [molec cm-3 s-1]
OddOx_NOx_loss: Odd Oxygen loss rate to NOx family (time, lat, lon, pressure) [molec cm-3 s-1]
OddOx_Ox_loss: Odd Oxygen loss rate to Ox family (time, lat, lon, pressure) [molec cm-3 s-1]
BC_mmr: Mass mixing ratio of black carbon (time, lat, lon, pressure) [kg m-3]
SO4_mmr: Mass mixing ratio of sulfate (time, lat, lon, pressure) [kg m-3]
tropopause_pressure: Model tropopause pressure (time,lat,lon) [hPa]
temp: Atmospheric temperature fields; only for EMAC. (time,lat,lon,pressure) [Kelvin]

For the MOZART-3 model some data is provided as differences instead, where the SCENARIO entry of (A1-A0) indicates a file with data of the difference between the respective fields of the A1 and A0 scenarios.
R
Dataset for QTL detection in a Tomato MAGIC population analysed in a...
entrepot.recherche.data.gouv.fr
csv, tsv, txt
Updated Jun 25, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mathilde Causse; Mathilde Causse (2021). Dataset for QTL detection in a Tomato MAGIC population analysed in a multi-environment experiment [Dataset]. http://doi.org/10.15454/UVZTAV
Explore at:
tsv(26735), tsv(16682), txt(85881091), tsv(40979), txt(24915), txt(30490), txt(13798), tsv(13152), csv(1298), tsv(119060), tsv(5530)Available download formats
Unique identifier
https://doi.org/10.15454/UVZTAV
Dataset updated
Jun 25, 2021
Dataset provided by
Recherche Data Gouv
Authors
Mathilde Causse; Mathilde Causse
License
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
Area covered
Israel, Morocco, France
Dataset funded by
ANR
Description
Description of the data The data described here were produced from the ANR projects ADAPTOM (ANR-13-ADAP-0013) and TomEpiSet (ANR-16-CE20-0014). An 8-way tomato MAGIC population was phenotyped over 12 environments including three geographical location (France, Israel and Morocco) and four conditions (control, and water-deficit, high-temperature and salinity stress). A set of 397 MAGIC lines were genotyped for 1345 markers, used together with the phenotypic traits for linkage mapping analysis. Genotype-by-environment interaction (GxE) was evaluated and phenotypic plasticity computed through different statistical models. Each file in the dataset has its own description below. • Phenotype files The Phenotypes files contain the 10 phenotypic traits that were evaluated. Phenotypic data averaged per genotype and environment are in the file “Phenotype_per_Environment”. The input phenotypes for the linkage mapping analysis are in the file “Pheno_Input_QTL_detection”. They represent for each trait the estimated average performance, slope and variance from the Finlay & Wilkinson regression model and sensitivity to environmental covariates from the factorial regression model, respectively. • MAGIC Genotyping information This file presents the genetic map with 1345 SNP markers used in linkage mapping analyses. The genotypic information of the eight founders and 397 MAGIC lines are also presented • Daily recorded climactic parameters This file presents the daily climatic parameters recorded within the greenhouses. The different parameters were computed over 24 hours. • Custom R script for the two-stage analysis of GxE The file “Two-stage-analysis_magicMET.txt” contains the custom R script used for analysis of factorial regression and Finlay-Wilkinson regression models. Average performance and plasticity parameters were derived from these analyses. Example have been given for fruit weight phenotype averaged per genotype and environment. The input file “Var_environment_P2P3” presents the average climatic parameters used particularly for the factorial regression model. • Custom R script for QEI modelling The files “QEI_Glbal_marker_effect_model5.txt” and “QEI_main_plus_interactive_effect_model6.txt” describe the custom R script used for the detection of interactive QTLs (QEI). Example of fruit weight phenotype have been developed. The input files for the script are “FW_pheno_GxE.csv”, the average phenotypic data per genotype and environment for fruit weight example and the parental haplotype probabilities “Proba_parents.txt” that were computed from R/qtl2 package with the function calc_genoprob. The “Geno_ID.csv” file gives the correspondence between genotype name and ID.

Facebook

Twitter

Click to copy link

Link copied

Cite

U.S. Geological Survey (2024). An example data set for exploration of Multiple Linear Regression [Dataset]. https://catalog.data.gov/dataset/an-example-data-set-for-exploration-of-multiple-linear-regression

An example data set for exploration of Multiple Linear Regression

Explore at:

Dataset updated

Jul 6, 2024

Dataset provided by

United States Geological Surveyhttp://www.usgs.gov/

Description

This data set contains example data for exploration of the theory of regression based regionalization. The 90th percentile of annual maximum streamflow is provided as an example response variable for 293 streamgages in the conterminous United States. Several explanatory variables are drawn from the GAGES-II data base in order to demonstrate how multiple linear regression is applied. Example scripts demonstrate how to collect the original streamflow data provided and how to recreate the figures from the associated Techniques and Methods chapter.

Clear search

Close search

Google apps

Main menu

An example data set for exploration of Multiple Linear Regression

Data for multiple linear regression models for predicting microcystin...

Student Performance (Multiple Linear Regression) Dataset

Subset for multiple regression analysis: socio-demographic data, social...

testingdatasetcards

Data from: Data and model archive for multiple linear regression models for...

Predicting End Semester Performance

Context

Content

Data from: Solving linear regression without skewness of the residuals’...

Ideals Dataset

Nestle India -Historical Stock Price Data

USA Optimal Product Price Prediction Dataset

Inspirations

What To Do?

How to know if a product sells?

How is this useful?

Suspended sediment and bedload data, simple linear regression models, loads,...

Global Burden of Disease analysis dataset of noncommunicable disease...

The Human Know-How Dataset

Time-Series Matrix (TSMx): A visualization tool for plotting multiscale...

Improving Time, Cost and Quality of Hire

Content

Acknowledgements

Inspiration

Dataset for: Some Remarks on the R2 for Clustering

Multi-turn Prompts Dataset

Supporting dataset for "Multi-model assessment of the atmospheric and...

Dataset for QTL detection in a Tomato MAGIC population analysed in a...

An example data set for exploration of Multiple Linear Regression