100+ datasets found

Insurance Dataset - Simple Linear Regression
kaggle.com
zip
Updated Sep 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taseer Mehboob (2023). Insurance Dataset - Simple Linear Regression [Dataset]. https://www.kaggle.com/datasets/taseermehboob9/insurance-dataset-simple-linear-regression
Explore at:
zip(254 bytes)Available download formats
Dataset updated
Sep 14, 2023
Authors
Taseer Mehboob
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Here in This Dataset we have only 2 columns the first one is Age and the second one is Premium You can use this dataset in machine learning for Simple linear Regression and for Prediction Practices.
LINEAR REGRESSION DATA CSV
kaggle.com
zip
Updated Feb 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alper Aktepe (2021). LINEAR REGRESSION DATA CSV [Dataset]. https://www.kaggle.com/datasets/alperaktepe/linear-regression-data-csv
Explore at:
zip(283 bytes)Available download formats
Dataset updated
Feb 28, 2021
Authors
Alper Aktepe
Description
Dataset

This dataset was created by Alper Aktepe

Contents
m
Panel dataset on Brazilian fuel demand
data.mendeley.com
Updated Oct 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergio Prolo (2024). Panel dataset on Brazilian fuel demand [Dataset]. http://doi.org/10.17632/hzpwbp7j22.1
Explore at:
Unique identifier
https://doi.org/10.17632/hzpwbp7j22.1
Dataset updated
Oct 7, 2024
Authors
Sergio Prolo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Brazil
Description
Summary : Fuel demand is shown to be influenced by fuel prices, people's income and motorization rates. We explore the effects of electric vehicle's rates in gasoline demand using this panel dataset.

Files : dataset.csv - Panel dimensions are the Brazilian state ( i ) and year ( t ). The other columns are: gasoline sales per capita (ln_Sg_pc), prices of gasoline (ln_Pg) and ethanol (ln_Pe) and their lags, motorization rates of combustion vehicles (ln_Mi_c) and electric vehicles (ln_Mi_e) and GDP per capita (ln_gdp_pc). All variables are all under the natural log function, since we use this to calculate demand elasticities in a regression model.

adjacency.csv - The adjacency matrix used in interaction with electric vehicles' motorization rates to calculate spatial effects. At first, it follows a binary adjacency formula: for each pair of states i and j, the cell (i, j) is 0 if the states are not adjacent and 1 if they are. Then, each row is normalized to have sum equal to one.

regression.do - Series of Stata commands used to estimate the regression models of our study. dataset.csv must be imported to work, see comment section.

dataset_predictions.xlsx - Based on the estimations from Stata, we use this excel file to make average predictions by year and by state. Also, by including years beyond the last panel sample, we also forecast the model into the future and evaluate the effects of different policies that influence gasoline prices (taxation) and EV motorization rates (electrification). This file is primarily used to create images, but can be used to further understand how the forecasting scenarios are set up.

Sources: Fuel prices and sales: ANP (https://www.gov.br/anp/en/access-information/what-is-anp/what-is-anp) State population, GDP and vehicle fleet: IBGE (https://www.ibge.gov.br/en/home-eng.html?lang=en-GB) State EV fleet: Anfavea (https://anfavea.com.br/en/site/anuarios/)
Cancer Regression
kaggle.com
Updated Apr 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Varun Raskar (2024). Cancer Regression [Dataset]. https://www.kaggle.com/datasets/varunraskar/cancer-regression
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 14, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Varun Raskar
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The dataset contains 2 .csv files

This file contains various demographic and health-related data for different regions. Here's a brief description of each column:

File 1st

avganncount: Average number of cancer cases diagnosed annually.

avgdeathsperyear: Average number of deaths due to cancer per year.

target_deathrate: Target death rate due to cancer.

incidencerate: Incidence rate of cancer.

medincome: Median income in the region.

popest2015: Estimated population in 2015.

povertypercent: Percentage of population below the poverty line.

studypercap: Per capita number of cancer-related clinical trials conducted.

binnedinc: Binned median income.

medianage: Median age in the region.

pctprivatecoveragealone: Percentage of population covered by private health insurance alone.

pctempprivcoverage: Percentage of population covered by employee-provided private health insurance.

pctpubliccoverage: Percentage of population covered by public health insurance.

pctpubliccoveragealone: Percentage of population covered by public health insurance only.

pctwhite: Percentage of White population.

pctblack: Percentage of Black population.

pctasian: Percentage of Asian population.

pctotherrace: Percentage of population belonging to other races.

pctmarriedhouseholds: Percentage of married households. birthrate: Birth rate in the region.

File 2nd

This file contains demographic information about different regions, including details about household size and geographical location. Here's a description of each column:

statefips: The FIPS code representing the state.

countyfips: The FIPS code representing the county or census area within the state.

avghouseholdsize: The average household size in the region.

geography: The geographical location, typically represented as the county or census area name followed by the state name.

Each row in the file represents a specific region, providing details about household size and geographical location. This information can be used for various demographic analyses and studies.
Marketing Linear Multiple Regression
kaggle.com
zip
Updated Apr 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FayeJavad (2020). Marketing Linear Multiple Regression [Dataset]. https://www.kaggle.com/datasets/fayejavad/marketing-linear-multiple-regression
Explore at:
zip(1907 bytes)Available download formats
Dataset updated
Apr 24, 2020
Authors
FayeJavad
Description
Dataset

This dataset was created by FayeJavad

Contents
d
Data from: Data for Regression Models to Estimate Water Use in Providence,...
catalog.data.gov
data.usgs.gov
+2more
Updated Nov 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Data for Regression Models to Estimate Water Use in Providence, Rhode Island, 2014-2021 [Dataset]. https://catalog.data.gov/dataset/data-for-regression-models-to-estimate-water-use-in-providence-rhode-island-2014-2021
Explore at:
Dataset updated
Nov 19, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Providence, Rhode Island
Description
This data release contains input data and programs (scripts) used to estimate monthly water demand for retail customers of Providence Water, located in Providence, Rhode Island. Explanatory data and model outputs are from July 2014 through June 2021. Models of per capita (for single-family residential customers) or per connection (for multi-family residential, commercial, and industrial customers) water use were developed using multiple linear regression. The dependent variables, provided by Providence Water, are the monthly number of connections and gallons of water delivered to single- and multi-family residential, commercial, and industrial connections. Potential independent variables (from online sources) are climate variables (temperature and precipitation), economic statistics, and a drought statistic. Not all independent variables were used in all of the models. The data are provided in data tables and model files. The data table RIWaterUseVariableExplanation.csv describes the explanatory variables and their data sources. The data table ProvModelInputData.csv provides the monthly water-use data that are the independent variables and the monthly climatic and economic data that are the dependent variables. The data table DroughtInputData.csv provides the weekly U.S. drought monitor index values that were processed to formulate a potential independent variable. The R script model_water_use.R runs the models that predict water use. The other two R scripts (load_preprocess_input_data.R and model_water_use_functions.R) are not run explicitly but are called from the primary script model_water_use.R. Regression equations produced by the models can be used to predict water demand throughout Rhode Island.
Pearson's Height Data 📏 Simple linear regression
kaggle.com
zip
Updated Aug 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MaDiha 🌷 (2024). Pearson's Height Data 📏 Simple linear regression [Dataset]. https://www.kaggle.com/datasets/fundal/pearsons-height-data-simple-linear-regression
Explore at:
zip(3544 bytes)Available download formats
Dataset updated
Aug 17, 2024
Authors
MaDiha 🌷
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Description The table below gives the heights of fathers and their sons, based on a famous experiment by Karl Pearson around 1903. The number of cases is 1078. Random noise was added to the original data, to produce heights to the nearest 0.1 inch.

Objective: Use this dataset to practice simple linear regression.

Columns - Father height - Son height

Source: Department of Statistics, University of California, Berkeley

Download TSV source file: Pearson.tsv
UCI and OpenML Data Sets for Ordinal Quantification
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Jul 25, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8177302
Dataset updated
Jul 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
d
Data from: Data and regression models used to estimate chloride...
catalog.data.gov
data.usgs.gov
Updated Oct 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Data and regression models used to estimate chloride concentration and road salt loading for an urban rain garden, Gary, Indiana [Dataset]. https://catalog.data.gov/dataset/data-and-regression-models-used-to-estimate-chloride-concentration-and-road-salt-loading-f
Explore at:
Dataset updated
Oct 8, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Gary, Indiana
Description
This data release contains data used to develop linear regression models to calculate continuous chloride concentrations in groundwater and runoff at an urban rain garden in Gary, Indiana. A python script was developed to perform the regression analysis and estimate sodium chloride (NaCl) loading to 5 rain-garden flumes as the product of: (1) the estimated continuous (one-minute) chloride concentration, (2) the mass ratio of sodium chloride to chloride, and (3) continuous flume discharge. The python source code used to execute the regression model and loading calculations is included in this data release in the zipped folder named Model_Archive.zip. The regression input (discrete specific conductance and chloride concentration data), chloride concentration model input (continuous specific conductance data) and NaCl loading model input (continuous flume discharge and specific conductance data) that were used to estimate chloride concentrations at 3 USGS monitoring wells (413610087201001, 413612087201301, 413611087201004) and NaCl loading at 5 USGS rain garden flume monitoring stations (413611087201101, 413611087201001, 413612087200901, 413611087200901, 413611087201002) are also included as part of the model archive (Model_Archive.zip). The model output consists of 3 .csv files for the USGS monitoring wells with estimated continuous (hourly) chloride concentrations (mg/L) and 5 .csv files for the USGS flume monitoring sites with estimated continuous (1-minute) chloride concentrations (mg/L) and NaCl loading (grams) that are presented in the zipped folder Model_Output_Data.zip.
Salary Dataset - Simple linear regression
kaggle.com
zip
Updated Jan 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allena Venkata Sai Abhishek (2023). Salary Dataset - Simple linear regression [Dataset]. https://www.kaggle.com/datasets/abhishek14398/salary-dataset-simple-linear-regression/code
Explore at:
zip(457 bytes)Available download formats
Dataset updated
Jan 10, 2023
Authors
Allena Venkata Sai Abhishek
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset Description

Salary Dataset in CSV for Simple linear regression. It has also been used in Machine Learning A to Z course of my series.

Columns

#

YearsExperience

Salary
u
Data from: Predicting spatial-temporal patterns of diet quality and large...
agdatacommons.nal.usda.gov
docx
Updated Nov 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sean Kearney; Lauren M. Porensky; David J. Augustine; Justin D. Derner; Feng Gao (2025). Data from: Predicting spatial-temporal patterns of diet quality and large herbivore performance using satellite time series [Dataset]. http://doi.org/10.15482/USDA.ADC/1522609
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.15482/USDA.ADC/1522609
Dataset updated
Nov 21, 2025
Dataset provided by
Ag Data Commons
Authors
Sean Kearney; Lauren M. Porensky; David J. Augustine; Justin D. Derner; Feng Gao
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Analysis-ready tabular data from "Predicting spatial-temporal patterns of diet quality and large herbivore performance using satellite time series" in Ecological Applications, Kearney et al., 2021. Data is tabular data only, summarized to the pasture scale. Weight gain data for individual cattle and the STARFM-derived Landsat-MODIS fusion imagery can be made available upon request. Resources in this dataset:Resource Title: Metadata - CSV column names, units and descriptions. File Name: Kearney_et_al_ECOLAPPL_Patterns of herbivore - metada.docxResource Description: Column names, units and descriptions for all CSV files in this datasetResource Title: Fecal quality data. File Name: Kearney_etal2021_Patterns_of_herbivore_Data_FQ_cln.csvResource Description: Field-sampled fecal quality (CP = crude protein; DOM = digestible organic matter) data and phenology-related APAR metrics derived from 30 m daily Landsat-MODIS fusion satellite imagery. All data are paddock-scale averages and the paddock is the spatial scale of replication and week is the temporal scale of replication. Fecal samples were collected by USDA-ARS staff from 3-5 animals per paddock (10% - 25% of animals in each herd) weekly during each grazing season from 2014 to 2019 across 10 different paddocks at the Central Plains Experimental Range (CPER) near Nunn, CO. Samples were analyzed at the Grazingland Animal Nutrition Lab (GANlab, https://cnrit.tamu.edu/index.php/ganlab/) using near infrared spectroscopy (see Lyons & Stuth, 1992; Lyons, Stuth, & Angerer, 1995). Not every herd was sampled every week or every year, resulting in a total of 199 samples. Samples represent all available data at the CPER during the study period and were collected for different research and adaptive management objectives, but following the basic protocol described above. APAR metrics were derived from the paddock-scale APAR daily time series (all paddock pixels averaged daily to create a single paddock-scale time series). All APAR metrics are calculated for the week that corresponds to the week that fecal quality samples were collected in the field. See Section 2.2.4 of the corresponding manuscript for a complete description of the APAR metrics. Resource Title: Monthly ADG. File Name: Kearney_etal2021_Patterns_of_herbivore_Data_ADG_monthly_cln.csvResource Description: Monthly average daily gain (ADG) of cattle weights at the paddock scale and the three satellite-derived metrics used to build regression model to predict AD: crude protein (CP), digestible organic matter (DOM) and aboveground net herbaceous production (ANHP). Data table also includes stocking rate (animal units per hectare) used as an interaction term in the ADG regression model and all associated data to derive each of these variables (e.g., sampling start and end dates, 30 m daily Landsat-MODIS fusion satellite imagery-derived APAR metrics, cattle weights, etc.). We calculated paddock-scale average daily gain (ADG, kg hd-1 day-1) from 2000-2019 for yearlings weighed approximately every 28-days during the grazing season across 6 different paddocks with stocking densities of 0.08 – 0.27 animal units (AU) ha-1, where one AU is equivalent to a 454 kg animal. It is worth noting that AU’s change as a function of both the number of cattle within a paddock and the size of individual animals, the latter of which changes within a single grazing season. This becomes important to consider when using sub-seasonal weight data for fast-growing yearlings. For paddock-scale ADG, we first calculated ADG for each individual yearling as the difference between the weights obtained at the end and beginning of each period, divided by the number of days in each period, and then averaged for all individuals in the paddock. We excluded data from 2013 due to data collection inconsistencies. We note that most of the monthly weight data (97%) is from 3 paddocks where cattle were weighed every year, whereas in the other 3 paddocks, monthly weights were only measured during 2017-2019. Apart from the 2013 data, which were not comparable to data from other years, the data represents all available weight gain data for CPER to maximize spatial-temporal coverage and avoid potential bias from subjective decisions to subset the data. Data may have been collected for different projects at different times, but was collected in a consistent way. This resulted in 269 paddock-scale estimates of monthly ADG, with robust temporal, but limited spatial, coverage. CP and DOM were estimated from a random forest model trained from the five APAR metrics: rAPAR, dAPAR, tPeak, iAPAR and iAPAR-dry (see manuscript Section 2.3 for description). APAR metrics were derived from the paddock-scale APAR daily time series (all paddock pixels averaged daily to create a single paddock-scale time series). All APAR metrics are calculated as the average of the approximately 28-day period that corresponds to the ADG calculation. See Section 2.2.4 of the manuscript for a complete description of the APAR metrics. ANHP was estimated from a linear regression model developed by Gaffney et al. (2018) to calculate net aboveground herbaceous productivity (ANHP; kg ha-1) from iAPAR. We averaged the coefficients of 4 spatial models (2013-2016) developed by Gaffney et al. (2018), resulting in the following equation: ANHP = -26.47 + 2.07(iAPAR) We first calculated ANHP for each day of the grazing season at the paddock scale, and then took the average ANHP for the 28-day period. REFERENCES: Gaffney, R., Porensky, L. M., Gao, F., Irisarri, J. G., Durante, M., Derner, J. D., & Augustine, D. J. (2018). Using APAR to predict aboveground plant productivity in semi-aid rangelands: Spatial and temporal relationships differ. Remote Sensing, 10(9). doi: 10.3390/rs10091474 Resource Title: Season-long ADG. File Name: Kearney_etal2021_Patterns_of_herbivore_Data_ADG_seasonal_cln.csvResource Description: Season-long observed and model-predicted average daily gain (ADG) of cattle weights at the paddock scale. Also includes two variables used to analyze patterns in model residuals: percent sand content and season-long aboveground net herbaceous production (ANHP). We calculated observed paddock-scale ADG for the entire grazing season from 2010-2019 (excluding 2013 due to data collection inconsistencies) by averaging seasonal ADG of each yearling, determined as the difference between the end and starting weights divided by the number of days in the grazing season. This dataset was available for 40 paddocks spanning a range of soil types, plant communities, and topographic positions. Data may have been collected for different projects at different times, but was collected in a consistent way. We note that there was spatial overlap among a small number paddock boundaries across different years since some fence lines were moved in 2012 and 2014. Model-predicted paddock-scale ADG was derived using the monthly ADG regression model described in Sections 2.3.3 and 2.3.4. of the associated manuscript. In short, we predicted season-long cattle weight gains by first predicting daily weight gain for each day of the grazing season from the monthly regression model using a 28-day moving average of model inputs (CP, DOM and ANHP ). We calculated the final ADG for the entire grazing season as the average predicted ADG, starting 28-days into the growing season. Percent sand content was obtained as the paddock-scale average of POLARIS sand content in the upper 0-30 cm. ANHP was calculated on the last day of the grazing season fusing a linear regression model developed by Gaffney et al. (2018) to calculate net aboveground herbaceous productivity (ANHP; kg ha-1) from satellite-derived integrated absorbed photosynthetically active radiation (iAPAR) (see Section 3.1.2 of the associated manuscript). We averaged the coefficients of 4 spatial models (2013-2016) developed by Gaffney et al. (2018), resulting in the following equation: ANHP = -26.47 + 2.07(iAPAR) REFERENCES: Gaffney, R., Porensky, L. M., Gao, F., Irisarri, J. G., Durante, M., Derner, J. D., & Augustine, D. J. (2018). Using APAR to predict aboveground plant productivity in semi-aid rangelands: Spatial and temporal relationships differ. Remote Sensing, 10(9). doi: 10.3390/rs10091474
Study Hours ,Student Scores for Linear Regression
kaggle.com
Updated Sep 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
douaa bennoune (2024). Study Hours ,Student Scores for Linear Regression [Dataset]. https://www.kaggle.com/datasets/douaabennoune/study-hours-student-scores-for-linear-regression
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 23, 2024
Dataset provided by
Kaggle
Authors
douaa bennoune
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains a collection of 100 randomly generated data points representing the relationship between the number of hours a student spends studying and their corresponding performance, measured as a score. The data has been generated to simulate a real-world scenario where study hours are assumed to influence academic outcomes, making it an excellent resource for linear regression analysis and other machine learning tasks.

Each row in the dataset consists of:

Hours: The number of hours a student dedicates to studying, ranging between 0 and 10 hours. Scores: The student's performance score, represented as a percentage, ranging from 0 to 100. Use Cases: This dataset is particularly useful for:

Linear Regression: Exploring how study hours influence student performance, fitting a regression line to predict scores based on study time. Data Science & Machine Learning: Practicing regression analysis, training models, and applying other predictive algorithms. Educational Research: Simulating data-driven insights into student behavior and performance metrics. Features: 100 rows of data. Continuous numerical variables suitable for regression tasks. Generated for educational purposes, making it ideal for students, teachers, and beginners in machine learning and data science. Potential Applications: Build a linear regression model to predict student scores. Investigate the correlation between study time and performance. Apply data visualization techniques to better understand the data. Use the dataset to experiment with model evaluation metrics like Mean Squared Error (MSE) and R-squared.
p
Cultural Clusters Data.csv
psycharchives.org
Updated Dec 17, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). Cultural Clusters Data.csv [Dataset]. https://www.psycharchives.org/en/item/dacf9eae-a8fb-466c-b222-eee51fb24a27
Explore at:
Dataset updated
Dec 17, 2020
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The present study updates and extends the meta-analysis by Haus et al. (2013) who applied the theory of planned behavior (TPB) to analyze gender differences in the motivation to start a business. We extend this meta-analysis by investigating the moderating role of the societal context in which the motivation to start a business emerges and proceeds. The results, based on 119 studies analyzing 129 samples with 266,958 individuals from 36 countries, show smaller gender differences than the original study and reveal little differences across cultural regions in the effects of the tested model. A meta-regression analyzing the role of specific cultural dimensions and economic factors on gender-related correlations reveals significant effects only of gender egalitarianism and in the opposite direction as expected. In summary, the study contributes to the discussion on gender differences, the importance of study replications and updates of meta-analyses, and the generalizability of theories across cultural contexts. Dataset for: Steinmetz, H., Isidor, R., & Bauer, C. (2021). Gender Differences in the Intention to Start a Business. Zeitschrift Für Psychologie, 229(1), 70–84. https://doi.org/10.1027/2151-2604/a000435: Electronic supplementary material D - Data file

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

zenodo.org

application/gzip, bin +2

Updated Aug 2, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788

Explore at:

bin, application/gzip, zip, text/x-pythonAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.1419788

Dataset updated

Aug 2, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb

License

https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

Description

Replication pack, FSE2018 submission #164:
------------------------------------------

**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
A Case Study of the PyPI Ecosystem

**Note:** link to data artifacts is already included in the paper. 
Link to the code will be included in the Camera Ready version as well.


Content description
===================

- **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
 described below
- **settings.py** - settings template for the code archive.
- **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
 This dataset only includes stats aggregated by the ecosystem (PyPI)
- **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
 statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
 themselves, which take around 2TB.
- **build_model.r, helpers.r** - R files to process the survival data 
  (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
  `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
  **dataset_full_Jan_2018.tgz**)
- **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
- LICENSE - text of GPL v3, under which this dataset is published
- INSTALL.md - replication guide (~2 pages)

Replication guide
=================

Step 0 - prerequisites
----------------------

- Unix-compatible OS (Linux or OS X)
- Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
- R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)

Depending on detalization level (see Step 2 for more details):
- up to 2Tb of disk space (see Step 2 detalization levels)
- at least 16Gb of RAM (64 preferable)
- few hours to few month of processing time

Step 1 - software
----------------

- unpack **ghd-0.1.0.zip**, or clone from gitlab:

   git clone https://gitlab.com/user2589/ghd.git
   git checkout 0.1.0
 
 `cd` into the extracted folder. 
 All commands below assume it as a current directory.
  
- copy `settings.py` into the extracted folder. Edit the file:
  * set `DATASET_PATH` to some newly created folder path
  * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
- install docker. For Ubuntu Linux, the command is 
  `sudo apt-get install docker-compose`
- install libarchive and headers: `sudo apt-get install libarchive-dev`
- (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
 Without this dependency, you might get an error on the next step, 
 but it's safe to ignore.
- install Python libraries: `pip install --user -r requirements.txt` . 
- disable all APIs except GitHub (Bitbucket and Gitlab support were
 not yet implemented when this study was in progress): edit
 `scraper/init.py`, comment out everything except GitHub support
 in `PROVIDERS`.

Step 2 - obtaining the dataset
-----------------------------

The ultimate goal of this step is to get output of the Python function 
`common.utils.survival_data()` and save it into a CSV file:

  # copy and paste into a Python console
  from common import utils
  survival_data = utils.survival_data('pypi', '2008', smoothing=6)
  survival_data.to_csv('survival_data.csv')

Since full replication will take several months, here are some ways to speedup
the process:

####Option 2.a, difficulty level: easiest

Just use the precomputed data. Step 1 is not necessary under this scenario.

- extract **dataset_minimal_Jan_2018.zip**
- get `survival_data.csv`, go to the next step

####Option 2.b, difficulty level: easy

Use precomputed longitudinal feature values to build the final table.
The whole process will take 15..30 minutes.

- create a folder `

Machine learning pipeline to train toxicity prediction model of...
zenodo.org
data.niaid.nih.gov
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Ewald; Jan Ewald (2020). Machine learning pipeline to train toxicity prediction model of FunTox-Networks [Dataset]. http://doi.org/10.5281/zenodo.3529162
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3529162
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jan Ewald; Jan Ewald
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Machine Learning pipeline used to provide toxicity prediction in FunTox-Networks

01_DATA # preprocessing and filtering of raw activity data from ChEMBL
- Chembl_v25 # latest activity assay data set from ChEMBL (retrieved Nov 2019)
- filt_stats.R # Filtering and preparation of raw data
- Filtered # output data sets from filt_stats.R
- toxicity_direction.csv # table of toxicity measurements and their proportionality to toxicity

02_MolDesc # Calculation of molecular descriptors for all compounds within the filtered ChEMBL data set
- datastore # files with all compounds and their calculated molecular descriptors based on SMILES
- scripts
- calc_molDesc.py # calculates for all compounds based on their smiles the molecular descriptors
- chemopy-1.1 # used python package for descriptor calculation as decsribed in: https://doi.org/10.1093/bioinformatics/btt105

03_Averages # Calculation of moving averages for levels and organisms as required for calculation of Z-scores
- datastore # output files with statistics calculated by make_Z.R
- scripts
-make_Z.R # script to calculate statistics to calculate Z-scores as used by the regression models

04_ZScores # Calculation of Z-scores and preparation of table to fit regression models
- datastore # Z-normalized activity data and molecular descriptors in the form as used for fitting regression models
- scripts
-calc_Ztable.py # based on activity data, molecular descriptors and Z-statistics, the learning data is calculated

05_Regression # Performing regression. Preparation of data by removing of outliers based on a linear regression model. Learning of random forest regression models. Validation of learning process by cross validation and tuning of hyperparameters.

- datastore # storage of all random forest regression models and average level of Z output value per level and organism (zexp_*.tsv)
- scripts
- data_preperation.R # set up of regression data set, removal of outliers and optional removal of fields and descriptors
- Rforest_CV.R # analysis of machine learning by cross validation, importance of regression variables and tuning of hyperparameters (number of trees, split of variables)
- Rforest.R # based on analysis of Rforest_CV.R learning of final models

rregrs_output
# early analysis of regression model performance with the package RRegrs as described in: https://doi.org/10.1186/s13321-015-0094-2
m
Datasets used to train and test prediction model to predict scores in terms...
data.mendeley.com
Updated Mar 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jarosław Wątróbski (2025). Datasets used to train and test prediction model to predict scores in terms of SDG 7 realization [Dataset]. http://doi.org/10.17632/6c8fm7s4y2.1
Explore at:
Unique identifier
https://doi.org/10.17632/6c8fm7s4y2.1
Dataset updated
Mar 5, 2025
Authors
Jarosław Wątróbski
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The datasets used in this research work refer to the aims of Sustainable Development Goal 7. These datasets were used to train and test machine learning model based on artificial neural network and other machine learning regression models for solving the problem of prediction scores in terms of SDG 7 aims realization. Train dataset was created based on data from 2013 to 2021 and includes 261 samples. Test dataset includes 29 samples. Sources data from 2013 to 2022 are available in 10 XLSX and CSV files. Train and test datasets are available in XLSX and CSV files. Detailed description of data is available in PDF file.
d
Remotely sensed and in-situ chlorophyll a and temperature data from Blue...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Remotely sensed and in-situ chlorophyll a and temperature data from Blue Mesa Reservoir, Colorado [Dataset]. https://catalog.data.gov/dataset/remotely-sensed-and-in-situ-chlorophyll-a-and-temperature-data-from-blue-mesa-reservoir-co
Explore at:
Dataset updated
Nov 26, 2025
Dataset provided by
U.S. Geological Survey
Area covered
Blue Mesa Reservoir, Colorado
Description
This data set includes estimates of aquatic chlorophyll a concentration and reservoir temperature for Blue Mesa Reservoir, CO. A Random Forest modeling approach was trained to model near-surface aquatic chlorophyll a using near-coincident Sentinel-2 satellite imagery and water samples analyzed for chlorophyll a concentration. The trained chlorophyll a model was applied to Sentinel-2 imagery to produce maps of modeled chlorophyll a concentrations at 10 m spatial resolution for May through October for 2016 through 2023. Chlorophyll a concentrations for three sections (basins) of Blue Mesa Reservoir were extracted from the raster data to produce time-series of modeled chlorophyll a concentration summary statistics (e.g., median, standard deviation, 90th percentile, etc). Water temperatures were approximated using the provisional Landsat surface temperature (PST) product collected with sensors on board Landsat 5, 7, 8, and 9 for May through October between 2000 and 2023. PST values for Landsat 8 and Landsat 9 were scaled to match in-situ water temperature observations in the top 1 m of the water column using a multivariate linear regression model. A harmonized water temperature record was produced by adjusting Landsat 7 PST values to align with the adjusted Landsat 8 values for near-coincident image dates. Similarly, Landsat 5 PST values were adjusted to match the adjusted Landsat 7 values. The modeled chlorophyll a and temperatures had root mean square errors of 1.9 micrograms per liter and 0.7 degrees Celsius, respectively. This data release includes three components with tabular and raster data: 1) Tabular .csv format in-situ and remotely sensed chlorophyll a data from Blue Mesa Reservoir, Colorado May through October 2016 - 2023: Data used to train the chlorophyll a model (chl_model_training.csv), ) and modeled chlorophyll a time series (chl_rs_values.csv) 2) Raster format remotely sensed aquatic chlorophyll a for Blue Mesa Reservoir, Colorado May through October 2016 - 2023: Raster data include 167 geotiffs of modeled chlorophyll a concentrations in zipped directory (chl_conc_ug_L.zip) 3) Tabular in-situ and remotely sensed temperature data from Blue Mesa Reservoir, Colorado May through October 2000 - 2023: Data used to train the temperature model (temp_model_training.csv) and modeled temperature timeseries (temp_rs_values.csv)
4
Dataset for 'Identifying Key Drivers of Product Formation in Microbial...
data.4tu.nl
zip
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marika Zegers; Moumita Roy; Ludovic Jourdin, Dataset for 'Identifying Key Drivers of Product Formation in Microbial Electrosynthesis with a Mixed Linear Regression Analysis' [Dataset]. http://doi.org/10.4121/5e840d08-55f6-4daa-a639-048cebcd8266.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/5e840d08-55f6-4daa-a639-048cebcd8266.v1
Dataset provided by
4TU.ResearchData
Authors
Marika Zegers; Moumita Roy; Ludovic Jourdin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Feb 1, 2024 - Dec 1, 2024
Dataset funded by
Delft University of Technology
NWO
Description
The analysed data and complete scripts for the permutation tests and mixed linear regression models (MLRMs) used in the paper 'Identifying Key Drivers of Product Formation in Microbial Electrosynthesis with a Mixed Linear Regression Analysis'.
Python version 3.10.13 with packages numpy, pandas, os, scipy.optimize, scipy.stats, sklearn.metrics, matplotlib.pyplot, statsmodels.formula.api, seaborn are required to run the .py files. Ensure all packages are installed before running the scripts. Data files required to run the code (.xlsx and .csv format) are included in the relevant folders.
e
CALY-SWE: Discrete choice experiment and time trade-off data for a...
data.europa.eu
researchdata.se
unknown
Updated Sep 19, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Umeå universitet (2023). CALY-SWE: Discrete choice experiment and time trade-off data for a representative Swedish value set [Dataset]. https://data.europa.eu/data/datasets/https-doi-org-10-5878-asxy-3p37?locale=en
Explore at:
unknownAvailable download formats
Dataset updated
Sep 19, 2023
Dataset authored and provided by
Umeå universitet
Description
The data consist of two parts: Time trade-off (TTO) data with one row per TTO question (5 questions), and discrete choice experiment (DCE) data with one row per question (6 questions). The purpose of the data is the calculation of a Swedish value set for the capability-adjusted life years (CALY-SWE) instrument. To protect the privacy of the study participants and to comply with GDPR, access to the data is given upon request.

The data is provided in 4 .csv files with the names:

tto.csv (252 kB)

dce.csv (282 kB)

weights_final_model.csv (30 kB)

coefs_final_model.csv (1 kB)

The first two files (tto.csv, dce.csv) contain the time trade-off (TTO) answers and discrete choice experiment (DCE) answers of participants. The latter two files (weight_final_model.csv, coefs_final_model.csv) contain the generated value set of CALY-SWE weights, and the pertaining coefficients of the main effects additive model.

Background:

CALY-SWE is a capability-based instrument for studying Quality of Life (QoL). It consists of 6 attributes (health, social relations, financial situation & housing, occupation, security, political & civil rights) and provides the option to gives for attribute answers on 3 levels (Agree, Agree partially, Do not agree). A configuration or state is one of the 3^6 = 729 possible situations that the instrument describes. Here, a config is denoted in the form of xxxxxx, one x for each attribute in order above. X is a digit corresponding to the level of the respective attribute, with 3 being the highest (Agree), and 1 being the lowest (Do not agree). For example, 222222 encodes a configuration with all attributes on level 2 (Partially agree). The purpose of this dataset is to support the publication of the CALY-SWE value set and to enable reproduction of the calculations (due to privacy concerns we abstain from publishing individual level characteristics). A value set consists of values on the 0 to 1 scale for all 729, each of represents a quality weighting where 1 is the highest capability-related QoL, and 0 the lowest capability-related QoL.

The data contains answers to two types of questions: TTO and DCE.

In TTO questions, participants iteratively chose a number of years between 1 to 10. A choice of 10 years is equivalent to living 10 years with full capability (state configuration 333333) in the capability state that the TTO question describes. The answer on the 0 to 1 scale is then calculated as x/10. In the DCE questions, participants were given two states and they chose a state that they found to be better. We used a hybrid model with a linear regression and a logit model component, where the coefficients were linked through a multiplicative factor, to obtain the weights (weights_final_model.csv). Each weight is calculated as constant + the coefficients for the respective configuration. Coefficients for level 3 encode the difference to level 2, and coefficients for level 2 the difference to the constant. For example, for the weight for 123112 is calculated as constant + socrel2 + finhou2 + finhou3 + polciv2 (No coefficients for health, occupation, and security involved as they are on level 1 that is captured in the constant/intercept).

To assess the quality of TTO answers, we calculated a score per participant that takes into account inconsistencies in answering the TTO question. We then excluded 20% of participants with the worst score to improve the TTO data quality and signal strength for the model (this is indicated by the 'included' variable in the TTO dataset). Details of the entire survey are described in the preprint “CALY-SWE value set: An integrated approach for a valuation study based on an online-administered TTO and DCE survey” by Meili et al. (2023). Please check this document for updated versions.

Ids have been randomized with preserved linkage between the DCE and TTO dataset.

Data files and variables:

Below is a description of the variables in each CSV file. - tto.csv:

config: 6 numbers representing the attribute levels. position: The number of the asked TTO question. tto_block: The design block of the TTO question. answer: The equivalence value indicated by the participant, ranging from 0.1 to 1 in steps of 0.1. included: If the answer was included in the data for the model to generate the value set. id: Randomized id of the participant.

dce.csv:

config1: Configuration of the first state in the question. config2: Configuration of the second state in the question. position: The number of the asked TTO question. answer: Whether state 1 or 2 was preferred. id: Randomized id of the participant.

weights_final_model.csv

config: 6 numbers representing the attribute levels. weight: The weight calculated with the final model. ciu: The upper 95% credible interval. cil: The lower 95% credible interval.

coefs_final_model.csv:

name: Name of the coefficient, composed of an abbreviation for the attribute and a level number (abbreviations in the same order as above:
F
Datasets for classification and regression in human-robot interaction with...
data.uni-hannover.de
zip, csv, py
Updated Nov 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Institut für Mechatronische Systeme (2025). Datasets for classification and regression in human-robot interaction with parallel robots [Dataset]. https://data.uni-hannover.de/dataset/datasets-for-classification-and-regression-in-human-robot-interaction-with-parallel-robots
Explore at:
zip, csv, pyAvailable download formats
Dataset updated
Nov 14, 2025
Dataset authored and provided by
Institut für Mechatronische Systeme
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This repository contains three datasets for the analysis of contact events in robots. The data was recorded using a parallel robot and is suitable for regression and classification tasks in the field of machine learning.

The datasets are divided into the following three tasks:

Regression of Contact Location and Force: The goal is to predict the location and force of a contact on the robot.

Classification of the Collided Body: This task aims to identify the robot's body that is in contact with the environment (one of the six links or the end-effector platform).

Classification of the Contact Type: This task deals with distinguishing between different types of contact, such as collisions and clamping.

More details on the first two datasets can be found in the publication: https://doi.org/10.1109/IROS55552.2023.10342345

Information on the third dataset can be found here: https://doi.org/10.1109/IROS55552.2023.10341581

Data Description

The data is available in .csv format and contains time-series data from the robot's sensors and a force-torque sensor. More information on the datasets can be found in their readme-files. The following variables are included in the datasets:

t_s: Time (s)

q_des_deg_[1-9]: Target joint angle for joints 1-9 (deg)

q_deg_[1-9]: Actual joint angle for joints 1-9 (deg)

x_des_m_rad_[1-3]: Target end-effector pose (x, y, orientation) (m, deg)

x_m_rad_[1-3]: Actual end-effector pose (x, y, orientation) (m, deg)

xd_ms_rads_[1-3]: Actual end-effector velocity (x, y, orientation) (m/s, deg/s)

tau_qa_Nm_[1-3]: Actual motor torque for motors 1-3 (Nm)

tau_ext_fts_Nm_[1-3]: External torque projected from force-torque sensor (Nm)

tau_ext_est_Nm_[1-3]: Estimated external torque (Nm)

F_ext_fts_N_Nm_[1-6]: External forces (1-3) and moments (4-6) from force-torque sensor (N, Nm)

F_ext_est_mobPlat_CS0_N_Nm_[1-3]: Estimated external force and moment on the mobile end-effector platform (N, Nm)

F_ext_fts_proj_mobPlat_CS0_N_Nm_[1-3]: Measured and projected external force and moment on the mobile end-effector platform (N, Nm)

distances_m_[1-3]: Distances for classification (m)

angles_deg_[1-3]: Angles for classification (deg)

collided_body: Identifier for the body in contact (1-6=links, 7=platform) (-)

chain: Collided chain (-)

link: Collided link of the chain (-)

location: Normalized location of the contact point on the link (-)

clamping_collision: Identifier for the type of contact (0=collision, 1=clamping) (-)

Target Variables

Regression: location and the third component of F_ext_fts_N_Nm_[1-6] (link-orthogonal force).

Classification of the Collided Body: collided_body.

Classification of the Contact Type: clamping_collision.

Usage

The datasets are available as train.csv and test.csv (for the classification tasks) and data.csv (for the regression task). They can be loaded using common libraries like Pandas in Python to train and evaluate machine learning models.

Citation

If you use these datasets in your research, please cite the corresponding publications.

Facebook

Twitter

Click to copy link

Link copied

Cite

Taseer Mehboob (2023). Insurance Dataset - Simple Linear Regression [Dataset]. https://www.kaggle.com/datasets/taseermehboob9/insurance-dataset-simple-linear-regression

Insurance Dataset - Simple Linear Regression

Insurance Dataset in csv for linear regression.It can be used in MachineLearning

Explore at:

zip(254 bytes)Available download formats

Dataset updated

Sep 14, 2023

Authors

Taseer Mehboob

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Here in This Dataset we have only 2 columns the first one is Age and the second one is Premium You can use this dataset in machine learning for Simple linear Regression and for Prediction Practices.

Clear search

Close search

Google apps

Main menu

Insurance Dataset - Simple Linear Regression

LINEAR REGRESSION DATA CSV

Dataset

Contents

Panel dataset on Brazilian fuel demand

Cancer Regression

Marketing Linear Multiple Regression

Dataset

Contents

Data from: Data for Regression Models to Estimate Water Use in Providence,...

Pearson's Height Data 📏 Simple linear regression

UCI and OpenML Data Sets for Ordinal Quantification

Data from: Data and regression models used to estimate chloride...

Salary Dataset - Simple linear regression

Dataset Description

Columns

Data from: Predicting spatial-temporal patterns of diet quality and large...

Study Hours ,Student Scores for Linear Regression

Cultural Clusters Data.csv

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

Machine learning pipeline to train toxicity prediction model of...

Datasets used to train and test prediction model to predict scores in terms...

Remotely sensed and in-situ chlorophyll a and temperature data from Blue...

Dataset for 'Identifying Key Drivers of Product Formation in Microbial...

CALY-SWE: Discrete choice experiment and time trade-off data for a...

Datasets for classification and regression in human-robot interaction with...

Data Description

Target Variables

Usage

Citation

Insurance Dataset - Simple Linear Regression

Insurance Dataset in csv for linear regression.It can be used in MachineLearning