This dataset was created by APriyanka
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
📌 Description
The data based on the famous 1885 study of Francis Galton exploring the relationship between the heights of adult children and the heights of their parents. Each case is an adult child.
Column | Description |
---|---|
Father_height | The father's height, in inches. |
Mother_height | The mother's height, in inches. |
Child_height | The height of the child, in inches. |
Gender | The gender of the child, male (1) or female (0) |
This data release contains input data and programs (scripts) used to estimate monthly water demand for retail customers of Providence Water, located in Providence, Rhode Island. Explanatory data and model outputs are from July 2014 through June 2021. Models of per capita (for single-family residential customers) or per connection (for multi-family residential, commercial, and industrial customers) water use were developed using multiple linear regression. The dependent variables, provided by Providence Water, are the monthly number of connections and gallons of water delivered to single- and multi-family residential, commercial, and industrial connections. Potential independent variables (from online sources) are climate variables (temperature and precipitation), economic statistics, and a drought statistic. Not all independent variables were used in all of the models. The data are provided in data tables and model files. The data table RIWaterUseVariableExplanation.csv describes the explanatory variables and their data sources. The data table ProvModelInputData.csv provides the monthly water-use data that are the independent variables and the monthly climatic and economic data that are the dependent variables. The data table DroughtInputData.csv provides the weekly U.S. drought monitor index values that were processed to formulate a potential independent variable. The R script model_water_use.R runs the models that predict water use. The other two R scripts (load_preprocess_input_data.R and model_water_use_functions.R) are not run explicitly but are called from the primary script model_water_use.R. Regression equations produced by the models can be used to predict water demand throughout Rhode Island.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This research data file contains the necessary software and the dataset for estimating the missing prices of house units. This approach combines several machine learning techniques (linear regression, support vector regression, the k-nearest neighbors and a multi-layer perceptron neural network) with several dimensionality reduction techniques (non-negative factorization, recursive feature elimination and feature selection with a variance threshold). It includes the input dataset formed with the available house prices in two neighborhoods of Teruel city (Spain) in November 13, 2017 from Idealista website. These two neighborhoods are the center of the city and “Ensanche”.
This dataset supports the research of the authors in the improvement of the setup of agent-based simulations about real-estate market. The work about this dataset has been submitted for consideration for publication to a scientific journal.
The open source python code is composed of all the files with the “.py” extension. The main program can be executed from the “main.py” file. The “boxplotErrors.eps” is a chart generated from the execution of the code, and compares the results of the different combinations of machine learning techniques and dimensionality reduction methods.
The dataset is in the “data” folder. The input raw data of the house prices are in the “dataRaw.csv” file. These were shuffled into the “dataShuffled.csv” file. We used cross-validation to obtain the estimations of house prices. The outputted estimations alongside the real values are stored in different files of the “data” folder, in which each filename is composed by the machine learning technique abbreviation and the dimensionality reduction method abbreviation.
This data release supports the following publication: Mast, M. A., 2018, Estimating metal concentrations with regression analysis and water-quality surrogates at nine sites on the Animas and San Juan Rivers, Colorado, New Mexico, and Utah: U.S. Geological Survey Scientific Investigations Report 2018-5116. The U.S. Geological Survey (USGS), in cooperation with the U. S. Environmental Protection Agency (EPA), developed site-specific regression models to estimate concentrations of selected metals at nine USGS streamflow-gaging stations along the Animas and San Juan Rivers. Multiple linear-regression models were developed by relating metal concentrations in discrete water-quality samples to continuously monitored streamflow and surrogate parameters including specific conductance, pH, turbidity, and water temperature. Models were developed for dissolved and total concentrations of aluminum, arsenic, cadmium, iron, lead, manganese, and zinc using water-quality samples collected during 2005–17 by several agencies, using different collection methods and analytical laboratories. Calibration datasets in comma-separated format (CSV) include the variables of sampling date and time, metal concentrations (in micrograms per liter), stream discharge (in cubic feet per second), specific conductance (in microsiemens per centimeter at 25 degrees Celsius), pH, water temperature (in degrees Celsius), turbidity (in nephelometric turbidity units), and calculated seasonal terms based on Julian day. Surrogate parameters and discrete water-quality samples were used from nine sites including Cement Creek at Silverton, Colo. (USGS station 09358550); Animas River below Silverton, Colo. (USGS station 09359020); Animas River at Durango, Colo. (USGS station 09361500); Animas River Near Cedar Hill, N. Mex. (USGS station 09363500); Animas River below Aztec, N. Mex. (USGS station 09364010); San Juan River at Farmington, N. Mex. (USGS station 09365000); San Juan River at Shiprock, N. Mex (USGS Station 09368000); San Juan River at Four Corners, Colo. (USGS station 09371010); and San Juan River near Bluff, Utah (USGS station 09379500). Model archive summaries in pdf format include model statistics, data, and plots and were generated using a R script developed by USGS Kansas Water Science Center available at https://patrickeslick.github.io/ModelArchiveSummary/. A description of each USGS streamflow gaging station along with information about the calibration datasets also are provided.
This data is downloaded from the official Bombay Stock Exchange Website (BSE). This file contains the last 10 years of Historical Stock Price (By Security & Period) Security Name - Nestle India Ltd. Period - Daily Start Date - 2nd January 2012 End Date - 21st April 2022. This is one of the Best datasets for Regression Supervised Machine Learning. You can Perform SImple as well as Multiple Linear Regression on this Dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data is used to train the Multiple Linear Regression Models. (CSV)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The "Dataset_HIR" folder contains the data to reproduce the results of the data mining approach proposed in the manuscript titled "Identification of hindered internal rotational mode for complex chemical species: A data mining approach with multivariate logistic regression model".
More specifically, the folder contains the raw electronic structure calculation input data provided by the domain experts as well as the training and testing dataset with the extracted features.
The "Dataset_HIR" folder contains the following subfolders namely:
Electronic structure calculation input data: contains the electronic structure calculation input generated by the Gaussian program
1.1. Testing data: contains the raw data of all training species (each is stored in a separate folder) used for extracting dataset for training and validation phase.
1.2. Testing data: contains the raw data of all testing species (each is stored in a separate folder) used for extracting data for the testing phase.
Dataset 2.1. Training dataset: used to produce the results in Tables 3 and 4 in the manuscript
+ datasetTrain_raw.csv: contains the features for all vibrational modes associated with corresponding labeled species to let the chemists select the Hindered Internal Rotor from the list easily for the training and validation steps.
+ datasetTrain.csv: refines the datasetTrain_raw.csv where the names of the species are all removed to transform the dataset into an appropriate form for the modeling and validation steps.
2.2. Testing dataset: used to produce the results of the data mining approach in Table 5 in the manuscript.
+ datasetTest_raw.csv: contains the features for all vibrational modes of each labeled species to let the chemists select the Hindered Internal Rotor from the list for the testing step.
+ datasetTest.csv: refines the datasetTest_raw.csv where the names of the species are all removed to transform the dataset into an appropriate form for the testing step.
Note for the Result feature in the dataset: 1 is for the mode needed to be treated as Hindered Internal Rotor, and 0 otherwise.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Insurance Premium Prediction’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/noordeen/insurance-premium-prediction on 28 January 2022.
--- Dataset description provided by original source is as follows ---
The insurance.csv dataset contains 1338 observations (rows) and 7 features (columns). The dataset contains 4 numerical features (age, bmi, children and expenses) and 3 nominal features (sex, smoker and region) that were converted into factors with numerical value designated for each level.
Insurance.csv file is obtained from the Machine Learning course website (Spring 2017) from Professor Eric Suess at http://www.sci.csueastbay.edu/~esuess/stat6620/#week-6.
The purposes of this exercise to look into different features to observe their relationship, and plot a multiple linear regression based on several features of individual such as age, physical/family condition and location against their existing medical expense to be used for predicting future medical expenses of individuals that help medical insurance to make decision on charging the premium.
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundHearing loss and tinnitus have been linked to mild cognitive impairment (MCI); however, the evidence is constrained by ethical and temporal constraints, and few prospective studies have definitively established causation. This study aims to utilize Mendelian randomization (MR) and cross-sectional studies to validate and analyze this association.MethodsThis study employs a two-step approach. Initially, the genetic data of the European population from the Genome-wide association studies (GWAS) database is utilized to establish the causal relationship between hearing loss and cognitive impairment through Mendelian randomization using the inverse variance weighted (IVW) method. This is achieved by identifying strongly correlated single nucleotide polymorphisms (SNPs), eliminating linkage disequilibrium, and excluding weak instrumental variables. In the second step, 363 elderly individuals from 10 communities in Qingdao, China are assessed and examined using methods questionnaire survey and pure tone audiology (PTA). Logistic regression and multiple linear regression were used to analyze the risk factors of MCI in the elderly and to calculate the cutoff values.ResultsMendelian randomization studies have shown that hearing loss is a risk factor for MCI in European populations, with a risk ratio of hearing loss to MCI loss of 1. 23. The findings of this cross-sectional study indicate that age, tinnitus, and hearing loss emerged as significant risk factors for MCI in univariate logistic regression analysis. Furthermore, multivariate logistic regression analysis identified hearing loss and tinnitus as potential risk factors for MCI. Consistent results were observed in multiple linear regression analysis, revealing that hearing loss and age significantly influenced the development of MCI. Additionally, a notable finding was that the likelihood of MCI occurrence increased by 9% when the hearing threshold exceeded 20 decibels.ConclusionThis study provides evidence from genomic and epidemiological investigations indicating that hearing loss may serve as a risk factor for cognitive impairment. While our epidemiological study has found both hearing loss and tinnitus as potential risk factors for cognitive decline, additional research is required to establish a causal relationship, particularly given that tinnitus can manifest as a symptom of various underlying medical conditions.
Insurance companies collect multiple features of a House and select which houses can be insured and what amount they can charge the Premium from them. So here I have collected data from multiple insurance companies in USA where features with house prices are given
This data set has many property details from address to their location co ordinates nad many other features, use them to predict the House price
Multiple regression datasets have been published every one unique in their own way, Use of location coordinates and some other co-ordinates are new here.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time-Series Matrix (TSMx): A visualization tool for plotting multiscale temporal trends TSMx is an R script that was developed to facilitate multi-temporal-scale visualizations of time-series data. The script requires only a two-column CSV of years and values to plot the slope of the linear regression line for all possible year combinations from the supplied temporal range. The outputs include a time-series matrix showing slope direction based on the linear regression, slope values plotted with colors indicating magnitude, and results of a Mann-Kendall test. The start year is indicated on the y-axis and the end year is indicated on the x-axis. In the example below, the cell in the top-right corner is the direction of the slope for the temporal range 2001–2019. The red line corresponds with the temporal range 2010–2019 and an arrow is drawn from the cell that represents that range. One cell is highlighted with a black border to demonstrate how to read the chart—that cell represents the slope for the temporal range 2004–2014. This publication entry also includes an excel template that produces the same visualizations without a need to interact with any code, though minor modifications will need to be made to accommodate year ranges other than what is provided. TSMx for R was developed by Georgios Boumis; TSMx was originally conceptualized and created by Brad G. Peter in Microsoft Excel. Please refer to the associated publication: Peter, B.G., Messina, J.P., Breeze, V., Fung, C.Y., Kapoor, A. and Fan, P., 2024. Perspectives on modifiable spatiotemporal unit problems in remote sensing of agriculture: evaluating rice production in Vietnam and tools for analysis. Frontiers in Remote Sensing, 5, p.1042624. https://www.frontiersin.org/journals/remote-sensing/articles/10.3389/frsen.2024.1042624 TSMx sample chart from the supplied Excel template. Data represent the productivity of rice agriculture in Vietnam as measured via EVI (enhanced vegetation index) from the NASA MODIS data product (MOD13Q1.V006). TSMx R script: # import packages library(dplyr) library(readr) library(ggplot2) library(tibble) library(tidyr) library(forcats) library(Kendall) options(warn = -1) # disable warnings # read data (.csv file with "Year" and "Value" columns) data <- read_csv("EVI.csv") # prepare row/column names for output matrices years <- data %>% pull("Year") r.names <- years[-length(years)] c.names <- years[-1] years <- years[-length(years)] # initialize output matrices sign.matrix <- matrix(data = NA, nrow = length(years), ncol = length(years)) pval.matrix <- matrix(data = NA, nrow = length(years), ncol = length(years)) slope.matrix <- matrix(data = NA, nrow = length(years), ncol = length(years)) # function to return remaining years given a start year getRemain <- function(start.year) { years <- data %>% pull("Year") start.ind <- which(data[["Year"]] == start.year) + 1 remain <- years[start.ind:length(years)] return (remain) } # function to subset data for a start/end year combination splitData <- function(end.year, start.year) { keep <- which(data[['Year']] >= start.year & data[['Year']] <= end.year) batch <- data[keep,] return(batch) } # function to fit linear regression and return slope direction fitReg <- function(batch) { trend <- lm(Value ~ Year, data = batch) slope <- coefficients(trend)[[2]] return(sign(slope)) } # function to fit linear regression and return slope magnitude fitRegv2 <- function(batch) { trend <- lm(Value ~ Year, data = batch) slope <- coefficients(trend)[[2]] return(slope) } # function to implement Mann-Kendall (MK) trend test and return significance # the test is implemented only for n>=8 getMann <- function(batch) { if (nrow(batch) >= 8) { mk <- MannKendall(batch[['Value']]) pval <- mk[['sl']] } else { pval <- NA } return(pval) } # function to return slope direction for all combinations given a start year getSign <- function(start.year) { remaining <- getRemain(start.year) combs <- lapply(remaining, splitData, start.year = start.year) signs <- lapply(combs, fitReg) return(signs) } # function to return MK significance for all combinations given a start year getPval <- function(start.year) { remaining <- getRemain(start.year) combs <- lapply(remaining, splitData, start.year = start.year) pvals <- lapply(combs, getMann) return(pvals) } # function to return slope magnitude for all combinations given a start year getMagn <- function(start.year) { remaining <- getRemain(start.year) combs <- lapply(remaining, splitData, start.year = start.year) magns <- lapply(combs, fitRegv2) return(magns) } # retrieve slope direction, MK significance, and slope magnitude signs <- lapply(years, getSign) pvals <- lapply(years, getPval) magns <- lapply(years, getMagn) # fill-in output matrices dimension <- nrow(sign.matrix) for (i in 1:dimension) { sign.matrix[i, i:dimension] <- unlist(signs[i]) pval.matrix[i, i:dimension] <- unlist(pvals[i]) slope.matrix[i, i:dimension] <- unlist(magns[i]) } sign.matrix <-...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Most North American species of terrestrial isopod (Isopoda) have been introduced from Europe. Sandplain grassland is a globally rare habitat that is abundant on Nantucket Island, Massachusetts and the abundance of terrestrial isopods in the habitat has never been studied. The objective of this project was to develop a model to explain isopod abundance based on vegetation characteristics within Sandplain grassland and use this model to test for land management effects (prescribed burning and mowing) on isopod abundance. I counted terrestrial isopods from 175 pitfall traps set for one week and used multiple linear regression with several selection algorithms to select the best model. The vegetation characteristics I used as regressors do not appear to explain terrestrial abundance well and the final model only contains the percent grass coverage as a regressor. The model suggests that terrestrial isopods decrease in abundance with increasing grass coverage and it explains 29 percent of the data. When management effects are incorporated, the model suggests that mowing significantly increases isopod abundance.
Funding for this project came from the Nantucket Islands Land Bank, Nantucket Land Council, and the Nantucket Biodiversity Initiative.
Associated vegetation data is in the published "Effects of Sandplain Grassland Management on Spider Richness and Abundance on Nantucket Island" dataset. Sampling methods are in the thesis linked from that dataset.
allisopodData.csv - isopod counts by trap dataDictionary.csv - descriptions of variables mckenna-foster_2009.pdf - a report submitted to NBI and used as part of a statistics class at the University of Wisconsin-Green Bay
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This is a classic and very widely used dataset in machine learning and statistics, often serving as a first dataset for classification problems. Introduced by the British statistician and biologist Ronald Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems," it is a foundational resource for learning classification algorithms.
Overview:
The dataset contains measurements for 150 samples of iris flowers. Each sample belongs to one of three species of iris:
For each flower, four features were measured:
The goal is typically to build a model that can classify iris flowers into their correct species based on these four features.
File Structure:
The dataset is usually provided as a single CSV (Comma Separated Values) file, often named iris.csv
or similar. This file typically contains the following columns:
Content of the Data:
The dataset contains an equal number of samples (50) for each of the three iris species. The measurements of the sepal and petal dimensions vary between the species, allowing for their differentiation using machine learning models.
How to Use This Dataset:
iris.csv
file.Citation:
When using the Iris dataset, it is common to cite Ronald Fisher's original work:
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179-188.
Data Contribution:
Thank you for providing this classic and fundamental dataset to the Kaggle community. The Iris dataset remains an invaluable resource for both beginners learning the basics of classification and experienced practitioners testing new algorithms. Its simplicity and clear class separation make it an ideal starting point for many data science projects.
If you find this dataset description helpful and the dataset itself useful for your learning or projects, please consider giving it an upvote after downloading. Your appreciation is valuable!
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research Domain:
The dataset is part of a project focused on retail sales forecasting. Specifically, it is designed to predict daily sales for Rossmann, a chain of over 3,000 drug stores operating across seven European countries. The project falls under the broader domain of time series analysis and machine learning applications for business optimization. The goal is to apply machine learning techniques to forecast future sales based on historical data, which includes factors like promotions, competition, holidays, and seasonal trends.
Purpose:
The primary purpose of this dataset is to help Rossmann store managers predict daily sales for up to six weeks in advance. By making accurate sales predictions, Rossmann can improve inventory management, staffing decisions, and promotional strategies. This dataset serves as a training set for machine learning models aimed at reducing forecasting errors and supporting decision-making processes across the company’s large network of stores.
How the Dataset Was Created:
The dataset was compiled from several sources, including historical sales data from Rossmann stores, promotional calendars, holiday schedules, and external factors such as competition. The data is split into multiple features, such as the store's location, promotion details, whether the store was open or closed, and weather information. The dataset is publicly available on platforms like Kaggle and was initially created for the Kaggle Rossmann Store Sales competition. The data is made accessible via an API for further analysis and modeling, and it is structured to help machine learning models predict future sales based on various input variables.
Dataset Structure:
The dataset consists of three main files, each with its specific role:
Train:
This file contains the historical sales data, which is used to train machine learning models. It includes daily sales information for each store, as well as various features that could influence the sales (e.g., promotions, holidays, store type, etc.).
https://handle.test.datacite.org/10.82556/yb6j-jw41
PID: b1c59499-9c6e-42c2-af8f-840181e809db
Test2:
The test dataset mirrors the structure of train.csv
but does not include the actual sales values (i.e., the target variable). This file is used for making predictions using the trained machine learning models. It is used to evaluate the accuracy of predictions when the true sales data is unknown.
https://handle.test.datacite.org/10.82556/jerg-4b84
PID: 7cbb845c-21dd-4b60-b990-afa8754a0dd9
Store:
This file provides metadata about each store, including information such as the store’s location, type, and assortment level. This data is essential for understanding the context in which the sales data is gathered.
https://handle.test.datacite.org/10.82556/nqeg-gy34
PID: 9627ec46-4ee6-4969-b14a-bda555fe34db
Id: A unique identifier for each (Store, Date) combination within the test set.
Store: A unique identifier for each store.
Sales: The daily turnover (target variable) for each store on a specific day (this is what you are predicting).
Customers: The number of customers visiting the store on a given day.
Open: An indicator of whether the store was open (1 = open, 0 = closed).
StateHoliday: Indicates if the day is a state holiday, with values like:
'a' = public holiday,
'b' = Easter holiday,
'c' = Christmas,
'0' = no holiday.
SchoolHoliday: Indicates whether the store is affected by school closures (1 = yes, 0 = no).
StoreType: Differentiates between four types of stores: 'a', 'b', 'c', 'd'.
Assortment: Describes the level of product assortment in the store:
'a' = basic,
'b' = extra,
'c' = extended.
CompetitionDistance: Distance (in meters) to the nearest competitor store.
CompetitionOpenSince[Month/Year]: The month and year when the nearest competitor store opened.
Promo: Indicates whether the store is running a promotion on a particular day (1 = yes, 0 = no).
Promo2: Indicates whether the store is participating in Promo2, a continuing promotion for some stores (1 = participating, 0 = not participating).
Promo2Since[Year/Week]: The year and calendar week when the store started participating in Promo2.
PromoInterval: Describes the months when Promo2 is active, e.g., "Feb,May,Aug,Nov" means the promotion starts in February, May, August, and November.
To work with this dataset, you will need to have specific software installed, including:
DBRepo Authorization: This is required to access the datasets via the DBRepo API. You may need to authenticate with an API key or login credentials to retrieve the datasets.
Python Libraries: Key libraries for working with the dataset include:
pandas
for data manipulation,
numpy
for numerical operations,
matplotlib
and seaborn
for data visualization,
scikit-learn
for machine learning algorithms.
Several additional resources are available for working with the dataset:
Presentation:
A presentation summarizing the exploratory data analysis (EDA), feature engineering process, and key insights from the analysis is provided. This presentation also includes visualizations that help in understanding the dataset’s trends and relationships.
Jupyter Notebook:
A Jupyter notebook, titled Retail_Sales_Prediction_Capstone_Project.ipynb
, is provided, which details the entire machine learning pipeline, from data loading and cleaning to model training and evaluation.
Model Evaluation Results:
The project includes a detailed evaluation of various machine learning models, including their performance metrics like training and testing scores, Mean Absolute Percentage Error (MAPE), and Root Mean Squared Error (RMSE). This allows for a comparison of model effectiveness in forecasting sales.
Trained Models (.pkl files):
The models trained during the project are saved as .pkl
files. These files contain the trained machine learning models (e.g., Random Forest, Linear Regression, etc.) that can be loaded and used to make predictions without retraining the models from scratch.
sample_submission.csv:
This file is a sample submission file that demonstrates the format of predictions expected when using the trained model. The sample_submission.csv
contains predictions made on the test dataset using the trained Random Forest model. It provides an example of how the output should be structured for submission.
These resources provide a comprehensive guide to implementing and analyzing the sales forecasting model, helping you understand the data, methods, and results in greater detail.
The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.
This dataset contains all the scripts used to carry out the uncertainty analysis for the maximum drawdown and time to maximum drawdown at the groundwater receptors in the Hunter bioregion and all the resulting posterior predictions. This is described in product 2.6.2 Groundwater numerical modelling (Herron et al. 2016). See History for a detailed explanation of the dataset contents.
References:
Herron N, Crosbie R, Peeters L, Marvanek S, Ramage A and Wilkins A (2016) Groundwater numerical modelling for the Hunter subregion. Product 2.6.2 for the Hunter subregion from the Northern Sydney Basin Bioregional Assessment. Department of the Environment, Bureau of Meteorology, CSIRO and Geoscience Australia, Australia.
This dataset uses the results of the design of experiment runs of the groundwater model of the Hunter subregion to train emulators to (a) constrain the prior parameter ensembles into the posterior parameter ensembles and to (b) generate the predictive posterior ensembles of maximum drawdown and time to maximum drawdown. This is described in product 2.6.2 Groundwater numerical modelling (Herron et al. 2016).
A flow chart of the way the various files and scripts interact is provided in HUN_GW_UA_Flowchart.png (editable version in HUN_GW_UA_Flowchart.gliffy).
R-script HUN_DoE_Parameters.R creates the set of parameters for the design of experiment in HUN_DoE_Parameters.csv. Each of these parameter combinations is evaluated with the groundwater model (dataset HUN GW Model v01). Associated with this spreadsheet is file HUN_GW_Parameters.csv. This file contains, for each parameter, if it is included in the sensitivity analysis, tied to another parameters, the initial value and range, the transformation, the type of prior distribution with its mean and covariance structure.
The results of the design of experiment model runs are summarised in files HUN_GW_dmax_DoE_Predictions.csv, HUN_GW_tmax_DoE_Predictions.csv, HUN_GW_DoE_Observations.csv, HUN_GW_DoE_mean_BL_BF_hist.csv which have the maximum additional drawdown, the time to maximum additional drawdown for each receptor and the simulated equivalents to observed groundwater levels and SW-GW fluxes respectively. These are generated with post-processing scripts in dataset HUN GW Model v01 from the output (as exemplified in dataset HUN GW Model simulate ua999 pawsey v01).
Spreadsheets HUN_GW_dmax_Predictions.csv and HUN_GW_tmax_Predictions.csv capture additional information on each prediction; the name of the prediction, transformation, min, max and median of design of experiment, a boolean to indicate the prediction is to be included in the uncertainty analysis, the layer it is assigned to and which objective function to use to constrain the prediction.
Spreadsheet HUN_GW_Observations.csv has additional information on each observation; the name of the observation, a boolean to indicate to use the observation, the min and max of the design of experiment, a metadata statement describing the observation, the spatial coordinates, the observed value and the number of observations at this location (from dataset HUN bores v01). Further it has the distance of each bore to the nearest blue line network and the distance to each prediction (both in km). Spreadsheet HUN_GW_mean_BL_BF_hist.csv has similar information, but on the SW-GW flux. The observed values are from dataset HUN Groundwater Flowrate Time Series v01
These files are used in script HUN_GW_SI.py to generate sensitivity indices (based on the Plischke et al. (2013) method) for each group of observations and predictions. These indices are saved in spreadsheets HUN_GW_dmax_SI.csv, HUN_GW_tmax_SI.csv, HUN_GW_hobs_SI.py, HUN_GW_mean_BF_hist_SI.csv
Script HUN_GW_dmax_ObjFun.py calculates the objective function values for the design of experiment runs. Each prediction has a tailored objective function which is a weighted sum of the residuals between observations and predictions with weights based on the distance between observation and prediction. In addition to that there is an objective function for the baseflow rates. The results are stored in HUN_GW_DoE_ObjFun.csv and HUN_GW_ObjFun.csv.
The latter files are used in scripts HUN_GW_dmax_CreatePosteriorParameters.R to carry out the Monte Carlo sampling of the prior parameter distributions with the Approximate Bayesian Computation methodology as described in Herron et al (2016) by generating and applying emulators for each objective function. The scripts use the scripts in dataset R-scripts for uncertainty analysis v01. These files are run on the high performance computation cluster machines with batch file HUN_GW_dmax_CreatePosterior.slurm. These scripts result in posterior parameter combinations for each objective function, stored in directory PosteriorParameters, with filename convention HUN_GW_dmax_Posterior_Parameters_OO_$OFName$.csv where $OFName$ is the name of the objective function. Python script HUN_GW_PosteriorParameters_Percentiles.py summarizes these posterior parameter combinations and stores the results in HUN_GW_PosteriorParameters_Percentiles.csv.
The same set of spreadsheets is used to test convergence of the emulator performance with script HUN_GW_emulator_convergence.R and batch file HUN_GW_emulator_convergence.slurm to produce spreadsheet HUN_GW_convergence_objfun_BF.csv.
The posterior parameter distributions are sampled with scripts HUN_GW_dmax_tmax_MCsampler.R and associated .slurm batch file. The script create and apply an emulator for each prediction. The emulator and results are stored in directory Emulators. This directory is not part of the this dataset but can be regenerated by running the scripts on the high performance computation clusters. A single emulator and associated output is included for illustrative purposes.
Script HUN_GW_collate_predictions.csv collates all posterior predictive distributions in spreadsheets HUN_GW_dmax_PosteriorPredictions.csv and HUN_GW_tmax_PosteriorPredictions.csv. These files are further summarised in spreadsheet HUN_GW_dmax_tmax_excprob.csv with script HUN_GW_exc_prob. This spreadsheet contains for all predictions the coordinates, layer, number of samples in the posterior parameter distribution and the 5th, 50th and 95th percentile of dmax and tmax, the probability of exceeding 1 cm and 20 cm drawdown, the maximum dmax value from the design of experiment and the threshold of the objective function and the acceptance rate.
The script HUN_GW_dmax_tmax_MCsampler.R is also used to evaluate parameter distributions HUN_GW_dmax_Posterior_Parameters_HUN_OF_probe439.csv and HUN_GW_dmax_Posterior_Parameters_Mackie_OF_probe439.csv. These are, for one predictions, different parameter distributions, in which the latter represents local information. The corresponding dmax values are stored in HUN_GW_dmax_probe439_HUN.csv and HUN_GW_dmax_probe439_Mackie.csv
Bioregional Assessment Programme (XXXX) HUN GW Uncertainty Analysis v01. Bioregional Assessment Derived Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/c25db039-5082-4dd6-bb9d-de7c37f6949a.
Derived From HUN GW Model code v01
Derived From Hydstra Groundwater Measurement Update - NSW Office of Water, Nov2013
Derived From Groundwater Economic Elements Hunter NSW 20150520 PersRem v02
Derived From NSW Office of Water - National Groundwater Information System 20140701
Derived From Travelling Stock Route Conservation Values
Derived From HUN GW Model v01
Derived From NSW Wetlands
Derived From Climate Change Corridors Coastal North East NSW
Derived From Communities of National Environmental Significance Database - RESTRICTED - Metadata only
Derived From Climate Change Corridors for Nandewar and New England Tablelands
Derived From National Groundwater Dependent Ecosystems (GDE) Atlas
Derived From Fauna Corridors for North East NSW
Derived From R-scripts for uncertainty analysis v01
Derived From Asset database for the Hunter subregion on 27 August 2015
Derived From Hunter CMA GDEs (DRAFT DPI pre-release)
Derived From Estuarine Macrophytes of Hunter Subregion NSW DPI Hunter 2004
Derived From Birds Australia - Important Bird Areas (IBA) 2009
Derived From [Camerons Gorge
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Companion DATA of the paper "Using social media and personality traits to assess software developers’ emotions" submitted to the IEEE Access journal, 2022.
The folders contain:
/analysis analyzed_tweets_by_psychologists.csv: file containing the manual analysis done by psychologists analyzed_tweets_by_participants.csv: file containing the manual analysis done by participants analyzed_tweets_by_psychologists_solved_divergencies.csv: file containing the manual analysis done by psychologists over 51 divergent tweets' classifications
/dataset alldata.json: contains the dataset used in the paper
/notebooks General - Charts.ipynb: notebook file containing all charts produced in the study, including those in the paper Statistics - Lexicons and Ensembles.ipynb: notebook file with the statistics for the five lexicons and ensembles used in the study Statistics - Linear Regression.ipynb: notebook file with the multiple linear regression results
Statistics - Polynomial Regression: notebook file with the polynomial regression results
Statistics - Psychologists versus Participants.ipynb: notebook file with the statistics between the psychologists and participants manual analysis
Statistics - Working x Non-working.ipynb: notebook file containing the statistical analysis for the tweets posted during work period and those posted outside of working period
/surveys Demographic_Survey.pdf: survey inviting participants to enroll in the study. We collect demographic data and participants' authorization to access their public Tweet posts Demographic_Survey_answers.xlsx: participants' demographic survey answers ibf_pt_br.doc: the Portuguese version of the Big Five Inventory (BFI) instrument to infer participants' Big Five polarity traits ibf_answers.xlsx: participantes' and psychologists' answers for BFI
We have removed from dataset any sensible data to protect participants' privacy and anonymity. We have removed from demographic survey answers any sensible data to protect participants' privacy and anonymity.
This item contains data and code used in experiments that produced the results for Sadler et. al (2022) (see below for full reference). We ran five experiments for the analysis, Experiment A, Experiment B, Experiment C, Experiment D, and Experiment AuxIn. Experiment A tested multi-task learning for predicting streamflow with 25 years of training data and using a different model for each of 101 sites. Experiment B tested multi-task learning for predicting streamflow with 25 years of training data and using a single model for all 101 sites. Experiment C tested multi-task learning for predicting streamflow with just 2 years of training data. Experiment D tested multi-task learning for predicting water temperature with over 25 years of training data. Experiment AuxIn used water temperature as an input variable for predicting streamflow. These experiments and their results are described in detail in the WRR paper. Data from a total of 101 sites across the US was used for the experiments. The model input data and streamflow data were from the Catchment Attributes and Meteorology for Large-sample Studies (CAMELS) dataset (Newman et. al 2014, Addor et. al 2017). The water temperature data were gathered from the National Water Information System (NWIS) (U.S. Geological Survey, 2016). The contents of this item are broken into 13 files or groups of files aggregated into zip files:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
=====================================================================
=====================================================================
Authors: Trung-Nghia Le (1), Khanh-Duy Nguyen (2), Huy H. Nguyen (1), Junichi Yamagishi (1), Isao Echizen (1)
Affiliations: (1)National Institute of Informatics, Japan (2)University of Information Technology-VNUHCM, Vietnam
National Institute of Informatics Copyright (c) 2021
Emails: {ltnghia, nhhuy, jyamagis, iechizen}@nii.ac.jp, {khanhd}@uit.edu.vn
Arxiv: https://arxiv.org/abs/2111.12888 NII Face Mask Dataset v1.0: https://zenodo.org/record/5761725
=============================== INTRODUCTION ===============================
The NII Face Mask Dataset is the first large-scale dataset targeting mask-wearing ratio estimation in street cameras. This dataset contains 581,108 face annotations extracted from 18,088 video frames (1920x1080 pixels) in 17 street-view videos obtained from the Rambalac's YouTube channel.
The videos were taken in multiple places, at various times, before and during the COVID-19 pandemic. The total length of the videos is approximately 56 hours.
=============================== REFERENCES ===============================
If your publish using any of the data in this dataset please cite the following papers:
@article{Nguyen202112888, title={Effectiveness of Detection-based and Regression-based Approaches for Estimating Mask-Wearing Ratio}, author={Nguyen, Khanh-Duy and Nguyen, Huy H and Le, Trung-Nghia and Yamagishi, Junichi and Echizen, Isao}, archivePrefix={arXiv}, arxivId={2111.12888}, url={https://arxiv.org/abs/2111.12888}, year={2021} }
@INPROCEEDINGS{Nguyen2021EstMaskWearing, author={Nguyen, Khanh-Duv and Nguyen, Huv H. and Le, Trung-Nghia and Yamagishi, Junichi and Echizen, Isao}, booktitle={2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021)}, title={Effectiveness of Detection-based and Regression-based Approaches for Estimating Mask-Wearing Ratio}, year={2021}, pages={1-8}, url={https://ieeexplore.ieee.org/document/9667046}, doi={10.1109/FG52635.2021.9667046}}
======================== DATA STRUCTURE ==================================
./NFM ├── dataset │ ├── train.csv: annotations for the train set. │ ├── test.csv: annotations for the test set. └── README_v1.0.md
We use the same structure for two CSV files (train.csv and test.csv). Both CSV files have the same columns: <1st column>: video_id (a source video can be found by following the link: https://www.youtube.com/watch?v=) <2nd column>: frame_id (the index of a frame extracted from the source video) <3rd column>: timestamp in milisecond (the timestamp of a frame extracted from the source video) <4th column>: label (for each annotated face, one of three labels was attached with a bounding box: 'Mask'/'No-Mask'/'Unknown') <5th column>: left <6th column>: top <7th column>: right <8th column>: bottom Four coordinates (left, top, right, bottom) were used to denote a face's bounding box.
============================== COPYING ================================
This repository is made available under Creative Commons Attribution License (CC-BY).
Regarding Creative Commons License: Attribution 4.0 International (CC BY 4.0), please see https://creativecommons.org/licenses/by/4.0/
THIS DATABASE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS DATABASE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE
====================== ACKNOWLEDGEMENTS ================================
This research was partly supported by JSPS KAKENHI Grants (JP16H06302, JP18H04120, JP21H04907, JP20K23355, JP21K18023), and JST CREST Grants (JPMJCR20D3, JPMJCR18A6), Japan.
This dataset is based on the Rambalac's YouTube channel: https://www.youtube.com/c/Rambalac
This dataset was created by APriyanka