The variable selection problem in the context of Linear Regression for large databases is analysed. The problem consists in selecting a small subset of independent variables that can perform the prediction task optimally. This problem has a wide range of applications. One important type of application is the design of composite indicators in various areas (sociology and economics, for example). Other important applications of variable selection in linear regression can be found in fields such as chemometrics, genetics, and climate prediction, among many others. For this problem, we propose a Branch & Bound method. This is an exact method and therefore guarantees optimal solutions. We also provide strategies that enable this method to be applied in very large databases (with hundreds of thousands of cases) in a moderate computation time. A series of computational experiments shows that our method performs well compared with well-known methods in the literature and with commercial software.
This dataset was extracted from the academic data below
An Econometric Model of the Watermelon Market (Suites, 1955) : https://www.jstor.org/stable/1233923?seq=1 Suits' watermelon model (Stewart, 2018) : https://www.uvic.ca/socialsciences/economics/assets/docs/seminars/KenStewartBrownBagFeb28.pdf
Variable Description Year : Year (1930-1951) log q (Q) : Total number of watermelons available for harvest (millions) log h (X) : Watermelons harvested (millions) log p (P) : Average farm price of watermelons ( $ per 1,000) log pc (C) : Average annual net farm receipts per pound of lncottonprice (dollars) log pv (T) : Average farm price of vegetables (index) log w (W) : Farm lnwageindex rates in the South Atlantic States (index) log n (N) : US population (milions) log(y/n) (Y/N) : Per capita disposable ($) log pf (F) : Railway lnfreightcostindex costs for watermelons (index)
This data release contains input data and programs (scripts) used to estimate monthly water demand for retail customers of Providence Water, located in Providence, Rhode Island. Explanatory data and model outputs are from July 2014 through June 2021. Models of per capita (for single-family residential customers) or per connection (for multi-family residential, commercial, and industrial customers) water use were developed using multiple linear regression. The dependent variables, provided by Providence Water, are the monthly number of connections and gallons of water delivered to single- and multi-family residential, commercial, and industrial connections. Potential independent variables (from online sources) are climate variables (temperature and precipitation), economic statistics, and a drought statistic. Not all independent variables were used in all of the models. The data are provided in data tables and model files. The data table RIWaterUseVariableExplanation.csv describes the explanatory variables and their data sources. The data table ProvModelInputData.csv provides the monthly water-use data that are the independent variables and the monthly climatic and economic data that are the dependent variables. The data table DroughtInputData.csv provides the weekly U.S. drought monitor index values that were processed to formulate a potential independent variable. The R script model_water_use.R runs the models that predict water use. The other two R scripts (load_preprocess_input_data.R and model_water_use_functions.R) are not run explicitly but are called from the primary script model_water_use.R. Regression equations produced by the models can be used to predict water demand throughout Rhode Island.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects. It has 1 row and is filtered where the books is The economics of low pay in Britain : a logistic regression approach. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Canada Per Capita Income Dataset: Contextualizing Economic Growth and Trends. This comprehensive dataset features per capita income data for Canada spanning multiple years, providing valuable insights into the country's economic progression. Sourced from reputable economic databases and governmental records, this dataset serves as a valuable resource for analysts, researchers, and policymakers. Inspired by the need for accessible and reliable economic data on Kaggle, this dataset aims to facilitate informed decision-making and foster a deeper understanding of Canada's income dynamics over time.
Computers are now involved in many economic transactions and can capture data associated with these transactions, which can then be manipulated and analyzed. Conventional statistical and econometric techniques such as regression often work well, but there are issues unique to big datasets that may require different tools. First, the sheer size of the data involved may require more powerful data manipulation tools. Second, we may have more potential predictors than appropriate for estimation, so we need to do some kind of variable selection. Third, large datasets may allow for more flexible relationships than simple linear models. Machine learning techniques such as decision trees, support vector machines, neural nets, deep learning, and so on may allow for more effective ways to model complex relationships. In this essay, I will describe a few of these tools for manipulating and analyzing big data. I believe that these methods have a lot to offer and should be more widely known and used by economists.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This material brings data on the results of univariate gamma regression model for direct costs, which was the first stage of inferential analysis using the linear regression model so that we could analyse which variables could be interacting with total direct cost per capita. The first table shows these data and precedes the multivariate analysis described in the article. The second table shows a more detailed descreptive analysis of per capita direct costs according to the current drug use pattern (evaluated by ASSIST alcohol, cannabis and cocaine/crack), including mean, standard deviation, minimum, maximum, first quartile, median, third quartile and the p value according to Kruskal-Wallis test. These data make reference to the article by Dr. Paula Becker e Dr. Denise Razzouk called " Relationships between age of onset of drug use, use pattern, and direct health costs in a sample of adults’ drug dependents in treatment at a Brazilian community mental health service ".
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Explore the intricacies of medical costs and healthcare expenses with our meticulously curated Medical Cost Dataset. This dataset offers valuable insights into the factors influencing medical charges, enabling researchers, analysts, and healthcare professionals to gain a deeper understanding of the dynamics within the healthcare industry.
Columns: 1. ID: A unique identifier assigned to each individual record, facilitating efficient data management and analysis. 2. Age: The age of the patient, providing a crucial demographic factor that often correlates with medical expenses. 3. Sex: The gender of the patient, offering insights into potential cost variations based on biological differences. 4. BMI: The Body Mass Index (BMI) of the patient, indicating the relative weight status and its potential impact on healthcare costs. 5. Children: The number of children or dependents covered under the medical insurance, influencing family-related medical expenses. 6. Smoker: A binary indicator of whether the patient is a smoker or not, as smoking habits can significantly impact healthcare costs. 7. Region: The geographic region of the patient, helping to understand regional disparities in healthcare expenditure. 8. Charges: The medical charges incurred by the patient, serving as the target variable for analysis and predictions.
Whether you're aiming to uncover patterns in medical billing, predict future healthcare costs, or explore the relationships between different variables and charges, our Medical Cost Dataset provides a robust foundation for your research. Researchers can utilize this dataset to develop data-driven models that enhance the efficiency of healthcare resource allocation, insurers can refine pricing strategies, and policymakers can make informed decisions to improve the overall healthcare system.
Unlock the potential of healthcare data with our comprehensive Medical Cost Dataset. Gain insights, make informed decisions, and contribute to the advancement of healthcare economics and policy. Start your analysis today and pave the way for a healthier future.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Urban economic competitiveness is a fundamental indicator for assessing the level of urban development and serves as an effective approach for understanding regional disparities. Traditional economic competitiveness research that relies solely on traditional regression models and assumes feature relationship theory tends to fall short in fully exploring the intricate interrelationships and nonlinear associations among features. As a result, the study of urban economic disparities remains limited to a narrow range of urban features, which is insufficient for comprehending cities as complex systems. The ability of deep learning neural networks to automatically construct models of nonlinear relationships among complex features provides a new approach to research in this issue. In this study, a complex urban feature dataset comprising 1008 features was constructed based on statistical data from 283 prefecture-level cities in China. Employing a machine learning approach based on convolutional neural network (CNN), a novel analytical model is constructed to capture the interrelationships among urban features, which is applied to achieve accurate classification of urban economic competitiveness. In addition, considering the limited number of samples in the dataset owing to the fixed number of cities, this study developed a data augmentation approach based on deep convolutional generative adversarial network (DCGAN) to further enhance the accuracy and generalization ability of the model. The performance of the CNN classification model was effectively improved by adding the generated samples to the original sample dataset. This study provides a precise and stable analytical model for investigating disparities in regional development. In the meantime, it offers a feasible solution to the limited sample size issue in the application of deep learning in urban research.
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description: This dataset contains historical economic data spanning from 1871 to 2024, used in Jaouad Karfali’s research on Economic Cycle Analysis with Numerical Time Cycles. The study aims to improve economic forecasting accuracy through the 9-year cycle model, which demonstrates superior predictive capabilities compared to traditional economic indicators.
Dataset Contents: The dataset includes a comprehensive range of economic indicators used in the research, such as:
USGDP_1871-2024.csv – U.S. Gross Domestic Product (GDP) data. USCPI_cleaned.csv – U.S. Consumer Price Index (CPI), cleaned and processed. USWAGE_1871-2024.csv – U.S. average wages data. EXCHANGEGLOBAL_cleaned.csv – Global exchange rates for the U.S. dollar. EXCHANGEPOUND_cleaned.csv – U.S. dollar to British pound exchange rates. INTERESTRATE_1871-2024.csv – U.S. interest rate data. UNRATE.csv – U.S. unemployment rate statistics. POPTOTUSA647NWDB.csv – U.S. total population data. Significance of the Data: This dataset serves as a foundation for a robust economic analysis of the U.S. economy over multiple decades. It was instrumental in testing the 9-year economic cycle model, which demonstrated an 85% accuracy rate in economic forecasting when compared to traditional models such as ARIMA and VAR.
Applications:
Economic Forecasting: Predicts a 1.5% decline in GDP in 2025, followed by a gradual recovery between 2026-2034. Economic Stability Analysis: Used for comparing forecasts with estimates from institutions like the IMF and World Bank. Academic and Institutional Research: Supports studies in economic cycles and long-term forecasting. Source & Further Information: For more details on the methodology and research findings, refer to the full paper published on SSRN:
https://ssrn.com/author=7429208 https://orcid.org/0009-0002-9626-7289
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Over 44.7 million Americans carry student loan debt, with the total amount valued at approximately $1.31 trillion (Quarterly Report, 2019). Ergo, consumer spending, a factor of GDP, is stifled and negatively impacts the economy (Frizell, 2014, p. 22). This study examined the relationship between student loan debt and the probability of a recession in the near future, as well as the effects of proposed student loan forgiveness policies through the use of a created model. The Federal Reserve Bank of St. Louis’s website (FRED) was used to extract data regarding total GDP per quarter and student loan debt per quarter ("Federal Reserve Economic Data," 2019). Through the combination of the student loan debt per quarter and total GDP per quarter datasets, the percentage of total GDP composed of student loan debt per quarter was calculated and fitted to a logistic curve. Future quarterly values for total GDP and the percentage of total GDP composed by student loan debt per quarter were found through Long Short Term Models and Euler’s Method, respectively. Through the creation of a probability of recession index, the probability of recession per quarter was compared to the percentage of total GDP composed by student loan debt per quarter to construct an exponential regression model. Utilizing a primarily quantitative method of analysis, the percentage of total GDP composed by student loan debt per quarter was found to be strongly associated[p < 1.26696* 10-8]with the probability of recession per quarter(p(R)), with the p(R) tending to peak as the percentage of total GDP composed of student loan debt per quarter strayed away from the carrying capacity of the logistic curve. Inputting the student loan debt forgiveness policies of potential congressional bills proposed by lawmakers found that eliminating 49.7 % and 36.7% of student loan debt would reduce the recession probabilities to be 1.73545*10-29% and 9.74474*10-25%, respectively.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Rice is a crucial crop in Sri Lanka, influencing both its agricultural and economic landscapes. This study delves into the complex interplay between economic indicators and rice production, aiming to uncover correlations and build prediction models using machine learning techniques. The dataset, spanning from 1960 to 2020, includes key economic variables such as GDP, inflation rate, manufacturing output, population, population growth rate, imports, arable land area, military expenditure, and rice production. The study’s findings reveal the significant influence of economic factors on rice production in Sri Lanka. Machine learning models, including Linear Regression, Support Vector Machines, Ensemble methods, and Gaussian Process Regression, demonstrate strong predictive accuracy in forecasting rice production based on economic indicators. These results underscore the importance of economic indicators in shaping rice production outcomes and highlight the potential of machine learning in predicting agricultural trends. The study suggests avenues for future research, such as exploring regional variations and refining models based on ongoing data collection.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book series. It has 1 row and is filtered where the books is Behind the model : a constructive critique of economic modeling. It features 2 columns including publication dates.
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Personal and economic well-being regression models based on effects of taxes and benefits data for the UK for the period April 2016 to March 2017
The data consist of two parts: Time trade-off (TTO) data with one row per TTO question (5 questions), and discrete choice experiment (DCE) data with one row per question (6 questions). The purpose of the data is the calculation of a Swedish value set for the capability-adjusted life years (CALY-SWE) instrument. To protect the privacy of the study participants and to comply with GDPR, access to the data is given upon request.
The data is provided in 4 .csv files with the names:
The first two files (tto.csv, dce.csv) contain the time trade-off (TTO) answers and discrete choice experiment (DCE) answers of participants. The latter two files (weight_final_model.csv, coefs_final_model.csv) contain the generated value set of CALY-SWE weights, and the pertaining coefficients of the main effects additive model.
Background:
CALY-SWE is a capability-based instrument for studying Quality of Life (QoL). It consists of 6 attributes (health, social relations, financial situation & housing, occupation, security, political & civil rights) and provides the option to gives for attribute answers on 3 levels (Agree, Agree partially, Do not agree). A configuration or state is one of the 3^6 = 729 possible situations that the instrument describes. Here, a config is denoted in the form of xxxxxx, one x for each attribute in order above. X is a digit corresponding to the level of the respective attribute, with 3 being the highest (Agree), and 1 being the lowest (Do not agree). For example, 222222 encodes a configuration with all attributes on level 2 (Partially agree). The purpose of this dataset is to support the publication of the CALY-SWE value set and to enable reproduction of the calculations (due to privacy concerns we abstain from publishing individual level characteristics). A value set consists of values on the 0 to 1 scale for all 729, each of represents a quality weighting where 1 is the highest capability-related QoL, and 0 the lowest capability-related QoL.
The data contains answers to two types of questions: TTO and DCE.
In TTO questions, participants iteratively chose a number of years between 1 to 10. A choice of 10 years is equivalent to living 10 years with full capability (state configuration 333333) in the capability state that the TTO question describes. The answer on the 0 to 1 scale is then calculated as x/10. In the DCE questions, participants were given two states and they chose a state that they found to be better. We used a hybrid model with a linear regression and a logit model component, where the coefficients were linked through a multiplicative factor, to obtain the weights (weights_final_model.csv). Each weight is calculated as constant + the coefficients for the respective configuration. Coefficients for level 3 encode the difference to level 2, and coefficients for level 2 the difference to the constant. For example, for the weight for 123112 is calculated as constant + socrel2 + finhou2 + finhou3 + polciv2 (No coefficients for health, occupation, and security involved as they are on level 1 that is captured in the constant/intercept).
To assess the quality of TTO answers, we calculated a score per participant that takes into account inconsistencies in answering the TTO question. We then excluded 20% of participants with the worst score to improve the TTO data quality and signal strength for the model (this is indicated by the 'included' variable in the TTO dataset). Details of the entire survey are described in the preprint “CALY-SWE value set: An integrated approach for a valuation study based on an online-administered TTO and DCE survey” by Meili et al. (2023). Please check this document for updated versions.
Ids have been randomized with preserved linkage between the DCE and TTO dataset.
Data files and variables:
Below is a description of the variables in each CSV file. - tto.csv:
config: 6 numbers representing the attribute levels. position: The number of the asked TTO question. tto_block: The design block of the TTO question. answer: The equivalence value indicated by the participant, ranging from 0.1 to 1 in steps of 0.1. included: If the answer was included in the data for the model to generate the value set. id: Randomized id of the participant.
config1: Configuration of the first state in the question. config2: Configuration of the second state in the question. position: The number of the asked TTO question. answer: Whether state 1 or 2 was preferred. id: Randomized id of the participant.
config: 6 numbers representing the attribute levels. weight: The weight calculated with the final model. ciu: The upper 95% credible interval. cil: The lower 95% credible interval.
name: Name of the coefficient, composed of an abbreviation for the attribute and a level number (abbreviations in the same order as above: health, socrel, finhou, occu, secu, polciv). value: Continuous, weight on the 0 to 1 scale. ciu: The upper 95% credible interval. cil: The lower 95% credible interval.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the code, input sheets, set-up guide and documentation for the EVOLVE research project (https://evolveenergy.eu/) economic dispatch model of Great Britain. Within this research project, a novel modelling framework has been developed to quantify the potential benefit of including higher proportions of ocean energy within large-scale electricity systems. Economic dispatch modelling is utilised to model hourly supply-demand matching for a range of sensitivity runs, adjusting the proportion of ocean energy within the generation mix. The framework is applied to a 2030 case study of the power system of Great Britain, testing installed wave or tidal stream capacities ranging from 100 MW to 10 GW. This dataset contains all of the data, code and documentation required to run this economic dispatch model. The project results found that for all sensitivity runs, ocean energy increases renewable dispatch, reduces dispatch costs, reduces generation required from fossil fuels, reduces system carbon emissions, reduces price volatility, and captures higher market prices. The development of this model, and analysis of the model results, is described in detail in a journal paper (currently in press). A preprint of this paper is included within the folder. It can be referenced as: S. Pennock, D.R. Noble, Y. Verdanyan, T. Delahaye and H. Jeffrey (2023). 'A modelling framework to quantify the power system benefits from ocean energy deployments'. Applied Energy, Volume 347, 1 October 2023, 121413 ( https://doi.org/10.1016/j.apenergy.2023.121413 ).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the output dataset for the research publication "Socio-economic development drives solid waste management performance in cities: A global analysis using machine learning". It features
Metadata info used by R codes
Summary of results for two modelling approaches (machine learning: Conditional random-forest and non-linear regression)
The independent variables dataset analysed here refer to specific indicators of the WABI methodology (https://www.sciencedirect.com/science/article/pii/S0956053X14004905) that generates solid waste management and resource recovery profiles for cities. It was applied here for 40 cities around the world. The data input are available here: 10.5281/zenodo.7570174
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Boston House Prices-Advanced Regression Techniques’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/fedesoriano/the-boston-houseprice-data on 13 February 2022.
--- Dataset description provided by original source is as follows ---
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.
Input features in order: 1) CRIM: per capita crime rate by town 2) ZN: proportion of residential land zoned for lots over 25,000 sq.ft. 3) INDUS: proportion of non-retail business acres per town 4) CHAS: Charles River dummy variable (1 if tract bounds river; 0 otherwise) 5) NOX: nitric oxides concentration (parts per 10 million) [parts/10M] 6) RM: average number of rooms per dwelling 7) AGE: proportion of owner-occupied units built prior to 1940 8) DIS: weighted distances to five Boston employment centres 9) RAD: index of accessibility to radial highways 10) TAX: full-value property-tax rate per $10,000 [$/10k] 11) PTRATIO: pupil-teacher ratio by town 12) B: The result of the equation B=1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 13) LSTAT: % lower status of the population
Output variable: 1) MEDV: Median value of owner-occupied homes in $1000's [k$]
StatLib - Carnegie Mellon University
Harrison, David & Rubinfeld, Daniel. (1978). Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management. 5. 81-102. 10.1016/0095-0696(78)90006-2. LINK
Belsley, David A. & Kuh, Edwin. & Welsch, Roy E. (1980). Regression diagnostics: identifying influential data and sources of collinearity. New York: Wiley LINK
--- Original source retains full ownership of the source dataset ---
These are the results obtained by conducting the experiment "Average Height of 19-year-old Males and Females and GDP per Capita in 2019 for 164 Countries". The CSV file contains the raw data produced by processing, filtering and merging the input datasets. There are two rows for each of the 164 countries. In both rows, the country name, country code and GDP per capita are given. However, one row contains the average height of 19-year-old males (indicated by the value 'Boys' in the 'Sex' column) whereas the other displays the average height of 19-year-old females (indicated by the value 'Girls'). Furthermore, there are two PNG files which display the regression plots for the average height of 19-year-old males and females, respectively. Note that the x-scale (for the GDP per capita) is logarithmic. {"references": ["The World Bank, GDP per capita (current US$), Washington, DC: The World Bank, 2021. Accessed on: Apr. 13, 2021. [Online] Available: https://data.worldbank.org/indicator/NY.GDP.PCAP.CD.", "NCD Risk Factor Collaboration, Height - Evolution of adult height over time, NCD Risk Factor Collaboration, 2021. Accessed on: Apr. 18, 2021. [Online] Available: https://ncdrisc.org/data-downloads-height.html under "Country-specific data for all countries"."]}
In this project, we added transportation modes and representation of alternative fuel technologies to a computable general equilibrium (CGE) model (ADAGE), and illustrated the impact of these transportation sector details using AEO oil price scenarios. This dataset includes the model results supporting the publication, "Insights from adding transportation sector detail into an economy-wide model: The case of the ADAGE CGE model." The dataset includes 3 files. "Adage_oilprice_main_results.xlsx" shows the data presented in the results section of the paper. "Adage_oilprice_fixed_factor.xlsx" shows data from sensitivity scenarios presented in Appendix B of the paper. "Adage_oilprice_alternative_nesting.xlsx" shows data from sensitivity scenarios presented in Appendix C of the paper. Citation information for this dataset can be found in the EDG's Metadata Reference Information section and Data.gov's References section.
The variable selection problem in the context of Linear Regression for large databases is analysed. The problem consists in selecting a small subset of independent variables that can perform the prediction task optimally. This problem has a wide range of applications. One important type of application is the design of composite indicators in various areas (sociology and economics, for example). Other important applications of variable selection in linear regression can be found in fields such as chemometrics, genetics, and climate prediction, among many others. For this problem, we propose a Branch & Bound method. This is an exact method and therefore guarantees optimal solutions. We also provide strategies that enable this method to be applied in very large databases (with hundreds of thousands of cases) in a moderate computation time. A series of computational experiments shows that our method performs well compared with well-known methods in the literature and with commercial software.