Facebook
TwitterThe dataset used is US Census data which is an extraction of the 1994 census data which was donated to the UC Irvine’s Machine Learning Repository. The data contains approximately 32,000 observations with over 15 variables. The dataset was downloaded from: http://archive.ics.uci.edu/ml/datasets/Adult. The dependent variable in our analysis will be income level and who earns above $50,000 a year using SQL queries, Proportion Analysis using bar charts and Simple Decision Tree to understand the important variables and their influence on prediction.
Facebook
TwitterThis dataset presents a rich collection of physicochemical parameters from 147 reservoirs distributed across the conterminous U.S. One hundred and eight of the reservoirs were selected using a statistical survey design and can provide unbiased inferences to the condition of all U.S. reservoirs. These data could be of interest to local water management specialists or those assessing the ecological condition of reservoirs at the national scale. These data have been reviewed in accordance with U.S. Environmental Protection Agency policy and approved for publication. This dataset is not publicly accessible because: It is too large. It can be accessed through the following means: https://portal-s.edirepository.org/nis/mapbrowse?scope=edi&identifier=2033&revision=1. Format: This dataset presents water quality and related variables for 147 reservoirs distributed across the U.S. Water quality parameters were measured during the summers of 2016, 2018, and 2020 – 2023. Measurements include nutrient concentration, algae abundance, dissolved oxygen concentration, and water temperature, among many others. Dataset includes links to other national and global scale data sets that provide additional variables.
Facebook
TwitterThis dataset is designed for beginners to practice regression problems, particularly in the context of predicting house prices. It contains 1000 rows, with each row representing a house and various attributes that influence its price. The dataset is well-suited for learning basic to intermediate-level regression modeling techniques.
Beginner Regression Projects: This dataset can be used to practice building regression models such as Linear Regression, Decision Trees, or Random Forests. The target variable (house price) is continuous, making this an ideal problem for supervised learning techniques.
Feature Engineering Practice: Learners can create new features by combining existing ones, such as the price per square foot or age of the house, providing an opportunity to experiment with feature transformations.
Exploratory Data Analysis (EDA): You can explore how different features (e.g., square footage, number of bedrooms) correlate with the target variable, making it a great dataset for learning about data visualization and summary statistics.
Model Evaluation: The dataset allows for various model evaluation techniques such as cross-validation, R-squared, and Mean Absolute Error (MAE). These metrics can be used to compare the effectiveness of different models.
The dataset is highly versatile for a range of machine learning tasks. You can apply simple linear models to predict house prices based on one or two features, or use more complex models like Random Forest or Gradient Boosting Machines to understand interactions between variables.
It can also be used for dimensionality reduction techniques like PCA or to practice handling categorical variables (e.g., neighborhood quality) through encoding techniques like one-hot encoding.
This dataset is ideal for anyone wanting to gain practical experience in building regression models while working with real-world features.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This paper explores a unique dataset of all the SET ratings provided by students of one university in Poland at the end of the winter semester of the 2020/2021 academic year. The SET questionnaire used by this university is presented in Appendix 1. The dataset is unique for several reasons. It covers all SET surveys filled by students in all fields and levels of study offered by the university. In the period analysed, the university was entirely in the online regime amid the Covid-19 pandemic. While the expected learning outcomes formally have not been changed, the online mode of study could have affected the grading policy and could have implications for some of the studied SET biases. This Covid-19 effect is captured by econometric models and discussed in the paper. The average SET scores were matched with the characteristics of the teacher for degree, seniority, gender, and SET scores in the past six semesters; the course characteristics for time of day, day of the week, course type, course breadth, class duration, and class size; the attributes of the SET survey responses as the percentage of students providing SET feedback; and the grades of the course for the mean, standard deviation, and percentage failed. Data on course grades are also available for the previous six semesters. This rich dataset allows many of the biases reported in the literature to be tested for and new hypotheses to be formulated, as presented in the introduction section. The unit of observation or the single row in the data set is identified by three parameters: teacher unique id (j), course unique id (k) and the question number in the SET questionnaire (n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9} ). It means that for each pair (j,k), we have nine rows, one for each SET survey question, or sometimes less when students did not answer one of the SET questions at all. For example, the dependent variable SET_score_avg(j,k,n) for the triplet (j=Calculus, k=John Smith, n=2) is calculated as the average of all Likert-scale answers to question nr 2 in the SET survey distributed to all students that took the Calculus course taught by John Smith. The data set has 8,015 such observations or rows. The full list of variables or columns in the data set included in the analysis is presented in the attached filesection. Their description refers to the triplet (teacher id = j, course id = k, question number = n). When the last value of the triplet (n) is dropped, it means that the variable takes the same values for all n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9}.Two attachments:- word file with variables description- Rdata file with the data set (for R language).Appendix 1. Appendix 1. The SET questionnaire was used for this paper. Evaluation survey of the teaching staff of [university name] Please, complete the following evaluation form, which aims to assess the lecturer’s performance. Only one answer should be indicated for each question. The answers are coded in the following way: 5- I strongly agree; 4- I agree; 3- Neutral; 2- I don’t agree; 1- I strongly don’t agree. Questions 1 2 3 4 5 I learnt a lot during the course. ○ ○ ○ ○ ○ I think that the knowledge acquired during the course is very useful. ○ ○ ○ ○ ○ The professor used activities to make the class more engaging. ○ ○ ○ ○ ○ If it was possible, I would enroll for the course conducted by this lecturer again. ○ ○ ○ ○ ○ The classes started on time. ○ ○ ○ ○ ○ The lecturer always used time efficiently. ○ ○ ○ ○ ○ The lecturer delivered the class content in an understandable and efficient way. ○ ○ ○ ○ ○ The lecturer was available when we had doubts. ○ ○ ○ ○ ○ The lecturer treated all students equally regardless of their race, background and ethnicity. ○ ○
Facebook
TwitterThe dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.
The full-population dataset (with about 10 million individuals) is also distributed as open data.
The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.
Household, Individual
The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.
ssd
The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.
other
The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.
The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.
This is a synthetic dataset; the "response rate" is 100%.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset for Linear Regression with two Independent variables and one Dependent variable. Focused on Testing, Visualization and Statistical Analysis. The dataset is synthetic and contains 100 instances.
Facebook
TwitterCross reference of dataset variables that have a denominator
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This layer includes the variables (by 2010 census block) used in the Neighborhood Change Index created by Data Driven Detroit in October 2018 for the Turning the Corner project. The final neighborhood change index was created using the average scores of five factors, which were made up of various combinations of these variables.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dear Researcher,
Thank you for using this code and datasets. I explain how CFTS code related to my paper "A clustering based forecasting algorithm for multivariable fuzzy time series using linear combinations of independent variables" published in Applied Soft Computing works. All datasets mentioned in the paper accompanied with CFTS code are included. If there is any question feel free to contact me at: bas_salaraskari@yahoo.com s_askari@aut.ac.ir
Regards,
S. Askari
Guidelines for CFTS algorithm: 1. Open the file CFTS Code using MATLAB. 2. Enter or paste name of the dataset you wish to simulate in line 5 after "load". It loads the dataset in the workplace. 3. Lines 6 and 7: "r" is number of independent variables and "N" is number of data vectors used for training. 4. Line 9: "C" is number of clusters. You can use the optimal number of clusters given in Table 6 of paper or your own preferred value. 5. If line 28 is "comment", covariance norm (Mahalanobis distance) is use and if it is "uncomment", identity norm (Euclidean distance) is used. 6. Please press Ctrl Enter to run the code. 7. For your own dataset, please arrange the data as the datasets described in MS Word file "Read Me".
Facebook
TwitterVariables and data sources.
Facebook
TwitterThe U.S. Geological Survey (USGS) has developed and implemented an algorithm that identifies burned areas in temporally-dense time series of Landsat image stacks to produce the Landsat Burned Area Essential Climate Variable (BAECV) products. The algorithm makes use of predictors derived from individual Landsat scenes, lagged reference conditions, and change metrics between the scene and reference conditions. Outputs of the BAECV algorithm consist of pixel-level burn probabilities for each Landsat scene, and annual burn probability, burn classification, and burn date composites. These products were generated for the conterminous United States for 1984 through 2015. These data are also available for download at https://gsc.cr.usgs.gov/outgoing/baecv/BAECV_CONUS_v1.1_2017/ Additional details about the algorithm used to generate these products are described in Hawbaker, T.J., Vanderhoof, M.K., Beal, Y.G., Takacs, J.D., Schmidt, G.L., Falgout, J.T., Williams, B., Brunner, N.M., Caldwell, M.K., Picotte, J.J., Howard, S.M., Stitt, S., and Dwyer, J.L., 2017. Mapping burned areas using dense time-series of Landsat data. Remote Sensing of Environment 198, 504522. doi:10.1016/j.rse.2017.06.027 First release: 2017 Revised: September 2017 (ver.1.1)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Description
This dataset contains a simulated collection of 1,00000 patient records designed to explore hypertension management in resource-constrained settings. It provides comprehensive data for analyzing blood pressure control rates, associated risk factors, and complications. The dataset is ideal for predictive modelling, risk analysis, and treatment optimization, offering insights into demographic, clinical, and treatment-related variables.
Dataset Structure
Dataset Volume
• Size: 10,000 records. • Features: 19 variables, categorized into Sociodemographic, Clinical, Complications, and Treatment/Control groups.
Variables and Categories
A. Sociodemographic Variables
1. Age:
• Continuous variable in years.
• Range: 18–80 years.
• Mean ± SD: 49.37 ± 12.81.
2. Sex:
• Categorical variable.
• Values: Male, Female.
3. Education:
• Categorical variable.
• Values: No Education, Primary, Secondary, Higher Secondary, Graduate, Post-Graduate, Madrasa.
4. Occupation:
• Categorical variable.
• Values: Service, Business, Agriculture, Retired, Unemployed, Housewife.
5. Monthly Income:
• Categorical variable in Bangladeshi Taka.
• Values: <5000, 5001–10000, 10001–15000, >15000.
6. Residence:
• Categorical variable.
• Values: Urban, Sub-urban, Rural.
B. Clinical Variables
7. Systolic BP:
• Continuous variable in mmHg.
• Range: 100–200 mmHg.
• Mean ± SD: 140 ± 15 mmHg.
8. Diastolic BP:
• Continuous variable in mmHg.
• Range: 60–120 mmHg.
• Mean ± SD: 90 ± 10 mmHg.
9. Elevated Creatinine:
• Binary variable (\geq 1.4 \, \text{mg/dL}).
• Values: Yes, No.
10. Diabetes Mellitus:
• Binary variable.
• Values: Yes, No.
11. Family History of CVD:
• Binary variable.
• Values: Yes, No.
12. Elevated Cholesterol:
• Binary variable (\geq 200 \, \text{mg/dL}).
• Values: Yes, No.
13. Smoking:
• Binary variable.
• Values: Yes, No.
C. Complications
14. LVH (Left Ventricular Hypertrophy):
• Binary variable (ECG diagnosis).
• Values: Yes, No.
15. IHD (Ischemic Heart Disease):
• Binary variable.
• Values: Yes, No.
16. CVD (Cerebrovascular Disease):
• Binary variable.
• Values: Yes, No.
17. Retinopathy:
• Binary variable.
• Values: Yes, No.
D. Treatment and Control
18. Treatment:
• Categorical variable indicating therapy type.
• Values: Single Drug, Combination Drugs.
19. Control Status:
• Binary variable.
• Values: Controlled, Uncontrolled.
Dataset Applications
1. Predictive Modeling:
• Develop models to predict blood pressure control status using demographic and clinical data.
2. Risk Analysis:
• Identify significant factors influencing hypertension control and complications.
3. Severity Scoring:
• Quantify hypertension severity for patient risk stratification.
4. Complications Prediction:
• Forecast complications like IHD, LVH, and CVD for early intervention.
5. Treatment Guidance:
• Analyze therapy efficacy to recommend optimal treatment strategies.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
This dataset contains simulated datasets, empirical data, and R scripts described in the paper: “Li, Q. and Kou, X. (2021) WiBB: An integrated method for quantifying the relative importance of predictive variables. Ecography (DOI: 10.1111/ecog.05651)”.
A fundamental goal of scientific research is to identify the underlying variables that govern crucial processes of a system. Here we proposed a new index, WiBB, which integrates the merits of several existing methods: a model-weighting method from information theory (Wi), a standardized regression coefficient method measured by ß* (B), and bootstrap resampling technique (B). We applied the WiBB in simulated datasets with known correlation structures, for both linear models (LM) and generalized linear models (GLM), to evaluate its performance. We also applied two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate their performance in comparison with the WiBB method on ranking predictor importances under various scenarios. We also applied it to an empirical dataset in a plant genus Mimulus to select bioclimatic predictors of species’ presence across the landscape. Results in the simulated datasets showed that the WiBB method outperformed the ß* and SWi methods in scenarios with small and large sample sizes, respectively, and that the bootstrap resampling technique significantly improved the discriminant ability. When testing WiBB in the empirical dataset with GLM, it sensibly identified four important predictors with high credibility out of six candidates in modeling geographical distributions of 71 Mimulus species. This integrated index has great advantages in evaluating predictor importance and hence reducing the dimensionality of data, without losing interpretive power. The simplicity of calculation of the new metric over more sophisticated statistical procedures, makes it a handy method in the statistical toolbox.
Methods To simulate independent datasets (size = 1000), we adopted Galipaud et al.’s approach (2014) with custom modifications of the data.simulation function, which used the multiple normal distribution function rmvnorm in R package mvtnorm(v1.0-5, Genz et al. 2016). Each dataset was simulated with a preset correlation structure between a response variable (y) and four predictors(x1, x2, x3, x4). The first three (genuine) predictors were set to be strongly, moderately, and weakly correlated with the response variable, respectively (denoted by large, medium, small Pearson correlation coefficients, r), while the correlation between the response and the last (spurious) predictor was set to be zero. We simulated datasets with three levels of differences of correlation coefficients of consecutive predictors, where ∆r = 0.1, 0.2, 0.3, respectively. These three levels of ∆r resulted in three correlation structures between the response and four predictors: (0.3, 0.2, 0.1, 0.0), (0.6, 0.4, 0.2, 0.0), and (0.8, 0.6, 0.3, 0.0), respectively. We repeated the simulation procedure 200 times for each of three preset correlation structures (600 datasets in total), for LM fitting later. For GLM fitting, we modified the simulation procedures with additional steps, in which we converted the continuous response into binary data O (e.g., occurrence data having 0 for absence and 1 for presence). We tested the WiBB method, along with two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate the ability to correctly rank predictor importances under various scenarios. The empirical dataset of 71 Mimulus species was collected by their occurrence coordinates and correponding values extracted from climatic layers from WorldClim dataset (www.worldclim.org), and we applied the WiBB method to infer important predictors for their geographical distributions.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This collection consists of 5 structure learning datasets from the Bayesian Network Repository (Scutari, 2010).
Task: The dataset collection can be used to study causal discovery algorithms.
Summary:
Missingness Statement: There are no missing values.
Collection:
The alarm dataset contains the following 37 variables:
The binary synthetic asia dataset:
The binary coronary dataset:
The hailfinder dataset contains the following 56 variables:
The lizards dataset contains the following 3 variables:
Facebook
TwitterThis file contains all of the cases and variables that are in the original 2018 General Social Survey, but is prepared for easier use in the classroom. Changes have been made in two areas. First, to avoid confusion when constructing tables or interpreting basic analysis, all missing data codes have been set to system missing. Second, many of the continuous variables have been categorized into fewer categories, and added as additional variables to the file. The General Social Surveys (GSS) have been conducted by the National Opinion Research Center (NORC) annually since 1972, except for the years 1979, 1981, and 1992 (a supplement was added in 1992), and biennially beginning in 1994. The GSS are designed to be part of a program of social indicator research, replicating questionnaire items and wording in order to facilitate time-trend studies. To download syntax files for the GSS that reproduce well-known religious group recodes, including RELTRAD, please visit the ARDA's Syntax Repository.
The 2018 General Social Survey - Instructional Dataset has been updated as of June 2024. This release includes additional interview-specific variables and survey weights.
Facebook
TwitterVariable Message Signs (VMS) in York.
For further information about traffic management please visit the City of York Council website.
*Please note that the data published within this dataset is a live API link to CYC's GIS server. Any changes made to the master copy of the data will be immediately reflected in the resources of this dataset.The date shown in the "Last Updated" field of each GIS resource reflects when the data was first published.
Facebook
TwitterIn principle, experiments offer a straightforward method for social scientists to accurately estimate causal effects. However, scholars often unwittingly distort treatment effect estimates by conditioning on variables that could be affected by their experimental manipulation. Typical examples include controlling for post-treatment variables in statistical models, eliminating observations based on post-treatment criteria, or subsetting the data based on post-treatment variables. Though these modeling choices are intended to address common problems encountered when conducting experiments, they can bias estimates of causal effects. Moreover, problems associated with conditioning on post-treatment variables remain largely unrecognized in the field, which we show frequently publishes experimental studies using these practices in our discipline's most prestigious journals. We demonstrate the severity of experimental post-treatment bias analytically and document the magnitude of the potential distortions it induces using visualizations and reanalyses of real-world data. We conclude by providing applied researchers with recommendations for best practice.
Facebook
TwitterThe QoG Institute is an independent research institute within the Department of Political Science at the University of Gothenburg. Overall 30 researchers conduct and promote research on the causes, consequences and nature of Good Governance and the Quality of Government - that is, trustworthy, reliable, impartial, uncorrupted and competent government institutions.
The main objective of our research is to address the theoretical and empirical problem of how political institutions of high quality can be created and maintained. A second objective is to study the effects of Quality of Government on a number of policy areas, such as health, the environment, social policy, and poverty.
The dataset was created as part of a research project titled “Quality of Government and the Conditions for Sustainable Social Policy”. The aim of the dataset is to promote cross-national comparative research on social policy output and its correlates, with a special focus on the connection between social policy and Quality of Government (QoG).
The data comes in three versions: one cross-sectional dataset, and two cross-sectional time-series datasets for a selection of countries. The two combined datasets are called “long” (year 1946-2009) and “wide” (year 1970-2005).
The data contains six types of variables, each provided under its own heading in the codebook: Social policy variables, Tax system variables, Social Conditions, Public opinion data, Political indicators, Quality of government variables.
QoG Social Policy Dataset can be downloaded from the Data Archive of the QoG Institute at http://qog.pol.gu.se/data/datadownloads/data-archive Its variables are now included in QoG Standard.
Purpose:
The primary aim of QoG is to conduct and promote research on corruption. One aim of the QoG Institute is to make publicly available cross-national comparative data on QoG and its correlates. The aim of the QoG Social Policy Dataset is to promote cross-national comparative research on social policy output and its correlates, with a special focus on the connection between social policy and Quality of Government (QoG).
A cross-section dataset based on data from and around 2002 of QoG Social Policy-dataset. If there was no data for 2002 on a variable, data from the year closest year available have been used, however not further back in time than 1995.
Samanni, Marcus. Jan Teorell, Staffan Kumlin, Stefan Dahlberg, Bo Rothstein, Sören Holmberg & Richard Svensson. 2012. The QoG Social Policy Dataset, version 4Apr12. University of Gothenburg:The Quality of Government Institute. http://www.qog.pol.gu.se
Facebook
TwitterThese are the variable codes for the datasets released as part of the 2020 decennial census redistricting data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset presented is part of the one used in the article "A 2-dimensional guillotine cutting stock problem with variable-sized stock for the honeycomb cardboard industry" by P. Terán-Viadero, A. Alonso-Ayuso and F. Javier Martín-Campo, published in International Journal of Production Research (2023), doi: 10.1080/00207543.2023.2279129. In the paper mentioned above, two mathematical optimisation models are proposed for the Cutting Stock Problem in the honeycomb cardboard sector. This problem appears in a Spanish company and the models proposed have been tested with real orders received by the company, achieving a reduction of up to 50% in the leftover generated. The dataset presented here includes six of the twenty cases used in the paper (the rest cannot be presented for confidentiality reasons). For each case, the characteristics of the order and the solution obtained by the two models are provided for the different scenarios analysed in the paper.
*Version 1.1 contains the same data but renamed according to the instances name in the final version of the article.*Version 1.2 adds the PDF with the accepted version of the article publised in International Journal of Production Research (2023), doi: 10.1080/00207543.2023.2279129.
Facebook
TwitterThe dataset used is US Census data which is an extraction of the 1994 census data which was donated to the UC Irvine’s Machine Learning Repository. The data contains approximately 32,000 observations with over 15 variables. The dataset was downloaded from: http://archive.ics.uci.edu/ml/datasets/Adult. The dependent variable in our analysis will be income level and who earns above $50,000 a year using SQL queries, Proportion Analysis using bar charts and Simple Decision Tree to understand the important variables and their influence on prediction.