100+ datasets found

Adult Data Set ( Census Income dataset)
kaggle.com
zip
Updated Mar 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KritiDoneria (2021). Adult Data Set ( Census Income dataset) [Dataset]. https://www.kaggle.com/datasets/kritidoneria/adultdatasetxai
Explore at:
zip(481687 bytes)Available download formats
Dataset updated
Mar 7, 2021
Authors
KritiDoneria
Description
The dataset used is US Census data which is an extraction of the 1994 census data which was donated to the UC Irvine’s Machine Learning Repository. The data contains approximately 32,000 observations with over 15 variables. The dataset was downloaded from: http://archive.ics.uci.edu/ml/datasets/Adult. The dependent variable in our analysis will be income level and who earns above $50,000 a year using SQL queries, Proportion Analysis using bar charts and Simple Decision Tree to understand the important variables and their influence on prediction.
A Dataset of Water Quality and Related Variables in U.S. Reservoirs
catalog.data.gov
s.cnmilf.com
+1more
Updated Jun 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2025). A Dataset of Water Quality and Related Variables in U.S. Reservoirs [Dataset]. https://catalog.data.gov/dataset/a-dataset-of-water-quality-and-related-variables-in-u-s-reservoirs
Explore at:
Dataset updated
Jun 13, 2025
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Area covered
United States
Description
This dataset presents a rich collection of physicochemical parameters from 147 reservoirs distributed across the conterminous U.S. One hundred and eight of the reservoirs were selected using a statistical survey design and can provide unbiased inferences to the condition of all U.S. reservoirs. These data could be of interest to local water management specialists or those assessing the ecological condition of reservoirs at the national scale. These data have been reviewed in accordance with U.S. Environmental Protection Agency policy and approved for publication. This dataset is not publicly accessible because: It is too large. It can be accessed through the following means: https://portal-s.edirepository.org/nis/mapbrowse?scope=edi&identifier=2033&revision=1. Format: This dataset presents water quality and related variables for 147 reservoirs distributed across the U.S. Water quality parameters were measured during the summers of 2016, 2018, and 2020 – 2023. Measurements include nutrient concentration, algae abundance, dissolved oxygen concentration, and water temperature, among many others. Dataset includes links to other national and global scale data sets that provide additional variables.
House Price Regression Dataset
kaggle.com
zip
Updated Sep 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prokshitha Polemoni (2024). House Price Regression Dataset [Dataset]. https://www.kaggle.com/datasets/prokshitha/home-value-insights
Explore at:
zip(27045 bytes)Available download formats
Dataset updated
Sep 6, 2024
Authors
Prokshitha Polemoni
Description
Home Value Insights: A Beginner's Regression Dataset

This dataset is designed for beginners to practice regression problems, particularly in the context of predicting house prices. It contains 1000 rows, with each row representing a house and various attributes that influence its price. The dataset is well-suited for learning basic to intermediate-level regression modeling techniques.

Features:

Square_Footage: The size of the house in square feet. Larger homes typically have higher prices.

Num_Bedrooms: The number of bedrooms in the house. More bedrooms generally increase the value of a home.

Num_Bathrooms: The number of bathrooms in the house. Houses with more bathrooms are typically priced higher.

Year_Built: The year the house was built. Older houses may be priced lower due to wear and tear.

Lot_Size: The size of the lot the house is built on, measured in acres. Larger lots tend to add value to a property.

Garage_Size: The number of cars that can fit in the garage. Houses with larger garages are usually more expensive.

Neighborhood_Quality: A rating of the neighborhood’s quality on a scale of 1-10, where 10 indicates a high-quality neighborhood. Better neighborhoods usually command higher prices.

House_Price (Target Variable): The price of the house, which is the dependent variable you aim to predict.

Potential Uses:

Beginner Regression Projects: This dataset can be used to practice building regression models such as Linear Regression, Decision Trees, or Random Forests. The target variable (house price) is continuous, making this an ideal problem for supervised learning techniques.

Feature Engineering Practice: Learners can create new features by combining existing ones, such as the price per square foot or age of the house, providing an opportunity to experiment with feature transformations.

Exploratory Data Analysis (EDA): You can explore how different features (e.g., square footage, number of bedrooms) correlate with the target variable, making it a great dataset for learning about data visualization and summary statistics.

Model Evaluation: The dataset allows for various model evaluation techniques such as cross-validation, R-squared, and Mean Absolute Error (MAE). These metrics can be used to compare the effectiveness of different models.

Versatility:

The dataset is highly versatile for a range of machine learning tasks. You can apply simple linear models to predict house prices based on one or two features, or use more complex models like Random Forest or Gradient Boosting Machines to understand interactions between variables.

It can also be used for dimensionality reduction techniques like PCA or to practice handling categorical variables (e.g., neighborhood quality) through encoding techniques like one-hot encoding.

This dataset is ideal for anyone wanting to gain practical experience in building regression models while working with real-world features.
o
University SET data, with faculty and courses characteristics
openicpsr.org
Updated Sep 12, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Under blind review in refereed journal (2021). University SET data, with faculty and courses characteristics [Dataset]. http://doi.org/10.3886/E149801V1
Explore at:
Unique identifier
https://doi.org/10.3886/E149801V1
Dataset updated
Sep 12, 2021
Authors
Under blind review in refereed journal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This paper explores a unique dataset of all the SET ratings provided by students of one university in Poland at the end of the winter semester of the 2020/2021 academic year. The SET questionnaire used by this university is presented in Appendix 1. The dataset is unique for several reasons. It covers all SET surveys filled by students in all fields and levels of study offered by the university. In the period analysed, the university was entirely in the online regime amid the Covid-19 pandemic. While the expected learning outcomes formally have not been changed, the online mode of study could have affected the grading policy and could have implications for some of the studied SET biases. This Covid-19 effect is captured by econometric models and discussed in the paper. The average SET scores were matched with the characteristics of the teacher for degree, seniority, gender, and SET scores in the past six semesters; the course characteristics for time of day, day of the week, course type, course breadth, class duration, and class size; the attributes of the SET survey responses as the percentage of students providing SET feedback; and the grades of the course for the mean, standard deviation, and percentage failed. Data on course grades are also available for the previous six semesters. This rich dataset allows many of the biases reported in the literature to be tested for and new hypotheses to be formulated, as presented in the introduction section. The unit of observation or the single row in the data set is identified by three parameters: teacher unique id (j), course unique id (k) and the question number in the SET questionnaire (n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9} ). It means that for each pair (j,k), we have nine rows, one for each SET survey question, or sometimes less when students did not answer one of the SET questions at all. For example, the dependent variable SET_score_avg(j,k,n) for the triplet (j=Calculus, k=John Smith, n=2) is calculated as the average of all Likert-scale answers to question nr 2 in the SET survey distributed to all students that took the Calculus course taught by John Smith. The data set has 8,015 such observations or rows. The full list of variables or columns in the data set included in the analysis is presented in the attached filesection. Their description refers to the triplet (teacher id = j, course id = k, question number = n). When the last value of the triplet (n) is dropped, it means that the variable takes the same values for all n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9}.Two attachments:- word file with variables description- Rdata file with the data set (for R language).Appendix 1. Appendix 1. The SET questionnaire was used for this paper. Evaluation survey of the teaching staff of [university name] Please, complete the following evaluation form, which aims to assess the lecturer’s performance. Only one answer should be indicated for each question. The answers are coded in the following way: 5- I strongly agree; 4- I agree; 3- Neutral; 2- I don’t agree; 1- I strongly don’t agree. Questions 1 2 3 4 5 I learnt a lot during the course. ○ ○ ○ ○ ○ I think that the knowledge acquired during the course is very useful. ○ ○ ○ ○ ○ The professor used activities to make the class more engaging. ○ ○ ○ ○ ○ If it was possible, I would enroll for the course conducted by this lecturer again. ○ ○ ○ ○ ○ The classes started on time. ○ ○ ○ ○ ○ The lecturer always used time efficiently. ○ ○ ○ ○ ○ The lecturer delivered the class content in an understandable and efficient way. ○ ○ ○ ○ ○ The lecturer was available when we had doubts. ○ ○ ○ ○ ○ The lecturer treated all students equally regardless of their race, background and ethnicity. ○ ○
w
Synthetic Data for an Imaginary Country, Sample, 2023 - World
microdata.worldbank.org
nada-demo.ihsn.org
Updated Jul 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Development Data Group, Data Analytics Unit (2023). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://microdata.worldbank.org/index.php/catalog/5906
Explore at:
Dataset updated
Jul 7, 2023
Dataset authored and provided by
Development Data Group, Data Analytics Unit
Time period covered
2023
Area covered
World
Description
Abstract

The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

The full-population dataset (with about 10 million individuals) is also distributed as open data.

Geographic coverage

The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

Analysis unit

Household, Individual

Universe

The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

Kind of data

ssd

Sampling procedure

The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

Mode of data collection

other

Research instrument

The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

Cleaning operations

The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

Response rate

This is a synthetic dataset; the "response rate" is 100%.
Dataset for Linear Regression with 2 IV and 1 DV
kaggle.com
zip
Updated Mar 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stable Space (2025). Dataset for Linear Regression with 2 IV and 1 DV [Dataset]. https://www.kaggle.com/datasets/sharmajicoder/dataset-for-linear-regression-with-2-iv-and-1-dv
Explore at:
zip(9351 bytes)Available download formats
Dataset updated
Mar 25, 2025
Authors
Stable Space
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset for Linear Regression with two Independent variables and one Dependent variable. Focused on Testing, Visualization and Statistical Analysis. The dataset is synthetic and contains 100 instances.
T
Dataset Variables (CV)
usc.data.socrata.com
splitgraph.com
csv, xlsx, xml
Updated Nov 26, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Dataset Variables (CV) [Dataset]. https://usc.data.socrata.com/Riverside-Coachella-Valley/Dataset-Variables-for-Riverside-Coachella-Valley/ime8-mqha
Explore at:
xml, xlsx, csvAvailable download formats
Dataset updated
Nov 26, 2018
Description
Cross reference of dataset variables that have a denominator
f
Neighborhood Change Index Variables 20181010
data.ferndalemi.gov
detroitdata.org
+5more
Updated Oct 10, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Driven Detroit (2018). Neighborhood Change Index Variables 20181010 [Dataset]. https://data.ferndalemi.gov/maps/D3::neighborhood-change-index-variables-20181010
Explore at:
Dataset updated
Oct 10, 2018
Dataset authored and provided by
Data Driven Detroit
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Area covered

Description
This layer includes the variables (by 2010 census block) used in the Neighborhood Change Index created by Data Driven Detroit in October 2018 for the Turning the Corner project. The final neighborhood change index was created using the average scores of five factors, which were made up of various combinations of these variables.
m
Data from: A clustering based forecasting algorithm for multivariable fuzzy...
data.mendeley.com
Updated Oct 31, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Salar Askari Lasaki (2016). A clustering based forecasting algorithm for multivariable fuzzy time series using linear combinations of independent variables [Dataset]. http://doi.org/10.17632/35fw8pb6s9.1
Explore at:
Unique identifier
https://doi.org/10.17632/35fw8pb6s9.1
Dataset updated
Oct 31, 2016
Authors
Salar Askari Lasaki
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Dear Researcher,

Thank you for using this code and datasets. I explain how CFTS code related to my paper "A clustering based forecasting algorithm for multivariable fuzzy time series using linear combinations of independent variables" published in Applied Soft Computing works. All datasets mentioned in the paper accompanied with CFTS code are included. If there is any question feel free to contact me at: bas_salaraskari@yahoo.com s_askari@aut.ac.ir

Regards,

S. Askari

Guidelines for CFTS algorithm: 1. Open the file CFTS Code using MATLAB. 2. Enter or paste name of the dataset you wish to simulate in line 5 after "load". It loads the dataset in the workplace. 3. Lines 6 and 7: "r" is number of independent variables and "N" is number of data vectors used for training. 4. Line 9: "C" is number of clusters. You can use the optimal number of clusters given in Table 6 of paper or your own preferred value. 5. If line 28 is "comment", covariance norm (Mahalanobis distance) is use and if it is "uncomment", identity norm (Euclidean distance) is used. 6. Please press Ctrl Enter to run the code. 7. For your own dataset, please arrange the data as the datasets described in MS Word file "Read Me".
f
Variables and data sources.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Mar 4, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lu, Yue; Li, Jian; Yang, Siying (2022). Variables and data sources. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000444597
Explore at:
Dataset updated
Mar 4, 2022
Authors
Lu, Yue; Li, Jian; Yang, Siying
Description
Variables and data sources.
d
Data from: Landsat Burned Area Essential Climate Variable products for the...
catalog.data.gov
datasets.ai
+1more
Updated Nov 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Landsat Burned Area Essential Climate Variable products for the conterminous United States (1984 - 2015) [Dataset]. https://catalog.data.gov/dataset/landsat-burned-area-essential-climate-variable-products-for-the-conterminous-united-s-1984
Explore at:
Dataset updated
Nov 26, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
United States, Contiguous United States
Description
The U.S. Geological Survey (USGS) has developed and implemented an algorithm that identifies burned areas in temporally-dense time series of Landsat image stacks to produce the Landsat Burned Area Essential Climate Variable (BAECV) products. The algorithm makes use of predictors derived from individual Landsat scenes, lagged reference conditions, and change metrics between the scene and reference conditions. Outputs of the BAECV algorithm consist of pixel-level burn probabilities for each Landsat scene, and annual burn probability, burn classification, and burn date composites. These products were generated for the conterminous United States for 1984 through 2015. These data are also available for download at https://gsc.cr.usgs.gov/outgoing/baecv/BAECV_CONUS_v1.1_2017/ Additional details about the algorithm used to generate these products are described in Hawbaker, T.J., Vanderhoof, M.K., Beal, Y.G., Takacs, J.D., Schmidt, G.L., Falgout, J.T., Williams, B., Brunner, N.M., Caldwell, M.K., Picotte, J.J., Howard, S.M., Stitt, S., and Dwyer, J.L., 2017. Mapping burned areas using dense time-series of Landsat data. Remote Sensing of Environment 198, 504522. doi:10.1016/j.rse.2017.06.027 First release: 2017 Revised: September 2017 (ver.1.1)

Bridging the Gap in Hypertension Management: Evaluating Blood Pressure...

data.mendeley.com

Updated Jan 15, 2025

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

abu sufian (2025). Bridging the Gap in Hypertension Management: Evaluating Blood Pressure Control and Associated Risk Factors in a Resource-Constrained Setting [Dataset]. http://doi.org/10.17632/56jyjndvcr.1

Explore at:

Unique identifier

https://doi.org/10.17632/56jyjndvcr.1

Dataset updated

Jan 15, 2025

Authors

abu sufian

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Dataset Description

This dataset contains a simulated collection of 1,00000 patient records designed to explore hypertension management in resource-constrained settings. It provides comprehensive data for analyzing blood pressure control rates, associated risk factors, and complications. The dataset is ideal for predictive modelling, risk analysis, and treatment optimization, offering insights into demographic, clinical, and treatment-related variables.

Dataset Structure

Dataset Volume

• Size: 10,000 records. • Features: 19 variables, categorized into Sociodemographic, Clinical, Complications, and Treatment/Control groups.
Variables and Categories

A. Sociodemographic Variables

1. Age:
•  Continuous variable in years.
•  Range: 18–80 years.
•  Mean ± SD: 49.37 ± 12.81.
2. Sex:
•  Categorical variable.
•  Values: Male, Female.
3. Education:
•  Categorical variable.
•  Values: No Education, Primary, Secondary, Higher Secondary, Graduate, Post-Graduate, Madrasa.
4. Occupation:
•  Categorical variable.
•  Values: Service, Business, Agriculture, Retired, Unemployed, Housewife.
5. Monthly Income:
•  Categorical variable in Bangladeshi Taka.
•  Values: <5000, 5001–10000, 10001–15000, >15000.
6. Residence:
•  Categorical variable.
•  Values: Urban, Sub-urban, Rural.

B. Clinical Variables

7. Systolic BP:
•  Continuous variable in mmHg.
•  Range: 100–200 mmHg.
•  Mean ± SD: 140 ± 15 mmHg.
8. Diastolic BP:
•  Continuous variable in mmHg.
•  Range: 60–120 mmHg.
•  Mean ± SD: 90 ± 10 mmHg.
9. Elevated Creatinine:
•  Binary variable (\geq 1.4 \, \text{mg/dL}).
•  Values: Yes, No.
10. Diabetes Mellitus:
•  Binary variable.
•  Values: Yes, No.
11. Family History of CVD:
•  Binary variable.
•  Values: Yes, No.
12. Elevated Cholesterol:
•  Binary variable (\geq 200 \, \text{mg/dL}).
•  Values: Yes, No.
13. Smoking:
•  Binary variable.
•  Values: Yes, No.

C. Complications

14. LVH (Left Ventricular Hypertrophy):
•  Binary variable (ECG diagnosis).
•  Values: Yes, No.
15. IHD (Ischemic Heart Disease):
•  Binary variable.
•  Values: Yes, No.
16. CVD (Cerebrovascular Disease):
•  Binary variable.
•  Values: Yes, No.
17. Retinopathy:
•  Binary variable.
•  Values: Yes, No.

D. Treatment and Control

18. Treatment:
•  Categorical variable indicating therapy type.
•  Values: Single Drug, Combination Drugs.
19. Control Status:
•  Binary variable.
•  Values: Controlled, Uncontrolled.

Dataset Applications

1. Predictive Modeling:
•  Develop models to predict blood pressure control status using demographic and clinical data.
2. Risk Analysis:
•  Identify significant factors influencing hypertension control and complications.
3. Severity Scoring:
•  Quantify hypertension severity for patient risk stratification.
4. Complications Prediction:
•  Forecast complications like IHD, LVH, and CVD for early intervention.
5. Treatment Guidance:
•  Analyze therapy efficacy to recommend optimal treatment strategies.

n
Data from: WiBB: An integrated method for quantifying the relative...
data-staging.niaid.nih.gov
data.niaid.nih.gov
+1more
zip
Updated Aug 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qin Li; Xiaojun Kou (2021). WiBB: An integrated method for quantifying the relative importance of predictive variables [Dataset]. http://doi.org/10.5061/dryad.xsj3tx9g1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.xsj3tx9g1
Dataset updated
Aug 20, 2021
Dataset provided by
Beijing Normal University
Field Museum of Natural History
Authors
Qin Li; Xiaojun Kou
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
This dataset contains simulated datasets, empirical data, and R scripts described in the paper: “Li, Q. and Kou, X. (2021) WiBB: An integrated method for quantifying the relative importance of predictive variables. Ecography (DOI: 10.1111/ecog.05651)”.

A fundamental goal of scientific research is to identify the underlying variables that govern crucial processes of a system. Here we proposed a new index, WiBB, which integrates the merits of several existing methods: a model-weighting method from information theory (Wi), a standardized regression coefficient method measured by ß* (B), and bootstrap resampling technique (B). We applied the WiBB in simulated datasets with known correlation structures, for both linear models (LM) and generalized linear models (GLM), to evaluate its performance. We also applied two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate their performance in comparison with the WiBB method on ranking predictor importances under various scenarios. We also applied it to an empirical dataset in a plant genus Mimulus to select bioclimatic predictors of species’ presence across the landscape. Results in the simulated datasets showed that the WiBB method outperformed the ß* and SWi methods in scenarios with small and large sample sizes, respectively, and that the bootstrap resampling technique significantly improved the discriminant ability. When testing WiBB in the empirical dataset with GLM, it sensibly identified four important predictors with high credibility out of six candidates in modeling geographical distributions of 71 Mimulus species. This integrated index has great advantages in evaluating predictor importance and hence reducing the dimensionality of data, without losing interpretive power. The simplicity of calculation of the new metric over more sophisticated statistical procedures, makes it a handy method in the statistical toolbox.

Methods To simulate independent datasets (size = 1000), we adopted Galipaud et al.’s approach (2014) with custom modifications of the data.simulation function, which used the multiple normal distribution function rmvnorm in R package mvtnorm(v1.0-5, Genz et al. 2016). Each dataset was simulated with a preset correlation structure between a response variable (y) and four predictors(x1, x2, x3, x4). The first three (genuine) predictors were set to be strongly, moderately, and weakly correlated with the response variable, respectively (denoted by large, medium, small Pearson correlation coefficients, r), while the correlation between the response and the last (spurious) predictor was set to be zero. We simulated datasets with three levels of differences of correlation coefficients of consecutive predictors, where ∆r = 0.1, 0.2, 0.3, respectively. These three levels of ∆r resulted in three correlation structures between the response and four predictors: (0.3, 0.2, 0.1, 0.0), (0.6, 0.4, 0.2, 0.0), and (0.8, 0.6, 0.3, 0.0), respectively. We repeated the simulation procedure 200 times for each of three preset correlation structures (600 datasets in total), for LM fitting later. For GLM fitting, we modified the simulation procedures with additional steps, in which we converted the continuous response into binary data O (e.g., occurrence data having 0 for absence and 1 for presence). We tested the WiBB method, along with two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate the ability to correctly rank predictor importances under various scenarios. The empirical dataset of 71 Mimulus species was collected by their occurrence coordinates and correponding values extracted from climatic layers from WorldClim dataset (www.worldclim.org), and we applied the WiBB method to infer important predictors for their geographical distributions.
bnlearn datasets
zenodo.org
data.niaid.nih.gov
zip
Updated Jan 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). bnlearn datasets [Dataset]. http://doi.org/10.5281/zenodo.7676616
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7676616
Dataset updated
Jan 29, 2025
Dataset provided by
Zenodohttp://zenodo.org/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This collection consists of 5 structure learning datasets from the Bayesian Network Repository (Scutari, 2010).

Task: The dataset collection can be used to study causal discovery algorithms.

Summary:

Size of collection: 5 datasets with 3 - 56 columns of various sizes

Task: Causal Discovery

Data Type: Discrete

Dataset Scope: Collection

Ground Truth: Known / Estimated

Temporal Structure: No

License: TBD

Missing Values: No

Missingness Statement: There are no missing values.

Collection:

The alarm dataset contains the following 37 variables:

CVP (central venous pressure): a three-level factor with levels LOW, NORMAL and HIGH.

PCWP (pulmonary capillary wedge pressure): a three-level factor with levels LOW, NORMAL and HIGH.

HIST (history): a two-level factor with levels TRUE and FALSE.

TPR (total peripheral resistance): a three-level factor with levels LOW, NORMAL and HIGH.

... (33 more variables, see the corresponding .html file)

The binary synthetic asia dataset:

D (dyspnoea), a two-level factor with levels yes and no.

T (tuberculosis), a two-level factor with levels yes and no.

L (lung cancer), a two-level factor with levels yes and no.

B (bronchitis), a two-level factor with levels yes and no.

A(visit to Asia), a two-level factor with levels yes and no.

S (smoking), a two-level factor with levels yes and no.

X (chest X-ray), a two-level factor with levels yes and no.

E (tuberculosis versus lung cancer/bronchitis), a two-level factor with levels yes and no.

The binary coronary dataset:

Smoking (smoking): a two-level factor with levels no and yes.

M. Work (strenuous mental work): a two-level factor with levels no and yes.

P. Work (strenuous physical work): a two-level factor with levels no and yes.

Pressure (systolic blood pressure): a two-level factor with levels <140 and >140.

Proteins (ratio of beta and alpha lipoproteins): a two-level factor with levels <3 and >3.

Family (family anamnesis of coronary heart disease): a two-level factor with levels neg and pos.

The hailfinder dataset contains the following 56 variables:

N07muVerMo (10.7mu vertical motion): a four-level factor with levels StrongUp, WeakUp, Neutral and Down.

SubjVertMo (subjective judgment of vertical motion): a four-level factor with levels StrongUp, WeakUp, Neutral and Down.

QGVertMotion (quasigeostrophic vertical motion): a four-level factor with levels StrongUp, WeakUp, Neutral and Down.

CombVerMo (combined vertical motion): a four-level factor with levels StrongUp, WeakUp, Neutral and Down.

AreaMesoALS (area of meso-alpha): a four-level factor with levels StrongUp, WeakUp, Neutral and Down.

SatContMoist (satellite contribution to moisture): a four-level factor with levels VeryWet, Wet, Neutral and Dry.

... (49 more variables are in the correspondent .html file)

The lizards dataset contains the following 3 variables:

Species (the species of the lizard): a two-level factor with levels Sagrei and Distichus.

Height (perch height): a two-level factor with levels high (greater than 4.75 feet) and low (lesser or equal to 4.75 feet).

Diameter (perch diameter): a two-level factor with levels narrow (greater than 4 inches) and wide (lesser or equal to 4 inches).
General Social Survey, 2018 - Instructional Dataset
thearda.com
Updated 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tom W. Smith (2018). General Social Survey, 2018 - Instructional Dataset [Dataset]. http://doi.org/10.17605/OSF.IO/7FVZG
Explore at:
Unique identifier
https://doi.org/10.17605/OSF.IO/7FVZG
Dataset updated
2018
Dataset provided by
Association of Religion Data Archives
Authors
Tom W. Smith
Dataset funded by
National Science Foundation
Description
This file contains all of the cases and variables that are in the original 2018 General Social Survey, but is prepared for easier use in the classroom. Changes have been made in two areas. First, to avoid confusion when constructing tables or interpreting basic analysis, all missing data codes have been set to system missing. Second, many of the continuous variables have been categorized into fewer categories, and added as additional variables to the file. The General Social Surveys (GSS) have been conducted by the National Opinion Research Center (NORC) annually since 1972, except for the years 1979, 1981, and 1992 (a supplement was added in 1992), and biennially beginning in 1994. The GSS are designed to be part of a program of social indicator research, replicating questionnaire items and wording in order to facilitate time-trend studies. To download syntax files for the GSS that reproduce well-known religious group recodes, including RELTRAD, please visit the ARDA's Syntax Repository.

The 2018 General Social Survey - Instructional Dataset has been updated as of June 2024. This release includes additional interview-specific variables and survey weights.
e
Data from: Variable Message Signs
data.europa.eu
data.wu.ac.at
csv, geojson, kml
Updated Oct 11, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of York Council (2021). Variable Message Signs [Dataset]. https://data.europa.eu/data/datasets/variable-message-signs
Explore at:
geojson, csv, kmlAvailable download formats
Dataset updated
Oct 11, 2021
Dataset authored and provided by
City of York Council
Description
Variable Message Signs (VMS) in York.

For further information about traffic management please visit the City of York Council website.

*Please note that the data published within this dataset is a live API link to CYC's GIS server. Any changes made to the master copy of the data will be immediately reflected in the resources of this dataset.The date shown in the "Last Updated" field of each GIS resource reflects when the data was first published.
H
Replication Data for: How Conditioning on Posttreatment Variables Can Ruin...
datasetcatalog.nlm.nih.gov
dataverse.harvard.edu
+1more
Updated Feb 20, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nyhan, Brendan; Montgomery, Jacob M.; Torres, Michelle (2018). Replication Data for: How Conditioning on Posttreatment Variables Can Ruin Your Experiment and What to Do about It [Dataset]. http://doi.org/10.7910/DVN/EZSJ1S
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/EZSJ1S
Dataset updated
Feb 20, 2018
Authors
Nyhan, Brendan; Montgomery, Jacob M.; Torres, Michelle
Description
In principle, experiments offer a straightforward method for social scientists to accurately estimate causal effects. However, scholars often unwittingly distort treatment effect estimates by conditioning on variables that could be affected by their experimental manipulation. Typical examples include controlling for post-treatment variables in statistical models, eliminating observations based on post-treatment criteria, or subsetting the data based on post-treatment variables. Though these modeling choices are intended to address common problems encountered when conducting experiments, they can bias estimates of causal effects. Moreover, problems associated with conditioning on post-treatment variables remain largely unrecognized in the field, which we show frequently publishes experimental studies using these practices in our discipline's most prestigious journals. We demonstrate the severity of experimental post-treatment bias analytically and document the magnitude of the potential distortions it induces using visualizations and reanalyses of real-world data. We conclude by providing applied researchers with recommendations for best practice.
r
QoG Social Policy Dataset - The QoG Social Policy Cross-Section Data
demo.researchdata.se
Updated Feb 13, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Teorell; Richard Svensson; Marcus Samanni; Staffan Kumlin; Stefan Dahlberg; Bo Rothstein; Sören Holmberg (2020). QoG Social Policy Dataset - The QoG Social Policy Cross-Section Data [Dataset]. https://demo.researchdata.se/en/catalogue/dataset/ext0004-1
Explore at:
Dataset updated
Feb 13, 2020
Dataset provided by
University of Gothenburg
Authors
Jan Teorell; Richard Svensson; Marcus Samanni; Staffan Kumlin; Stefan Dahlberg; Bo Rothstein; Sören Holmberg
Time period covered
2002
Area covered
United Kingdom, Spain, Mexico, Estonia, New Caledonia, Slovakia, Romania, Bulgaria, Iceland, Malta
Description
The QoG Institute is an independent research institute within the Department of Political Science at the University of Gothenburg. Overall 30 researchers conduct and promote research on the causes, consequences and nature of Good Governance and the Quality of Government - that is, trustworthy, reliable, impartial, uncorrupted and competent government institutions.

The main objective of our research is to address the theoretical and empirical problem of how political institutions of high quality can be created and maintained. A second objective is to study the effects of Quality of Government on a number of policy areas, such as health, the environment, social policy, and poverty.

The dataset was created as part of a research project titled “Quality of Government and the Conditions for Sustainable Social Policy”. The aim of the dataset is to promote cross-national comparative research on social policy output and its correlates, with a special focus on the connection between social policy and Quality of Government (QoG).

The data comes in three versions: one cross-sectional dataset, and two cross-sectional time-series datasets for a selection of countries. The two combined datasets are called “long” (year 1946-2009) and “wide” (year 1970-2005).

The data contains six types of variables, each provided under its own heading in the codebook: Social policy variables, Tax system variables, Social Conditions, Public opinion data, Political indicators, Quality of government variables.

QoG Social Policy Dataset can be downloaded from the Data Archive of the QoG Institute at http://qog.pol.gu.se/data/datadownloads/data-archive Its variables are now included in QoG Standard.

Purpose:

The primary aim of QoG is to conduct and promote research on corruption. One aim of the QoG Institute is to make publicly available cross-national comparative data on QoG and its correlates. The aim of the QoG Social Policy Dataset is to promote cross-national comparative research on social policy output and its correlates, with a special focus on the connection between social policy and Quality of Government (QoG).

A cross-section dataset based on data from and around 2002 of QoG Social Policy-dataset. If there was no data for 2002 on a variable, data from the year closest year available have been used, however not further back in time than 1995.

Samanni, Marcus. Jan Teorell, Staffan Kumlin, Stefan Dahlberg, Bo Rothstein, Sören Holmberg & Richard Svensson. 2012. The QoG Social Policy Dataset, version 4Apr12. University of Gothenburg:The Quality of Government Institute. http://www.qog.pol.gu.se
g
2020 Census Redistricting Data - Variable Names and Codes | gimi9.com
gimi9.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
2020 Census Redistricting Data - Variable Names and Codes | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_2020-census-redistricting-data-variable-names-and-codes/
Explore at:
Description
These are the variable codes for the datasets released as part of the 2020 decennial census redistricting data.
Z
Data from: Dataset used in article "A 2-dimensional guillotine cutting stock...
data-staging.niaid.nih.gov
produccioncientifica.ucm.es
Updated Jul 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Terán-Viadero, Paula; Alonso-Ayuso, Antonio; Martín-Campo, F. Javier (2024). Dataset used in article "A 2-dimensional guillotine cutting stock problem with variable-sized stock for the honeycomb cardboard industry" [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_8033003
Explore at:
Dataset updated
Jul 10, 2024
Dataset provided by
Complutense University of Madrid
Rey Juan Carlos University
Authors
Terán-Viadero, Paula; Alonso-Ayuso, Antonio; Martín-Campo, F. Javier
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset presented is part of the one used in the article "A 2-dimensional guillotine cutting stock problem with variable-sized stock for the honeycomb cardboard industry" by P. Terán-Viadero, A. Alonso-Ayuso and F. Javier Martín-Campo, published in International Journal of Production Research (2023), doi: 10.1080/00207543.2023.2279129. In the paper mentioned above, two mathematical optimisation models are proposed for the Cutting Stock Problem in the honeycomb cardboard sector. This problem appears in a Spanish company and the models proposed have been tested with real orders received by the company, achieving a reduction of up to 50% in the leftover generated. The dataset presented here includes six of the twenty cases used in the paper (the rest cannot be presented for confidentiality reasons). For each case, the characteristics of the order and the solution obtained by the two models are provided for the different scenarios analysed in the paper.

*Version 1.1 contains the same data but renamed according to the instances name in the final version of the article.*Version 1.2 adds the PDF with the accepted version of the article publised in International Journal of Production Research (2023), doi: 10.1080/00207543.2023.2279129.

Facebook

Twitter

Click to copy link

Link copied

Cite

KritiDoneria (2021). Adult Data Set ( Census Income dataset) [Dataset]. https://www.kaggle.com/datasets/kritidoneria/adultdatasetxai

Adult Data Set ( Census Income dataset)

Predict whether income exceeds $50K/yr based on census data

Explore at:

zip(481687 bytes)Available download formats

Dataset updated

Mar 7, 2021

Authors

KritiDoneria

Description

The dataset used is US Census data which is an extraction of the 1994 census data which was donated to the UC Irvine’s Machine Learning Repository. The data contains approximately 32,000 observations with over 15 variables. The dataset was downloaded from: http://archive.ics.uci.edu/ml/datasets/Adult. The dependent variable in our analysis will be income level and who earns above $50,000 a year using SQL queries, Proportion Analysis using bar charts and Simple Decision Tree to understand the important variables and their influence on prediction.

Clear search

Close search

Google apps

Main menu

Adult Data Set ( Census Income dataset)

A Dataset of Water Quality and Related Variables in U.S. Reservoirs

House Price Regression Dataset

Home Value Insights: A Beginner's Regression Dataset

Features:

Potential Uses:

Versatility:

University SET data, with faculty and courses characteristics

Synthetic Data for an Imaginary Country, Sample, 2023 - World

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

Dataset for Linear Regression with 2 IV and 1 DV

Dataset Variables (CV)

Neighborhood Change Index Variables 20181010

Data from: A clustering based forecasting algorithm for multivariable fuzzy...

Variables and data sources.

Data from: Landsat Burned Area Essential Climate Variable products for the...

Bridging the Gap in Hypertension Management: Evaluating Blood Pressure...

Data from: WiBB: An integrated method for quantifying the relative...

bnlearn datasets

General Social Survey, 2018 - Instructional Dataset

Data from: Variable Message Signs

Replication Data for: How Conditioning on Posttreatment Variables Can Ruin...

QoG Social Policy Dataset - The QoG Social Policy Cross-Section Data

2020 Census Redistricting Data - Variable Names and Codes | gimi9.com

Data from: Dataset used in article "A 2-dimensional guillotine cutting stock...

Adult Data Set ( Census Income dataset)

Predict whether income exceeds $50K/yr based on census data