100+ datasets found
  1. Adult Data Set ( Census Income dataset)

    • kaggle.com
    zip
    Updated Mar 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KritiDoneria (2021). Adult Data Set ( Census Income dataset) [Dataset]. https://www.kaggle.com/datasets/kritidoneria/adultdatasetxai
    Explore at:
    zip(481687 bytes)Available download formats
    Dataset updated
    Mar 7, 2021
    Authors
    KritiDoneria
    Description

    The dataset used is US Census data which is an extraction of the 1994 census data which was donated to the UC Irvine’s Machine Learning Repository. The data contains approximately 32,000 observations with over 15 variables. The dataset was downloaded from: http://archive.ics.uci.edu/ml/datasets/Adult. The dependent variable in our analysis will be income level and who earns above $50,000 a year using SQL queries, Proportion Analysis using bar charts and Simple Decision Tree to understand the important variables and their influence on prediction.

  2. A Dataset of Water Quality and Related Variables in U.S. Reservoirs

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Jun 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2025). A Dataset of Water Quality and Related Variables in U.S. Reservoirs [Dataset]. https://catalog.data.gov/dataset/a-dataset-of-water-quality-and-related-variables-in-u-s-reservoirs
    Explore at:
    Dataset updated
    Jun 13, 2025
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Area covered
    United States
    Description

    This dataset presents a rich collection of physicochemical parameters from 147 reservoirs distributed across the conterminous U.S. One hundred and eight of the reservoirs were selected using a statistical survey design and can provide unbiased inferences to the condition of all U.S. reservoirs. These data could be of interest to local water management specialists or those assessing the ecological condition of reservoirs at the national scale. These data have been reviewed in accordance with U.S. Environmental Protection Agency policy and approved for publication. This dataset is not publicly accessible because: It is too large. It can be accessed through the following means: https://portal-s.edirepository.org/nis/mapbrowse?scope=edi&identifier=2033&revision=1. Format: This dataset presents water quality and related variables for 147 reservoirs distributed across the U.S. Water quality parameters were measured during the summers of 2016, 2018, and 2020 – 2023. Measurements include nutrient concentration, algae abundance, dissolved oxygen concentration, and water temperature, among many others. Dataset includes links to other national and global scale data sets that provide additional variables.

  3. House Price Regression Dataset

    • kaggle.com
    zip
    Updated Sep 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prokshitha Polemoni (2024). House Price Regression Dataset [Dataset]. https://www.kaggle.com/datasets/prokshitha/home-value-insights
    Explore at:
    zip(27045 bytes)Available download formats
    Dataset updated
    Sep 6, 2024
    Authors
    Prokshitha Polemoni
    Description

    Home Value Insights: A Beginner's Regression Dataset

    This dataset is designed for beginners to practice regression problems, particularly in the context of predicting house prices. It contains 1000 rows, with each row representing a house and various attributes that influence its price. The dataset is well-suited for learning basic to intermediate-level regression modeling techniques.

    Features:

    1. Square_Footage: The size of the house in square feet. Larger homes typically have higher prices.
    2. Num_Bedrooms: The number of bedrooms in the house. More bedrooms generally increase the value of a home.
    3. Num_Bathrooms: The number of bathrooms in the house. Houses with more bathrooms are typically priced higher.
    4. Year_Built: The year the house was built. Older houses may be priced lower due to wear and tear.
    5. Lot_Size: The size of the lot the house is built on, measured in acres. Larger lots tend to add value to a property.
    6. Garage_Size: The number of cars that can fit in the garage. Houses with larger garages are usually more expensive.
    7. Neighborhood_Quality: A rating of the neighborhood’s quality on a scale of 1-10, where 10 indicates a high-quality neighborhood. Better neighborhoods usually command higher prices.
    8. House_Price (Target Variable): The price of the house, which is the dependent variable you aim to predict.

    Potential Uses:

    1. Beginner Regression Projects: This dataset can be used to practice building regression models such as Linear Regression, Decision Trees, or Random Forests. The target variable (house price) is continuous, making this an ideal problem for supervised learning techniques.

    2. Feature Engineering Practice: Learners can create new features by combining existing ones, such as the price per square foot or age of the house, providing an opportunity to experiment with feature transformations.

    3. Exploratory Data Analysis (EDA): You can explore how different features (e.g., square footage, number of bedrooms) correlate with the target variable, making it a great dataset for learning about data visualization and summary statistics.

    4. Model Evaluation: The dataset allows for various model evaluation techniques such as cross-validation, R-squared, and Mean Absolute Error (MAE). These metrics can be used to compare the effectiveness of different models.

    Versatility:

    • The dataset is highly versatile for a range of machine learning tasks. You can apply simple linear models to predict house prices based on one or two features, or use more complex models like Random Forest or Gradient Boosting Machines to understand interactions between variables.

    • It can also be used for dimensionality reduction techniques like PCA or to practice handling categorical variables (e.g., neighborhood quality) through encoding techniques like one-hot encoding.

    • This dataset is ideal for anyone wanting to gain practical experience in building regression models while working with real-world features.

  4. o

    University SET data, with faculty and courses characteristics

    • openicpsr.org
    Updated Sep 12, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Under blind review in refereed journal (2021). University SET data, with faculty and courses characteristics [Dataset]. http://doi.org/10.3886/E149801V1
    Explore at:
    Dataset updated
    Sep 12, 2021
    Authors
    Under blind review in refereed journal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This paper explores a unique dataset of all the SET ratings provided by students of one university in Poland at the end of the winter semester of the 2020/2021 academic year. The SET questionnaire used by this university is presented in Appendix 1. The dataset is unique for several reasons. It covers all SET surveys filled by students in all fields and levels of study offered by the university. In the period analysed, the university was entirely in the online regime amid the Covid-19 pandemic. While the expected learning outcomes formally have not been changed, the online mode of study could have affected the grading policy and could have implications for some of the studied SET biases. This Covid-19 effect is captured by econometric models and discussed in the paper. The average SET scores were matched with the characteristics of the teacher for degree, seniority, gender, and SET scores in the past six semesters; the course characteristics for time of day, day of the week, course type, course breadth, class duration, and class size; the attributes of the SET survey responses as the percentage of students providing SET feedback; and the grades of the course for the mean, standard deviation, and percentage failed. Data on course grades are also available for the previous six semesters. This rich dataset allows many of the biases reported in the literature to be tested for and new hypotheses to be formulated, as presented in the introduction section. The unit of observation or the single row in the data set is identified by three parameters: teacher unique id (j), course unique id (k) and the question number in the SET questionnaire (n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9} ). It means that for each pair (j,k), we have nine rows, one for each SET survey question, or sometimes less when students did not answer one of the SET questions at all. For example, the dependent variable SET_score_avg(j,k,n) for the triplet (j=Calculus, k=John Smith, n=2) is calculated as the average of all Likert-scale answers to question nr 2 in the SET survey distributed to all students that took the Calculus course taught by John Smith. The data set has 8,015 such observations or rows. The full list of variables or columns in the data set included in the analysis is presented in the attached filesection. Their description refers to the triplet (teacher id = j, course id = k, question number = n). When the last value of the triplet (n) is dropped, it means that the variable takes the same values for all n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9}.Two attachments:- word file with variables description- Rdata file with the data set (for R language).Appendix 1. Appendix 1. The SET questionnaire was used for this paper. Evaluation survey of the teaching staff of [university name] Please, complete the following evaluation form, which aims to assess the lecturer’s performance. Only one answer should be indicated for each question. The answers are coded in the following way: 5- I strongly agree; 4- I agree; 3- Neutral; 2- I don’t agree; 1- I strongly don’t agree. Questions 1 2 3 4 5 I learnt a lot during the course. ○ ○ ○ ○ ○ I think that the knowledge acquired during the course is very useful. ○ ○ ○ ○ ○ The professor used activities to make the class more engaging. ○ ○ ○ ○ ○ If it was possible, I would enroll for the course conducted by this lecturer again. ○ ○ ○ ○ ○ The classes started on time. ○ ○ ○ ○ ○ The lecturer always used time efficiently. ○ ○ ○ ○ ○ The lecturer delivered the class content in an understandable and efficient way. ○ ○ ○ ○ ○ The lecturer was available when we had doubts. ○ ○ ○ ○ ○ The lecturer treated all students equally regardless of their race, background and ethnicity. ○ ○

  5. w

    Synthetic Data for an Imaginary Country, Sample, 2023 - World

    • microdata.worldbank.org
    • nada-demo.ihsn.org
    Updated Jul 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Development Data Group, Data Analytics Unit (2023). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://microdata.worldbank.org/index.php/catalog/5906
    Explore at:
    Dataset updated
    Jul 7, 2023
    Dataset authored and provided by
    Development Data Group, Data Analytics Unit
    Time period covered
    2023
    Area covered
    World
    Description

    Abstract

    The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

    The full-population dataset (with about 10 million individuals) is also distributed as open data.

    Geographic coverage

    The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

    Analysis unit

    Household, Individual

    Universe

    The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

    Kind of data

    ssd

    Sampling procedure

    The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

    Mode of data collection

    other

    Research instrument

    The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

    Cleaning operations

    The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

    Response rate

    This is a synthetic dataset; the "response rate" is 100%.

  6. Dataset for Linear Regression with 2 IV and 1 DV

    • kaggle.com
    zip
    Updated Mar 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stable Space (2025). Dataset for Linear Regression with 2 IV and 1 DV [Dataset]. https://www.kaggle.com/datasets/sharmajicoder/dataset-for-linear-regression-with-2-iv-and-1-dv
    Explore at:
    zip(9351 bytes)Available download formats
    Dataset updated
    Mar 25, 2025
    Authors
    Stable Space
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset for Linear Regression with two Independent variables and one Dependent variable. Focused on Testing, Visualization and Statistical Analysis. The dataset is synthetic and contains 100 instances.

  7. T

    Dataset Variables (CV)

    • usc.data.socrata.com
    • splitgraph.com
    csv, xlsx, xml
    Updated Nov 26, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Dataset Variables (CV) [Dataset]. https://usc.data.socrata.com/Riverside-Coachella-Valley/Dataset-Variables-for-Riverside-Coachella-Valley/ime8-mqha
    Explore at:
    xml, xlsx, csvAvailable download formats
    Dataset updated
    Nov 26, 2018
    Description

    Cross reference of dataset variables that have a denominator

  8. f

    Neighborhood Change Index Variables 20181010

    • data.ferndalemi.gov
    • detroitdata.org
    • +5more
    Updated Oct 10, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Driven Detroit (2018). Neighborhood Change Index Variables 20181010 [Dataset]. https://data.ferndalemi.gov/maps/D3::neighborhood-change-index-variables-20181010
    Explore at:
    Dataset updated
    Oct 10, 2018
    Dataset authored and provided by
    Data Driven Detroit
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Area covered
    Description

    This layer includes the variables (by 2010 census block) used in the Neighborhood Change Index created by Data Driven Detroit in October 2018 for the Turning the Corner project. The final neighborhood change index was created using the average scores of five factors, which were made up of various combinations of these variables.

  9. m

    Data from: A clustering based forecasting algorithm for multivariable fuzzy...

    • data.mendeley.com
    Updated Oct 31, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Salar Askari Lasaki (2016). A clustering based forecasting algorithm for multivariable fuzzy time series using linear combinations of independent variables [Dataset]. http://doi.org/10.17632/35fw8pb6s9.1
    Explore at:
    Dataset updated
    Oct 31, 2016
    Authors
    Salar Askari Lasaki
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Dear Researcher,

    Thank you for using this code and datasets. I explain how CFTS code related to my paper "A clustering based forecasting algorithm for multivariable fuzzy time series using linear combinations of independent variables" published in Applied Soft Computing works. All datasets mentioned in the paper accompanied with CFTS code are included. If there is any question feel free to contact me at: bas_salaraskari@yahoo.com s_askari@aut.ac.ir

    Regards,

    S. Askari

    Guidelines for CFTS algorithm: 1. Open the file CFTS Code using MATLAB. 2. Enter or paste name of the dataset you wish to simulate in line 5 after "load". It loads the dataset in the workplace. 3. Lines 6 and 7: "r" is number of independent variables and "N" is number of data vectors used for training. 4. Line 9: "C" is number of clusters. You can use the optimal number of clusters given in Table 6 of paper or your own preferred value. 5. If line 28 is "comment", covariance norm (Mahalanobis distance) is use and if it is "uncomment", identity norm (Euclidean distance) is used. 6. Please press Ctrl Enter to run the code. 7. For your own dataset, please arrange the data as the datasets described in MS Word file "Read Me".

  10. f

    Variables and data sources.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Mar 4, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu, Yue; Li, Jian; Yang, Siying (2022). Variables and data sources. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000444597
    Explore at:
    Dataset updated
    Mar 4, 2022
    Authors
    Lu, Yue; Li, Jian; Yang, Siying
    Description

    Variables and data sources.

  11. d

    Data from: Landsat Burned Area Essential Climate Variable products for the...

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Nov 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Landsat Burned Area Essential Climate Variable products for the conterminous United States (1984 - 2015) [Dataset]. https://catalog.data.gov/dataset/landsat-burned-area-essential-climate-variable-products-for-the-conterminous-united-s-1984
    Explore at:
    Dataset updated
    Nov 26, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    United States, Contiguous United States
    Description

    The U.S. Geological Survey (USGS) has developed and implemented an algorithm that identifies burned areas in temporally-dense time series of Landsat image stacks to produce the Landsat Burned Area Essential Climate Variable (BAECV) products. The algorithm makes use of predictors derived from individual Landsat scenes, lagged reference conditions, and change metrics between the scene and reference conditions. Outputs of the BAECV algorithm consist of pixel-level burn probabilities for each Landsat scene, and annual burn probability, burn classification, and burn date composites. These products were generated for the conterminous United States for 1984 through 2015. These data are also available for download at https://gsc.cr.usgs.gov/outgoing/baecv/BAECV_CONUS_v1.1_2017/ Additional details about the algorithm used to generate these products are described in Hawbaker, T.J., Vanderhoof, M.K., Beal, Y.G., Takacs, J.D., Schmidt, G.L., Falgout, J.T., Williams, B., Brunner, N.M., Caldwell, M.K., Picotte, J.J., Howard, S.M., Stitt, S., and Dwyer, J.L., 2017. Mapping burned areas using dense time-series of Landsat data. Remote Sensing of Environment 198, 504522. doi:10.1016/j.rse.2017.06.027 First release: 2017 Revised: September 2017 (ver.1.1)

  12. m

    Bridging the Gap in Hypertension Management: Evaluating Blood Pressure...

    • data.mendeley.com
    Updated Jan 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    abu sufian (2025). Bridging the Gap in Hypertension Management: Evaluating Blood Pressure Control and Associated Risk Factors in a Resource-Constrained Setting [Dataset]. http://doi.org/10.17632/56jyjndvcr.1
    Explore at:
    Dataset updated
    Jan 15, 2025
    Authors
    abu sufian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description

    This dataset contains a simulated collection of 1,00000 patient records designed to explore hypertension management in resource-constrained settings. It provides comprehensive data for analyzing blood pressure control rates, associated risk factors, and complications. The dataset is ideal for predictive modelling, risk analysis, and treatment optimization, offering insights into demographic, clinical, and treatment-related variables.

    Dataset Structure

    1. Dataset Volume

      • Size: 10,000 records. • Features: 19 variables, categorized into Sociodemographic, Clinical, Complications, and Treatment/Control groups.

    2. Variables and Categories

    A. Sociodemographic Variables

    1. Age:
    •  Continuous variable in years.
    •  Range: 18–80 years.
    •  Mean ± SD: 49.37 ± 12.81.
    2. Sex:
    •  Categorical variable.
    •  Values: Male, Female.
    3. Education:
    •  Categorical variable.
    •  Values: No Education, Primary, Secondary, Higher Secondary, Graduate, Post-Graduate, Madrasa.
    4. Occupation:
    •  Categorical variable.
    •  Values: Service, Business, Agriculture, Retired, Unemployed, Housewife.
    5. Monthly Income:
    •  Categorical variable in Bangladeshi Taka.
    •  Values: <5000, 5001–10000, 10001–15000, >15000.
    6. Residence:
    •  Categorical variable.
    •  Values: Urban, Sub-urban, Rural.
    

    B. Clinical Variables

    7. Systolic BP:
    •  Continuous variable in mmHg.
    •  Range: 100–200 mmHg.
    •  Mean ± SD: 140 ± 15 mmHg.
    8. Diastolic BP:
    •  Continuous variable in mmHg.
    •  Range: 60–120 mmHg.
    •  Mean ± SD: 90 ± 10 mmHg.
    9. Elevated Creatinine:
    •  Binary variable (\geq 1.4 \, \text{mg/dL}).
    •  Values: Yes, No.
    10. Diabetes Mellitus:
    •  Binary variable.
    •  Values: Yes, No.
    11. Family History of CVD:
    •  Binary variable.
    •  Values: Yes, No.
    12. Elevated Cholesterol:
    •  Binary variable (\geq 200 \, \text{mg/dL}).
    •  Values: Yes, No.
    13. Smoking:
    •  Binary variable.
    •  Values: Yes, No.
    

    C. Complications

    14. LVH (Left Ventricular Hypertrophy):
    •  Binary variable (ECG diagnosis).
    •  Values: Yes, No.
    15. IHD (Ischemic Heart Disease):
    •  Binary variable.
    •  Values: Yes, No.
    16. CVD (Cerebrovascular Disease):
    •  Binary variable.
    •  Values: Yes, No.
    17. Retinopathy:
    •  Binary variable.
    •  Values: Yes, No.
    

    D. Treatment and Control

    18. Treatment:
    •  Categorical variable indicating therapy type.
    •  Values: Single Drug, Combination Drugs.
    19. Control Status:
    •  Binary variable.
    •  Values: Controlled, Uncontrolled.
    

    Dataset Applications

    1. Predictive Modeling:
    •  Develop models to predict blood pressure control status using demographic and clinical data.
    2. Risk Analysis:
    •  Identify significant factors influencing hypertension control and complications.
    3. Severity Scoring:
    •  Quantify hypertension severity for patient risk stratification.
    4. Complications Prediction:
    •  Forecast complications like IHD, LVH, and CVD for early intervention.
    5. Treatment Guidance:
    •  Analyze therapy efficacy to recommend optimal treatment strategies.
    
  13. n

    Data from: WiBB: An integrated method for quantifying the relative...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Aug 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qin Li; Xiaojun Kou (2021). WiBB: An integrated method for quantifying the relative importance of predictive variables [Dataset]. http://doi.org/10.5061/dryad.xsj3tx9g1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 20, 2021
    Dataset provided by
    Beijing Normal University
    Field Museum of Natural History
    Authors
    Qin Li; Xiaojun Kou
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    This dataset contains simulated datasets, empirical data, and R scripts described in the paper: “Li, Q. and Kou, X. (2021) WiBB: An integrated method for quantifying the relative importance of predictive variables. Ecography (DOI: 10.1111/ecog.05651)”.

    A fundamental goal of scientific research is to identify the underlying variables that govern crucial processes of a system. Here we proposed a new index, WiBB, which integrates the merits of several existing methods: a model-weighting method from information theory (Wi), a standardized regression coefficient method measured by ß* (B), and bootstrap resampling technique (B). We applied the WiBB in simulated datasets with known correlation structures, for both linear models (LM) and generalized linear models (GLM), to evaluate its performance. We also applied two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate their performance in comparison with the WiBB method on ranking predictor importances under various scenarios. We also applied it to an empirical dataset in a plant genus Mimulus to select bioclimatic predictors of species’ presence across the landscape. Results in the simulated datasets showed that the WiBB method outperformed the ß* and SWi methods in scenarios with small and large sample sizes, respectively, and that the bootstrap resampling technique significantly improved the discriminant ability. When testing WiBB in the empirical dataset with GLM, it sensibly identified four important predictors with high credibility out of six candidates in modeling geographical distributions of 71 Mimulus species. This integrated index has great advantages in evaluating predictor importance and hence reducing the dimensionality of data, without losing interpretive power. The simplicity of calculation of the new metric over more sophisticated statistical procedures, makes it a handy method in the statistical toolbox.

    Methods To simulate independent datasets (size = 1000), we adopted Galipaud et al.’s approach (2014) with custom modifications of the data.simulation function, which used the multiple normal distribution function rmvnorm in R package mvtnorm(v1.0-5, Genz et al. 2016). Each dataset was simulated with a preset correlation structure between a response variable (y) and four predictors(x1, x2, x3, x4). The first three (genuine) predictors were set to be strongly, moderately, and weakly correlated with the response variable, respectively (denoted by large, medium, small Pearson correlation coefficients, r), while the correlation between the response and the last (spurious) predictor was set to be zero. We simulated datasets with three levels of differences of correlation coefficients of consecutive predictors, where ∆r = 0.1, 0.2, 0.3, respectively. These three levels of ∆r resulted in three correlation structures between the response and four predictors: (0.3, 0.2, 0.1, 0.0), (0.6, 0.4, 0.2, 0.0), and (0.8, 0.6, 0.3, 0.0), respectively. We repeated the simulation procedure 200 times for each of three preset correlation structures (600 datasets in total), for LM fitting later. For GLM fitting, we modified the simulation procedures with additional steps, in which we converted the continuous response into binary data O (e.g., occurrence data having 0 for absence and 1 for presence). We tested the WiBB method, along with two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate the ability to correctly rank predictor importances under various scenarios. The empirical dataset of 71 Mimulus species was collected by their occurrence coordinates and correponding values extracted from climatic layers from WorldClim dataset (www.worldclim.org), and we applied the WiBB method to infer important predictors for their geographical distributions.

  14. bnlearn datasets

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). bnlearn datasets [Dataset]. http://doi.org/10.5281/zenodo.7676616
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 29, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This collection consists of 5 structure learning datasets from the Bayesian Network Repository (Scutari, 2010).

    Task: The dataset collection can be used to study causal discovery algorithms.

    Summary:

    • Size of collection: 5 datasets with 3 - 56 columns of various sizes
    • Task: Causal Discovery
    • Data Type: Discrete
    • Dataset Scope: Collection
    • Ground Truth: Known / Estimated
    • Temporal Structure: No
    • License: TBD
    • Missing Values: No

    Missingness Statement: There are no missing values.

    Collection:

    The alarm dataset contains the following 37 variables:

    • CVP (central venous pressure): a three-level factor with levels LOW, NORMAL and HIGH.
    • PCWP (pulmonary capillary wedge pressure): a three-level factor with levels LOW, NORMAL and HIGH.
    • HIST (history): a two-level factor with levels TRUE and FALSE.
    • TPR (total peripheral resistance): a three-level factor with levels LOW, NORMAL and HIGH.
    • ... (33 more variables, see the corresponding .html file)

    The binary synthetic asia dataset:

    • D (dyspnoea), a two-level factor with levels yes and no.
    • T (tuberculosis), a two-level factor with levels yes and no.
    • L (lung cancer), a two-level factor with levels yes and no.
    • B (bronchitis), a two-level factor with levels yes and no.
    • A(visit to Asia), a two-level factor with levels yes and no.
    • S (smoking), a two-level factor with levels yes and no.
    • X (chest X-ray), a two-level factor with levels yes and no.
    • E (tuberculosis versus lung cancer/bronchitis), a two-level factor with levels yes and no.

    The binary coronary dataset:

    • Smoking (smoking): a two-level factor with levels no and yes.
    • M. Work (strenuous mental work): a two-level factor with levels no and yes.
    • P. Work (strenuous physical work): a two-level factor with levels no and yes.
    • Pressure (systolic blood pressure): a two-level factor with levels <140 and >140.
    • Proteins (ratio of beta and alpha lipoproteins): a two-level factor with levels <3 and >3.
    • Family (family anamnesis of coronary heart disease): a two-level factor with levels neg and pos.

    The hailfinder dataset contains the following 56 variables:

    • N07muVerMo (10.7mu vertical motion): a four-level factor with levels StrongUp, WeakUp, Neutral and Down.
    • SubjVertMo (subjective judgment of vertical motion): a four-level factor with levels StrongUp, WeakUp, Neutral and Down.
    • QGVertMotion (quasigeostrophic vertical motion): a four-level factor with levels StrongUp, WeakUp, Neutral and Down.
    • CombVerMo (combined vertical motion): a four-level factor with levels StrongUp, WeakUp, Neutral and Down.
    • AreaMesoALS (area of meso-alpha): a four-level factor with levels StrongUp, WeakUp, Neutral and Down.
    • SatContMoist (satellite contribution to moisture): a four-level factor with levels VeryWet, Wet, Neutral and Dry.
    • ... (49 more variables are in the correspondent .html file)

    The lizards dataset contains the following 3 variables:

    • Species (the species of the lizard): a two-level factor with levels Sagrei and Distichus.
    • Height (perch height): a two-level factor with levels high (greater than 4.75 feet) and low (lesser or equal to 4.75 feet).
    • Diameter (perch diameter): a two-level factor with levels narrow (greater than 4 inches) and wide (lesser or equal to 4 inches).
  15. General Social Survey, 2018 - Instructional Dataset

    • thearda.com
    Updated 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tom W. Smith (2018). General Social Survey, 2018 - Instructional Dataset [Dataset]. http://doi.org/10.17605/OSF.IO/7FVZG
    Explore at:
    Dataset updated
    2018
    Dataset provided by
    Association of Religion Data Archives
    Authors
    Tom W. Smith
    Dataset funded by
    National Science Foundation
    Description

    This file contains all of the cases and variables that are in the original 2018 General Social Survey, but is prepared for easier use in the classroom. Changes have been made in two areas. First, to avoid confusion when constructing tables or interpreting basic analysis, all missing data codes have been set to system missing. Second, many of the continuous variables have been categorized into fewer categories, and added as additional variables to the file. The General Social Surveys (GSS) have been conducted by the National Opinion Research Center (NORC) annually since 1972, except for the years 1979, 1981, and 1992 (a supplement was added in 1992), and biennially beginning in 1994. The GSS are designed to be part of a program of social indicator research, replicating questionnaire items and wording in order to facilitate time-trend studies. To download syntax files for the GSS that reproduce well-known religious group recodes, including RELTRAD, please visit the ARDA's Syntax Repository.

    The 2018 General Social Survey - Instructional Dataset has been updated as of June 2024. This release includes additional interview-specific variables and survey weights.

  16. e

    Data from: Variable Message Signs

    • data.europa.eu
    • data.wu.ac.at
    csv, geojson, kml
    Updated Oct 11, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of York Council (2021). Variable Message Signs [Dataset]. https://data.europa.eu/data/datasets/variable-message-signs
    Explore at:
    geojson, csv, kmlAvailable download formats
    Dataset updated
    Oct 11, 2021
    Dataset authored and provided by
    City of York Council
    Description

    Variable Message Signs (VMS) in York.

    For further information about traffic management please visit the City of York Council website.

    *Please note that the data published within this dataset is a live API link to CYC's GIS server. Any changes made to the master copy of the data will be immediately reflected in the resources of this dataset.The date shown in the "Last Updated" field of each GIS resource reflects when the data was first published.

  17. H

    Replication Data for: How Conditioning on Posttreatment Variables Can Ruin...

    • datasetcatalog.nlm.nih.gov
    • dataverse.harvard.edu
    • +1more
    Updated Feb 20, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nyhan, Brendan; Montgomery, Jacob M.; Torres, Michelle (2018). Replication Data for: How Conditioning on Posttreatment Variables Can Ruin Your Experiment and What to Do about It [Dataset]. http://doi.org/10.7910/DVN/EZSJ1S
    Explore at:
    Dataset updated
    Feb 20, 2018
    Authors
    Nyhan, Brendan; Montgomery, Jacob M.; Torres, Michelle
    Description

    In principle, experiments offer a straightforward method for social scientists to accurately estimate causal effects. However, scholars often unwittingly distort treatment effect estimates by conditioning on variables that could be affected by their experimental manipulation. Typical examples include controlling for post-treatment variables in statistical models, eliminating observations based on post-treatment criteria, or subsetting the data based on post-treatment variables. Though these modeling choices are intended to address common problems encountered when conducting experiments, they can bias estimates of causal effects. Moreover, problems associated with conditioning on post-treatment variables remain largely unrecognized in the field, which we show frequently publishes experimental studies using these practices in our discipline's most prestigious journals. We demonstrate the severity of experimental post-treatment bias analytically and document the magnitude of the potential distortions it induces using visualizations and reanalyses of real-world data. We conclude by providing applied researchers with recommendations for best practice.

  18. r

    QoG Social Policy Dataset - The QoG Social Policy Cross-Section Data

    • demo.researchdata.se
    Updated Feb 13, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Teorell; Richard Svensson; Marcus Samanni; Staffan Kumlin; Stefan Dahlberg; Bo Rothstein; Sören Holmberg (2020). QoG Social Policy Dataset - The QoG Social Policy Cross-Section Data [Dataset]. https://demo.researchdata.se/en/catalogue/dataset/ext0004-1
    Explore at:
    Dataset updated
    Feb 13, 2020
    Dataset provided by
    University of Gothenburg
    Authors
    Jan Teorell; Richard Svensson; Marcus Samanni; Staffan Kumlin; Stefan Dahlberg; Bo Rothstein; Sören Holmberg
    Time period covered
    2002
    Area covered
    United Kingdom, Spain, Mexico, Estonia, New Caledonia, Slovakia, Romania, Bulgaria, Iceland, Malta
    Description

    The QoG Institute is an independent research institute within the Department of Political Science at the University of Gothenburg. Overall 30 researchers conduct and promote research on the causes, consequences and nature of Good Governance and the Quality of Government - that is, trustworthy, reliable, impartial, uncorrupted and competent government institutions.

    The main objective of our research is to address the theoretical and empirical problem of how political institutions of high quality can be created and maintained. A second objective is to study the effects of Quality of Government on a number of policy areas, such as health, the environment, social policy, and poverty.

    The dataset was created as part of a research project titled “Quality of Government and the Conditions for Sustainable Social Policy”. The aim of the dataset is to promote cross-national comparative research on social policy output and its correlates, with a special focus on the connection between social policy and Quality of Government (QoG).

    The data comes in three versions: one cross-sectional dataset, and two cross-sectional time-series datasets for a selection of countries. The two combined datasets are called “long” (year 1946-2009) and “wide” (year 1970-2005).

    The data contains six types of variables, each provided under its own heading in the codebook: Social policy variables, Tax system variables, Social Conditions, Public opinion data, Political indicators, Quality of government variables.

    QoG Social Policy Dataset can be downloaded from the Data Archive of the QoG Institute at http://qog.pol.gu.se/data/datadownloads/data-archive Its variables are now included in QoG Standard.

    Purpose:

    The primary aim of QoG is to conduct and promote research on corruption. One aim of the QoG Institute is to make publicly available cross-national comparative data on QoG and its correlates. The aim of the QoG Social Policy Dataset is to promote cross-national comparative research on social policy output and its correlates, with a special focus on the connection between social policy and Quality of Government (QoG).

    A cross-section dataset based on data from and around 2002 of QoG Social Policy-dataset. If there was no data for 2002 on a variable, data from the year closest year available have been used, however not further back in time than 1995.

    Samanni, Marcus. Jan Teorell, Staffan Kumlin, Stefan Dahlberg, Bo Rothstein, Sören Holmberg & Richard Svensson. 2012. The QoG Social Policy Dataset, version 4Apr12. University of Gothenburg:The Quality of Government Institute. http://www.qog.pol.gu.se

  19. g

    2020 Census Redistricting Data - Variable Names and Codes | gimi9.com

    • gimi9.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    2020 Census Redistricting Data - Variable Names and Codes | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_2020-census-redistricting-data-variable-names-and-codes/
    Explore at:
    Description

    These are the variable codes for the datasets released as part of the 2020 decennial census redistricting data.

  20. Z

    Data from: Dataset used in article "A 2-dimensional guillotine cutting stock...

    • data-staging.niaid.nih.gov
    • produccioncientifica.ucm.es
    Updated Jul 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Terán-Viadero, Paula; Alonso-Ayuso, Antonio; Martín-Campo, F. Javier (2024). Dataset used in article "A 2-dimensional guillotine cutting stock problem with variable-sized stock for the honeycomb cardboard industry" [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_8033003
    Explore at:
    Dataset updated
    Jul 10, 2024
    Dataset provided by
    Complutense University of Madrid
    Rey Juan Carlos University
    Authors
    Terán-Viadero, Paula; Alonso-Ayuso, Antonio; Martín-Campo, F. Javier
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset presented is part of the one used in the article "A 2-dimensional guillotine cutting stock problem with variable-sized stock for the honeycomb cardboard industry" by P. Terán-Viadero, A. Alonso-Ayuso and F. Javier Martín-Campo, published in International Journal of Production Research (2023), doi: 10.1080/00207543.2023.2279129. In the paper mentioned above, two mathematical optimisation models are proposed for the Cutting Stock Problem in the honeycomb cardboard sector. This problem appears in a Spanish company and the models proposed have been tested with real orders received by the company, achieving a reduction of up to 50% in the leftover generated. The dataset presented here includes six of the twenty cases used in the paper (the rest cannot be presented for confidentiality reasons). For each case, the characteristics of the order and the solution obtained by the two models are provided for the different scenarios analysed in the paper.

    *Version 1.1 contains the same data but renamed according to the instances name in the final version of the article.*Version 1.2 adds the PDF with the accepted version of the article publised in International Journal of Production Research (2023), doi: 10.1080/00207543.2023.2279129.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
KritiDoneria (2021). Adult Data Set ( Census Income dataset) [Dataset]. https://www.kaggle.com/datasets/kritidoneria/adultdatasetxai
Organization logo

Adult Data Set ( Census Income dataset)

Predict whether income exceeds $50K/yr based on census data

Explore at:
zip(481687 bytes)Available download formats
Dataset updated
Mar 7, 2021
Authors
KritiDoneria
Description

The dataset used is US Census data which is an extraction of the 1994 census data which was donated to the UC Irvine’s Machine Learning Repository. The data contains approximately 32,000 observations with over 15 variables. The dataset was downloaded from: http://archive.ics.uci.edu/ml/datasets/Adult. The dependent variable in our analysis will be income level and who earns above $50,000 a year using SQL queries, Proportion Analysis using bar charts and Simple Decision Tree to understand the important variables and their influence on prediction.

Search
Clear search
Close search
Google apps
Main menu