63 datasets found
  1. d

    Current Population Survey (CPS)

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Damico, Anthony (2023). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Damico, Anthony
    Description

    analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D

  2. Dataset for Linear Regression with 2 IV and 1 DV

    • kaggle.com
    zip
    Updated Mar 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stable Space (2025). Dataset for Linear Regression with 2 IV and 1 DV [Dataset]. https://www.kaggle.com/datasets/sharmajicoder/dataset-for-linear-regression-with-2-iv-and-1-dv
    Explore at:
    zip(9351 bytes)Available download formats
    Dataset updated
    Mar 25, 2025
    Authors
    Stable Space
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset for Linear Regression with two Independent variables and one Dependent variable. Focused on Testing, Visualization and Statistical Analysis. The dataset is synthetic and contains 100 instances.

  3. d

    The probability of obtaining 40 gallons of water per minute from a 400...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). The probability of obtaining 40 gallons of water per minute from a 400 foot-deep well in New Hampshire [Dataset]. https://catalog.data.gov/dataset/the-probability-of-obtaining-40-gallons-of-water-per-minute-from-a-400-foot-deep-well-in-n
    Explore at:
    Dataset updated
    Nov 21, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    New Hampshire
    Description

    This geographic raster dataset presents statistical model predictions for the probability of exceeding 40 gallons per minute well yield from a 400 foot deep well, and is based on the available data. The model was developed as part of the New Hampshire Bedrock Aquifer Assessment. The New Hampshire Bedrock Aquifer Assessment was designed to provide information (including this raster) that can be used by communities, industry, professional consultants, and other interests to evaluate the ground-water development potential of the fractured-bedrock aquifer in the State. The assessment was done at statewide, regional, and well field scales to identify relations that potentially could increase the success in locating high-yield water supplies in the fractured-bedrock aquifer. Statewide, data were collected for well construction and yield information, bedrock lithology, surficial geology, lineaments, topography, and various derivatives of these basic data sets. Regionally, geologic, fracture, and lineament data were collected for the Pinardville and Windham quadrangles in New Hampshire. The regional scale of the study examined the degree to which predictive well-yield relations, developed as part of the statewide reconnaissance investigation, could be improved by use of quadrangle-scale geologic mapping. Beginning in 1984, water-well contractors in the State were required to report detailed information on newly constructed wells to the New Hampshire Department of Environmental Services (NHDES). The reports contain basic data on well construction, including six characteristics used in this study—well yield, well depth, well use, method of construction, date drilled, and depth to bedrock (or length of casing). The NHDES has determined accurate georeferenced locations for more than 20,000 wells reported since 1984. The availability of this large data set provided an opportunity for a statistical analysis of bedrock-well yields. Well yields in the database ranged from zero to greater than 500 gallons per minute (gal/min). Multivariate regression was used as the primary statistical method of analysis because it is the most efficient tool for predicting a single variable with many potentially independent variables. The dependent variable that was explored in this study was the natural logarithm (ln) of the reported well yield. One complication with using well yield as a dependent variable is that yield also is a function of demand. An innovative statistical technique that involves the use of instrumental variables was implemented to compensate for the effect of demand on well yield. Results of the multivariate-regression model show that a variety of factors are either positively or negatively related to well yields (see USGS Professional Paper 1660). Model results are presented statewide as the probability of exceeding 40 gallons per minute well yield from a 400 foot deep well. Probability values represented in this raster (ranging from near 0 to near 100) have all been multiplied by 100 in order to store the data more efficiently as a raster.

  4. f

    UC_vs_US Statistic Analysis.xlsx

    • figshare.com
    xlsx
    Updated Jul 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    F. (Fabiano) Dalpiaz (2020). UC_vs_US Statistic Analysis.xlsx [Dataset]. http://doi.org/10.23644/uu.12631628.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 9, 2020
    Dataset provided by
    Utrecht University
    Authors
    F. (Fabiano) Dalpiaz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sheet 1 (Raw-Data): The raw data of the study is provided, presenting the tagging results for the used measures described in the paper. For each subject, it includes multiple columns: A. a sequential student ID B an ID that defines a random group label and the notation C. the used notation: user Story or use Cases D. the case they were assigned to: IFA, Sim, or Hos E. the subject's exam grade (total points out of 100). Empty cells mean that the subject did not take the first exam F. a categorical representation of the grade L/M/H, where H is greater or equal to 80, M is between 65 included and 80 excluded, L otherwise G. the total number of classes in the student's conceptual model H. the total number of relationships in the student's conceptual model I. the total number of classes in the expert's conceptual model J. the total number of relationships in the expert's conceptual model K-O. the total number of encountered situations of alignment, wrong representation, system-oriented, omitted, missing (see tagging scheme below) P. the researchers' judgement on how well the derivation process explanation was explained by the student: well explained (a systematic mapping that can be easily reproduced), partially explained (vague indication of the mapping ), or not present.

    Tagging scheme:
    Aligned (AL) - A concept is represented as a class in both models, either
    

    with the same name or using synonyms or clearly linkable names; Wrongly represented (WR) - A class in the domain expert model is incorrectly represented in the student model, either (i) via an attribute, method, or relationship rather than class, or (ii) using a generic term (e.g., user'' instead ofurban planner''); System-oriented (SO) - A class in CM-Stud that denotes a technical implementation aspect, e.g., access control. Classes that represent legacy system or the system under design (portal, simulator) are legitimate; Omitted (OM) - A class in CM-Expert that does not appear in any way in CM-Stud; Missing (MI) - A class in CM-Stud that does not appear in any way in CM-Expert.

    All the calculations and information provided in the following sheets
    

    originate from that raw data.

    Sheet 2 (Descriptive-Stats): Shows a summary of statistics from the data collection,
    

    including the number of subjects per case, per notation, per process derivation rigor category, and per exam grade category.

    Sheet 3 (Size-Ratio):
    

    The number of classes within the student model divided by the number of classes within the expert model is calculated (describing the size ratio). We provide box plots to allow a visual comparison of the shape of the distribution, its central value, and its variability for each group (by case, notation, process, and exam grade) . The primary focus in this study is on the number of classes. However, we also provided the size ratio for the number of relationships between student and expert model.

    Sheet 4 (Overall):
    

    Provides an overview of all subjects regarding the encountered situations, completeness, and correctness, respectively. Correctness is defined as the ratio of classes in a student model that is fully aligned with the classes in the corresponding expert model. It is calculated by dividing the number of aligned concepts (AL) by the sum of the number of aligned concepts (AL), omitted concepts (OM), system-oriented concepts (SO), and wrong representations (WR). Completeness on the other hand, is defined as the ratio of classes in a student model that are correctly or incorrectly represented over the number of classes in the expert model. Completeness is calculated by dividing the sum of aligned concepts (AL) and wrong representations (WR) by the sum of the number of aligned concepts (AL), wrong representations (WR) and omitted concepts (OM). The overview is complemented with general diverging stacked bar charts that illustrate correctness and completeness.

    For sheet 4 as well as for the following four sheets, diverging stacked bar
    

    charts are provided to visualize the effect of each of the independent and mediated variables. The charts are based on the relative numbers of encountered situations for each student. In addition, a "Buffer" is calculated witch solely serves the purpose of constructing the diverging stacked bar charts in Excel. Finally, at the bottom of each sheet, the significance (T-test) and effect size (Hedges' g) for both completeness and correctness are provided. Hedges' g was calculated with an online tool: https://www.psychometrica.de/effect_size.html. The independent and moderating variables can be found as follows:

    Sheet 5 (By-Notation):
    

    Model correctness and model completeness is compared by notation - UC, US.

    Sheet 6 (By-Case):
    

    Model correctness and model completeness is compared by case - SIM, HOS, IFA.

    Sheet 7 (By-Process):
    

    Model correctness and model completeness is compared by how well the derivation process is explained - well explained, partially explained, not present.

    Sheet 8 (By-Grade):
    

    Model correctness and model completeness is compared by the exam grades, converted to categorical values High, Low , and Medium.

  5. Youtube video statistics for 1 million videos

    • kaggle.com
    zip
    Updated Jun 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mattia Zeni (2020). Youtube video statistics for 1 million videos [Dataset]. https://www.kaggle.com/mattiazeni/youtube-video-statistics-1million-videos
    Explore at:
    zip(6696303511 bytes)Available download formats
    Dataset updated
    Jun 29, 2020
    Authors
    Mattia Zeni
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Area covered
    YouTube
    Description

    Motivation

    Study how YouTube videos become viral or, more in general, how they evolve in terms of views, likes and subscriptions is a topic of interest in many disciplines. With this dataset you can study such phenomena, with statistics about 1 million YouTube videos. The information was collected in 2013 when YouTube was exposing the data publicly: they removed this functionality in the years and now it's possible to have such statistics only to the owner of the video. This makes this dataset unique.

    Context

    This Dataset has been generated with YOUStatAnalyzer, a tool developed by myself (Mattia Zeni) when I was working for CREATE-NET (www.create-net.org) within the framework of the CONGAS FP7 project (http://www.congas-project.eu). For the project we needed to collect and analyse the dynamics of YouTube videos popularity. The dataset contains statistics of more than 1 million Youtube videos, chosen accordingly to random keywords extracted from the WordNet library (http://wordnet.princeton.edu).

    The motivation that led us to the development of the YOUStatAnalyser data collection tool and the creation of this dataset is that there's an active research community working on the interplay among user individual preferences, social dynamics, advertising mechanisms and a common problem is the lack of open large-scale datasets. At the same time, no tool was present at that time. Today, YouTube removed the possibility to visualize these data on each video's page, making this dataset unique.

    When using our dataset for research purposes, please cite it as:

    @INPROCEEDINGS{YOUStatAnalyzer, author={Mattia Zeni and Daniele Miorandi and Francesco {De Pellegrini}}, title = {{YOUStatAnalyzer}: a Tool for Analysing the Dynamics of {YouTube} Content Popularity}, booktitle = {Proc.\ 7th International Conference on Performance Evaluation Methodologies and Tools (Valuetools, Torino, Italy, December 2013)}, address = {Torino, Italy}, year = {2013} }

    Content

    The dataset contains statistics and metadata of 1 million YouTube videos, collected in 2013. The videos have been chosen accordingly to random keywords extracted from the WordNet library (http://wordnet.princeton.edu).

    Dataset structure

    The structure of a dataset is the following: { u'_id': u'9eToPjUnwmU', u'title': u'Traitor Compilation # 1 (Trouble ...', u'description': u'A traitor compilation by one are ...', u'category': u'Games', u'commentsNumber': u'6', u'publishedDate': u'2012-10-09T23:42:12.000Z', u'author': u'ServilityGaming', u'duration': u'208', u'type': u'video/3gpp', u'relatedVideos': [u'acjHy7oPmls', u'EhW2LbCjm7c', u'UUKigFAQLMA', ...], u'accessControl': { u'comment': {u'permission': u'allowed'}, u'list': {u'permission': u'allowed'}, u'videoRespond': {u'permission': u'moderated'}, u'rate': {u'permission': u'allowed'}, u'syndicate': {u'permission': u'allowed'}, u'embed': {u'permission': u'allowed'}, u'commentVote': {u'permission': u'allowed'}, u'autoPlay': {u'permission': u'allowed'} }, u'views': { u'cumulative': { u'data': [15.0, 25.0, 26.0, 26.0, ...] }, u'daily': { u'data': [15.0, 10.0, 1.0, 0.0, ..] } }, u'shares': { u'cumulative': { u'data': [0.0, 0.0, 0.0, 0.0, ...] }, u'daily': { u'data': [0.0, 0.0, 0.0, 0.0, ...] } }, u'watchtime': { u'cumulative': { u'data': [22.5666666667, 36.5166666667, 36.7, 36.7, ...] }, u'daily': { u'data': [22.5666666667, 13.95, 0.166666666667, 0.0, ...] } }, u'subscribers': { u'cumulative': { u'data': [0.0, 0.0, 0.0, 0.0, ...] }, u'daily': { u'data': [-1.0, 0.0, 0.0, 0.0, ...] } }, u'day': { u'data': [1349740800000.0, 1349827200000.0, 1349913600000.0, 1350000000000.0, ...] } }

    From the structure above is possible to see which fields an entry in the dataset has. It is possible to divide them into 2 sections:

    1) Video Information.

    _id -> Corresponding to the video ID and to the unique identifier of an entry in the database. title -> Te video's title. description -> The video's description. category -> The YouTube category the video is inserted in. commentsNumber -> The number of comments posted by users. publishedDate -> The date the video has been published. author -> The author of the video. duration -> The video duration in seconds. type -> The encoding type of the video. relatedVideos -> A list of related videos. accessControl -> A list of access policies for different aspects related to the video.

    2) Video Statistics.

    Each video can have 4 different statistics variables: views, shares, subscribers and watchtime. Recent videos have all of them while older video can have only the 'views' variable. Each variable has 2 dimensions, daily and cumulative.

    `views -> number of views collected by the vi...

  6. f

    Variable definition and statistics.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Nov 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    He, Weiming; Wang, Jiaxue; Guo, Linan (2023). Variable definition and statistics. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001021701
    Explore at:
    Dataset updated
    Nov 30, 2023
    Authors
    He, Weiming; Wang, Jiaxue; Guo, Linan
    Description

    China is one of the countries hardest hit by disasters. Disaster shocks not only cause a large number of casualties and property damage but also have an impact on the risk preference of those who experience it. Current research has not reached a consensus conclusion on the impact of risk preferences. This paper empirically analyzes the effects of natural and man-made disasters on residents’ risk preference based on the data of the China Household Financial Survey (CHFS) in 2019. The results indicate that: (1) Both natural and man-made disasters can significantly lead to an increase in the risk aversion of residents, and man-made disasters have a greater impact. (2) Education background plays a negative moderating role in the impact of man-made disasters on residents’ risk preference. (3) Natural disaster experiences have a greater impact on the risk preference of rural residents, while man-made disaster experiences have a greater impact on the risk preference of urban residents. Natural disaster experiences make rural residents more risk-averse, while man-made disaster experiences make urban residents more risk-averse. The results provide new evidence and perspective on the negative impact of disaster shocks on the social life of residents.

  7. House Price Regression Dataset

    • kaggle.com
    zip
    Updated Sep 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prokshitha Polemoni (2024). House Price Regression Dataset [Dataset]. https://www.kaggle.com/datasets/prokshitha/home-value-insights
    Explore at:
    zip(27045 bytes)Available download formats
    Dataset updated
    Sep 6, 2024
    Authors
    Prokshitha Polemoni
    Description

    Home Value Insights: A Beginner's Regression Dataset

    This dataset is designed for beginners to practice regression problems, particularly in the context of predicting house prices. It contains 1000 rows, with each row representing a house and various attributes that influence its price. The dataset is well-suited for learning basic to intermediate-level regression modeling techniques.

    Features:

    1. Square_Footage: The size of the house in square feet. Larger homes typically have higher prices.
    2. Num_Bedrooms: The number of bedrooms in the house. More bedrooms generally increase the value of a home.
    3. Num_Bathrooms: The number of bathrooms in the house. Houses with more bathrooms are typically priced higher.
    4. Year_Built: The year the house was built. Older houses may be priced lower due to wear and tear.
    5. Lot_Size: The size of the lot the house is built on, measured in acres. Larger lots tend to add value to a property.
    6. Garage_Size: The number of cars that can fit in the garage. Houses with larger garages are usually more expensive.
    7. Neighborhood_Quality: A rating of the neighborhood’s quality on a scale of 1-10, where 10 indicates a high-quality neighborhood. Better neighborhoods usually command higher prices.
    8. House_Price (Target Variable): The price of the house, which is the dependent variable you aim to predict.

    Potential Uses:

    1. Beginner Regression Projects: This dataset can be used to practice building regression models such as Linear Regression, Decision Trees, or Random Forests. The target variable (house price) is continuous, making this an ideal problem for supervised learning techniques.

    2. Feature Engineering Practice: Learners can create new features by combining existing ones, such as the price per square foot or age of the house, providing an opportunity to experiment with feature transformations.

    3. Exploratory Data Analysis (EDA): You can explore how different features (e.g., square footage, number of bedrooms) correlate with the target variable, making it a great dataset for learning about data visualization and summary statistics.

    4. Model Evaluation: The dataset allows for various model evaluation techniques such as cross-validation, R-squared, and Mean Absolute Error (MAE). These metrics can be used to compare the effectiveness of different models.

    Versatility:

    • The dataset is highly versatile for a range of machine learning tasks. You can apply simple linear models to predict house prices based on one or two features, or use more complex models like Random Forest or Gradient Boosting Machines to understand interactions between variables.

    • It can also be used for dimensionality reduction techniques like PCA or to practice handling categorical variables (e.g., neighborhood quality) through encoding techniques like one-hot encoding.

    • This dataset is ideal for anyone wanting to gain practical experience in building regression models while working with real-world features.

  8. Simulation Data Set

    • catalog.data.gov
    • s.cnmilf.com
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Simulation Data Set [Dataset]. https://catalog.data.gov/dataset/simulation-data-set
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).

  9. Public Health Indicators in Chicago

    • kaggle.com
    zip
    Updated Jan 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Public Health Indicators in Chicago [Dataset]. https://www.kaggle.com/datasets/thedevastator/public-health-indicators-in-chicago
    Explore at:
    zip(5864 bytes)Available download formats
    Dataset updated
    Jan 24, 2023
    Authors
    The Devastator
    Area covered
    Chicago
    Description

    Public Health Indicators in Chicago

    Natality, Mortality, Infectious Disease, Lead Poisoning and Economic Status

    By City of Chicago [source]

    About this dataset

    This public health dataset contains a comprehensive selection of indicators related to natality, mortality, infectious disease, lead poisoning, and economic status from Chicago community areas. It is an invaluable resource for those interested in understanding the current state of public health within each area in order to identify any deficiencies or areas of improvement needed.

    The data includes 27 indicators such as birth and death rates, prenatal care beginning in first trimester percentages, preterm birth rates, breast cancer incidences per hundred thousand female population, all-sites cancer rates per hundred thousand population and more. For each indicator provided it details the geographical region so that analyses can be made regarding trends on a local level. Furthermore this dataset allows various stakeholders to measure performance along these indicators or even compare different community areas side-by-side.

    This dataset provides a valuable tool for those striving toward better public health outcomes for the citizens of Chicago's communities by allowing greater insight into trends specific to geographic regions that could potentially lead to further research and implementation practices based on empirical evidence gathered from this comprehensive yet digestible selection of indicators

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    In order to use this dataset effectively to assess the public health of a given area or areas in the city: - Understand which data is available: The list of data included in this dataset can be found above. It is important to know all that are included as well as their definitions so that accurate conclusions can be made when utilizing the data for research or analysis. - Identify areas of interest: Once you are familiar with what type of data is present it can help to identify which community areas you would like to study more closely or compare with one another. - Choose your variables: Once you have identified your areas it will be helpful to decide which variables are most relevant for your studies and research specific questions regarding these variables based on what you are trying to learn from this data set.
    - Analyze the Data : Once your variables have been selected and clarified take right into analyzing the corresponding values across different community areas using statistical tests such as t-tests or correlations etc.. This will help answer questions like “Are there significant differences between two outputs?” allowing you to compare how different Chicago Community Areas stack up against each other with regards to public health statistics tracked by this dataset!

    Research Ideas

    • Creating interactive maps that show data on public health indicators by Chicago community area to allow users to explore the data more easily.
    • Designing a machine learning model to predict future variations in public health indicators by Chicago community area such as birth rate, preterm births, and childhood lead poisoning levels.
    • Developing an app that enables users to search for public health information in their own community areas and compare with other areas within the city or across different cities in the US

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    See the dataset description for more information.

    Columns

    File: public-health-statistics-selected-public-health-indicators-by-chicago-community-area-1.csv | Column name | Description | |:-----------------------------------------------|:--------------------------------------------------------------------------------------------------| | Community Area | Unique identifier for each community area in Chicago. (Integer) | | Community Area Name | Name of the community area in Chicago. (String) | | Birth Rate | Number of live births per 1,000 population. (Float) | | General Fertility Rate | Number of live births per 1,000 women aged 15-44. (Float) ...

  10. o

    atp1d

    • openml.org
    Updated Mar 14, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). atp1d [Dataset]. https://www.openml.org/search?type=data&id=41475
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 14, 2019
    Description

    Multivariate regression data set from: https://link.springer.com/article/10.1007%2Fs10994-016-5546-z : The Airline Ticket Price dataset concerns the prediction of airline ticket prices. The rows are a sequence of time-ordered observations over several days. Each sample in this dataset represents a set of observations from a specific observation date and departure date pair. The input variables for each sample are values that may be useful for prediction of the airline ticket prices for a specific departure date. The target variables in these datasets are the next day (ATP1D) price or minimum price observed over the next 7 days (ATP7D) for 6 target flight preferences: (1) any airline with any number of stops, (2) any airline non-stop only, (3) Delta Airlines, (4) Continental Airlines, (5) Airtrain Airlines, and (6) United Airlines. The input variables include the following types: the number of days between the observation date and the departure date (1 feature), the boolean variables for day-of-the-week of the observation date (7 features), the complete enumeration of the following 4 values: (1) the minimum price, mean price, and number of quotes from (2) all airlines and from each airline quoting more than 50 % of the observation days (3) for non-stop, one-stop, and two-stop flights, (4) for the current day, previous day, and two days previous. The result is a feature set of 411 variables. For specific details on how these datasets are constructed please consult Groves and Gini (2015). The nature of these datasets is heterogeneous with a mixture of several types of variables including boolean variables, prices, and counts.

  11. Mortality Statistics in US Cities

    • kaggle.com
    zip
    Updated Jan 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Mortality Statistics in US Cities [Dataset]. https://www.kaggle.com/datasets/thedevastator/mortality-statistics-in-us-cities
    Explore at:
    zip(96624 bytes)Available download formats
    Dataset updated
    Jan 23, 2023
    Authors
    The Devastator
    Area covered
    United States
    Description

    Mortality Statistics in US Cities

    Deaths by Age and Cause of Death in 2016

    By Health [source]

    About this dataset

    This dataset contains mortality statistics for 122 U.S. cities in 2016, providing detailed information about all deaths that occurred due to any cause, including pneumonia and influenza. The data is voluntarily reported from cities with populations of 100,000 or more, and it includes the place of death and the week during which the death certificate was filed. Data is provided broken down by age group and includes a flag indicating the reliability of each data set to help inform analysis. Each row also provides longitude and latitude information for each reporting area in order to make further analysis easier. These comprehensive mortality statistics are invaluable resources for tracking disease trends as well as making comparisons between different areas across the country in order to identify public health risks quickly and effectively

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset contains mortality rates for 122 U.S. cities in 2016, including deaths by age group and cause of death. The data can be used to study various trends in mortality and contribute to the understanding of how different diseases impact different age groups across the country.

    In order to use the data, firstly one has to identify which variables they would like to use from this dataset. These include: reporting area; MMWR week; All causes by age greater than 65 years; All causes by age 45-64 years; All causes by age 25-44 years; All causes by age 1-24 years; All causes less than 1 year old; Pneumonia and Influenza total fatalities; Location (1 & 2); flag indicating reliability of data.

    Once you have identified the variables that you are interested in,you will need to filter the dataset so that it only includes relevant information for your analysis or research purposes. For example, if you are looking at trends between different ages, then all you would need is information on those 3 specific cause groups (greater than 65, 45-64 and 25-44). You can do this using a selection tool that allows you to pick only certain columns from your data set or an excel filter tool if your data is stored as a csv file type .

    Next step is preparing your data - it’s important for efficient analysis also helpful when there are too many variables/columns which can confuse our analysis process – eliminate unnecessary columns, rename column labels where needed etc ... In addition , make sure we clean up any missing values / outliers / incorrect entries before further investigation .Remember , outliers or corrupt entries may lead us into incorrect conclusions upon analyzing our set ! Once we complete the cleaning steps , now its safe enough transit into drawing insights !

    The last step involves using statistical methods such as linear regression with multiple predictors or descriptive statistical measures such as mean/median etc ..to draw key insights based on analysis done so far and generate some actionable points !

    With these steps taken care off , now its easier for anyone who decides dive into another project involving this particular dataset with added advantage formulated out of existing work done over our previous investigations!

    Research Ideas

    • Creating population health profiles for cities in the U.S.
    • Tracking public health trends across different age groups
    • Analyzing correlations between mortality and geographical locations

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices.

    Columns

    File: rows.csv | Column name | Description | |:--------------------------------------------|:-----------------------------------...

  12. D

    Background data for: Latent-variable modeling of ordinal outcomes in...

    • dataverse.azure.uit.no
    • dataone.org
    • +1more
    pdf, text/tsv, txt
    Updated Jul 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manfred Krug; Manfred Krug; Fabian Vetter; Fabian Vetter; Lukas Sönning; Lukas Sönning (2025). Background data for: Latent-variable modeling of ordinal outcomes in language data analysis [Dataset]. http://doi.org/10.18710/WI9TEH
    Explore at:
    txt(8660), pdf(287207), pdf(160867), text/tsv(1079156), text/tsv(4475)Available download formats
    Dataset updated
    Jul 17, 2025
    Dataset provided by
    DataverseNO
    Authors
    Manfred Krug; Manfred Krug; Fabian Vetter; Fabian Vetter; Lukas Sönning; Lukas Sönning
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2008 - Dec 31, 2018
    Area covered
    Malta
    Dataset funded by
    German Humboldt Foundation
    Spanish Ministry of Education and Science with European Regional Development Fund
    Bavarian Ministry for Science, Research and the Arts
    Description

    This dataset contains tabular files with information about the usage preferences of speakers of Maltese English with regard to 63 pairs of lexical expressions. These pairs (e.g. truck-lorry or realization-realisation) are known to differ in usage between BrE and AmE (cf. Algeo 2006). The data were elicited with a questionnaire that asks informants to indicate whether they always use one of the two variants, prefer one over the other, have no preference, or do not use either expression (see Krug and Sell 2013 for methodological details). Usage preferences were therefore measured on a symmetric 5-point ordinal scale. Data were collected between 2008 to 2018, as part of a larger research project on lexical and grammatical variation in settings where English is spoken as a native, second, or foreign language. The current dataset, which we use for our methodological study on ordinal data modeling strategies, consists of a subset of 500 speakers that is roughly balanced on year of birth. Abstract: Related publication In empirical work, ordinal variables are typically analyzed using means based on numeric scores assigned to categories. While this strategy has met with justified criticism in the methodological literature, it also generates simple and informative data summaries, a standard often not met by statistically more adequate procedures. Motivated by a survey of how ordered variables are dealt with in language research, we draw attention to an un(der)used latent-variable approach to ordinal data modeling, which constitutes an alternative perspective on the most widely used form of ordered regression, the cumulative model. Since the latent-variable approach does not feature in any of the studies in our survey, we believe it is worthwhile to promote its benefits. To this end, we draw on questionnaire-based preference ratings by speakers of Maltese English, who indicated on a 5-point scale which of two synonymous expressions (e.g. package-parcel) they (tend to) use. We demonstrate that a latent-variable formulation of the cumulative model affords nuanced and interpretable data summaries that can be visualized effectively, while at the same time avoiding limitations inherent in mean response models (e.g. distortions induced by floor and ceiling effects). The online supplementary materials include a tutorial for its implementation in R.

  13. n

    Data from: WiBB: An integrated method for quantifying the relative...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    • +1more
    zip
    Updated Aug 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qin Li; Xiaojun Kou (2021). WiBB: An integrated method for quantifying the relative importance of predictive variables [Dataset]. http://doi.org/10.5061/dryad.xsj3tx9g1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 20, 2021
    Dataset provided by
    Field Museum of Natural History
    Beijing Normal University
    Authors
    Qin Li; Xiaojun Kou
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    This dataset contains simulated datasets, empirical data, and R scripts described in the paper: “Li, Q. and Kou, X. (2021) WiBB: An integrated method for quantifying the relative importance of predictive variables. Ecography (DOI: 10.1111/ecog.05651)”.

    A fundamental goal of scientific research is to identify the underlying variables that govern crucial processes of a system. Here we proposed a new index, WiBB, which integrates the merits of several existing methods: a model-weighting method from information theory (Wi), a standardized regression coefficient method measured by ß* (B), and bootstrap resampling technique (B). We applied the WiBB in simulated datasets with known correlation structures, for both linear models (LM) and generalized linear models (GLM), to evaluate its performance. We also applied two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate their performance in comparison with the WiBB method on ranking predictor importances under various scenarios. We also applied it to an empirical dataset in a plant genus Mimulus to select bioclimatic predictors of species’ presence across the landscape. Results in the simulated datasets showed that the WiBB method outperformed the ß* and SWi methods in scenarios with small and large sample sizes, respectively, and that the bootstrap resampling technique significantly improved the discriminant ability. When testing WiBB in the empirical dataset with GLM, it sensibly identified four important predictors with high credibility out of six candidates in modeling geographical distributions of 71 Mimulus species. This integrated index has great advantages in evaluating predictor importance and hence reducing the dimensionality of data, without losing interpretive power. The simplicity of calculation of the new metric over more sophisticated statistical procedures, makes it a handy method in the statistical toolbox.

    Methods To simulate independent datasets (size = 1000), we adopted Galipaud et al.’s approach (2014) with custom modifications of the data.simulation function, which used the multiple normal distribution function rmvnorm in R package mvtnorm(v1.0-5, Genz et al. 2016). Each dataset was simulated with a preset correlation structure between a response variable (y) and four predictors(x1, x2, x3, x4). The first three (genuine) predictors were set to be strongly, moderately, and weakly correlated with the response variable, respectively (denoted by large, medium, small Pearson correlation coefficients, r), while the correlation between the response and the last (spurious) predictor was set to be zero. We simulated datasets with three levels of differences of correlation coefficients of consecutive predictors, where ∆r = 0.1, 0.2, 0.3, respectively. These three levels of ∆r resulted in three correlation structures between the response and four predictors: (0.3, 0.2, 0.1, 0.0), (0.6, 0.4, 0.2, 0.0), and (0.8, 0.6, 0.3, 0.0), respectively. We repeated the simulation procedure 200 times for each of three preset correlation structures (600 datasets in total), for LM fitting later. For GLM fitting, we modified the simulation procedures with additional steps, in which we converted the continuous response into binary data O (e.g., occurrence data having 0 for absence and 1 for presence). We tested the WiBB method, along with two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate the ability to correctly rank predictor importances under various scenarios. The empirical dataset of 71 Mimulus species was collected by their occurrence coordinates and correponding values extracted from climatic layers from WorldClim dataset (www.worldclim.org), and we applied the WiBB method to infer important predictors for their geographical distributions.

  14. n

    Data from: Macaques preferentially attend to intermediately surprising...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Apr 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shengyi Wu; Tommy Blanchard; Emily Meschke; Richard Aslin; Ben Hayden; Celeste Kidd (2022). Macaques preferentially attend to intermediately surprising information [Dataset]. http://doi.org/10.6078/D15Q7Q
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 26, 2022
    Dataset provided by
    University of Minnesota
    University of California, Berkeley
    Yale University
    Klaviyo
    Authors
    Shengyi Wu; Tommy Blanchard; Emily Meschke; Richard Aslin; Ben Hayden; Celeste Kidd
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Normative learning theories dictate that we should preferentially attend to informative sources, but only up to the point that our limited learning systems can process their content. Humans, including infants, show this predicted strategic deployment of attention. Here we demonstrate that rhesus monkeys, much like humans, attend to events of moderate surprisingness over both more and less surprising events. They do this in the absence of any specific goal or contingent reward, indicating that the behavioral pattern is spontaneous. We suggest this U-shaped attentional preference represents an evolutionarily preserved strategy for guiding intelligent organisms toward material that is maximally useful for learning. Methods How the data were collected: In this project, we collected gaze data of 5 macaques when they watched sequential visual displays designed to elicit probabilistic expectations using the Eyelink Toolbox and were sampled at 1000 Hz by an infrared eye-monitoring camera system. Dataset:

    "csv-combined.csv" is an aggregated dataset that includes one pop-up event per row for all original datasets for each trial. Here are descriptions of each column in the dataset:

    subj: subject_ID = {"B":104, "C":102,"H":101,"J":103,"K":203} trialtime: start time of current trial in second trial: current trial number (each trial featured one of 80 possible visual-event sequences)(in order) seq current: sequence number (one of 80 sequences) seq_item: current item number in a seq (in order) active_item: pop-up item (active box) pre_active: prior pop-up item (actve box) {-1: "the first active object in the sequence/ no active object before the currently active object in the sequence"} next_active: next pop-up item (active box) {-1: "the last active object in the sequence/ no active object after the currently active object in the sequence"} firstappear: {0: "not first", 1: "first appear in the seq"} looks_blank: csv: total amount of time look at blank space for current event (ms); csv_timestamp: {1: "look blank at timestamp", 0: "not look blank at timestamp"} looks_offscreen: csv: total amount of time look offscreen for current event (ms); csv_timestamp: {1: "look offscreen at timestamp", 0: "not look offscreen at timestamp"} time till target: time spent to first start looking at the target object (ms) {-1: "never look at the target"} looks target: csv: time spent to look at the target object (ms);csv_timestamp: look at the target or not at current timestamp (1 or 0) look1,2,3: time spent look at each object (ms) location 123X, 123Y: location of each box (location of the three boxes for a given sequence were chosen randomly, but remained static throughout the sequence) item123id: pop-up item ID (remained static throughout a sequence) event time: total time spent for the whole event (pop-up and go back) (ms) eyeposX,Y: eye position at current timestamp

    "csv-surprisal-prob.csv" is an output file from Monkilock_Data_Processing.ipynb. Surprisal values for each event were calculated and added to the "csv-combined.csv". Here are descriptions of each additional column:

    rt: time till target {-1: "never look at the target"}. In data analysis, we included data that have rt > 0. already_there: {NA: "never look at the target object"}. In data analysis, we included events that are not the first event in a sequence, are not repeats of the previous event, and already_there is not NA. looks_away: {TRUE: "the subject was looking away from the currently active object at this time point", FALSE: "the subject was not looking away from the currently active object at this time point"} prob: the probability of the occurrence of object surprisal: unigram surprisal value bisurprisal: transitional surprisal value std_surprisal: standardized unigram surprisal value std_bisurprisal: standardized transitional surprisal value binned_surprisal_means: the means of unigram surprisal values binned to three groups of evenly spaced intervals according to surprisal values. binned_bisurprisal_means: the means of transitional surprisal values binned to three groups of evenly spaced intervals according to surprisal values.

    "csv-surprisal-prob_updated.csv" is a ready-for-analysis dataset generated by Analysis_Code_final.Rmd after standardizing controlled variables, changing data types for categorical variables for analysts, etc. "AllSeq.csv" includes event information of all 80 sequences

    Empty Values in Datasets:

    There is no missing value in the original dataset "csv-combined.csv". Missing values (marked as NA in datasets) happen in columns "prev_active", "next_active", "already_there", "bisurprisal", "std_bisurprisal", "sq_std_bisurprisal" in "csv-surprisal-prob.csv" and "csv-surprisal-prob_updated.csv". NAs in columns "prev_active" and "next_active" mean that the first or the last active object in the sequence/no active object before or after the currently active object in the sequence. When we analyzed the variable "already_there", we eliminated data that their "prev_active" variable is NA. NAs in column "already there" mean that the subject never looks at the target object in the current event. When we analyzed the variable "already there", we eliminated data that their "already_there" variable is NA. Missing values happen in columns "bisurprisal", "std_bisurprisal", "sq_std_bisurprisal" when it is the first event in the sequence and the transitional probability of the event cannot be computed because there's no event happening before in this sequence. When we fitted models for transitional statistics, we eliminated data that their "bisurprisal", "std_bisurprisal", and "sq_std_bisurprisal" are NAs.

    Codes:

    In "Monkilock_Data_Processing.ipynb", we processed raw fixation data of 5 macaques and explored the relationship between their fixation patterns and the "surprisal" of events in each sequence. We computed the following variables which are necessary for further analysis, modeling, and visualizations in this notebook (see above for details): active_item, pre_active, next_active, firstappear ,looks_blank, looks_offscreen, time till target, looks target, look1,2,3, prob, surprisal, bisurprisal, std_surprisal, std_bisurprisal, binned_surprisal_means, binned_bisurprisal_means. "Analysis_Code_final.Rmd" is the main scripts that we further processed the data, built models, and created visualizations for data. We evaluated the statistical significance of variables using mixed effect linear and logistic regressions with random intercepts. The raw regression models include standardized linear and quadratic surprisal terms as predictors. The controlled regression models include covariate factors, such as whether an object is a repeat, the distance between the current and previous pop up object, trial number. A generalized additive model (GAM) was used to visualize the relationship between the surprisal estimate from the computational model and the behavioral data. "helper-lib.R" includes helper functions used in Analysis_Code_final.Rmd

  15. 2023 Census totals by topic for individuals by statistical area 1 – part 1

    • datafinder.stats.govt.nz
    csv, dwg, geodatabase +6
    Updated Nov 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stats NZ (2024). 2023 Census totals by topic for individuals by statistical area 1 – part 1 [Dataset]. https://datafinder.stats.govt.nz/layer/120766-2023-census-totals-by-topic-for-individuals-by-statistical-area-1-part-1/
    Explore at:
    geodatabase, dwg, mapinfo mif, shapefile, csv, kml, geopackage / sqlite, pdf, mapinfo tabAvailable download formats
    Dataset updated
    Nov 14, 2024
    Dataset provided by
    Statistics New Zealandhttp://www.stats.govt.nz/
    Authors
    Stats NZ
    License

    https://datafinder.stats.govt.nz/license/attribution-4-0-international/https://datafinder.stats.govt.nz/license/attribution-4-0-international/

    Area covered
    Description

    Dataset contains counts and measures for individuals from the 2013, 2018, and 2023 Censuses. Data is available by statistical area 1.

    The variables included in this dataset are for the census usually resident population count (unless otherwise stated). All data is for level 1 of the classification (unless otherwise stated).

    The variables for part 1 of the dataset are:

    • Census usually resident population count
    • Census night population count
    • Age (5-year groups)
    • Age (life cycle groups)
    • Median age
    • Birthplace (NZ born/overseas born)
    • Birthplace (broad geographic areas)
    • Ethnicity (total responses) for level 1 and ‘Other Ethnicity’ grouped by ‘New Zealander’ and ‘Other Ethnicity nec’
    • Māori descent indicator
    • Languages spoken (total responses)
    • Official language indicator
    • Gender
    • Sex at birth
    • Rainbow/LGBTIQ+ indicator for the census usually resident population count aged 15 years and over
    • Sexual identity for the census usually resident population count aged 15 years and over
    • Legally registered relationship status for the census usually resident population count aged 15 years and over
    • Partnership status in current relationship for the census usually resident population count aged 15 years and over
    • Number of children born for the sex at birth female census usually resident population count aged 15 years and over
    • Average number of children born for the sex at birth female census usually resident population count aged 15 years and over
    • Religious affiliation (total responses)
    • Cigarette smoking behaviour for the census usually resident population count aged 15 years and over
    • Disability indicator for the census usually resident population count aged 5 years and over
    • Difficulty communicating for the census usually resident population count aged 5 years and over
    • Difficulty hearing for the census usually resident population count aged 5 years and over
    • Difficulty remembering or concentrating for the census usually resident population count aged 5 years and over
    • Difficulty seeing for the census usually resident population count aged 5 years and over
    • Difficulty walking for the census usually resident population count aged 5 years and over
    • Difficulty washing for the census usually resident population count aged 5 years and over.

    Download lookup file for part 1 from Stats NZ ArcGIS Online or embedded attachment in Stats NZ geographic data service. Download data table (excluding the geometry column for CSV files) using the instructions in the Koordinates help guide.

    Footnotes

    Te Whata

    Under the Mana Ōrite Relationship Agreement, Te Kāhui Raraunga (TKR) will be publishing Māori descent and iwi affiliation data from the 2023 Census in partnership with Stats NZ. This will be available on Te Whata, a TKR platform.

    Geographical boundaries

    Statistical standard for geographic areas 2023 (updated December 2023) has information about geographic boundaries as of 1 January 2023. Address data from 2013 and 2018 Censuses was updated to be consistent with the 2023 areas. Due to the changes in area boundaries and coding methodologies, 2013 and 2018 counts published in 2023 may be slightly different to those published in 2013 or 2018.

    Subnational census usually resident population

    The census usually resident population count of an area (subnational count) is a count of all people who usually live in that area and were present in New Zealand on census night. It excludes visitors from overseas, visitors from elsewhere in New Zealand, and residents temporarily overseas on census night. For example, a person who usually lives in Christchurch city and is visiting Wellington city on census night will be included in the census usually resident population count of Christchurch city.

    Population counts

    Stats NZ publishes a number of different population counts, each using a different definition and methodology. Population statistics – user guide has more information about different counts.

    Caution using time series

    Time series data should be interpreted with care due to changes in census methodology and differences in response rates between censuses. The 2023 and 2018 Censuses used a combined census methodology (using census responses and administrative data), while the 2013 Census used a full-field enumeration methodology (with no use of administrative data).

    Study participation time series

    In the 2013 Census study participation was only collected for the census usually resident population count aged 15 years and over.

    About the 2023 Census dataset

    For information on the 2023 dataset see Using a combined census model for the 2023 Census. We combined data from the census forms with administrative data to create the 2023 Census dataset, which meets Stats NZ's quality criteria for population structure information. We added real data about real people to the dataset where we were confident the people who hadn’t completed a census form (which is known as admin enumeration) will be counted. We also used data from the 2018 and 2013 Censuses, administrative data sources, and statistical imputation methods to fill in some missing characteristics of people and dwellings.

    Data quality

    The quality of data in the 2023 Census is assessed using the quality rating scale and the quality assurance framework to determine whether data is fit for purpose and suitable for release. Data quality assurance in the 2023 Census has more information.

    Concept descriptions and quality ratings

    Data quality ratings for 2023 Census variables has additional details about variables found within totals by topic, for example, definitions and data quality.

    Disability indicator

    This data should not be used as an official measure of disability prevalence. Disability prevalence estimates are only available from the 2023 Household Disability Survey. Household Disability Survey 2023: Final content has more information about the survey.

    Activity limitations are measured using the Washington Group Short Set (WGSS). The WGSS asks about six basic activities that a person might have difficulty with: seeing, hearing, walking or climbing stairs, remembering or concentrating, washing all over or dressing, and communicating. A person was classified as disabled in the 2023 Census if there was at least one of these activities that they had a lot of difficulty with or could not do at all.

    Using data for good

    Stats NZ expects that, when working with census data, it is done so with a positive purpose, as outlined in the Māori Data Governance Model (Data Iwi Leaders Group, 2023). This model states that "data should support transformative outcomes and should uplift and strengthen our relationships with each other and with our environments. The avoidance of harm is the minimum expectation for data use. Māori data should also contribute to iwi and hapū tino rangatiratanga”.

    Confidentiality

    The 2023 Census confidentiality rules have been applied to 2013, 2018, and 2023 data. These rules protect the confidentiality of individuals, families, households, dwellings, and undertakings in 2023 Census data. Counts are calculated using fixed random rounding to base 3 (FRR3) and suppression of ‘sensitive’ counts less than six, where tables report multiple geographic variables and/or small populations. Individual figures may not always sum to stated totals. Applying confidentiality rules to 2023 Census data and summary of changes since 2018 and 2013 Censuses has more information about 2023 Census confidentiality rules.

    Measures

    Measures like averages, medians, and other quantiles are calculated from unrounded counts, with input noise added to or subtracted from each contributing value during measures calculations. Averages and medians based on less than six units (e.g. individuals, dwellings, households, families, or extended families) are suppressed. This suppression threshold changes for other quantiles. Where the cells have been suppressed, a placeholder value has been used.

    Percentages

    To calculate percentages, divide the figure for the category of interest by the figure for 'Total stated' where this applies.

    Symbol

    -997 Not available

    -999 Confidential

    Inconsistencies in definitions

    Please note that there may be differences in definitions between census classifications and those used for other data collections.

  16. Effect of suicide rates on life expectancy dataset

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Apr 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Filip Zoubek; Filip Zoubek (2021). Effect of suicide rates on life expectancy dataset [Dataset]. http://doi.org/10.5281/zenodo.4694270
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 16, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Filip Zoubek; Filip Zoubek
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Description

    Effect of suicide rates on life expectancy dataset

    Abstract
    In 2015, approximately 55 million people died worldwide, of which 8 million committed suicide. In the USA, one of the main causes of death is the aforementioned suicide, therefore, this experiment is dealing with the question of how much suicide rates affects the statistics of average life expectancy.
    The experiment takes two datasets, one with the number of suicides and life expectancy in the second one and combine data into one dataset. Subsequently, I try to find any patterns and correlations among the variables and perform statistical test using simple regression to confirm my assumptions.

    Data

    The experiment uses two datasets - WHO Suicide Statistics[1] and WHO Life Expectancy[2], which were firstly appropriately preprocessed. The final merged dataset to the experiment has 13 variables, where country and year are used as index: Country, Year, Suicides number, Life expectancy, Adult Mortality, which is probability of dying between 15 and 60 years per 1000 population, Infant deaths, which is number of Infant Deaths per 1000 population, Alcohol, which is alcohol, recorded per capita (15+) consumption, Under-five deaths, which is number of under-five deaths per 1000 population, HIV/AIDS, which is deaths per 1 000 live births HIV/AIDS, GDP, which is Gross Domestic Product per capita, Population, Income composition of resources, which is Human Development Index in terms of income composition of resources, and Schooling, which is number of years of schooling.

    LICENSE

    THE EXPERIMENT USES TWO DATASET - WHO SUICIDE STATISTICS AND WHO LIFE EXPECTANCY, WHICH WERE COLLEECTED FROM WHO AND UNITED NATIONS WEBSITE. THEREFORE, ALL DATASETS ARE UNDER THE LICENSE ATTRIBUTION-NONCOMMERCIAL-SHAREALIKE 3.0 IGO (https://creativecommons.org/licenses/by-nc-sa/3.0/igo/).

    [1] https://www.kaggle.com/szamil/who-suicide-statistics

    [2] https://www.kaggle.com/kumarajarshi/life-expectancy-who

  17. Popularity Dataset for Online Stats Training

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Aug 25, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rens van de Schoot; Rens van de Schoot (2020). Popularity Dataset for Online Stats Training [Dataset]. http://doi.org/10.5281/zenodo.3962123
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 25, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rens van de Schoot; Rens van de Schoot
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a dataset used for the online stats training website (https://www.rensvandeschoot.com/tutorials/) and is based on the data used by van de Schoot, van der Velden, Boom, and Brugman (2010).

    The dataset is based on a study that investigates an association between popularity status and antisocial behavior from at-risk adolescents (n = 1491), where gender and ethnic background are moderators under the association. The study distinguished subgroups within the popular status group in terms of overt and covert antisocial behavior.For more information on the sample, instruments, methodology, and research context, we refer the interested readers to van de Schoot, van der Velden, Boom, and Brugman (2010).

    Variable name Description

    Respnr = Respondents’ number

    Dutch = Respondents’ ethnic background (0 = Dutch origin, 1 = non-Dutch origin)

    gender = Respondents’ gender (0 = boys, 1 = girls)

    sd = Adolescents’ socially desirable answering patterns

    covert = Covert antisocial behavior

    overt = Overt antisocial behavior

  18. f

    Descriptive statistics for the study variables in Studies 1 and 2.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Dec 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Strickland-Hughes, Carla M.; Ferguson, Leah; Davis, Elizabeth L.; Kürüm, Esra; Wu, Rachel; Sheffler, Pamela; Kyeong, Yena (2024). Descriptive statistics for the study variables in Studies 1 and 2. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001434414
    Explore at:
    Dataset updated
    Dec 19, 2024
    Authors
    Strickland-Hughes, Carla M.; Ferguson, Leah; Davis, Elizabeth L.; Kürüm, Esra; Wu, Rachel; Sheffler, Pamela; Kyeong, Yena
    Description

    Descriptive statistics for the study variables in Studies 1 and 2.

  19. Pre and Post-Exercise Heart Rate Analysis

    • kaggle.com
    zip
    Updated Sep 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdullah M Almutairi (2024). Pre and Post-Exercise Heart Rate Analysis [Dataset]. https://www.kaggle.com/datasets/abdullahmalmutairi/pre-and-post-exercise-heart-rate-analysis
    Explore at:
    zip(3857 bytes)Available download formats
    Dataset updated
    Sep 29, 2024
    Authors
    Abdullah M Almutairi
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Overview:

    This dataset contains simulated (hypothetical) but almost realistic (based on AI) data related to sleep, heart rate, and exercise habits of 500 individuals. It includes both pre-exercise and post-exercise resting heart rates, allowing for analyses such as a dependent t-test (Paired Sample t-test) to observe changes in heart rate after an exercise program. The dataset also includes additional health-related variables, such as age, hours of sleep per night, and exercise frequency.

    The data is designed for tasks involving hypothesis testing, health analytics, or even machine learning applications that predict changes in heart rate based on personal attributes and exercise behavior. It can be used to understand the relationships between exercise frequency, sleep, and changes in heart rate.

    File: Filename: heart_rate_data.csv File Format: CSV

    - Features (Columns):

    Age: Description: The age of the individual. Type: Integer Range: 18-60 years Relevance: Age is an important factor in determining heart rate and the effects of exercise.

    Sleep Hours: Description: The average number of hours the individual sleeps per night. Type: Float Range: 3.0 - 10.0 hours Relevance: Sleep is a crucial health metric that can impact heart rate and exercise recovery.

    Exercise Frequency (Days/Week): Description: The number of days per week the individual engages in physical exercise. Type: Integer Range: 1-7 days/week Relevance: More frequent exercise may lead to greater heart rate improvements and better cardiovascular health.

    Resting Heart Rate Before: Description: The individual’s resting heart rate measured before beginning a 6-week exercise program. Type: Integer Range: 50 - 100 bpm (beats per minute) Relevance: This is a key health indicator, providing a baseline measurement for the individual’s heart rate.

    Resting Heart Rate After: Description: The individual’s resting heart rate measured after completing the 6-week exercise program. Type: Integer Range: 45 - 95 bpm (lower than the "Resting Heart Rate Before" due to the effects of exercise). Relevance: This variable is essential for understanding how exercise affects heart rate over time, and it can be used to perform a dependent t-test analysis.

    Max Heart Rate During Exercise: Description: The maximum heart rate the individual reached during exercise sessions. Type: Integer Range: 120 - 190 bpm Relevance: This metric helps in understanding cardiovascular strain during exercise and can be linked to exercise frequency or fitness levels.

    Potential Uses: Dependent T-Test Analysis: The dataset is particularly suited for a dependent (paired) t-test where you compare the resting heart rate before and after the exercise program for each individual.

    Exploratory Data Analysis (EDA):Investigate relationships between sleep, exercise frequency, and changes in heart rate. Potential analyses include correlations between sleep hours and resting heart rate improvement, or regression analyses to predict heart rate after exercise.

    Machine Learning: Use the dataset for predictive modeling, and build a beginner regression model to predict post-exercise heart rate using age, sleep, and exercise frequency as features.

    Health and Fitness Insights: This dataset can be useful for studying how different factors like sleep and age influence heart rate changes and overall cardiovascular health.

    License: Choose an appropriate open license, such as:

    CC BY 4.0 (Attribution 4.0 International).

    Inspiration for Kaggle Users: How does exercise frequency influence the reduction in resting heart rate? Is there a relationship between sleep and heart rate improvements post-exercise? Can we predict the post-exercise heart rate using other health variables? How do age and exercise frequency interact to affect heart rate?

    Acknowledgments: This is a simulated dataset for educational purposes, generated to demonstrate statistical and machine learning applications in the field of health analytics.

  20. d

    Tabular statistical summay of data analysis - Calawah River Riverscape Study...

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated May 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact, Custodian) (2025). Tabular statistical summay of data analysis - Calawah River Riverscape Study [Dataset]. https://catalog.data.gov/dataset/tabular-statistical-summay-of-data-analysis-calawah-river-riverscape-study3
    Explore at:
    Dataset updated
    May 24, 2025
    Dataset provided by
    (Point of Contact, Custodian)
    Area covered
    Calawah River
    Description

    The objective of this study was to identify the patterns of juvenile salmonid distribution and relative abundance in relation to habitat correlates. It is the first dataset of its kind because the entire river was snorkeled by one person in multiple years. During two consecutive summers, we completed a census of juvenile salmonids and stream habitat across a stream network. We used the data to test the ability of habitat models to explain the distribution of juvenile coho salmon (Oncorhynchus kisutch), young-of-the-year (age 0) steelhead (Oncorhynchus mykiss), and steelhead parr (= age 1) for a network consisting of several different sized streams. Our network-scale models, which included five stream habitat variables, explained 27%, 11%, and 19% of the variation in the density of juvenile coho salmon, age 0 steelhead, and steelhead parr, respectively. We found weak to strong levels of spatial auto-correlation in the model residuals (Moran's I values ranging from 0.25 - 0.71). Explanatory power of base habitat models increased substantially and the level of spatial auto-correlation decreased with sequential inclusion of variables accounting for stream size, year, stream, and reach location. The models for specific streams underscored the variability that was implied in the network-scale models. Associations between juvenile salmonids and individual habitat variables were rarely linear and ranged from negative to positive, and the variable accounting for location of the habitat within a stream was often more important than any individual habitat variable. The limited success in predicting the summer distribution and density of juvenile coho salmon and steelhead with our network-scale models was apparently related to variation in the strength and shape of fish-habitat associations across and within streams and years. Summary of statistical analysis of the Calawah Riverscape data. NOAA was not involved and did not pay for the collection of this data. This data represents the statistical analysis carried out by Martin Liermann as a NOAA employee.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Damico, Anthony (2023). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD

Current Population Survey (CPS)

Explore at:
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Damico, Anthony
Description

analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D

Search
Clear search
Close search
Google apps
Main menu