These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionThere is a need to develop harmonized procedures and a Minimum Data Set (MDS) for cross-border Multi Casualty Incidents (MCI) in medical emergency scenarios to ensure appropriate management of such incidents, regardless of place, language and internal processes of the institutions involved. That information should be capable of real-time communication to the command-and-control chain. It is crucial that the models adopted are interoperable between countries so that the rights of patients to cross-border healthcare are fully respected.ObjectiveTo optimize management of cross-border Multi Casualty Incidents through a Minimum Data Set collected and communicated in real time to the chain of command and control for each incident. To determine the degree of agreement among experts.MethodWe used the modified Delphi method supplemented with the Utstein technique to reach consensus among experts. In the first phase, the minimum requirements of the project, the profile of the experts who were to participate, the basic requirements of each variable chosen and the way of collecting the data were defined by providing bibliography on the subject. In the second phase, the preliminary variables were grouped into 6 clusters, the objectives, the characteristics of the variables and the logistics of the work were approved. Several meetings were held to reach a consensus to choose the MDS variables using a Modified Delphi technique. Each expert had to score each variable from 1 to 10. Non-voting variables were eliminated, and the round of voting ended. In the third phase, the Utstein Style was applied to discuss each group of variables and choose the ones with the highest consensus. After several rounds of discussion, it was agreed to eliminate the variables with a score of less than 5 points. In phase four, the researchers submitted the variables to the external experts for final assessment and validation before their use in the simulations. Data were analysed with SPSS Statistics (IBM, version 2) software.ResultsSix data entities with 31 sub-entities were defined, generating 127 items representing the final MDS regarded as essential for incident management. The level of consensus for the choice of items was very high and was highest for the category ‘Incident’ with an overall kappa of 0.7401 (95% CI 0.1265–0.5812, p 0.000), a good level of consensus in the Landis and Koch model. The items with the greatest degree of consensus at ten were those relating to location, type of incident, date, time and identification of the incident. All items met the criteria set, such as digital collection and real-time transmission to the chain of command and control.ConclusionsThis study documents the development of a MDS through consensus with a high degree of agreement among a group of experts of different nationalities working in different fields. All items in the MDS were digitally collected and forwarded in real time to the chain of command and control. This tool has demonstrated its validity in four large cross-border simulations involving more than eight countries and their emergency services.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We include Stata syntax (dummy_dataset_create.do) that creates a panel dataset for negative binomial time series regression analyses, as described in our paper "Examining methodology to identify patterns of consulting in primary care for different groups of patients before a diagnosis of cancer: an exemplar applied to oesophagogastric cancer". We also include a sample dataset for clarity (dummy_dataset.dta), and a sample of that data in a spreadsheet (Appendix 2).
The variables contained therein are defined as follows:
case: binary variable for case or control status (takes a value of 0 for controls and 1 for cases).
patid: a unique patient identifier.
time_period: A count variable denoting the time period. In this example, 0 denotes 10 months before diagnosis with cancer, and 9 denotes the month of diagnosis with cancer,
ncons: number of consultations per month.
period0 to period9: 10 unique inflection point variables (one for each month before diagnosis). These are used to test which aggregation period includes the inflection point.
burden: binary variable denoting membership of one of two multimorbidity burden groups.
We also include two Stata do-files for analysing the consultation rate, stratified by burden group, using the Maximum likelihood method (1_menbregpaper.do and 2_menbregpaper_bs.do).
Note: In this example, for demonstration purposes we create a dataset for 10 months leading up to diagnosis. In the paper, we analyse 24 months before diagnosis. Here, we study consultation rates over time, but the method could be used to study any countable event, such as number of prescriptions.
The harmonized data set on health, created and published by the ERF, is a subset of Iraq Household Socio Economic Survey (IHSES) 2012. It was derived from the household, individual and health modules, collected in the context of the above mentioned survey. The sample was then used to create a harmonized health survey, comparable with the Iraq Household Socio Economic Survey (IHSES) 2007 micro data set.
----> Overview of the Iraq Household Socio Economic Survey (IHSES) 2012:
Iraq is considered a leader in household expenditure and income surveys where the first was conducted in 1946 followed by surveys in 1954 and 1961. After the establishment of Central Statistical Organization, household expenditure and income surveys were carried out every 3-5 years in (1971/ 1972, 1976, 1979, 1984/ 1985, 1988, 1993, 2002 / 2007). Implementing the cooperation between CSO and WB, Central Statistical Organization (CSO) and Kurdistan Region Statistics Office (KRSO) launched fieldwork on IHSES on 1/1/2012. The survey was carried out over a full year covering all governorates including those in Kurdistan Region.
The survey has six main objectives. These objectives are:
The raw survey data provided by the Statistical Office were then harmonized by the Economic Research Forum, to create a comparable version with the 2006/2007 Household Socio Economic Survey in Iraq. Harmonization at this stage only included unifying variables' names, labels and some definitions. See: Iraq 2007 & 2012- Variables Mapping & Availability Matrix.pdf provided in the external resources for further information on the mapping of the original variables on the harmonized ones, in addition to more indications on the variables' availability in both survey years and relevant comments.
National coverage: Covering a sample of urban, rural and metropolitan areas in all the governorates including those in Kurdistan Region.
1- Household/family. 2- Individual/person.
The survey was carried out over a full year covering all governorates including those in Kurdistan Region.
Sample survey data [ssd]
----> Design:
Sample size was (25488) household for the whole Iraq, 216 households for each district of 118 districts, 2832 clusters each of which includes 9 households distributed on districts and governorates for rural and urban.
----> Sample frame:
Listing and numbering results of 2009-2010 Population and Housing Survey were adopted in all the governorates including Kurdistan Region as a frame to select households, the sample was selected in two stages: Stage 1: Primary sampling unit (blocks) within each stratum (district) for urban and rural were systematically selected with probability proportional to size to reach 2832 units (cluster). Stage two: 9 households from each primary sampling unit were selected to create a cluster, thus the sample size of total survey clusters was 25488 households distributed on the governorates, 216 households in each district.
----> Sampling Stages:
In each district, the sample was selected in two stages: Stage 1: based on 2010 listing and numbering frame 24 sample points were selected within each stratum through systematic sampling with probability proportional to size, in addition to the implicit breakdown urban and rural and geographic breakdown (sub-district, quarter, street, county, village and block). Stage 2: Using households as secondary sampling units, 9 households were selected from each sample point using systematic equal probability sampling. Sampling frames of each stages can be developed based on 2010 building listing and numbering without updating household lists. In some small districts, random selection processes of primary sampling may lead to select less than 24 units therefore a sampling unit is selected more than once , the selection may reach two cluster or more from the same enumeration unit when it is necessary.
Face-to-face [f2f]
----> Preparation:
The questionnaire of 2006 survey was adopted in designing the questionnaire of 2012 survey on which many revisions were made. Two rounds of pre-test were carried out. Revision were made based on the feedback of field work team, World Bank consultants and others, other revisions were made before final version was implemented in a pilot survey in September 2011. After the pilot survey implemented, other revisions were made in based on the challenges and feedbacks emerged during the implementation to implement the final version in the actual survey.
----> Questionnaire Parts:
The questionnaire consists of four parts each with several sections: Part 1: Socio – Economic Data: - Section 1: Household Roster - Section 2: Emigration - Section 3: Food Rations - Section 4: housing - Section 5: education - Section 6: health - Section 7: Physical measurements - Section 8: job seeking and previous job
Part 2: Monthly, Quarterly and Annual Expenditures: - Section 9: Expenditures on Non – Food Commodities and Services (past 30 days). - Section 10 : Expenditures on Non – Food Commodities and Services (past 90 days). - Section 11: Expenditures on Non – Food Commodities and Services (past 12 months). - Section 12: Expenditures on Non-food Frequent Food Stuff and Commodities (7 days). - Section 12, Table 1: Meals Had Within the Residential Unit. - Section 12, table 2: Number of Persons Participate in the Meals within Household Expenditure Other Than its Members.
Part 3: Income and Other Data: - Section 13: Job - Section 14: paid jobs - Section 15: Agriculture, forestry and fishing - Section 16: Household non – agricultural projects - Section 17: Income from ownership and transfers - Section 18: Durable goods - Section 19: Loans, advances and subsidies - Section 20: Shocks and strategy of dealing in the households - Section 21: Time use - Section 22: Justice - Section 23: Satisfaction in life - Section 24: Food consumption during past 7 days
Part 4: Diary of Daily Expenditures: Diary of expenditure is an essential component of this survey. It is left at the household to record all the daily purchases such as expenditures on food and frequent non-food items such as gasoline, newspapers…etc. during 7 days. Two pages were allocated for recording the expenditures of each day, thus the roster will be consists of 14 pages.
----> Raw Data:
Data Editing and Processing: To ensure accuracy and consistency, the data were edited at the following stages: 1. Interviewer: Checks all answers on the household questionnaire, confirming that they are clear and correct. 2. Local Supervisor: Checks to make sure that questions has been correctly completed. 3. Statistical analysis: After exporting data files from excel to SPSS, the Statistical Analysis Unit uses program commands to identify irregular or non-logical values in addition to auditing some variables. 4. World Bank consultants in coordination with the CSO data management team: the World Bank technical consultants use additional programs in SPSS and STAT to examine and correct remaining inconsistencies within the data files. The software detects errors by analyzing questionnaire items according to the expected parameter for each variable.
----> Harmonized Data:
Iraq Household Socio Economic Survey (IHSES) reached a total of 25488 households. Number of households refused to response was 305, response rate was 98.6%. The highest interview rates were in Ninevah and Muthanna (100%) while the lowest rates were in Sulaimaniya (92%).
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains tabular files with information about the usage preferences of speakers of Maltese English with regard to 63 pairs of lexical expressions. These pairs (e.g. truck-lorry or realization-realisation) are known to differ in usage between BrE and AmE (cf. Algeo 2006). The data were elicited with a questionnaire that asks informants to indicate whether they always use one of the two variants, prefer one over the other, have no preference, or do not use either expression (see Krug and Sell 2013 for methodological details). Usage preferences were therefore measured on a symmetric 5-point ordinal scale. Data were collected between 2008 to 2018, as part of a larger research project on lexical and grammatical variation in settings where English is spoken as a native, second, or foreign language. The current dataset, which we use for our methodological study on ordinal data modeling strategies, consists of a subset of 500 speakers that is roughly balanced on year of birth. Abstract: Related publication In empirical work, ordinal variables are typically analyzed using means based on numeric scores assigned to categories. While this strategy has met with justified criticism in the methodological literature, it also generates simple and informative data summaries, a standard often not met by statistically more adequate procedures. Motivated by a survey of how ordered variables are dealt with in language research, we draw attention to an un(der)used latent-variable approach to ordinal data modeling, which constitutes an alternative perspective on the most widely used form of ordered regression, the cumulative model. Since the latent-variable approach does not feature in any of the studies in our survey, we believe it is worthwhile to promote its benefits. To this end, we draw on questionnaire-based preference ratings by speakers of Maltese English, who indicated on a 5-point scale which of two synonymous expressions (e.g. package-parcel) they (tend to) use. We demonstrate that a latent-variable formulation of the cumulative model affords nuanced and interpretable data summaries that can be visualized effectively, while at the same time avoiding limitations inherent in mean response models (e.g. distortions induced by floor and ceiling effects). The online supplementary materials include a tutorial for its implementation in R.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.
Various climate variables summary for all 15 subregions based on Bureau of Meteorology Australian Water Availability Project (BAWAP) climate grids. Including
Time series mean annual BAWAP rainfall from 1900 - 2012.
Long term average BAWAP rainfall and Penman Potentail Evapotranspiration (PET) from Jan 1981 - Dec 2012 for each month
Values calculated over the years 1981 - 2012 (inclusive), for 17 time periods (i.e., annual, 4 seasons and 12 months) for the following 8 meteorological variables: (i) BAWAP_P (precipitation); (ii) Penman ETp; (iii) Tavg (average temperature); (iv) Tmax (maximum temperature); (v) Tmin (minimum temperature); (vi) VPD (Vapour Pressure Deficit); (vii) Rn (net radiation); and (viii) Wind speed. For each of the 17 time periods for each of the 8 meteorological variables have calculated the: (a) average; (b) maximum; (c) minimum; (d) average plus standard deviation (stddev); (e) average minus stddev; (f) stddev; and (g) trend.
Correlation coefficients (-1 to 1) between rainfall and 4 remote rainfall drivers between 1957-2006 for the four seasons. The data and methodology are described in Risbey et al. (2009).
As described in the Risbey et al. (2009) paper, the rainfall was from 0.05 degree gridded data described in Jeffrey et al. (2001 - known as the SILO datasets); sea surface temperature was from the Hadley Centre Sea Ice and Sea Surface Temperature dataset (HadISST) on a 1 degree grid. BLK=Blocking; DMI=Dipole Mode Index; SAM=Southern Annular Mode; SOI=Southern Oscillation Index; DJF=December, January, February; MAM=March, April, May; JJA=June, July, August; SON=September, October, November. The analysis is a summary of Fig. 15 of Risbey et al. (2009).
There are 4 csv files here:
BAWAP_P_annual_BA_SYB_GLO.csv
Desc: Time series mean annual BAWAP rainfall from 1900 - 2012.
Source data: annual BILO rainfall
P_PET_monthly_BA_SYB_GLO.csv
long term average BAWAP rainfall and Penman PET from 198101 - 201212 for each month
Climatology_Trend_BA_SYB_GLO.csv
Values calculated over the years 1981 - 2012 (inclusive), for 17 time periods (i.e., annual, 4 seasons and 12 months) for the following 8 meteorological variables: (i) BAWAP_P; (ii) Penman ETp; (iii) Tavg; (iv) Tmax; (v) Tmin; (vi) VPD; (vii) Rn; and (viii) Wind speed. For each of the 17 time periods for each of the 8 meteorological variables have calculated the: (a) average; (b) maximum; (c) minimum; (d) average plus standard deviation (stddev); (e) average minus stddev; (f) stddev; and (g) trend
Risbey_Remote_Rainfall_Drivers_Corr_Coeffs_BA_NSB_GLO.csv
Correlation coefficients (-1 to 1) between rainfall and 4 remote rainfall drivers between 1957-2006 for the four seasons. The data and methodology are described in Risbey et al. (2009). As described in the Risbey et al. (2009) paper, the rainfall was from 0.05 degree gridded data described in Jeffrey et al. (2001 - known as the SILO datasets); sea surface temperature was from the Hadley Centre Sea Ice and Sea Surface Temperature dataset (HadISST) on a 1 degree grid. BLK=Blocking; DMI=Dipole Mode Index; SAM=Southern Annular Mode; SOI=Southern Oscillation Index; DJF=December, January, February; MAM=March, April, May; JJA=June, July, August; SON=September, October, November. The analysis is a summary of Fig. 15 of Risbey et al. (2009).
Dataset was created from various BAWAP source data, including Monthly BAWAP rainfall, Tmax, Tmin, VPD, etc, and other source data including monthly Penman PET, Correlation coefficient data. Data were extracted from national datasets for the GLO subregion.
BAWAP_P_annual_BA_SYB_GLO.csv
Desc: Time series mean annual BAWAP rainfall from 1900 - 2012.
Source data: annual BILO rainfall
P_PET_monthly_BA_SYB_GLO.csv
long term average BAWAP rainfall and Penman PET from 198101 - 201212 for each month
Climatology_Trend_BA_SYB_GLO.csv
Values calculated over the years 1981 - 2012 (inclusive), for 17 time periods (i.e., annual, 4 seasons and 12 months) for the following 8 meteorological variables: (i) BAWAP_P; (ii) Penman ETp; (iii) Tavg; (iv) Tmax; (v) Tmin; (vi) VPD; (vii) Rn; and (viii) Wind speed. For each of the 17 time periods for each of the 8 meteorological variables have calculated the: (a) average; (b) maximum; (c) minimum; (d) average plus standard deviation (stddev); (e) average minus stddev; (f) stddev; and (g) trend
Risbey_Remote_Rainfall_Drivers_Corr_Coeffs_BA_NSB_GLO.csv
Correlation coefficients (-1 to 1) between rainfall and 4 remote rainfall drivers between 1957-2006 for the four seasons. The data and methodology are described in Risbey et al. (2009). As described in the Risbey et al. (2009) paper, the rainfall was from 0.05 degree gridded data described in Jeffrey et al. (2001 - known as the SILO datasets); sea surface temperature was from the Hadley Centre Sea Ice and Sea Surface Temperature dataset (HadISST) on a 1 degree grid. BLK=Blocking; DMI=Dipole Mode Index; SAM=Southern Annular Mode; SOI=Southern Oscillation Index; DJF=December, January, February; MAM=March, April, May; JJA=June, July, August; SON=September, October, November. The analysis is a summary of Fig. 15 of Risbey et al. (2009).
Bioregional Assessment Programme (2014) GLO climate data stats summary. Bioregional Assessment Derived Dataset. Viewed 18 July 2018, http://data.bioregionalassessments.gov.au/dataset/afed85e0-7819-493d-a847-ec00a318e657.
Derived From Natural Resource Management (NRM) Regions 2010
Derived From Bioregional Assessment areas v03
Derived From BILO Gridded Climate Data: Daily Climate Data for each year from 1900 to 2012
Derived From Bioregional Assessment areas v01
Derived From Bioregional Assessment areas v02
Derived From GEODATA TOPO 250K Series 3
Derived From NSW Catchment Management Authority Boundaries 20130917
Derived From Geological Provinces - Full Extent
Derived From GEODATA TOPO 250K Series 3, File Geodatabase format (.gdb)
https://datafinder.stats.govt.nz/license/attribution-4-0-international/https://datafinder.stats.govt.nz/license/attribution-4-0-international/
Dataset contains counts and measures for households from the 2013, 2018, and 2023 Censuses. Data is available by statistical area 2.
The variables included in this dataset are for households in occupied private dwellings (unless otherwise stated). All data is for level 1 of the classification (unless otherwise stated):
Download lookup file from Stats NZ ArcGIS Online or embedded attachment in Stats NZ geographic data service. Download data table (excluding the geometry column for CSV files) using the instructions in the Koordinates help guide.
Footnotes
Geographical boundaries
Statistical standard for geographic areas 2023 (updated December 2023) has information about geographic boundaries as of 1 January 2023. Address data from 2013 and 2018 Censuses was updated to be consistent with the 2023 areas. Due to the changes in area boundaries and coding methodologies, 2013 and 2018 counts published in 2023 may be slightly different to those published in 2013 or 2018.
Caution using time series
Time series data should be interpreted with care due to changes in census methodology and differences in response rates between censuses. The 2023 and 2018 Censuses used a combined census methodology (using census responses and administrative data), while the 2013 Census used a full-field enumeration methodology (with no use of administrative data).
About the 2023 Census dataset
For information on the 2023 dataset see Using a combined census model for the 2023 Census. We combined data from the census forms with administrative data to create the 2023 Census dataset, which meets Stats NZ's quality criteria for population structure information. We added real data about real people to the dataset where we were confident the people who hadn’t completed a census form (which is known as admin enumeration) will be counted. We also used data from the 2018 and 2013 Censuses, administrative data sources, and statistical imputation methods to fill in some missing characteristics of people and dwellings.
Data quality
The quality of data in the 2023 Census is assessed using the quality rating scale and the quality assurance framework to determine whether data is fit for purpose and suitable for release. Data quality assurance in the 2023 Census has more information.
Concept descriptions and quality ratings
Data quality ratings for 2023 Census variables has additional details about variables found within totals by topic, for example, definitions and data quality.
Household crowding
Household crowding is based on the Canadian National Occupancy Standard (CNOS). It calculates the number of bedrooms needed based on the demographic composition of the household. The household crowding index methodology for 2023 Census has been updated to use gender instead of sex. Household crowding should be used with caution for small geographical areas due to high volatility between census years as a result of population change and urban development. There may be additional volatility in areas affected by the cyclone, particularly in Gisborne and Hawke's Bay. Household crowding index – 2023 Census has details on how the methodology has changed, differences from 2018 Census, and more.
Using data for good
Stats NZ expects that, when working with census data, it is done so with a positive purpose, as outlined in the Māori Data Governance Model (Data Iwi Leaders Group, 2023). This model states that "data should support transformative outcomes and should uplift and strengthen our relationships with each other and with our environments. The avoidance of harm is the minimum expectation for data use. Māori data should also contribute to iwi and hapū tino rangatiratanga”.
Confidentiality
The 2023 Census confidentiality rules have been applied to 2013, 2018, and 2023 data. These rules protect the confidentiality of individuals, families, households, dwellings, and undertakings in 2023 Census data. Counts are calculated using fixed random rounding to base 3 (FRR3) and suppression of ‘sensitive’ counts less than six, where tables report multiple geographic variables and/or small populations. Individual figures may not always sum to stated totals. Applying confidentiality rules to 2023 Census data and summary of changes since 2018 and 2013 Censuses has more information about 2023 Census confidentiality rules.
Measures
Measures like averages, medians, and other quantiles are calculated from unrounded counts, with input noise added to or subtracted from each contributing value during measures calculations. Averages and medians based on less than six units (e.g. individuals, dwellings, households, families, or extended families) are suppressed. This suppression threshold changes for other quantiles. Where the cells have been suppressed, a placeholder value has been used.
Percentages
To calculate percentages, divide the figure for the category of interest by the figure for 'Total stated' where this applies.
Symbol
-997 Not available
-999 Confidential
Inconsistencies in definitions
Please note that there may be differences in definitions between census classifications and those used for other data collections.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This paper explores a unique dataset of all the SET ratings provided by students of one university in Poland at the end of the winter semester of the 2020/2021 academic year. The SET questionnaire used by this university is presented in Appendix 1. The dataset is unique for several reasons. It covers all SET surveys filled by students in all fields and levels of study offered by the university. In the period analysed, the university was entirely in the online regime amid the Covid-19 pandemic. While the expected learning outcomes formally have not been changed, the online mode of study could have affected the grading policy and could have implications for some of the studied SET biases. This Covid-19 effect is captured by econometric models and discussed in the paper. The average SET scores were matched with the characteristics of the teacher for degree, seniority, gender, and SET scores in the past six semesters; the course characteristics for time of day, day of the week, course type, course breadth, class duration, and class size; the attributes of the SET survey responses as the percentage of students providing SET feedback; and the grades of the course for the mean, standard deviation, and percentage failed. Data on course grades are also available for the previous six semesters. This rich dataset allows many of the biases reported in the literature to be tested for and new hypotheses to be formulated, as presented in the introduction section. The unit of observation or the single row in the data set is identified by three parameters: teacher unique id (j), course unique id (k) and the question number in the SET questionnaire (n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9} ). It means that for each pair (j,k), we have nine rows, one for each SET survey question, or sometimes less when students did not answer one of the SET questions at all. For example, the dependent variable SET_score_avg(j,k,n) for the triplet (j=Calculus, k=John Smith, n=2) is calculated as the average of all Likert-scale answers to question nr 2 in the SET survey distributed to all students that took the Calculus course taught by John Smith. The data set has 8,015 such observations or rows. The full list of variables or columns in the data set included in the analysis is presented in the attached filesection. Their description refers to the triplet (teacher id = j, course id = k, question number = n). When the last value of the triplet (n) is dropped, it means that the variable takes the same values for all n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9}.Two attachments:- word file with variables description- Rdata file with the data set (for R language).Appendix 1. Appendix 1. The SET questionnaire was used for this paper. Evaluation survey of the teaching staff of [university name] Please, complete the following evaluation form, which aims to assess the lecturer’s performance. Only one answer should be indicated for each question. The answers are coded in the following way: 5- I strongly agree; 4- I agree; 3- Neutral; 2- I don’t agree; 1- I strongly don’t agree. Questions 1 2 3 4 5 I learnt a lot during the course. ○ ○ ○ ○ ○ I think that the knowledge acquired during the course is very useful. ○ ○ ○ ○ ○ The professor used activities to make the class more engaging. ○ ○ ○ ○ ○ If it was possible, I would enroll for the course conducted by this lecturer again. ○ ○ ○ ○ ○ The classes started on time. ○ ○ ○ ○ ○ The lecturer always used time efficiently. ○ ○ ○ ○ ○ The lecturer delivered the class content in an understandable and efficient way. ○ ○ ○ ○ ○ The lecturer was available when we had doubts. ○ ○ ○ ○ ○ The lecturer treated all students equally regardless of their race, background and ethnicity. ○ ○
Studies utilizing Global Positioning System (GPS) telemetry rarely result in 100% fix success rates (FSR). Many assessments of wildlife resource use do not account for missing data, either assuming data loss is random or because a lack of practical treatment for systematic data loss. Several studies have explored how the environment, technological features, and animal behavior influence rates of missing data in GPS telemetry, but previous spatially explicit models developed to correct for sampling bias have been specified to small study areas, on a small range of data loss, or to be species-specific, limiting their general utility. Here we explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use. We also evaluate patterns in missing data that relate to potential animal activities that change the orientation of the antennae and characterize home-range probability of GPS detection for 4 focal species; cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Part 1, Positive Openness Raster (raster dataset): Openness is an angular measure of the relationship between surface relief and horizontal distance. For angles less than 90 degrees it is equivalent to the internal angle of a cone with its apex at a DEM location, and is constrained by neighboring elevations within a specified radial distance. 480 meter search radius was used for this calculation of positive openness. Openness incorporates the terrain line-of-sight or viewshed concept and is calculated from multiple zenith and nadir angles-here along eight azimuths. Positive openness measures openness above the surface, with high values for convex forms and low values for concave forms (Yokoyama et al. 2002). We calculated positive openness using a custom python script, following the methods of Yokoyama et. al (2002) using a USGS National Elevation Dataset as input. Part 2, Northern Arizona GPS Test Collar (csv): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. The model training data are provided here for fix attempts by hour. This table can be linked with the site location shapefile using the site field. Part 3, Probability Raster (raster dataset): Bias correction in GPS telemetry datasets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix aquistion. We found terrain exposure and tall overstory vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The models predictive ability was evaluated using two independent datasets from stationary test collars of different make/model, fix interval programing, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. We evaluated GPS telemetry datasets by comparing the mean probability of a successful GPS fix across study animals home-ranges, to the actual observed FSR of GPS downloaded deployed collars on cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Comparing the mean probability of acquisition within study animals home-ranges and observed FSRs of GPS downloaded collars resulted in a approximatly 1:1 linear relationship with an r-sq= 0.68. Part 4, GPS Test Collar Sites (shapefile): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. Part 5, Cougar Home Ranges (shapefile): Cougar home-ranges were calculated to compare the mean probability of a GPS fix acquisition across the home-range to the actual fix success rate (FSR) of the collar as a means for evaluating if characteristics of an animal’s home-range have an effect on observed FSR. We estimated home-ranges using the Local Convex Hull (LoCoH) method using the 90th isopleth. Data obtained from GPS download of retrieved units were only used. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose as additional 10% of data. Comparisons with home-range mean probability of fix were also used as a reference for assessing if the frequency animals use areas of low GPS acquisition rates may play a role in observed FSRs. Part 6, Cougar Fix Success Rate by Hour (csv): Cougar GPS collar fix success varied by hour-of-day suggesting circadian rhythms with bouts of rest during daylight hours may change the orientation of the GPS receiver affecting the ability to acquire fixes. Raw data of overall fix success rates (FSR) and FSR by hour were used to predict relative reductions in FSR. Data only includes direct GPS download datasets. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose approximately an additional 10% of data. Part 7, Openness Python Script version 2.0: This python script was used to calculate positive openness using a 30 meter digital elevation model for a large geographic area in Arizona, California, Nevada and Utah. A scientific research project used the script to explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use.
https://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/cc-by/cc-by_f24dc630aa52ab8c52a0ac85c03bc35e0abc850b4d7453bdc083535b41d5a5c3.pdfhttps://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/cc-by/cc-by_f24dc630aa52ab8c52a0ac85c03bc35e0abc850b4d7453bdc083535b41d5a5c3.pdf
ERA5 is the fifth generation ECMWF reanalysis for the global climate and weather for the past 8 decades. Data is available from 1940 onwards. ERA5 replaces the ERA-Interim reanalysis. Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product. This catalogue entry provides post-processed ERA5 hourly single-level data aggregated to daily time steps. In addition to the data selection options found on the hourly page, the following options can be selected for the daily statistic calculation:
The daily aggregation statistic (daily mean, daily max, daily min, daily sum*) The sub-daily frequency sampling of the original data (1 hour, 3 hours, 6 hours) The option to shift to any local time zone in UTC (no shift means the statistic is computed from UTC+00:00)
*The daily sum is only available for the accumulated variables (see ERA5 documentation for more details). Users should be aware that the daily aggregation is calculated during the retrieval process and is not part of a permanently archived dataset. For more details on how the daily statistics are calculated, including demonstrative code, please see the documentation. For more details on the hourly data used to calculate the daily statistics, please refer to the ERA5 hourly single-level data catalogue entry and the documentation found therein.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains information about high school students and their actual and predicted performance on an exam. Most of the information, including some general information about high school students and their grade for an exam, was based on an already existing dataset, while the predicted exam performance was based on a human experiment. In this experiment, participants were shown short descriptions of the students (based on the information in the original data) and had to rank and grade according to their expected performance. Prior to this task some participants were exposed to some "Stereotype Activation", suggesting that boys perform less well in school than girls.
Based on this dataset (which is also available on kaggle), we extracted a number of student profiles that participants had to make grade predictions for. For more information about this dataset we refer to the corresponding kaggle page: https://www.kaggle.com/datasets/uciml/student-alcohol-consumption
Note that we performed some preprocessing on the original data:
The original data consisted of two parts: the information about students following a Maths course and the information about students following a Portuguese course. Since in both datasets the same type of information was recorded, we merged both datasets and added a column "subject", to show which course each student belongs to
We excluded all data where G3 = 0 (i.e. the grade for the last exam = 0)
From original_data.csv we randomly sampled 856 students that participants in our study had to make grade predictions for.
index - this column corresponds to the indeces in the file "original_data.csv". Through these indices, it is possible to add columns from the original data to the dataset with the grade prediction
ParticipantID - the ID of the participant who made the performance predictions for the corresponding student. Predictions needed to be made for 856 students, and each participant made 8 predictions total. Thus there are 107 different participant IDs
name - to make the prediction task more engaging for participants, each of the 8 student profiles, that participants had to grade & rank was randomly matched to one of four boy/girl's names (depending on the sex of the student)
sex - the sex of each student, either female (F) or male (M). For benchmarking fair ML algorithms, this can be used as the sensitive attribute. We assume that in the fair version of the decision variable ("Pass"), no sex discrimination occurs. The biased versions of the variable ("Predicted Pass") are mostly discriminatory towards male students.
studytime - this variable is taken from the original dataset and denotes how long a student studied for their exam. In the original data this variable consisted of four levels (less than 2 hours vs. 2-5 hours vs. 5-10 hours vs. more than 10 hours). We binned the latter two levels together and encoded this column numerically from 1-3.
freetime - Originally, this variable ranged from 1 (very low) to 5 (very high). We binned this variable into three categories, where level 1 and 2 are binned, as well as level 4 and 5.
romantic - Binary variable, denoting whether the student is in a romantic relationship or not.
Walc - This variable shows how much alcohol each student consumes in the weekend. Originally it ranged from 1 to 5 (5 corresponding to the highest alcohol consumption), but we binned the last two levels together.
goout - This variable shows how often a student goes out in a week. Originally it ranged from 1 to 5 (5 corresponding to going out very often), but we binned the last two levels together.
Parents_edu - This variable was not present in the original dataset. Instead, the original dataset consisted of two variables "mum_edu" and "dad_edu". We obtained "Parents_edu" by taking the higher one of both. The variable consist of 4 levels, whereas 4 = highest level of education.
absences - This variable shows the number of absences per student. Originally it ranged from 0 - 93, but because large number of absences were infrequent we binned all absences of >=7 into one level.
reason - The reason for why a student chose to go to the school in question. The levels are close to home, school's reputation, school's curricular and other
G3 - The actual grade each student received for the final exam of the course, ranging from 0-20.
Pass - A binary variable showing whether G3 is a passing grade (i.e. >=10) or not.
Predicted Grade - The grade the student was predicted to receive in our experiment
Predicted Rank - In our ex...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Descriptive statistics of sexual violence victim-survivors in the Crime Survey for England and Wales (CSEW) and Rape Crisis England & Wales (RCEW) datasets.
https://datafinder.stats.govt.nz/license/attribution-4-0-international/https://datafinder.stats.govt.nz/license/attribution-4-0-international/
Dataset contains counts and measures for households from the 2013, 2018, and 2023 Censuses. Data is available by statistical area 1.
The variables included in this dataset are for households in occupied private dwellings (unless otherwise stated). All data is for level 1 of the classification (unless otherwise stated):
Download lookup file from Stats NZ ArcGIS Online or embedded attachment in Stats NZ geographic data service. Download data table (excluding the geometry column for CSV files) using the instructions in the Koordinates help guide.
Footnotes
Geographical boundaries
Statistical standard for geographic areas 2023 (updated December 2023) has information about geographic boundaries as of 1 January 2023. Address data from 2013 and 2018 Censuses was updated to be consistent with the 2023 areas. Due to the changes in area boundaries and coding methodologies, 2013 and 2018 counts published in 2023 may be slightly different to those published in 2013 or 2018.
Caution using time series
Time series data should be interpreted with care due to changes in census methodology and differences in response rates between censuses. The 2023 and 2018 Censuses used a combined census methodology (using census responses and administrative data), while the 2013 Census used a full-field enumeration methodology (with no use of administrative data).
About the 2023 Census dataset
For information on the 2023 dataset see Using a combined census model for the 2023 Census. We combined data from the census forms with administrative data to create the 2023 Census dataset, which meets Stats NZ's quality criteria for population structure information. We added real data about real people to the dataset where we were confident the people who hadn’t completed a census form (which is known as admin enumeration) will be counted. We also used data from the 2018 and 2013 Censuses, administrative data sources, and statistical imputation methods to fill in some missing characteristics of people and dwellings.
Data quality
The quality of data in the 2023 Census is assessed using the quality rating scale and the quality assurance framework to determine whether data is fit for purpose and suitable for release. Data quality assurance in the 2023 Census has more information.
Concept descriptions and quality ratings
Data quality ratings for 2023 Census variables has additional details about variables found within totals by topic, for example, definitions and data quality.
Household crowding
Household crowding is based on the Canadian National Occupancy Standard (CNOS). It calculates the number of bedrooms needed based on the demographic composition of the household. The household crowding index methodology for 2023 Census has been updated to use gender instead of sex. Household crowding should be used with caution for small geographical areas due to high volatility between census years as a result of population change and urban development. There may be additional volatility in areas affected by the cyclone, particularly in Gisborne and Hawke's Bay. Household crowding index – 2023 Census has details on how the methodology has changed, differences from 2018 Census, and more.
Using data for good
Stats NZ expects that, when working with census data, it is done so with a positive purpose, as outlined in the Māori Data Governance Model (Data Iwi Leaders Group, 2023). This model states that "data should support transformative outcomes and should uplift and strengthen our relationships with each other and with our environments. The avoidance of harm is the minimum expectation for data use. Māori data should also contribute to iwi and hapū tino rangatiratanga”.
Confidentiality
The 2023 Census confidentiality rules have been applied to 2013, 2018, and 2023 data. These rules protect the confidentiality of individuals, families, households, dwellings, and undertakings in 2023 Census data. Counts are calculated using fixed random rounding to base 3 (FRR3) and suppression of ‘sensitive’ counts less than six, where tables report multiple geographic variables and/or small populations. Individual figures may not always sum to stated totals. Applying confidentiality rules to 2023 Census data and summary of changes since 2018 and 2013 Censuses has more information about 2023 Census confidentiality rules.
Measures
Measures like averages, medians, and other quantiles are calculated from unrounded counts, with input noise added to or subtracted from each contributing value during measures calculations. Averages and medians based on less than six units (e.g. individuals, dwellings, households, families, or extended families) are suppressed. This suppression threshold changes for other quantiles. Where the cells have been suppressed, a placeholder value has been used.
Percentages
To calculate percentages, divide the figure for the category of interest by the figure for 'Total stated' where this applies.
Symbol
-997 Not available
-999 Confidential
Inconsistencies in definitions
Please note that there may be differences in definitions between census classifications and those used for other data collections.
The U.S. Population Grids (Summary File 1), 2000: Houston Metropolitan Statistical Area, Alpha Version data set contains an ARC/INFO Workspace with grids of demographic data from the 2000 census. The grids have a resolution of 7.5 arc-seconds (0.002075 decimal degrees), or approximately 250 square meters. The gridded variables are based on census block geography from Census 2000 TIGER/Line Files and census variables (population, households, and housing variables) from Summary File 1. This data set is produced by the Columbia University Center for International Earth Science Information Network (CIESIN).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is raw and analysis data for empirical study, entitled entitled “Integrating Multidimensional Dependability with the Technology Acceptance Model: A Study of Adoption of Cloud Computing at the Organizational Level”. This study investigated how perceived dependability affects user acceptance by integrating perceived dependability with the technology acceptance model in the context of cloud computing. In this study, perceived dependability was treated as a multi-dimensional variable and conceptualized as a second-order construct. A total of 216 samples (organizational managers) were analyzed using the structural equation modeling. IBM SPSS AMOS 23 tool was used for data analysis.
This data is valuable because no other research have considered the multidimensional approach to dependability. These empirical data can provide academic researchers and businesses with insights on organizational level adoption of cloud computing.
CourseKata is a platform that creates and publishes a series of e-books for introductory statistics and data science classes that utilize demonstrated learning strategies to help students learn statistics and data science. The developers of CourseKata, Jim Stigler (UCLA) and Ji Son (Cal State Los Angeles) and their team, are cognitive psychologists interested in improving statistics learning by examining students' interactions with online interactive textbooks. Traditionally, much of the research in how students learn is done in a 1-hour lab or through small-scale interviews with students. CourseKata offers the opportunity to peek into the actions, responses, and choices of thousands of students as they are engaged in learning the interrelated concepts and skills of statistics and coding in R over many weeks or months in real classes.
Questions are grouped into items (item_id). An item can be one of three item_type 's: code, learnosity or learnosity-activity (the distinction between learnosity and learnosity-activity is not important). Code items are a single question and ask for R code as a response. (Responses can be seen in responses.csv.) Learnosity-activities and learnosity items are collections of one or more questions that can be of a variety of lrn_type's: ● association ● choicematrix ● clozeassociation ● formulaV2 ● imageclozeassociation ● mcq ● plaintext ● shorttext ● sortlist
Examples of these question types are provided at the end of this document.
The level of detail made available to you in the responses file depends on the lrn_type. For example, for multiple choice questions (mcq), you can find the options in the responses file in the columns labeled lrn_option_0 through lrn_option_11, and you can see the chosen option in the results variable.
Assessment Types In general, assessments, such as the items and questions included in CourseKata, can be used for two purposes. Formative assessments are meant to provide feedback to the student (and instructor), or to serve as a learning aid to help prompt students improve memory and deepen their understanding. Summative assessments are meant to provide a summary of a student's understanding, often for use in assigning a grade. For example, most midterms and final exams that you've taken are summative assessments.
The vast majority of items in CourseKata should be treated as formative assessments. The exceptions are the end-of-chapter Review questions, which can be thought of as summative. The mean number of correct answers for end-of-chapter review questions is provided within the checkpoints file. You might see that some pages have the word "Quiz" or "Exam" or "Midterm" in them. Results from these items and responses to them are not provided to us in this data set.
This dataset presents a rich collection of physicochemical parameters from 147 reservoirs distributed across the conterminous U.S. One hundred and eight of the reservoirs were selected using a statistical survey design and can provide unbiased inferences to the condition of all U.S. reservoirs. These data could be of interest to local water management specialists or those assessing the ecological condition of reservoirs at the national scale. These data have been reviewed in accordance with U.S. Environmental Protection Agency policy and approved for publication. This dataset is not publicly accessible because: It is too large. It can be accessed through the following means: https://portal-s.edirepository.org/nis/mapbrowse?scope=edi&identifier=2033&revision=1. Format: This dataset presents water quality and related variables for 147 reservoirs distributed across the U.S. Water quality parameters were measured during the summers of 2016, 2018, and 2020 – 2023. Measurements include nutrient concentration, algae abundance, dissolved oxygen concentration, and water temperature, among many others. Dataset includes links to other national and global scale data sets that provide additional variables.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Description
This dataset contains a simulated collection of 1,00000 patient records designed to explore hypertension management in resource-constrained settings. It provides comprehensive data for analyzing blood pressure control rates, associated risk factors, and complications. The dataset is ideal for predictive modelling, risk analysis, and treatment optimization, offering insights into demographic, clinical, and treatment-related variables.
Dataset Structure
Dataset Volume
• Size: 10,000 records. • Features: 19 variables, categorized into Sociodemographic, Clinical, Complications, and Treatment/Control groups.
Variables and Categories
A. Sociodemographic Variables
1. Age:
• Continuous variable in years.
• Range: 18–80 years.
• Mean ± SD: 49.37 ± 12.81.
2. Sex:
• Categorical variable.
• Values: Male, Female.
3. Education:
• Categorical variable.
• Values: No Education, Primary, Secondary, Higher Secondary, Graduate, Post-Graduate, Madrasa.
4. Occupation:
• Categorical variable.
• Values: Service, Business, Agriculture, Retired, Unemployed, Housewife.
5. Monthly Income:
• Categorical variable in Bangladeshi Taka.
• Values: <5000, 5001–10000, 10001–15000, >15000.
6. Residence:
• Categorical variable.
• Values: Urban, Sub-urban, Rural.
B. Clinical Variables
7. Systolic BP:
• Continuous variable in mmHg.
• Range: 100–200 mmHg.
• Mean ± SD: 140 ± 15 mmHg.
8. Diastolic BP:
• Continuous variable in mmHg.
• Range: 60–120 mmHg.
• Mean ± SD: 90 ± 10 mmHg.
9. Elevated Creatinine:
• Binary variable (\geq 1.4 \, \text{mg/dL}).
• Values: Yes, No.
10. Diabetes Mellitus:
• Binary variable.
• Values: Yes, No.
11. Family History of CVD:
• Binary variable.
• Values: Yes, No.
12. Elevated Cholesterol:
• Binary variable (\geq 200 \, \text{mg/dL}).
• Values: Yes, No.
13. Smoking:
• Binary variable.
• Values: Yes, No.
C. Complications
14. LVH (Left Ventricular Hypertrophy):
• Binary variable (ECG diagnosis).
• Values: Yes, No.
15. IHD (Ischemic Heart Disease):
• Binary variable.
• Values: Yes, No.
16. CVD (Cerebrovascular Disease):
• Binary variable.
• Values: Yes, No.
17. Retinopathy:
• Binary variable.
• Values: Yes, No.
D. Treatment and Control
18. Treatment:
• Categorical variable indicating therapy type.
• Values: Single Drug, Combination Drugs.
19. Control Status:
• Binary variable.
• Values: Controlled, Uncontrolled.
Dataset Applications
1. Predictive Modeling:
• Develop models to predict blood pressure control status using demographic and clinical data.
2. Risk Analysis:
• Identify significant factors influencing hypertension control and complications.
3. Severity Scoring:
• Quantify hypertension severity for patient risk stratification.
4. Complications Prediction:
• Forecast complications like IHD, LVH, and CVD for early intervention.
5. Treatment Guidance:
• Analyze therapy efficacy to recommend optimal treatment strategies.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The KNMI'23 climate scenarios are based on 240 years (8 ensembles of 30 years each) of RACMO (Regional Atmospheric Climate Model) v2.3 data for every horizon/scenario. The data in this set is a user-friendly version of the RACMO putput that was used to calculate the scenario tables. ‘User-friendly’ means that the data is mapped to a regular lat/lon grid, and that the time coordinate corresponds to the nominal period it is used for. Version 1.0 (https://dataplatform.knmi.nl/dataset/knmi23-user-friendly-racmo-1-0) of the dataset only includes values within the Dutch borders. Version 2.0 also includes values outside the borders. In addition, this version also includes a mask to mask the points outside the Netherlands. This user-friendly dataset is also provided to the public via the data portal. The current version (3.0) is the same as version 2.0 but has four extra variables: psl (sea level pressure), tsfcwindmax (daily maximum wind speed), uas (daily mean zonal wind speed), vas (daily mean meridional wind speed). These variables are not bias-corrected because the necessary observational data are not available. Users should be aware of this difference.
https://datafinder.stats.govt.nz/license/attribution-4-0-international/https://datafinder.stats.govt.nz/license/attribution-4-0-international/
Dataset contains counts and measures for individuals from the 2013, 2018, and 2023 Censuses. Data is available by statistical area 2.
The variables included in this dataset are for the census usually resident population count (unless otherwise stated). All data is for level 1 of the classification (unless otherwise stated).
The variables for part 1 of the dataset are:
Download lookup file for part 1 from Stats NZ ArcGIS Online or embedded attachment in Stats NZ geographic data service. Download data table (excluding the geometry column for CSV files) using the instructions in the Koordinates help guide.
Footnotes
Te Whata
Under the Mana Ōrite Relationship Agreement, Te Kāhui Raraunga (TKR) will be publishing Māori descent and iwi affiliation data from the 2023 Census in partnership with Stats NZ. This will be available on Te Whata, a TKR platform.
Geographical boundaries
Statistical standard for geographic areas 2023 (updated December 2023) has information about geographic boundaries as of 1 January 2023. Address data from 2013 and 2018 Censuses was updated to be consistent with the 2023 areas. Due to the changes in area boundaries and coding methodologies, 2013 and 2018 counts published in 2023 may be slightly different to those published in 2013 or 2018.
Subnational census usually resident population
The census usually resident population count of an area (subnational count) is a count of all people who usually live in that area and were present in New Zealand on census night. It excludes visitors from overseas, visitors from elsewhere in New Zealand, and residents temporarily overseas on census night. For example, a person who usually lives in Christchurch city and is visiting Wellington city on census night will be included in the census usually resident population count of Christchurch city.
Population counts
Stats NZ publishes a number of different population counts, each using a different definition and methodology. Population statistics – user guide has more information about different counts.
Caution using time series
Time series data should be interpreted with care due to changes in census methodology and differences in response rates between censuses. The 2023 and 2018 Censuses used a combined census methodology (using census responses and administrative data), while the 2013 Census used a full-field enumeration methodology (with no use of administrative data).
Study participation time series
In the 2013 Census study participation was only collected for the census usually resident population count aged 15 years and over.
About the 2023 Census dataset
For information on the 2023 dataset see Using a combined census model for the 2023 Census. We combined data from the census forms with administrative data to create the 2023 Census dataset, which meets Stats NZ's quality criteria for population structure information. We added real data about real people to the dataset where we were confident the people who hadn’t completed a census form (which is known as admin enumeration) will be counted. We also used data from the 2018 and 2013 Censuses, administrative data sources, and statistical imputation methods to fill in some missing characteristics of people and dwellings.
Data quality
The quality of data in the 2023 Census is assessed using the quality rating scale and the quality assurance framework to determine whether data is fit for purpose and suitable for release. Data quality assurance in the 2023 Census has more information.
Concept descriptions and quality ratings
Data quality ratings for 2023 Census variables has additional details about variables found within totals by topic, for example, definitions and data quality.
Disability indicator
This data should not be used as an official measure of disability prevalence. Disability prevalence estimates are only available from the 2023 Household Disability Survey. Household Disability Survey 2023: Final content has more information about the survey.
Activity limitations are measured using the Washington Group Short Set (WGSS). The WGSS asks about six basic activities that a person might have difficulty with: seeing, hearing, walking or climbing stairs, remembering or concentrating, washing all over or dressing, and communicating. A person was classified as disabled in the 2023 Census if there was at least one of these activities that they had a lot of difficulty with or could not do at all.
Using data for good
Stats NZ expects that, when working with census data, it is done so with a positive purpose, as outlined in the Māori Data Governance Model (Data Iwi Leaders Group, 2023). This model states that "data should support transformative outcomes and should uplift and strengthen our relationships with each other and with our environments. The avoidance of harm is the minimum expectation for data use. Māori data should also contribute to iwi and hapū tino rangatiratanga”.
Confidentiality
The 2023 Census confidentiality rules have been applied to 2013, 2018, and 2023 data. These rules protect the confidentiality of individuals, families, households, dwellings, and undertakings in 2023 Census data. Counts are calculated using fixed random rounding to base 3 (FRR3) and suppression of ‘sensitive’ counts less than six, where tables report multiple geographic variables and/or small populations. Individual figures may not always sum to stated totals. Applying confidentiality rules to 2023 Census data and summary of changes since 2018 and 2013 Censuses has more information about 2023 Census confidentiality rules.
Measures
Measures like averages, medians, and other quantiles are calculated from unrounded counts, with input noise added to or subtracted from each contributing value during measures calculations. Averages and medians based on less than six units (e.g. individuals, dwellings, households, families, or extended families) are suppressed. This suppression threshold changes for other quantiles. Where the cells have been suppressed, a placeholder value has been used.
Percentages
To calculate percentages, divide the figure for the category of interest by the figure for 'Total stated' where this applies.
Symbol
-997 Not available
-999 Confidential
Inconsistencies in definitions
Please note that there may be differences in definitions between census classifications and those used for other data collections.
These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).