Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In genomic study, log transformation is a common prepossessing step to adjust for skewness in data. This standard approach often assumes that log-transformed data is normally distributed, and two sample t-test (or its modifications) is used for detecting differences between two experimental conditions. However, recently it was shown that two sample t-test can lead to exaggerated false positives, and the Wilcoxon-Mann-Whitney (WMW) test was proposed as an alternative for studies with larger sample sizes. In addition, studies have demonstrated that the specific distribution used in modeling genomic data has profound impact on the interpretation and validity of results. The aim of this paper is three-fold: 1) to present the Exp-gamma distribution (exponential-gamma distribution stands for log-transformed gamma distribution) as a proper biological and statistical model for the analysis of log-transformed protein abundance data from single-cell experiments; 2) to demonstrate the inappropriateness of two sample t-test and the WMW test in analyzing log-transformed protein abundance data; 3) to propose and evaluate statistical inference methods for hypothesis testing and confidence interval estimation when comparing two independent samples under the Exp-gamma distributions. The proposed methods are applied to analyze protein abundance data from a single-cell dataset.
The documentation covers Enterprise Survey panel datasets that were collected in Slovenia in 2009, 2013 and 2019.
The Slovenia ES 2009 was conducted between 2008 and 2009. The Slovenia ES 2013 was conducted between March 2013 and September 2013. Finally, the Slovenia ES 2019 was conducted between December 2018 and November 2019. The objective of the Enterprise Survey is to gain an understanding of what firms experience in the private sector.
As part of its strategic goal of building a climate for investment, job creation, and sustainable growth, the World Bank has promoted improving the business environment as a key strategy for development, which has led to a systematic effort in collecting enterprise data across countries. The Enterprise Surveys (ES) are an ongoing World Bank project in collecting both objective data based on firms' experiences and enterprises' perception of the environment in which they operate.
National
The primary sampling unit of the study is the establishment. An establishment is a physical location where business is carried out and where industrial operations take place or services are provided. A firm may be composed of one or more establishments. For example, a brewery may have several bottling plants and several establishments for distribution. For the purposes of this survey an establishment must take its own financial decisions and have its own financial statements separate from those of the firm. An establishment must also have its own management and control over its payroll.
As it is standard for the ES, the Slovenia ES was based on the following size stratification: small (5 to 19 employees), medium (20 to 99 employees), and large (100 or more employees).
Sample survey data [ssd]
The sample for Slovenia ES 2009, 2013, 2019 were selected using stratified random sampling, following the methodology explained in the Sampling Manual for Slovenia 2009 ES and for Slovenia 2013 ES, and in the Sampling Note for 2019 Slovenia ES.
Three levels of stratification were used in this country: industry, establishment size, and oblast (region). The original sample designs with specific information of the industries and regions chosen are included in the attached Excel file (Sampling Report.xls.) for Slovenia 2009 ES. For Slovenia 2013 and 2019 ES, specific information of the industries and regions chosen is described in the "The Slovenia 2013 Enterprise Surveys Data Set" and "The Slovenia 2019 Enterprise Surveys Data Set" reports respectively, Appendix E.
For the Slovenia 2009 ES, industry stratification was designed in the way that follows: the universe was stratified into manufacturing industries, services industries, and one residual (core) sector as defined in the sampling manual. Each industry had a target of 90 interviews. For the manufacturing industries sample sizes were inflated by about 17% to account for potential non-response cases when requesting sensitive financial data and also because of likely attrition in future surveys that would affect the construction of a panel. For the other industries (residuals) sample sizes were inflated by about 12% to account for under sampling in firms in service industries.
For Slovenia 2013 ES, industry stratification was designed in the way that follows: the universe was stratified into one manufacturing industry, and two service industries (retail, and other services).
Finally, for Slovenia 2019 ES, three levels of stratification were used in this country: industry, establishment size, and region. The original sample design with specific information of the industries and regions chosen is described in "The Slovenia 2019 Enterprise Surveys Data Set" report, Appendix C. Industry stratification was done as follows: Manufacturing – combining all the relevant activities (ISIC Rev. 4.0 codes 10-33), Retail (ISIC 47), and Other Services (ISIC 41-43, 45, 46, 49-53, 55, 56, 58, 61, 62, 79, 95).
For Slovenia 2009 and 2013 ES, size stratification was defined following the standardized definition for the rollout: small (5 to 19 employees), medium (20 to 99 employees), and large (more than 99 employees). For stratification purposes, the number of employees was defined on the basis of reported permanent full-time workers. This seems to be an appropriate definition of the labor force since seasonal/casual/part-time employment is not a common practice, except in the sectors of construction and agriculture.
For Slovenia 2009 ES, regional stratification was defined in 2 regions. These regions are Vzhodna Slovenija and Zahodna Slovenija. The Slovenia sample contains panel data. The wave 1 panel “Investment Climate Private Enterprise Survey implemented in Slovenia” consisted of 223 establishments interviewed in 2005. A total of 57 establishments have been re-interviewed in the 2008 Business Environment and Enterprise Performance Survey.
For Slovenia 2013 ES, regional stratification was defined in 2 regions (city and the surrounding business area) throughout Slovenia.
Finally, for Slovenia 2019 ES, regional stratification was done across two regions: Eastern Slovenia (NUTS code SI03) and Western Slovenia (SI04).
Computer Assisted Personal Interview [capi]
Questionnaires have common questions (core module) and respectfully additional manufacturing- and services-specific questions. The eligible manufacturing industries have been surveyed using the Manufacturing questionnaire (includes the core module, plus manufacturing specific questions). Retail firms have been interviewed using the Services questionnaire (includes the core module plus retail specific questions) and the residual eligible services have been covered using the Services questionnaire (includes the core module). Each variation of the questionnaire is identified by the index variable, a0.
Survey non-response must be differentiated from item non-response. The former refers to refusals to participate in the survey altogether whereas the latter refers to the refusals to answer some specific questions. Enterprise Surveys suffer from both problems and different strategies were used to address these issues.
Item non-response was addressed by two strategies: a- For sensitive questions that may generate negative reactions from the respondent, such as corruption or tax evasion, enumerators were instructed to collect the refusal to respond as (-8). b- Establishments with incomplete information were re-contacted in order to complete this information, whenever necessary. However, there were clear cases of low response.
For 2009 and 2013 Slovenia ES, the survey non-response was addressed by maximizing efforts to contact establishments that were initially selected for interview. Up to 4 attempts were made to contact the establishment for interview at different times/days of the week before a replacement establishment (with similar strata characteristics) was suggested for interview. Survey non-response did occur but substitutions were made in order to potentially achieve strata-specific goals. Further research is needed on survey non-response in the Enterprise Surveys regarding potential introduction of bias.
For 2009, the number of contacted establishments per realized interview was 6.18. This number is the result of two factors: explicit refusals to participate in the survey, as reflected by the rate of rejection (which includes rejections of the screener and the main survey) and the quality of the sample frame, as represented by the presence of ineligible units. The relatively low ratio of contacted establishments per realized interview (6.18) suggests that the main source of error in estimates in the Slovenia may be selection bias and not frame inaccuracy.
For 2013, the number of realized interviews per contacted establishment was 25%. This number is the result of two factors: explicit refusals to participate in the survey, as reflected by the rate of rejection (which includes rejections of the screener and the main survey) and the quality of the sample frame, as represented by the presence of ineligible units. The number of rejections per contact was 44%.
Finally, for 2019, the number of interviews per contacted establishments was 9.7%. This number is the result of two factors: explicit refusals to participate in the survey, as reflected by the rate of rejection (which includes rejections of the screener and the main survey) and the quality of the sample frame, as represented by the presence of ineligible units. The share of rejections per contact was 75.2%.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data in social and behavioral sciences are routinely collected using questionnaires, and each domain of interest is tapped by multiple indicators. Structural equation modeling (SEM) is one of the most widely used methods to analyze such data. However, conventional methods for SEM face difficulty when the number of variables (p) is large even when the sample size (N) is also rather large. This article addresses the issue of model inference with the likelihood ratio statistic Tml. Using the method of empirical modeling, mean-and-variance corrected statistics for SEM with many variables are developed. Results show that the new statistics not only perform much better than Tml but also are substantial improvements over other corrections to Tml. When combined with a robust transformation, the new statistics also perform well with non-normally distributed data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
###Preamble###
This upload contains initialization files and data for simulations reported in:
https://arxiv.org/abs/1506.09008: Coarse-grained modelling of strong DNA bending II: Cyclization
The initialization files allow a user to repeat the reported simulations using the oxDNA model. oxDNA is available for download from:
https://dna.physics.ox.ac.uk/index.php/Main_Page.
The use and meaning of the input and output files are documented extensively on this wiki.
This is only a practice upload, and so only a small part of the total material is included.
###Organisation###
Simulations are organised by system type. Folder DXXCYY corresponds to simulation of a cyclization system with Nd = XX and Nbp=YY, with the meaning of these symbols given in the text referenced above. "_seq" indicates simulation of a specific sequence, as listed in Table S1 of the reference.
For each system, a "closed" and an "open" folder are present. These correspond to the two windows of umbrella sampling that were performed separately.
###Content###
Within each folder are the necessary initialization files to run the simulations exactly as reported in the reference above, simply by calling oxDNA from within the folder, using "inputVMMC" as the input file.
Also included are output files for a single realisation of the simulation.
Note that the results in the reference above were all obtained from 5 independent replicas, using different initial conditions and different seeds. These can be (statistically) recreated simply by drawing random starting configurations from the single available traj_hist file.
Water quality replicate sample data and field blank data was collected at the Colorado River above Imperial Dam, Colorado River below Cooper Wasteway, Yuma Main Drain, and 242 Lateral during 2017 and 2018. Instantaneous discharge data was collected at the Cooper Wasteway, Yuma Main Drain, and 242 Lateral from January 2017 to March 2019. Instantaneous discharge readings were recorded at a fixed interval of 5 minutes. Mean daily discharge data was collected at the Colorado River above Imperial Dam, Cooper Wasteway, Yuma Main Drain, and 242 Lateral from January 2017 to March 2019. Instantaneous discharge and mean daily discharge data was provided to the USGS by the International Boundary and Water Commission (IBWC). Discrete water-quality samples were collected at the Colorado River above Imperial Dam, Colorado River below Cooper Wasteway, Yuma Main Drain, and 242 Lateral during 2017, 2018, through March 2019 and values were used to compute dissolved solids concentrations using BOR's method.
analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D
This statistic illustrates the results of a survey regarding the opinion on the meaning of the term fake news in Turkey in 2018. According to data published by IPSOS, ** percent of Turkish adults stated that they personally thought of politicians and the media using the term to discredit news they did not agree with.
The 1961 Census Microdata Individual File for Great Britain: 5% Sample dataset was created from existing digital records from the 1961 Census under a project known as Enhancing and Enriching Historic Census Microdata Samples (EEHCM), which was funded by the Economic and Social Research Council with input from the Office for National Statistics and National Records of Scotland. The project ran from 2012-2014 and was led from the UK Data Archive, University of Essex, in collaboration with the Cathie Marsh Institute for Social Research (CMIST) at the University of Manchester and the Census Offices. In addition to the 1961 data, the team worked on files from the 1971 Census and 1981 Census.
The original 1961 records preceded current data archival standards and were created before microdata sets for secondary use were anticipated. A process of data recovery and quality checking was necessary to maximise their utility for current researchers, though some imperfections remain (see the User Guide for details). Three other 1961 Census datasets have been created:
This statistic illustrates the results of a survey regarding the opinion on the meaning of the term fake news in Serbia in 2018. According to data published by IPSOS, 66 percent of Serbian adults stated that they personally thought of politicians using the term to support their side of the argument.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Background: Whilst the outcome variables on the Bells test (e.g., total omissions) are easily quantified, process data (e.g., search strategy) have been more difficult to quantify objectively. CancellationTools is a software package that calculates cancellation task-based process variables following input of cancellation target identification order. The primary aim of the current study was to examine the psychometric properties of several CancellationTools process variables for the Bells test.
Method: The CancellationTools process variables: mean distance, standardized mean distance, standardized angle, best r, and intersections rate, were calculated for the Bells Test in a diverse neurological sample (n=101) and a healthy Australian sample (n=57). Ratings of cancellation path organization using an ordinal categorical variable (systematic, disorganized, indeterminate) was completed by two experienced clinicians. Construct validity, criterion validity, known groups validity, test-retest reliability and test operating characteristics were examined.
Results: Mean distance, standardized angle, best r and intersections rate, but not standardized mean distance, showed good construct (convergent) validity with clinician ratings. Standardized angle, best r and intersections rate showed good divergent validity when compared with outcome variables. Criterion validity was established for best r and intersections rate. The CancellationTools measures did not show known groups validity, although clinician ratings did. Good test-retest reliability was demonstrated for best r (ICC = .79) and clinician-rated search strategy (ICC = .75). Best r explained 88% of the area under the curve when classifying disorganized vs other search strategies based on clinician ratings.
Conclusion: Best r emerged as the most psychometrically robust of the CancellationTools measures.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This database includes simulated data showing the accuracy of estimated probability distributions of project durations when limited data are available for the project activities. The base project networks are taken from PSPLIB. Then, various stochastic project networks are synthesized by changing the variability and skewness of project activity durations.
Number of variables: 20
Number of cases/rows: 114240
Variable List:
• Experiment ID: The ID of the experiment
• Experiment for network: The ID of the experiment for each of the synthesized networks
• Network ID: ID of the synthesized network
• #Activities: Number of activities in the network, including start and finish activities
• Variability: Variance of the activities in the network (this value can be either high, low, medium or rand, where rand shows a random combination of low, high and medium variance in the network activities.)
• Skewness: Skewness of the activities in the network (Skewness can be either right, left, None or rand, where rand shows a random combination of right, left, and none skewed in the network activities)
• Fitted distribution type: Distribution type used to fit on sampled data
• Sample size: Number of sampled data used for the experiment resembling limited data condition
• Benchmark 10th percentile: 10th percentile of project duration in the benchmark stochastic project network
• Benchmark 50th percentile: 50th project duration in the benchmark stochastic project network
• Benchmark 90th percentile: 90th project duration in the benchmark stochastic project network
• Benchmark mean: Mean project duration in the benchmark stochastic project network
• Benchmark variance: Variance project duration in the benchmark stochastic project network
• Experiment 10th percentile: 10th percentile of project duration distribution for the experiment
• Experiment 50th percentile: 50th percentile of project duration distribution for the experiment
• Experiment 90th percentile: 90th percentile of project duration distribution for the experiment
• Experiment mean: Mean of project duration distribution for the experiment
• Experiment variance: Variance of project duration distribution for the experiment
• K-S: Kolmogorov–Smirnov test comparing benchmark distribution and project duration
• distribution of the experiment
• P_value: the P-value based on the distance calculated in the K-S test
A random sample of households were invited to participate in this survey. In the dataset, you will find the respondent level data in each row with the questions in each column. The numbers represent a scale option from the survey, such as 1=Excellent, 2=Good, 3=Fair, 4=Poor. The question stem, response option, and scale information for each field can be found in the var "variable labels" and "value labels" sheets. VERY IMPORTANT NOTE: The scientific survey data were weighted, meaning that the demographic profile of respondents was compared to the demographic profile of adults in Bloomington from US Census data. Statistical adjustments were made to bring the respondent profile into balance with the population profile. This means that some records were given more "weight" and some records were given less weight. The weights that were applied are found in the field "wt". If you do not apply these weights, you will not obtain the same results as can be found in the report delivered to the Bloomington. The easiest way to replicate these results is likely to create pivot tables, and use the sum of the "wt" field rather than a count of responses.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This paper evaluates the claim that Welch’s t-test (WT) should replace the independent-samples t-test (IT) as the default approach for comparing sample means. Simulations involving unequal and equal variances, skewed distributions, and different sample sizes were performed. For normal distributions, we confirm that the WT maintains the false positive rate close to the nominal level of 0.05 when sample sizes and standard deviations are unequal. However, the WT was found to yield inflated false positive rates under skewed distributions with unequal sample sizes. A complementary empirical study based on gender differences in two psychological scales corroborates these findings. Finally, we contend that the null hypothesis of unequal variances together with equal means lacks plausibility, and that empirically, a difference in means typically coincides with differences in variance and skewness. An additional analysis using the Kolmogorov-Smirnov and Anderson-Darling tests demonstrates that examining entire distributions, rather than just their means, can provide a more suitable alternative when facing unequal variances or skewed distributions. Given these results, researchers should remain cautious with software defaults, such as R favoring Welch’s test.
This statistic illustrates the results of a survey regarding the opinion on the meaning of the term fake news in Great Britain in 2018. According to data published by IPSOS, 42 percent of British adults stated that they personally thought of politicians and the media using the term to discredit news they did not agree with.
Your Client WOMart is a leading nutrition and supplement retail chain that offers a comprehensive range of products for all your wellness and fitness needs.
WOMart follows a multi-channel distribution strategy with 350+ retail stores spread across 100+ cities.
Effective forecasting for store sales gives essential insight into upcoming cash flow, meaning WOMart can more accurately plan the cashflow at the store level.
Sales data for 18 months from 365 stores of WOMart is available along with information on Store Type, Location Type for each store, Region Code for every store, Discount provided by the store on every day, Number of Orders everyday etc.
Your task is to predict the store sales for each store in the test set for the next two months.
Train Data |Variable |Definition | |-------------------------------|-------------------------------| |ID |Unique Identifier for a row | |Store_id |Unique id for each Store| |Store_Type |Type of the Store| |Location_Type |Type of the location where Store is located| |Region_Code |Code of the Region where Store is located| |Date |Information about the Date| |Holiday |If there is holiday on the given Date, 1 : Yes, 0 : No| |Discount |If discount is offered by store on the given Date, Yes/ No| |#Orders |Number of Orders received by the Store on the given Day| |Sales |Total Sale for the Store on the given Day|
Test Data |Variable |Definition | |-----------------------------|-------------------------| |ID |Unique Identifier for a row | |Store_id |Unique id for each Store | |Store_Type |Type of the Store | |Location_Type |Type of the location where Store is located | |Region_Code |Code of the Region where Store is located | |Date |Information about the Date | |Holiday |If there is holiday on the given Date, 1 : Yes, 0 : No | |Discount |If discount is offered by store on the given Date, Yes/ No |
Sample_Submission |Variable |Definition | |------------------------|----------------| |ID |Unique Identifier for a row | |Sales |Total Sale for the Store on the given Day |
Public and Private Split
The sales column that we submit would be compared to the actual answer similar to the following. Instead of 8 items it is 22266 items(the function is avable in sklearn).
Sample Input :
actual = [27.5, 55.9, 25.8, 17.7, 27.6, 55.9, 25.7, 17.8] predicted = 24.0, 49.1, 21.0, 16.2, 23.3, 47.0, 12.1, 15.2*1000
Sample Output:
82.9949678377161
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this article, we propose a new projection test for linear hypotheses on regression coefficient matrices in linear models with high-dimensional responses. We systematically study the theoretical properties of the proposed test. We first derive the optimal projection matrix for any given projection dimension to achieve the best power and provide an upper bound for the optimal dimension of projection matrix. We further provide insights into how to construct the optimal projection matrix. One- and two-sample mean problems can be formulated as special cases of linear hypotheses studied in this article. We both theoretically and empirically demonstrate that the proposed test can outperform the existing ones for one- and two-sample mean problems. We conduct Monte Carlo simulation to examine the finite sample performance and illustrate the proposed test by a real data example.
The 1997 Jordan Population and Family Health Survey (JPFHS) is a national sample survey carried out by the Department of Statistics (DOS) as part of its National Household Surveys Program (NHSP). The JPFHS was specifically aimed at providing information on fertility, family planning, and infant and child mortality. Information was also gathered on breastfeeding, on maternal and child health care and nutritional status, and on the characteristics of households and household members. The survey will provide policymakers and planners with important information for use in formulating informed programs and policies on reproductive behavior and health.
National
Sample survey data
SAMPLE DESIGN AND IMPLEMENTATION
The 1997 JPFHS sample was designed to produce reliable estimates of major survey variables for the country as a whole, for urban and rural areas, for the three regions (each composed of a group of governorates), and for the three major governorates, Amman, Irbid, and Zarqa.
The 1997 JPFHS sample is a subsample of the master sample that was designed using the frame obtained from the 1994 Population and Housing Census. A two-stage sampling procedure was employed. First, primary sampling units (PSUs) were selected with probability proportional to the number of housing units in the PSU. A total of 300 PSUs were selected at this stage. In the second stage, in each selected PSU, occupied housing units were selected with probability inversely proportional to the number of housing units in the PSU. This design maintains a self-weighted sampling fraction within each governorate.
UPDATING OF SAMPLING FRAME
Prior to the main fieldwork, mapping operations were carried out and the sample units/blocks were selected and then identified and located in the field. The selected blocks were delineated and the outer boundaries were demarcated with special signs. During this process, the numbers on buildings and housing units were updated, listed and documented, along with the name of the owner/tenant of the unit or household and the name of the household head. These activities took place between January 7 and February 28, 1997.
Note: See detailed description of sample design in APPENDIX A of the survey report.
Face-to-face
The 1997 JPFHS used two questionnaires, one for the household interview and the other for eligible women. Both questionnaires were developed in English and then translated into Arabic. The household questionnaire was used to list all members of the sampled households, including usual residents as well as visitors. For each member of the household, basic demographic and social characteristics were recorded and women eligible for the individual interview were identified. The individual questionnaire was developed utilizing the experience gained from previous surveys, in particular the 1983 and 1990 Jordan Fertility and Family Health Surveys (JFFHS).
The 1997 JPFHS individual questionnaire consists of 10 sections: - Respondent’s background - Marriage - Reproduction (birth history) - Contraception - Pregnancy, breastfeeding, health and immunization - Fertility preferences - Husband’s background, woman’s work and residence - Knowledge of AIDS - Maternal mortality - Height and weight of children and mothers.
Fieldwork and data processing activities overlapped. After a week of data collection, and after field editing of questionnaires for completeness and consistency, the questionnaires for each cluster were packaged together and sent to the central office in Amman where they were registered and stored. Special teams were formed to carry out office editing and coding.
Data entry started after a week of office data processing. The process of data entry, editing, and cleaning was done by means of the ISSA (Integrated System for Survey Analysis) program DHS has developed especially for such surveys. The ISSA program allows data to be edited while being entered. Data entry was completed on November 14, 1997. A data processing specialist from Macro made a trip to Jordan in November and December 1997 to identify problems in data entry, editing, and cleaning, and to work on tabulations for both the preliminary and final report.
A total of 7,924 occupied housing units were selected for the survey; from among those, 7,592 households were found. Of the occupied households, 7,335 (97 percent) were successfully interviewed. In those households, 5,765 eligible women were identified, and complete interviews were obtained with 5,548 of them (96 percent of all eligible women). Thus, the overall response rate of the 1997 JPFHS was 93 percent. The principal reason for nonresponse among the women was the failure of interviewers to find them at home despite repeated callbacks.
Note: See summarized response rates by place of residence in Table 1.1 of the survey report.
The estimates from a sample survey are subject to two types of errors: nonsampling errors and sampling errors. Nonsampling errors are the result of mistakes made in implementing data collection and data processing (such as failure to locate and interview the correct household, misunderstanding questions either by the interviewer or the respondent, and data entry errors). Although during the implementation of the 1997 JPFHS numerous efforts were made to minimize this type of error, nonsampling errors are not only impossible to avoid but also difficult to evaluate statistically.
Sampling errors, on the other hand, can be evaluated statistically. The respondents selected in the 1997 JPFHS constitute only one of many samples that could have been selected from the same population, given the same design and expected size. Each of those samples would have yielded results differing somewhat from the results of the sample actually selected. Sampling errors are a measure of the variability among all possible samples. Although the degree of variability is not known exactly, it can be estimated from the survey results.
A sampling error is usually measured in terms of the standard error for a particular statistic (mean, percentage, etc.), which is the square root of the variance. The standard error can be used to calculate confidence intervals within which the true value for the population can reasonably be assumed to fall. For example, for any given statistic calculated from a sample survey, the value of that statistic will fall within a range of plus or minus two times the standard error of that statistic in 95 percent of all possible samples of identical size and design.
If the sample of respondents had been selected as a simple random sample, it would have been possible to use straightforward formulas for calculating sampling errors. However, since the 1997 JDHS-II sample resulted from a multistage stratified design, formulae of higher complexity had to be used. The computer software used to calculate sampling errors for the 1997 JDHS-II was the ISSA Sampling Error Module, which uses the Taylor linearization method of variance estimation for survey estimates that are means or proportions. The Jackknife repeated replication method is used for variance estimation of more complex statistics, such as fertility and mortality rates.
Note: See detailed estimate of sampling error calculation in APPENDIX B of the survey report.
Data Quality Tables - Household age distribution - Age distribution of eligible and interviewed women - Completeness of reporting - Births by calendar years - Reporting of age at death in days - Reporting of age at death in months
Note: See detailed tables in APPENDIX C of the survey report.
The 1971 Census Microdata for Great Britain: 9% Sample: Secure Access dataset was created from existing digital records from the 1971 Census. It comprises a larger population sample than the other files available from the 1971 Census (see below) and so contains sufficient information to constitute personal data, meaning that it is only available to Accredited Researchers, under restrictive Secure Access conditions. See Access section for further details.
The file was created under a project known as Enhancing and Enriching Historic Census Microdata Samples (EEHCM), which was funded by the Economic and Social Research Council with input from the Office for National Statistics and National Records of Scotland. The project ran from 2012-2014 and was led from the UK Data Archive, University of Essex, in collaboration with the Cathie Marsh Institute for Social Research (CMIST) at the University of Manchester and the Census Offices. In addition to the 1971 data, the team worked on files from the 1961 Census and 1981 Census.
The original 1971 records preceded current data archival standards and were created before microdata sets for secondary use were anticipated. A process of data recovery and quality checking was necessary to maximise their utility for current researchers, though some imperfections remain (see the User Guide for details).
Three other 1971 Census datasets have been created; users should obtain the other datasets in the series first to see whether they are sufficient for their research needs before considering making an application for this study (SN 8271), the Secure Access version:
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Supporting documentation on code lists, subject definitions, data accuracy, and statistical testing can be found on the American Community Survey website in the Data and Documentation section...Sample size and data quality measures (including coverage rates, allocation rates, and response rates) can be found on the American Community Survey website in the Methodology section..Although the American Community Survey (ACS) produces population, demographic and housing unit estimates, for 2010, the 2010 Census provides the official counts of the population and housing units for the nation, states, counties, cities and towns..Explanation of Symbols:.An ''**'' entry in the margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate..An ''-'' entry in the estimate column indicates that either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution..An ''-'' following a median estimate means the median falls in the lowest interval of an open-ended distribution..An ''+'' following a median estimate means the median falls in the upper interval of an open-ended distribution..An ''***'' entry in the margin of error column indicates that the median falls in the lowest interval or upper interval of an open-ended distribution. A statistical test is not appropriate..An ''*****'' entry in the margin of error column indicates that the estimate is controlled. A statistical test for sampling variability is not appropriate. .An ''N'' entry in the estimate and margin of error columns indicates that data for this geographic area cannot be displayed because the number of sample cases is too small..An ''(X)'' means that the estimate is not applicable or not available..Estimates of urban and rural population, housing units, and characteristics reflect boundaries of urban areas defined based on Census 2000 data. Boundaries for urban areas have not been updated since Census 2000. As a result, data for urban and rural areas from the ACS do not necessarily reflect the results of ongoing urbanization..While the 2010 American Community Survey (ACS) data generally reflect the December 2009 Office of Management and Budget (OMB) definitions of metropolitan and micropolitan statistical areas; in certain instances the names, codes, and boundaries of the principal cities shown in ACS tables may differ from the OMB definitions due to differences in the effective dates of the geographic entities..The Census Bureau introduced a new set of disability questions in the 2008 ACS questionnaire. Accordingly, comparisons of disability data from 2008 or later with data from prior years are not recommended. For more information on these questions and their evaluation in the 2006 ACS Content Test, see the Evaluation Report Covering Disability..Data for year of entry of the native population reflect the year of entry into the U.S. by people who were born in Puerto Rico, U.S. Island Areas or born outside the U.S. to a U.S. citizen parent and who subsequently moved to the U.S..Ancestry listed in this table refers to the total number of people who responded with a particular ancestry; for example, the estimate given for Russian represents the number of people who listed Russian as either their first or second ancestry. This table lists only the largest ancestry groups; see the Detailed Tables for more categories. Race and Hispanic origin groups are not included in this table because official data for those groups come from the Race and Hispanic origin questions rather than the ancestry question (see Demographic Table)..Starting in 2008, the Scotch-Irish category does not include Irish-Scotch. People who reported Irish-Scotch ancestry are classified under "Other groups," whereas in 2007 and earlier they were classified as Scotch-Irish..Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted roughly as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect of nonsampling error is not represented in these tables..Source: U.S. Census Bureau, 2010 American Community Survey
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset contains the list of COVID Fake News/Claims which is shared all over the internet.
Content
Headlines: String attribute consisting of the headlines/fact shared.
Outcome: It is binary data where 0 means the headline is fake and 1 means that it is true.
Inspiration
In many research portals, there was this common question in which the combined fake news dataset is available or not. This led to the publication of this dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In genomic study, log transformation is a common prepossessing step to adjust for skewness in data. This standard approach often assumes that log-transformed data is normally distributed, and two sample t-test (or its modifications) is used for detecting differences between two experimental conditions. However, recently it was shown that two sample t-test can lead to exaggerated false positives, and the Wilcoxon-Mann-Whitney (WMW) test was proposed as an alternative for studies with larger sample sizes. In addition, studies have demonstrated that the specific distribution used in modeling genomic data has profound impact on the interpretation and validity of results. The aim of this paper is three-fold: 1) to present the Exp-gamma distribution (exponential-gamma distribution stands for log-transformed gamma distribution) as a proper biological and statistical model for the analysis of log-transformed protein abundance data from single-cell experiments; 2) to demonstrate the inappropriateness of two sample t-test and the WMW test in analyzing log-transformed protein abundance data; 3) to propose and evaluate statistical inference methods for hypothesis testing and confidence interval estimation when comparing two independent samples under the Exp-gamma distributions. The proposed methods are applied to analyze protein abundance data from a single-cell dataset.