Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
The complete COVID-19 dataset is a collection of the COVID-19 data maintained and provided by Our World in Data. Our World in Data team will update it daily throughout the duration of the COVID-19 pandemic.
These are the following information that includes in the dataset: | Metrics | Source | Updated | Countries | | --- | --- | | Vaccinations | Official data collated by the Our World in Data team | Daily | 218 | | Tests & positivity | Official data collated by the Our World in Data team | Weekly | 139 | | Hospital & ICU | Official data collated by the Our World in Data team | Weekly | 39 | | Confirmed cases | JHU CSSE COVID-19 Data | Daily | 196 | | Confirmed deaths | JHU CSSE COVID-19 Data | Daily | 196 | | Reproduction rate | Arroyo-Marioli F, Bullano F, Kucinskas S, Rondón-Moreno C | Daily | 185 | | Policy responses | Oxford COVID-19 Government Response Tracker | Daily | 186 | | Other variables of interest | International organizations (UN, World Bank, OECD, IHME…) | Fixed |
Data dictionary is available below ⤵
I'd like to clarify that I'm only making data about vaccines collected by Our World in Data available to Kaggle community. This dataset is gathered, integrated, and posted the new version on a daily basis, as maintained by Our World in Data on their GitHub repository.
📷 Images by Fusion Medical Animation.
Facebook
TwitterThe total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly. While it was estimated at ***** zettabytes in 2025, the forecast for 2029 stands at ***** zettabytes. Thus, global data generation will triple between 2025 and 2029. Data creation has been expanding continuously over the past decade. In 2020, the growth was higher than previously expected, caused by the increased demand due to the coronavirus (COVID-19) pandemic, as more people worked and learned from home and used home entertainment options more often.
Facebook
TwitterThe rapid and massive dissemination of mobile phones in the developing world is creating new opportunities for the discipline of survey research. The World Bank is interested in leveraging mobile phone technology as a means of direct communication with poor households in the developing world in order to gather rapid feedback on the impact of economic crises and other events on the economy of such households.
The World Bank commissioned Gallup to conduct the Listening to LAC (L2L) pilot program, a research project aimed at testing the feasibility of mobile phone technology as a way of data collection for conducting quick turnaround, self-administered, longitudinal surveys among households in Peru and Honduras.
The project used face-to-face interviews as its benchmark, and included Short Message Service (SMS), Interactive Voice Response (IVR) and Computer Assisted Telephone Interviews (CATI) as test methods of data collection.
The pilot was designed in a way that allowed testing the response rates and the quality of data, while also providing information on the cost of collecting data using mobile phones. Researchers also evaluated if providing incentives affected panel attrition rates. The Honduras design was a test-retest design, which is closely related to the difference-in-difference methodology of experimental evaluation.
The random stratified multistage sampling technique was used to select a nationally representative sample of 1,500 households. During the initial face-to-face interviews, researchers gathered information on the socio-economic characteristics of households and recruited participants for follow-up research. Questions wording was the same in all modes of data collection.
In Honduras, after the initial face-to-face interviews, respondents were exposed to the remaining three methodologies according to a randomized scheme (three rotations, one methodology per week). Panelists in Honduras were surveyed for four and a half months, starting in February 2012.
Includes the entire national territory, with the exception of neighborhoods where access of interviewers is extremely difficult, due to lack of transportation infrastructure or for situations that threaten the physical integrity of the interviewers and supervisors (i.e. extremely high crime rate, warfare, etc.)
All the households that exist in the neighborhoods of Honduras, as reported by the 2001 Census. Institutions such as military, religious or educational living quarters are not included in the universe.
Sample survey data [ssd]
Honduras did not have an income oversample because the poverty rate is 60 percent, so oversampling 20 percent above the poverty rate would include a large portion of the middle class, which are not the most vulnerable in times of crisis.
The Honduras panel was built on a nationally representative sample of 1,500 households. The sample was drawn by means of a random, stratified, multistage design. The pilot used Gallup World Poll sampling frame.
Census-defined municipalities were classified into five strata according to population size: I. Municipalities with 500,000 to 999,000 inhabitants II. Municipalities with 100,000 to 499,000 inhabitants III. Municipalities with 50,000 to 99,000 inhabitants IV. Municipalities with 10,000 and 49,000 inhabitants V. Municipalities with less than 10,000 inhabitants
Interviews were then proportionally allocated to these five strata according to their share among the country's population.
The first stage of the design consisted of a random selection of Primary Sampling Units (PSU's) within each of the five strata previously defined.
In the second stage, in each PSU, one or more Secondary Sampling Units (SSU's) were then selected.
Once SSU's were selected, interviewers were sent to the field to proceed with the third stage of the sample design, which consisted of selecting households using a systematic "random route" procedure. Interviewers started from the previously selected "random origin" and walked around the block in clockwise direction, selecting every third household on their right hand side. They were also trained to handle vacant, nonresponsive, non-cooperative households, as well as other failed attempts, in a systematic manner.
Other [oth]
The following survey instruments were used in the project:
1) Initial face-to-face questionnaire
In Peru, the starting point was the ENAHO (National Household Survey) questionnaire. Step-wise regressions were done to select the set of questions that best predicted consumption. For the purposes of robustness, the regressions were also done with questions that best predicted income, which yielded the same results. A similar procedure was done in Honduras, using the latest household survey deployed by the Honduran Statistics Institute, except that only best predictors of income were chosen, because Honduras did not have a recent consumption aggregate.
The survey gathered information on households' demographics, household infrastructure, employment, remittances, income, accidents, food security, self-perceptions on poverty, Internet access and cellphones use.
2) Monthly questionnaires (SMS, IVR, CATI)
The questionnaires were worded exactly the same way, regardless of the mode, which meant short questions, since SMS is limited to 160 characters. A maximum of 10 questions had to be chosen for the monthly questionnaire. In addition, two questions sought to ensure the validity of the responses by testing if the respondent was a member of the household. Most questions were time-variant and each questionnaire was repeated to observe if answers changed over time. All questions related to variables that strongly affect household welfare and are likely to change in times of crisis.
3) Final face-to-face questionnaire
Gallup conducted face-to-face closing surveys among 700 panelists. The researchers asked about issues the respondets had with mobile phones and coverage during the test. Panelists were also asked what would motivate them to keep on participating in a project like this in the future.
The questionnaires were worded exactly the same way, regardless of the mode, which meant short questions, since SMS is limited to 160 characters, unlike IVR and CATI.
In Honduras, 41% of recruited households failed to answer the first round of follow-up surveys. The attrition rate from the initial face-to-face interview to the end of panel study was 50%.
As part of the survey administration process Gallup implemented a number of mechanisms to maximize the response rate and panelist retention. The following strategies were applied to respondents who did not replay first time:
Also, in order to minimize non-response, three types of incentives were given. First, households that did not own a mobile phone were provided one for free. Approximately 127 phones were donated in Honduras. Second, all communications between the interviewers and the households were free to the respondents. Finally, households were randomly assigned to one of three incentive levels: one-third of households received US$1 in free airtime for each questionnaire they answered, one-third received US$5 in free airtime, and one-third received no financial incentive (the control group).
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset provides a comprehensive collection of time series data sourced from the World Bank Open Data Platform, covering a wide range of global indicators from 1960 to the most recently published year. It includes economic, social, environmental, and demographic metrics, making it an ideal resource for researchers, data scientists, and policymakers interested in global development trends, economic forecasting, or socio-economic analysis.
A tutorial on how to combined the dataset topics together into one large dataset can be found here
My motivation for this project was to curate a high-quality collection of datasets for World Bank indicators organized by topics and structured in time-series, making them more accessible for data science projects. Since the World Bank’s Kaggle datasets have not been updated since 2019 https://www.kaggle.com/organizations/theworldbank, I saw an opportunity to provide more current data for the data analysis community.
This collection brings together more than 800 World Bank indicators organized into 18 topic‑specific CSV files. Each file is structured as a country‑year panel: every row represents a unique combination of year (1960‑present) and ISO‑3 country code, while the columns hold the topic’s indicators.
The collection includes datasets with a variety of indicators, such as:
- Economic Metrics: GDP growth (%), GDP per capita, consumer price inflation, merchandise trade, gross capital formation, and more.
- Social Metrics: School enrollment (primary, secondary, tertiary), infant mortality rate, maternal mortality rate, poverty headcount, and more.
- Environmental Metrics: Forest area, renewable energy consumption, food production indices, and more.
- Demographic Metrics: Urban population, life expectancy, net migration, and more.
This dataset is ideal for a variety of applications, including:
- Economic forecasting and trend analysis (e.g., GDP growth, inflation).
- Socio-economic studies (e.g., education, health, poverty).
- Environmental impact analysis (e.g., renewable energy adoption).
- Demographic research (e.g., population trends, migration).
Topic datasets can be merged with each other using year and country code. This tutorial with notebook code can help you get started quickly.
The data is collected via a custom software application that discovers and groups high-quality indicators with rules-based logic & artificial intelligence, generates metadata, and performs ETL for the data from the World Bank API. The result is a clean, up‑to‑date collection of World Bank indicators in time-series format that is ready for analysis—no manual downloads or data wrangling required.
The original World Bank data has been aggregated and transformed for ease of use. Missing values have been preserved as provided by the World Bank, and no significant transformations have been applied beyond formatting and aggregation into a single file.
The World Bank: World Development Indicators
This dataset is publicly available and sourced from the World Bank Open Data Platform and is made available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. When using this data, please attribute the World Bank as follows: "Data sourced from the World Bank, licensed under CC BY 4.0." For more details on the World Bank’s terms of use, visit: https://www.worldbank.org/en/about/legal/terms-of-use-for-datasets.
This dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Feel free to use this data in Kaggle notebooks, academic research, or policy analysis. If you create a derived dataset or analysis, I encourage you to share it with the Kaggle community.
Facebook
Twitterhttps://www.icpsr.umich.edu/web/ICPSR/studies/8376/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/8376/terms
The boundaries of five different geographic areas -- North America, South America, Europe, Africa, and Asia -- are digitally represented in this collection of data files that can be used in the production of computer maps. Each of the five areas is encoded in three distinct files: (1) coastline, islands, and lakes, (2) rivers, and (3) international boundaries. There is an additional file for North America (Part 4: North America: Internal Boundaries) delineating state lines in the United States and provincial boundaries in Canada. The data in each of the files is hierarchically structured into subordinate geographic features and ranks, which may be used for output plotting symbol definition. The mapping scale used to encode the data ranged from 1:1 million to 1:4 million.
Facebook
TwitterThe rapid and massive dissemination of mobile phones in the developing world is creating new opportunities for the discipline of survey research. The World Bank is interested in leveraging mobile phone technology as a means of direct communication with poor households in the developing world in order to gather rapid feedback on the impact of economic crises and other events on the economy of such households.
The World Bank commissioned Gallup to conduct the Listening to LAC (L2L) pilot program, a research project aimed at testing the feasibility of mobile phone technology as a way of data collection for conducting quick turnaround, self-administered, longitudinal surveys among households in Peru and Honduras.
The project used face-to-face interviews as its benchmark, and included Short Message Service (SMS), Interactive Voice Response (IVR) and Computer Assisted Telephone Interviews (CATI) as test methods of data collection.
The pilot was designed in a way that allowed testing the response rates and the quality of data, while also providing information on the cost of collecting data using mobile phones. Researchers also evaluated if providing incentives affected panel attrition rates.
The random stratified multistage sampling technique was used to select a nationally representative sample of 1,500 households. During the initial face-to-face interviews, researchers gathered information on the socio-economic characteristics of households and recruited participants for follow-up research. Questions wording was the same in all modes of data collection.
In Peru, households were randomly assigned to a communication mode (SMS, IVR, CATI), which stayed constant for all rounds (waves) of the survey.
Includes the entire national territory, with the exception of neighborhoods where access of interviewers is extremely difficult, due to lack of transportation infrastructure or for situations that threaten the physical integrity of the interviewers and supervisors (i.e. extremely high crime rate, warfare, etc.)
Sample survey data [ssd]
The Peru panel was built on a nationally representative sample of 1,500 households. The sample was based on the sampling frame for the National Household Survey (ENAHO) conducted by the Peruvian National Statistics Office (INEI) every three months.
In Peru, the sample selection was guided by the following criteria: (i) the sample should be representative nationally, and in urban and rural areas, and (ii) households close to poverty line should be oversampled because policy decisions in time of crises need to be especially mindful of the poor and vulnerable. For the purposes of this project, "close to poverty line" was defined as 40 percent of consumption distribution that symmetrically band the national poverty line: 20 percent above and 20 percent below. In 27 percent of Peruvian households monthly per capita consumption was below the moderate poverty line in 2010 (ENAHO).Those households whose monthly per capita consumption falls between 7 and 47 percent of the national distribution were oversampled.
The L2L sample frame comprises all the panel conglomerados from the fourth trimester of ENAHO 2010, or 281 conglomerados.
Detailed information about the sampling procedure is available in "Listening to LAC: Using Mobile Phones for High Frequency Data Collection, Final Report" (p. 65-69) and "The World Bank Listening to LAC (L2L) Pilot Project Sample Design for Peru."
A number of restive communities in Peru did not allow Gallup's interviewers to enter the area. Where possible, these were replaced following INEI's standard methodology. When confronted with a problem in a particular location, INEI moves to the next "Centro Poblado" in the same "Conglomerado."
Other [oth]
The following survey instruments were used in the project:
1) Initial face-to-face questionnaire
In Peru, the starting point was the ENAHO (National Household Survey) questionnaire. Step-wise regressions were done to select the set of questions that best predicted consumption. For the purposes of robustness, the regressions were also done with questions that best predicted income, which yielded the same results.
The survey gathered information on households' demographics, household infrastructure, employment, remittances, income, accidents, food security, self-perceptions on poverty, Internet access and cellphones use.
2) Monthly questionnaires (SMS, IVR, CATI)
The questionnaires were worded exactly the same way, regardless of the mode, which meant short questions, since SMS is limited to 160 characters. A maximum of 10 questions had to be chosen for the monthly questionnaire. In addition, two questions sought to ensure the validity of the responses by testing if the respondent was a member of the household. Most questions were time-variant and each questionnaire was repeated to observe if answers changed over time. All questions related to variables that strongly affect household welfare and are likely to change in times of crisis.
A maximum of 10 questions was chosen for the monthly questionnaire. In addition, two questions sought to ensure the validity of the responses by testing if the respondent was a member of the household. To accomplish this, the first two questions in each monthly questionnaire asked the respondent for their gender and year of birth, and the answers were compared to the household roster obtained during the face-to-face interview.
3) Final face-to-face questionnaire
Gallup conducted face-to-face closing surveys among 700 panelists. The researchers asked about issues the respondets had with mobile phones and coverage during the test. Panelists were also asked what would motivate them to keep on participating in a project like this in the future.
In Peru, 67 percent of recruited households failed to answer the first round of follow-up surveys. Attrition slightly increased with each wave of the survey (between 1 and 3 percentage points per wave), reaching 75 percent in wave 6.
As part of the survey administration process Gallup implemented a number of mechanisms to maximize the response rate and panelist retention. The following strategies were applied to respondents who did not replay first time:
Also, in order to minimize non-response, three types of incentives were given. First, households that did not own a mobile phone were provided one for free. Approximately 200 phones were donated in Peru. Second, all communications between the interviewers and the households were free to the respondents. Finally, households were randomly assigned to one of three incentive levels: one-third of households received US$1 in free airtime for each questionnaire they answered, one-third received US$5 in free airtime, and one-third received no financial incentive (the control group).
Facebook
TwitterIn order to develop various methods of comparable data collection on health and health system responsiveness WHO started a scientific survey study in 2000-2001. This study has used a common survey instrument in nationally representative populations with modular structure for assessing health of indviduals in various domains, health system responsiveness, household health care expenditures, and additional modules in other areas such as adult mortality and health state valuations.
The health module of the survey instrument was based on selected domains of the International Classification of Functioning, Disability and Health (ICF) and was developed after a rigorous scientific review of various existing assessment instruments. The responsiveness module has been the result of ongoing work over the last 2 years that has involved international consultations with experts and key informants and has been informed by the scientific literature and pilot studies.
Questions on household expenditure and proportionate expenditure on health have been borrowed from existing surveys. The survey instrument has been developed in multiple languages using cognitive interviews and cultural applicability tests, stringent psychometric tests for reliability (i.e. test-retest reliability to demonstrate the stability of application) and most importantly, utilizing novel psychometric techniques for cross-population comparability.
The study was carried out in 61 countries completing 71 surveys because two different modes were intentionally used for comparison purposes in 10 countries. Surveys were conducted in different modes of in- person household 90 minute interviews in 14 countries; brief face-to-face interviews in 27 countries and computerized telephone interviews in 2 countries; and postal surveys in 28 countries. All samples were selected from nationally representative sampling frames with a known probability so as to make estimates based on general population parameters.
The survey study tested novel techniques to control the reporting bias between different groups of people in different cultures or demographic groups ( i.e. differential item functioning) so as to produce comparable estimates across cultures and groups. To achieve comparability, the selfreports of individuals of their own health were calibrated against well-known performance tests (i.e. self-report vision was measured against standard Snellen's visual acuity test) or against short descriptions in vignettes that marked known anchor points of difficulty (e.g. people with different levels of mobility such as a paraplegic person or an athlete who runs 4 km each day) so as to adjust the responses for comparability . The same method was also used for self-reports of individuals assessing responsiveness of their health systems where vignettes on different responsiveness domains describing different levels of responsiveness were used to calibrate the individual responses.
This data are useful in their own right to standardize indicators for different domains of health (such as cognition, mobility, self care, affect, usual activities, pain, social participation, etc.) but also provide a better measurement basis for assessing health of the populations in a comparable manner. The data from the surveys can be fed into composite measures such as "Healthy Life Expectancy" and improve the empirical data input for health information systems in different regions of the world. Data from the surveys were also useful to improve the measurement of the responsiveness of different health systems to the legitimate expectations of the population.
Sample survey data [ssd]
The metropolitan, urban and rural population and all .administrative regional units. as defined in Official Europe Union Statistics (NUTS 2) covered proportionately the respective population aged 18 and above. The country was divided into an appropriate number of areas, grouping NUTS regions at whatever level appropriately. The NUTS covered in Sweden were the following; Stockholm/Södertäjle A-Region, Gothenburgs A-Region, Malmö/Lund/Trelleborgs A-region, Semi urban area, Rural area.
The basic sample design was a multi-stage, random probability sample. 100 sampling points were drawn with probability proportional to population size, for a total coverage of the country. The sampling points were drawn after stratification by NUTS 2 region and by degree of urbanisation. They represented the whole territory of the country surveyed and are selected proportionally to the distribution of the population in terms of metropolitan, urban and rural areas. In each of the selected sampling points, one address was drawn at random. This starting address forms the first address of a cluster of a maximum of 20 addresses. The remainder of the cluster was selected as every Nth address by standard random route procedure from the initial address. In theory, there is no maximum number of addresses issued per country. Procedures for random household selection and random respondent selection are independent of the interviewer.s decision and controlled by the institute responsible. They should be as identical as possible from to country, full functional equivalence being a must.
At every address up to 4 recalls were made to attempt to achieve an interview with the selected respondent. There was only one interview per household. The final sample size is 1,000 completed interviews.
Face-to-face [f2f]
Data Coding At each site the data was coded by investigators to indicate the respondent status and the selection of the modules for each respondent within the survey design. After the interview was edited by the supervisor and considered adequate it was entered locally.
Data Entry Program A data entry program was developed in WHO specifically for the survey study and provided to the sites. It was developed using a database program called the I-Shell (short for Interview Shell), a tool designed for easy development of computerized questionnaires and data entry (34). This program allows for easy data cleaning and processing.
The data entry program checked for inconsistencies and validated the entries in each field by checking for valid response categories and range checks. For example, the program didn’t accept an age greater than 120. For almost all of the variables there existed a range or a list of possible values that the program checked for.
In addition, the data was entered twice to capture other data entry errors. The data entry program was able to warn the user whenever a value that did not match the first entry was entered at the second data entry. In this case the program asked the user to resolve the conflict by choosing either the 1st or the 2nd data entry value to be able to continue. After the second data entry was completed successfully, the data entry program placed a mark in the database in order to enable the checking of whether this process had been completed for each and every case.
Data Transfer The data entry program was capable of exporting the data that was entered into one compressed database file which could be easily sent to WHO using email attachments or a file transfer program onto a secure server no matter how many cases were in the file. The sites were allowed the use of as many computers and as many data entry personnel as they wanted. Each computer used for this purpose produced one file and they were merged once they were delivered to WHO with the help of other programs that were built for automating the process. The sites sent the data periodically as they collected it enabling the checking procedures and preliminary analyses in the early stages of the data collection.
Data quality checks Once the data was received it was analyzed for missing information, invalid responses and representativeness. Inconsistencies were also noted and reported back to sites.
Data Cleaning and Feedback After receipt of cleaned data from sites, another program was run to check for missing information, incorrect information (e.g. wrong use of center codes), duplicated data, etc. The output of this program was fed back to sites regularly. Mainly, this consisted of cases with duplicate IDs, duplicate cases (where the data for two respondents with different IDs were identical), wrong country codes, missing age, sex, education and some other important variables.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This data set is a compilation of heat flow data of uncertain origin. References as cited in Global Heat Flow Database were incomplete and thus could not be verified. This data compilation contains: data of unknown origin, unpublished data, data which has no full reference information or data which were extracted from other database. The remaining short citation and its related problem are listed in columns 18 and 19.
Facebook
TwitterIn order to develop various methods of comparable data collection on health and health system responsiveness WHO started a scientific survey study in 2000-2001. This study has used a common survey instrument in nationally representative populations with modular structure for assessing health of indviduals in various domains, health system responsiveness, household health care expenditures, and additional modules in other areas such as adult mortality and health state valuations.
The health module of the survey instrument was based on selected domains of the International Classification of Functioning, Disability and Health (ICF) and was developed after a rigorous scientific review of various existing assessment instruments. The responsiveness module has been the result of ongoing work over the last 2 years that has involved international consultations with experts and key informants and has been informed by the scientific literature and pilot studies.
Questions on household expenditure and proportionate expenditure on health have been borrowed from existing surveys. The survey instrument has been developed in multiple languages using cognitive interviews and cultural applicability tests, stringent psychometric tests for reliability (i.e. test-retest reliability to demonstrate the stability of application) and most importantly, utilizing novel psychometric techniques for cross-population comparability.
The study was carried out in 61 countries completing 71 surveys because two different modes were intentionally used for comparison purposes in 10 countries. Surveys were conducted in different modes of in- person household 90 minute interviews in 14 countries; brief face-to-face interviews in 27 countries and computerized telephone interviews in 2 countries; and postal surveys in 28 countries. All samples were selected from nationally representative sampling frames with a known probability so as to make estimates based on general population parameters.
The survey study tested novel techniques to control the reporting bias between different groups of people in different cultures or demographic groups ( i.e. differential item functioning) so as to produce comparable estimates across cultures and groups. To achieve comparability, the selfreports of individuals of their own health were calibrated against well-known performance tests (i.e. self-report vision was measured against standard Snellen's visual acuity test) or against short descriptions in vignettes that marked known anchor points of difficulty (e.g. people with different levels of mobility such as a paraplegic person or an athlete who runs 4 km each day) so as to adjust the responses for comparability . The same method was also used for self-reports of individuals assessing responsiveness of their health systems where vignettes on different responsiveness domains describing different levels of responsiveness were used to calibrate the individual responses.
This data are useful in their own right to standardize indicators for different domains of health (such as cognition, mobility, self care, affect, usual activities, pain, social participation, etc.) but also provide a better measurement basis for assessing health of the populations in a comparable manner. The data from the surveys can be fed into composite measures such as "Healthy Life Expectancy" and improve the empirical data input for health information systems in different regions of the world. Data from the surveys were also useful to improve the measurement of the responsiveness of different health systems to the legitimate expectations of the population.
Sample survey data [ssd]
The metropolitan, urban and rural population and all .administrative regional units. as defined in Official Europe Union Statistics (NUTS 2) covered proportionately the respective population aged 18 and above. The country was divided into an appropriate number of areas, grouping NUTS regions at whatever level appropriately. The NUTS covered in Iceland were the following; Reykjavik, Near Reykjavik and Sudurnes, West-Iceland, North-Iceland, East-Iceland, South-Iceland.
The basic sample design was a multi-stage, random probability sample. 50 sampling points were drawn with probability proportional to population size, for a total coverage of the country. The sampling points were drawn after stratification by NUTS 2 region and by degree of urbanisation. They represented the whole territory of the country surveyed and are selected proportionally to the distribution of the population in terms of metropolitan, urban and rural areas. In each of the selected sampling points, one address was drawn at random. This starting address forms the first address of a cluster of a maximum of 20 addresses. The remainder of the cluster was selected as every Nth address by standard random route procedure from the initial address. In theory, there is no maximum number of addresses issued per country. Procedures for random household selection and random respondent selection are independent of the interviewer.s decision and controlled by the institute responsible. They should be as identical as possible from to country, full functional equivalence being a must.
At every address up to 4 recalls were made to attempt to achieve an interview with the selected respondent. There was only one interview per household. The final sample size is 489 completed interviews.
Face-to-face [f2f]
Data Coding At each site the data was coded by investigators to indicate the respondent status and the selection of the modules for each respondent within the survey design. After the interview was edited by the supervisor and considered adequate it was entered locally.
Data Entry Program A data entry program was developed in WHO specifically for the survey study and provided to the sites. It was developed using a database program called the I-Shell (short for Interview Shell), a tool designed for easy development of computerized questionnaires and data entry (34). This program allows for easy data cleaning and processing.
The data entry program checked for inconsistencies and validated the entries in each field by checking for valid response categories and range checks. For example, the program didn’t accept an age greater than 120. For almost all of the variables there existed a range or a list of possible values that the program checked for.
In addition, the data was entered twice to capture other data entry errors. The data entry program was able to warn the user whenever a value that did not match the first entry was entered at the second data entry. In this case the program asked the user to resolve the conflict by choosing either the 1st or the 2nd data entry value to be able to continue. After the second data entry was completed successfully, the data entry program placed a mark in the database in order to enable the checking of whether this process had been completed for each and every case.
Data Transfer The data entry program was capable of exporting the data that was entered into one compressed database file which could be easily sent to WHO using email attachments or a file transfer program onto a secure server no matter how many cases were in the file. The sites were allowed the use of as many computers and as many data entry personnel as they wanted. Each computer used for this purpose produced one file and they were merged once they were delivered to WHO with the help of other programs that were built for automating the process. The sites sent the data periodically as they collected it enabling the checking procedures and preliminary analyses in the early stages of the data collection.
Data quality checks Once the data was received it was analyzed for missing information, invalid responses and representativeness. Inconsistencies were also noted and reported back to sites.
Data Cleaning and Feedback After receipt of cleaned data from sites, another program was run to check for missing information, incorrect information (e.g. wrong use of center codes), duplicated data, etc. The output of this program was fed back to sites regularly. Mainly, this consisted of cases with duplicate IDs, duplicate cases (where the data for two respondents with different IDs were identical), wrong country codes, missing age, sex, education and some other important variables.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Coupled Model Intercomparison Project Phase 6 (CMIP6) datasets. These data include all datasets published for 'CMIP6.CMIP.EC-Earth-Consortium.EC-Earth3' with the full Data Reference Syntax following the template 'mip_era.activity_id.institution_id.source_id.experiment_id.member_id.table_id.variable_id.grid_label.version'.
The EC Earth 3.3 climate model, released in 2019, includes the following components: atmos: IFS cy36r4 (TL255, linearly reduced Gaussian grid equivalent to 512 x 256 longitude/latitude; 91 levels; top level 0.01 hPa), land: HTESSEL (land surface scheme built in IFS), ocean: NEMO3.6 (ORCA1 tripolar primarily 1 deg with meridional refinement down to 1/3 degree in the tropics; 362 x 292 longitude/latitude; 75 levels; top grid cell 0-1 m), seaIce: LIM3. The model was run by the AEMET, Spain; BSC, Spain; CNR-ISAC, Italy; DMI, Denmark; ENEA, Italy; FMI, Finland; Geomar, Germany; ICHEC, Ireland; ICTP, Italy; IDL, Portugal; IMAU, The Netherlands; IPMA, Portugal; KIT, Karlsruhe, Germany; KNMI, The Netherlands; Lund University, Sweden; Met Eireann, Ireland; NLeSC, The Netherlands; NTNU, Norway; Oxford University, UK; surfSARA, The Netherlands; SMHI, Sweden; Stockholm University, Sweden; Unite ASTR, Belgium; University College Dublin, Ireland; University of Bergen, Norway; University of Copenhagen, Denmark; University of Helsinki, Finland; University of Santiago de Compostela, Spain; Uppsala University, Sweden; Utrecht University, The Netherlands; Vrije Universiteit Amsterdam, the Netherlands; Wageningen University, The Netherlands. Mailing address: EC-Earth consortium, Rossby Center, Swedish Meteorological and Hydrological Institute/SMHI, SE-601 76 Norrkoping, Sweden (EC-Earth-Consortium) in native nominal resolutions: atmos: 100 km, land: 100 km, ocean: 100 km, seaIce: 100 km.
Project: These data have been generated as part of the internationally-coordinated Coupled Model Intercomparison Project Phase 6 (CMIP6; see also GMD Special Issue: http://www.geosci-model-dev.net/special_issue590.html). The simulation data provides a basis for climate research designed to answer fundamental science questions and serves as resource for authors of the Sixth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC-AR6).
CMIP6 is a project coordinated by the Working Group on Coupled Modelling (WGCM) as part of the World Climate Research Programme (WCRP). Phase 6 builds on previous phases executed under the leadership of the Program for Climate Model Diagnosis and Intercomparison (PCMDI) and relies on the Earth System Grid Federation (ESGF) and the Centre for Environmental Data Analysis (CEDA) along with numerous related activities for implementation. The original data is hosted and partially replicated on a federated collection of data nodes, and most of the data relied on by the IPCC is being archived for long-term preservation at the IPCC Data Distribution Centre (IPCC DDC) hosted by the German Climate Computing Center (DKRZ).
The project includes simulations from about 120 global climate models and around 45 institutions and organizations worldwide. - Project website: https://pcmdi.llnl.gov/CMIP6.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset presents a dual-version representation of employment-related data from India, crafted to highlight the importance of data cleaning and transformation in any real-world data science or analytics project.
It includes two parallel datasets: 1. Messy Dataset (Raw) – Represents a typical unprocessed dataset often encountered in data collection from surveys, databases, or manual entries. 2. Cleaned Dataset – This version demonstrates how proper data preprocessing can significantly enhance the quality and usability of data for analytical and visualization purposes.
Each record captures multiple attributes related to individuals in the Indian job market, including:
- Age Group
- Employment Status (Employed/Unemployed)
- Monthly Salary (INR)
- Education Level
- Industry Sector
- Years of Experience
- Location
- Perceived AI Risk
- Date of Data Recording
The raw dataset underwent comprehensive transformations to convert it into its clean, analysis-ready form: - Missing Values: Identified and handled using either row elimination (where critical data was missing) or imputation techniques. - Duplicate Records: Identified using row comparison and removed to prevent analytical skew. - Inconsistent Formatting: Unified inconsistent naming in columns (like 'monthly_salary_(inr)' → 'Monthly Salary (INR)'), capitalization, and string spacing. - Incorrect Data Types: Converted columns like salary from string/object to float for numerical analysis. - Outliers: Detected and handled based on domain logic and distribution analysis. - Categorization: Converted numeric ages into grouped age categories for comparative analysis. - Standardization: Uniform labels for employment status, industry names, education, and AI risk levels were applied for visualization clarity.
This dataset is ideal for learners and professionals who want to understand: - The impact of messy data on visualization and insights - How transformation steps can dramatically improve data interpretation - Practical examples of preprocessing techniques before feeding into ML models or BI tools
It's also useful for:
- Training ML models with clean inputs
- Data storytelling with visual clarity
- Demonstrating reproducibility in data cleaning pipelines
By examining both the messy and clean datasets, users gain a deeper appreciation for why “garbage in, garbage out” rings true in the world of data science.
Facebook
Twitterhttps://www.icpsr.umich.edu/web/ICPSR/studies/8320/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/8320/terms
This aggregate data collection is an extract of the International Data Base (IDB), a computerized central repository of demographic, economic, and social data for all countries of the world. Data available in this collection include total midyear population estimates and projections (1950-1985), percent urban population, estimates and projections of crude birth rate, crude death rate, net migration rate, rate of natural increase, and annual growth rate, infant mortality rate and life expectancy at birth by sex, percent literate by sex, and percent of the labor force in agriculture.
Facebook
TwitterBy Correlates of War Project [source]
The World Religion Project (WRP) is an ambitious endeavor to conduct a comprehensive analysis of religious adherence throughout the world from 1945 to 2010. This cutting-edge project offers unparalleled insight into the religious behavior of people in different countries, regions, and continents during this time period. Its datasets provide important information about the numbers and percentages of adherents across a multitude of different religions, religion families, and non-religious affiliations.
The WRP consists of three distinct datasets: the national religion dataset, regional religion dataset, and global religion dataset. Each is focused on understanding individually specific realms for varied analysis approaches - from individual states to global systems. The national dataset provides data on number of adherents by state as well as percentage population practicing a given faith group in five-year increments; focusing attention to how this number evolves from nation to nation over time. Similarly, regional data is provided at five year intervals highlighting individual region designations with one modification – Pacific Ocean states have been reclassified into their own Oceania category according to Country Code Number 900 or above). Finally at a global level – all states are aggregated in order that we may understand a snapshot view at any five-year interval between 1945‐2010 regarding relationships between religions or religio‐families within one location or transnationally.
This project was developed in three stages: firstly forming a religions tree (a systematic classification), secondly collecting data such as this provided by WRP according to that classification structure – lastly cleaning the data so discrepancies may be reconciled and imported where needed with gaps selected when unknown values were encountered during collection process . We would encourage anyone wishing details undergoing more detailed reading/analysis relating various use applications for these rich datasets - please contact Zeev Maoz (University California Davis) & Errol A Henderson _(Pennsylvania State University)
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
The World Religions Project (WRP) dataset offers a comprehensive look at religious adherence around the world within a single dataset. With this dataset, you can track global religious trends over a period of 65 years and explore how they’ve changed during that time. By exploring the WRP data set, you’ll gain insight into cross-regional and cross-time patterns in religious affiliation around the world.
- Analyzing historical patterns of religious growth and decline across different regions
- Creating visualizations to compare religious adherence in various states, countries, or globally
- Studying the impact of governmental policies on religious participation over time
If you use this dataset in your research, please credit the original authors. Data Source
License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices.
File: WRP regional data.csv | Column name | Description | |:-----------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------| | Year | Reference year for data collection. (Integer) | | Region | World region according to Correlates Of War (COW) Regional Systemizations with one modification (Oceania category for COW country code ...
Facebook
TwitterBy 2025, forecasts suggest that there will be more than ** billion Internet of Things (IoT) connected devices in use. This would be a nearly threefold increase from the IoT installed base in 2019. What is the Internet of Things? The IoT refers to a network of devices that are connected to the internet and can “communicate” with each other. Such devices include daily tech gadgets such as the smartphones and the wearables, smart home devices such as smart meters, as well as industrial devices like smart machines. These smart connected devices are able to gather, share, and analyze information and create actions accordingly. By 2023, global spending on IoT will reach *** trillion U.S. dollars. How does Internet of Things work? IoT devices make use of sensors and processors to collect and analyze data acquired from their environments. The data collected from the sensors will be shared by being sent to a gateway or to other IoT devices. It will then be either sent to and analyzed in the cloud or analyzed locally. By 2025, the data volume created by IoT connections is projected to reach a massive total of **** zettabytes. Privacy and security concerns Given the amount of data generated by IoT devices, it is no wonder that data privacy and security are among the major concerns with regard to IoT adoption. Once devices are connected to the Internet, they become vulnerable to possible security breaches in the form of hacking, phishing, etc. Frequent data leaks from social media raise earnest concerns about information security standards in today’s world; were the IoT to become the next new reality, serious efforts to create strict security stands need to be prioritized.
Facebook
TwitterThis dataset, a product of the Trade Team - Development Research Group, is part of a larger effort in the group to measure the extent of the brain drain as part of the International Migration and Development Program. It measures international skilled migration for the years 1975-2000.
The methodology is explained in: "Tendance de long terme des migrations internationals. Analyse à partir des 6 principaux pays recerveurs", Cécily Defoort.
This data set uses the same methodology as used in the Docquier-Marfouk data set on international migration by educational attainment. The authors use data from 6 key receiving countries in the OECD: Australia, Canada, France, Germany, the UK and the US.
It is estimated that the data represent approximately 77 percent of the world’s migrant population.
Bilateral brain drain rates are estimated based observations for every five years, during the period 1975-2000.
Australia, Canada, France, Germany, UK and US
Aggregate data [agg]
Other [oth]
Facebook
TwitterIn order to develop various methods of comparable data collection on health and health system responsiveness WHO started a scientific survey study in 2000-2001. This study has used a common survey instrument in nationally representative populations with modular structure for assessing health of indviduals in various domains, health system responsiveness, household health care expenditures, and additional modules in other areas such as adult mortality and health state valuations.
The health module of the survey instrument was based on selected domains of the International Classification of Functioning, Disability and Health (ICF) and was developed after a rigorous scientific review of various existing assessment instruments. The responsiveness module has been the result of ongoing work over the last 2 years that has involved international consultations with experts and key informants and has been informed by the scientific literature and pilot studies.
Questions on household expenditure and proportionate expenditure on health have been borrowed from existing surveys. The survey instrument has been developed in multiple languages using cognitive interviews and cultural applicability tests, stringent psychometric tests for reliability (i.e. test-retest reliability to demonstrate the stability of application) and most importantly, utilizing novel psychometric techniques for cross-population comparability.
The study was carried out in 61 countries completing 71 surveys because two different modes were intentionally used for comparison purposes in 10 countries. Surveys were conducted in different modes of in- person household 90 minute interviews in 14 countries; brief face-to-face interviews in 27 countries and computerized telephone interviews in 2 countries; and postal surveys in 28 countries. All samples were selected from nationally representative sampling frames with a known probability so as to make estimates based on general population parameters.
The survey study tested novel techniques to control the reporting bias between different groups of people in different cultures or demographic groups ( i.e. differential item functioning) so as to produce comparable estimates across cultures and groups. To achieve comparability, the selfreports of individuals of their own health were calibrated against well-known performance tests (i.e. self-report vision was measured against standard Snellen's visual acuity test) or against short descriptions in vignettes that marked known anchor points of difficulty (e.g. people with different levels of mobility such as a paraplegic person or an athlete who runs 4 km each day) so as to adjust the responses for comparability . The same method was also used for self-reports of individuals assessing responsiveness of their health systems where vignettes on different responsiveness domains describing different levels of responsiveness were used to calibrate the individual responses.
This data are useful in their own right to standardize indicators for different domains of health (such as cognition, mobility, self care, affect, usual activities, pain, social participation, etc.) but also provide a better measurement basis for assessing health of the populations in a comparable manner. The data from the surveys can be fed into composite measures such as "Healthy Life Expectancy" and improve the empirical data input for health information systems in different regions of the world. Data from the surveys were also useful to improve the measurement of the responsiveness of different health systems to the legitimate expectations of the population.
Sample survey data [ssd]
The telephone directory was used as the sampling frame since it is considered as the most reliable registry available.
Each region was divided into provinces. The provinces are composed of "comunas" or municipalities from within which individuals were randomly selected. However, with this design, there may be a bias towards the population without a telephone.
Final Sample Size=2,078
Mail Questionnaire [mail]
Data Coding At each site the data was coded by investigators to indicate the respondent status and the selection of the modules for each respondent within the survey design. After the interview was edited by the supervisor and considered adequate it was entered locally.
Data Entry Program A data entry program was developed in WHO specifically for the survey study and provided to the sites. It was developed using a database program called the I-Shell (short for Interview Shell), a tool designed for easy development of computerized questionnaires and data entry (34). This program allows for easy data cleaning and processing.
The data entry program checked for inconsistencies and validated the entries in each field by checking for valid response categories and range checks. For example, the program didn’t accept an age greater than 120. For almost all of the variables there existed a range or a list of possible values that the program checked for.
In addition, the data was entered twice to capture other data entry errors. The data entry program was able to warn the user whenever a value that did not match the first entry was entered at the second data entry. In this case the program asked the user to resolve the conflict by choosing either the 1st or the 2nd data entry value to be able to continue. After the second data entry was completed successfully, the data entry program placed a mark in the database in order to enable the checking of whether this process had been completed for each and every case.
Data Transfer The data entry program was capable of exporting the data that was entered into one compressed database file which could be easily sent to WHO using email attachments or a file transfer program onto a secure server no matter how many cases were in the file. The sites were allowed the use of as many computers and as many data entry personnel as they wanted. Each computer used for this purpose produced one file and they were merged once they were delivered to WHO with the help of other programs that were built for automating the process. The sites sent the data periodically as they collected it enabling the checking procedures and preliminary analyses in the early stages of the data collection.
Data quality checks Once the data was received it was analyzed for missing information, invalid responses and representativeness. Inconsistencies were also noted and reported back to sites.
Data Cleaning and Feedback After receipt of cleaned data from sites, another program was run to check for missing information, incorrect information (e.g. wrong use of center codes), duplicated data, etc. The output of this program was fed back to sites regularly. Mainly, this consisted of cases with duplicate IDs, duplicate cases (where the data for two respondents with different IDs were identical), wrong country codes, missing age, sex, education and some other important variables.
Facebook
TwitterNo description is available. Visit https://dataone.org/datasets/5d041e4bfaf4ea361dd3135126134720 for complete metadata about this dataset.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The New 7 Wonders of the World was a campaign started in 2000 to choose Wonders of the World from a selection of 200 existing monuments. The popularity poll via free Web-based voting and small amounts of telephone voting was led by Canadian-Swiss Bernard Weber and organized by the New 7 Wonders Foundation (N7W) based in Zurich, Switzerland, with winners announced on 7 July 2007 in Lisbon, at Estádio da Luz. The poll was considered unscientific partly because it was possible for people to cast multiple votes.
When someday, if we plan to go on a World tour, obviously there is going to be a bucket list of wonders or places around the world, that we wish to visit. Here, we have one set of "Wonders of the World" images scraped from Google Images. Let us use our deep learning skills to build multiclass classification to identify the place in the images.
This dataset contains a total of 3846 images placed in folders, with which each folder representing one of the top new wonders of the world. Below is the list of wonders with images extracted from Google Images.
Facebook
TwitterNo description is available. Visit https://dataone.org/datasets/c9d6507f203308063a16ce22ba032540 for complete metadata about this dataset.
Facebook
TwitterThis dataset is about: Component parts of the World Heat Flow Data Collection.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
The complete COVID-19 dataset is a collection of the COVID-19 data maintained and provided by Our World in Data. Our World in Data team will update it daily throughout the duration of the COVID-19 pandemic.
These are the following information that includes in the dataset: | Metrics | Source | Updated | Countries | | --- | --- | | Vaccinations | Official data collated by the Our World in Data team | Daily | 218 | | Tests & positivity | Official data collated by the Our World in Data team | Weekly | 139 | | Hospital & ICU | Official data collated by the Our World in Data team | Weekly | 39 | | Confirmed cases | JHU CSSE COVID-19 Data | Daily | 196 | | Confirmed deaths | JHU CSSE COVID-19 Data | Daily | 196 | | Reproduction rate | Arroyo-Marioli F, Bullano F, Kucinskas S, Rondón-Moreno C | Daily | 185 | | Policy responses | Oxford COVID-19 Government Response Tracker | Daily | 186 | | Other variables of interest | International organizations (UN, World Bank, OECD, IHME…) | Fixed |
Data dictionary is available below ⤵
I'd like to clarify that I'm only making data about vaccines collected by Our World in Data available to Kaggle community. This dataset is gathered, integrated, and posted the new version on a daily basis, as maintained by Our World in Data on their GitHub repository.
📷 Images by Fusion Medical Animation.