https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a Dataset of the World Population Consisting of Each and Every Country. I have attempted to analyze the same data to bring some insights out of it. The dataset consists of 234 rows and 17 columns. I will analyze the same data and bring the below pieces of information regarding the same.
As of March 2025, *** data centers were listed as being located in Germany, the most of any European nation. Data centers are facilities housing critical IT infrastructure designed to store, process, and manage vast volumes of data. The United States is home to the largest share of data centers worldwide, with over ***** facilities.
In 2022, China topped the list of surveilled countries worldwide. Nearly the whole population in the country was affected by the internet usage restrictions set by the government. India and Pakistan followed as the governmental authorities in these countries also put limitations on internet usage for their citizens.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a compilation of processed data on citation and references for research papers including their author, institution and open access info for a selected sample of academics analysed using Microsoft Academic Graph (MAG) data and CORE. The data for this dataset was collected during December 2019 to January 2020.Six countries (Austria, Brazil, Germany, India, Portugal, United Kingdom and United States) were the focus of the six questions which make up this dataset. There is one csv file per country and per question (36 files in total). More details about the creation of this dataset are available on the public ON-MERRIT D3.1 deliverable report.The dataset is a combination of two different data sources, one part is a dataset created on analysing promotion policies across the target countries, while the second part is a set of data points available to understand the publishing behaviour. To facilitate the analysis the dataset is organised in the following seven folders:PRTThe dataset with the file name "PRT_policies.csv" contains the related information as this was extracted from promotion, review and tenure (PRT) policies. Q1: What % of papers coming from a university are Open Access?- Dataset Name format: oa_status_countryname_papers.csv- Dataset Contents: Open Access (OA) status of all papers of all the universities listed in Times Higher Education World University Rankings (THEWUR) for the given country. A paper is marked OA if there is at least an OA link available. OA links are collected using the CORE Discovery API.- Important considerations about this dataset: - Papers with multiple authorship are preserved only once towards each of the distinct institutions their authors may belong to. - The service we used to recognise if a paper is OA, CORE Discovery, does not contain entries for all paperids in MAG. This implies that some of the records in the dataset extracted will not have either a true or false value for the _is_OA_ field. - Only those records marked as true for _is_OA_ field can be said to be OA. Others with false or no value for is_OA field are unknown status (i.e. not necessarily closed access).Q2: How are papers, published by the selected universities, distributed across the three scientific disciplines of our choice?- Dataset Name format: fsid_countryname_papers.csv- Dataset Contents: For the given country, all papers for all the universities listed in THEWUR with the information of fieldofstudy they belong to.- Important considerations about this dataset: * MAG can associate a paper to multiple fieldofstudyid. If a paper belongs to more than one of our fieldofstudyid, separate records were created for the paper with each of those _fieldofstudyid_s.- MAG assigns fieldofstudyid to every paper with a score. We preserve only those records whose score is more than 0.5 for any fieldofstudyid it belongs to.- Papers with multiple authorship are preserved only once towards each of the distinct institutions their authors may belong to. Papers with authorship from multiple universities are counted once towards each of the universities concerned.Q3: What is the gender distribution in authorship of papers published by the universities?- Dataset Name format: author_gender_countryname_papers.csv- Dataset Contents: All papers with their author names for all the universities listed in THEWUR.- Important considerations about this dataset :- When there are multiple collaborators(authors) for the same paper, this dataset makes sure that only the records for collaborators from within selected universities are preserved.- An external script was executed to determine the gender of the authors. The script is available here.Q4: Distribution of staff seniority (= number of years from their first publication until the last publication) in the given university.- Dataset Name format: author_ids_countryname_papers.csv- Dataset Contents: For a given country, all papers for authors with their publication year for all the universities listed in THEWUR.- Important considerations about this work :- When there are multiple collaborators(authors) for the same paper, this dataset makes sure that only the records for collaborators from within selected universities are preserved.- Calculating staff seniority can be achieved in various ways. The most straightforward option is to calculate it as _academic_age = MAX(year) - MIN(year) _for each authorid.Q5: Citation counts (incoming) for OA vs Non-OA papers published by the university.- Dataset Name format: cc_oa_countryname_papers.csv- Dataset Contents: OA status and OA links for all papers of all the universities listed in THEWUR and for each of those papers, count of incoming citations available in MAG.- Important considerations about this dataset :- CORE Discovery was used to establish the OA status of papers.- Papers with multiple authorship are preserved only once towards each of the distinct institutions their authors may belong to.- Only those records marked as true for _is_OA_ field can be said to be OA. Others with false or no value for is_OA field are unknown status (i.e. not necessarily closed access).Q6: Count of OA vs Non-OA references (outgoing) for all papers published by universities.- Dataset Name format: rc_oa_countryname_-papers.csv- Dataset Contents: Counts of all OA and unknown papers referenced by all papers published by all the universities listed in THEWUR.- Important considerations about this dataset :- CORE Discovery was used to establish the OA status of papers being referenced.- Papers with multiple authorship are preserved only once towards each of the distinct institutions their authors may belong to. Papers with authorship from multiple universities are counted once towards each of the universities concerned.Additional files:- _fieldsofstudy_mag_.csv: this file contains a dump of fieldsofstudy table of MAG mapping each of the ids to their actual field of study name.
Cost of Living - Country Rankings Dataset
The "Cost of Living - Country Rankings Dataset" provides comprehensive information on the cost of living in various countries around the world. Understanding the cost of living is crucial for individuals, businesses, and policymakers alike, as it impacts decisions related to travel, relocation, investment, and economic analysis. This dataset is intended to serve as a valuable resource for researchers, data analysts, and anyone interested in exploring and comparing the cost of living across different nations.
This dataset comprises four primary columns:
1. Countries: This column contains the names of various countries included in the dataset. Each country is identified by its official name.
2. Cost of Living: The "Cost of Living" column represents the cost of living index or score for each country. This index is typically calculated by considering various factors, such as housing, food, transportation, healthcare, and other essential expenses. A higher index value indicates a higher cost of living in that particular country, while a lower value suggests a more affordable cost of living.
3. 2017 Global Rank: This column provides the global ranking of each country's cost of living in the year 2017. The ranking is based on the cost of living index mentioned earlier. A lower rank indicates a lower cost of living relative to other countries, while a higher rank suggests a higher cost of living position.
4. Available Data: The "Available Data" column indicates whether or not data for a specific country and year is available.
This dataset is designed to support various data analysis and visualization tasks. Users can explore trends in the cost of living, identify countries with high or low cost of living, and analyze how rankings have changed over time. Researchers can use this dataset to conduct in-depth studies on the factors influencing the cost of living in different regions and the economic implications of such variations.
Please note that the dataset includes information for the year 2017, and users are encouraged to consider this when interpreting the data, as economic conditions and the cost of living may have changed since then. Additionally, this dataset aims to provide a snapshot of cost of living rankings for countries in 2017 and may not cover every country in the world.
Link: https://www.theglobaleconomy.com/rankings/cost_of_living_wb/
Disclaimer: The accuracy and completeness of the data provided in this dataset are subject to the source from which it was obtained. Users are advised to cross-reference this data with authoritative sources and exercise discretion when making decisions based on it. The dataset creator and Kaggle assume no responsibility for any actions taken based on the information provided herein.
Based on a comparison of coronavirus deaths in 210 countries relative to their population, Peru had the most losses to COVID-19 up until July 13, 2022. As of the same date, the virus had infected over 557.8 million people worldwide, and the number of deaths had totaled more than 6.3 million. Note, however, that COVID-19 test rates can vary per country. Additionally, big differences show up between countries when combining the number of deaths against confirmed COVID-19 cases. The source seemingly does not differentiate between "the Wuhan strain" (2019-nCOV) of COVID-19, "the Kent mutation" (B.1.1.7) that appeared in the UK in late 2020, the 2021 Delta variant (B.1.617.2) from India or the Omicron variant (B.1.1.529) from South Africa.
The difficulties of death figures
This table aims to provide a complete picture on the topic, but it very much relies on data that has become more difficult to compare. As the coronavirus pandemic developed across the world, countries already used different methods to count fatalities, and they sometimes changed them during the course of the pandemic. On April 16, for example, the Chinese city of Wuhan added a 50 percent increase in their death figures to account for community deaths. These deaths occurred outside of hospitals and went unaccounted for so far. The state of New York did something similar two days before, revising their figures with 3,700 new deaths as they started to include “assumed” coronavirus victims. The United Kingdom started counting deaths in care homes and private households on April 29, adjusting their number with about 5,000 new deaths (which were corrected lowered again by the same amount on August 18). This makes an already difficult comparison even more difficult. Belgium, for example, counts suspected coronavirus deaths in their figures, whereas other countries have not done that (yet). This means two things. First, it could have a big impact on both current as well as future figures. On April 16 already, UK health experts stated that if their numbers were corrected for community deaths like in Wuhan, the UK number would change from 205 to “above 300”. This is exactly what happened two weeks later. Second, it is difficult to pinpoint exactly which countries already have “revised” numbers (like Belgium, Wuhan or New York) and which ones do not. One work-around could be to look at (freely accessible) timelines that track the reported daily increase of deaths in certain countries. Several of these are available on our platform, such as for Belgium, Italy and Sweden. A sudden large increase might be an indicator that the domestic sources changed their methodology.
Where are these numbers coming from?
The numbers shown here were collected by Johns Hopkins University, a source that manually checks the data with domestic health authorities. For the majority of countries, this is from national authorities. In some cases, like China, the United States, Canada or Australia, city reports or other various state authorities were consulted. In this statistic, these separately reported numbers were put together. For more information or other freely accessible content, please visit our dedicated Facts and Figures page.
The Global Data Regulation Diagnostic provides a comprehensive assessment of the quality of the data governance environment. Diagnostic results show that countries have put in greater effort in adopting enabler regulatory practices than in safeguard regulatory practices. However, for public intent data, enablers for private intent data, safeguards for personal and nonpersonal data, cybersecurity and cybercrime, as well as cross-border data flows. Across all these dimensions, no income group demonstrates advanced regulatory frameworks across all dimensions, indicating significant room for the regulatory development of both enablers and safeguards remains at an intermediate stage: 47 percent of enabler good practices and 41 percent of good safeguard practices are adopted across countries. Under the enabler and safeguard pillars, the diagnostic covers dimensions of e-commerce/e-transactions, enablers further improvement on data governance environment.
The Global Data Regulation Diagnostic is the first comprehensive assessment of laws and regulations on data governance. It covers enabler and safeguard regulatory practices in 80 countries providing indicators to assess and compare their performance. This Global Data Regulation Diagnostic develops objective and standardized indicators to measure the regulatory environment for the data economy across countries. The indicators aim to serve as a diagnostic tool so countries can assess and compare their performance vis-á-vis other countries. Understanding the gap with global regulatory good practices is a necessary first step for governments when identifying and prioritizing reforms.
80 countries
Country
Observation data/ratings [obs]
The diagnostic is based on a detailed assessment of domestic laws, regulations, and administrative requirements in 80 countries selected to ensure a balanced coverage across income groups, regions, and different levels of digital technology development. Data are further verified through a detailed desk research of legal texts, reflecting the regulatory status of each country as of June 1, 2020.
Mail Questionnaire [mail]
The questionnaire comprises 37 questions designed to determine if a country has adopted good regulatory practice on data governance. The responses are then scored and assigned a normative interpretation. Related questions fall into seven clusters so that when the scores are averaged, each cluster provides an overall sense of how it performs in its corresponding regulatory and legal dimensions. These seven dimensions are: (1) E-commerce/e-transaction; (2) Enablers for public intent data; (3) Enablers for private intent data; (4) Safeguards for personal data; (5) Safeguards for nonpersonal data; (6) Cybersecurity and cybercrime; (7) Cross-border data transfers.
100%
Round 1 of the Afrobarometer survey was conducted from July 1999 through June 2001 in 12 African countries, to solicit public opinion on democracy, governance, markets, and national identity. The full 12 country dataset released was pieced together out of different projects, Round 1 of the Afrobarometer survey,the old Southern African Democracy Barometer, and similar surveys done in West and East Africa.
The 7 country dataset is a subset of the Round 1 survey dataset, and consists of a combined dataset for the 7 Southern African countries surveyed with other African countries in Round 1, 1999-2000 (Botswana, Lesotho, Malawi, Namibia, South Africa, Zambia and Zimbabwe). It is a useful dataset because, in contrast to the full 12 country Round 1 dataset, all countries in this dataset were surveyed with the identical questionnaire
Botswana Lesotho Malawi Namibia South Africa Zambia Zimbabwe
Basic units of analysis that the study investigates include: individuals and groups
Sample survey data [ssd]
A new sample has to be drawn for each round of Afrobarometer surveys. Whereas the standard sample size for Round 3 surveys will be 1200 cases, a larger sample size will be required in societies that are extremely heterogeneous (such as South Africa and Nigeria), where the sample size will be increased to 2400. Other adaptations may be necessary within some countries to account for the varying quality of the census data or the availability of census maps.
The sample is designed as a representative cross-section of all citizens of voting age in a given country. The goal is to give every adult citizen an equal and known chance of selection for interview. We strive to reach this objective by (a) strictly applying random selection methods at every stage of sampling and by (b) applying sampling with probability proportionate to population size wherever possible. A randomly selected sample of 1200 cases allows inferences to national adult populations with a margin of sampling error of no more than plus or minus 2.5 percent with a confidence level of 95 percent. If the sample size is increased to 2400, the confidence interval shrinks to plus or minus 2 percent.
Sample Universe
The sample universe for Afrobarometer surveys includes all citizens of voting age within the country. In other words, we exclude anyone who is not a citizen and anyone who has not attained this age (usually 18 years) on the day of the survey. Also excluded are areas determined to be either inaccessible or not relevant to the study, such as those experiencing armed conflict or natural disasters, as well as national parks and game reserves. As a matter of practice, we have also excluded people living in institutionalized settings, such as students in dormitories and persons in prisons or nursing homes.
What to do about areas experiencing political unrest? On the one hand we want to include them because they are politically important. On the other hand, we want to avoid stretching out the fieldwork over many months while we wait for the situation to settle down. It was agreed at the 2002 Cape Town Planning Workshop that it is difficult to come up with a general rule that will fit all imaginable circumstances. We will therefore make judgments on a case-by-case basis on whether or not to proceed with fieldwork or to exclude or substitute areas of conflict. National Partners are requested to consult Core Partners on any major delays, exclusions or substitutions of this sort.
Sample Design
The sample design is a clustered, stratified, multi-stage, area probability sample.
To repeat the main sampling principle, the objective of the design is to give every sample element (i.e. adult citizen) an equal and known chance of being chosen for inclusion in the sample. We strive to reach this objective by (a) strictly applying random selection methods at every stage of sampling and by (b) applying sampling with probability proportionate to population size wherever possible.
In a series of stages, geographically defined sampling units of decreasing size are selected. To ensure that the sample is representative, the probability of selection at various stages is adjusted as follows:
The sample is stratified by key social characteristics in the population such as sub-national area (e.g. region/province) and residential locality (urban or rural). The area stratification reduces the likelihood that distinctive ethnic or language groups are left out of the sample. And the urban/rural stratification is a means to make sure that these localities are represented in their correct proportions. Wherever possible, and always in the first stage of sampling, random sampling is conducted with probability proportionate to population size (PPPS). The purpose is to guarantee that larger (i.e., more populated) geographical units have a proportionally greater probability of being chosen into the sample. The sampling design has four stages
A first-stage to stratify and randomly select primary sampling units;
A second-stage to randomly select sampling start-points;
A third stage to randomly choose households;
A final-stage involving the random selection of individual respondents
We shall deal with each of these stages in turn.
STAGE ONE: Selection of Primary Sampling Units (PSUs)
The primary sampling units (PSU's) are the smallest, well-defined geographic units for which reliable population data are available. In most countries, these will be Census Enumeration Areas (or EAs). Most national census data and maps are broken down to the EA level. In the text that follows we will use the acronyms PSU and EA interchangeably because, when census data are employed, they refer to the same unit.
We strongly recommend that NIs use official national census data as the sampling frame for Afrobarometer surveys. Where recent or reliable census data are not available, NIs are asked to inform the relevant Core Partner before they substitute any other demographic data. Where the census is out of date, NIs should consult a demographer to obtain the best possible estimates of population growth rates. These should be applied to the outdated census data in order to make projections of population figures for the year of the survey. It is important to bear in mind that population growth rates vary by area (region) and (especially) between rural and urban localities. Therefore, any projected census data should include adjustments to take such variations into account.
Indeed, we urge NIs to establish collegial working relationships within professionals in the national census bureau, not only to obtain the most recent census data, projections, and maps, but to gain access to sampling expertise. NIs may even commission a census statistician to draw the sample to Afrobarometer specifications, provided that provision for this service has been made in the survey budget.
Regardless of who draws the sample, the NIs should thoroughly acquaint themselves with the strengths and weaknesses of the available census data and the availability and quality of EA maps. The country and methodology reports should cite the exact census data used, its known shortcomings, if any, and any projections made from the data. At minimum, the NI must know the size of the population and the urban/rural population divide in each region in order to specify how to distribute population and PSU's in the first stage of sampling. National investigators should obtain this written data before they attempt to stratify the sample.
Once this data is obtained, the sample population (either 1200 or 2400) should be stratified, first by area (region/province) and then by residential locality (urban or rural). In each case, the proportion of the sample in each locality in each region should be the same as its proportion in the national population as indicated by the updated census figures.
Having stratified the sample, it is then possible to determine how many PSU's should be selected for the country as a whole, for each region, and for each urban or rural locality.
The total number of PSU's to be selected for the whole country is determined by calculating the maximum degree of clustering of interviews one can accept in any PSU. Because PSUs (which are usually geographically small EAs) tend to be socially homogenous we do not want to select too many people in any one place. Thus, the Afrobarometer has established a standard of no more than 8 interviews per PSU. For a sample size of 1200, the sample must therefore contain 150 PSUs/EAs (1200 divided by 8). For a sample size of 2400, there must be 300 PSUs/EAs.
These PSUs should then be allocated proportionally to the urban and rural localities within each regional stratum of the sample. Let's take a couple of examples from a country with a sample size of 1200. If the urban locality of Region X in this country constitutes 10 percent of the current national population, then the sample for this stratum should be 15 PSUs (calculated as 10 percent of 150 PSUs). If the rural population of Region Y constitutes 4 percent of the current national population, then the sample for this stratum should be 6 PSU's.
The next step is to select particular PSUs/EAs using random methods. Using the above example of the rural localities in Region Y, let us say that you need to pick 6 sample EAs out of a census list that contains a total of 240 rural EAs in Region Y. But which 6? If the EAs created by the national census bureau are of equal or roughly equal population size, then selection is relatively straightforward. Just number all EAs consecutively, then make six selections using a table of random numbers. This procedure, known as simple random sampling (SRS), will
By Harish Kumar Garg [source]
This dataset is about the number of Indian students studying abroad in different countries and the detailed information about different nations where Indian students are present. The data has been complied from the Ministry Of External Affairs to answer a question from the Member of Parliament regarding how many students from India are studying in foreign countries and which country. This dataset includes two fields, Country Name and Number of Indians Studying Abroad as of Mar 2017, giving a unique opportunity to track student mobility across various nations around the world. With this valuable data about student mobility, we can gain insights into how educational opportunities for Indian students have increased over time as well as look at trends in international education throughout different regions. From comparison among countries with similar academic opportunities to tracking regional popularity among study destinations, this dataset provides important context for studying student migration patterns. We invite everyone to explore this data further and use it to draw meaningful conclusions!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
How to use this dataset?
The data has two columns – Country Name and Number of Indians studying there as of March 2017. It also includes a third column, Percentage, which gives an indication about the proportion of Indian students enrolled in each country relative to total number enrolled abroad globally.
To get started with your exploration, you can visualize the data against various parameters like geographical region or language speaking as it may provide more clarity about motives/reasons behind student’s choice. You can also group countries on basis of research opportunities available, cost consideration etc.,to understand deeper into all aspects that motivate Indians to explore further studies outside India.
Additionally you can use this dataset for benchmarking purpose with other regional / international peer groups or aggregate regional / global reports with aim towards making better decisions or policies aiming greater outreach & support while targeting foreign universities/colleges for educational promotion activities that highlights engaging elements aimed at attracting more potential students from India aspiring higher international education experience abroad!
- Using this dataset, educational institutions in India can set up international exchange programs with universities in other countries to facilitate and support Indian students studying abroad.
Higher Education Institutions can also understand the current trend of Indian students sourcing for opportunities to study abroad and use this data to build specialized short-term courses in collaboration with universities from different countries that cater to the needs of students who are interested in moving abroad permanently or even temporarily for higher studies.
Policy makers could use this data to assess the current trends and develop policies that aim at incentivizing international exposure among young professionals by commissioning fellowships or scholarships with an aim of exposing them to different problem sets around the world thereby making their profile more attractive while they look for better job opportunities globally
If you use this dataset in your research, please credit the original authors. Data Source
Unknown License - Please check the dataset description for more information.
File: final_data.csv | Column name | Description | |:--------------------------|:-------------------------------------------------------------------------------------------------------------------------------| | Country | Name of the country where Indian students are studying. (String) | | No of Indian Students | Number of Indian students studying in the country. (Integer) | | Percentage | Percentage of Indian students studying in the country compared to the total number of Indian students studying abroad. (Float) |
If you use this dataset in your research, please credit ...
Between 2013 and 2022, Russia had the highest number of content removal requests sent to Google by the government, with more than 215 thousand requests. South Korea ranked second, with over 267 thousand requests for content removal. India's government sent over 195 thousand requests to Google for online content removal in the measured period.
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/27557https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/27557
The Global Hunger Index (GHI) is a tool designed to comprehensively measure and track hunger globally and by region and country. Calculated each year by the International Food Policy Research Institute (IFPRI), the GHI highlights successes and failures in hunger reduction and provide insights into the drivers of hunger, and food and nutrition security. The 2014 GHI has been calculated for 120 countries for which data on the three component indicators are available and for which measuring hung er is considered most relevant. The GHI calculation excludes some higher income countries because the prevalence of hunger there is very low. The GHI is only as current as the data for its three component indicators. This year's GHI reflects the most recent available country level data for the three component indicators spanning the period 2009 to 2013. Besides the most recent GHI scores, this dataset also contains the GHI scores for four other reference periods- 1990, 1995, 2000, and 2005. A country's GHI score is calculated by averaging the percentage of the population that is undernourished, the percentage of children youn ger than five years old who are underweight, and the percentage of children dying before the age of five. This calculation results in a 100 point scale on which zero is the best score (no hunger) and 100 the worst, although neither of these extremes is reached in practice. The three component indicators used to calculate the GHI scores draw upon data from the following sources: 1. Undernourishment: Updated data from the Food and Agriculture Organization of the United Nations (FAO) were used for the 1990, 1995, 2000, 2005, and 2014GHI scores. Undernourishment data for the 2014 GHI are for 2011-2013. 2. Child underweight: The "child underweight" component indicator of the GHI scores includes the latest additions to the World Health Organization's (WHO) Global Database on Child Growth and Malnutrition, and additional data from the joint data base by the United Nations Children's Fund (UNICEF), WHO and the World Bank; the most recent Demographic and Health Survey (DHS) and Multiple Indicator Cluster Survey reports; and statistical tables from UNICEF. For the 2014 GHI, data on child underweight are for the latest year for which data are available in the period 2009-2014. 3. Child mortality: Updated data from the UN Inter-agency Group for Child Mortality Estimation were used for the 1990, 1995, 2000, and 2005, and 2014 GHI scores. For the 2014 GHI, data on child mortality are for 2012. Resources related to 2014 Global Hunger Index
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data is from:
https://simplemaps.com/data/world-cities
We're proud to offer a simple, accurate and up-to-date database of the world's cities and towns. We've built it from the ground up using authoritative sources such as the NGIA, US Geological Survey, US Census Bureau, and NASA.
Our database is:
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
This dataset comprises 204 entries and 38 attributes, providing a comprehensive analysis of key economic and social indicators across various countries. It includes a diverse range of metrics, allowing for in-depth exploration of global trends related to GDP, education, health, and environmental factors.
Key Features:
Applications and Uses:
Research and Analysis: Ideal for researchers studying the correlation between economic performance and social indicators. This dataset can help identify trends and patterns relevant to global development.
Policy Development: Policymakers can utilize this data to inform decisions on education, healthcare, and environmental policies, aiming to improve national outcomes.
Machine Learning and Data Science: Data scientists can apply machine learning techniques to predict economic trends, analyze social impacts, or classify countries based on various indicators.
Educational Purposes: Suitable for students and educators in fields like economics, sociology, and environmental science for practical data analysis exercises.
Visualization Projects: Perfect for creating compelling visualizations that illustrate relationships between different metrics, aiding in public understanding and engagement.
By leveraging this dataset, users can uncover insights into how different factors influence a country's development, making it a valuable resource for diverse applications across various fields.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The highest number of lifetime Gmail downloads across iOS and Android.
The Afrobarometer is a comparative series of public attitude surveys that assess African citizen's attitudes to democracy and governance, markets, and civil society, among other topics.
The 12 country datasetis a combined dataset for the 12 African countries surveyed during round 1 of the survey, conducted between 1999-2000 (Botswana, Ghana, Lesotho, Mali, Malawi, Namibia, Nigeria South Africa, Tanzania, Uganda, Zambia and Zimbabwe), plus data from the old Southern African Democracy Barometer, and similar surveys done in West and East Africa.
The Round 1 Afrobarometer surveys have national coverage for the following countries: Botswana, Ghana, Lesotho, Malawi, Mali, Namibia, Nigeria, South Africa, Tanzania, Uganda, Zambia, Zimbabwe.
Individuals
The sample universe for Afrobarometer surveys includes all citizens of voting age within the country. In other words, we exclude anyone who is not a citizen and anyone who has not attained this age (usually 18 years) on the day of the survey. Also excluded are areas determined to be either inaccessible or not relevant to the study, such as those experiencing armed conflict or natural disasters, as well as national parks and game reserves. As a matter of practice, we have also excluded people living in institutionalized settings, such as students in dormitories and persons in prisons or nursing homes.
What to do about areas experiencing political unrest? On the one hand we want to include them because they are politically important. On the other hand, we want to avoid stretching out the fieldwork over many months while we wait for the situation to settle down. It was agreed at the 2002 Cape Town Planning Workshop that it is difficult to come up with a general rule that will fit all imaginable circumstances. We will therefore make judgments on a case-by-case basis on whether or not to proceed with fieldwork or to exclude or substitute areas of conflict. National Partners are requested to consult Core Partners on any major delays, exclusions or substitutions of this sort.
Sample survey data [ssd]
Afrobarometer uses national probability samples designed to meet the following criteria. Samples are designed to generate a sample that is a representative cross-section of all citizens of voting age in a given country. The goal is to give every adult citizen an equal and known chance of being selected for an interview. They achieve this by:
• using random selection methods at every stage of sampling; • sampling at all stages with probability proportionate to population size wherever possible to ensure that larger (i.e., more populated) geographic units have a proportionally greater probability of being chosen into the sample.
The sampling universe normally includes all citizens age 18 and older. As a standard practice, we exclude people living in institutionalized settings, such as students in dormitories, patients in hospitals, and persons in prisons or nursing homes. Occasionally, we must also exclude people living in areas determined to be inaccessible due to conflict or insecurity. Any such exclusion is noted in the technical information report (TIR) that accompanies each data set.
Sample size and design Samples usually include either 1,200 or 2,400 cases. A randomly selected sample of n=1200 cases allows inferences to national adult populations with a margin of sampling error of no more than +/-2.8% with a confidence level of 95 percent. With a sample size of n=2400, the margin of error decreases to +/-2.0% at 95 percent confidence level.
The sample design is a clustered, stratified, multi-stage, area probability sample. Specifically, we first stratify the sample according to the main sub-national unit of government (state, province, region, etc.) and by urban or rural location.
Area stratification reduces the likelihood that distinctive ethnic or language groups are left out of the sample. Afrobarometer occasionally purposely oversamples certain populations that are politically significant within a country to ensure that the size of the sub-sample is large enough to be analysed. Any oversamples is noted in the TIR.
Sample stages Samples are drawn in either four or five stages:
Stage 1: In rural areas only, the first stage is to draw secondary sampling units (SSUs). SSUs are not used in urban areas, and in some countries they are not used in rural areas. See the TIR that accompanies each data set for specific details on the sample in any given country. Stage 2: We randomly select primary sampling units (PSU). Stage 3: We then randomly select sampling start points. Stage 4: Interviewers then randomly select households. Stage 5: Within the household, the interviewer randomly selects an individual respondent. Each interviewer alternates in each household between interviewing a man and interviewing a woman to ensure gender balance in the sample.
To keep the costs and logistics of fieldwork within manageable limits, eight interviews are clustered within each selected PSU.
Data weights For some national surveys, data are weighted to correct for over or under-sampling or for household size. "Withinwt" should be turned on for all national -level descriptive statistics in countries that contain this weighting variable. It is included as the last variable in the data set, with details described in the codebook. For merged data sets, "Combinwt" should be turned on for cross-national comparisons of descriptive statistics. Note: this weighting variable standardizes each national sample as if it were equal in size.
Further information on sampling protocols, including full details of the methodologies used for each stage of sample selection, can be found at https://afrobarometer.org/surveys-and-methods/sampling-principles
Face-to-face [f2f]
Because Afrobarometer Round 1 emerged out of several different survey research efforts, survey instruments were not standardized across all countries, there are a number of features of the questionnaires that should be noted, as follows: • In most cases, the data set only includes those questions/variables that were asked in nine or more countries. Complete Round 1 data sets for each individual country have already been released, and are available from ICPSR or from the Afrobarometer website at www.afrobarometer.org. • In the seven countries that originally formed the Southern Africa Barometer (SAB) - Botswana, Lesotho, Malawi, Namibia, South Africa, Zambia and Zimbabwe - a standardized questionnaire was used, so question wording and response categories are the generally the same for all of these countries. The questionnaires in Mali and Tanzania were also essentially identical (in the original English version). Ghana, Uganda and Nigeria each had distinct questionnaires. • This merged dataset combines, into a single variable, responses from across these different countries where either identical or very similar questions were used, or where conceptually equivalent questions can be found in at least nine of the different countries. For each variable, the exact question text from each of the countries or groups of countries ("SAB" refers to the Southern Africa Barometer countries) is listed. • Response options also varied on some questions, and where applicable, these differences are also noted.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides values for GDP reported in several countries. The data includes current values, previous releases, historical highs and record lows, release frequency, reported unit and currency.
https://datacatalog.worldbank.org/public-licenses?fragment=cchttps://datacatalog.worldbank.org/public-licenses?fragment=cc
This dataset contains metadata (title, abstract, date of publication, field, etc) for around 1 million academic articles. Each record contains additional information on the country of study and whether the article makes use of data. Machine learning tools were used to classify the country of study and data use.
Our data source of academic articles is the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al. 2020). The corpus contains more than 130 million English language academic papers across multiple disciplines. The papers included in the Semantic Scholar corpus are gathered directly from publishers, from open archives such as arXiv or PubMed, and crawled from the internet.
We placed some restrictions on the articles to make them usable and relevant for our purposes. First, only articles with an abstract and parsed PDF or latex file are included in the analysis. The full text of the abstract is necessary to classify the country of study and whether the article uses data. The parsed PDF and latex file are important for extracting important information like the date of publication and field of study. This restriction eliminated a large number of articles in the original corpus. Around 30 million articles remain after keeping only articles with a parsable (i.e., suitable for digital processing) PDF, and around 26% of those 30 million are eliminated when removing articles without an abstract. Second, only articles from the year 2000 to 2020 were considered. This restriction eliminated an additional 9% of the remaining articles. Finally, articles from the following fields of study were excluded, as we aim to focus on fields that are likely to use data produced by countries’ national statistical system: Biology, Chemistry, Engineering, Physics, Materials Science, Environmental Science, Geology, History, Philosophy, Math, Computer Science, and Art. Fields that are included are: Economics, Political Science, Business, Sociology, Medicine, and Psychology. This third restriction eliminated around 34% of the remaining articles. From an initial corpus of 136 million articles, this resulted in a final corpus of around 10 million articles.
Due to the intensive computer resources required, a set of 1,037,748 articles were randomly selected from the 10 million articles in our restricted corpus as a convenience sample.
The empirical approach employed in this project utilizes text mining with Natural Language Processing (NLP). The goal of NLP is to extract structured information from raw, unstructured text. In this project, NLP is used to extract the country of study and whether the paper makes use of data. We will discuss each of these in turn.
To determine the country or countries of study in each academic article, two approaches are employed based on information found in the title, abstract, or topic fields. The first approach uses regular expression searches based on the presence of ISO3166 country names. A defined set of country names is compiled, and the presence of these names is checked in the relevant fields. This approach is transparent, widely used in social science research, and easily extended to other languages. However, there is a potential for exclusion errors if a country’s name is spelled non-standardly.
The second approach is based on Named Entity Recognition (NER), which uses machine learning to identify objects from text, utilizing the spaCy Python library. The Named Entity Recognition algorithm splits text into named entities, and NER is used in this project to identify countries of study in the academic articles. SpaCy supports multiple languages and has been trained on multiple spellings of countries, overcoming some of the limitations of the regular expression approach. If a country is identified by either the regular expression search or NER, it is linked to the article. Note that one article can be linked to more than one country.
The second task is to classify whether the paper uses data. A supervised machine learning approach is employed, where 3500 publications were first randomly selected and manually labeled by human raters using the Mechanical Turk service (Paszke et al. 2019).[1] To make sure the human raters had a similar and appropriate definition of data in mind, they were given the following instructions before seeing their first paper:
Each of these documents is an academic article. The goal of this study is to measure whether a specific academic article is using data and from which country the data came.
There are two classification tasks in this exercise:
1. identifying whether an academic article is using data from any country
2. Identifying from which country that data came.
For task 1, we are looking specifically at the use of data. Data is any information that has been collected, observed, generated or created to produce research findings. As an example, a study that reports findings or analysis using a survey data, uses data. Some clues to indicate that a study does use data includes whether a survey or census is described, a statistical model estimated, or a table or means or summary statistics is reported.
After an article is classified as using data, please note the type of data used. The options are population or business census, survey data, administrative data, geospatial data, private sector data, and other data. If no data is used, then mark "Not applicable". In cases where multiple data types are used, please click multiple options.[2]
For task 2, we are looking at the country or countries that are studied in the article. In some cases, no country may be applicable. For instance, if the research is theoretical and has no specific country application. In some cases, the research article may involve multiple countries. In these cases, select all countries that are discussed in the paper.
We expect between 10 and 35 percent of all articles to use data.
The median amount of time that a worker spent on an article, measured as the time between when the article was accepted to be classified by the worker and when the classification was submitted was 25.4 minutes. If human raters were exclusively used rather than machine learning tools, then the corpus of 1,037,748 articles examined in this study would take around 50 years of human work time to review at a cost of $3,113,244, which assumes a cost of $3 per article as was paid to MTurk workers.
A model is next trained on the 3,500 labelled articles. We use a distilled version of the BERT (bidirectional Encoder Representations for transformers) model to encode raw text into a numeric format suitable for predictions (Devlin et al. (2018)). BERT is pre-trained on a large corpus comprising the Toronto Book Corpus and Wikipedia. The distilled version (DistilBERT) is a compressed model that is 60% the size of BERT and retains 97% of the language understanding capabilities and is 60% faster (Sanh, Debut, Chaumond, Wolf 2019). We use PyTorch to produce a model to classify articles based on the labeled data. Of the 3,500 articles that were hand coded by the MTurk workers, 900 are fed to the machine learning model. 900 articles were selected because of computational limitations in training the NLP model. A classification of “uses data” was assigned if the model predicted an article used data with at least 90% confidence.
The performance of the models classifying articles to countries and as using data or not can be compared to the classification by the human raters. We consider the human raters as giving us the ground truth. This may underestimate the model performance if the workers at times got the allocation wrong in a way that would not apply to the model. For instance, a human rater could mistake the Republic of Korea for the Democratic People’s Republic of Korea. If both humans and the model perform the same kind of errors, then the performance reported here will be overestimated.
The model was able to predict whether an article made use of data with 87% accuracy evaluated on the set of articles held out of the model training. The correlation between the number of articles written about each country using data estimated under the two approaches is given in the figure below. The number of articles represents an aggregate total of
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The average for 2023 based on 193 countries was -0.03 points. The highest value was in Singapore: 2.31 points and the lowest value was in North Korea: -2.39 points. The indicator is available from 1996 to 2023. Below is a chart for all countries where data are available.
In 2021, the United States is the leading country in the big data and business analytics (BDA) market, with ** percent market share. The following four leading counties all hover around * percent market share. Global BDA spending is forecast to reach almost *** billion U.S. dollars in 2021, with the majority to be spent on IT services and software.
The Global Roads Open Access Data Set, Version 1 (gROADSv1) was developed under the auspices of the CODATA Global Roads Data Development Task Group. The data set combines the best available roads data by country into a global roads coverage, using the UN Spatial Data Infrastructure Transport (UNSDI-T) version 2 as a common data model. All country road networks have been joined topologically at the borders, and many countries have been edited for internal topology. Source data for each country are provided in the documentation, and users are encouraged to refer to the readme file for use constraints that apply to a small number of countries. Because the data are compiled from multiple sources, the date range for road network representations ranges from the 1980s to 2010 depending on the country (most countries have no confirmed date), and spatial accuracy varies. The baseline global data set was compiled by the Information Technology Outreach Services (ITOS) of the University of Georgia. Updated data for 27 countries and 6 smaller geographic entities were assembled by Columbia University's Center for International Earth Science Information Network (CIESIN), with a focus largely on developing countries with the poorest data coverage.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a Dataset of the World Population Consisting of Each and Every Country. I have attempted to analyze the same data to bring some insights out of it. The dataset consists of 234 rows and 17 columns. I will analyze the same data and bring the below pieces of information regarding the same.