46 datasets found

Number of missing persons files in the U.S. 2022, by race
statista.com
Updated Jul 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Number of missing persons files in the U.S. 2022, by race [Dataset]. https://www.statista.com/statistics/240396/number-of-missing-persons-files-in-the-us-by-race/
Explore at:
Dataset updated
Jul 5, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2022
Area covered
United States
Description
In 2022, there were 313,017 cases filed by the NCIC where the race of the reported missing was White. In the same year, 18,928 people were missing whose race was unknown.

What is the NCIC?

The National Crime Information Center (NCIC) is a digital database that stores crime data for the United States, so criminal justice agencies can access it. As a part of the FBI, it helps criminal justice professionals find criminals, missing people, stolen property, and terrorists. The NCIC database is broken down into 21 files. Seven files belong to stolen property and items, and 14 belong to persons, including the National Sex Offender Register, Missing Person, and Identify Theft. It works alongside federal, tribal, state, and local agencies. The NCIC’s goal is to maintain a centralized information system between local branches and offices, so information is easily accessible nationwide.

Missing people in the United States

A person is considered missing when they have disappeared and their location is unknown. A person who is considered missing might have left voluntarily, but that is not always the case. The number of the NCIC unidentified person files in the United States has fluctuated since 1990, and in 2022, there were slightly more NCIC missing person files for males as compared to females. Fortunately, the number of NCIC missing person files has been mostly decreasing since 1998.
Indicator 11.5.1: Number of missing persons due to disaster (number)
unstats-undesa.opendata.arcgis.com
sdgs.amerigeoss.org
Updated Sep 9, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UN DESA Statistics Division (2021). Indicator 11.5.1: Number of missing persons due to disaster (number) [Dataset]. https://unstats-undesa.opendata.arcgis.com/datasets/indicator-11-5-1-number-of-missing-persons-due-to-disaster-number
Explore at:
Dataset updated
Sep 9, 2021
Dataset provided by
United Nations Department of Economic and Social Affairshttps://www.un.org/en/desa
Authors
UN DESA Statistics Division
Area covered
Description
Series Name: Number of missing persons due to disaster (number)Series Code: VC_DSR_MISSRelease Version: 2021.Q2.G.03 This dataset is part of the Global SDG Indicator Database compiled through the UN System in preparation for the Secretary-General's annual report on Progress towards the Sustainable Development Goals.Indicator 11.5.1: Number of deaths, missing persons and directly affected persons attributed to disasters per 100,000 populationTarget 11.5: By 2030, significantly reduce the number of deaths and the number of people affected and substantially decrease the direct economic losses relative to global gross domestic product caused by disasters, including water-related disasters, with a focus on protecting the poor and people in vulnerable situationsGoal 11: Make cities and human settlements inclusive, safe, resilient and sustainableFor more information on the compilation methodology of this dataset, see https://unstats.un.org/sdgs/metadata/
o
Geonames - All Cities with a population > 1000
public.opendatasoft.com
data.smartidf.services
+2more
csv, excel, geojson +1
Updated Mar 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Geonames - All Cities with a population > 1000 [Dataset]. https://public.opendatasoft.com/explore/dataset/geonames-all-cities-with-a-population-1000/
Explore at:
csv, json, geojson, excelAvailable download formats
Dataset updated
Mar 10, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
All cities with a population > 1000 or seats of adm div (ca 80.000)Sources and ContributionsSources : GeoNames is aggregating over hundred different data sources. Ambassadors : GeoNames Ambassadors help in many countries. Wiki : A wiki allows to view the data and quickly fix error and add missing places. Donations and Sponsoring : Costs for running GeoNames are covered by donations and sponsoring.Enrichment:add country name
A
‘Missing Migrants Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Apr 23, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2019). ‘Missing Migrants Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-missing-migrants-dataset-c736/2e62d69f/?v=grid
Explore at:
Dataset updated
Apr 23, 2019
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Missing Migrants Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/jmataya/missingmigrants on 14 February 2022.

--- Dataset description provided by original source is as follows ---

About the Missing Migrants Data

This data is sourced from the International Organization for Migration. The data is part of a specific project called the Missing Migrants Project which tracks deaths of migrants, including refugees , who have gone missing along mixed migration routes worldwide. The research behind this project began with the October 2013 tragedies, when at least 368 individuals died in two shipwrecks near the Italian island of Lampedusa. Since then, Missing Migrants Project has developed into an important hub and advocacy source of information that media, researchers, and the general public access for the latest information.

Where is the data from?

Missing Migrants Project data are compiled from a variety of sources. Sources vary depending on the region and broadly include data from national authorities, such as Coast Guards and Medical Examiners; media reports; NGOs; and interviews with survivors of shipwrecks. In the Mediterranean region, data are relayed from relevant national authorities to IOM field missions, who then share it with the Missing Migrants Project team. Data are also obtained by IOM and other organizations that receive survivors at landing points in Italy and Greece. In other cases, media reports are used. IOM and UNHCR also regularly coordinate on such data to ensure consistency. Data on the U.S./Mexico border are compiled based on data from U.S. county medical examiners and sheriff’s offices, as well as media reports for deaths occurring on the Mexico side of the border. Estimates within Mexico and Central America are based primarily on media and year-end government reports. Data on the Bay of Bengal are drawn from reports by UNHCR and NGOs. In the Horn of Africa, data are obtained from media and NGOs. Data for other regions is drawn from a combination of sources, including media and grassroots organizations. In all regions, Missing Migrants Projectdata represents minimum estimates and are potentially lower than in actuality.

Updated data and visuals can be found here: https://missingmigrants.iom.int/

Who is included in Missing Migrants Project data?

IOM defines a migrant as any person who is moving or has moved across an international border or within a State away from his/her habitual place of residence, regardless of

(1) the person’s legal status; (2) whether the movement is voluntary or involuntary; (3) what the causes for the movement are; or (4) what the length of the stay is.[1]

Missing Migrants Project counts migrants who have died or gone missing at the external borders of states, or in the process of migration towards an international destination. The count excludes deaths that occur in immigration detention facilities, during deportation, or after forced return to a migrant’s homeland, as well as deaths more loosely connected with migrants’ irregular status, such as those resulting from labour exploitation. Migrants who die or go missing after they are established in a new home are also not included in the data, so deaths in refugee camps or housing are excluded. This approach is chosen because deaths that occur at physical borders and while en route represent a more clearly definable category, and inform what migration routes are most dangerous. Data and knowledge of the risks and vulnerabilities faced by migrants in destination countries, including death, should not be neglected, rather tracked as a distinct category.

How complete is the data on dead and missing migrants?

Data on fatalities during the migration process are challenging to collect for a number of reasons, most stemming from the irregular nature of migratory journeys on which deaths tend to occur. For one, deaths often occur in remote areas on routes chosen with the explicit aim of evading detection. Countless bodies are never found, and rarely do these deaths come to the attention of authorities or the media. Furthermore, when deaths occur at sea, frequently not all bodies are recovered - sometimes with hundreds missing from one shipwreck - and the precise number of missing is often unknown. In 2015, over 50 per cent of deaths recorded by the Missing Migrants Project refer to migrants who are presumed dead and whose bodies have not been found, mainly at sea.

Data are also challenging to collect as reporting on deaths is poor, and the data that does exist are highly scattered. Few official sources are collecting data systematically. Many counts of death rely on media as a source. Coverage can be spotty and incomplete. In addition, the involvement of criminal actors in incidents means there may be fear among survivors to report deaths and some deaths may be actively covered-up. The irregular immigration status of many migrants, and at times their families as well, also impedes reporting of missing persons or deaths.

The varying quality and comprehensiveness of data by region in attempting to estimate deaths globally may exaggerate the share of deaths that occur in some regions, while under-representing the share occurring in others.

What can be understood through this data?

The available data can give an indication of changing conditions and trends related to migration routes and the people travelling on them, which can be relevant for policy making and protection plans. Data can be useful to determine the relative risks of irregular migration routes. For example, Missing Migrants Project data show that despite the increase in migrant flows through the eastern Mediterranean in 2015, the central Mediterranean remained the more deadly route. In 2015, nearly two people died out of every 100 travellers (1.85%) crossing the Central route, as opposed to one out of every 1,000 that crossed from Turkey to Greece (0.095%). From the data, we can also get a sense of whether groups like women and children face additional vulnerabilities on migration routes.

However, it is important to note that because of the challenges in data collection for the missing and dead, basic demographic information on the deceased is rarely known. Often migrants in mixed migration flows do not carry appropriate identification. When bodies are found it may not be possible to identify them or to determine basic demographic information. In the data compiled by Missing Migrants Project, sex of the deceased is unknown in over 80% of cases. Region of origin has been determined for the majority of the deceased. Even this information is at times extrapolated based on available information – for instance if all survivors of a shipwreck are of one origin it was assumed those missing also came from the same region.

The Missing Migrants Project dataset includes coordinates for where incidents of death took place, which indicates where the risks to migrants may be highest. However, it should be noted that all coordinates are estimates.

Why collect data on missing and dead migrants?

By counting lives lost during migration, even if the result is only an informed estimate, we at least acknowledge the fact of these deaths. What before was vague and ill-defined is now a quantified tragedy that must be addressed. Politically, the availability of official data is important. The lack of political commitment at national and international levels to record and account for migrant deaths reflects and contributes to a lack of concern more broadly for the safety and well-being of migrants, including asylum-seekers. Further, it drives public apathy, ignorance, and the dehumanization of these groups.

Data are crucial to better understand the profiles of those who are most at risk and to tailor policies to better assist migrants and prevent loss of life. Ultimately, improved data should contribute to efforts to better understand the causes, both direct and indirect, of fatalities and their potential links to broader migration control policies and practices.

Counting and recording the dead can also be an initial step to encourage improved systems of identification of those who die. Identifying the dead is a moral imperative that respects and acknowledges those who have died. This process can also provide a some sense of closure for families who may otherwise be left without ever knowing the fate of missing loved ones.

Identification and tracing of the dead and missing

As mentioned above, the challenge remains to count the numbers of dead and also identify those counted. Globally, the majority of those who die during migration remain unidentified. Even in cases in which a body is found identification rates are low. Families may search for years or a lifetime to find conclusive news of their loved one. In the meantime, they may face psychological, practical, financial, and legal problems.

Ultimately Missing Migrants Project would like to see that every unidentified body, for which it is possible to recover, is adequately “managed”, analysed and tracked to ensure proper documentation, traceability and dignity. Common forensic protocols and standards should be agreed upon, and used within and between States. Furthermore, data relating to the dead and missing should be held in searchable and open databases at local, national and international levels to facilitate identification.

For more in-depth analysis and discussion of the numbers of missing and dead migrants around the world, and the challenges involved in identification and tracing, read our two reports on the issue, Fatal Journeys: Tracking Lives Lost during Migration (2014) and Fatal Journeys Volume 2, Identification and Tracing of Dead and Missing Migrants

Content

The data set records
o
Global Missing Children Research, Country-Specific Findings: Lao People’s...
data.opendevelopmentmekong.net
Updated Jun 17, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Global Missing Children Research, Country-Specific Findings: Lao People’s Democratic Republic [Dataset]. https://data.opendevelopmentmekong.net/dataset/lao-people-s-democratic-republic
Explore at:
Dataset updated
Jun 17, 2018
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Laos
Description
The document provides a summary of human trafficking in Lao PDR, missing child cases, child protection, and international instruments that ensure protection for children who are victims.
Indicator 11.5.1: Number of deaths and missing persons attributed to...
sdgs.amerigeoss.org
sdg.org
Updated Aug 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UN DESA Statistics Division (2020). Indicator 11.5.1: Number of deaths and missing persons attributed to disasters (number) [Dataset]. https://sdgs.amerigeoss.org/datasets/undesa::indicator-11-5-1-number-of-deaths-and-missing-persons-attributed-to-disasters-number-4/about
Explore at:
Dataset updated
Aug 18, 2020
Dataset provided by
United Nations Department of Economic and Social Affairshttps://www.un.org/en/desa
Authors
UN DESA Statistics Division
Area covered
Description
Series Name: Number of deaths and missing persons attributed to disasters (number)Series Code: VC_DSR_MTMNRelease Version: 2020.Q2.G.03 This dataset is the part of the Global SDG Indicator Database compiled through the UN System in preparation for the Secretary-General's annual report on Progress towards the Sustainable Development Goals.Indicator 11.5.1: Number of deaths, missing persons and directly affected persons attributed to disasters per 100,000 populationTarget 11.5: By 2030, significantly reduce the number of deaths and the number of people affected and substantially decrease the direct economic losses relative to global gross domestic product caused by disasters, including water-related disasters, with a focus on protecting the poor and people in vulnerable situationsGoal 11: Make cities and human settlements inclusive, safe, resilient and sustainableFor more information on the compilation methodology of this dataset, see https://unstats.un.org/sdgs/metadata/
World Religion Project - Global Religion Dataset
thearda.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Association of Religion Data Archives, World Religion Project - Global Religion Dataset [Dataset]. http://doi.org/10.17605/OSF.IO/J7BCM
Explore at:
Unique identifier
https://doi.org/10.17605/OSF.IO/J7BCM
Dataset provided by
Association of Religion Data Archives
Dataset funded by
The University of California, Davis
The John Templeton Foundation
Description
The World Religion Project (WRP) aims to provide detailed information about religious adherence worldwide since 1945. It contains data about the number of adherents by religion in each of the states in the international system. These numbers are given for every half-decade period (1945, 1950, etc., through 2010). Percentages of the states' populations that practice a given religion are also provided. (Note: These percentages are expressed as decimals, ranging from 0 to 1, where 0 indicates that 0 percent of the population practices a given religion and 1 indicates that 100 percent of the population practices that religion.) Some of the religions (as detailed below) are divided into religious families. To the extent data are available, the breakdown of adherents within a given religion into religious families is also provided.

The project was developed in three stages. The first stage consisted of the formation of a religion tree. A religion tree is a systematic classification of major religions and of religious families within those major religions. To develop the religion tree we prepared a comprehensive literature review, the aim of which was (i) to define a religion, (ii) to find tangible indicators of a given religion of religious families within a major religion, and (iii) to identify existing efforts at classifying world religions. (Please see the original survey instrument to view the structure of the religion tree.) The second stage consisted of the identification of major data sources of religious adherence and the collection of data from these sources according to the religion tree classification. This created a dataset that included multiple records for some states for a given point in time. It also contained multiple missing data for specific states, specific time periods and specific religions. The third stage consisted of cleaning the data, reconciling discrepancies of information from different sources and imputing data for the missing cases.

The Global Religion Dataset: This dataset uses a religion-by-five-year unit. It aggregates the number of adherents of a given religion and religious group globally by five-year periods.
Mental Health in Tech Survey
kaggle.com
Updated Jan 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Mental Health in Tech Survey [Dataset]. https://www.kaggle.com/datasets/thedevastator/mental-health-in-tech-survey
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 20, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
Description
Mental Health in Tech Survey

Understanding Employee Mental Health Needs in the Tech Industry

By Stephen Myers [source]

About this dataset

This dataset contains survey responses from individuals in the tech industry about their mental health, including questions about treatment, workplace resources, and attitudes towards discussing mental health in the workplace. Mental health is an issue that affects all people of all ages, genders and walks of life. The prevalence of these issues within the tech industry–one that places hard demands on those who work in it–is no exception. By analyzing this dataset, we can better understand how prevalent mental health issues are among those who work in the tech sector.–and what kinds of resources they rely upon to find help–so that more can be done to create a healthier working environment for all.

This dataset tracks key measures such as age, gender and country to determine overall prevalence, along with responses surrounding employee access to care options; whether mental health or physical illness are being taken as seriously by employers; whether or not anonymity is protected with regards to seeking help; and how coworkers may perceive those struggling with mental illness issues such as depression or anxiety. With an ever-evolving landscape due to new technology advancing faster than ever before – these statistics have never been more important for us to analyze if we hope remain true promoters of a healthy world inside and outside our office walls

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

In this dataset you will find data on age, gender, country, and state of survey respondents in addition to numerous questions that assess an individual's mental state including: self-employment status, family history of mental illness, treatment status and access or lack thereof; how their mental health condition affects their work; number of employees at the company they work for; remote work status; tech company status; benefit information from employers such as mental health benefits and wellness program availability; anonymity protection if seeking treatment resources for substance abuse or mental health issues ; ease (or difficulty) for medical leave for a mental health condition ; whether discussing physical or medical matters with employers have negative consequences. You will also find comments from survey participants.

To use this dataset effectively: - Clean the data by removing invalid responses/duplicates/missing values - you can do this with basic Pandas commands like .dropna() , .drop_duplicates(), .replace(). - Utilize descriptive statistics such as mean and median to draw general conclusions about patterns of responses - you can do this with Pandas tools such as .groupby() and .describe(). - Run various types analyses such as mean comparisons on different kinds of variables(age vs gender), correlations between different features etc using appropriate statistical methods - use commands like Statsmodels' OLS models (.smf) , calculate z-scores , run hypothesis tests etc depending on what analysis is needed. Make sure you are aware any underlying assumptions your analysis requires beforehand !
- Visualize your results with plotting libraries like Matplotlib/Seaborn to easily interpret these findings! Use boxplots/histograms/heatmaps where appropriate depending on your question !

Research Ideas

Using the results of this survey, you could develop targeted outreach campaigns directed at underrepresented groups that answer “No” to questions about their employers providing resources for mental health or discussing it as part of wellness programs.

Analyzing the employee characteristics (e.g., age and gender) of those who reported negative consequences from discussing their mental health in the workplace could inform employer policies to support individuals with mental health conditions and reduce stigma and discrimination in the workplace.

Correlating responses to questions about remote work, leave policies, and anonymity with whether or not individuals have sought treatment for a mental health condition may provide insight into which types of workplace resources are most beneficial for supporting employees dealing with these issues

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: Dataset copyright by authors - You are free to: - Share - copy and redi...
UCI Communities and Crime Unnormalized Data Set
kaggle.com
Updated Feb 21, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kavitha (2018). UCI Communities and Crime Unnormalized Data Set [Dataset]. https://www.kaggle.com/kkanda/communities%20and%20crime%20unnormalized%20data%20set/notebooks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 21, 2018
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kavitha
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Context

Introduction: The dataset used for this experiment is real and authentic. The dataset is acquired from UCI machine learning repository website [13]. The title of the dataset is ‘Crime and Communities’. It is prepared using real data from socio-economic data from 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crimedata from the 1995 FBI UCR [13]. This dataset contains a total number of 147 attributes and 2216 instances.

The per capita crimes variables were calculated using population values included in the 1995 FBI data (which differ from the 1990 Census values).

Content

The variables included in the dataset involve the community, such as the percent of the population considered urban, and the median family income, and involving law enforcement, such as per capita number of police officers, and percent of officers assigned to drug units. The crime attributes (N=18) that could be predicted are the 8 crimes considered 'Index Crimes' by the FBI)(Murders, Rape, Robbery, .... ), per capita (actually per 100,000 population) versions of each, and Per Capita Violent Crimes and Per Capita Nonviolent Crimes)

predictive variables : 125 non-predictive variables : 4 potential goal/response variables : 18

Acknowledgements

http://archive.ics.uci.edu/ml/datasets/Communities%20and%20Crime%20Unnormalized

U. S. Department of Commerce, Bureau of the Census, Census Of Population And Housing 1990 United States: Summary Tape File 1a & 3a (Computer Files),

U.S. Department Of Commerce, Bureau Of The Census Producer, Washington, DC and Inter-university Consortium for Political and Social Research Ann Arbor, Michigan. (1992)

U.S. Department of Justice, Bureau of Justice Statistics, Law Enforcement Management And Administrative Statistics (Computer File) U.S. Department Of Commerce, Bureau Of The Census Producer, Washington, DC and Inter-university Consortium for Political and Social Research Ann Arbor, Michigan. (1992)

U.S. Department of Justice, Federal Bureau of Investigation, Crime in the United States (Computer File) (1995)

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?

Data available in the dataset may not act as a complete source of information for identifying factors that contribute to more violent and non-violent crimes as many relevant factors may still be missing.

However, I would like to try and answer the following questions answered.

Analyze if number of vacant and occupied houses and the period of time the houses were vacant had contributed to any significant change in violent and non-violent crime rates in communities

How has unemployment changed crime rate(violent and non-violent) in the communities?

Were people from a particular age group more vulnerable to crime?

Does ethnicity play a role in crime rate?

Has education played a role in bringing down the crime rate?
A
‘MISSING MIGRANTS (2014-2021)’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘MISSING MIGRANTS (2014-2021)’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-missing-migrants-2014-2021-19da/1a9479e3/?iid=039-565&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘MISSING MIGRANTS (2014-2021)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/methoomirza/missing-migrants-20142021 on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

Missing Migrants Project tracks deaths of migrants, including refugees and asylum-seekers, who have died or gone missing in the process of migration towards an international destination. Please note that these data represent minimum estimates, as many deaths during migration go unrecorded

What is included in Missing Migrants Project data?

Missing Migrants Project counts migrants who have died at the external borders of states, or in the process of migration towards an international destination, regardless of their legal status. The Project records only those migrants who die during their journey to a country different from their country of residence. Missing Migrants Project data include the deaths of migrants who die in transportation accidents, shipwrecks, violent attacks, or due to medical complications during their journeys. It also includes the number of corpses found at border crossings that are categorized as the bodies of migrants, on the basis of belongings and/or the characteristics of the death. For instance, a death of an unidentified person might be included if the decedent is found without any identifying documentation in an area known to be on a migration route. Deaths during migration may also be identified based on the cause of death, especially if is related to trafficking, smuggling, or means of travel such as on top of a train, in the back of a cargo truck, as a stowaway on a plane, in unseaworthy boats, or crossing a border fence. While the location and cause of death can provide strong evidence that an unidentified decedent should be included in Missing Migrants Project data, this should always be evaluated in conjunction with migration history and trends.

What is excluded?

The count excludes deaths that occur in immigration detention facilities or after deportation to a migrant’s homeland, as well as deaths more loosely connected with migrants´ irregular status, such as those resulting from labour exploitation. Migrants who die or go missing after they are established in a new home are also not included in the data, so deaths in refugee camps or housing are excluded. The deaths of internally displaced persons who die within their country of origin are also excluded. There remains a significant gap in knowledge and data on such deaths. Data and knowledge of the risks and vulnerabilities faced by migrants in destination countries, including death, should not be neglected, but rather tracked as a distinct category.

What sources of information are used in the Missing Migrants Project database?

The Missing Migrants Project currently gathers information from diverse sources such as official records – including from coast guards and medical examiners – and other sources such as media reports, NGOs, and surveys and interviews of migrants. In the Mediterranean region, data are relayed from relevant national authorities to IOM field missions, who then share it with the Missing Migrants Project team. Data are also obtained by IOM and other organizations that receive survivors at landing points in Italy and Greece. IOM and UNHCR also regularly coordinate to validate data on missing migrants in the Mediterranean. Data on the United States/Mexico border are compiled based on data from U.S. county medical examiners, coroners, and sheriff’s offices, as well as media reports for deaths occurring on the Mexican side of the border. In Africa, data are obtained from media and NGOs, including the Regional Mixed Migration Secretariat and the International Red Cross/Red Crescent. The quality of the data source(s) for each incident is assessed through the ‘Source quality’ variable, which can be viewed in the data. Across the world, the Missing Migrants Project uses social and traditional media reports to find data, which are then verified by local IOM staff whenever possible. In all cases, new entries are checked against existing records to ensure that no deaths are double-counted. In all regions, Missing Migrants Project data represent a minimum estimate of the number of migrant deaths. To learn more about data sources, visit the thematic page on migrant deaths and disappearances in the Global Migration Data Portal.

Content

What are the variables used in the Missing Migrants Project database?

This section presents the list of variables that constitute the Missing Migrants Project database. While ideally, all incidents recorded would include entries for each of these variables, the challenges described above mean that this is not always possible. The minimum information necessary to register an incident is the date of the incident, the number of dead and/or the number of missing, and the location of death. If the information is unavailable, the cell is left blank or “unknown” is recorded, as indicated in below.

1. Web ID - An automatically generated number used to identify each unique entry in the dataset.

2. Region - Region in which an incident took place. For more about regional classifications used in the dataset, click here.

3. Incident Date - Estimated date of death. In cases where the exact date of death is not known, this variable indicates the date in which the body or bodies were found. In cases where data are drawn from surviving migrants, witnesses or other interviews, this variable is entered as the date of the death as reported by the interviewee. At a minimum, the month and the year of death is recorded. In some cases, official statistics are not disaggregated by the incident, meaning that data is reported as a total number of deaths occurring during a certain time period. In such cases the entry is marked as a “cumulative total,” and the latest date of the range is recorded, with the full dates recorded in the comments.

4. Year - The year in which the incident occurred.

5. Reported month - The month in which the incident occurred.

6. Number dead - The total number of people confirmed dead in one incident, i.e. the number of bodies recovered. If migrants are missing and presumed dead, such as in cases of shipwrecks, leave blank.

7. Number missing - The total number of those who are missing and are thus assumed to be dead. This variable is generally recorded in incidents involving shipwrecks. The number of missing is calculated by subtracting the number of bodies recovered from a shipwreck and the number of survivors from the total number of migrants reported to have been on the boat. This number may be reported by surviving migrants or witnesses. If no missing persons are reported, it is left blank.

8. Total dead & missing - The sum of the ‘number dead’ and ‘number missing’ variables.

9. Number of survivors - The number of migrants that survived the incident, if known. The age, gender, and country of origin of survivors are recorded in the ‘Comments’ variable if known. If unknown, it is left blank.

10. Number of females - Indicates the number of females found dead or missing. If unknown, it is left blank. This gender identification is based on a third-party interpretation of the victim's gender from information available in official documents, autopsy reports, witness testimonies, and/or media reports.

11. Number of males - Indicates the number of males found dead or missing. If unknown, it is left blank. This gender identification is based on a third-party interpretation of the victim's gender from information available in official documents, autopsy reports, witness testimonies, and/or media reports.

12. Number of children - Indicates the number of individuals under the age of 18 found dead or missing. If unknown, it is left blank.

13. Age - The age of the decedent(s). Occasionally, an estimated age range is recorded. If unknown, it is left blank.

14. Country of origin - Country of birth of the decedent. If unknown, the entry will be marked “unknown”.

15. Region of origin - Region of origin of the decedent(s). In some incidents, region of origin may be marked as “Presumed” or “(P)” if migrants travelling through that location are known to hail from a certain region. If unknown, the entry will be marked “unknown”.

16. Cause of death - The determination of conditions resulting in the migrant's death i.e. the circumstances of the event that produced the fatal injury. If unknown, the reason why is included where possible. For example, “Unknown – skeletal remains only”, is used in cases in which only the skeleton of the decedent was found.

17. Location description - Place where the death(s) occurred or where the body or bodies were found. Nearby towns or cities or borders are included where possible. When incidents are reported in an unspecified location, this will be noted.

18. Location coordinates - Place where the death(s) occurred or where the body or bodies were found. In many regions, most notably the Mediterranean, geographic coordinates are estimated as precise locations are not often known. The location description should always be checked against the location coordinates.

19. Migration route - Name of the migrant route on which incident occurred, if known. If unknown, it is left blank.

20. UNSD geographical grouping - Geographical region in which the incident took place, as designated by the United Nations Statistics Division (UNSD) geoscheme. For more about regional classifications used in the dataset, click here.

21. Information source - Name of source of information for each incident. Multiple sources may be listed.

22. Link - Links to original reports of migrant deaths /
H
IMAGIC-500: A Benchmark Dataset for Missing Data Imputation in Hierarchical...
dataverse.harvard.edu
Updated May 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Siyi Sun (2025). IMAGIC-500: A Benchmark Dataset for Missing Data Imputation in Hierarchical Socio-Economic Surveys [Dataset]. http://doi.org/10.7910/DVN/7GMPBH
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/7GMPBH
Dataset updated
May 10, 2025
Dataset provided by
Harvard Dataverse
Authors
Siyi Sun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IMAGIC-500 is a large-scale, fully synthetic benchmark dataset designed to evaluate missing data imputation methods in hierarchical, real-world-like socio-economic survey data. It is derived from the World Bank’s Synthetic Data for an Imaginary Country (SDIC, 2023)—an openly available synthetic census-like dataset simulating a fictional middle-income country. This dataset combines the individual-level and household-level components of SDIC by joining them on household ID, preserving the nested structure of real survey data (individual → household → district → province). From this joined population, we sample 500,000 individuals across approximately 136,476 households, ensuring broad geographic and demographic diversity. For downstream task, we select 19 mixed-type variables from the SDIC attributes, covering both household-level and individual-level variables. Specifically, household-level features include geographic and socioeconomic variables, while individual-level features include demographics and socioeconomics. Moreover, we also select the individual’s highest educational attainment ("cat_educ_attain") as a target variable for downstream tasks.
Deaths Involving COVID-19 by Vaccination Status
ouvert.canada.ca
datasets.ai
+3more
csv, docx, html, xlsx
Updated Jun 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Government of Ontario (2025). Deaths Involving COVID-19 by Vaccination Status [Dataset]. https://ouvert.canada.ca/data/dataset/1375bb00-6454-4d3e-a723-4ae9e849d655
Explore at:
xlsx, html, docx, csvAvailable download formats
Dataset updated
Jun 25, 2025
Dataset provided by
Government of Ontariohttps://www.ontario.ca/
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Time period covered
Mar 1, 2021 - Nov 12, 2024
Description
This dataset reports the daily reported number of the 7-day moving average rates of Deaths involving COVID-19 by vaccination status and by age group. Learn how the Government of Ontario is helping to keep Ontarians safe during the 2019 Novel Coronavirus outbreak. Effective November 14, 2024 this page will no longer be updated. Information about COVID-19 and other respiratory viruses is available on Public Health Ontario’s interactive respiratory virus tool: https://www.publichealthontario.ca/en/Data-and-Analysis/Infectious-Disease/Respiratory-Virus-Tool Data includes: * Date on which the death occurred * Age group * 7-day moving average of the last seven days of the death rate per 100,000 for those not fully vaccinated * 7-day moving average of the last seven days of the death rate per 100,000 for those fully vaccinated * 7-day moving average of the last seven days of the death rate per 100,000 for those vaccinated with at least one booster ##Additional notes As of June 16, all COVID-19 datasets will be updated weekly on Thursdays by 2pm. As of January 12, 2024, data from the date of January 1, 2024 onwards reflect updated population estimates. This update specifically impacts data for the 'not fully vaccinated' category. On November 30, 2023 the count of COVID-19 deaths was updated to include missing historical deaths from January 15, 2020 to March 31, 2023. CCM is a dynamic disease reporting system which allows ongoing update to data previously entered. As a result, data extracted from CCM represents a snapshot at the time of extraction and may differ from previous or subsequent results. Public Health Units continually clean up COVID-19 data, correcting for missing or overcounted cases and deaths. These corrections can result in data spikes and current totals being different from previously reported cases and deaths. Observed trends over time should be interpreted with caution for the most recent period due to reporting and/or data entry lags. The data does not include vaccination data for people who did not provide consent for vaccination records to be entered into the provincial COVaxON system. This includes individual records as well as records from some Indigenous communities where those communities have not consented to including vaccination information in COVaxON. “Not fully vaccinated” category includes people with no vaccine and one dose of double-dose vaccine. “People with one dose of double-dose vaccine” category has a small and constantly changing number. The combination will stabilize the results. Spikes, negative numbers and other data anomalies: Due to ongoing data entry and data quality assurance activities in Case and Contact Management system (CCM) file, Public Health Units continually clean up COVID-19, correcting for missing or overcounted cases and deaths. These corrections can result in data spikes, negative numbers and current totals being different from previously reported case and death counts. Public Health Units report cause of death in the CCM based on information available to them at the time of reporting and in accordance with definitions provided by Public Health Ontario. The medical certificate of death is the official record and the cause of death could be different. Deaths are defined per the outcome field in CCM marked as “Fatal”. Deaths in COVID-19 cases identified as unrelated to COVID-19 are not included in the Deaths involving COVID-19 reported. Rates for the most recent days are subject to reporting lags All data reflects totals from 8 p.m. the previous day. This dataset is subject to change.
World Bank Subnational Poverty Data
kaggle.com
Updated Feb 28, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brooke Watson (2018). World Bank Subnational Poverty Data [Dataset]. https://www.kaggle.com/brookewatson/worldbank-subnational-poverty/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 28, 2018
Dataset provided by
Kaggle
Authors
Brooke Watson
Description
Context

This dataset was uploaded to support the Data Science For Good Kiva crowdfunding challenge. In particular, in uploading this dataset, I intend to assist with mapping subnational locations in the Kiva dataset to more accurate geocodes.

Content

This dataset contains poverty data at the administrative unit level 1, based on national poverty line(s). Administrative unit level 1 refers to the highest subnational unit level (examples include ‘state’, ‘governorate’, ‘province’). This dataset also provides data and methodology for distinguishing between poverty rates in urban and rural regions.

This dataset includes one main .csv file: Subnational-PovertyData.csv, which includes a set of poverty indicators at the national and subnational level between the years 1996-2013. Many countries are missing data for multiple years, and no country has data for the years 1997-1999.

It also includes three metadata .csv files: 1. Subnational-PovertyCountry.csv, which describes the country codes and subregions. 2.Subnational-PovertySeries.csv, which describes the three series indicators for national, urban, and rural poverty headcount ratios. This metadata file also including limitations, statistical methodologies, and development relevance for these metrics. 3. Subnational-Povertyfootnote.csv, which describes the years and sources for all of the country-series combinations.

Acknowledgements

This dataset is provided openly by the World Bank. Individual sources for the different data series are available in Subnational-Povertyfootnote.csv.

This dataset is classified as Public under the Access to Information Classification Policy. Users inside and outside the World Bank can access this dataset. It is licensed under CC-BY 4.0.

Metadata

Type: Time Series Topics: Economic Growth Poverty Economy Coverage: IBRD Languages Supported: English Number of Economies: 60 Geographical Coverage: World Access Options: Download, Query Tool Temporal Coverage: 1996 - 2013 Last Updated: April 27, 2015
Film Circulation dataset
zenodo.org
data.niaid.nih.gov
bin, csv, png
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
Explore at:
csv, png, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7887672
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
H
Global hotspots of climate-related disasters (dataset related to the paper...
dataverse.harvard.edu
Updated May 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Camila Donatti (2024). Global hotspots of climate-related disasters (dataset related to the paper Donatti etal.2024) [Dataset]. http://doi.org/10.7910/DVN/TFBAOH
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/TFBAOH
Dataset updated
May 17, 2024
Dataset provided by
Harvard Dataverse
Authors
Camila Donatti
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
Jan 1, 2000 - Dec 31, 2020
Description
This dataset "Global hotspots of climate related disasters" shows the number of people impacted by climate-related disasters recorded in the EM-DAT database between 2000 and 2020. This dataset was used to prepare the maps and the analysis of the paper Donatti C.I., Nicholas K., Fedele G., Delforge D., Speybroeck N., Moraga P., Blatter J., Below R., Zvoleff A. 2024. Global hotspots of climate-related disasters. International Journal of Disaster Risk Reduction. https://doi.org/10.1016/j.ijdrr.2024.104488. This dataset includes information on people impacted by Drought, tropical cyclones, flash flood, riverine flood, forest fire, land fire, heat wave, landslide and mudslide. Data on coastal flood was not included because the database only had recordings until 2013. Data on disaster sub-types “landslides” and “mudslides” as presented in the EM-DAT were further combined as one single climate-related disaster (“land and mudslides”) for the analyses. Likewise, data on disaster sub-types “forest fire” and “land fire” were further combined as one climate-related disaster (“wildfire”). The data was accessed directly from the EM-DAT database and then summarized as show in the dataset. We used this database, downloaded on June 2nd 2021, to access data on “total affected” people and the “total deaths” per disaster event impacting a country (i.e., an entry in the EM-DAT), which were combined in this study to create the variable “total people impacted”. In the EM-DAT database, “total affected” represents the sum of people “injured,” “affected,” and “homeless” resulting from a particular event. “Injured” were considered those that have suffered from physical injuries, trauma, or an illness requiring immediate medical assistance, including people hospitalized, as a direct result of a disaster, “affected” were considered people requiring immediate assistance during an emergency and “homeless” were considered those whose homes were destroyed or heavily damaged and therefore needed shelter after an event. “Total deaths” include people that have died or were considered missing, those whose whereabouts since the disaster were unknown and presumed dead based on official figures. More details can be found under “documentation, data structure and content description” at emdat.be. In the dataset, "ADM-CODE" refers to the code used to identify each administrative area, which refers to the code of FAO's Global Administrative Unit Layer, GAUL.
d
Public Health Official Departures
data.world
csv, zip
Updated Jun 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Associated Press (2022). Public Health Official Departures [Dataset]. https://data.world/associatedpress/public-health-official-departures
Explore at:
csv, zipAvailable download formats
Dataset updated
Jun 7, 2022
Authors
The Associated Press
Description
Changelog:

Update September 20, 2021: Data and overview updated to reflect data used in the September 15 story Over Half of States Have Rolled Back Public Health Powers in Pandemic. It includes 303 state or local public health leaders who resigned, retired or were fired between April 1, 2020 and Sept. 12, 2021. Previous versions of this dataset reflected data used in the Dec. 2020 and April 2021 stories.

Overview

Across the U.S., state and local public health officials have found themselves at the center of a political storm as they combat the worst pandemic in a century. Amid a fractured federal response, the usually invisible army of workers charged with preventing the spread of infectious disease has become a public punching bag.

In the midst of the coronavirus pandemic, at least 303 state or local public health leaders in 41 states have resigned, retired or been fired since April 1, 2020, according to an ongoing investigation by The Associated Press and KHN.

According to experts, that is the largest exodus of public health leaders in American history.

Many left due to political blowback or pandemic pressure, as they became the target of groups that have coalesced around a common goal — fighting and even threatening officials over mask orders and well-established public health activities like quarantines and contact tracing. Some left to take higher profile positions, or due to health concerns. Others were fired for poor performance. Dozens retired. An untold number of lower level staffers have also left.

The result is a further erosion of the nation’s already fragile public health infrastructure, which KHN and the AP documented beginning in 2020 in the Underfunded and Under Threat project.

Findings

The AP and KHN found that:

One in five Americans live in a community that has lost its local public health department leader during the pandemic

Top public health officials in 28 states have left state-level departments ## Using this data To filter for data specific to your state, use this query

To get total numbers of exits by state, broken down by state and local departments, use this query

Methodology

KHN and AP counted how many state and local public health leaders have left their jobs between April 1, 2020 and Sept. 12, 2021.

The government tasks public health workers with improving the health of the general population, through their work to encourage healthy living and prevent infectious disease. To that end, public health officials do everything from inspecting water and food safety to testing the nation’s babies for metabolic diseases and contact tracing cases of syphilis.

Many parts of the country have a health officer and a health director/administrator by statute. The analysis counted both of those positions if they existed. For state-level departments, the count tracks people in the top and second-highest-ranking job.

The analysis includes exits of top department officials regardless of reason, because no matter the reason, each left a vacancy at the top of a health agency during the pandemic. Reasons for departures include political pressure, health concerns and poor performance. Others left to take higher profile positions or to retire. Some departments had multiple top officials exit over the course of the pandemic; each is included in the analysis.

Reporters compiled the exit list by reaching out to public health associations and experts in every state and interviewing hundreds of public health employees. They also received information from the National Association of City and County Health Officials, and combed news reports and records.

Public health departments can be found at multiple levels of government. Each state has a department that handles these tasks, but most states also have local departments that either operate under local or state control. The population served by each local health department is calculated using the U.S. Census Bureau 2019 Population Estimates based on each department’s jurisdiction.

KHN and the AP have worked since the spring on a series of stories documenting the funding, staffing and problems around public health. A previous data distribution detailed a decade's worth of cuts to state and local spending and staffing on public health. That data can be found here.

Attribution

Findings and the data should be cited as: "According to a KHN and Associated Press report."

Is Data Missing?

If you know of a public health official in your state or area who has left that position between April 1, 2020 and Sept. 12, 2021 and isn't currently in our dataset, please contact authors Anna Maria Barry-Jester annab@kff.org, Hannah Recht hrecht@kff.org, Michelle Smith mrsmith@ap.org and Lauren Weber laurenw@kff.org.
Employment Of India CLeaned and Messy Data
kaggle.com
Updated Apr 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SONIA SHINDE (2025). Employment Of India CLeaned and Messy Data [Dataset]. https://www.kaggle.com/datasets/soniaaaaaaaa/employment-of-india-cleaned-and-messy-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 7, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
SONIA SHINDE
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
India
Description
This dataset presents a dual-version representation of employment-related data from India, crafted to highlight the importance of data cleaning and transformation in any real-world data science or analytics project.

🔹 Dataset Composition:

It includes two parallel datasets: 1. Messy Dataset (Raw) – Represents a typical unprocessed dataset often encountered in data collection from surveys, databases, or manual entries. 2. Cleaned Dataset – This version demonstrates how proper data preprocessing can significantly enhance the quality and usability of data for analytical and visualization purposes.

Each record captures multiple attributes related to individuals in the Indian job market, including: - Age Group
- Employment Status (Employed/Unemployed)
- Monthly Salary (INR)
- Education Level
- Industry Sector
- Years of Experience
- Location
- Perceived AI Risk
- Date of Data Recording

Transformations & Cleaning Applied:

The raw dataset underwent comprehensive transformations to convert it into its clean, analysis-ready form: - Missing Values: Identified and handled using either row elimination (where critical data was missing) or imputation techniques. - Duplicate Records: Identified using row comparison and removed to prevent analytical skew. - Inconsistent Formatting: Unified inconsistent naming in columns (like 'monthly_salary_(inr)' → 'Monthly Salary (INR)'), capitalization, and string spacing. - Incorrect Data Types: Converted columns like salary from string/object to float for numerical analysis. - Outliers: Detected and handled based on domain logic and distribution analysis. - Categorization: Converted numeric ages into grouped age categories for comparative analysis. - Standardization: Uniform labels for employment status, industry names, education, and AI risk levels were applied for visualization clarity.

Purpose & Utility:

This dataset is ideal for learners and professionals who want to understand: - The impact of messy data on visualization and insights - How transformation steps can dramatically improve data interpretation - Practical examples of preprocessing techniques before feeding into ML models or BI tools

It's also useful for: - Training ML models with clean inputs
- Data storytelling with visual clarity
- Demonstrating reproducibility in data cleaning pipelines

By examining both the messy and clean datasets, users gain a deeper appreciation for why “garbage in, garbage out” rings true in the world of data science.
Z
New global dataset on historical water-related conflict and cooperation...
data.niaid.nih.gov
Updated Jul 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zahra Kalantari (2024). New global dataset on historical water-related conflict and cooperation events [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7465152
Explore at:
Dataset updated
Jul 15, 2024
Dataset provided by
Zahra Kalantari
Haozhi Pan
Elisie Kåresdotter
Gustav Skoog
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The water-related conflict and cooperation events database was created as a part of a larger project: "The missing link: how does the climate affect human conflicts and collaborations through water?" where the goal is to increase understanding of how people and the climate affect water flows and how, in turn, these changes affect cooperation and conflicts over water. Formas, project 2017-00,608 support this project.

The database includes a collection of cooperation and conflict events between 1951 and 2019. The Transboundary Freshwater Dispute Database (2010) and WCC (Pacific Institute, Oakland, CA, 2022) were used as data on water-related acts of cooperation and conflict over time. As TFDD cooperation data ends in 2008, cooperation events were extended following a similar methodology as was used in the creation of TFDD. Further, geographic locations and regional classifications were added to all events, which can be used to create visualizations and extract subsets of the database for different parts of the world. The database methodology flow chart included in the files gives a brief overview of steps taken to prepare and process the data into the database.

The openly available scientific article in Science of the Total Environment (STOTEN) highlight findings using this dataset and gives further explanations relating to the database. The article can be found here: https://doi.org/10.1016/j.scitotenv.2023.161555.
Global Health and Development (2012-2021)
kaggle.com
Updated Nov 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martina Galasso (2024). Global Health and Development (2012-2021) [Dataset]. https://www.kaggle.com/datasets/martinagalasso/global-health-and-development-2012-2021/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 30, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Martina Galasso
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset provides a curated and comprehensive overview of global health, demographic, economic, and environmental metrics for 188 recognized countries over a period of 10 years (2012-2021). It was created by combining reliable data from the World Bank and the World Health Organization (WHO). Due to the absence of a single source containing all necessary indicators, over 60 datasets were analyzed, cleaned, and merged, prioritizing completeness and significance.

The dataset includes 29 key indicators, ranging from life expectancy, population metrics, and economic factors to environmental conditions and health-related behaviors. Missing values were carefully handled, and only the most relevant data with substantial coverage were retained.

This dataset is ideal for researchers, analysts, and policymakers interested in exploring relationships between economic development, health outcomes, and environmental factors at a global scale.
e
A global database of long-term changes in insect assemblages
knb.ecoinformatics.org
dataone.org
Updated Oct 1, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roel van Klink; Diana E. Bowler; Jonathan M. Chase; Orr Comay; Michael M. Driessen; S.K. Morgan Ernest; Alessandro Gentile; Francis Gilbert; Konstantin Gongalsky; Jennifer Owen; Guy Pe'er; Israel Pe'er; Vincent H. Resh; Ilia Rochlin; Sebastian Schuch; Ann E. Swengel; Scott R. Swengel; Thomas L. Valone; Rikjan Vermeulen; Tyson Wepprich; Jerome Wiedmann (2020). A global database of long-term changes in insect assemblages [Dataset]. http://doi.org/10.5063/F11V5C9V
Explore at:
Unique identifier
https://doi.org/10.5063/F11V5C9V
Dataset updated
Oct 1, 2020
Dataset provided by
Knowledge Network for Biocomplexity
Authors
Roel van Klink; Diana E. Bowler; Jonathan M. Chase; Orr Comay; Michael M. Driessen; S.K. Morgan Ernest; Alessandro Gentile; Francis Gilbert; Konstantin Gongalsky; Jennifer Owen; Guy Pe'er; Israel Pe'er; Vincent H. Resh; Ilia Rochlin; Sebastian Schuch; Ann E. Swengel; Scott R. Swengel; Thomas L. Valone; Rikjan Vermeulen; Tyson Wepprich; Jerome Wiedmann
Time period covered
Jan 1, 1925 - Jan 1, 2018
Area covered
Variables measured
End, Link, Year, Realm, Start, CRUmnC, CRUmnK, Metric, Number, Period, and 62 more
Description
This data set under CC-BY license contains time series of total abundance and/or biomass of assemblages of insect, arachnid and Entognatha assemblages (grouped at the family level or higher taxonomic resolution), monitored by standardized means for ten or more years. The data were derived from 166 data sources, representing a total of 1676 sites from 41 countries. The time series for abundance and biomass represent the aggregated number of all individuals of all taxa monitored at each site. The data set consists of four linked tables, representing information on the study level, the plot level, about sampling, and the measured assemblage sizes. all references to the original data sources can be found in the pdf with references, and a Google Earth file (kml) file presents the locations (including metadata) of all datasets. When using (parts of) this data set, please respect the original open access licenses. This data set underlies all analyses performed in the paper 'Meta-analysis reveals declines in terrestrial, but increases in freshwater insect abundances', a meta-analysis of changes in insect assemblage sizes, and is accompanied by a data paper entitled 'InsectChange – a global database of temporal changes in insect and arachnid assemblages'. Consulting the data paper before use is recommended. Tables that can be used to calculate trends of specific taxa and for species richness will be added as they become available. The data set consists of four tables that are linked by the columns 'DataSource_ID'. and 'Plot_ID', and a table with references to original research. In the table 'DataSources', descriptive data is provided at the dataset level: Links are provided to online repositories where the original data can be found, it describes whether the dataset provides data on biomass, abundance or both, the invertebrate group under study, the realm, and describes the location of sampling at different geographic scales (continent to state). This table also contains a reference column. The full reference to the original data is found in the file 'References_to_original_data_sources.pdf'. In the table 'PlotData' more details on each site within each dataset are provided: there is data on the exact location of each plot, whether the plots were experimentally manipulated, and if there was any spatial grouping of sites (column 'Location'). Additionally, this table contains all explanatory variables used for analysis, e.g. climate change variables, land-use variables, protection status. The table 'SampleData' describes the exact source of the data (table X, figure X, etc), the extraction methods, as well as the sampling methods (derived from the original publications). This includes the sampling method, sampling area, sample size, and how the aggregation of samples was done, if reported. Also, any calculations we did on the original data (e.g. reverse log transformations) are detailed here, but more details are provided in the data paper. This table links to the table 'DataSources' by the column 'DataSource_ID'. Note that each datasource may contain multiple entries in the 'SampleData' table if the data were presented in different figures or tables, or if there was any other necessity to split information on sampling details. The table 'InsectAbundanceBiomassData' provides the insect abundance or biomass numbers as analysed in the paper. It contains columns matching to the tables 'DataSources' and 'PlotData', as well as year of sampling, a descriptor of the period within the year of sampling (this was used as a random effect), the unit in which the number is reported (abundance or biomass), and the estimated abundance or biomass. In the column for Number, missing data are included (NA). The years with missing data were added because this was essential for the analysis performed, and retained here because they are easier to remove than to add. Linking the table 'InsectAbundanceBiomassData.csv' with 'PlotData.csv' by column 'Plot_ID', and with 'DataSources.csv' by column 'DataSource_ID' will provide the full dataframe used for all analyses. Detailed explanations of all column headers and terms are available in the ReadMe file, and more details will be available in the forthcoming data paper. WARNING: Because of the disparate sampling methods and various spatial and temporal scales used to collect the original data, this dataset should never be used to test for differences in insect abundance/biomass among locations (i.e. differences in intercept). The data can only be used to study temporal trends, by testing for differences in slopes. The data are standardized within plots to allow the temporal comparison, but not necessarily among plots (even within one dataset).

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista (2024). Number of missing persons files in the U.S. 2022, by race [Dataset]. https://www.statista.com/statistics/240396/number-of-missing-persons-files-in-the-us-by-race/

Number of missing persons files in the U.S. 2022, by race

Explore at:

Dataset updated

Jul 5, 2024

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

2022

Area covered

United States

Description

In 2022, there were 313,017 cases filed by the NCIC where the race of the reported missing was White. In the same year, 18,928 people were missing whose race was unknown.

What is the NCIC?

The National Crime Information Center (NCIC) is a digital database that stores crime data for the United States, so criminal justice agencies can access it. As a part of the FBI, it helps criminal justice professionals find criminals, missing people, stolen property, and terrorists. The NCIC database is broken down into 21 files. Seven files belong to stolen property and items, and 14 belong to persons, including the National Sex Offender Register, Missing Person, and Identify Theft. It works alongside federal, tribal, state, and local agencies. The NCIC’s goal is to maintain a centralized information system between local branches and offices, so information is easily accessible nationwide.

Missing people in the United States

A person is considered missing when they have disappeared and their location is unknown. A person who is considered missing might have left voluntarily, but that is not always the case. The number of the NCIC unidentified person files in the United States has fluctuated since 1990, and in 2022, there were slightly more NCIC missing person files for males as compared to females. Fortunately, the number of NCIC missing person files has been mostly decreasing since 1998.

Clear search

Close search

Google apps

Main menu

Number of missing persons files in the U.S. 2022, by race

Indicator 11.5.1: Number of missing persons due to disaster (number)

Geonames - All Cities with a population > 1000

‘Missing Migrants Dataset’ analyzed by Analyst-2

About the Missing Migrants Data

Where is the data from?

Who is included in Missing Migrants Project data?

How complete is the data on dead and missing migrants?

What can be understood through this data?

Why collect data on missing and dead migrants?

Identification and tracing of the dead and missing

Content

Global Missing Children Research, Country-Specific Findings: Lao People’s...

Indicator 11.5.1: Number of deaths and missing persons attributed to...

World Religion Project - Global Religion Dataset

Mental Health in Tech Survey

Mental Health in Tech Survey

Understanding Employee Mental Health Needs in the Tech Industry

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

UCI Communities and Crime Unnormalized Data Set

Context

Content

Acknowledgements

Inspiration

‘MISSING MIGRANTS (2014-2021)’ analyzed by Analyst-2

Context

What is included in Missing Migrants Project data?

What is excluded?

What sources of information are used in the Missing Migrants Project database?

Content

What are the variables used in the Missing Migrants Project database?

IMAGIC-500: A Benchmark Dataset for Missing Data Imputation in Hierarchical...

Deaths Involving COVID-19 by Vaccination Status

World Bank Subnational Poverty Data

Context

Content

Acknowledgements

Metadata

Film Circulation dataset

Global hotspots of climate-related disasters (dataset related to the paper...

Public Health Official Departures

Changelog:

Overview

Findings

Methodology

Attribution

Is Data Missing?

Employment Of India CLeaned and Messy Data

🔹 Dataset Composition:

Transformations & Cleaning Applied:

Purpose & Utility:

New global dataset on historical water-related conflict and cooperation...

Global Health and Development (2012-2021)

A global database of long-term changes in insect assemblages

Number of missing persons files in the U.S. 2022, by race