100+ datasets found

Number of missing persons files U.S. 2024, by race
statista.com
Updated Aug 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Number of missing persons files U.S. 2024, by race [Dataset]. https://www.statista.com/statistics/240396/number-of-missing-persons-files-in-the-us-by-race/
Explore at:
Dataset updated
Aug 14, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2024
Area covered
United States
Description
In 2024, there were 301,623 cases filed by the National Crime Information Center (NCIC) where the race of the reported missing person was white. In the same year, 17,097 people whose race was unknown were also reported missing in the United States. What is the NCIC? The National Crime Information Center (NCIC) is a digital database that stores crime data for the United States, so criminal justice agencies can access it. As a part of the FBI, it helps criminal justice professionals find criminals, missing people, stolen property, and terrorists. The NCIC database is broken down into 21 files. Seven files belong to stolen property and items, and 14 belong to persons, including the National Sex Offender Register, Missing Person, and Identify Theft. It works alongside federal, tribal, state, and local agencies. The NCIC’s goal is to maintain a centralized information system between local branches and offices, so information is easily accessible nationwide. Missing people in the United States A person is considered missing when they have disappeared and their location is unknown. A person who is considered missing might have left voluntarily, but that is not always the case. The number of the NCIC unidentified person files in the United States has fluctuated since 1990, and in 2022, there were slightly more NCIC missing person files for males as compared to females. Fortunately, the number of NCIC missing person files has been mostly decreasing since 1998.
d
NCRB: State and Gender-wise Number of Persons Reported Missing and Traced
dataful.in
Updated Aug 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataful (Factly) (2025). NCRB: State and Gender-wise Number of Persons Reported Missing and Traced [Dataset]. https://dataful.in/datasets/18466
Explore at:
csv, application/x-parquet, xlsxAvailable download formats
Dataset updated
Aug 25, 2025
Dataset authored and provided by
Dataful (Factly)
License
https://dataful.in/terms-and-conditionshttps://dataful.in/terms-and-conditions
Area covered
India
Variables measured
Number of persons missing, share of persons traced
Description
The dataset contains the state-wise number of persons reported missing in a particular year, the total number of persons missing including those from previous years, the number of persons recovered/traced and those unrecovered/untraced. The dataset also contains the percentage recovery of missing persons which is calculated as the percentage share of total number of persons traced over the total number of persons missing. NCRB started providing detailed data on missing & traced persons including children from 2016 onwards following the Supreme Court’s direction in a Writ Petition. It should also be noted that the data published by NCRB is restricted to those cases where FIRs have been registered by the police in respective States/UTs.

Note: Figures for projected_mid_year_population are sourced from the Report of the Technical Group on Population Projections for India and States 2011-2036
d
NCRB: State and Gender-wise number of children reported missing and traced
dataful.in
Updated Aug 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataful (Factly) (2025). NCRB: State and Gender-wise number of children reported missing and traced [Dataset]. https://dataful.in/datasets/18468
Explore at:
csv, application/x-parquet, xlsxAvailable download formats
Dataset updated
Aug 25, 2025
Dataset authored and provided by
Dataful (Factly)
License
https://dataful.in/terms-and-conditionshttps://dataful.in/terms-and-conditions
Area covered
States of India
Variables measured
Number of children missing, share of children traced
Description
Ministry of Home Affairs, Government of India has defined missing child as 'a person below eighteen years of age, whose whereabouts are not known to the parents, legal guardians and any other persons who may be legally entrusted with the custody of the child, whatever may be the circumstances/causes of disappearance”. The dataset contains the state wise and gender-wise number of children reported missing in a particular year, total number of persons missing including those from previous years, number of persons recovered/traced and those unrecovered/untraced. The dataset also contains the percentage recovery of missing persons which is calculated as the percentage share of total number of persons traced over the total number of persons missing. NCRB started providing detailed data on missing & traced persons including children from 2016 onwards following the Supreme Court’s direction in a Writ Petition. It should also be noted that the data published by NCRB is restricted to those cases where FIRs have been registered by the police in respective States/UTs.
National Missing and Unidentified Persons System (NamUs)
catalog.data.gov
datasets.ai
Updated Mar 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office of Justice Programs (2025). National Missing and Unidentified Persons System (NamUs) [Dataset]. https://catalog.data.gov/dataset/national-missing-and-unidentified-persons-system-namus
Explore at:
Dataset updated
Mar 12, 2025
Dataset provided by
Office of Justice Programshttps://ojp.gov/
Description
NamUs is the only national repository for missing, unidentified, and unclaimed persons cases. The program provides a singular resource hub for law enforcement, medical examiners, coroners, and investigating professionals. It is the only national database for missing, unidentified, and unclaimed persons that allows limited access to the public, empowering family members to take a more proactive role in the search for their missing loved ones.
OPP Missing Persons Annual Report Data
open.canada.ca
ouvert.canada.ca
csv, html, txt, xlsx
Updated Aug 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Government of Ontario (2025). OPP Missing Persons Annual Report Data [Dataset]. https://open.canada.ca/data/en/dataset/1bf5a9a3-14bc-482d-9fe6-c182034f3a66
Explore at:
csv, xlsx, txt, htmlAvailable download formats
Dataset updated
Aug 6, 2025
Dataset provided by
Government of Ontariohttps://www.ontario.ca/
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Time period covered
Jul 1, 2019 - Dec 31, 2024
Description
Under Section 8 of the Missing Persons Act, 2018, police services are required to report annually on their use of urgent demands for records under the Act and the Ministry of the Solicitor General is required to make the OPP’s annual report data publicly available. The data includes: * year in which the urgent demands were reported * category of records * description of records accessed under each category * total number of times each category of records was demanded * total number of missing persons investigations which had urgent demands for records * total number of urgent demands for records made by OPP in a year.
f
Missing and Unaccounted-for People in Mexico (1960s–2025)
figshare.com
txt
Updated Jul 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Montserrat Mora (2025). Missing and Unaccounted-for People in Mexico (1960s–2025) [Dataset]. http://doi.org/10.6084/m9.figshare.28283000.v4
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28283000.v4
Dataset updated
Jul 2, 2025
Dataset provided by
figshare
Authors
Montserrat Mora
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Mexico
Description
This project provides a comprehensive dataset of over 125,000 missing and unaccounted-for people in Mexico from the 1960s to 2025. The dataset is sourced from the publicly available records on the RNPDO website and represents individuals who were actively missing as of the date of collection (July 1, 2025). To protect individual identities, personal identifiers, such as names, have been removed.Dataset Features:The data has been cleaned and translated to facilitate analysis by a global audience.Fields include:SexDate of birthDate of incidenceState and municipality of the incidentData spans over six decades, offering insights into trends and regional disparities.Additional Materials:Python Script: A Python script to generate customizable visualizations based on the dataset. Users can specify the state to generate tailored charts.Sample Chart: An example chart showcasing the evolution of missing persons per 100,000 inhabitants in Mexico between 2006 and 2025.Requirements File: A requirements.txt file listing the necessary Python libraries to run the script seamlessly.This dataset and accompanying tools aim to support researchers, policymakers, and journalists in analyzing and addressing the issue of missing persons in Mexico.
e
Geographies of missing people: processes, experiences and responses -...
b2find.eudat.eu
Updated Oct 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Geographies of missing people: processes, experiences and responses - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/76fd493f-bf8f-5337-a8d8-8db0fc56dbd5
Explore at:
Dataset updated
Oct 23, 2023
Description
This data collection represents the empirical materials collected from the ESRC project 'Geographies of Missing People'. It comprises 45 interviews with people previously reported as missing, 9 charity workers, 23 police officers of various ranks and 25 families of missing people. We request that other researchers who wish to reuse our data get in touch to dialogue with the research team about how and why they want to reuse this data. The data is accessible with direct permission from the PI of the original ESRC award: Hester.parr@glasgow.ac.ukThis project seeks to understand the realities involved in 'going missing', and does so from multiple perspectives; using the voices and opinions of the police, families and returned missing people themselves. Qualitative data has been collected to shed light on this significant social (and spatial) problem and help us understand more about the nature of missing experiences for different groups. The purpose of the research project has been to understand more about how people go missing and how the police and families respond to such events (the geographies of searching). Such a focus holds value for both the police and families (the 'left behind') in that it updates and checks current knowledge about the likely spatial experiences of missing people. The project has recruited 45 people formally reported as missing to the project; 9 charity workers in the field of missing persons; 23 police officers of various ranks and 25 family members and these are held by the data archive service. Permission to access from Hester.parr@glasgow.ac.uk Interviews and focus groups. Sampling methods are profiled in the main reports lodged on www.geographiesofmissingpeople.org.uk
N
Lost Nation, IA Population Breakdown by Gender and Age Dataset: Male and...
neilsberg.com
csv, json
Updated Feb 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2025). Lost Nation, IA Population Breakdown by Gender and Age Dataset: Male and Female Population Distribution Across 18 Age Groups // 2025 Edition [Dataset]. https://www.neilsberg.com/research/datasets/e1ede495-f25d-11ef-8c1b-3860777c1fe6/
Explore at:
csv, jsonAvailable download formats
Dataset updated
Feb 24, 2025
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Iowa, Lost Nation
Variables measured
Male and Female Population Under 5 Years, Male and Female Population over 85 years, Male and Female Population Between 5 and 9 years, Male and Female Population Between 10 and 14 years, Male and Female Population Between 15 and 19 years, Male and Female Population Between 20 and 24 years, Male and Female Population Between 25 and 29 years, Male and Female Population Between 30 and 34 years, Male and Female Population Between 35 and 39 years, Male and Female Population Between 40 and 44 years, and 8 more
Measurement technique
The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To measure the three variables, namely (a) Population (Male), (b) Population (Female), and (c) Gender Ratio (Males per 100 Females), we initially analyzed and categorized the data for each of the gender classifications (biological sex) reported by the US Census Bureau across 18 age groups, ranging from under 5 years to 85 years and above. These age groups are described above in the variables section. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the population of Lost Nation by gender across 18 age groups. It lists the male and female population in each age group along with the gender ratio for Lost Nation. The dataset can be utilized to understand the population distribution of Lost Nation by gender and age. For example, using this dataset, we can identify the largest age group for both Men and Women in Lost Nation. Additionally, it can be used to see how the gender ratio changes from birth to senior most age group and male to female ratio across each age group for Lost Nation.

Key observations

Largest age group (population): Male # 50-54 years (27) | Female # 10-14 years (25). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

Age groups:

Under 5 years

5 to 9 years

10 to 14 years

15 to 19 years

20 to 24 years

25 to 29 years

30 to 34 years

35 to 39 years

40 to 44 years

45 to 49 years

50 to 54 years

55 to 59 years

60 to 64 years

65 to 69 years

70 to 74 years

75 to 79 years

80 to 84 years

85 years and over

Scope of gender :

Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis.

Variables / Data Columns

Age Group: This column displays the age group for the Lost Nation population analysis. Total expected values are 18 and are define above in the age groups section.

Population (Male): The male population in the Lost Nation is shown in the following column.

Population (Female): The female population in the Lost Nation is shown in the following column.

Gender Ratio: Also known as the sex ratio, this column displays the number of males per 100 females in Lost Nation for each age group.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Lost Nation Population by Gender. You can refer the same here
WHO national life expectancy
kaggle.com
Updated Oct 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MMattson (2020). WHO national life expectancy [Dataset]. https://www.kaggle.com/datasets/mmattson/who-national-life-expectancy/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 16, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
MMattson
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

I am developing my data science skills in areas outside of my previous work. An interesting problem for me was to identify which factors influence life expectancy on a national level. There is an existing Kaggle data set that explored this, but that information was corrupted. Part of the problem solving process is to step back periodically and ask "does this make sense?" Without reasonable data, it is harder to notice mistakes in my analysis code (as opposed to unusual behavior due to the data itself). I wanted to make a similar data set, but with reliable information.

This is my first time exploring life expectancy, so I had to guess which features might be of interest when making the data set. Some were included for comparison with the other Kaggle data set. A number of potentially interesting features (like air pollution) were left off due to limited year or country coverage. Since the data was collected from more than one server, some features are present more than once, to explore the differences.

Content

A goal of the World Health Organization (WHO) is to ensure that a billion more people are protected from health emergencies, and provided better health and well-being. They provide public data collected from many sources to identify and monitor factors that are important to reach this goal. This set was primarily made using GHO (Global Health Observatory) and UNESCO (United Nations Educational Scientific and Culture Organization) information. The set covers the years 2000-2016 for 183 countries, in a single CSV file. Missing data is left in place, for the user to decide how to deal with it.

Three notebooks are provided for my cursory analysis, a comparison with the other Kaggle set, and a template for creating this data set.

Inspiration

There is a lot to explore, if the user is interested. The GHO server alone has over 2000 "indicators". - How are the GHO and UNESCO life expectancies calculated, and what is causing the difference? That could also be asked for Gross National Income (GNI) and mortality features. - How does the life expectancy after age 60 compare to the life expectancy at birth? Is the relationship with the features in this data set different for those two targets? - What other indicators on the servers might be interesting to use? Some of the GHO indicators are different studies with different coverage. Can they be combined to make a more useful and robust data feature? - Unraveling the correlations between the features would take significant work.
f
Data_Sheet_2_A Random Shuffle Method to Expand a Narrow Dataset and Overcome...
frontiersin.figshare.com
pdf
Updated May 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lorenzo Fassina; Alessandro Faragli; Francesco Paolo Lo Muzio; Sebastian Kelle; Carlo Campana; Burkert Pieske; Frank Edelmann; Alessio Alogna (2023). Data_Sheet_2_A Random Shuffle Method to Expand a Narrow Dataset and Overcome the Associated Challenges in a Clinical Study: A Heart Failure Cohort Example.PDF [Dataset]. http://doi.org/10.3389/fcvm.2020.599923.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fcvm.2020.599923.s001
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Lorenzo Fassina; Alessandro Faragli; Francesco Paolo Lo Muzio; Sebastian Kelle; Carlo Campana; Burkert Pieske; Frank Edelmann; Alessio Alogna
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Heart failure (HF) affects at least 26 million people worldwide, so predicting adverse events in HF patients represents a major target of clinical data science. However, achieving large sample sizes sometimes represents a challenge due to difficulties in patient recruiting and long follow-up times, increasing the problem of missing data. To overcome the issue of a narrow dataset cardinality (in a clinical dataset, the cardinality is the number of patients in that dataset), population-enhancing algorithms are therefore crucial. The aim of this study was to design a random shuffle method to enhance the cardinality of an HF dataset while it is statistically legitimate, without the need of specific hypotheses and regression models. The cardinality enhancement was validated against an established random repeated-measures method with regard to the correctness in predicting clinical conditions and endpoints. In particular, machine learning and regression models were employed to highlight the benefits of the enhanced datasets. The proposed random shuffle method was able to enhance the HF dataset cardinality (711 patients before dataset preprocessing) circa 10 times and circa 21 times when followed by a random repeated-measures approach. We believe that the random shuffle method could be used in the cardiovascular field and in other data science problems when missing data and the narrow dataset cardinality represent an issue.
d
COVID-19 Cases and Deaths by Race/Ethnicity - ARCHIVE
catalog.data.gov
data.ct.gov
Updated Aug 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.ct.gov (2023). COVID-19 Cases and Deaths by Race/Ethnicity - ARCHIVE [Dataset]. https://catalog.data.gov/dataset/covid-19-cases-and-deaths-by-race-ethnicity
Explore at:
Dataset updated
Aug 12, 2023
Dataset provided by
data.ct.gov
Description
Note: DPH is updating and streamlining the COVID-19 cases, deaths, and testing data. As of 6/27/2022, the data will be published in four tables instead of twelve. The COVID-19 Cases, Deaths, and Tests by Day dataset contains cases and test data by date of sample submission. The death data are by date of death. This dataset is updated daily and contains information back to the beginning of the pandemic. The data can be found at https://data.ct.gov/Health-and-Human-Services/COVID-19-Cases-Deaths-and-Tests-by-Day/g9vi-2ahj. The COVID-19 State Metrics dataset contains over 93 columns of data. This dataset is updated daily and currently contains information starting June 21, 2022 to the present. The data can be found at https://data.ct.gov/Health-and-Human-Services/COVID-19-State-Level-Data/qmgw-5kp6 . The COVID-19 County Metrics dataset contains 25 columns of data. This dataset is updated daily and currently contains information starting June 16, 2022 to the present. The data can be found at https://data.ct.gov/Health-and-Human-Services/COVID-19-County-Level-Data/ujiq-dy22 . The COVID-19 Town Metrics dataset contains 16 columns of data. This dataset is updated daily and currently contains information starting June 16, 2022 to the present. The data can be found at https://data.ct.gov/Health-and-Human-Services/COVID-19-Town-Level-Data/icxw-cada . To protect confidentiality, if a town has fewer than 5 cases or positive NAAT tests over the past 7 days, those data will be suppressed. COVID-19 cases and associated deaths that have been reported among Connecticut residents, broken down by race and ethnicity. All data in this report are preliminary; data for previous dates will be updated as new reports are received and data errors are corrected. Deaths reported to the either the Office of the Chief Medical Examiner (OCME) or Department of Public Health (DPH) are included in the COVID-19 update. The following data show the number of COVID-19 cases and associated deaths per 100,000 population by race and ethnicity. Crude rates represent the total cases or deaths per 100,000 people. Age-adjusted rates consider the age of the person at diagnosis or death when estimating the rate and use a standardized population to provide a fair comparison between population groups with different age distributions. Age-adjustment is important in Connecticut as the median age of among the non-Hispanic white population is 47 years, whereas it is 34 years among non-Hispanic blacks, and 29 years among Hispanics. Because most non-Hispanic white residents who died were over 75 years of age, the age-adjusted rates are lower than the unadjusted rates. In contrast, Hispanic residents who died tend to be younger than 75 years of age which results in higher age-adjusted rates. The population data used to calculate rates is based on the CT DPH population statistics for 2019, which is available online here: https://portal.ct.gov/DPH/Health-Information-Systems--Reporting/Population/Population-Statistics. Prior to 5/10/2021, the population estimates from 2018 were used. Rates are standardized to the 2000 US Millions Standard population (data available here: https://seer.cancer.gov/stdpopulations/). Standardization was done using 19 age groups (0, 1-4, 5-9, 10-14, ..., 80-84, 85 years and older). More information about direct standardization for age adjustment is available here: https://www.cdc.gov/nchs/data/statnt/statnt06rv.pdf Categories are mutually exclusive. The category “multiracial” includes people who answered ‘yes’ to more than one race category. Counts may not add up to total case counts as data on race and ethnicity may be missing. Age adjusted rates calculated only for groups with more than 20 deaths. Abbreviation: NH=Non-Hispanic. Data on Connecticut deaths were obtained from the Connecticut Deaths Registry maintained by the DPH Office of Vital Records. Cause of death was determined by a death certifier (e.g., physician, APRN, medical
d
Assessor - Parcel Universe (Current Year Only)
catalog.data.gov
Updated Aug 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
datacatalog.cookcountyil.gov (2025). Assessor - Parcel Universe (Current Year Only) [Dataset]. https://catalog.data.gov/dataset/assessor-parcel-universe-current-year-only
Explore at:
Dataset updated
Aug 30, 2025
Dataset provided by
datacatalog.cookcountyil.gov
Description
A current-year-only universe of Cook County parcels with attached geographic, governmental, and spatial data. When working with Parcel Index Numbers (PINs) make sure to zero-pad them to 14 digits. Some datasets may lose leading zeros for PINs when downloaded. Additional notes:Non-taxing district data is attached via spatial join (st_contains) to each parcel's centroid. Tax district data (school district, park district, municipality, etc.) are attached by a parcel's assigned tax code. Centroids are based on Cook County parcel shapefiles. Older properties may be missing coordinates and thus also missing attached spatial data (usually they are missing a parcel boundary in the shapefile). Newer properties may be missing a mailing or property address, as they need to be assigned one by the postal service. This dataset contains data for the current tax year, which may not yet be complete or final. Assessed values for any given year are subject to change until review and certification of values by the Cook County Board of Review, though there are a few rare circumstances where values may change for the current or past years after that. Rowcount for a given year is final once the Assessor has certified the assessment roll all townships. Data will be updated monthly. Depending on the time of year, some third-party and internal data will be missing for the most recent year. Assessments mailed this year represent values from last year, so this isn't an issue. By the time the Data Department models values for this year, those data will have populated. Current property class codes, their levels of assessment, and descriptions can be found on the Assessor's website. Note that class codes details can change across time. Due to discrepancies between the systems used by the Assessor and Clerk's offices, tax_district_code is not currently up-to-date in this table. There are currently two different sources of parcel-level municipality available in this data set, and they will not always agree: tax and spatial records. Tax records from the Cook County Clerk indicate the municipality to which a parcel owner pays taxes, while spatial records, also from the Cook County Clerk, indicate the municipal boundaries within which a parcel lies. For more information on the sourcing of attached data and the preparation of this dataset, see the Assessor's Standard Operating Procedures for Open Data on GitHub. Read about the Assessor's 2025 Open Data Refresh.
Water-quality data imputation with a high percentage of missing values: a...
zenodo.org
data.niaid.nih.gov
csv
Updated Jun 8, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. http://doi.org/10.5281/zenodo.4731169
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4731169
Dataset updated
Jun 8, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.

This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.

To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).

IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.

In this dataset, we include the original and imputed values for the following variables:

Water temperature (Tw)

Dissolved oxygen (DO)

Electrical conductivity (EC)

pH

Turbidity (Turb)

Nitrite (NO2-)

Nitrate (NO3-)

Total Nitrogen (TN)

Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].

More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.

If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318
Medicaid and CHIP enrollees who received a well-child visit
catalog.data.gov
healthdata.gov
+2more
Updated Jul 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Centers for Medicare & Medicaid Services (2025). Medicaid and CHIP enrollees who received a well-child visit [Dataset]. https://catalog.data.gov/dataset/medicaid-and-chip-enrollees-who-received-a-well-child-visit
Explore at:
Dataset updated
Jul 11, 2025
Dataset provided by
Centers for Medicare & Medicaid Services
Description
This data set includes annual counts and percentages of Medicaid and Children’s Health Insurance Program (CHIP) enrollees who received a well-child visit paid for by Medicaid or CHIP, overall and by five subpopulation topics: age group, race and ethnicity, urban or rural residence, program type, and primary language. These results were generated using Transformed Medicaid Statistical Information System (T-MSIS) Analytic Files (TAF) Release 1 data and the Race/Ethnicity Imputation Companion File. This data set includes Medicaid and CHIP enrollees in all 50 states, the District of Columbia, Puerto Rico, and the U.S. Virgin Islands, except where otherwise noted. Enrollees in Guam, American Samoa, and the Northern Mariana Islands are not included. Results include enrollees with comprehensive Medicaid or CHIP benefits for all 12 months of the year and who were younger than age 19 at the end of the calendar year. Results shown for the race and ethnicity subpopulation topic exclude enrollees in the U.S. Virgin Islands. Results shown for the primary language subpopulation topic exclude select states with data quality issues with the primary language variable in TAF. Some rows in the data set have a value of "DS," which indicates that data were suppressed according to the Centers for Medicare & Medicaid Services’ Cell Suppression Policy for values between 1 and 10. This data set is based on the brief: "Medicaid and CHIP enrollees who received a well-child visit in 2020." Enrollees are identified as receiving a well-child visit in the year according to the Line 6 criteria in the Form CMS-416 reporting instructions. Enrollees are assigned to an age group subpopulation using age as of December 31st of the calendar year. Enrollees are assigned to a race and ethnicity subpopulation using the state-reported race and ethnicity information in TAF when it is available and of good quality; if it is missing or unreliable, race and ethnicity is indirectly estimated using an enhanced version of Bayesian Improved Surname Geocoding (BISG) (Race and ethnicity of the national Medicaid and CHIP population in 2020). Enrollees are assigned to an urban or rural subpopulation based on the 2010 Rural-Urban Commuting Area (RUCA) code associated with their home or mailing address ZIP code in TAF (Rural Medicaid and CHIP enrollees in 2020). Enrollees are assigned to a program type subpopulation based on the CHIP code and eligibility group code that applies to the majority of their enrolled-months during the year (Medicaid-Only Enrollment; M-CHIP and S-CHIP Enrollment). Enrollees are assigned to a primary language subpopulation based on their reported ISO language code in TAF (English/missing, Spanish, and all other language codes) (Primary Language). Please refer to the full brief for additional context about the methodology and detailed findings. Future updates to this data set will include more recent data years as the TAF data become available.
Z
Film Circulation dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samoilova, Evgenia (Zhenya) (2024). Film Circulation dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7887671
Explore at:
Dataset updated
Jul 12, 2024
Dataset provided by
Loist, Skadi
Samoilova, Evgenia (Zhenya)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.

The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This
Medicaid and CHIP enrollees who received mental health or SUD services
catalog.data.gov
healthdata.gov
+2more
Updated Jul 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Centers for Medicare & Medicaid Services (2025). Medicaid and CHIP enrollees who received mental health or SUD services [Dataset]. https://catalog.data.gov/dataset/medicaid-and-chip-enrollees-who-received-mental-health-or-sud-services
Explore at:
Dataset updated
Jul 11, 2025
Dataset provided by
Centers for Medicare & Medicaid Services
Description
This data set includes annual counts and percentages of Medicaid and Children’s Health Insurance Program (CHIP) enrollees who received mental health (MH) or substance use disorder (SUD) services, overall and by six subpopulation topics: age group, sex or gender identity, race and ethnicity, urban or rural residence, eligibility category, and primary language. These results were generated using Transformed Medicaid Statistical Information System (T-MSIS) Analytic Files (TAF) Release 1 data and the Race/Ethnicity Imputation Companion File. This data set includes Medicaid and CHIP enrollees in all 50 states, the District of Columbia, Puerto Rico, and the U.S. Virgin Islands, ages 12 to 64 at the end of the calendar year, who were not dually eligible for Medicare and were continuously enrolled with comprehensive benefits for 12 months, with no more than one gap in enrollment exceeding 45 days. Enrollees who received services for both an MH condition and SUD in the year are counted toward both condition categories. Enrollees in Guam, American Samoa, the Northern Mariana Islands, and select states with TAF data quality issues are not included. Results shown for the race and ethnicity subpopulation topic exclude enrollees in the U.S. Virgin Islands. Results shown for the primary language subpopulation topic exclude select states with data quality issues with the primary language variable in TAF. Some rows in the data set have a value of "DS," which indicates that data were suppressed according to the Centers for Medicare & Medicaid Services’ Cell Suppression Policy for values between 1 and 10. This data set is based on the brief: "Medicaid and CHIP enrollees who received mental health or SUD services in 2020." Enrollees are assigned to an age group subpopulation using age as of December 31st of the calendar year. Enrollees are assigned to a sex or gender identity subpopulation using their latest reported sex in the calendar year. Enrollees are assigned to a race and ethnicity subpopulation using the state-reported race and ethnicity information in TAF when it is available and of good quality; if it is missing or unreliable, race and ethnicity is indirectly estimated using an enhanced version of Bayesian Improved Surname Geocoding (BISG) (Race and ethnicity of the national Medicaid and CHIP population in 2020). Enrollees are assigned to an urban or rural subpopulation based on the 2010 Rural-Urban Commuting Area (RUCA) code associated with their home or mailing address ZIP code in TAF (Rural Medicaid and CHIP enrollees in 2020). Enrollees are assigned to an eligibility category subpopulation using their latest reported eligibility group code, CHIP code, and age in the calendar year. Enrollees are assigned to a primary language subpopulation based on their reported ISO language code in TAF (English/missing, Spanish, and all other language codes) (Primary Language). Please refer to the full brief for additional context about the methodology and detailed findings. Future updates to this data set will include more recent data years as the TAF data become available.
Max Foundation Bangladesh 2019 Nutrition Baseline
kaggle.com
Updated Dec 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Remco Geervliet (2021). Max Foundation Bangladesh 2019 Nutrition Baseline [Dataset]. https://www.kaggle.com/datasets/remcogeervliet/max-foundation-bangladesh-2019-nutrition-baseline
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 14, 2021
Dataset provided by
Kaggle
Authors
Remco Geervliet
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
Bangladesh
Description
Max Foundation

Max Foundation is a Netherlands-based NGO that works towards a healthy start for every child in the most effective and long-lasting way. Over the past 15 years, our teams in Bangladesh and Ethiopia have reached almost 3 million people, supporting communities in reducing stunting and undernutrition by gaining better access to clean water, sanitation and hygiene, as well as healthy diets and care for mother and child.

Maximising our impact and cost efficiency are at the core of our work, which makes quantifying and analysing our programmes crucial. We therefore collect a lot of information on the communities we work with; to understand them better and see where and how we can improve as an organisation.
This data set is one of many we are making publicly available because we believe that data in the development sector should be open: not as a goal in itself, but as a way to help the sector be more effective and create more impact.

Content

These data were collected between Q2 and Q3 in 2019 (with a few observations earlier and later) in the areas in Bangladesh where Max Foundation is active. The data were collected on a representative sample of the households in the area which includes at least one child between the age of 2 and 5. The data provide a very detailed picture of the nutritional status of households as well as their knowledge, attitudes and practices in nutrition and especially child nutrition. As this information was collected by a third partner, some information information is missing. We cleaned the data to the best of our ability, and feel very confident on the district, upazila and union information. Village numbers are often missing and ward numbers were inferred for much of the data, and may therefore not always be accurate. We regret this lapse in quality.

All datasets we publish can be linked together at the village-level, and we encourage everyone to not look at these data in isolation, but link it to our other datasets to create richer analyses.

Privacy and links to our other data

All of Max Foundation's data are collected and processed according to GDPR standards and explicit informed consent is given by all respondents. They are also clearly informed that choosing not to participate in data collection will in no way affect their eligibility for, or receiving of, products or services from Max Foundation.

Furthermore, we enforce strong privacy protections on our open data to minimise the risk of these data being used to cause harm or re-identify individuals. Concretely this means: - Administrative units up to the Union can be directly identified with the BD_ loc_xx data (which can be found in our Max Foundation Bangladesh 2018 WASH Census dataset). Villages are masked by random numbers. However, to ensure it is still possible to compare our data sets, these random numbers are consistent across all datasets. This means that village '1' in this data is the same as village '1' in all of our other Bangladesh datasets, unless stated otherwise; - Sensitive variables are omitted, censored or bucketed.

The column descriptions specify any transformations done to the data.

Acknowledgements

These data could have not been collected without the generous support from the Embassy of the Kingdom of the Netherlands in Dhaka and numerous other donors who have supported us over the years. Special thanks to our Bangladesh team for their excellent work in guiding the data collection process.

What you can do for our communities

We invite you to share any interesting insights you have derived from the data with us. From visualising our impact, to uncovering which parts of our programmes are most strongly related with reducing stunting, to making new connections we may have not even considered; we are eager to hear how we can be more effective in what we do and how we do it.

More detailed data insights are available from our internal data, such as the linking of households between datasets. Please note that we would be happy to share more detailed data with researchers, students and many others once proper agreements are in place.

As we value impact above all else, we are happy to work with anyone who can help us to improve our impact. We are constantly adapting our approach based on internal and external findings, and invite you to join us on this journey. Together we can ensure that every child has a healthy start.
Taxis Dataset | Yellow Taxi | Cleaned Version |
kaggle.com
Updated May 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sheikh Muhammad Abdullah (2024). Taxis Dataset | Yellow Taxi | Cleaned Version | [Dataset]. https://www.kaggle.com/datasets/abdmental01/taxis-dataset-yellow-taxi
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 24, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sheikh Muhammad Abdullah
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The dataset comprises cleaned records of yellow taxi rides in a specified time frame, covering essential details such as pickup and drop-off dates, number of passengers, distance traveled, fare, tips, tolls, total payment, taxi color, and payment method. Detailed statistics on ride durations, distances, fares, and payments are included.

Source

This Dataset is From Seaborn Data_Set Collection. Taxis Dataset.

The Original Data Contains Many Missing Values.

I Imputed The Missing values on Some Research Idea's and Some With The Help Of Iterative_Imputer.

Complete Notebook Also Attached in Data Code Section.

Column Description: - pickup: Pickup date and time. - dropoff: Drop-off date and time. - passengers: Number of passengers. - distance: Distance traveled in miles. - fare: Fare amount. - tip: Extra tip amount. - tolls: Toll tax amount. - total: Total payment including fare, tip, and tolls. - color: Color of the taxi. - payment: Payment method (e.g., credit card, cash). - 03/28/2019 - 03/31/2019: Ride counts aggregated by date range. - 2019-03-01 - 2019-04-01: Ride counts aggregated by date. - Label: Categorized ranges for distances, fares, tips, tolls, and totals. - yellow: Percentage breakdown of yellow taxi rides. - green: Percentage breakdown of green taxi rides.

Empathy dataset

zenodo.org
data.niaid.nih.gov

bin, csv, html

Updated Dec 18, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Zenodo (2024). Empathy dataset [Dataset]. http://doi.org/10.5281/zenodo.7683907

Explore at:

bin, html, csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.7683907

Dataset updated

Dec 18, 2024

Dataset provided by

Zenodohttp://zenodo.org/

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

The database for this study (Briganti et al. 2018; the same for the Braun study analysis) was composed of 1973 French-speaking students in several universities or schools for higher education in the following fields: engineering (31%), medicine (18%), nursing school (16%), economic sciences (15%), physiotherapy, (4%), psychology (11%), law school (4%) and dietetics (1%). The subjects were 17 to 25 years old (M = 19.6 years, SD = 1.6 years), 57% were females and 43% were males. Even though the full dataset was composed of 1973 participants, only 1270 answered the full questionnaire: missing data are handled using pairwise complete observations in estimating a Gaussian Graphical Model, meaning that all available information from every subject are used.

The feature set is composed of 28 items meant to assess the four following components: fantasy, perspective taking, empathic concern and personal distress. In the questionnaire, the items are mixed; reversed items (items 3, 4, 7, 12, 13, 14, 15, 18, 19) are present. Items are scored from 0 to 4, where “0” means “Doesn’t describe me very well” and “4” means “Describes me very well”; reverse-scoring is calculated afterwards. The questionnaires were anonymized. The reanalysis of the database in this retrospective study was approved by the ethical committee of the Erasmus Hospital.

Size: A dataset of size 1973*28

Number of features: 28

Ground truth: No

Type of Graph: Mixed graph

The following gives the description of the variables:

Feature	FeatureLabel	Domain	Item meaning from Davis 1980
001	1FS	Green	I daydream and fantasize, with some regularity, about things that might happen to me.
002	2EC	Purple	I often have tender, concerned feelings for people less fortunate than me.
003	3PT_R	Yellow	I sometimes find it difficult to see things from the “other guy’s” point of view.
004	4EC_R	Purple	Sometimes I don’t feel very sorry for other people when they are having problems.
005	5FS	Green	I really get involved with the feelings of the characters in a novel.
006	6PD	Red	In emergency situations, I feel apprehensive and ill-at-ease.
007	7FS_R	Green	I am usually objective when I watch a movie or play, and I don’t often get completely caught up in it.(Reversed)
008	8PT	Yellow	I try to look at everybody’s side of a disagreement before I make a decision.
009	9EC	Purple	When I see someone being taken advantage of, I feel kind of protective towards them.
010	10PD	Red	I sometimes feel helpless when I am in the middle of a very emotional situation.
011	11PT	Yellow	sometimes try to understand my friends better by imagining how things look from their perspective
012	12FS_R	Green	Becoming extremely involved in a good book or movie is somewhat rare for me. (Reversed)
013	13PD_R	Red	When I see someone get hurt, I tend to remain calm. (Reversed)
014	14EC_R	Purple	Other people’s misfortunes do not usually disturb me a great deal. (Reversed)
015	15PT_R	Yellow	If I’m sure I’m right about something, I don’t waste much time listening to other people’s arguments. (Reversed)
016	16FS	Green	After seeing a play or movie, I have felt as though I were one of the characters.
017	17PD	Red	Being in a tense emotional situation scares me.
018	18EC_R	Purple	When I see someone being treated unfairly, I sometimes don’t feel very much pity for them. (Reversed)
019	19PD_R	Red	I am usually pretty effective in dealing with emergencies. (Reversed)
020	20FS	Green	I am often quite touched by things that I see happen.
021	21PT	Yellow	I believe that there are two sides to every question and try to look at them both.
022	22EC	Purple	I would describe myself as a pretty soft-hearted person.
023	23FS	Green	When I watch a good movie, I can very easily put myself in the place of a leading character.
024	24PD	Red	I tend to lose control during emergencies.
025	25PT	Yellow	When I’m upset at someone, I usually try to “put myself in his shoes” for a while.
026	26FS	Green	When I am reading an interesting story or novel, I imagine how I would feel if the events in the story were happening to me.
027	27PD	Red	When I see someone who badly needs help in an emergency, I go to pieces.
028	28PT	Yellow	Before criticizing somebody, I try to imagine how I would feel if I were in their place

More information about the dataset is contained in empathy_description.html file.

Orlando Neighborhood
kaggle.com
Updated Oct 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Giovannini (2022). Orlando Neighborhood [Dataset]. https://www.kaggle.com/datasets/sgiov95/orlando-neighborhood
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 7, 2022
Dataset provided by
Kaggle
Authors
Sebastian Giovannini
Area covered
Orlando
Description
This dataset is a snapshot from October 2022 of all 48 homes in a section of a neighborhood nearby a large university in Central Florida. All of the homes are single family homes featuring a garage, a driveway, and a fenced-in backyard. Data was gathered by hand (keyboard) via a collection of sites, including Zillow, Realtor, Redfin, Trulia, and Orange County Property Appraiser. All homes were built in the same year in the early 2000's and feature central air and all other utilities typical of contemporary suburban homes in the United States. The area is close to a university and a large portion of renters are college students and young professionals, as well as families and older adults.

There are 30 columns:

HID: House ID, a unique identifier for each house (int from 1 to 48, not the actual address number) -Sqft: The Square Footage of the Interior of the house (int) -LandSqft: The Total Square Footage of the land (int) -Neighbors: The number of homes directly adjacent to each house (int) -Stories: The number of stories in each house (int) -Pool: Does the house have a pool (int, 0 for 'No', 1 for 'Yes') -Bedrooms: The number of bedrooms in each house (int) -Bathrooms: The number of bathrooms (full or half) in each house (int) -DateLastSold: The date on which the house was last sold (datetime) -PropertyTaxes2022: The annual property taxes for 2022 (float) -OwnedByBank: Is the house owned by a bank (int, 0 for 'No', 1 for 'Yes') -OuterPortion: Is the house on the Outer Portion of the Neighborhood (int, 0 for 'No', 1 for 'Yes') -NextToLoudRoad: Is the house directly adjacent to a loud road (int, 0 for 'No', 1 for 'Yes') -PriceLastSold: Price that the house was last sold for (float) -Zestimate: Zillow's Price Estimate for the house (float) -RentZestimate: Zillow's Estimate for the Monthly Price of rent for the house (float) -RealtorcomEstimate: Realtor dot com's Estimate for the house (float) -RedfinEstimate: Redfin's Estimate for the house (float) -TruliaEstimate: Trulia's Estimate for the house (float) -OCPALandValue2022: The Land Value on the county's 2022 records (float) -OCPABuildingValue2022: The Building Value on the county's 2022 records (float) -OCPAFeaturesValue2022: The Features Value on the county's 2022 records (float) -OCPAMarketValue2022: The Market Value on the county's 2022 records (float) -OCPAAssessedValue2022: The Assessed Value on the county's 2022 records (float), AKA what homeowners are taxed on -OCPALandValue2021: The Land Value on the county's 2021 records (float) -OCPABuildingValue2021: The Building Value on the county's 2021 records (float) -OCPAFeaturesValue2021: The Features Value on the county's 2021 records (float) -OCPAMarketValue2021: The Market Value on the county's 2021 records (float) -OCPAAssessedValue2021: The Assessed Value on the county's 2021 records (float), AKA what homeowners are taxed on -Notes: any notes on any of the homes (str)

Note that while the dataset is exhaustive in that it has all of the houses, some homes are missing some columns, typically because a home did not feature a estimate on a site or the one home not found on the property appraiser's site. This also is therefore not a randomized dataset, so the only population of homes that it can be used to infer on are those within this specific portion of the neighborhood. Personally, I am going to use the dataset to practice a couple of aspects of real-world data: Cleaning, Imputing, and Exploratory Data Analysis. Mainly, I want to compare different approaches to filling in the missing values of the dataset, then do some Model Building with some additional Dimensionality Reduction.

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista (2025). Number of missing persons files U.S. 2024, by race [Dataset]. https://www.statista.com/statistics/240396/number-of-missing-persons-files-in-the-us-by-race/

Number of missing persons files U.S. 2024, by race

Explore at:

Dataset updated

Aug 14, 2025

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

2024

Area covered

United States

Description

In 2024, there were 301,623 cases filed by the National Crime Information Center (NCIC) where the race of the reported missing person was white. In the same year, 17,097 people whose race was unknown were also reported missing in the United States. What is the NCIC? The National Crime Information Center (NCIC) is a digital database that stores crime data for the United States, so criminal justice agencies can access it. As a part of the FBI, it helps criminal justice professionals find criminals, missing people, stolen property, and terrorists. The NCIC database is broken down into 21 files. Seven files belong to stolen property and items, and 14 belong to persons, including the National Sex Offender Register, Missing Person, and Identify Theft. It works alongside federal, tribal, state, and local agencies. The NCIC’s goal is to maintain a centralized information system between local branches and offices, so information is easily accessible nationwide. Missing people in the United States A person is considered missing when they have disappeared and their location is unknown. A person who is considered missing might have left voluntarily, but that is not always the case. The number of the NCIC unidentified person files in the United States has fluctuated since 1990, and in 2022, there were slightly more NCIC missing person files for males as compared to females. Fortunately, the number of NCIC missing person files has been mostly decreasing since 1998.

Clear search

Close search

Google apps

Main menu

Number of missing persons files U.S. 2024, by race

NCRB: State and Gender-wise Number of Persons Reported Missing and Traced

NCRB: State and Gender-wise number of children reported missing and traced

National Missing and Unidentified Persons System (NamUs)

OPP Missing Persons Annual Report Data

Missing and Unaccounted-for People in Mexico (1960s–2025)

Geographies of missing people: processes, experiences and responses -...

Lost Nation, IA Population Breakdown by Gender and Age Dataset: Male and...

About this dataset

Content

Inspiration

Recommended for further research

WHO national life expectancy

Context

Content

Inspiration

Data_Sheet_2_A Random Shuffle Method to Expand a Narrow Dataset and Overcome...

COVID-19 Cases and Deaths by Race/Ethnicity - ARCHIVE

Assessor - Parcel Universe (Current Year Only)

Water-quality data imputation with a high percentage of missing values: a...

Medicaid and CHIP enrollees who received a well-child visit

Film Circulation dataset

Medicaid and CHIP enrollees who received mental health or SUD services

Max Foundation Bangladesh 2019 Nutrition Baseline

Max Foundation

Content

Privacy and links to our other data

Acknowledgements

What you can do for our communities

Taxis Dataset | Yellow Taxi | Cleaned Version |

Source

Empathy dataset

Orlando Neighborhood

Number of missing persons files U.S. 2024, by race