Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SPIDER (v2) – Synthetic Person Information Dataset for Entity Resolution provides researchers with ready-to-use data for benchmarking Duplicate or Entity Resolution algorithms. The dataset focuses on person-level fields typical in customer or citizen records. Since real-world person-level data is restricted due to Personally Identifiable Information (PII) constraints, publicly available synthetic datasets are limited in scope, volume, or realism.SPIDER addresses these limitations by providing a large-scale, realistic dataset containing first name, last name, email, phone, address, and date of birth (DOB) attributes. Using the Python Faker library, 40,000 unique synthetic person records were generated, followed by 10,000 controlled duplicate records derived using seven real-world transformation rules. Each duplicate record is linked to its original base record and rule through the fields is_duplicate_of and duplication_rule.Version 2 introduces major realism and structural improvements, enhancing both the dataset and generation framework.Enhancements in Version 2New cluster_id column to group base and duplicate records for improved entity-level benchmarking.Improved data realism with consistent field relationships:State and ZIP codes now match correctly.Phone numbers are generated based on state codes.Email addresses are logically related to name components.Refined duplication logic:Rule 4 updated for realistic address variation.Rule 7 enhanced to simulate shared accounts among different individuals (with distinct DOBs).Improved data validation and formatting for address, email, and date fields.Updated Python generation script for modular configuration, reproducibility, and extensibility.Duplicate Rules (with real-world use cases)Duplicate record with a variation in email address.Use case: Same person using multiple email accounts.Duplicate record with a variation in phone numbers.Use case: Same person using multiple contact numbers.Duplicate record with last-name variation.Use case: Name changes or data entry inconsistencies.Duplicate record with address variation.Use case: Same person maintaining multiple addresses or moving residences.Duplicate record with a nickname.Use case: Same person using formal and informal names (Robert → Bob, Elizabeth → Liz).Duplicate record with minor spelling variations in the first name.Use case: Legitimate entry or migration errors (Sara → Sarah).Duplicate record with multiple individuals sharing the same email and last name but different DOBs.Use case: Realistic shared accounts among family members or households (benefits, tax, or insurance portals).Output FormatThe dataset is available in both CSV and JSON formats for direct use in data-processing, machine-learning, and record-linkage frameworks.Data RegenerationThe included Python script can be used to fully regenerate the dataset and supports:Addition of new duplication rulesRegional, linguistic, or domain-specific variationsVolume scaling for large-scale testing scenariosFiles Includedspider_dataset_v2_6_20251027_022215.csvspider_dataset_v2_6_20251027_022215.jsonspider_readme_v2.mdSPIDER_generation_script_v2.pySupportingDocuments/ folder containing:benchmark_comparison_script.py – script used for derive F-1 score.Public_census_data_surname.csv – sample U.S. Census name and demographic data used for comparison.ssa_firstnames.csv – Social Security Administration names dataset.simplemaps_uszips.csv – ZIP-to-state mapping data used for phone and address validation.
Facebook
TwitterThe Multiple Indicator Cluster Survey (MICS) is a household survey programme developed by UNICEF to assist countries in filling data gaps for monitoring human development in general and the situation of children and women in particular. MICS is capable of producing statistically sound, internationally comparable estimates of social indicators. The current round of MICS is focused on providing a monitoring tool for the Millennium Development Goals (MDGs), the World Fit for Children (WFFC), as well as for other major international commitments, such as the United Nations General Assembly Special Session (UNGASS) on HIV/AIDS and the Abuja targets for malaria.
Survey Objectives The 2006 Palestinian Refugee Camps, Lebanon Multiple Indicator Cluster Survey has as its primary objectives: - To provide up-to-date information for assessing the situation of children and women in Generic - To furnish data needed for monitoring progress toward goals established in the Millennium Declaration, the goals of A World Fit For Children (WFFC), and other internationally agreed upon goals, as a basis for future action; - To contribute to the improvement of data and monitoring systems in Generic and to strengthen technical expertise in the design, implementation, and analysis of such systems.
Survey Content
MICS questionnaires are designed in a modular fashion that can be easily customized to the needs of a country. They consist of a household questionnaire, a questionnaire for women aged 15-49 and a questionnaire for children under the age of five (to be administered to the mother or caretaker). Other than a set of core modules, countries can select which modules they want to include in each questionnaire.
Survey Implementation
The surveys are typically carried out by government organizations, with the support and assistance of UNICEF and other partners. Technical assistance and training for the surveys is provided through a series of regional workshops, covering questionnaire content, sampling and survey implementation; data processing; data quality and data analysis; report writing and dissemination.
Survey results
Results from the surveys, including national reports, standard sets of tabulations and micro level datasets will all be made widely available after completion of the surveys. Results from the surveys will also be made available in DevInfo format. DevInfo v5.0 is a powerful database system which has been adapted from UNICEF's ChildInfo technology to specifically monitor progress towards the Millennium Development Goals. MICS Results will also be available through UNICEF's web site dedicated to monitoring the situation of children and women at www.childinfo.org. Results of the prior round of MICS can already be found at this site.
The survey is representative and covers the whole of Palestinian refugee camps and gatherings in Lebanon.
Households (defined as a group of persons who usually live and eat together)
De jure household members (defined as memers of the household who usually live in the household, which may include people who did not sleep in the household the previous night, but does not include visitors who slept in the household the previous night but do not usually live in the household)
Women aged 15-49
Children aged 0-4
The survey covered all de jure household members (usual residents), all women aged 15-49 years resident in the household, and all children aged 0-4 years (under age 5) resident in the household.
Sample survey data [ssd]
The sample for the Multiple Indicator Cluster Survey (MICS) in Palestinian Refugee Camps and Gatherings in Lebanon was designed to provide estimates on a large number of indicators on the situation of children and women at the geographical area and camp/gathering level, for urban and rural areas, and for 12 camps and 12 gatherings in 5 geographical areas. With this design we could monitor a large number of women and children indicators at the geographical area and camp level for urban and rural areas.
The sample population (based on the Palestinian Refugee Camps and Gatherings in Lebanon Census of 1999) was divided into equal clusters each containing 20 households (totaling 1300 clusters). Sample clusters (310 clusters, i.e. 6200 households) were drawn with uniformity, random start and a sampling fraction of 0.25.
No major deviations from the original sample design were made. All sample enumeration areas were accessed and successfully interviewed with good response rates.
Face-to-face [f2f]
Three sets of questionnaires were used in the survey: 1) a household questionnaire was used to collect information on all household members, the household, and the dwelling; 2) a women’s questionnaire administered in each household to all women aged 15-49 years; 3) an under-5 questionnaire, administered to mothers or caretakers of all children under 5 living in the household.
The questionnaires included the following modules: Household Questionnaire, Household Listing, Education, Water and Sanitation Facilities, Household Background Characteristics, Child Labour, and Salt Iodization.
Questionnaire for Individual Women: Child Mortality, Tetanus Toxoid, Maternal and Newborn Health, Contraception, and - HIV/AIDS.
Questionnaire for Children Under Five: Birth Registration and Early Learning, Vitamin A, Breastfeeding, Care of Illness, Immunization, and Anthropometry.
The questionnaires are based on the MICS3 model questionnaire. Changes in format were made to the UNICEF MICS3 model Arabic version questionnaires that were pre-tested during March 2006.
Data were processed in clusters, with each cluster being processed as a complete unit through each stage of data processing. Each cluster goes through the following steps: 1) Questionnaire reception 2) Office editing and coding 3) Data entry 4) Structure and completeness checking 5) Verification entry 6) Comparison of verification data 7) Back up of raw data 8) Secondary editing 9) Edited data back up After all clusters are processed, all data is concatenated together and then the following steps are completed for all data files: 10) Export to SPSS in 4 files (hh - household, hl - household members, wm - women, ch - children under 5) 11) Recoding of variables needed for analysis 12) Adding of sample weights 13) Calculation of wealth quintiles and merging into data 14) Structural checking of SPSS files 15) Data quality tabulations 16) Production of analysis tabulations
Details of each of these steps can be found in the data processing documentation, data editing guidelines, data processing programs in CSPro and SPSS, and tabulation guidelines.
Data entry was conducted by 12 data entry operators in tow shifts, supervised by 2 data entry supervisors, using a total of 7 computers (6 data entry computers plus one supervisors computer). All data entry was conducted at the GenCenStat head office using manual data entry. For data entry, CSPro version 2.6.007 was used with a highly structured data entry program, using system controlled approach, that controlled entry of each variable. All range checks and skips were controlled by the program and operators could not override these. A limited set of consistency checks were also included inthe data entry program. In addition, the calculation of anthropometric Z-scores was also included in the data entry programs for use during analysis. Open-ended responses ("Other" answers) were not entered or coded, except in rare circumstances where the response matched an existing code in the questionnaire.
Structure and completeness checking ensured that all questionnaires for the cluster had been entered, were structurally sound, and that women's and children's questionnaires existed for each eligible woman and child.
100% verification of all variables was performed using independent verification, i.e. double entry of data, with separate comparison of data followed by modification of one or both datasets to correct keying errors by original operators who first keyed the files.
After completion of all processing in CSPro, all individual cluster files were backed up before concatenating data together using the CSPro file concatenate utility.
Data editing took place at a number of stages throughout the processing (see Other processing), including: a) Office editing and coding b) During data entry c) Structure checking and completeness d) Secondary editing e) Structural checking of SPSS data files
Detailed documentation of the editing of data can be found in the data processing guidelines in the MICS Manual (http://www.childinfo.org/mics/mics3/manual.php)
The response rate of households, mothers and children was remarkably high. Of the 6200 households selected for the sample, only 33 households could not be interviewed thus making the household response rate 99.5 percent.
In the interviewed households, 4001 ever married women (age 15-49) were identified. Of these, 3955 were successfully interviewed, yielding a response rate of 98.9 percent. In addition, 2431 children under age five were listed in the household questionnaire. Questionnaires were completed for 2381 of these children, which corresponds to a response rate of 97.9 percent.
Estimates from a sample survey are affected by two types of errors: 1) non-sampling errors and 2) sampling
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SPIDER (v2) – Synthetic Person Information Dataset for Entity Resolution provides researchers with ready-to-use data for benchmarking Duplicate or Entity Resolution algorithms. The dataset focuses on person-level fields typical in customer or citizen records. Since real-world person-level data is restricted due to Personally Identifiable Information (PII) constraints, publicly available synthetic datasets are limited in scope, volume, or realism.SPIDER addresses these limitations by providing a large-scale, realistic dataset containing first name, last name, email, phone, address, and date of birth (DOB) attributes. Using the Python Faker library, 40,000 unique synthetic person records were generated, followed by 10,000 controlled duplicate records derived using seven real-world transformation rules. Each duplicate record is linked to its original base record and rule through the fields is_duplicate_of and duplication_rule.Version 2 introduces major realism and structural improvements, enhancing both the dataset and generation framework.Enhancements in Version 2New cluster_id column to group base and duplicate records for improved entity-level benchmarking.Improved data realism with consistent field relationships:State and ZIP codes now match correctly.Phone numbers are generated based on state codes.Email addresses are logically related to name components.Refined duplication logic:Rule 4 updated for realistic address variation.Rule 7 enhanced to simulate shared accounts among different individuals (with distinct DOBs).Improved data validation and formatting for address, email, and date fields.Updated Python generation script for modular configuration, reproducibility, and extensibility.Duplicate Rules (with real-world use cases)Duplicate record with a variation in email address.Use case: Same person using multiple email accounts.Duplicate record with a variation in phone numbers.Use case: Same person using multiple contact numbers.Duplicate record with last-name variation.Use case: Name changes or data entry inconsistencies.Duplicate record with address variation.Use case: Same person maintaining multiple addresses or moving residences.Duplicate record with a nickname.Use case: Same person using formal and informal names (Robert → Bob, Elizabeth → Liz).Duplicate record with minor spelling variations in the first name.Use case: Legitimate entry or migration errors (Sara → Sarah).Duplicate record with multiple individuals sharing the same email and last name but different DOBs.Use case: Realistic shared accounts among family members or households (benefits, tax, or insurance portals).Output FormatThe dataset is available in both CSV and JSON formats for direct use in data-processing, machine-learning, and record-linkage frameworks.Data RegenerationThe included Python script can be used to fully regenerate the dataset and supports:Addition of new duplication rulesRegional, linguistic, or domain-specific variationsVolume scaling for large-scale testing scenariosFiles Includedspider_dataset_v2_6_20251027_022215.csvspider_dataset_v2_6_20251027_022215.jsonspider_readme_v2.mdSPIDER_generation_script_v2.pySupportingDocuments/ folder containing:benchmark_comparison_script.py – script used for derive F-1 score.Public_census_data_surname.csv – sample U.S. Census name and demographic data used for comparison.ssa_firstnames.csv – Social Security Administration names dataset.simplemaps_uszips.csv – ZIP-to-state mapping data used for phone and address validation.