Data sets referred to in our article that examines the problem of false positives in automated census linking.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Average predicted ancestry and variance in predicted ancestry for candidate reference breed individuals when filtered on minimum predicted ancestry.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The recent digitization of complete count census data is an extraordinary opportunity for social scientists to create large longitudinal datasets by linking individuals from one census to another or from other sources to the census. We evaluate different automated methods for record linkage, performing a series of comparisons across methods and against hand linking. We have three main findings that lead us to conclude that automated methods perform well. First, a number of automated methods generate very low (less than 5%) false positive rates. The automated methods trace out a frontier illustrating the tradeoff between the false positive rate and the (true) match rate. Relative to more conservative automated algorithms, humans tend to link more observations but at a cost of higher rates of false positives. Second, when human linkers and algorithms use the same linking variables, there is relatively little disagreement between them. Third, across a number of plausible analyses, coefficient estimates and parameters of interest are very similar when using linked samples based on each of the different automated methods. We provide code and Stata commands to implement the various automated methods.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The number of variants queried by each assay and the number of individuals from the 20 reference breeds genotyped using each assay.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Academic Family Tree is a live, crowdsourced project, that documents academic mentoring relationships across many fields. Data are updated continually. This snapshot was taken on 2024-10-18.
We welcome inquiries about this dataset and are interested in learning about any new results you uncover. Contact: davids@ohsu.edu.
This dataset contains key tables from the Academic Family Tree, including information on names/institutions of academic mentoring relationships and semi-automated links of authors to publications and US grants (NSF, NIH only). A subset of publication and grant links have been validated by human users. To save space, only unique identifiers are included for publications (PMID, DOI) and grants (federal project number), without other metadata (author, title, journal, principle investigator, etc). These identifiers should be adequate to link to other databases. Also note, author-publication links are broken into several separate files. These files should be concatenated into a single table to generate a complete dataset. Some additional information is available here: https://academictree.org/export.php.
These data are associated with Liénard, J.F., Achakulvisut, T., Acuna, D.E. et al. Intellectual synthesis in mentorship determines success in academic careers. Nature Communications 9, 4840 (2018) (https://doi.org/10.1038/s41467-018-07034-y). Please cite this publication in work that uses this dataset.
Funded by NSF Award 1933675.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ancestry proportion statistics for the self-assignment of reference panel members from samples of ≤50 or ≤100 individuals from the candidate reference breed individuals.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Eastern Canada (ECA) Flocks data set consists of manually annotated Images from the Common Eider (COEI, Somateria mollissima) Winter Survey and the Greater Snow Geese (GSGO, Anser caerulescens atlanticus) Spring Survey. The images were taken in Eastern Canada using fixed-wing aircraft and manually annotated with ImageJ's Cell counter plugins. We selected and annotated the ECA Flocks images in order to test the precision of the CountEm flock size estimation method. ECA Flocks includes 179 COEI and 99 GSGO single flock images. We cut each image manually to a rectangle that excluded large parts of the image with no birds. Both versions (original and cut) of each image are available in the data set. We manually annotated 637,555 (124,309 COEI and 514,235 GSGO) bird positions in the cut images from both surveys. Each bird has an associated "Type" which refers to species and/or sex. Sex identification was only possible for adult common eiders since females and immature males are brown birds whereas adult males have mainly white plumage. 64,484 male and 58,029 females were identified in the COEI images, as well as 1796 birds of other species. 504,891 Snow Geese and 9344 birds of other species were labeled in the GSGO images. A .csv file including all annotated bird positions and types is available for each image. The COEI and GSGO photos of the ECA Flocks data set were taken in the years 2006 and 2018 and 2016-2018 respectively. We selected these photos in order to include images with different quality and resolution. COEI and GSGO flock sizes range from 6 to 4,154 and from 43 to 36, 241 respectively. There is high variability in light conditions, backgrounds, number and spatial arrangement of birds across the images. The data set is therefore potentially useful to test the precision of methods for analyzing imagery to estimate the abundance of animals by directly detecting, identifying and counting individuals.
The goal is to develop methods for estimating the size of breeding populations for different waterbird species and to provide protocols for monitoring colonial species at Chase Lake NWR and other remote nesting colonies. The specific objectives of this study are: 1) to evaluate and improve the accuracy of an automated, pixel-based mapping method (using Geographic Information System applications) used to estimate populations of American White Pelicans, and to apply these methods to counting historic and recent records of Double-crested Cormorants and gulls at Chase Lake NWR, and 2) to develop, assess, and evaluate the accuracy of a perimeter-count method to estimate populations of egrets and herons nesting in shrubs at Chase Lake NWR. Development of survey techniques will enhance the capability of the USFWS for detecting and responding to disease events and monitoring colonial waterbirds at Chase Lake and other National Wildlife Refuges.
Also included are proposal reviewer summaries, uploaded as a separate digital holding.
Assessing underwater biodiversity is labour-intensive and costly, but is crucial for measuring the extent of the decline in local fish stock. In most cases, Underwater Visual Census (UVC) is the preferred method, however this can be costly in terms of human effort and is limited by meteorological and logistical factors. Advances in technology allows the utilisation of more autonomous video recording methods (i.e. Remote Operated Vehicles (ROV)) which addresses these limitations. This study used a transect-wise UVC coupled with diver operated videos (DOV). For the video analysis, a comprehensive fully automated pipeline was developed to extract frames from DOV and perform colour correction. This pipeline integrates a YOLO-based model to detect 20 Mediterranean fish species and validate the presence or absence of each species within individual transects. This study was conducted to evaluate the feasibility of using video-based methods for UVC with minimal human-input. The result of automa..., 1. Study area and data collection The training dataset (DATAT ) was gathered in eight different locations in the Mediterranean Sea along the French Riviera, following the same UVC protocol on each site (Harmelin-Vivien et al., 1985). The depth ranged from 1-37m and was carried out during the whole year in 2022 (cold and warm season) to cover the full range of conditions and possibilities of fish occurrences. The experimental dataset (DATAE) was recorded in October 2023 in and around two protected areas, one no-take zone (Cap Roux) and one Natura2000 site (Corniche Varoise), which both have elevated biodiversity. The specific coordinates and meta data can be found in the supplementary material (Table S1). A total of 64 videos, each corresponding to a transect, from 14 sites (8 on seagrass meadows and 6 on rocky substrates) were evaluated and compared. Each site consists of 3 to 6 transects, depending on the availability of video recordings and UVC data from the divers. The videos were ob..., , # Data from: Towards a fully automated underwater census for fish assemblages in the Mediterranean Sea
https://doi.org/10.5061/dryad.f7m0cfz6f
The training dataset (DATA_T) was gathered in eight different locations in the Mediterranean Sea along the French Riviera, following the same UVC protocol on each site. The depth ranged from 1-37m and was carried out during the whole year in 2022 (cold and warm season) to cover the full range of conditions and possibilities of fish occurrences.
The experimental dataset (DATA_E) was recorded in October 2023 in and around two protected areas, one no-take zone (Cap Roux) and one Natura2000 site (Corniche Varoise), which both have elevated biodiversity. A total of 64 videos, each corresponding to a transect, from 14 sites (8 on seagrass meadows and 6 on rocky substrates) were evaluated and compared. Each site consists of...
https://www.icpsr.umich.edu/web/ICPSR/studies/8882/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/8882/terms
In preparation for the 1990 Census of Population and Housing, test censuses were conducted in central Los Angeles County and in east central Mississippi in order to test census procedures in representative urban and rural settings. Several goals were identified for the 1986 test censuses including (1) the examination of new techniques for automating questionnaire processing, (2) the production of maps from an automated geographic database, (3) the testing of a new questionnaire design, and (4) a review of adjustment methodology. This data collection offers complete-count data for central Los Angeles County on race and Spanish origin for total persons and for those 18 years and over.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Number of individuals for each reference breed assigned to their breed of registration by minimum ancestry threshold.
2020 Census tracts are small, relatively permanent statistical subdivisions of a county or equivalent entity, and are reviewed and updated by local participants prior to each decennial census as part of the Census Bureau’s Participant Statistical Areas Program (PSAP). The primary purpose of census tracts is to provide a stable set of geographic units for the presentation of decennial census data.Census tracts generally have a total population size between 1,200 and 8,000 people with an optimum size of 4,000 people. The spatial size of census tracts varies widely depending on the density of settlement. Ideally, census tract boundaries remain stable over time to facilitate statistical comparisons from census to census. However, physical changes in street patterns caused by highway construction, new development, and so forth, may require boundary revisions. In addition, significant changes in population may result in splitting or combining census tracts. State and county boundaries always are census tract boundaries in the standard census geographic hierarchy, but tracts can cross the same kinds of boundaries that block groups can. Census tract numbers have up to a 4-character basic number and may have an optional 2-character suffix. The census tract numbers (used as names) eliminate any leading zeroes and append a suffix only if required. The 6-digit census tract codes, however, include leading zeroes and have an implied decimal point for the suffix. Census tract codes (000100 to 998999) are unique within a county or equivalent area.
https://fred.stlouisfed.org/legal/#copyright-public-domainhttps://fred.stlouisfed.org/legal/#copyright-public-domain
Graph and download economic data for Advance Retail Sales: Auto and Other Motor Vehicle Dealers (RSAOMVN) from Jan 1992 to Jul 2025 about retail trade, vehicles, sales, retail, and USA.
https://fred.stlouisfed.org/legal/#copyright-public-domainhttps://fred.stlouisfed.org/legal/#copyright-public-domain
Graph and download economic data for Advance Retail Sales: Auto and Other Motor Vehicle Dealers (MARTSMPCSM441XUSS) from Feb 1992 to Jun 2025 about vehicles, retail trade, percent, sales, retail, and USA.
The Accessibility Observatory at the University of Minnesota produces census block level access to jobs data by four modes (auto, bike, walk and transit). https://www.cts.umn.edu/programs/ao
Accessibility is the ease and feasibility of reaching valued destinations. It can be measured for a wide array of transportation modes, to different types of destinations, and at different times of day. There are a variety of ways to define accessibility, but the number of destinations reachable within a given travel time is the most comprehensible and transparent as well as the most directly comparable across different geographies.
Reports published by mode have detailed information on how these data are produced.
Auto: https://hdl.handle.net/11299/266465
Bike: https://hdl.handle.net/11299/266466
Walk: https://hdl.handle.net/11299/266468
Transit: https://hdl.handle.net/11299/266467
These data are provided in two formats. The first format is an ESRI Map Package that includes feature classes for each mode. The second format is a zipped shapefile of census blocks and four CSV files, one for each mode. For every census block, for each mode, there is a reported value of the number of jobs that can be reached within a specified time threshold. The bicycle mode uses Level of Traffic Stress (LTS) 3 (medium stress) to define usable routes.
Each of the State of Maryland’s 1,406 2010 census tracts was analyzed to determine whether it represented a typical census tract as defined by the U. S. Bureau of the Census. Nationally these are census tracts that optimally are 4,000 inhabitants but generally range from 1,200 to 8,000 persons. In Maryland the average census tract contains 4,106 persons. Nationally the housing unit threshold for each census tract generally ranges from 480 to 3,200 housing units, with an optimum size of 1,600 housing units. In Maryland the average census tract contains 1,692 housing units. The Emergency Management Planning Database and the Emergency Planning Vulnerable Population Index are intended to assist State agency emergency officials plan tactics, develop strategies, allocate resources and prioritize responses for emergencies and to identify potentially vulnerable population areas for special attention.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
This layer represents the boundaries of 2020 Census Blocks in Hamilton County.
Census blocks are: The smallest geographic area for which the Bureau of the Census collects and tabulates decennial census data. Statistical areas bounded by visible features such as roads, streams, and railroad tracks, and by nonvisible boundaries such as property lines, city, township, school district, county limits and short line-of-sight extensions of roads. The building blocks for all geographic boundaries the Census Bureau tabulates data for, such as tracts, places, and American Indian Reservations. Generally small in area. In a city, a census block looks like a city block bounded on all sides by streets. Census blocks in suburban and rural areas may be large, irregular, and bounded by a variety of features, such as roads, streams, and transmission lines. In remote areas, census blocks may encompass hundreds of square miles. Numbered uniquely with a four-digit census block number ranging from 0000 to 9999 nesting within each census tract, which nest within state and county. The first digit of the census block number identifies the block group. Block numbers beginning with a zero (in Block Group 0) are associated with water-only areas. Delineated by the U.S. Census Bureau once every ten years. An automated computer process looks for all visible and nonvisible features in our geographic database that should be a block boundary and creates a block each time those features create a polygon. The smallest level of geography you can get basic demographic data for, such as total population by age, sex, and race. Census blocks are not: Delineated based on population. In fact, many census blocks do not have any population.
The Street Network Files (SNFs), formerly known as the Area Master Files (AMFs), were first created in the early 1970s as the basis for retrieval of Census data for user-defined geographic areas. More recently, the Street Network Files have also been used in Census data collection, specifically in the delineation of enumeration areas and the automated production of census collection and reference maps. The Street Network Files contain information on visible features such as streets, hydrography, railroad tracks and power lines, and information on invisible (or abstract) features such as municipal and park boundaries. The files also contain attribute information on the features (for example, street and hydrographic names and address ranges for streets with assigned addresses). The 1996 Street Network Files (SNFs) are available as standard products for 50 urban centres: 25 census metropolitan areas (CMAs), 18 census agglomerations (CAs), and Seven census subdivisions (CSDs) outside CMA/CA areas. Street Network Files provide full digital coverage for 23 urban centres and partial coverage for 27. In total, 344 census subdivisions are covered by a Street Network File. These CSDs represent a population of approximately 18 million people or 62% of the total population of Canada. A comprehensive list of CMA/CAs and their component census subdivisions covered by a Street Network File is presented in Appendix D. The 1996 Street Network Files contain street information and address ranges current to Census Day, May 14, 1996.
The Central Statistical Organization (CSO) conducted fifth Economic Census in 2005 in all the States/UTs in collaboration with State Directorates of Economics and Statistics. The first Economic Census was conducted in 1977 covering only non- agricultural establishments and the three Economic Censuses subsequently carried out in 1980, 1990 and 1998 covered all agricultural and non-agricultural enterprises excepting those engaged in crop production and plantation. There was no change in the coverage of the fifth Economic Census as compared to the fourth Economic Census. Economic Census not only provides updated frame for detailed follow-up surveys but also gives basic entrepreneurial data for planning and development specially for unorganized sector of the economy.
There are certain new features in the fifth Economic Census. Addresses of the enterprises employing 10 workers or more were collected for the first time in the fifth Economic Census through Address Slip. At present the country does not maintain a Business Register. The directory of enterprises to be generated from the Address Slip would be the basic input for preparation of a Business Register. For the first time, data collected in the fifth Economic Census are processed through Intelligent Character Recognition (ICR) Technology.
The results of EC-2005 "ALL INDIA REPORT" contains the all India figures on the number of enterprises and their employment, cross-classified according to their locations, major activity groups, type of the establishments, size-class of the employment, etc. The disaggregated data for States/UTs are also included in the report.
All the States/UTs. in the country
Establishment
Economic Census (EC) is the complete count of all entrepreneurial units located within the geographical boundaries of the country. All units engaged in the production or distribution of goods or services other than for the sole purpose of own consumption are counted. While all units engaged in nonagricultural activities are covered, in the agricultural sector units in crop production and plantation activities are excluded.
Census/enumeration data [cen]
Face-to-face [f2f]
All questionaires are provided as external resources
Intelligent Character Recognition (ICR) technology, which is also known as Automated Forms Processing, was used to process the EC-2005 data. Automated Forms Processing technology enables the user to process documents from their images or directly from paper and convert them to computer readable data.
The schedules of the Fifth EC were scanned/digitized at the fifteen regional Data Processing Centres of Registrar General of India (RGI). After running the edit programme, the error list files were handed over to the State Governments for corrections. The DES officials of the State Government corrected the error files in two/three cycles and then sent the data files to RGI Headquarters to give final touch before sending to Computer Centre, MOSPI. The data files were made further error free by applying auto corrections at the Computer Centre.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Genotype data for 48,776 registered individuals from 20 breeds were used to establish the reference population.
Data sets referred to in our article that examines the problem of false positives in automated census linking.