25 datasets found

d
Python code used to download U.S. Census Bureau data for public-supply water...
catalog.data.gov
data.usgs.gov
Updated Nov 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Python code used to download U.S. Census Bureau data for public-supply water service areas [Dataset]. https://catalog.data.gov/dataset/python-code-used-to-download-u-s-census-bureau-data-for-public-supply-water-service-areas
Explore at:
Dataset updated
Nov 19, 2025
Dataset provided by
U.S. Geological Survey
Description
This child item describes Python code used to query census data from the TigerWeb Representational State Transfer (REST) services and the U.S. Census Bureau Application Programming Interface (API). These data were needed as input feature variables for a machine learning model to predict public supply water use for the conterminous United States. Census data were retrieved for public-supply water service areas, but the census data collector could be used to retrieve data for other areas of interest. This dataset is part of a larger data release using machine learning to predict public supply water use for 12-digit hydrologic units from 2000-2020. Data retrieved by the census data collector code were used as input features in the public supply delivery and water use machine learning models. This page includes the following file: census_data_collector.zip - a zip file containing the census data collector Python code used to retrieve data from the U.S. Census Bureau and a README file.
d
ACS 5 Year Data by Community Area
catalog.data.gov
Updated Jun 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.cityofchicago.org (2025). ACS 5 Year Data by Community Area [Dataset]. https://catalog.data.gov/dataset/acs-5-year-data-by-community-area
Explore at:
Dataset updated
Jun 7, 2025
Dataset provided by
data.cityofchicago.org
Description
Selected variables from the most recent ACS Community Survey (Released 2023) aggregated by Community Area. Additional years will be added as they become available. The underlying algorithm to create the dataset calculates the % of a census tract that falls within the boundaries of a given community area. Given that census tracts and community area boundaries are not aligned, these figures should be considered an estimate. Total population in this dataset: 2,647,621 Total Chicago Population Per ACS 2023: 2,664,452 % Difference: -0.632% There are different approaches in common use for displaying Hispanic or Latino population counts. In this dataset, following the approach taken by the Census Bureau, a person who identifies as Hispanic or Latino will also be counted in the race category with which they identify. However, again following the Census Bureau data, there is also a column for White Not Hispanic or Latino. Code can be found here: https://github.com/Chicago/5-Year-ACS-Survey-Data Community Area Shapefile: https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-current-/cauq-8yn6 Census Area Python Package Documentation: https://census-area.readthedocs.io/en/latest/index.html
H
KNIME US Census Data Connector
dataverse.harvard.edu
search.dataone.org
Updated Oct 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lingbo Liu (2022). KNIME US Census Data Connector [Dataset]. http://doi.org/10.7910/DVN/LILUPH
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/LILUPH
Dataset updated
Oct 12, 2022
Dataset provided by
Harvard Dataverse
Authors
Lingbo Liu
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
United States
Description
This workflow provides the prototype components of open dataset tools in KNIME Python-based Geospatial Extension, Users can acquire the data by easily defining the variable and geographic level. It contains 4 nodes: US2020 TIGER for US Basemap( Census Block, Block Group, Tract, and County), US2020 Census for Decennial Census P.L. 94-171 Redistricting Data US ACS-5: for the data of American Community Survey (ACS) 5 Years. GeoView: for geodata visualization Requirements: US Census API key:https://api.census.gov/data/key_signup.html KNIME Extension: KNIME Python Integration Python Package: geopandas, requests, matplotlib
US Means of Transportation to Work Census Data
kaggle.com
zip
Updated Feb 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sagar G (2022). US Means of Transportation to Work Census Data [Dataset]. https://www.kaggle.com/goswamisagard/american-census-survey-b08301-cleaned-csv-data
Explore at:
zip(3388809 bytes)Available download formats
Dataset updated
Feb 23, 2022
Authors
Sagar G
Area covered
United States
Description

US Census Bureau conducts American Census Survey 1 and 5 Yr surveys that record various demographics and provide public access through APIs. I have attempted to call the APIs through the python environment using the requests library, Clean, and organize the data in a usable format.

Data Ingestion and Cleaning:

ACS Subject data [2011-2019] was accessed using Python by following the below API Link: https://api.census.gov/data/2011/acs/acs1?get=group(B08301)&for=county:* The data was obtained in JSON format by calling the above API, then imported as Python Pandas Dataframe. The 84 variables returned have 21 Estimate values for various metrics, 21 pairs of respective Margin of Error, and respective Annotation values for Estimate and Margin of Error Values. This data was then undergone through various cleaning processes using Python, where excess variables were removed, and the column names were renamed. Web-Scraping was carried out to extract the variables' names and replace the codes in the column names in raw data.

The above step was carried out for multiple ACS/ACS-1 datasets spanning 2011-2019 and then merged into a single Python Pandas Dataframe. The columns were rearranged, and the "NAME" column was split into two columns, namely 'StateName' and 'CountyName.' The counties for which no data was available were also removed from the Dataframe. Once the Dataframe was ready, it was separated into two new dataframes for separating State and County Data and exported into '.csv' format

Data Source:

More information about the source of Data can be found at the URL below: US Census Bureau. (n.d.). About: Census Bureau API. Retrieved from Census.gov https://www.census.gov/data/developers/about.html

Final Word:

I hope this data helps you to create something beautiful, and awesome. I will be posting a lot more databases shortly, if I get more time from assignments, submissions, and Semester Projects 🧙🏼‍♂️. Good Luck.
g
ACS 5 Year Data by Ward
gimi9.com
data.cityofchicago.org
+1more
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ACS 5 Year Data by Ward [Dataset]. https://gimi9.com/dataset/data-gov_acs-5-year-data-by-ward
Explore at:
Description
🇺🇸 미국 English Selected variables from the most recent 5 year ACS Community Survey (Released 2023) aggregated by Ward. Additional years will be added as they become available. The underlying algorithm to create the dataset calculates the percent of a census tract that falls within the boundaries of a given ward. Given that census tracts and ward boundaries are not aligned, these figures should be considered an estimate. Total Population in this Dataset: 2,649,803 Total Population of Chicago reported by ACS 2023: 2,664,452 % Difference: %-0.55 There are different approaches in common use for displaying Hispanic or Latino population counts. In this dataset, following the approach taken by the Census Bureau, a person who identifies as Hispanic or Latino will also be counted in the race category with which they identify. However, again following the Census Bureau data, there is also a column for White Not Hispanic or Latino. The City of Chicago is actively soliciting community input on how best to represent race, ethnicity, and related concepts in its data and policy. Every dataset, including this one, has a "Contact dataset owner" link in the Actions menu. You can use it to offer any input you wish to share or to indicate if you would be interested in participating in live discussions the City may host. Code can be found here: https://github.com/Chicago/5-Year-ACS-Survey-Data Ward Shapefile: https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Wards-2023-Map/cdf7-bgn3 Census Area Python Package Documentation: https://census-area.readthedocs.io/en/latest/index.html
census-bureau-international
kaggle.com
zip
Updated May 6, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2020). census-bureau-international [Dataset]. https://www.kaggle.com/bigquery/census-bureau-international
Explore at:
zip(0 bytes)Available download formats
Dataset updated
May 6, 2020
Dataset provided by
Googlehttp://google.com/
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
Description
Context

The United States Census Bureau’s international dataset provides estimates of country populations since 1950 and projections through 2050. Specifically, the dataset includes midyear population figures broken down by age and gender assignment at birth. Additionally, time-series data is provided for attributes including fertility rates, birth rates, death rates, and migration rates.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.census_bureau_international.

Sample Query 1

What countries have the longest life expectancy? In this query, 2016 census information is retrieved by joining the mortality_life_expectancy and country_names_area tables for countries larger than 25,000 km2. Without the size constraint, Monaco is the top result with an average life expectancy of over 89 years!

standardSQL

SELECT age.country_name, age.life_expectancy, size.country_area FROM ( SELECT country_name, life_expectancy FROM bigquery-public-data.census_bureau_international.mortality_life_expectancy WHERE year = 2016) age INNER JOIN ( SELECT country_name, country_area FROM bigquery-public-data.census_bureau_international.country_names_area where country_area > 25000) size ON age.country_name = size.country_name ORDER BY 2 DESC /* Limit removed for Data Studio Visualization */ LIMIT 10

Sample Query 2

Which countries have the largest proportion of their population under 25? Over 40% of the world’s population is under 25 and greater than 50% of the world’s population is under 30! This query retrieves the countries with the largest proportion of young people by joining the age-specific population table with the midyear (total) population table.

standardSQL

SELECT age.country_name, SUM(age.population) AS under_25, pop.midyear_population AS total, ROUND((SUM(age.population) / pop.midyear_population) * 100,2) AS pct_under_25 FROM ( SELECT country_name, population, country_code FROM bigquery-public-data.census_bureau_international.midyear_population_agespecific WHERE year =2017 AND age < 25) age INNER JOIN ( SELECT midyear_population, country_code FROM bigquery-public-data.census_bureau_international.midyear_population WHERE year = 2017) pop ON age.country_code = pop.country_code GROUP BY 1, 3 ORDER BY 4 DESC /* Remove limit for visualization*/ LIMIT 10

Sample Query 3

The International Census dataset contains growth information in the form of birth rates, death rates, and migration rates. Net migration is the net number of migrants per 1,000 population, an important component of total population and one that often drives the work of the United Nations Refugee Agency. This query joins the growth rate table with the area table to retrieve 2017 data for countries greater than 500 km2.

SELECT growth.country_name, growth.net_migration, CAST(area.country_area AS INT64) AS country_area FROM ( SELECT country_name, net_migration, country_code FROM bigquery-public-data.census_bureau_international.birth_death_growth_rates WHERE year = 2017) growth INNER JOIN ( SELECT country_area, country_code FROM bigquery-public-data.census_bureau_international.country_names_area

Update frequency

Historic (none)

Dataset source

United States Census Bureau

Terms of use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://www.data.gov/privacy-policy#data_policy - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

See the GCP Marketplace listing for more details and sample queries: https://console.cloud.google.com/marketplace/details/united-states-census-bureau/international-census-data
North Carolina Farmers Market Data
kaggle.com
zip
Updated Nov 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tejas Jyothi (2024). North Carolina Farmers Market Data [Dataset]. https://www.kaggle.com/datasets/tejasjyothi/north-carolina-farmers-market-data
Explore at:
zip(13402 bytes)Available download formats
Dataset updated
Nov 25, 2024
Authors
Tejas Jyothi
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
North Carolina
Description
Executive Summary Farmers' markets are an important part of building community, ethically sourcing food, and creating a culture around sustainable habits. In this project, I worked to source data for farmers' markets in North Carolina. Due to their impact on the community, I also joined this data with census data to obtain a better understanding of how they are distributed and what insights they can provide us socially and economically. This dataset can also be used with other census data as it has digestible location data and further research in social science fields.

Data The data includes farmers' market data, web scraped from the North Carolina Department of Agriculture and Food Services joined with census data from 2019, the most recent year I could find. The web scraping gathered the farmers' market name, address, and contact info, while the census data gave total population, median income, and the number of people from 18-30 based on zipcode. This data is unique in this field due to its recency. It is possible to find similar data through the Department of Agriculture, but that data is often outdated and can contain mistakes on a more granular level. This script I've constructed allows the most recent data to be pulled in North Carolina.

Power Analysis I conducted a power analysis with intention to find if the populations based on zipcodes with farmers' markets were significantly different than the average zipcode population of North Carolina, using a significance level of .05 and power of .8, resulting in a required sample of 127.52.

Exploratory Data Analysis You can find exploratory data analysis in the eda.py file to better acclimate yourself with the data. There were 247 farmers' markets collected, and three census variables were attached. Other distribution metrics are included with visualizations as well as general information on the data.

Link to Github https://github.com/tejasj02/Farmers-Market-Data-Curation

Ethics statement This dataset was curated on publicly available sources with intention to further research and information in this social science field. All scraping and data gathering was done ethically, not breaching any rules. Farmers' Market data was obtained from the North Carolina Department of Agriculture and Consumer Services while the census data was imported from the censusdata python library. Data is public and up to date as of 11/25/2024. Can be run with adjusted code to be updated. The dataset is open source and should adhere to normal ethical boundaries.
h
census-income
huggingface.co
Updated Jul 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WC (2025). census-income [Dataset]. https://huggingface.co/datasets/cestwc/census-income
Explore at:
Dataset updated
Jul 21, 2025
Authors
WC
Description
Dataset Card for Census Income (Adult)

This dataset is a precise version of Adult or Census Income. This dataset from UCI somehow happens to occupy two links, but we checked and confirm that they are identical. We used the following python script to create this Hugging Face dataset. import pandas as pd from datasets import Dataset, DatasetDict, Features, Value, ClassLabel

URLs

url1 = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" url2 =… See the full description on the dataset page: https://huggingface.co/datasets/cestwc/census-income.
H
U.S. Census Geographic Boundaries and Crosswalks
dataverse.harvard.edu
Updated Nov 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michelle Audirac; James Kitch (2025). U.S. Census Geographic Boundaries and Crosswalks [Dataset]. http://doi.org/10.7910/DVN/QIZWWE
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/QIZWWE
Dataset updated
Nov 24, 2025
Dataset provided by
Harvard Dataverse
Authors
Michelle Audirac; James Kitch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 2000 - Dec 31, 2024
Area covered
United States
Description
This dataset contains many files; use the Tree view below to get a condensed overview of what is available. This dataset provides harmonized geographic boundary shapefiles and crosswalks sourced from the U.S. Census Bureau and accessed via the pygris Python library. It includes: ZCTA (ZIP Code Tabulation Area) shapefiles County shapefiles ZCTA-to-county crosswalk files Unique lists of ZCTAs and counties by year Shapefiles: The column names and column types are harmonized for consistency across years. The cartographic boundaries are selected across years enabling longitudinal spatial analysis and integration with external datasets such as demographic or health data. Crosswalks are fetched directly from U.S. Census sources and processed to ensure compatibility and ease of use. All files are structured to support reproducible, year-over-year spatial analyses.
d
Comprehensive dataset and Python toolkit for housing market analysis in...
search.dataone.org
Updated Oct 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li, Kingston (2025). Comprehensive dataset and Python toolkit for housing market analysis in Mercer County, NJ [Dataset]. http://doi.org/10.7910/DVN/LYRDHG
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/LYRDHG
Dataset updated
Oct 29, 2025
Dataset provided by
Harvard Dataverse
Authors
Li, Kingston
Area covered
New Jersey, Mercer County
Description
This project combines data extraction, predictive modeling, and geospatial mapping to analyze housing trends in Mercer County, New Jersey. It consists of three core components: Census Data Extraction: Gathers U.S. Census data (2012–2022) on median house value, household income, and racial demographics for all census tracts in the county. It accounts for changes in census tract boundaries between 2010 and 2020 by approximating values for newly defined tracts. House Value Prediction: Uses an LSTM model with k-fold cross-validation to forecast median house values through 2025. Multiple feature combinations and sequence lengths are tested to optimize prediction accuracy, with the final model selected based on MSE and MAE scores. Data Mapping: Visualizes historical and predicted housing data using GeoJSON files from the TIGERWeb API. It generates interactive maps showing raw values, changes over time, and percent differences, with customization options to handle outliers and improve interpretability. This modular workflow can be adapted to other regions by changing the input FIPS codes and feature selections.
Covid-19 and Hospitals US County Time Series
kaggle.com
zip
Updated Oct 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Zahn (2025). Covid-19 and Hospitals US County Time Series [Dataset]. https://www.kaggle.com/datasets/jmzahn/covid19-and-hospitals-us-county-time-series
Explore at:
zip(6221564 bytes)Available download formats
Dataset updated
Oct 23, 2025
Authors
Jacob Zahn
Area covered
United States
Description
Context

This data was collected and created for a project in a data science course I took in college in the Spring of 2020. I have updated the data to include more dates into the summer and decided to share it and the code so others can explore it.

Content

Data

Hospitals.csv

Available here: https://hifld-geoplatform.opendata.arcgis.com/datasets/hospitals

Information on hospitals in the United States.

us-counties.csv

Available here: https://github.com/nytimes/covid-19-data

Daily covid cases and death data for us counties.

co-est2019-alldata.csv

Available here: https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/totals/

Data sheet available here: https://www2.census.gov/programs-surveys/popest/technical-documentation/file-layouts/2010-2019/co-est2019-alldata.pdf

2019 county level census estimates.

daily.csv

Available here: https://covidtracking.com/api/v1/states/daily.csv

Daily state level covid testing data.

Uploaded with Git LFS

CountyHospitalCombined.csv, CovCountyHospitalTimeSeries.csv, and StateTestingTimeSeries.csv

Intereim data views created by me to hold cleaned data and used to create the final datset.

MasterTimeSeries.csv

Final combined dataset, a days X 3142(num of us counties+dc) long time series with variables stored as a proportion of population.

Uploaded with Git LFS

Code

The python scripts have comments to explain which datasets they're responsible for generating.

Feel free to use and edit them to tailor the datasets generated to your liking.

There is also a helper function library in the main directory.

Scripts can be ran by calling >python
Bachelors or Higher Degree Time Series Data of USA
kaggle.com
zip
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saad Aziz (2023). Bachelors or Higher Degree Time Series Data of USA [Dataset]. https://www.kaggle.com/saadaziz1985/bachelors-or-higher-degree-data-of-usa
Explore at:
zip(24049 bytes)Available download formats
Dataset updated
Jun 21, 2023
Authors
Saad Aziz
Area covered
United States
Description
The provided Python code is developed to extract data from the Federal Reserve Economic Data (FRED) regarding Bachelor's or Higher degree education in the United States, specifically at the state and county levels. The code generates data based on the current date and is available up until the year 2021.

This code is useful for research purposes, particularly for conducting comparative analyses involving educational and economic indicators. There are two distinct CSV files associated with this code. One file contains information on the percentage of Bachelor's or Higher degree holders among residents of all USA states, while the other file provides data on states, counties, and municipalities throughout the entire USA.

The extraction process involves applying different criteria, including content filtering (such as title, frequency, seasonal adjustment, and unit) and collaborative filtering based on item similarity. For the first CSV file, the algorithm extracts data for each state in the USA and assigns corresponding state names to the respective FRED codes using a loop. Similarly, for the second CSV file, data is extracted based on a given query, encompassing USA states, counties, and municipalities.
OLS regression results for detections per image.
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated May 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew Martell; Nick Terry; Ribhu Sengupta; Chris Salazar; Nicole A. Errett; Scott B. Miles; Joseph Wartman; Youngjun Choe (2024). OLS regression results for detections per image. [Dataset]. http://doi.org/10.1371/journal.pone.0303180.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0303180.t001
Dataset updated
May 10, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Matthew Martell; Nick Terry; Ribhu Sengupta; Chris Salazar; Nicole A. Errett; Scott B. Miles; Joseph Wartman; Youngjun Choe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The first three non-intercept terms represent indicator variables for the different seasons, with fall being the baseline. The Vaccine Available term represents a binary variable for whether the COVID-19 initial vaccination series was publicly available or not. Weekend is a binary variable for whether the data was collected on Saturday or Sunday. The four Income Bracket terms are indicator variables for the median income level of the census tract where the data was collected. The income brackets are defined in our methods. Lastly, the More than 55.5% White term is an indicator variable for if the census tract in question had a populace that is more than 55.5% White. Full documentation for the Python package used to make this output is available from the developers [38].
o
Gender and Ethnicity Predictions for California City Council Members and...
openicpsr.org
dataverse.harvard.edu
delimited
Updated Oct 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rohan M. Dalal (2024). Gender and Ethnicity Predictions for California City Council Members and School Board Members, 2010-2023 [Dataset]. http://doi.org/10.3886/E209861V1
Explore at:
delimitedAvailable download formats
Unique identifier
https://doi.org/10.3886/E209861V1
Dataset updated
Oct 24, 2024
Dataset provided by
Crystal Springs Uplands School
Authors
Rohan M. Dalal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2010 - 2023
Area covered
California City, California
Description
To conduct this study, I sourced demographic data from 2010 to 2023 from the California Elections Data Archive (CEDA) for city council members and school board members. The CEDA data provide a full list of candidate names and the number of votes a given candidate received for every city council and school board election. I assigned the gender to each candidate based on the lists of popular male and female names provided by the Social Security Administration. Since the average age of city council members is 46 years old according to the Bureau of Labor Statistics, I compiled a list of popular male and female given names for babies born in the 1960s, 1970s, and 1980s. Then, I automated the gender classification as follows: for example, as “Lisa” is identified as a popular female given name by the Social Security Administration, every candidate whose first name is “Lisa” was assigned “female” in our dataset. For a gender-neutral name that appeared on the lists for both male and female given names, which included “Alex” and “Casey,” I used the following keywords “[first name] [last name] [office type (either “city council” or “school board”)] [name of the city or the school district]” to search for more information about the official’s gender online. My search returned either a picture to help clearly identify the official’s gender and/or an article that refers to the official with gendered pronouns. To identify the ethnicity of each elected official, I used the 2010 Census data and the 23AndMe Surname Discovery Tool. The 2010 Census lists surnames occurring at least 100 times, and it includes self-reported ethnicity data for individuals with a given surname. Similarly, the 23AndMe Surname Discovery Tool gives the percentage of individuals with the given surname who identify as each of four different ethnicity groups: Hispanic, White, Asian/Pacific Islander, and Black based on the 2010 US Census data. For surnames that did not appear on either the 2010 Census data or the 23AndMe Surname Discovery Tool, I used Python’s Ethnicolr library, which bases its prediction of ethnicity using either both first and last name or just the last name on the US census data (2000 and 2010), the Florida voting registration data, and the Wikipedia data.
d
Data from: Public supply water use reanalysis for the 2000-2020 period by...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Public supply water use reanalysis for the 2000-2020 period by HUC12, month, and year for the conterminous United States (ver. 2.0, August 2024) [Dataset]. https://catalog.data.gov/dataset/public-supply-water-use-reanalysis-for-the-2000-2020-period-by-huc12-month-and-year-for-th
Explore at:
Dataset updated
Nov 19, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Contiguous United States, United States
Description
The U.S. Geological Survey is developing national water-use models to support water resources management in the United States. Model benefits include a nationally consistent estimation approach, greater temporal and spatial resolution of estimates, efficient and automated updates of results, and capabilities to forecast water use into the future and assess model uncertainty. The term “reanalysis” refers to the process of reevaluating and recalculating water-use data using updated or refined methods, data sources, models, or assumptions. In this data release, water use refers to water that is withdrawn by public and private water suppliers and includes water provided for domestic, commercial, industrial, thermoelectric power, and public water uses, as well as water that is consumed or lost within the public supply system. Consumptive use refers to water withdrawn by the public supply system that is evaporated, transpired, incorporated into products or crops, or consumed by humans or livestock. This data release contains data used in a machine learning model (child item 2) to estimate monthly water use for communities that are supplied by public-supply water systems in the conterminous United States for 2000-2020. This data release also contains associated scripts used to produce input features (child items 4 - 8) as well as model water use estimates by 12-digit hydrologic unit code (HUC12) and public supply water service area (WSA). HUC12 boundaries are in child item 3. Public supply delivery and consumptive use estimates are in child items 1 and 9, respectively. First posted: November 1, 2023 Revised: August 8, 2024 This version replaces the previous version of the data release: Luukkonen, C.L., Alzraiee, A.H., Larsen, J.D., Martin, D.J., Herbert, D.M., Buchwald, C.A., Houston, N.A., Valseth, K.J., Paulinski, S., Miller, L.D., Niswonger, R.G., Stewart, J.S., and Dieter, C.A., 2023, Public supply water use reanalysis for the 2000-2020 period by HUC12, month, and year for the conterminous United States: U.S. Geological Survey data release, https://doi.org/10.5066/P9FUL880 Version 2.0 This data release has been updated as of 8/8/2024. The previous version has been replaced because some fractions used for downscaling WSA estimates to HUC12 did not sum to one for some WSAs in Virginia. Updated model water use estimates by HUC12 are included in this version. A change was made in two scripts to check for this condition. Output files have also been updated to preserve the leading zero in in the HUC12 codes. Additional files are also included to provide information about mapping the WSAs and groundwater and surface water fractions to HUC12 and to provide public supply water-use estimates by WSA. The 'Machine learning model that estimates total monthly and annual per capita public supply water use' child item has been updated with these corrections and additional files. A new child item 'R code used to estimate public supply consumptive water use' has been added to provide estimates of public supply consumptive use. This page includes the following files: PS_HUC12_Tot_2000_2020.csv - a csv file with estimated monthly public supply total water use from 2000-2020 by HUC12, in million gallons per day PS_HUC12_GW_2000_2020.csv - a csv file with estimated monthly public supply groundwater use for 2000-2020 by HUC12, in million gallons per day PS_HUC12_SW_2000_2020.csv - a csv file with estimated monthly public supply surface water use for 2000-2020 by HUC12, in million gallons per day PS_WSA_Tot_2000_2020.csv - a csv file with estimated monthly public supply total water use from 2000-2020 by WSA, in million gallons per day PS_WSA_GW_2000_2020.csv - a csv file with estimated monthly public supply groundwater use for 2000-2020 by WSA, in million gallons per day PS_WSA_SW_2000_2020.csv - a csv file with estimated monthly public supply surface water use for 2000-2020 by WSA, in million gallons per day Note: 1) Groundwater and surface water fractions were determined using source counts as described in the 'R code that determines groundwater and surface water source fractions for public-supply water service areas, counties, and 12-digit hydrologic units' child item. 2) Some HUC12s have estimated water use of zero because no public-supply water service areas were modeled within the HUC. change_files_format.py - A Python script used to change the water use estimates by WSA and HUC12 files from wide format to the thin and long format version_history.txt - a txt file describing changes in this version The data release is organized into these items: 1. Machine learning model that estimates public supply deliveries for domestic and other use types - The public supply delivery model estimates total delivery of domestic, commercial, industrial, institutional, and irrigation (CII) water use for public supply water service areas within the conterminous United States. This item contains model input datasets, code used to build the delivery machine learning model, and output predictions. 2. Machine learning model that estimates total monthly and annual per capita public supply water use - The public supply water use model estimates total monthly water use for 12-digit hydrologic units within the conterminous United States. This item contains model input datasets, code used to build the water use machine learning model, and output predictions. 3. National watershed boundary (HUC12) dataset for the conterminous United States, retrieved 10/26/2020 - Spatial data consisting of a shapefile with 12-digit hydrologic units for the conterminous United States retrieved 10/26/2020. 4. Python code used to determine average yearly and monthly tourism per 1000 residents for public-supply water service areas - This code was used to create a feature for the public supply model that provides information for areas affected by population increases due to tourism. 5. Python code used to download gridMET climate data for public-supply water service areas - The climate data collector is a tool used to query climate data which are used as input features in the public supply models. 6. Python code used to download U.S. Census Bureau data for public-supply water service areas - The census data collector is a geographic based tool to query census data which are used as input features in the public supply models. 7. R code that determines buying and selling of water by public-supply water service areas - This code was used to create a feature for the public supply model that indicates whether public-supply systems buy water, sell water, or neither buy nor sell water. 8. R code that determines groundwater and surface water source fractions for public-supply water service areas, counties, and 12-digit hydrologic units - This code was used to determine source water fractions (groundwater and/or surface water) for public supply systems and HUC12s. 9. R code used to estimate public supply consumptive water use - This code was used to estimate public supply consumptive water use using an assumed fraction of deliveries for outdoor irrigation and estimates of evaporative demand. This item contains estimated monthly public supply consumptive use datasets by HUC12 and WSA.
Z
Supplementary Data for "Incorporating Community Knowledge into Analysis of...
data.niaid.nih.gov
Updated Feb 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gardner-Frolick, Rivkah; Jain, Sakshi; Martinussen, Nika; Chambliss, Sarah; Jackson, Dan; Zimmerman, Naomi; Giang, Amanda (2025). Supplementary Data for "Incorporating Community Knowledge into Analysis of Air Quality Monitoring Network Data" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14799940
Explore at:
Dataset updated
Feb 5, 2025
Dataset provided by
The University of Texas at Austin
University of British Columbia
Authors
Gardner-Frolick, Rivkah; Jain, Sakshi; Martinussen, Nika; Chambliss, Sarah; Jackson, Dan; Zimmerman, Naomi; Giang, Amanda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These are the underlying data sets needed to perform the peak analysis and create the land-use regression (LUR) models described in the paper. There are four datasets:

"RAMP_Location.csv": Locations and IDs of the low-cost sensors used in this work. [NOTE: latitudes and longitudes for the sensor deployments have been intentionally rounded to protect the location of volunteer sensor hosts.]

"RAMP_data.zip": This contains the csv files of calibrated PM2.5, NO, NO2, CO and O3 measurements for the entire study period across all low-cost sensor sites.

"Vancouver_Population_Density_2016.zip": Shapefile of the population within each Dissemination Area from the 2016 Canadian Census. This information was originally extracted from the Canadian Census Analyser supported by the University of Toronto.

"smell_van_data.csv": Contains the locations, date, and description of odor reports during the monitoring period from the Smell Vancouver website (https://smell-vancouver.ca)

There is also sample code in Python to perform the peak analysis and create the LURs. [NOTE: we have intentionally excluded uploading the exact data sets imported by this code; our original data contains exact locations of sensor host volunteers and thus cannot be shared.]

"Peak_analysis_geohealth.py": A Python script to perform the peak analysis from the paper.

"LUR_strathcona_geohealth.py": A Python script to create the LURs and maps of LUR results from the paper.

Links to other data used in the code from publicly available sources:

Railway locations - https://opendata.vancouver.ca/explore/dataset/railways/information/

Public streets - https://opendata.vancouver.ca/explore/dataset/public-streets/information/

Land use - https://open-data-portal-metrovancouver.hub.arcgis.com/datasets/metrovancouver::landuse-2016-code-description/about

Bus stops - https://abacus.library.ubc.ca/dataset.xhtml?persistentId=hdl:11272.1/AB2/QQLSCJ

Block outlines - https://opendata.vancouver.ca/explore/dataset/block-outlines/information/
USA State Shapefiles
kaggle.com
zip
Updated Nov 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nick Switzer (2025). USA State Shapefiles [Dataset]. https://www.kaggle.com/datasets/nswitzer/usa-state-shapeflies
Explore at:
zip(3299828 bytes)Available download formats
Dataset updated
Nov 22, 2025
Authors
Nick Switzer
Area covered
United States
Description
Shapefiles for mapping and understanding overlaps

sf package in R. geopandas in Python.

https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html
Cancer County-Level
kaggle.com
zip
Updated Dec 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Cancer County-Level [Dataset]. https://www.kaggle.com/datasets/thedevastator/exploring-county-level-correlations-in-cancer-ra
Explore at:
zip(146998 bytes)Available download formats
Dataset updated
Dec 3, 2022
Authors
The Devastator
Description
Exploring County-Level Correlations in Cancer Rates and Trends

A Multivariate Ordinary Least Squares Regression Model

By Noah Rippner [source]

About this dataset

This dataset offers a unique opportunity to examine the pattern and trends of county-level cancer rates in the United States at the individual county level. Using data from cancer.gov and the US Census American Community Survey, this dataset allows us to gain insight into how age-adjusted death rate, average deaths per year, and recent trends vary between counties – along with other key metrics like average annual counts, met objectives of 45.5?, recent trends (2) in death rates, etc., captured within our deep multi-dimensional dataset. We are able to build linear regression models based on our data to determine correlations between variables that can help us better understand cancers prevalence levels across different counties over time - making it easier to target health initiatives and resources accurately when necessary or desired

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This kaggle dataset provides county-level datasets from the US Census American Community Survey and cancer.gov for exploring correlations between county-level cancer rates, trends, and mortality statistics. This dataset contains records from all U.S counties concerning the age-adjusted death rate, average deaths per year, recent trend (2) in death rates, average annual count of cases detected within 5 years, and whether or not an objective of 45.5 (1) was met in the county associated with each row in the table.

To use this dataset to its fullest potential you need to understand how to perform simple descriptive analytics which includes calculating summary statistics such as mean, median or other numerical values; summarizing categorical variables using frequency tables; creating data visualizations such as charts and histograms; applying linear regression or other machine learning techniques such as support vector machines (SVMs), random forests or neural networks etc.; differentiating between supervised vs unsupervised learning techniques etc.; reviewing diagnostics tests to evaluate your models; interpreting your findings; hypothesizing possible reasons and patterns discovered during exploration made through data visualizations ; Communicating and conveying results found via effective presentation slides/documents etc.. Having this understanding will enable you apply different methods of analysis on this data set accurately ad effectively.

Once these concepts are understood you are ready start exploring this data set by first importing it into your visualization software either tableau public/ desktop version/Qlikview / SAS Analytical suite/Python notebooks for building predictive models by loading specified packages based on usage like Scikit Learn if Python is used among others depending on what tool is used . Secondly a brief description of the entire table's column structure has been provided above . Statistical operations can be carried out with simple queries after proper knowledge of basic SQL commands is attained just like queries using sub sets can also be performed with good command over selecting columns while specifying conditions applicable along with sorting operations being done based on specific attributes as required leading up towards writing python codes needed when parsing specific portion of data desired grouping / aggregating different categories before performing any kind of predictions / models can also activated create post joining few tables possible , when ever necessary once again varying across tools being used Thereby diving deep into analyzing available features determined randomly thus creating correlation matrices figures showing distribution relationships using correlation & covariance matrixes , thus making evaluations deducing informative facts since revealing trends identified through corresponding scatter plots from a given metric gathered from appropriate fields!

Research Ideas

Building a predictive cancer incidence model based on county-level demographic data to identify high-risk areas and target public health interventions.

Analyzing correlations between age-adjusted death rate, average annual count, and recent trends in order to develop more effective policy initiatives for cancer prevention and healthcare access.

Utilizing the dataset to construct a machine learning algorithm that can predict county-level mortality rates based on socio-economic factors such as poverty levels and educational attainment rates

Acknowledgements

If you use this dataset i...
D
India DroughtSet: A village-level drought dataset for the past 43 years
phys-techsciences.datastations.nl
ssh.datastations.nl
application/netcdf +5
Updated Nov 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
T Pareek; T Pareek (2023). India DroughtSet: A village-level drought dataset for the past 43 years [Dataset]. http://doi.org/10.17026/DANS-XFT-EPRJ
Explore at:
zip(20083), mid(163767407), mid(682839633), mif(1721580), pdf(204411), application/netcdf(79177037), application/netcdf(100224077), csv(266091148), mid(362607737), csv(563572003), mif(786781), mif(3215211)Available download formats
Unique identifier
https://doi.org/10.17026/DANS-XFT-EPRJ
Dataset updated
Nov 9, 2023
Dataset provided by
DANS Data Station Physical and Technical Sciences
Authors
T Pareek; T Pareek
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Area covered
India
Description
This database consists of a high-resolution village-level drought dataset for major Indian states for the past 43 years (1981 – 2022) for each month. It was created by utilising the CHIRPS precipitation and GLEAM evapotranspiration datasets. GLEAMS dataset based on the well recognised Priestley-Taylor equation to estimate potential evapotranspiration (PET) based on observations of surface net radiation and near-surface air temperature. The SPEI was calculated for spatial grids of 5x5 km for the SPEI 3-month time scale, suitable for agricultural drought monitoring.This high-resolution SPEI dataset was integrated with Indian village boundaries and associated census attribute dataset. This allows researchers to perform multi-disciplinary investigations, e.g., climate migration modelling, drought hazards, and exposure assessment. The development of the dataset has been performed while keeping potential users in mind. Therefore, the dataset can be integrated into a GIS system for visualization (using .mid/.mif format) and into Python programming for modelling and analysis (using .csv). For advanced analysis, I have also provided it in netCDF format, which can be read in Python using xarray or the netcdf4 library. More details are in the README.pdf file. Date Submitted: 2023-11-07 Issued: 2023-11-07
Urban Areas Dataset
kaggle.com
zip
Updated Dec 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Urban Areas Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/urban-areas-dataset/discussion
Explore at:
zip(180678 bytes)Available download formats
Dataset updated
Dec 8, 2023
Authors
The Devastator
Description
Urban Areas Dataset

Geographic information on urban areas

By Homeland Infrastructure Foundation [source]

About this dataset

Each urban area is uniquely identified by a 5-character numeric census code that may contain leading zeroes as necessary. The dataset comprises several key attributes such as the name of the urban area (represented by multiple columns), legal/statistical area description, MAF/TIGER feature class code for classification purposes (MTFCC10), urban area type code (UATYP10), functional status indicating its operational characteristics (FUNCSTAT10), and geographic coordinates specifying the latitude and longitude of the interior point of each urban area.

Additional information available includes the land area in square meters (ALAND10) which denotes the extent of developed territory within an urban zone. Similarly, water areas associated with each urban area are provided as well in square meters measurement (AWATER10). Furthermore, shape length is included to describe the total length of an individual's shape or outline within an urban region while shape area signifies its overall spatial extent.

How to use the dataset

Here is a step-by-step guide on how to effectively use this dataset:

Import the Data: Load the dataset into your preferred tool or programming language for data analysis. Popular options include Python with libraries like pandas or R with packages like tidyr.

Explore the Columns: Familiarize yourself with the available columns in the dataset. Here are some important ones:

NAME10: The name of each urban area.

NAMELSAD10: The name and legal/statistical area description of each urban area.

UACE10: A 5-character numeric census code that uniquely identifies each urban area.

ALAND10: The land area of each urban area in square meters.

AWATER10: The water area of each urban area in square meters.

FUNCSTAT10: The functional status of each urban area.

INTPTLAT10 and INTPTLON10: The latitude and longitude coordinates of the interior point of each urban area.

Understand Urban Area Types: The dataset distinguishes between two types of urban areas:

a) Urbanized Areas (UAs): These areas contain 50,000 or more people.

b) Urban Clusters (UCs): These areas contain at least 2,500 people but fewer than 50,000 people. (Except in the U.S. Virgin Islands and Guam, which may have urban clusters with populations greater than 50,000).

The column UATYP10 provides the urban area type code for each entry.

Analyze Functional Status: Explore the FUNCSTAT10 column to understand the functional status of each urban area. This information indicates whether an area is deemed functional for residential, commercial, or other non-residential purposes.

Visualize Geographic Data: Util

Research Ideas

Urban Planning Analysis: This dataset can be used to analyze and compare different urban areas based on their land area, water area, population density, and functional status. It can provide valuable insights for urban planners in terms of designing infrastructure, allocating resources, and making informed decisions to ensure sustainable development.

Demographic Research: Researchers studying population trends and demographics can utilize this dataset to understand the growth, distribution, and characteristics of urban areas over time. By analyzing the population size and density of different urban areas, they can identify patterns of urbanization and assess the impact of policies or events on urban populations.

Environmental Impact Assessment: The land area and water area information in this dataset can be used to assess the environmental impact of urban areas. Researchers or environmentalists can analyze the proportion of green spaces versus built-up areas within each urban area to evaluate levels of air pollution, biodiversity loss, or potential for implementing sustainable practices like rooftop gardens or rainwater harvesting systems

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate i...

Facebook

Twitter

Click to copy link

Link copied

Cite

U.S. Geological Survey (2025). Python code used to download U.S. Census Bureau data for public-supply water service areas [Dataset]. https://catalog.data.gov/dataset/python-code-used-to-download-u-s-census-bureau-data-for-public-supply-water-service-areas

Python code used to download U.S. Census Bureau data for public-supply water service areas

Explore at:

Dataset updated

Nov 19, 2025

Dataset provided by

U.S. Geological Survey

Description

This child item describes Python code used to query census data from the TigerWeb Representational State Transfer (REST) services and the U.S. Census Bureau Application Programming Interface (API). These data were needed as input feature variables for a machine learning model to predict public supply water use for the conterminous United States. Census data were retrieved for public-supply water service areas, but the census data collector could be used to retrieve data for other areas of interest. This dataset is part of a larger data release using machine learning to predict public supply water use for 12-digit hydrologic units from 2000-2020. Data retrieved by the census data collector code were used as input features in the public supply delivery and water use machine learning models. This page includes the following file: census_data_collector.zip - a zip file containing the census data collector Python code used to retrieve data from the U.S. Census Bureau and a README file.

Clear search

Close search

Google apps

Main menu

Python code used to download U.S. Census Bureau data for public-supply water...

ACS 5 Year Data by Community Area

KNIME US Census Data Connector

US Means of Transportation to Work Census Data

Data Ingestion and Cleaning:

Data Source:

Final Word:

ACS 5 Year Data by Ward

census-bureau-international

Context

Querying BigQuery tables

Sample Query 1

standardSQL

Sample Query 2

standardSQL

Sample Query 3

Update frequency

Dataset source

North Carolina Farmers Market Data

census-income

URLs

U.S. Census Geographic Boundaries and Crosswalks

Comprehensive dataset and Python toolkit for housing market analysis in...

Covid-19 and Hospitals US County Time Series

Context

Content

Data

Hospitals.csv

us-counties.csv

co-est2019-alldata.csv

daily.csv

CountyHospitalCombined.csv, CovCountyHospitalTimeSeries.csv, and StateTestingTimeSeries.csv

MasterTimeSeries.csv

Code

Bachelors or Higher Degree Time Series Data of USA

OLS regression results for detections per image.

Gender and Ethnicity Predictions for California City Council Members and...

Data from: Public supply water use reanalysis for the 2000-2020 period by...

Supplementary Data for "Incorporating Community Knowledge into Analysis of...

USA State Shapefiles

Cancer County-Level

Exploring County-Level Correlations in Cancer Rates and Trends

A Multivariate Ordinary Least Squares Regression Model

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

India DroughtSet: A village-level drought dataset for the past 43 years

Urban Areas Dataset

Urban Areas Dataset

Geographic information on urban areas

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

Python code used to download U.S. Census Bureau data for public-supply water service areasSee More Versions

Python code used to download U.S. Census Bureau data for public-supply water service areas