Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Gross Domestic Product (GDP) in the United States was worth 29184.89 billion US dollars in 2024, according to official data from the World Bank. The GDP value of the United States represents 27.49 percent of the world economy. This dataset provides - United States GDP - actual values, historical data, forecast, chart, statistics, economic calendar and news.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Government spending in the United States was last recorded at 39.7 percent of GDP in 2024 . This dataset provides - United States Government Spending To Gdp- actual values, historical data, forecast, chart, statistics, economic calendar and news.
On October 20, 2022, CDC began retrieving aggregate case and death data from jurisdictional and state partners weekly instead of daily. This dataset contains archived community transmission and related data elements by county as originally displayed on the COVID Data Tracker. Although these data will continue to be publicly available, this dataset has not been updated since October 20, 2022. An archived dataset containing weekly community transmission data by county as originally posted can also be found here: Weekly COVID-19 County Level of Community Transmission as Originally Posted | Data | Centers for Disease Control and Prevention (cdc.gov).
Related data CDC has been providing the public with two versions of COVID-19 county-level community transmission level data: this dataset with the daily values as originally posted on the COVID Data Tracker, and an historical dataset with daily data as well as the updates and corrections from state and local health departments. Similar to this dataset, the original historical dataset is archived on 10/20/2022. It will continue to be publicly available but will no longer be updated. A new dataset containing historical community transmission data by county is now published weekly and can be found at: Weekly COVID-19 County Level of Community Transmission Historical Changes | Data | Centers for Disease Control and Prevention (cdc.gov).
This public use dataset has 7 data elements reflecting community transmission levels for all available counties and jurisdictions. It contains reported daily transmission levels at the county level with the same values used to display transmission maps on the COVID Data Tracker. Each day, the dataset is appended to contain the most recent day's data. Transmission level is set to low, moderate, substantial, or high using the calculation rules below.
Methods for calculating county level of community transmission indicator The County Level of Community Transmission indicator uses two metrics: (1) total new COVID-19 cases per 100,000 persons in the last 7 days and (2) percentage of positive SARS-CoV-2 diagnostic nucleic acid amplification tests (NAAT) in the last 7 days. For each of these metrics, CDC classifies transmission values as low, moderate, substantial, or high (below and here). If the values for each of these two metrics differ (e.g., one indicates moderate and the other low), then the higher of the two should be used for decision-making.
CDC core metrics of and thresholds for community transmission levels of SARS-CoV-2
Total New Case Rate Metric: "New cases per 100,000 persons in the past 7 days" is calculated by adding the number of new cases in the county (or other administrative level) in the last 7 days divided by the population in the county (or other administrative level) and multiplying by 100,000. "New cases per 100,000 persons in the past 7 days" is considered to have a transmission level of Low (0-9.99); Moderate (10.00-49.99); Substantial (50.00-99.99); and High (greater than or equal to 100.00).
Test Percent Positivity Metric: "Percentage of positive NAAT in the past 7 days" is calculated by dividing the number of positive tests in the county (or other administrative level) during the last 7 days by the total number of tests conducted over the last 7 days. "Percentage of positive NAAT in the past 7 days" is considered to have a transmission level of Low (less than 5.00); Moderate (5.00-7.99); Substantial (8.00-9.99); and High (greater than or equal to 10.00).
If
NOTE: A more current version of the Protected Areas Database of the United States (PAD-US) is available: PAD-US 3.0 https://doi.org/10.5066/P9Q9LQ4B. The USGS Protected Areas Database of the United States (PAD-US) is the nation's inventory of protected areas, including public land and voluntarily provided private protected areas, identified as an A-16 National Geospatial Data Asset in the Cadastre Theme (https://communities.geoplatform.gov/ngda-cadastre/). The PAD-US is an ongoing project with several published versions of a spatial database including areas dedicated to the preservation of biological diversity, and other natural (including extraction), recreational, or cultural uses, managed for these purposes through legal or other effective means. The database was originally designed to support biodiversity assessments; however, its scope expanded in recent years to include all public and nonprofit lands and waters. Most are public lands owned in fee (the owner of the property has full and irrevocable ownership of the land); however, long-term easements, leases, agreements, Congressional (e.g. 'Wilderness Area'), Executive (e.g. 'National Monument'), and administrative designations (e.g. 'Area of Critical Environmental Concern') documented in agency management plans are also included. The PAD-US strives to be a complete inventory of public land and other protected areas, compiling “best available” data provided by managing agencies and organizations. The PAD-US geodatabase maps and describes areas using over twenty-five attributes and five feature classes representing the U.S. protected areas network in separate feature classes: Fee (ownership parcels), Designation, Easement, Marine, Proclamation and Other Planning Boundaries. Five additional feature classes include various combinations of the primary layers (for example, Combined_Fee_Easement) to support data management, queries, web mapping services, and analyses. This PAD-US Version 2.1 dataset includes a variety of updates and new data from the previous Version 2.0 dataset (USGS, 2018 https://doi.org/10.5066/P955KPLE ), achieving the primary goal to "Complete the PAD-US Inventory by 2020" (https://www.usgs.gov/core-science-systems/science-analytics-and-synthesis/gap/science/pad-us-vision) by addressing known data gaps with newly available data. The following list summarizes the integration of "best available" spatial data to ensure public lands and other protected areas from all jurisdictions are represented in PAD-US, along with continued improvements and regular maintenance of the federal theme. Completing the PAD-US Inventory: 1) Integration of over 75,000 city parks in all 50 States (and the District of Columbia) from The Trust for Public Land's (TPL) ParkServe data development initiative (https://parkserve.tpl.org/) added nearly 2.7 million acres of protected area and significantly reduced the primary known data gap in previous PAD-US versions (local government lands). 2) First-time integration of the Census American Indian/Alaskan Native Areas (AIA) dataset (https://www2.census.gov/geo/tiger/TIGER2019/AIANNH) representing the boundaries for federally recognized American Indian reservations and off-reservation trust lands across the nation (as of January 1, 2020, as reported by the federally recognized tribal governments through the Census Bureau's Boundary and Annexation Survey) addressed another major PAD-US data gap. 3) Aggregation of nearly 5,000 protected areas owned by local land trusts in 13 states, aggregated by Ducks Unlimited through data calls for easements to update the National Conservation Easement Database (https://www.conservationeasement.us/), increased PAD-US protected areas by over 350,000 acres. Maintaining regular Federal updates: 1) Major update of the Federal estate (fee ownership parcels, easement interest, and management designations), including authoritative data from 8 agencies: Bureau of Land Management (BLM), U.S. Census Bureau (Census), Department of Defense (DOD), U.S. Fish and Wildlife Service (FWS), National Park Service (NPS), Natural Resources Conservation Service (NRCS), U.S. Forest Service (USFS), National Oceanic and Atmospheric Administration (NOAA). The federal theme in PAD-US is developed in close collaboration with the Federal Geographic Data Committee (FGDC) Federal Lands Working Group (FLWG, https://communities.geoplatform.gov/ngda-govunits/federal-lands-workgroup/); 2) Complete National Marine Protected Areas (MPA) update: from the National Oceanic and Atmospheric Administration (NOAA) MPA Inventory, including conservation measure ('GAP Status Code', 'IUCN Category') review by NOAA; Other changes: 1) PAD-US field name change - The "Public Access" field name changed from 'Access' to 'Pub_Access' to avoid unintended scripting errors associated with the script command 'access'. 2) Additional field - The "Feature Class" (FeatClass) field was added to all layers within PAD-US 2.1 (only included in the "Combined" layers of PAD-US 2.0 to describe which feature class data originated from). 3) Categorical GAP Status Code default changes - National Monuments are categorically assigned GAP Status Code = 2 (previously GAP 3), in the absence of other information, to better represent biodiversity protection restrictions associated with the designation. The Bureau of Land Management Areas of Environmental Concern (ACECs) are categorically assigned GAP Status Code = 3 (previously GAP 2) as the areas are administratively protected, not permanent. More information is available upon request. 4) Agency Name (FWS) geodatabase domain description changed to U.S. Fish and Wildlife Service (previously U.S. Fish & Wildlife Service). 5) Select areas in the provisional PAD-US 2.1 Proclamation feature class were removed following a consultation with the data-steward (Census Bureau). Tribal designated statistical areas are purely a geographic area for providing Census statistics with no land base. Most affected areas are relatively small; however, 4,341,120 acres and 37 records were removed in total. Contact Mason Croft (masoncroft@boisestate) for more information about how to identify these records. For more information regarding the PAD-US dataset please visit, https://usgs.gov/gapanalysis/PAD-US/. For more information about data aggregation please review the Online PAD-US Data Manual available at https://www.usgs.gov/core-science-systems/science-analytics-and-synthesis/gap/pad-us-data-manual .
The Counties dataset was updated on October 31, 2023 from the United States Census Bureau (USCB) and is part of the U.S. Department of Transportation (USDOT)/Bureau of Transportation Statistics (BTS) National Transportation Atlas Database (NTAD). This resource is a member of a series. The TIGER/Line shapefiles and related database files (.dbf) are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB). The MTDB represents a seamless national file with no overlaps or gaps between parts, however, each TIGER/Line shapefile is designed to stand alone as an independent data set, or they can be combined to cover the entire nation. The primary legal divisions of most states are termed counties. In Louisiana, these divisions are known as parishes. In Alaska, which has no counties, the equivalent entities are the organized boroughs, city and boroughs, municipalities, and for the unorganized area, census areas. The latter are delineated cooperatively for statistical purposes by the State of Alaska and the Census Bureau. In four states (Maryland, Missouri, Nevada, and Virginia), there are one or more incorporated places that are independent of any county organization and thus constitute primary divisions of their states. These incorporated places are known as independent cities and are treated as equivalent entities for purposes of data presentation. The District of Columbia and Guam have no primary divisions, and each area is considered an equivalent entity for purposes of data presentation. The Census Bureau treats the following entities as equivalents of counties for purposes of data presentation: Municipios in Puerto Rico, Districts and Islands in American Samoa, Municipalities in the Commonwealth of the Northern Mariana Islands, and Islands in the U.S. Virgin Islands. The entire area of the United States, Puerto Rico, and the Island Areas is covered by counties or equivalent entities. The boundaries for counties and equivalent entities are mostly as of January 1, 2023, as reported through the Census Bureau's Boundary and Annexation Survey (BAS). A data dictionary, or other source of attribute information, is accessible at https://doi.org/10.21949/1529015
https://github.com/nytimes/covid-19-data/blob/master/LICENSEhttps://github.com/nytimes/covid-19-data/blob/master/LICENSE
The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.
Since the first reported coronavirus case in Washington State on Jan. 21, 2020, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.
We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.
The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.
https://www.usa.gov/government-workshttps://www.usa.gov/government-works
Reporting of new Aggregate Case and Death Count data was discontinued May 11, 2023, with the expiration of the COVID-19 public health emergency declaration. This dataset will receive a final update on June 1, 2023, to reconcile historical data through May 10, 2023, and will remain publicly available.
Aggregate Data Collection Process Since the start of the COVID-19 pandemic, data have been gathered through a robust process with the following steps:
Methodology Changes Several differences exist between the current, weekly-updated dataset and the archived version:
Confirmed and Probable Counts In this dataset, counts by jurisdiction are not displayed by confirmed or probable status. Instead, confirmed and probable cases and deaths are included in the Total Cases and Total Deaths columns, when available. Not all jurisdictions report probable cases and deaths to CDC.* Confirmed and probable case definition criteria are described here:
Council of State and Territorial Epidemiologists (ymaws.com).
Deaths CDC reports death data on other sections of the website: CDC COVID Data Tracker: Home, CDC COVID Data Tracker: Cases, Deaths, and Testing, and NCHS Provisional Death Counts. Information presented on the COVID Data Tracker pages is based on the same source (total case counts) as the present dataset; however, NCHS Death Counts are based on death certificates that use information reported by physicians, medical examiners, or coroners in the cause-of-death section of each certificate. Data from each of these pages are considered provisional (not complete and pending verification) and are therefore subject to change. Counts from previous weeks are continually revised as more records are received and processed.
Number of Jurisdictions Reporting There are currently 60 public health jurisdictions reporting cases of COVID-19. This includes the 50 states, the District of Columbia, New York City, the U.S. territories of American Samoa, Guam, the Commonwealth of the Northern Mariana Islands, Puerto Rico, and the U.S Virgin Islands as well as three independent countries in compacts of free association with the United States, Federated States of Micronesia, Republic of the Marshall Islands, and Republic of Palau. New York State’s reported case and death counts do not include New York City’s counts as they separately report nationally notifiable conditions to CDC.
CDC COVID-19 data are available to the public as summary or aggregate count files, including total counts of cases and deaths, available by state and by county. These and other data on COVID-19 are available from multiple public locations, such as:
https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html
https://www.cdc.gov/covid-data-tracker/index.html
https://www.cdc.gov/coronavirus/2019-ncov/covid-data/covidview/index.html
https://www.cdc.gov/coronavirus/2019-ncov/php/open-america/surveillance-data-analytics.html
Additional COVID-19 public use datasets, include line-level (patient-level) data, are available at: https://data.cdc.gov/browse?tags=covid-19.
Archived Data Notes:
November 3, 2022: Due to a reporting cadence issue, case rates for Missouri counties are calculated based on 11 days’ worth of case count data in the Weekly United States COVID-19 Cases and Deaths by State data released on November 3, 2022, instead of the customary 7 days’ worth of data.
November 10, 2022: Due to a reporting cadence change, case rates for Alabama counties are calculated based on 13 days’ worth of case count data in the Weekly United States COVID-19 Cases and Deaths by State data released on November 10, 2022, instead of the customary 7 days’ worth of data.
November 10, 2022: Per the request of the jurisdiction, cases and deaths among non-residents have been removed from all Hawaii county totals throughout the entire time series. Cumulative case and death counts reported by CDC will no longer match Hawaii’s COVID-19 Dashboard, which still includes non-resident cases and deaths.
November 17, 2022: Two new columns, weekly historic cases and weekly historic deaths, were added to this dataset on November 17, 2022. These columns reflect case and death counts that were reported that week but were historical in nature and not reflective of the current burden within the jurisdiction. These historical cases and deaths are not included in the new weekly case and new weekly death columns; however, they are reflected in the cumulative totals provided for each jurisdiction. These data are used to account for artificial increases in case and death totals due to batched reporting of historical data.
December 1, 2022: Due to cadence changes over the Thanksgiving holiday, case rates for all Ohio counties are reported as 0 in the data released on December 1, 2022.
January 5, 2023: Due to North Carolina’s holiday reporting cadence, aggregate case and death data will contain 14 days’ worth of data instead of the customary 7 days. As a result, case and death metrics will appear higher than expected in the January 5, 2023, weekly release.
January 12, 2023: Due to data processing delays, Mississippi’s aggregate case and death data will be reported as 0. As a result, case and death metrics will appear lower than expected in the January 12, 2023, weekly release.
January 19, 2023: Due to a reporting cadence issue, Mississippi’s aggregate case and death data will be calculated based on 14 days’ worth of data instead of the customary 7 days in the January 19, 2023, weekly release.
January 26, 2023: Due to a reporting backlog of historic COVID-19 cases, case rates for two Michigan counties (Livingston and Washtenaw) were higher than expected in the January 19, 2023 weekly release.
January 26, 2023: Due to a backlog of historic COVID-19 cases being reported this week, aggregate case and death counts in Charlotte County and Sarasota County, Florida, will appear higher than expected in the January 26, 2023 weekly release.
January 26, 2023: Due to data processing delays, Mississippi’s aggregate case and death data will be reported as 0 in the weekly release posted on January 26, 2023.
February 2, 2023: As of the data collection deadline, CDC observed an abnormally large increase in aggregate COVID-19 cases and deaths reported for Washington State. In response, totals for new cases and new deaths released on February 2, 2023, have been displayed as zero at the state level until the issue is addressed with state officials. CDC is working with state officials to address the issue.
February 2, 2023: Due to a decrease reported in cumulative case counts by Wyoming, case rates will be reported as 0 in the February 2, 2023, weekly release. CDC is working with state officials to verify the data submitted.
February 16, 2023: Due to data processing delays, Utah’s aggregate case and death data will be reported as 0 in the weekly release posted on February 16, 2023. As a result, case and death metrics will appear lower than expected and should be interpreted with caution.
February 16, 2023: Due to a reporting cadence change, Maine’s
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Cultural diversity in the U.S. has led to great variations in names and naming traditions and names have been used to express creativity, personality, cultural identity, and values. Source: https://en.wikipedia.org/wiki/Naming_in_the_United_States
This public dataset was created by the Social Security Administration and contains all names from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in this data. For others who did apply, records may not show the place of birth, and again their names are not included in the data.
All data are from a 100% sample of records on Social Security card applications as of the end of February 2015. To safeguard privacy, the Social Security Administration restricts names to those with at least 5 occurrences.
Fork this kernel to get started with this dataset.
https://bigquery.cloud.google.com/dataset/bigquery-public-data:usa_names
https://cloud.google.com/bigquery/public-data/usa-names
Dataset Source: Data.gov. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
Banner Photo by @dcp from Unplash.
What are the most common names?
What are the most common female names?
Are there more female or male names?
Female names by a wide margin?
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
This list ranks the 50 states in the United States by Non-Hispanic Some Other Race (SOR) population, as estimated by the United States Census Bureau. It also highlights population changes in each states over the past five years.
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 5-Year Estimates, including:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
First, we would like to thank the wildland fire advisory group. Their wisdom and guidance helped us build the dataset as it currently exists. Currently, there are multiple, freely available fire datasets that identify wildfire and prescribed fire burned areas across the United States. However, these datasets are all limited in some way. Their time periods could cover only a couple of decades or they may have stopped collecting data many years ago. Their spatial footprints may be limited to a specific geographic area or agency. Their attribute data may be limited to nothing more than a polygon and a year. None of the existing datasets provides a comprehensive picture of fires that have burned throughout the last few centuries. Our dataset uses these existing layers and utilizes a series of both manual processes and ArcGIS Python (arcpy) scripts to merge these existing datasets into a single dataset that encompasses the known wildfires and prescribed fires within the United States and certain territories. Forty different fire layers were utilized in this dataset. First, these datasets were ranked by order of observed quality (Tiers). The datasets were given a common set of attribute fields and as many of these fields were populated as possible within each dataset. All fire layers were then merged together (the merged dataset) by their common attributes to created a merged dataset containing all fire polygons. Polygons were then processed in order of Tier (1-8) so that overlapping polygons in the same year and Tier were dissolved together. Overlapping polygons in subsequent Tiers were removed from the dataset. Attributes from the original datasets of all intersecting polygons in the same year across all Tiers were also merged so that all attributes from all Tiers were included, but only the polygons from the highest ranking Tier were dissolved to form the fire polygon. The resulting product (the combined dataset) has only one fire per year in a given area with one set of attributes. While it combines wildfire data from 40 wildfire layers and therefore has more complete information on wildfires than the datasets that went into it, this dataset has also has its own set of limitations. Please see the Data Quality attributes within the metadata record for additional information on this dataset's limitations. Overall, we believe this dataset is designed be to a comprehensive collection of fire boundaries within the United States and provides a more thorough and complete picture of fires across the United States when compared to the datasets that went into it.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The United States Census is a decennial census mandated by Article I, Section 2 of the United States Constitution, which states: "Representatives and direct Taxes shall be apportioned among the several States ... according to their respective Numbers."
Source: https://en.wikipedia.org/wiki/United_States_Census
The United States census count (also known as the Decennial Census of Population and Housing) is a count of every resident of the US. The census occurs every 10 years and is conducted by the United States Census Bureau. Census data is publicly available through the census website, but much of the data is available in summarized data and graphs. The raw data is often difficult to obtain, is typically divided by region, and it must be processed and combined to provide information about the nation as a whole.
The United States census dataset includes nationwide population counts from the 2000 and 2010 censuses. Data is broken out by gender, age and location using zip code tabular areas (ZCTAs) and GEOIDs. ZCTAs are generalized representations of zip codes, and often, though not always, are the same as the zip code for an area. GEOIDs are numeric codes that uniquely identify all administrative, legal, and statistical geographic areas for which the Census Bureau tabulates data. GEOIDs are useful for correlating census data with other censuses and surveys.
Fork this kernel to get started.
https://bigquery.cloud.google.com/dataset/bigquery-public-data:census_bureau_usa
https://cloud.google.com/bigquery/public-data/us-census
Dataset Source: United States Census Bureau
Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://www.data.gov/privacy-policy#data_policy - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
Banner Photo by Steve Richey from Unsplash.
What are the ten most populous zip codes in the US in the 2010 census?
What are the top 10 zip codes that experienced the greatest change in population between the 2000 and 2010 censuses?
https://cloud.google.com/bigquery/images/census-population-map.png" alt="https://cloud.google.com/bigquery/images/census-population-map.png">
https://cloud.google.com/bigquery/images/census-population-map.png
Brackish groundwater (BGW), defined for this assessment as having a dissolved-solids concentration between 1,000 and 10,000 milligrams per liter is an unconventional source of water that may offer a partial solution to current (2016) and future water challenges. In support of the National Water Census, the U.S. Geological Survey has completed a BGW assessment to gain a better understanding of the occurrence and character of BGW resources of the United States as an alternative source of water. Analyses completed as part of this assessment relied on previously collected data from multiple sources, and no new data were collected. One of the most important contributions of this assessment is the creation of a database containing chemical data and aquifer information for the known quantities of BGW in the United States. Data were compiled from single publications to large datasets and from local studies to national assessments, and includes chemical data on the concentrations of dissolved solids, major ions, trace elements, nutrients, radionuclides, and physical properties of the resource (pH, temperature, specific conductance). This dataset represents major-ions data from a compilation of water-quality samples from 33 sources for almost 384,000 groundwater wells across the continental U.S., Alaska, Hawaii, Puerto Rico, the U.S. Virgin Islands, Guam, and American Samoa. The data are published here as an ESRI geodatabase with a point feature class, and associated attribute table, and also as non-proprietary comma-separated value table. Dissolved-solids data include information for assessing the distribution of dissolved-solids concentrations and other chemical constituents that may limit the usability of brackish groundwater. It was not possible to compile all data available for the Nation, and data selected for this investigation were mostly limited to larger datasets that were available in a digital format. As a result, some data on a more local-scale may not be included.
https://www.usa.gov/government-workshttps://www.usa.gov/government-works
Reporting of Aggregate Case and Death Count data was discontinued May 11, 2023, with the expiration of the COVID-19 public health emergency declaration. Although these data will continue to be publicly available, this dataset will no longer be updated.
This archived public use dataset has 11 data elements reflecting United States COVID-19 community levels for all available counties.
The COVID-19 community levels were developed using a combination of three metrics — new COVID-19 admissions per 100,000 population in the past 7 days, the percent of staffed inpatient beds occupied by COVID-19 patients, and total new COVID-19 cases per 100,000 population in the past 7 days. The COVID-19 community level was determined by the higher of the new admissions and inpatient beds metrics, based on the current level of new cases per 100,000 population in the past 7 days. New COVID-19 admissions and the percent of staffed inpatient beds occupied represent the current potential for strain on the health system. Data on new cases acts as an early warning indicator of potential increases in health system strain in the event of a COVID-19 surge.
Using these data, the COVID-19 community level was classified as low, medium, or high.
COVID-19 Community Levels were used to help communities and individuals make decisions based on their local context and their unique needs. Community vaccination coverage and other local information, like early alerts from surveillance, such as through wastewater or the number of emergency department visits for COVID-19, when available, can also inform decision making for health officials and individuals.
For the most accurate and up-to-date data for any county or state, visit the relevant health department website. COVID Data Tracker may display data that differ from state and local websites. This can be due to differences in how data were collected, how metrics were calculated, or the timing of web updates.
Archived Data Notes:
This dataset was renamed from "United States COVID-19 Community Levels by County as Originally Posted" to "United States COVID-19 Community Levels by County" on March 31, 2022.
March 31, 2022: Column name for county population was changed to “county_population”. No change was made to the data points previous released.
March 31, 2022: New column, “health_service_area_population”, was added to the dataset to denote the total population in the designated Health Service Area based on 2019 Census estimate.
March 31, 2022: FIPS codes for territories American Samoa, Guam, Commonwealth of the Northern Mariana Islands, and United States Virgin Islands were re-formatted to 5-digit numeric for records released on 3/3/2022 to be consistent with other records in the dataset.
March 31, 2022: Changes were made to the text fields in variables “county”, “state”, and “health_service_area” so the formats are consistent across releases.
March 31, 2022: The “%” sign was removed from the text field in column “covid_inpatient_bed_utilization”. No change was made to the data. As indicated in the column description, values in this column represent the percentage of staffed inpatient beds occupied by COVID-19 patients (7-day average).
March 31, 2022: Data values for columns, “county_population”, “health_service_area_number”, and “health_service_area” were backfilled for records released on 2/24/2022. These columns were added since the week of 3/3/2022, thus the values were previously missing for records released the week prior.
April 7, 2022: Updates made to data released on 3/24/2022 for Guam, Commonwealth of the Northern Mariana Islands, and United States Virgin Islands to correct a data mapping error.
April 21, 2022: COVID-19 Community Level (CCL) data released for counties in Nebraska for the week of April 21, 2022 have 3 counties identified in the high category and 37 in the medium category. CDC has been working with state officials to verify the data submitted, as other data systems are not providing alerts for substantial increases in disease transmission or severity in the state.
May 26, 2022: COVID-19 Community Level (CCL) data released for McCracken County, KY for the week of May 5, 2022 have been updated to correct a data processing error. McCracken County, KY should have appeared in the low community level category during the week of May 5, 2022. This correction is reflected in this update.
May 26, 2022: COVID-19 Community Level (CCL) data released for several Florida counties for the week of May 19th, 2022, have been corrected for a data processing error. Of note, Broward, Miami-Dade, Palm Beach Counties should have appeared in the high CCL category, and Osceola County should have appeared in the medium CCL category. These corrections are reflected in this update.
May 26, 2022: COVID-19 Community Level (CCL) data released for Orange County, New York for the week of May 26, 2022 displayed an erroneous case rate of zero and a CCL category of low due to a data source error. This county should have appeared in the medium CCL category.
June 2, 2022: COVID-19 Community Level (CCL) data released for Tolland County, CT for the week of May 26, 2022 have been updated to correct a data processing error. Tolland County, CT should have appeared in the medium community level category during the week of May 26, 2022. This correction is reflected in this update.
June 9, 2022: COVID-19 Community Level (CCL) data released for Tolland County, CT for the week of May 26, 2022 have been updated to correct a misspelling. The medium community level category for Tolland County, CT on the week of May 26, 2022 was misspelled as “meduim” in the data set. This correction is reflected in this update.
June 9, 2022: COVID-19 Community Level (CCL) data released for Mississippi counties for the week of June 9, 2022 should be interpreted with caution due to a reporting cadence change over the Memorial Day holiday that resulted in artificially inflated case rates in the state.
July 7, 2022: COVID-19 Community Level (CCL) data released for Rock County, Minnesota for the week of July 7, 2022 displayed an artificially low case rate and CCL category due to a data source error. This county should have appeared in the high CCL category.
July 14, 2022: COVID-19 Community Level (CCL) data released for Massachusetts counties for the week of July 14, 2022 should be interpreted with caution due to a reporting cadence change that resulted in lower than expected case rates and CCL categories in the state.
July 28, 2022: COVID-19 Community Level (CCL) data released for all Montana counties for the week of July 21, 2022 had case rates of 0 due to a reporting issue. The case rates have been corrected in this update.
July 28, 2022: COVID-19 Community Level (CCL) data released for Alaska for all weeks prior to July 21, 2022 included non-resident cases. The case rates for the time series have been corrected in this update.
July 28, 2022: A laboratory in Nevada reported a backlog of historic COVID-19 cases. As a result, the 7-day case count and rate will be inflated in Clark County, NV for the week of July 28, 2022.
August 4, 2022: COVID-19 Community Level (CCL) data was updated on August 2, 2022 in error during performance testing. Data for the week of July 28, 2022 was changed during this update due to additional case and hospital data as a result of late reporting between July 28, 2022 and August 2, 2022. Since the purpose of this data set is to provide point-in-time views of COVID-19 Community Levels on Thursdays, any changes made to the data set during the August 2, 2022 update have been reverted in this update.
August 4, 2022: COVID-19 Community Level (CCL) data for the week of July 28, 2022 for 8 counties in Utah (Beaver County, Daggett County, Duchesne County, Garfield County, Iron County, Kane County, Uintah County, and Washington County) case data was missing due to data collection issues. CDC and its partners have resolved the issue and the correction is reflected in this update.
August 4, 2022: Due to a reporting cadence change, case rates for all Alabama counties will be lower than expected. As a result, the CCL levels published on August 4, 2022 should be interpreted with caution.
August 11, 2022: COVID-19 Community Level (CCL) data for the week of August 4, 2022 for South Carolina have been updated to correct a data collection error that resulted in incorrect case data. CDC and its partners have resolved the issue and the correction is reflected in this update.
August 18, 2022: COVID-19 Community Level (CCL) data for the week of August 11, 2022 for Connecticut have been updated to correct a data ingestion error that inflated the CT case rates. CDC, in collaboration with CT, has resolved the issue and the correction is reflected in this update.
August 25, 2022: A laboratory in Tennessee reported a backlog of historic COVID-19 cases. As a result, the 7-day case count and rate may be inflated in many counties and the CCLs published on August 25, 2022 should be interpreted with caution.
August 25, 2022: Due to a data source error, the 7-day case rate for St. Louis County, Missouri, is reported as zero in the COVID-19 Community Level data released on August 25, 2022. Therefore, the COVID-19 Community Level for this county should be interpreted with caution.
September 1, 2022: Due to a reporting issue, case rates for all Nebraska counties will include 6 days of data instead of 7 days in the COVID-19 Community Level (CCL) data released on September 1, 2022. Therefore, the CCLs for all Nebraska counties should be interpreted with caution.
September 8, 2022: Due to a data processing error, the case rate for Philadelphia County, Pennsylvania,
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Gross Domestic Product (GDP) in the United States expanded 2 percent in the second quarter of 2025 over the same quarter of the previous year. This dataset provides the latest reported value for - United States GDP Annual Growth Rate - plus previous releases, historical high and low, short-term forecast and long-term prediction, economic calendar, survey consensus and news.
This product is part of the Landscape Change Monitoring System (LCMS) data suite. It shows LCMS modeled change classes for each year. See additional information about change in the Entity_and_Attribute_Information section below. LCMS is a remote sensing-based system for mapping and monitoring landscape change across the United States. Its objective is to develop a consistent approach using the latest technology and advancements in change detection to produce a "best available" map of landscape change. Because no algorithm performs best in all situations, LCMS uses an ensemble of models as predictors, which improves map accuracy across a range of ecosystems and change processes (Healey et al., 2018). The resulting suite of LCMS change, land cover, and land use maps offer a holistic depiction of landscape change across the United States over the past four decades. Predictor layers for the LCMS model include annual Landsat and Sentinel 2 composites, outputs from the LandTrendr and CCDC change detection algorithms, and terrain information. These components are all accessed and processed using Google Earth Engine (Gorelick et al., 2017). To produce annual composites, the cFmask (Zhu and Woodcock 2012), cloudScore, and TDOM (Chastain et al., 2019) cloud and cloud shadow masking methods are applied to Landsat Tier 1 and Sentinel 2a and 2b Level-1C top of atmosphere reflectance data. The annual medoid is then computed to summarize each year into a single composite. The composite time series is temporally segmented using LandTrendr (Kennedy et al., 2010; Kennedy et al., 2018; Cohen et al., 2018). All cloud and cloud shadow free values are also temporally segmented using the CCDC algorithm (Zhu and Woodcock, 2014). The raw composite values, LandTrendr fitted values, pair-wise differences, segment duration, change magnitude, and slope, and CCDC September 1 sine and cosine coefficients (first 3 harmonics), fitted values, and pairwise differences, along with elevation, slope, sine of aspect, cosine of aspect, and topographic position indices (Weiss, 2001) from the National Elevation Dataset (NED), are used as independent predictor variables in a Random Forest (Breiman, 2001) model. Reference data are collected using TimeSync, a web-based tool that helps analysts visualize and interpret the Landsat data record from 1984-present (Cohen et al., 2010).Outputs fall into three categories: change, land cover, and land use. Change relates specifically to vegetation cover and includes slow loss, fast loss (which also includes hydrologic changes such as inundation or desiccation), and gain. These values are predicted for each year of the Landsat time series and serve as the foundational products for LCMS.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
This list ranks the 50 states in the United States by Hispanic American Indian and Alaska Native (AIAN) population, as estimated by the United States Census Bureau. It also highlights population changes in each states over the past five years.
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 5-Year Estimates, including:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
Contains:World HillshadeWorld Street Map (with Relief) - Base LayerLarge Scale International Boundaries (v11.3)World Street Map (with Relief) - LabelsDoS Country Labels DoS Country LabelsCountry (admin 0) labels that have been vetted for compliance with foreign policy and legal requirements. These labels are part of the US Federal Government Basemap, which contains the borders and place names that have been vetted for compliance with foreign policy and legal requirements.Source: DoS Country Labels - Overview (arcgis.com)Large Scale International BoundariesVersion 11.3Release Date: December 19, 2023DownloadFor more information on the LSIB click here: https://geodata.state.gov/ A direct link to the data is available here: https://data.geodata.state.gov/LSIB.zipAn ISO-compliant version of the LSIB metadata (in ISO 19139 format) is here: https://geodata.state.gov/geonetwork/srv/eng/catalog.search#/metadata/3bdb81a0-c1b9-439a-a0b1-85dac30c59b2 Direct inquiries to internationalboundaries@state.govOverviewThe Office of the Geographer and Global Issues at the U.S. Department of State produces the Large Scale International Boundaries (LSIB) dataset. The current edition is version 11.3 (published 19 December 2023). The 11.3 release contains updates to boundary lines and data refinements enabling reuse of the dataset. These data and generalized derivatives are the only international boundary lines approved for U.S. Government use. The contents of this dataset reflect U.S. Government policy on international boundary alignment, political recognition, and dispute status. They do not necessarily reflect de facto limits of control.National Geospatial Data AssetThis dataset is a National Geospatial Data Asset managed by the Department of State on behalf of the Federal Geographic Data Committee's International Boundaries Theme.DetailsSources for these data include treaties, relevant maps, and data from boundary commissions and national mapping agencies. Where available and applicable, the dataset incorporates information from courts, tribunals, and international arbitrations. The research and recovery process involves analysis of satellite imagery and elevation data. Due to the limitations of source materials and processing techniques, most lines are within 100 meters of their true position on the ground.Attribute StructureThe dataset uses thefollowing attributes:Attribute NameCC1COUNTRY1CC2COUNTRY2RANKSTATUSLABELNOTES These attributes are logically linked:Linked AttributesCC1COUNTRY1CC2COUNTRY2RANKSTATUS These attributes have external sources:Attribute NameExternal Data SourceCC1GENCCOUNTRY1DoS ListsCC2GENCCOUNTRY2DoS ListsThe eight attributes listed above describe the boundary lines contained within the LSIB dataset in both a human and machine-readable fashion. Other attributes in the release include "FID", "Shape", and "Shape_Leng" are components of the shapefile format and do not form an intrinsic part of the LSIB."CC1" and "CC2" fields are machine readable fields which contain political entity codes. These codes are derived from the Geopolitical Entities, Names, and Codes Standard (GENC) Edition 3 Update 18. The dataset uses the GENC two-character codes. The code ‘Q2’, which is not in GENC, denotes a line in the LSIB representing a boundary associated with an area not contained within the GENC standard.The "COUNTRY1" and "COUNTRY2" fields contain human-readable text corresponding to the name of the political entity. These names are names approved by the U.S. Board on Geographic Names (BGN) as incorporated in the list of Independent States in the World and the list of Dependencies and Areas of Special Sovereignty maintained by the Department of State. To ensure the greatest compatibility, names are presented without diacritics and certain names are rendered using commonly accepted cartographic abbreviations. Names for lines associated with the code ‘Q2’ are descriptive and are not necessarily BGN-approved. Names rendered in all CAPITAL LETTERS are names of independent states. Other names are those associated with dependencies, areas of special sovereignty, or are otherwise presented for the convenience of the user.The following fields are an intrinsic part of the LSIB dataset and do not rely on external sources:Attribute NameMandatoryContains NullsRANKYesNoSTATUSYesNoLABELNoYesNOTESNoYesNeither the "RANK" nor "STATUS" field contains null values; the "LABEL" and "NOTES" fields do.The "RANK" field is a numeric, machine-readable expression of the "STATUS" field. Collectively, these fields encode the views of the United States Government on the political status of the boundary line.Attribute NameValueRANK123STATUSInternational BoundaryOther Line of International Separation Special Line A value of "1" in the "RANK" field corresponds to an "International Boundary" value in the "STATUS" field. Values of "2" and "3" correspond to "Other Line of International Separation" and "Special Line", respectively.The "LABEL" field contains required text necessarily to describe the line segment. The "LABEL" field is used when the line segment is displayed on maps or other forms of cartographic visualizations. This includes most interactive products. The requirement to incorporate the contents of the "LABEL" field on these products is scale dependent. If a label is legible at the scale of a given static product a proper use of this dataset would encourage the application of that label. Using the contents of the "COUNTRY1" and "COUNTRY2" fields in the generation of a line segment label is not required. The "STATUS" field is not a line labeling field but does contain the preferred description for the three LSIB line types when lines are incorporated into a map legend. Using the "CC1", "CC2", or "RANK" fields for labeling purposes is prohibited.The "NOTES" field contains an explanation of any applicable special circumstances modifying the lines. This information can pertain to the origins of the boundary lines, any limitations regarding the purpose of the lines, or the original source of the line. Use of the "NOTES" field for labeling purposes is prohibited.External Data SourcesGeopolitical Entities, Names, and Codes Registry: https://nsgreg.nga.mil/GENC-overview.jspU.S. Department of State List of Independent States in the World: https://www.state.gov/independent-states-in-the-world/U.S. Department of State List of Dependencies and Areas of Special Sovereignty: https://www.state.gov/dependencies-and-areas-of-special-sovereignty/The source for the U.S.—Canada international boundary (NGDAID97) is the International Boundary Commission: https://www.internationalboundarycommission.org/en/maps-coordinates/coordinates.phpThe source for the “International Boundary between the United States of America and the United States of Mexico” (NGDAID82) is the International Boundary and Water Commission: https://catalog.data.gov/dataset?q=usibwcCartographic UsageCartographic usage of the LSIB requires a visual differentiation between the three categories of boundaries. Specifically, this differentiation must be between:- International Boundaries (Rank 1);- Other Lines of International Separation (Rank 2); and- Special Lines (Rank 3).Rank 1 lines must be the most visually prominent. Rank 2 lines must be less visually prominent than Rank 1 lines. Rank 3 lines must be shown in a manner visually subordinate to Ranks 1 and 2. Where scale permits, Rank 2 and 3 lines must be labeled in accordance with the “Label” field. Data marked with a Rank 2 or 3 designation does not necessarily correspond to a disputed boundary.Additional cartographic information can be found in Guidance Bulletins (https://hiu.state.gov/data/cartographic_guidance_bulletins/) published by the Office of the Geographer and Global Issues.ContactDirect inquiries to internationalboundaries@state.gov.CreditsThe lines in the LSIB dataset are the product of decades of collaboration between geographers at the Department of State and the National Geospatial-Intelligence Agency with contributions from the Central Intelligence Agency and the UK Defence Geographic Centre.Attribution is welcome: U.S. Department of State, Office of the Geographer and Global Issues.Changes from Prior ReleaseThe 11.3 release is the third update in the version 11 series.This version of the LSIB contains changes and accuracy refinements for the following line segments. These changes reflect improvements in spatial accuracy derived from newly available source materials, an ongoing review process, or the publication of new treaties or agreements. Notable changes to lines include:• AFGHANISTAN / IRAN• ALBANIA / GREECE• ALBANIA / KOSOVO• ALBANIA/MONTENEGRO• ALBANIA / NORTH MACEDONIA• ALGERIA / MOROCCO• ARGENTINA / BOLIVIA• ARGENTINA / CHILE• BELARUS / POLAND• BOLIVIA / PARAGUAY• BRAZIL / GUYANA• BRAZIL / VENEZUELA• BRAZIL / French Guiana (FR.)• BRAZIL / SURINAME• CAMBODIA / LAOS• CAMBODIA / VIETNAM• CAMEROON / CHAD• CAMEROON / NIGERIA• CHINA / INDIA• CHINA / NORTH KOREA• CHINA / Aksai Chin• COLOMBIA / VENEZUELA• CONGO, DEM. REP. OF THE / UGANDA• CZECHIA / GERMANY• EGYPT / LIBYA• ESTONIA / RUSSIA• French Guiana (FR.) / SURINAME• GREECE / NORTH MACEDONIA• GUYANA / VENEZUELA• INDIA / Aksai Chin• KAZAKHSTAN / RUSSIA• KOSOVO / MONTENEGRO• KOSOVO / SERBIA• LAOS / VIETNAM• LATVIA / LITHUANIA• MEXICO / UNITED STATES• MONTENEGRO / SERBIA• MOROCCO / SPAIN• POLAND / RUSSIA• ROMANIA / UKRAINEVersions 11.0 and 11.1 were updates to boundary lines. Like this version, they also contained topology fixes, land boundary terminus refinements, and tripoint adjustments. Version 11.2 corrected a few errors in the attribute data and ensured that CC1 and CC2 attributes are in alignment with an updated version of the Geopolitical Entities, Names, and Codes (GENC) Standard, specifically Edition 3 Update 17.LayersLarge_Scale_International_BoundariesTerms of
This product is part of the Landscape Change Monitoring System (LCMS) data suite. It shows LCMS modeled change classes for each year. See additional information about change in the Entity_and_Attribute_Information section below. LCMS is a remote sensing-based system for mapping and monitoring landscape change across the United States. Its objective is to develop a consistent approach using the latest technology and advancements in change detection to produce a "best available" map of landscape change. Because no algorithm performs best in all situations, LCMS uses an ensemble of models as predictors, which improves map accuracy across a range of ecosystems and change processes (Healey et al., 2018). The resulting suite of LCMS change, land cover, and land use maps offer a holistic depiction of landscape change across the United States over the past four decades. Predictor layers for the LCMS model include annual Landsat and Sentinel 2 composites, outputs from the LandTrendr and CCDC change detection algorithms, and terrain information. These components are all accessed and processed using Google Earth Engine (Gorelick et al., 2017). To produce annual composites, the cFmask (Zhu and Woodcock 2012), cloudScore, and TDOM (Chastain et al., 2019) cloud and cloud shadow masking methods are applied to Landsat Tier 1 and Sentinel 2a and 2b Level-1C top of atmosphere reflectance data. The annual medoid is then computed to summarize each year into a single composite. The composite time series is temporally segmented using LandTrendr (Kennedy et al., 2010; Kennedy et al., 2018; Cohen et al., 2018). All cloud and cloud shadow free values are also temporally segmented using the CCDC algorithm (Zhu and Woodcock, 2014). The raw composite values, LandTrendr fitted values, pair-wise differences, segment duration, change magnitude, and slope, and CCDC September 1 sine and cosine coefficients (first 3 harmonics), fitted values, and pairwise differences, along with elevation, slope, sine of aspect, cosine of aspect, and topographic position indices (Weiss, 2001) from the National Elevation Dataset (NED), are used as independent predictor variables in a Random Forest (Breiman, 2001) model. Reference data are collected using TimeSync, a web-based tool that helps analysts visualize and interpret the Landsat data record from 1984-present (Cohen et al., 2010).Outputs fall into three categories: change, land cover, and land use. Change relates specifically to vegetation cover and includes slow loss, fast loss (which also includes hydrologic changes such as inundation or desiccation), and gain. These values are predicted for each year of the Landsat time series and serve as the foundational products for LCMS.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
There are several works based on Natural Language Processing on newspaper reports. Mining opinions from headlines [ 1 ] using Standford NLP and SVM by Rameshbhaiet. Al.compared several algorithms on a small and large dataset. Rubinet. al., in their paper [ 2 ], created a mechanism to differentiate fake news from real ones by building a set of characteristics of news according to their types. The purpose was to contribute to the low resource data available for training machine learning algorithms. Doumitet. al.in [ 3 ] have implemented LDA, a topic modeling approach to study bias present in online news media.
However, there are not many NLP research invested in studying COVID-19. Most applications include classification of chest X-rays and CT-scans to detect presence of pneumonia in lungs [ 4 ], a consequence of the virus. Other research areas include studying the genome sequence of the virus[ 5 ][ 6 ][ 7 ] and replicating its structure to fight and find a vaccine. This research is crucial in battling the pandemic. The few NLP based research publications are sentiment classification of online tweets by Samuel et el [ 8 ] to understand fear persisting in people due to the virus. Similar work has been done using the LSTM network to classify sentiments from online discussion forums by Jelodaret. al.[ 9 ]. NKK dataset is the first study on a comparatively larger dataset of a newspaper report on COVID-19, which contributed to the virus’s awareness to the best of our knowledge.
2 Data-set Introduction
2.1 Data Collection
We accumulated 1000 online newspaper report from United States of America (USA) on COVID-19. The newspaper includes The Washington Post (USA) and StarTribune (USA). We have named it as “Covid-News-USA-NNK”. We also accumulated 50 online newspaper report from Bangladesh on the issue and named it “Covid-News-BD-NNK”. The newspaper includes The Daily Star (BD) and Prothom Alo (BD). All these newspapers are from the top provider and top read in the respective countries. The collection was done manually by 10 human data-collectors of age group 23- with university degrees. This approach was suitable compared to automation to ensure the news were highly relevant to the subject. The newspaper online sites had dynamic content with advertisements in no particular order. Therefore there were high chances of online scrappers to collect inaccurate news reports. One of the challenges while collecting the data is the requirement of subscription. Each newspaper required $1 per subscriptions. Some criteria in collecting the news reports provided as guideline to the human data-collectors were as follows:
The headline must have one or more words directly or indirectly related to COVID-19.
The content of each news must have 5 or more keywords directly or indirectly related to COVID-19.
The genre of the news can be anything as long as it is relevant to the topic. Political, social, economical genres are to be more prioritized.
Avoid taking duplicate reports.
Maintain a time frame for the above mentioned newspapers.
To collect these data we used a google form for USA and BD. We have two human editor to go through each entry to check any spam or troll entry.
2.2 Data Pre-processing and Statistics
Some pre-processing steps performed on the newspaper report dataset are as follows:
Remove hyperlinks.
Remove non-English alphanumeric characters.
Remove stop words.
Lemmatize text.
While more pre-processing could have been applied, we tried to keep the data as much unchanged as possible since changing sentence structures could result us in valuable information loss. While this was done with help of a script, we also assigned same human collectors to cross check for any presence of the above mentioned criteria.
The primary data statistics of the two dataset are shown in Table 1 and 2.
Table 1: Covid-News-USA-NNK data statistics
No of words per headline
7 to 20
No of words per body content
150 to 2100
Table 2: Covid-News-BD-NNK data statistics No of words per headline
10 to 20
No of words per body content
100 to 1500
2.3 Dataset Repository
We used GitHub as our primary data repository in account name NKK^1. Here, we created two repositories USA-NKK^2 and BD-NNK^3. The dataset is available in both CSV and JSON format. We are regularly updating the CSV files and regenerating JSON using a py script. We provided a python script file for essential operation. We welcome all outside collaboration to enrich the dataset.
3 Literature Review
Natural Language Processing (NLP) deals with text (also known as categorical) data in computer science, utilizing numerous diverse methods like one-hot encoding, word embedding, etc., that transform text to machine language, which can be fed to multiple machine learning and deep learning algorithms.
Some well-known applications of NLP includes fraud detection on online media sites[ 10 ], using authorship attribution in fallback authentication systems[ 11 ], intelligent conversational agents or chatbots[ 12 ] and machine translations used by Google Translate[ 13 ]. While these are all downstream tasks, several exciting developments have been made in the algorithm solely for Natural Language Processing tasks. The two most trending ones are BERT[ 14 ], which uses bidirectional encoder-decoder architecture to create the transformer model, that can do near-perfect classification tasks and next-word predictions for next generations, and GPT-3 models released by OpenAI[ 15 ] that can generate texts almost human-like. However, these are all pre-trained models since they carry huge computation cost. Information Extraction is a generalized concept of retrieving information from a dataset. Information extraction from an image could be retrieving vital feature spaces or targeted portions of an image; information extraction from speech could be retrieving information about names, places, etc[ 16 ]. Information extraction in texts could be identifying named entities and locations or essential data. Topic modeling is a sub-task of NLP and also a process of information extraction. It clusters words and phrases of the same context together into groups. Topic modeling is an unsupervised learning method that gives us a brief idea about a set of text. One commonly used topic modeling is Latent Dirichlet Allocation or LDA[17].
Keyword extraction is a process of information extraction and sub-task of NLP to extract essential words and phrases from a text. TextRank [ 18 ] is an efficient keyword extraction technique that uses graphs to calculate the weight of each word and pick the words with more weight to it.
Word clouds are a great visualization technique to understand the overall ’talk of the topic’. The clustered words give us a quick understanding of the content.
4 Our experiments and Result analysis
We used the wordcloud library^4 to create the word clouds. Figure 1 and 3 presents the word cloud of Covid-News-USA- NNK dataset by month from February to May. From the figures 1,2,3, we can point few information:
In February, both the news paper have talked about China and source of the outbreak.
StarTribune emphasized on Minnesota as the most concerned state. In April, it seemed to have been concerned more.
Both the newspaper talked about the virus impacting the economy, i.e, bank, elections, administrations, markets.
Washington Post discussed global issues more than StarTribune.
StarTribune in February mentioned the first precautionary measurement: wearing masks, and the uncontrollable spread of the virus throughout the nation.
While both the newspaper mentioned the outbreak in China in February, the weight of the spread in the United States are more highlighted through out March till May, displaying the critical impact caused by the virus.
We used a script to extract all numbers related to certain keywords like ’Deaths’, ’Infected’, ’Died’ , ’Infections’, ’Quarantined’, Lock-down’, ’Diagnosed’ etc from the news reports and created a number of cases for both the newspaper. Figure 4 shows the statistics of this series. From this extraction technique, we can observe that April was the peak month for the covid cases as it gradually rose from February. Both the newspaper clearly shows us that the rise in covid cases from February to March was slower than the rise from March to April. This is an important indicator of possible recklessness in preparations to battle the virus. However, the steep fall from April to May also shows the positive response against the attack. We used Vader Sentiment Analysis to extract sentiment of the headlines and the body. On average, the sentiments were from -0.5 to -0.9. Vader Sentiment scale ranges from -1(highly negative to 1(highly positive). There were some cases
where the sentiment scores of the headline and body contradicted each other,i.e., the sentiment of the headline was negative but the sentiment of the body was slightly positive. Overall, sentiment analysis can assist us sort the most concerning (most negative) news from the positive ones, from which we can learn more about the indicators related to COVID-19 and the serious impact caused by it. Moreover, sentiment analysis can also provide us information about how a state or country is reacting to the pandemic. We used PageRank algorithm to extract keywords from headlines as well as the body content. PageRank efficiently highlights important relevant keywords in the text. Some frequently occurring important keywords extracted from both the datasets are: ’China’, Government’, ’Masks’, ’Economy’, ’Crisis’, ’Theft’ , ’Stock market’ , ’Jobs’ , ’Election’, ’Missteps’, ’Health’, ’Response’. Keywords extraction acts as a filter allowing quick searches for indicators in case of locating situations of the economy,
The United States census count (also known as the Decennial Census of Population and Housing) is a count of every resident of the US. The census occurs every 10 years and is conducted by the United States Census Bureau. Census data is publicly available through the census website, but much of the data is available in summarized data and graphs. The raw data is often difficult to obtain, is typically divided by region, and it must be processed and combined to provide information about the nation as a whole. Update frequency: Historic (none)
United States Census Bureau
SELECT
zipcode,
population
FROM
bigquery-public-data.census_bureau_usa.population_by_zip_2010
WHERE
gender = ''
ORDER BY
population DESC
LIMIT
10
This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://www.data.gov/privacy-policy#data_policy - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
See the GCP Marketplace listing for more details and sample queries: https://console.cloud.google.com/marketplace/details/united-states-census-bureau/us-census-data
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Gross Domestic Product (GDP) in the United States was worth 29184.89 billion US dollars in 2024, according to official data from the World Bank. The GDP value of the United States represents 27.49 percent of the world economy. This dataset provides - United States GDP - actual values, historical data, forecast, chart, statistics, economic calendar and news.