Introduction: I have chosen to complete a data analysis project for the second course option, Bellabeats, Inc., using a locally hosted database program, Excel for both my data analysis and visualizations. This choice was made primarily because I live in a remote area and have limited bandwidth and inconsistent internet access. Therefore, completing a capstone project using web-based programs such as R Studio, SQL Workbench, or Google Sheets was not a feasible choice. I was further limited in which option to choose as the datasets for the ride-share project option were larger than my version of Excel would accept. In the scenario provided, I will be acting as a Junior Data Analyst in support of the Bellabeats, Inc. executive team and data analytics team. This combined team has decided to use an existing public dataset in hopes that the findings from that dataset might reveal insights which will assist in Bellabeat's marketing strategies for future growth. My task is to provide data driven insights to business tasks provided by the Bellabeats, Inc.'s executive and data analysis team. In order to accomplish this task, I will complete all parts of the Data Analysis Process (Ask, Prepare, Process, Analyze, Share, Act). In addition, I will break each part of the Data Analysis Process down into three sections to provide clarity and accountability. Those three sections are: Guiding Questions, Key Tasks, and Deliverables. For the sake of space and to avoid repetition, I will record the deliverables for each Key Task directly under the numbered Key Task using an asterisk (*) as an identifier.
Section 1 - Ask: A. Guiding Questions: Who are the key stakeholders and what are their goals for the data analysis project? What is the business task that this data analysis project is attempting to solve?
B. Key Tasks: Identify key stakeholders and their goals for the data analysis project *The key stakeholders for this project are as follows: -Urška Sršen and Sando Mur - co-founders of Bellabeats, Inc. -Bellabeats marketing analytics team. I am a member of this team. Identify the business task. *The business task is: -As provided by co-founder Urška Sršen, the business task for this project is to gain insight into how consumers are using their non-BellaBeats smart devices in order to guide upcoming marketing strategies for the company which will help drive future growth. Specifically, the researcher was tasked with applying insights driven by the data analysis process to 1 BellaBeats product and presenting those insights to BellaBeats stakeholders.
Section 2 - Prepare: A. Guiding Questions: Where is the data stored and organized? Are there any problems with the data? How does the data help answer the business question?
B. Key Tasks: Research and communicate the source of the data, and how it is stored/organized to stakeholders. *The data source used for our case study is FitBit Fitness Tracker Data. This dataset is stored in Kaggle and was made available through user Mobius in an open-source format. Therefore, the data is public and available to be copied, modified, and distributed, all without asking the user for permission. These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk reportedly (see credibility section directly below) between 03/12/2016 thru 05/12/2016. *Reportedly (see credibility section directly below), thirty eligible Fitbit users consented to the submission of personal tracker data, including output related to steps taken, calories burned, time spent sleeping, heart rate, and distance traveled. This data was broken down into minute, hour, and day level totals. This data is stored in 18 CSV documents. I downloaded all 18 documents into my local laptop and decided to use 2 documents for the purposes of this project as they were files which had merged activity and sleep data from the other documents. All unused documents were permanently deleted from the laptop. The 2 files used were: -sleepDaymerged.csv -dailyActivitymerged.csv Identify and communicate to stakeholders any problems found with the data related to credibility and bias. *As will be more specifically presented in the Process section, the data seems to have credibility issues related to the reported time frame of the data collected. The metadata seems to indicate that the data collected covered roughly 2 months of FitBit tracking. However, upon my initial data processing, I found that only 1 month of data was reported. *As will be more specifically presented in the Process section, the data has credibility issues related to the number of individuals who reported FitBit data. Specifically, the metadata communicates that 30 individual users agreed to report their tracking data. My initial data processing uncovered 33 individual IDs in the dailyActivity_merged dataset. *Due to the small number of participants (...
Polygon shapefile showing the footprint boundaries, source agency origins, and resolutions of compiled bathymetric digital elevation models (DEMs) used to construct a continuous, high-resolution DEM of the southern portion of San Francisco Bay.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This page contains the source data for the manuscript describing the Data Citation Explorer, currently in review for publication. The preprint version can be found on this page.
Files:
DCE_manual_eval_sample.xlsx:
This file was used to manually evaluate hits generated by the Data Citation Explorer. There are two separate sheets: one with publications returned by searches in PubMed and PubMed Central and another with publications returned by searches in Dimensions. Column descriptions can be found in the file itself. Each row in each evaluation sheet refers to a pair between a JAMO record and a linked publication.
DCE_citation_report.csv
Contains JAMO record IDs and PubMed IDs from the initial 2020 DCE trial run. There are 238,994 unique JAMO IDs and 30,641 unique PubMed IDs. 78,104 JAMO records are linked with publications.
Columns:
DCE_source_files.zip:
This folder contains 3 files for each JAMO record in DCE_citation_report.tsv. For each JAMO record listed in the citation report, three files are provided:
Jurisdictional Unit, 2022-05-21. For use with WFDSS, IFTDSS, IRWIN, and InFORM.This is a feature service which provides Identify and Copy Feature capabilities. If fast-drawing at coarse zoom levels is a requirement, consider using the tile (map) service layer located at https://nifc.maps.arcgis.com/home/item.html?id=3b2c5daad00742cd9f9b676c09d03d13.OverviewThe Jurisdictional Agencies dataset is developed as a national land management geospatial layer, focused on representing wildland fire jurisdictional responsibility, for interagency wildland fire applications, including WFDSS (Wildland Fire Decision Support System), IFTDSS (Interagency Fuels Treatment Decision Support System), IRWIN (Interagency Reporting of Wildland Fire Information), and InFORM (Interagency Fire Occurrence Reporting Modules). It is intended to provide federal wildland fire jurisdictional boundaries on a national scale. The agency and unit names are an indication of the primary manager name and unit name, respectively, recognizing that:There may be multiple owner names.Jurisdiction may be held jointly by agencies at different levels of government (ie State and Local), especially on private lands, Some owner names may be blocked for security reasons.Some jurisdictions may not allow the distribution of owner names. Private ownerships are shown in this layer with JurisdictionalUnitIdentifier=null,JurisdictionalUnitAgency=null, JurisdictionalUnitKind=null, and LandownerKind="Private", LandownerCategory="Private". All land inside the US country boundary is covered by a polygon.Jurisdiction for privately owned land varies widely depending on state, county, or local laws and ordinances, fire workload, and other factors, and is not available in a national dataset in most cases.For publicly held lands the agency name is the surface managing agency, such as Bureau of Land Management, United States Forest Service, etc. The unit name refers to the descriptive name of the polygon (i.e. Northern California District, Boise National Forest, etc.).These data are used to automatically populate fields on the WFDSS Incident Information page.This data layer implements the NWCG Jurisdictional Unit Polygon Geospatial Data Layer Standard.Relevant NWCG Definitions and StandardsUnit2. A generic term that represents an organizational entity that only has meaning when it is contextualized by a descriptor, e.g. jurisdictional.Definition Extension: When referring to an organizational entity, a unit refers to the smallest area or lowest level. Higher levels of an organization (region, agency, department, etc) can be derived from a unit based on organization hierarchy.Unit, JurisdictionalThe governmental entity having overall land and resource management responsibility for a specific geographical area as provided by law.Definition Extension: 1) Ultimately responsible for the fire report to account for statistical fire occurrence; 2) Responsible for setting fire management objectives; 3) Jurisdiction cannot be re-assigned by agreement; 4) The nature and extent of the incident determines jurisdiction (for example, Wildfire vs. All Hazard); 5) Responsible for signing a Delegation of Authority to the Incident Commander.See also: Unit, Protecting; LandownerUnit IdentifierThis data standard specifies the standard format and rules for Unit Identifier, a code used within the wildland fire community to uniquely identify a particular government organizational unit.Landowner Kind & CategoryThis data standard provides a two-tier classification (kind and category) of landownership. Attribute Fields JurisdictionalAgencyKind Describes the type of unit Jurisdiction using the NWCG Landowner Kind data standard. There are two valid values: Federal, and Other. A value may not be populated for all polygons.JurisdictionalAgencyCategoryDescribes the type of unit Jurisdiction using the NWCG Landowner Category data standard. Valid values include: ANCSA, BIA, BLM, BOR, DOD, DOE, NPS, USFS, USFWS, Foreign, Tribal, City, County, OtherLoc (other local, not in the standard), State. A value may not be populated for all polygons.JurisdictionalUnitNameThe name of the Jurisdictional Unit. Where an NWCG Unit ID exists for a polygon, this is the name used in the Name field from the NWCG Unit ID database. Where no NWCG Unit ID exists, this is the “Unit Name” or other specific, descriptive unit name field from the source dataset. A value is populated for all polygons.JurisdictionalUnitIDWhere it could be determined, this is the NWCG Standard Unit Identifier (Unit ID). Where it is unknown, the value is ‘Null’. Null Unit IDs can occur because a unit may not have a Unit ID, or because one could not be reliably determined from the source data. Not every land ownership has an NWCG Unit ID. Unit ID assignment rules are available from the Unit ID standard, linked above.LandownerKindThe landowner category value associated with the polygon. May be inferred from jurisdictional agency, or by lack of a jurisdictional agency. A value is populated for all polygons. There are three valid values: Federal, Private, or Other.LandownerCategoryThe landowner kind value associated with the polygon. May be inferred from jurisdictional agency, or by lack of a jurisdictional agency. A value is populated for all polygons. Valid values include: ANCSA, BIA, BLM, BOR, DOD, DOE, NPS, USFS, USFWS, Foreign, Tribal, City, County, OtherLoc (other local, not in the standard), State, Private.DataSourceThe database from which the polygon originated. Be as specific as possible, identify the geodatabase name and feature class in which the polygon originated.SecondaryDataSourceIf the Data Source is an aggregation from other sources, use this field to specify the source that supplied data to the aggregation. For example, if Data Source is "PAD-US 2.1", then for a USDA Forest Service polygon, the Secondary Data Source would be "USDA FS Automated Lands Program (ALP)". For a BLM polygon in the same dataset, Secondary Source would be "Surface Management Agency (SMA)."SourceUniqueIDIdentifier (GUID or ObjectID) in the data source. Used to trace the polygon back to its authoritative source.MapMethod:Controlled vocabulary to define how the geospatial feature was derived. Map method may help define data quality. MapMethod will be Mixed Method by default for this layer as the data are from mixed sources. Valid Values include: GPS-Driven; GPS-Flight; GPS-Walked; GPS-Walked/Driven; GPS-Unknown Travel Method; Hand Sketch; Digitized-Image; DigitizedTopo; Digitized-Other; Image Interpretation; Infrared Image; Modeled; Mixed Methods; Remote Sensing Derived; Survey/GCDB/Cadastral; Vector; Phone/Tablet; OtherDateCurrentThe last edit, update, of this GIS record. Date should follow the assigned NWCG Date Time data standard, using 24 hour clock, YYYY-MM-DDhh.mm.ssZ, ISO8601 Standard.CommentsAdditional information describing the feature. GeometryIDPrimary key for linking geospatial objects with other database systems. Required for every feature. This field may be renamed for each standard to fit the feature.JurisdictionalUnitID_sansUSNWCG Unit ID with the "US" characters removed from the beginning. Provided for backwards compatibility.JoinMethodAdditional information on how the polygon was matched information in the NWCG Unit ID database.LocalNameLocalName for the polygon provided from PADUS or other source.LegendJurisdictionalAgencyJurisdictional Agency but smaller landholding agencies, or agencies of indeterminate status are grouped for more intuitive use in a map legend or summary table.LegendLandownerAgencyLandowner Agency but smaller landholding agencies, or agencies of indeterminate status are grouped for more intuitive use in a map legend or summary table.DataSourceYearYear that the source data for the polygon were acquired.Data InputThis dataset is based on an aggregation of 4 spatial data sources: Protected Areas Database US (PAD-US 2.1), data from Bureau of Indian Affairs regional offices, the BLM Alaska Fire Service/State of Alaska, and Census Block-Group Geometry. NWCG Unit ID and Agency Kind/Category data are tabular and sourced from UnitIDActive.txt, in the WFMI Unit ID application (https://wfmi.nifc.gov/unit_id/Publish.html). Areas of with unknown Landowner Kind/Category and Jurisdictional Agency Kind/Category are assigned LandownerKind and LandownerCategory values of "Private" by use of the non-water polygons from the Census Block-Group geometry.PAD-US 2.1:This dataset is based in large part on the USGS Protected Areas Database of the United States - PAD-US 2.`. PAD-US is a compilation of authoritative protected areas data between agencies and organizations that ultimately results in a comprehensive and accurate inventory of protected areas for the United States to meet a variety of needs (e.g. conservation, recreation, public health, transportation, energy siting, ecological, or watershed assessments and planning). Extensive documentation on PAD-US processes and data sources is available.How these data were aggregated:Boundaries, and their descriptors, available in spatial databases (i.e. shapefiles or geodatabase feature classes) from land management agencies are the desired and primary data sources in PAD-US. If these authoritative sources are unavailable, or the agency recommends another source, data may be incorporated by other aggregators such as non-governmental organizations. Data sources are tracked for each record in the PAD-US geodatabase (see below).BIA and Tribal Data:BIA and Tribal land management data are not available in PAD-US. As such, data were aggregated from BIA regional offices. These data date from 2012 and were substantially updated in 2022. Indian Trust Land affiliated with Tribes, Reservations, or BIA Agencies: These data are not considered the system of record and are not intended to be used as such. The Bureau of Indian Affairs (BIA), Branch of Wildland Fire Management (BWFM) is not the originator of these data. The
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This is the source data used in Nature Energy manuscript to produce visualisations, authored by Yuwan Malakar and Rosie Day. This manuscript is designed to compare women's perspectives of the relationships between their wellbeing and cooking fuels. they use. The study was conducted in rural India. Qualitative data generated from focus group discussions is used for the analysis. The data was collected from November 2016 to February 2017. Lineage: This data was produced via R codes. The source data are in the *.csv format.
This text file "Solar radiation.txt" contains hourly data in Langleys and associated data-source flag from January 1, 1948, to September 30, 2016. The primary source of the data is the Argonne National Laboratory, Illinois. The data-source flag consist of a three-digit sequence in the form "xyz" that describe the origin and transformations of the data values. They indicate if the data are original or missing, the method that was used to fill the missing periods, and any other transformations of the data. Bera (2014) describes in detail an addition of a new data-source flag based on the regression analysis of the backup data series at St. Charles (STC) for water years (WY) 2008-10. The user of the data should consult Over and others (2010) and Bera (2014) for the detailed documentation of the data-source flag. Reference Cited: Over, T.M., Price, T.H., and Ishii, A.L., 2010, Development and analysis of a meteorological database, Argonne National Laboratory, Illinois: U.S. Geological Survey Open File Report 2010-1220, 67 p., http://pubs.usgs.gov/of/2010/1220/. Bera, M., 2014, Watershed Data Management (WDM) database for Salt Creek streamflow simulation, DuPage County, Illinois, water years 2005-11: U.S. Geological Survey Data Series 870, 18 p., http://dx.doi.org/10.3133/ds870.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Compressed file sedPbZn_SourceData_gdb.zip contains the GIS datasets and Python scripts used to calculate the estimated potential and certainty that sediment-hosted Pb-Zn (lead-zinc) deposits might be present in an area in Alaska. The statewide datasets include: Alaska Geochemical Database (AGDB3), Alaska Resource Data File (ARDF), lithology layers created from Alaska Geologic Map (SIM3340), and 12-digit HUCs, subwatersheds from the National Watershed Boundary dataset. FGDC metadata for all datasets are included. In addition, files are included for the user to modify the parameters of the analysis. These include two Python scripts, 1) used to score ARDF sites for sediment-hosted Pb-Zn potential, and 2) to evaluate each 12-digit HUC for sediment-hosted Pb-Zn potential and certainty based on queries on AGDB3, ARDF, and lithology. An mxd file and cartography layers are included for viewing the data selections in ArcGIS. Other supporting documents are included. Compress ...
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
A. SUMMARY This dataset includes data on a variety of substance use services funded by the San Francisco Department of Public Health (SFDPH). This dataset only includes Drug MediCal-certified residential treatment, withdrawal management, and methadone treatment. Other private non-Drug Medi-Cal treatment providers may operate in the city. Withdrawal management discharges are inclusive of anyone who left withdrawal management after admission and may include someone who left before completing withdrawal management.
This dataset also includes naloxone distribution from the SFDPH Behavioral Health Services Naloxone Clearinghouse and the SFDPH-funded Drug Overdose Prevention and Education program. Both programs distribute naloxone to various community-based organizations who then distribute naloxone to their program participants. Programs may also receive naloxone from other sources. Data from these other sources is not included in this dataset.
Finally, this dataset includes the number of clients on medications for opioid use disorder (MOUD).
The number of people who were treated with methadone at a Drug Medi-Cal certified Opioid Treatment Program (OTP) by year is populated by the San Francisco Department of Public Health (SFDPH) Behavioral Health Services Quality Management (BHSQM) program. OTPs in San Francisco are required to submit patient billing data in an electronic medical record system called Avatar. BHSQM calculates the number of people who received methadone annually based on Avatar data. Data only from Drug MediCal certified OTPs were included in this dataset.
The number of people who receive buprenorphine by year is populated from the Controlled Substance Utilization Review and Evaluation System (CURES), administered by the California Department of Justice. All licensed prescribers in California are required to document controlled substance prescriptions in CURES. The Center on Substance Use and Health calculates the total number of people who received a buprenorphine prescription annually based on CURES data. Formulations of buprenorphine that are prescribed only for pain management are excluded.
People may receive buprenorphine and methadone in the same year, so you cannot add the Buprenorphine Clients by Year, and Methadone Clients by Year data together to get the total number of unique people receiving medications for opioid use disorder.
For more information on where to find treatment in San Francisco, visit findtreatment-sf.org.
B. HOW THE DATASET IS CREATED This dataset is created by copying the data into this dataset from the SFDPH Behavioral Health Services Quality Management Program, the California Controlled Substance Utilization Review and Evaluation System (CURES), and the Office of Overdose Prevention.
C. UPDATE PROCESS Residential Substance Use Treatment, Withdrawal Management, Methadone, and Naloxone data are updated quarterly with a 45-day delay. Buprenorphine data are updated quarterly and when the state makes this data available, usually at a 5-month delay.
D. HOW TO USE THIS DATASET Throughout the year this dataset may include partial year data for methadone and buprenorphine treatment. As both methadone and buprenorphine are used as long-term treatments for opioid use disorder, many people on treatment at the end of one calendar year will continue into the next. For this reason, doubling (methadone), or quadrupling (buprenorphine) partial year data will not accurately project year-end totals.
E. RELATED DATASETS Overdose-Related 911 Responses by Emergency Medical Services Unintentional Overdose Death Rates by Race/Ethnicity Preliminary Unintentional Drug Overdose Deaths
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Background: Clean water is an essential part of human healthy life and wellbeing. More recently, rapid population growth, high illiteracy rate, lack of sustainable development, and climate change; faces a global challenge in developing countries. The discontinuity of drinking water supply forces households either to use unsafe water storage materials or to use water from unsafe sources. The present study aimed to identify the determinants of water source types, use, quality of water, and sanitation perception of physical parameters among urban households in North-West Ethiopia.
Methods: A community-based cross-sectional study was conducted among households from February to March 2019. An interview-based a pretested and structured questionnaire was used to collect the data. Data collection samples were selected randomly and proportional to each of the kebeles' households. MS Excel and R Version 3.6.2 were used to enter and analyze the data; respectively. Descriptive statistics using frequencies and percentages were used to explain the sample data concerning the predictor variable. Both bivariate and multivariate logistic regressions were used to assess the association between independent and response variables.
Results: Four hundred eighteen (418) households have participated. Based on the study undertaken,78.95% of households used improved and 21.05% of households used unimproved drinking water sources. Households drinking water sources were significantly associated with the age of the participant (x2 = 20.392, df=3), educational status(x2 = 19.358, df=4), source of income (x2 = 21.777, df=3), monthly income (x2 = 13.322, df=3), availability of additional facilities (x2 = 98.144, df=7), cleanness status (x2 =42.979, df=4), scarcity of water (x2 = 5.1388, df=1) and family size (x2 = 9.934, df=2). The logistic regression analysis also indicated that those factors are significantly determining the water source types used by the households. Factors such as availability of toilet facility, household member type, and sex of the head of the household were not significantly associated with drinking water sources.
Conclusion: The uses of drinking water from improved sources were determined by different demographic, socio-economic, sanitation, and hygiene-related factors. Therefore, ; the local, regional, and national governments and other supporting organizations shall improve the accessibility and adequacy of drinking water from improved sources in the area.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Data from various sources, including 2018 and 2019 multibeam bathymetry data collected by the National Oceanic and Atmospheric Administration (NOAA) and the U.S. Geological Survey (USGS) were combined to create a composite 30-m resolution multibeam bathymetry surface of central Cascadia Margin offshore Oregon. These metadata describe the polygon shapefile that outlines and identifies each publicly available bathymetric dataset. The data are available as a polygon shapefile.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This upload contains slices 2,001 – 3,000 from the data collection described in
Maximilian B. Kiss, Sophia B. Coban, K. Joost Batenburg, Tristan van Leeuwen, and Felix Lucka “"2DeteCT - A large 2D expandable, trainable, experimental Computed Tomography dataset for machine learning", Sci Data 10, 576 (2023) or arXiv:2306.05907 (2023)
Abstract:
"Recent research in computational imaging largely focuses on developing machine learning (ML) techniques for image reconstruction, which requires large-scale training datasets consisting of measurement data and ground-truth images. However, suitable experimental datasets for X-ray Computed Tomography (CT) are scarce, and methods are often developed and evaluated only on simulated data. We fill this gap by providing the community with a versatile, open 2D fan-beam CT dataset suitable for developing ML techniques for a range of image reconstruction tasks. To acquire it, we designed a sophisticated, semi-automatic scan procedure that utilizes a highly-flexible laboratory X-ray CT setup. A diverse mix of samples with high natural variability in shape and density was scanned slice-by-slice (5000 slices in total) with high angular and spatial resolution and three different beam characteristics: A high-fidelity, a low-dose and a beam-hardening-inflicted mode. In addition, 750 out-of-distribution slices were scanned with sample and beam variations to accommodate robustness and segmentation tasks. We provide raw projection data, reference reconstructions and segmentations based on an open-source data processing pipeline."
The data collection has been acquired using a highly flexible, programmable and custom-built X-ray CT scanner, the FleX-ray scanner, developed by TESCAN-XRE NV, located in the FleX-ray Lab at the Centrum Wiskunde & Informatica (CWI) in Amsterdam, Netherlands. It consists of a cone-beam microfocus X-ray point source (limited to 90 kV and 90 W) that projects polychromatic X-rays onto a 14-bit CMOS (complementary metal-oxide semiconductor) flat panel detector with CsI(Tl) scintillator (Dexella 1512NDT) and 1536-by-1944 pixels, \(74.8\mu m^2\) each. To create a 2D dataset, a fan-beam geometry was mimicked by only reading out the central row of the detector. Between source and detector there is a rotation stage, upon which samples can be mounted. The machine components (i.e., the source, the detector panel, and the rotation stage) are mounted on translation belts that allow the moving of the components independently from one another.
Please refer to the paper for all further technical details.
The complete dataset can be found via the following links: 1-1000, 1001-2000, 2001-3000, 3001-4000, 4001-5000, OOD.
The reference reconstructions and segmentations can be found via the following links: 1-1000, 1001-2000, 2001-3000, 3001-4000, 4001-5000, OOD.
The corresponding Python scripts for loading, pre-processing, reconstructing and segmenting the projection data in the way described in the paper can be found on github. A machine-readable file with the used scanning parameters and instrument data for each acquisition mode as well as a script loading it can be found on the GitHub repository as well.
Note: It is advisable to use the graphical user interface when decompressing the .zip archives. If you experience a zipbomb error when unzipping the file on a Linux system rerun the command with the UNZIP_DISABLE_ZIPBOMB_DETECTION=TRUE environment variable by setting in your .bashrc “export UNZIP_DISABLE_ZIPBOMB_DETECTION=TRUE”.
For more information or guidance in using the data collection, please get in touch with
Maximilian.Kiss [at] cwi.nl
Felix.Lucka [at] cwi.nl
Interagency Wildland Fire Perimeter History (IFPH) Overview This national fire history perimeter data layer of conglomerated agency perimeters was developed in support of the WFDSS application and wildfire decision support. The layer encompasses the fire perimeter datasets of the USDA Forest Service, US Department of Interior Bureau of Land Management, Bureau of Indian Affairs, Fish and Wildlife Service, and National Park Service, the Alaska Interagency Fire Center, CalFire, and WFIGS History. Perimeters are included thru the 2024 fire season. Requirements for fire perimeter inclusion, such as minimum acreage requirements, are set by the contributing agencies. WFIGS, NPS and CALFIRE data now include Prescribed Burns. Data InputSeveral data sources were used in the development of this layer, links are provided where possible below. In addition, many agencies are now using WFIGS as their authoritative source, beginning in mid-2020.Alaska fire history (WFIGS pull for updates began 2022)USDA FS Regional Fire History Data (WFIGS pull for updates began 2024)BLM Fire Planning and Fuels (WFIGS pull for updates began 2020)National Park Service - Includes Prescribed Burns (WFIGS pull for updates began 2020)Fish and Wildlife Service (WFIGS pull for updates began 2024)Bureau of Indian Affairs (Incomplete, 2017-2018 from BIA, WFIGS pull for updates began 2020)CalFire FRAS - Includes Prescribed Burns (CALFIRE only source, non-fed fires)WFIGS - updates included since mid-2020, unless otherwise noted Data LimitationsFire perimeter data are often collected at the local level, and fire management agencies have differing guidelines for submitting fire perimeter data. Often data are collected by agencies only once annually. If you do not see your fire perimeters in this layer, they were not present in the sources used to create the layer at the time the data were submitted. A companion service for perimeters entered into the WFDSS application is also available, if a perimeter is found in the WFDSS service that is missing in this Agency Authoritative service or a perimeter is missing in both services, please contact the appropriate agency Fire GIS Contact listed in the table below.Attributes This dataset implements the NWCG Wildland Fire Perimeters (polygon) data standard.https://www.nwcg.gov/sites/default/files/stds/WildlandFirePerimeters_definition.pdfIRWINID - Primary key for linking to the IRWIN Incident dataset. The origin of this GUID is the wildland fire locations point data layer maintained by IrWIN. (This unique identifier may NOT replace the GeometryID core attribute) FORID - Unique identifier assigned to each incident record in the Fire Occurence Data Records system. (This unique identifier may NOT replace the GeometryID core attribute) INCIDENT - The name assigned to an incident; assigned by responsible land management unit. (IRWIN required). Officially recorded name. FIRE_YEAR (Alias) - Calendar year in which the fire started. Example: 2013. Value is of type integer (FIRE_YEAR_INT). AGENCY - Agency assigned for this fire - should be based on jurisdiction at origin. SOURCE - System/agency source of record from which the perimeter came. DATE_CUR - The last edit, update, or other valid date of this GIS Record. Example: mm/dd/yyyy. MAP_METHOD - Controlled vocabulary to define how the geospatial feature was derived. Map method may help define data quality.GPS-Driven; GPS-Flight; GPS-Walked; GPS-Walked/Driven; GPS-Unknown Travel Method; Hand Sketch; Digitized-Image; Digitized-Topo; Digitized-Other; Image Interpretation; Infrared Image; Modeled; Mixed Methods; Remote Sensing Derived; Survey/GCDB/Cadastral; Vector; Other GIS_ACRES - GIS calculated acres within the fire perimeter. Not adjusted for unburned areas within the fire perimeter. Total should include 1 decimal place. (ArcGIS: Precision=10; Scale=1). Example: 23.9 UNQE_FIRE_ - Unique fire identifier is the Year-Unit Identifier-Local Incident Identifier (yyyy-SSXXX-xxxxxx). SS = State Code or International Code, XXX or XXXX = A code assigned to an organizational unit, xxxxx = Alphanumeric with hyphens or periods. The unit identifier portion corresponds to the POINT OF ORIGIN RESPONSIBLE AGENCY UNIT IDENTIFIER (POOResonsibleUnit) from the responsible unit’s corresponding fire report. Example: 2013-CORMP-000001 LOCAL_NUM - Local incident identifier (dispatch number). A number or code that uniquely identifies an incident for a particular local fire management organization within a particular calendar year. Field is string to allow for leading zeros when the local incident identifier is less than 6 characters. (IRWIN required). Example: 123456. UNIT_ID - NWCG Unit Identifier of landowner/jurisdictional agency unit at the point of origin of a fire. (NFIRS ID should be used only when no NWCG Unit Identifier exists). Example: CORMP COMMENTS - Additional information describing the feature. Free Text.FEATURE_CA - Type of wildland fire polygon: Wildfire (represents final fire perimeter or last daily fire perimeter available) or Prescribed Fire or Unknown GEO_ID - Primary key for linking geospatial objects with other database systems. Required for every feature. This field may be renamed for each standard to fit the feature. Globally Unique Identifier (GUID). Cross-Walk from sources (GeoID) and other processing notesAK: GEOID = OBJECT ID of provided file geodatabase (4,781 Records thru 2021), other federal sources for AK data removed. No RX data included.CA: GEOID = OBJECT ID of downloaded file geodatabase (8,480 Records, federal fires removed, includes RX. Significant cleanup occurred between 2023 and 2024 data pulls resulting in fewer perimeters).FWS: GEOID = OBJECTID of service download combined history 2005-2021 (2,959 Records), includes RX.BIA: GEOID = "FireID" 2017/2018 data (382 records). No RX data included.NPS: GEOID = EVENT ID 15,237 records, includes RX. In 2024/2023 dataset was reduced by combining singlepart to multpart based on valid Irwin, FORID or Unique Fire IDs. RX data included.BLM: GEOID = GUID from BLM FPER (23,730 features). No RX data included.USFS: GEOID=GLOBALID from EDW records (48,569 features), includes RXWFIGS: GEOID=polySourceGlobalID (9724 records added or replaced agency record since mid-2020)Attempts to repair Unique Fire ID not made. Attempts to repair dates not made. Verified all IrWIN IDs and FODRIDs present via joins and cross checks to the respective dataset. Stripped leading and trailing spaces, fixed empty values to
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This folder contains the source data for the images we used to train the neural network. Tiff files are microscopic images, and each tiff file has two xls or csv files to indicate the location of osteoclasts or non-osteoclasts in each Tiff file. We performed data collection 3 times, and the data from each data collection is stored separately in 3 folders, "Dataset_1", "Dataset_2", "Dataset_3".
The "Dataset_1" folder contains a total of 170 tiff images. They are from the cell culture of wild-type cells stimulated with 50 ng/ml RANKL with/without 100 ng/ml TNF-alpha or 10 ng/ml IL-1beta. The location of cells is recorded in xls files. "IMAGE_NAME osteoclasts.xls" is for the coordination of osteoclasts, and "IMAGE_NAME non-osteoclasts.xls" is for the coordination of non-osteoclasts. The x-coordinate is recorded in the "x" column, and the y-coordinate is in the "y" column, with x=0 and y=0 being the coordinates of the upper left corner of the image. Please ignore the columns other than x and y.
The "Dataset_2" folder contains a total of 288 images. They are from the cell culture of wild-type cells or cells with gain-of-function mutation of SH3BP2 (KI) stimulated with 25 or 50 ng/ml of RANKL. The image names are composed of the information about the culture conditions, such as gender of cell sources (female and male), the genotype of the cell source (wt: wild-type, KI: Knock-In mutation in Sh3bp2 resulting in the increased osteoclastogenesis), the concentration of RANKL in the culture media (R25: 25 ng/ml of RANKL, R50: 50 ng/ml of RANKL), and the culture period with RANKL stimulation (day 3: cultured with RANKL for 3 days.), excepting the images with the name starting "Image_". The images with the name Image_NUMBER.tif are from the following culture condition; female KI R25. The location of cells is recorded in xls files. "IMAGE_NAME posi.xls" is for the coordination of osteoclasts, and "IMAGE_NAME nega.xls" is for the coordination of non-osteoclasts. The x-coordinate is recorded in the "x" column, and the y-coordinate is in the "y" column, with x=0 and y=0 being the coordinates of the upper left corner of the image. Please ignore the columns other than x and y.
The "additional data" folder contains a total of 288 images, which are identical to the images in the "second batch" folder. Using the same images, we collected additional coordination to increase samples. The location of cells is recorded in csv files. "IMAGE_NAME new posi.xls" is for the coordination of osteoclasts, and "IMAGE_NAME new nega.xls" is for the coordination of non-osteoclasts. The x-coordinate is recorded in the "x" column, and the y-coordinate is in the "y" column, with x=0 and y=0 being the coordinates of the upper left corner of the image. Please ignore the columns other than x and y.
Please contact Mizuho Kittaka for the inquery about the dataset.
Data provided are the scale of polygonal datasources used to generate the polygon derived surfaces for the intensive agricultural areas of Australia. Data modelled from area based observations made by State soil agencies.The final ASRIS polygon attributed surfaces are a mosaic of all of the data obtained from various state and federal agencies. The surfaces have been constructed with the best available soil survey information available at the time. The surfaces also rely on a number of assumptions. One being that an area weighted mean is a good estimate of the soil attributes for that polygon or mapunit. Another assumption made is that the lookup tables provided by McKenzie et al. (2000), state and territories accurately depict the soil attribute values for each soil type.The accuracy of the maps is most dependent on the scale of the original polygon data sets and the level of soil survey that has taken place in each state. The scale of the various soil maps used in deriving this map is available by accessing darasource grid, the scale is used as an assessment of the likely accuracy of the modelling.The Atlas of Australian Soils is considered to be the least accurate dataset and has therefore only been used where there is no state based data.Of the state datasets Western Australian sub-systems, South Australian land systems and NSW soil landscapes and reconnaissance mapping would be the most reliable based on scale. NSW soil landscapes and reconnaissance mapping however, may be less accurate than South Australia and Western Australia as only one dominant soil type per polygon was used in the estimation of attributes, compared to several soil types per polygon or mapunit in South Australia and Western Australia. NSW soil landscapes and reconnaissance mapping as the name suggests is reconnaissance level only with no laboratory data. The digital map data is provided in geographical coordinates based on the World Geodetic System 1984 (WGS84) datum.
See further metadat for more detail.
The fourth edition of the Global Findex offers a lens into how people accessed and used financial services during the COVID-19 pandemic, when mobility restrictions and health policies drove increased demand for digital services of all kinds.
The Global Findex is the world's most comprehensive database on financial inclusion. It is also the only global demand-side data source allowing for global and regional cross-country analysis to provide a rigorous and multidimensional picture of how adults save, borrow, make payments, and manage financial risks. Global Findex 2021 data were collected from national representative surveys of about 128,000 adults in more than 120 economies. The latest edition follows the 2011, 2014, and 2017 editions, and it includes a number of new series measuring financial health and resilience and contains more granular data on digital payment adoption, including merchant and government payments.
The Global Findex is an indispensable resource for financial service practitioners, policy makers, researchers, and development professionals.
South Ossetia and Abkhazia were not included for the safety of the interviewers. In addition, very remote mountainous villages or those with less than 100 inhabitants were also excluded. The excluded areas represent approximately 8 percent of the total population.
Individual
Observation data/ratings [obs]
In most developing economies, Global Findex data have traditionally been collected through face-to-face interviews. Surveys are conducted face-to-face in economies where telephone coverage represents less than 80 percent of the population or where in-person surveying is the customary methodology. However, because of ongoing COVID-19 related mobility restrictions, face-to-face interviewing was not possible in some of these economies in 2021. Phone-based surveys were therefore conducted in 67 economies that had been surveyed face-to-face in 2017. These 67 economies were selected for inclusion based on population size, phone penetration rate, COVID-19 infection rates, and the feasibility of executing phone-based methods where Gallup would otherwise conduct face-to-face data collection, while complying with all government-issued guidance throughout the interviewing process. Gallup takes both mobile phone and landline ownership into consideration. According to Gallup World Poll 2019 data, when face-to-face surveys were last carried out in these economies, at least 80 percent of adults in almost all of them reported mobile phone ownership. All samples are probability-based and nationally representative of the resident adult population. Phone surveys were not a viable option in 17 economies that had been part of previous Global Findex surveys, however, because of low mobile phone ownership and surveying restrictions. Data for these economies will be collected in 2022 and released in 2023.
In economies where face-to-face surveys are conducted, the first stage of sampling is the identification of primary sampling units. These units are stratified by population size, geography, or both, and clustering is achieved through one or more stages of sampling. Where population information is available, sample selection is based on probabilities proportional to population size; otherwise, simple random sampling is used. Random route procedures are used to select sampled households. Unless an outright refusal occurs, interviewers make up to three attempts to survey the sampled household. To increase the probability of contact and completion, attempts are made at different times of the day and, where possible, on different days. If an interview cannot be obtained at the initial sampled household, a simple substitution method is used. Respondents are randomly selected within the selected households. Each eligible household member is listed, and the hand-held survey device randomly selects the household member to be interviewed. For paper surveys, the Kish grid method is used to select the respondent. In economies where cultural restrictions dictate gender matching, respondents are randomly selected from among all eligible adults of the interviewer's gender.
In traditionally phone-based economies, respondent selection follows the same procedure as in previous years, using random digit dialing or a nationally representative list of phone numbers. In most economies where mobile phone and landline penetration is high, a dual sampling frame is used.
The same respondent selection procedure is applied to the new phone-based economies. Dual frame (landline and mobile phone) random digital dialing is used where landline presence and use are 20 percent or higher based on historical Gallup estimates. Mobile phone random digital dialing is used in economies with limited to no landline presence (less than 20 percent).
For landline respondents in economies where mobile phone or landline penetration is 80 percent or higher, random selection of respondents is achieved by using either the latest birthday or household enumeration method. For mobile phone respondents in these economies or in economies where mobile phone or landline penetration is less than 80 percent, no further selection is performed. At least three attempts are made to reach a person in each household, spread over different days and times of day.
Sample size for Georgia is 1000.
Face-to-face [f2f]
Questionnaires are available on the website.
Estimates of standard errors (which account for sampling error) vary by country and indicator. For country-specific margins of error, please refer to the Methodology section and corresponding table in Demirgüç-Kunt, Asli, Leora Klapper, Dorothe Singer, Saniya Ansar. 2022. The Global Findex Database 2021: Financial Inclusion, Digital Payments, and Resilience in the Age of COVID-19. Washington, DC: World Bank.
Access to up-to-date socio-economic data is a widespread challenge in Papua New Guinea and other Pacific Island Countries. To increase data availability and promote evidence-based policymaking, the Pacific Observatory provides innovative solutions and data sources to complement existing survey data and analysis. One of these data sources is a series of High Frequency Phone Surveys (HFPS), which began in 2020 as a way to monitor the socio-economic impacts of the COVID-19 Pandemic, and since 2023 has grown into a series of continuous surveys for socio-economic monitoring. See https://www.worldbank.org/en/country/pacificislands/brief/the-pacific-observatory for further details.
For PNG, after five rounds of data collection from 2020-2022, in April 2023 a monthly HFPS data collection commenced and continued for 18 months (ending September 2024) –on topics including employment, income, food security, health, food prices, assets and well-being. This followed an initial pilot of the data collection from January 2023-March 2023. Data for April 2023-September 2023 were a repeated cross section, while October 2023 established the first month of a panel, which is ongoing as of March 2025. For each month, approximately 550-1000 households were interviewed. The sample is representative of urban and rural areas but is not representative at the province level. This dataset contains combined monthly survey data for all months of the continuous HFPS in PNG. There is one date file for household level data with a unique household ID, and separate files for individual level data within each household data, and household food price data, that can be matched to the household file using the household ID. A unique individual ID within the household data which can be used to track individuals over time within households.
Urban and rural areas of Papua New Guinea
Household, Individual
Sample survey data [ssd]
The initial sample was drawn through Random Digit Dialing (RDD) with geographic stratification from a large random sample of Digicel’s subscribers. As an objective of the survey was to measure changes in household economic wellbeing over time, the HFPS sought to contact a consistent number of households across each province month to month. This was initially a repeated cross section from April 2023-Dec 2023. The resulting overall sample has a probability-based weighted design, with a proportionate stratification to achieve a proper geographical representation. More information on sampling for the cross-sectional monthly sample can be found in previous documentation for the PNG HFPS data.
A monthly panel was established in October 2023, that is ongoing as of March 2025. In each subsequent round of data collection after October 2024, the survey firm would first attempt to contact all households from the previous month, and then attempt to contact households from earlier months that had dropped out. After previous numbers were exhausted, RDD with geographic stratification was used for replacement households.
Computer Assisted Telephone Interview [cati]
he questionnaire, which can be found in the External Resources of this documentation, is in English with a Pidgin translation.
The survey instrument for Q1 2025 consists of the following modules: -1. Basic Household information, -2. Household Roster, -3. Labor, -4a Food security, -4b Food prices -5. Household income, -6. Agriculture, -8. Access to services, -9. Assets -10. Wellbeing and shocks -10a. WASH
The raw data were cleaned by the World Bank team using STATA. This included formatting and correcting errors identified through the survey’s monitoring and quality control process. The data are presented in two datasets: a household dataset and an individual dataset. The individual dataset contains information on individual demographics and labor market outcomes of all household members aged 15 and above, and the household data set contains information about household demographics, education, food security, food prices, household income, agriculture activities, social protection, access to services, and durable asset ownership. The household identifier (hhid) is available in both the household dataset and the individual dataset. The individual identifier (id_member) can be found in the individual dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Machine learning can be used to predict fault properties such as shear stress, friction, and time to failure using continuous records of fault zone acoustic emissions. The files are extracted features and labels from lab data (experiment p4679). The features are extracted with a non-overlapping window from the original acoustic data. The first column is the time of the window. The second and third columns are the mean and the variance of the acoustic data in this window, respectively. The 4th-11th column is the the power spectrum density ranging from low to high frequency. And the last column is the corresponding label (shear stress level). The name of the file means which driving velocity the sequence is generated from. Data were generated from laboratory friction experiments conducted with a biaxial shear apparatus. Experiments were conducted in the double direct shear configuration in which two fault zones are sheared between three rigid forcing blocks. Our samples consisted of two 5-mm-thick layers of simulated fault gouge with a nominal contact area of 10 by 10 cm^2. Gouge material consisted of soda-lime glass beads with initial particle size between 105 and 149 micrometers. Prior to shearing, we impose a constant fault normal stress of 2 MPa using a servo-controlled load-feedback mechanism and allow the sample to compact. Once the sample has reached a constant layer thickness, the central block is driven down at constant rate of 10 micrometers per second. In tandem, we collect an AE signal continuously at 4 MHz from a piezoceramic sensor embedded in a steel forcing block about 22 mm from the gouge layer The data from this experiment can be used with the deep learning algorithm to train it for future fault property prediction.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The OSDG Community Dataset (OSDG-CD) is a public dataset of thousands of text excerpts, which were validated by over 1,400 OSDG Community Platform (OSDG-CP) citizen scientists from over 140 countries, with respect to the Sustainable Development Goals (SDGs).
Dataset Information
In support of the global effort to achieve the Sustainable Development Goals (SDGs), OSDG is realising a series of SDG-labelled text datasets. The OSDG Community Dataset (OSDG-CD) is the direct result of the work of more than 1,400 volunteers from over 130 countries who have contributed to our understanding of SDGs via the OSDG Community Platform (OSDG-CP). The dataset contains tens of thousands of text excerpts (henceforth: texts) which were validated by the Community volunteers with respect to SDGs. The data can be used to derive insights into the nature of SDGs using either ontology-based or machine learning approaches.
📘 The file contains 43,0210 (+390) text excerpts and a total of 310,328 (+3,733) assigned labels.
To learn more about the project, please visit the OSDG website and the official GitHub page. Explore a detailed overview of the OSDG methodology in our recent paper "OSDG 2.0: a multilingual tool for classifying text data by UN Sustainable Development Goals (SDGs)".
Source Data
The dataset consists of paragraph-length text excerpts derived from publicly available documents, including reports, policy documents and publication abstracts. A significant number of documents (more than 3,000) originate from UN-related sources such as SDG-Pathfinder and SDG Library. These sources often contain documents that already have SDG labels associated with them. Each text is comprised of 3 to 6 sentences and is about 90 words on average.
Methodology
All the texts are evaluated by volunteers on the OSDG-CP. The platform is an ambitious attempt to bring together researchers, subject-matter experts and SDG advocates from all around the world to create a large and accurate source of textual information on the SDGs. The Community volunteers use the platform to participate in labelling exercises where they validate each text's relevance to SDGs based on their background knowledge.
In each exercise, the volunteer is shown a text together with an SDG label associated with it – this usually comes from the source – and asked to either accept or reject the suggested label.
There are 3 types of exercises:
All volunteers start with the mandatory introductory exercise that consists of 10 pre-selected texts. Each volunteer must complete this exercise before they can access 2 other exercise types. Upon completion, the volunteer reviews the exercise by comparing their answers with the answers of the rest of the Community using aggregated statistics we provide, i.e., the share of those who accepted and rejected the suggested SDG label for each of the 10 texts. This helps the volunteer to get a feel for the platform.
SDG-specific exercises where the volunteer validates texts with respect to a single SDG, e.g., SDG 1 No Poverty.
All SDGs exercise where the volunteer validates a random sequence of texts where each text can have any SDG as its associated label.
After finishing the introductory exercise, the volunteer is free to select either SDG-specific or All SDGs exercises. Each exercise, regardless of its type, consists of 100 texts. Once the exercise is finished, the volunteer can either label more texts or exit the platform. Of course, the volunteer can finish the exercise early. All progress is saved and recorded still.
To ensure quality, each text is validated by up to 9 different volunteers and all texts included in the public release of the data have been validated by at least 3 different volunteers.
It is worth keeping in mind that all exercises present the volunteers with a binary decision problem, i.e., either accept or reject a suggested label. The volunteers are never asked to select one or more SDGs that a certain text might relate to. The rationale behind this set-up is that asking a volunteer to select from 17 SDGs is extremely inefficient. Currently, all texts are validated against only one associated SDG label.
Column Description
doi - Digital Object Identifier of the original document
text_id - unique text identifier
text - text excerpt from the document
sdg - the SDG the text is validated against
labels_negative - the number of volunteers who rejected the suggested SDG label
labels_positive - the number of volunteers who accepted the suggested SDG label
agreement - agreement score based on the formula (agreement = \frac{|labels_{positive} - labels_{negative}|}{labels_{positive} + labels_{negative}})
Further Information
Do not hesitate to share with us your outputs, be it a research paper, a machine learning model, a blog post, or just an interesting observation. All queries can be directed to community@osdg.ai.
Terms of UseData Limitations and DisclaimerThe user’s use of and/or reliance on the information contained in the Document shall be at the user’s own risk and expense. MassDEP disclaims any responsibility for any loss or harm that may result to the user of this data or to any other person due to the user’s use of the Document.This is an ongoing data development project. Attempts have been made to contact all PWS systems, but not all have responded with information on their service area. MassDEP will continue to collect and verify this information. Some PWS service areas included in this datalayer have not been verified by the PWS or the municipality involved, but since many of those areas are based on information published online by the municipality, the PWS, or in a publicly available report, they are included in the estimated PWS service area datalayer.Please note: All PWS service area delineations are estimates for broad planning purposes and should only be used as a guide. The data is not appropriate for site-specific or parcel-specific analysis. Not all properties within a PWS service area are necessarily served by the system, and some properties outside the mapped service areas could be served by the PWS – please contact the relevant PWS. Not all service areas have been confirmed by the systems.Please use the following citation to reference these data:MassDEP, Water Utility Resilience Program. 2025. Community and Non-Transient Non-Community Public Water System Service Area (PubV2025_3).IMPORTANT NOTICE: This MassDEP Estimated Water Service datalayer may not be complete, may contain errors, omissions, and other inaccuracies and the data are subject to change. This version is published through MassGIS. We want to learn about the data uses. If you use this dataset, please notify staff in the Water Utility Resilience Program (WURP@mass.gov).This GIS datalayer represents approximate service areas for Public Water Systems (PWS) in Massachusetts. In 2017, as part of its “Enhancing Resilience and Emergency Preparedness of Water Utilities through Improved Mapping” (Critical Infrastructure Mapping Project ), the MassDEP Water Utility Resilience Program (WURP) began to uniformly map drinking water service areas throughout Massachusetts using information collected from various sources. Along with confirming existing public water system (PWS) service area information, the project collected and verified estimated service area delineations for PWSs not previously delineated and will continue to update the information contained in the datalayers. As of the date of publication, WURP has delineated Community (COM) and Non-Transient Non-Community (NTNC) service areas. Transient non-community (TNCs) are not part of this mapping project.Layers and Tables:The MassDEP Estimated Public Water System Service Area data comprises two polygon feature classes and a supporting table. Some data fields are populated from the MassDEP Drinking Water Program’s Water Quality Testing System (WQTS) and Annual Statistical Reports (ASR).The Community Water Service Areas feature class (PWS_WATER_SERVICE_AREA_COMM_POLY) includes polygon features that represent the approximate service areas for PWS classified as Community systems.The NTNC Water Service Areas feature class (PWS_WATER_SERVICE_AREA_NTNC_POLY) includes polygon features that represent the approximate service areas for PWS classified as Non-Transient Non-Community systems.The Unlocated Sites List table (PWS_WATER_SERVICE_AREA_USL) contains a list of known, unmapped active Community and NTNC PWS services areas at the time of publication.ProductionData UniversePublic Water Systems in Massachusetts are permitted and regulated through the MassDEP Drinking Water Program. The WURP has mapped service areas for all active and inactive municipal and non-municipal Community PWSs in MassDEP’s Water Quality Testing Database (WQTS). Community PWS refers to a public water system that serves at least 15 service connections used by year-round residents or regularly serves at least 25 year-round residents.All active and inactive NTNC PWS were also mapped using information contained in WQTS. An NTNC or Non-transient Non-community Water System refers to a public water system that is not a community water system and that has at least 15 service connections or regularly serves at least 25 of the same persons or more approximately four or more hours per day, four or more days per week, more than six months or 180 days per year, such as a workplace providing water to its employees.These data may include declassified PWSs. Staff will work to rectify the status/water services to properties previously served by declassified PWSs and remove or incorporate these service areas as needed.Maps of service areas for these systems were collected from various online and MassDEP sources to create service areas digitally in GIS. Every PWS is assigned a unique PWSID by MassDEP that incorporates the municipal ID of the municipality it serves (or the largest municipality it serves if it serves multiple municipalities). Some municipalities contain more than one PWS, but each PWS has a unique PWSID. The Estimated PWS Service Area datalayer, therefore, contains polygons with a unique PWSID for each PWS service area.A service area for a community PWS may serve all of one municipality (e.g. Watertown Water Department), multiple municipalities (e.g. Abington-Rockland Joint Water Works), all or portions of two or more municipalities (e.g. Provincetown Water Dept which serves all of Provincetown and a portion of Truro), or a portion of a municipality (e.g. Hyannis Water System, which is one of four PWSs in the town of Barnstable).Some service areas have not been mapped but their general location is represented by a small circle which serves as a placeholder. The location of these circles are estimates based on the general location of the source wells or the general estimated location of the service area - these do not represent the actual service area.Service areas were mapped initially from 2017 to 2022 and reflect varying years for which service is implemented for that service area boundary. WURP maintains the dataset quarterly with annual data updates; however, the dataset may not include all current active PWSs. A list of unmapped PWS systems is included in the USL table PWS_WATER_SERVICE_AREA_USL available for download with the dataset. Some PWSs that are not mapped may have come online after this iteration of the mapping project; these will be reconciled and mapped during the next phase of the WURP project. PWS IDs that represent regional or joint boards with (e.g. Tri Town Water Board, Randolph/Holbrook Water Board, Upper Cape Regional Water Cooperative) will not be mapped because their individual municipal service areas are included in this datalayer.PWSs that do not have corresponding sources, may be part of consecutive systems, may have been incorporated into another PWSs, reclassified as a different type of PWS, or otherwise taken offline. PWSs that have been incorporated, reclassified, or taken offline will be reconciled during the next data update.Methodologies and Data SourcesSeveral methodologies were used to create service area boundaries using various sources, including data received from the systems in response to requests for information from the MassDEP WURP project, information on file at MassDEP, and service area maps found online at municipal and PWS websites. When provided with water line data rather than generalized areas, 300-foot buffers were created around the water lines to denote service areas and then edited to incorporate generalizations. Some municipalities submitted parcel data or address information to be used in delineating service areas.Verification ProcessSmall-scale PDF file maps with roads and other infrastructure were sent to every PWS for corrections or verifications. For small systems, such as a condominium complex or residential school, the relevant parcels were often used as the basis for the delineated service area. In towns where 97% or more of their population is served by the PWS and no other service area delineation was available, the town boundary was used as the service area boundary. Some towns responded to the request for information or verification of service areas by stating that the town boundary should be used since all or nearly all of the municipality is served by the PWS.Sources of information for estimated drinking water service areasThe following information was used to develop estimated drinking water service areas:EOEEA Water Assets Project (2005) water lines (these were buffered to create service areas)Horsely Witten Report 2008Municipal Master Plans, Open Space Plans, Facilities Plans, Water Supply System Webpages, reports and online interactive mapsGIS data received from PWSDetailed infrastructure mapping completed through the MassDEP WURP Critical Infrastructure InitiativeIn the absence of other service area information, for municipalities served by a town-wide water system serving at least 97% of the population, the municipality’s boundary was used. Determinations of which municipalities are 97% or more served by the PWS were made based on the Percent Water Service Map created in 2018 by MassDEP based on various sources of information including but not limited to:The Winter population served submitted by the PWS in the ASR submittalThe number of services from WQTS as a percent of developed parcelsTaken directly from a Master Plan, Water Department Website, Open Space Plan, etc. found onlineCalculated using information from the town on the population servedMassDEP staff estimateHorsely Witten Report 2008Calculation based on Water System Areas Mapped through MassDEP WURP Critical Infrastructure Initiative, 2017-2022Information found in publicly available PWS planning documents submitted to MassDEP or as part of infrastructure planningMaintenanceThe
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OGIM is a collection of data tables within a GeoPackage, an open-source geospatial database format. Each data table within the GeoPackage includes locations and facility attributes of oil and gas infrastructure types that are important sources of methane emissions, including oil and gas production wells, offshore production platforms, natural gas compressor stations, oil and natural gas processing facilities, liquefied natural gas facilities, crude oil refineries, and pipelines. All location data have been transformed to a common spatial reference system (WGS 1984, EPSG:4326). The GeoPackage also includes a “Data Catalog” table which lists each primary data source utilized during OGIM database development. Each source in the Data Catalog is assigned a Source Reference ID (‘SRC_ID’) and each record in the OGIM database has a 'SRC_REF_ID' attribute that can be used to join the record to its original source(s).
OGIM v2.5.1 includes approximately 6.7 million features, including 4.5 million point locations of oil and gas wells and over 1.2 million kilometers of oil and gas pipelines. This work and the OGIM database, which we anticipate updating on a regular cadence, helps fill a crucial oil and gas geospatial data need, in support of the quantification and attribution of global oil and gas methane emissions at high resolution. Please see the PDF document in the ‘Files’ section of this page for a description of all attribute columns present within the OGIM database. Full details on database development and related analytics can be found in the following Earth System Science Data (ESSD) journal paper. Please cite the paper when using any version of the database:
Omara, M., Gautam, R., O'Brien, M., Himmelberger, A., Franco, A., Meisenhelder, K., Hauser, G., Lyon, D., Chulakadabba, A., Miller, C., Franklin, J., Wofsy, S., and Hamburg, S.: Developing a spatially explicit global oil and gas infrastructure database for characterizing methane emission sources at high resolution, Earth Syst. Sci. Data Discuss., https://doi.org/10.5194/essd-15-3761-2023, 2023.
Important note: While the results section of the manuscript is specific to v1 of the OGIM, the methods described therein are the the same methods used to develop and update OGIM_v2.5.1. Additionally, while we describe our data sources in detail in the manuscript above, and include maps for all acquired datasets, this open-access version of the OGIM database does not include the locations of about 300 natural gas compressor stations in Russia. Future updates may include these datasets when appropriate permissions to make them publicly accessible are obtained.
OGIM_v2.5.1.gpkg. Key changes since v1.1:
OGIM v2.5.1 is based on public-domain datasets reported on or prior to April 2024. Each record in OGIM indicates a source date (SRC_DATE) when the original source of the data was last updated. Some records may have out-of-date information, for example, if facility status has changed since we last acquired the data. We are continuing to update the OGIM database as new public-domain datasets become available.
---
Point of Contact at Environmental Defense Fund and MethaneSAT, LLC: Mark Omara (momara@edf.org) and Ritesh Gautam (rgautam@edf.org).
Introduction: I have chosen to complete a data analysis project for the second course option, Bellabeats, Inc., using a locally hosted database program, Excel for both my data analysis and visualizations. This choice was made primarily because I live in a remote area and have limited bandwidth and inconsistent internet access. Therefore, completing a capstone project using web-based programs such as R Studio, SQL Workbench, or Google Sheets was not a feasible choice. I was further limited in which option to choose as the datasets for the ride-share project option were larger than my version of Excel would accept. In the scenario provided, I will be acting as a Junior Data Analyst in support of the Bellabeats, Inc. executive team and data analytics team. This combined team has decided to use an existing public dataset in hopes that the findings from that dataset might reveal insights which will assist in Bellabeat's marketing strategies for future growth. My task is to provide data driven insights to business tasks provided by the Bellabeats, Inc.'s executive and data analysis team. In order to accomplish this task, I will complete all parts of the Data Analysis Process (Ask, Prepare, Process, Analyze, Share, Act). In addition, I will break each part of the Data Analysis Process down into three sections to provide clarity and accountability. Those three sections are: Guiding Questions, Key Tasks, and Deliverables. For the sake of space and to avoid repetition, I will record the deliverables for each Key Task directly under the numbered Key Task using an asterisk (*) as an identifier.
Section 1 - Ask: A. Guiding Questions: Who are the key stakeholders and what are their goals for the data analysis project? What is the business task that this data analysis project is attempting to solve?
B. Key Tasks: Identify key stakeholders and their goals for the data analysis project *The key stakeholders for this project are as follows: -Urška Sršen and Sando Mur - co-founders of Bellabeats, Inc. -Bellabeats marketing analytics team. I am a member of this team. Identify the business task. *The business task is: -As provided by co-founder Urška Sršen, the business task for this project is to gain insight into how consumers are using their non-BellaBeats smart devices in order to guide upcoming marketing strategies for the company which will help drive future growth. Specifically, the researcher was tasked with applying insights driven by the data analysis process to 1 BellaBeats product and presenting those insights to BellaBeats stakeholders.
Section 2 - Prepare: A. Guiding Questions: Where is the data stored and organized? Are there any problems with the data? How does the data help answer the business question?
B. Key Tasks: Research and communicate the source of the data, and how it is stored/organized to stakeholders. *The data source used for our case study is FitBit Fitness Tracker Data. This dataset is stored in Kaggle and was made available through user Mobius in an open-source format. Therefore, the data is public and available to be copied, modified, and distributed, all without asking the user for permission. These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk reportedly (see credibility section directly below) between 03/12/2016 thru 05/12/2016. *Reportedly (see credibility section directly below), thirty eligible Fitbit users consented to the submission of personal tracker data, including output related to steps taken, calories burned, time spent sleeping, heart rate, and distance traveled. This data was broken down into minute, hour, and day level totals. This data is stored in 18 CSV documents. I downloaded all 18 documents into my local laptop and decided to use 2 documents for the purposes of this project as they were files which had merged activity and sleep data from the other documents. All unused documents were permanently deleted from the laptop. The 2 files used were: -sleepDaymerged.csv -dailyActivitymerged.csv Identify and communicate to stakeholders any problems found with the data related to credibility and bias. *As will be more specifically presented in the Process section, the data seems to have credibility issues related to the reported time frame of the data collected. The metadata seems to indicate that the data collected covered roughly 2 months of FitBit tracking. However, upon my initial data processing, I found that only 1 month of data was reported. *As will be more specifically presented in the Process section, the data has credibility issues related to the number of individuals who reported FitBit data. Specifically, the metadata communicates that 30 individual users agreed to report their tracking data. My initial data processing uncovered 33 individual IDs in the dailyActivity_merged dataset. *Due to the small number of participants (...