Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This resource contains Jupyter Notebooks with examples for conducting quality control post processing for in situ aquatic sensor data. The code uses the Python pyhydroqc package. The resource is part of set of materials for hydroinformatics and water data science instruction. Complete learning module materials are found in HydroLearn: Jones, A.S., Horsburgh, J.S., Bastidas Pacheco, C.J. (2022). Hydroinformatics and Water Data Science. HydroLearn. https://edx.hydrolearn.org/courses/course-v1:USU+CEE6110+2022/about.
This resources consists of 3 example notebooks and associated data files.
Notebooks: 1. Example 1: Import and plot data 2. Example 2: Perform rules-based quality control 3. Example 3: Perform model-based quality control (ARIMA)
Data files: Data files are available for 6 aquatic sites in the Logan River Observatory. Each file contains data for one site for a single year. Each file corresponds to a single year of data. The files are named according to monitoring site (FranklinBasin, TonyGrove, WaterLab, MainStreet, Mendon, BlackSmithFork) and year. The files were sourced by querying the Logan River Observatory relational database, and equivalent data could be obtained from the LRO website or on HydroShare. Additional information on sites, variables, and methods can be found on the LRO website (http://lrodata.usu.edu/tsa/) or HydroShare (https://www.hydroshare.org/search/?q=logan%20river%20observatory). Each file has the same structure indexed with a datetime column (mountain standard time) with three columns corresponding to each variable. Variable abbreviations and units are: - temp: water temperature, degrees C - cond: specific conductance, μS/cm - ph: pH, standard units - do: dissolved oxygen, mg/L - turb: turbidity, NTU - stage: stage height, cm
For each variable, there are 3 columns: - Raw data value measured by the sensor (column header is the variable abbreviation). - Technician quality controlled (corrected) value (column header is the variable abbreviation appended with '_cor'). - Technician labels/qualifiers (column header is the variable abbreviation appended with '_qual').
Facebook
TwitterPreprocessing data in a reproducible and robust way is one of the current challenges in untargeted metabolomics workflows. Data curation in liquid chromatography-mass spectrometry (LC-MS) involves the removal of unwanted features (retention time; m/z pairs) to retain only high-quality data for subsequent analysis and interpretation. The present work introduces a package for the Python programming language for pre-processing LC-MS data for quality control procedures in untargeted metabolomics workflows. It is a versatile strategy that can be customized or fit for purpose according to the specific metabolomics application. It allows performing quality control procedures to ensure accuracy and reliability in LC-MS measurements, and it allows preprocessing metabolomics data to obtain cleaned matrices for subsequent statistical analysis. The capabilities of the package are showcased with pipelines for an LC-MS system suitability check, system conditioning, signal drift evaluation, and data curation. These applications were implemented to preprocess data corresponding to a new suite of plasma candidate plasma reference materials developed by the National Institute of Standards and Technology (NIST; hypertriglyceridemic, diabetic, and African-American plasma pools) to be used in untargeted metabolomics studies. in addition to NIST SRM 1950 ā Metabolites in Frozen Human Plasma. The package offers a rapid and reproducible workflow that can be used in an automated or semi-automated fashion, and it is an open and free tool available to all users.
Facebook
TwitterPython is a free computer language that prioritizes readability for humans and general application. It is one of the easier computer languages to learn and start especially with no prior programming knowledge. I have been using Python for Excel spreadsheet automation, data analysis, and data visualization. It has allowed me to better focus on learning how to automate my data analysis workload. I am currently examining the North Carolina Department of Environmental Quality (NCDEQ) database for water quality sampling for the Town of Nags Head, NC. It spans over 26 years (1997-2023) and lists a total of currently 41 different testing site locations. You can see at the bottom of image 2 below that I have 148,204 testing data points for the entirety of the NCDEQ testing for the state. From this large dataset 34,759 data points are from Dare County (Nags Head) specifically with this subdivided into testing sites.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The code repository to replicate the work (e.g., figures and results) from the publication: "Advancing Data Quality Assurance with Machine Learning: A Case Study on Wind Vane Stalling Detection". Repository includes dedicated Python files and a README document.
Facebook
TwitterThis resource contains a video recording for a presentation given as part of the National Water Quality Monitoring Council conference in April 2021. The presentation covers the motivation for performing quality control for sensor data, the development of PyHydroQC, a Python package with functions for automating sensor quality control including anomaly detection and correction, and the performance of the algorithms applied to data from multiple sites in the Logan River Observatory.
The initial abstract for the presentation: Water quality sensors deployed to aquatic environments make measurements at high frequency and commonly include artifacts that do not represent the environmental phenomena targeted by the sensor. Sensors are subject to fouling from environmental conditions, often exhibit drift and calibration shifts, and report anomalies and erroneous readings due to issues with datalogging, transmission, and other unknown causes. The suitability of data for analyses and decision making often depend on subjective and time-consuming quality control processes consisting of manual review and adjustment of data. Data driven and machine learning techniques have the potential to automate identification and correction of anomalous data, streamlining the quality control process. We explored documented approaches and selected several for implementation in a reusable, extensible Python package designed for anomaly detection for aquatic sensor data. Implemented techniques include regression approaches that estimate values in a time series, flag a point as anomalous if the difference between the sensor measurement exceeds a threshold, and offer replacement values for correcting anomalies. Additional algorithms that scaffold the central regression approaches include rules-based preprocessing, thresholds for determining anomalies that adjust with data variability, and the ability to detect and correct anomalies using forecasted and backcasted estimation. The techniques were developed and tested based on several years of data from aquatic sensors deployed at multiple sites in the Logan River Observatory in northern Utah, USA. Performance was assessed based on labels and corrections applied previously by trained technicians. In this presentation, we describe the techniques for detection and correction, report their performance, illustrate the workflow for applying to high frequency aquatic sensor data, and demonstrate the possibility for additional approaches to help increase automation of aquatic sensor data post processing.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset consist of 1854 rows of real-world sensor data collected from automated manufacturing systems to detect and classify production faults. It includes domain-specific features such as temperature, vibration, acoustic signals, humidity, pressure, motor current, RPM, surface reflectance, machine cycle time, and tool wear level. Each instance is labeled with a fault type (normal, surface crack, overheating, vibration anomaly).
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
A realistic synthetic French insurance dataset specifically designed for practicing data cleaning, transformation, and analytics with PySpark and other big data tools. This dataset contains intentional data quality issues commonly found in real-world insurance data.
Perfect for practicing data cleaning and transformation:
2024-01-15, 15/01/2024, 01/15/20241250.50ā¬, ā¬1250.50, 1250.50 EUR, $1375.551250.501250.50 eurosM, F, Male, Female, empty strings150 HP, 150hp, 150 CV, 111 kW, missing valuesto_date() and date parsing functionsregexp_replace() for price cleaningwhen().otherwise() conditional logiccast() for data type conversionsfillna() and dropna() strategiesRealistic insurance business rules implemented: - Age-based premium adjustments - Geographic risk zone pricing - Product-specific claim patterns - Seasonal claim distributions - Client lifecycle status transitions
Intermediate - Suitable for learners with basic Python/SQL knowledge ready to tackle real-world data challenges.
Generated with realistic French business context and intentional quality issues for educational purposes. All data is synthetic and does not represent real individuals or companies.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Model, Format and Python Library for ground truth data containing information on dynamic objects, map and environmental factors optimized for representing urban traffic. The repository contains:
see ./docs/omega_prime_specification.md
The data model and format utilize ASAM OpenDRIVE and ASAM Open-Simulation-Interface GroundTruth messages. omega-prime sets requirements on presence and quality of ASAM OSI GroundTruth messages and ASAM OpenDRIVE files and defines a file format for the exchange and storage of these.
Omega-Prime is the successor of the OMEGAFormat. It has the benefit that its definition is directly based on the established standards ASAM OSI and ASAM OpenDRIVE and carries over the data quality requirements and the data tooling from OMEGAFormat. Therefore, it should be easier to incorporate omega-prime into existing workflows and tooling.
To learn more about the example data read example_files/README.md. Example data was taken and created from esmini.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This resource was created for the 2024 New Zealand Hydrological Society Data Workshop in Queenstown, NZ. This resource contains Jupyter Notebooks with examples for conducting quality control post processing for in situ aquatic sensor data. The code uses the Python pyhydroqc package to detect anomalies. This resource consists of 3 example notebooks and associated data files. For more information, see the original resource from which this was derived: http://www.hydroshare.org/resource/451c4f9697654b1682d87ee619cd7924.
Notebooks: 1. Example 1: Import and plot data 2. Example 2: Perform rules-based quality control 3. Example 3: Perform model-based quality control (ARIMA) 4. Example 4: Model-based quality control (ARIMA) with user data
Data files: Data files are available for 6 aquatic sites in the Logan River Observatory. Each file contains data for one site for a single year. Each file corresponds to a single year of data. The files are named according to monitoring site (FranklinBasin, TonyGrove, WaterLab, MainStreet, Mendon, BlackSmithFork) and year. The files were sourced by querying the Logan River Observatory relational database, and equivalent data could be obtained from the LRO website or on HydroShare. Additional information on sites, variables, and methods can be found on the LRO website (http://lrodata.usu.edu/tsa/) or HydroShare (https://www.hydroshare.org/search/?q=logan%20river%20observatory). Each file has the same structure indexed with a datetime column (mountain standard time) with three columns corresponding to each variable. Variable abbreviations and units are: - temp: water temperature, degrees C - cond: specific conductance, μS/cm - ph: pH, standard units - do: dissolved oxygen, mg/L - turb: turbidity, NTU - stage: stage height, cm
For each variable, there are 3 columns: - Raw data value measured by the sensor (column header is the variable abbreviation). - Technician quality controlled (corrected) value (column header is the variable abbreviation appended with '_cor'). - Technician labels/qualifiers (column header is the variable abbreviation appended with '_qual').
There is also a file "data.csv" for use with Example 4. If any user wants to bring their own data file, they should structure it similarly to this file with a single column of datetime values and a single column of numeric observations labeled "raw".
Facebook
TwitterMass spectrometry-based proteomics is increasingly employed in biology and medicine. To generate reliable information from large data sets and ensure comparability of results it is crucial to implement and standardize the quality control of the raw data, the data processing steps and the statistical analyses. The MSPypeline provides a platform for the import of MaxQuant output tables, the generation of quality control reports, the preprocessing of data including normalization and exploratory analyses by statistical inference plots. These standardized steps assess data quality, provide customizable figures and enable the identification of differentially expressed proteins to reach biologically relevant conclusions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The repository contains an extensive dataset of PV power measurements and a python package (qcpv) for quality controlling PV power measurements. The dataset features four years (2014-2017) of power measurements of 175 rooftop mounted residential PV systems located in Utrecht, the Netherlands. The power measurements have a 1-min resolution.
PV power measurements
Three different versions of the power measurements are included in three data-subsets in the repository. Unfiltered power measurements are enclosed in unfiltered_pv_power_measurements.csv. Filtered power measurements are included as filtered_pv_power_measurements_sc.csv and filtered_pv_power_measurements_ac.csv. The former dataset contains the quality controlled power measurements after running single system filters only, the latter dataset considers the output after running both single and across system filters. The metadata of the PV systems is added in metadata.csv. This file holds for each PV system a unique ID, start and end time of registered power measurements, estimated DC and AC capacity, tilt and azimuth angle, annual yield and mapped grids of the system location (north, south, west and east boundary).
Quality control routine
An open-source quality control routine that can be applied to filter erroneous PV power measurements is added to the repository in the form of the Python package qcpv (qcpv.py). Sample code to call and run the functions in the qcpv package is available as example.py.
Objective
By publishing the dataset we provide access to high quality PV power measurements that can be used for research experiments on several topics related to PV power and the integration of PV in the electricity grid.
By publishing the qcpv package we strive to set a next step into developing a standardized routine for quality control of PV power measurements. We hope to stimulate others to adopt and improve the routine of quality control and work towards a widely adopted standardized routine.
Data usage
If you use the data and/or python package in a published work please cite: Visser, L., Elsinga, B., AlSkaif, T., van Sark, W., 2022. Open-source quality control routine and multi-year power generation data of 175 PV systems. Journal of Renewable and Sustainable Energy.
Units
Timestamps are in UTC (YYYY-MM-DD HH:MM:SS+00:00).
Power measurements are in Watt.
Installed capacities (DC and AC) are in Watt-peak.
Additional information
A detailed discussion of the data and qcpv package is presented in: Visser, L., Elsinga, B., AlSkaif, T., van Sark, W., 2022. Open-source quality control routine and multi-year power generation data of 175 PV systems. Journal of Renewable and Sustainable Energy.
Acknowledgements
This work is part of the Energy Intranets (NEAT: ESI-BiDa 647.003.002) project, which is funded by the Dutch Research Council NWO in the framework of the Energy Systems Integration & Big Data programme. The authors would especially like to thank the PV owners who volunteered to take part in the measurement campaign.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data consists of a Golgi image dataset and the pipeline to perform unsupervised phenotypic analysis on these images. The data is presented as a zipped file āGolgi_HCA_workflow.zipā and its contents include:
1) Data folder āsnare_2ā containing vignettes of Golgi images (.jpg) acquired from multiple fields of multiple wells and numerical data (.sta) corresponding to the image features extracted for each Golgi image.
2) Plate map folder āplate_mapsā containing the .csv plate map file for āsnare_2ā dataset with the well locations for all the siRNA treatments.
3) Repository folder ārepositoryā containing ānqc.h5ā. A labeled set of good and bad nuclei was used to train the nuclei quality control (NQC) classifier. The results of this pre-trained classifier have been included in ānqc.h5ā for convenience of users.
4) Two Python scripts ācontrol_model_utils.pyā for the control modeling module of the pipeline and 'HCA_workflow.pyā is the main script for running the entire pipeline.
5) README file describing the steps to download and install this package and the Python software needed to run it.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Python Doctest Corpus
A curated corpus of Python doctest examples designed for training Python-to-Rust transpilers and testing code translation systems.
Dataset Description
This dataset contains Python function signatures, doctest inputs, and expected outputs that serve as high-quality training data for:
Transpilation training: Teaching models to translate Python patterns to Rust Test validation: Verifying that transpiled code produces correct outputs Code understanding:⦠See the full description on the dataset page: https://huggingface.co/datasets/paiml/python-doctest-corpus-test.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To standardize metabolomics data analysis and facilitate future computational developments, it is essential to have a set of well-defined templates for common data structures. Here we describe a collection of data structures involved in metabolomics data processing and illustrate how they are utilized in a full-featured Python-centric pipeline. We demonstrate the performance of the pipeline, and the details in annotation and quality control using large-scale LC-MS metabolomics and lipidomics data and LC-MS/MS data. Multiple previously published datasets are also reanalyzed to showcase its utility in biological data analysis. This pipeline allows users to streamline data processing, quality control, annotation, and standardization in an efficient and transparent manner. This work fills a major gap in the Python ecosystem for computational metabolomics.
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
This competition features two independent synthetic data challenges that you can join separately: - The FLAT DATA Challenge - The SEQUENTIAL DATA Challenge
For each challenge, generate a dataset with the same size and structure as the original, capturing its statistical patterns ā but without being significantly closer to the (released) original samples than to the (unreleased) holdout samples.
Train a generative model that generalizes well, using any open-source tools (Synthetic Data SDK, synthcity, reprosyn, etc.) or your own solution. Submissions must be fully open-source, reproducible, and runnable within 6 hours on a standard machine.
Flat Data - 100,000 records - 80 data columns: 60 numeric, 20 categorical
Sequential Data - 20,000 groups - each group contains 5-10 records - 10 data columns: 7 numeric, 3 categorical
If you use this dataset in your research, please cite:
@dataset{mostlyaiprize,
author = {MOSTLY AI},
title = {MOSTLY AI Prize Dataset},
year = {2025},
url = {https://www.mostlyaiprize.com/},
}
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset is a synthetic yet realistic E-commerce retail dataset generated programmatically using Python (Faker + NumPy + Pandas).
It is designed to closely mimic real-world online shopping behavior, user patterns, product interactions, seasonal trends, and marketplace events.
Machine Learning & Deep Learning
Recommender Systems
Customer Segmentation
Sales Forecasting
A/B Testing
E-commerce Behaviour Analysis
Data Cleaning / Feature Engineering Practice
SQL practice
The dataset contains 6 CSV files: ~~~ File Rows Description users.csv ~10,000 User profiles, demographics & signup info products.csv ~2,000 Product catalog with rating and pricing orders.csv ~20,000 Order-level transactions order_items.csv ~60,000 Items purchased per order reviews.csv ~15,000 Customer-written product reviews events.csv ~80,000 User event logs: view, cart, wishlist, purchase ~~~
1. Users (users.csv)
Column Description
user_id Unique user identifier
name Full customer name
email Email (synthetic, no real emails)
gender Male / Female / Other
city City of residence
signup_date Account creation date
2. Products (products.csv)
Column Description
product_id Unique product identifier
product_name Product title
category Electronics, Clothing, Beauty, Home, Sports, etc.
price Actual selling price
rating Average product rating
3. Orders (orders.csv)
Column Description
order_id Unique order identifier
user_id User who placed the order
order_date Timestamp of the order
order_status Completed / Cancelled / Returned
total_amount Total order value
4. Order Items (order_items.csv)
Column Description
order_item_id Unique identifier
order_id Associated order
product_id Purchased product
quantity Quantity purchased
item_price Price per unit
5. Reviews (reviews.csv)
Column Description
review_id Unique review identifier
user_id User who submitted review
product_id Reviewed product
rating 1ā5 star rating
review_text Short synthetic review
review_date Submission date
6. Events (events.csv)
Column Description
event_id Unique event identifier
user_id User performing event
product_id Viewed/added/purchased product
event_type view/cart/wishlist/purchase
event_timestamp Timestamp of event
Customer churn prediction
Review sentiment analysis (NLP)
Recommendation engines
Price optimization models
Demand forecasting (Time-series)
Market basket analysis
RFM segmentation
Cohort analysis
Funnel conversion tracking
A/B testing simulations
Joins
Window functions
Aggregations
CTE-based funnels
Complex queries
Faker for realistic user and review generation
NumPy for probability-based event modeling
Pandas for data processing
demand variation
user behavior simulation
return/cancel probabilities
seasonal order timestamp distribution
The dataset does not include any real personal data.
Everything is generated synthetically.
This dataset is released under CC BY 4.0 ā free to use for:
Research
Education
Commercial projects
Kaggle competitions
Machine learning pipelines
Just provide attribution.
Upvote the dataset
Leave a comment
Share your notebooks/notebooks using it
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
It is important to easily and efficiently obtain high quality species distribution data for predicting the potential distribution of species using species distribution models (SDMs). There is a need for a powerful software tool to automatically or semi-automatically assist in identifying and correcting errors. Here, we use Python to develop a web-based software tool (SDMdata) to easily collect occurrence data from the Global Biodiversity Information Facility (GBIF) and check species names and the accuracy of coordinates (latitude and longitude). It is an open source software (GNU Affero General Public License/AGPL licensed) allowing anyone to access and manipulate the source code. SDMdata is available online free of charge from .
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The datasets are the output of a Python-based simulator of control plane containing the module of deep reinforcement learning based flow operation, a lightweight Software-Defined Networking controller, the flow routing manager and the packet node agents. The data plane was accurately emulated using a simulator based on CURSA-SQ.
Facebook
TwitterNotice: this is not the latest Heat Island Anomalies image service.This layer contains the relative degrees Fahrenheit difference between any given pixel and the mean heat value for the city in which it is located, for every city in the contiguous United States, Alaska, Hawaii, and Puerto Rico. This 30-meter raster was derived from Landsat 8 imagery band 10 (ground-level thermal sensor) from the summer of 2022, with patching from summer of 2021 where necessary.Federal statistics over a 30-year period show extreme heat is the leading cause of weather-related deaths in the United States. Extreme heat exacerbated by urban heat islands can lead to increased respiratory difficulties, heat exhaustion, and heat stroke. These heat impacts significantly affect the most vulnerableāchildren, the elderly, and those with preexisting conditions.The purpose of this layer is to show where certain areas of cities are hotter or cooler than the average temperature for that same city as a whole. This dataset represents a snapshot in time. It will be updated yearly, but is static between updates. It does not take into account changes in heat during a single day, for example, from building shadows moving. The thermal readings detected by the Landsat 8 sensor are surface-level, whether that surface is the ground or the top of a building. Although there is strong correlation between surface temperature and air temperature, they are not the same. We believe that this is useful at the national level, and for cities that donāt have the ability to conduct their own hyper local temperature survey. Where local data is available, it may be more accurate than this dataset. Dataset SummaryThis dataset was developed using proprietary Python code developed at The Trust for Public Land, running on the Descartes Labs platform through the Descartes Labs API for Python. The Descartes Labs platform allows for extremely fast retrieval and processing of imagery, which makes it possible to produce heat island data for all cities in the United States in a relatively short amount of time.In order to click on the image service and see the raw pixel values in a map viewer, you must be signed in to ArcGIS Online, then Enable Pop-Ups and Configure Pop-Ups.Using the Urban Heat Island (UHI) Image ServicesThe data is made available as an image service. There is a processing template applied that supplies the yellow-to-red or blue-to-red color ramp, but once this processing template is removed (you can do this in ArcGIS Pro or ArcGIS Desktop, or in QGIS), the actual data values come through the service and can be used directly in a geoprocessing tool (for example, to extract an area of interest). Following are instructions for doing this in Pro.In ArcGIS Pro, in a Map view, in the Catalog window, click on Portal. In the Portal window, click on the far-right icon representing Living Atlas. Search on the acronyms ātplā and āuhiā. The results returned will be the UHI image services. Right click on a result and select āAdd to current mapā from the context menu. When the image service is added to the map, right-click on it in the map view, and select Properties. In the Properties window, select Processing Templates. On the drop-down menu at the top of the window, the default Processing Template is either a yellow-to-red ramp or a blue-to-red ramp. Click the drop-down, and select āNoneā, then āOKā. Now you will have the actual pixel values displayed in the map, and available to any geoprocessing tool that takes a raster as input. Below is a screenshot of ArcGIS Pro with a UHI image service loaded, color ramp removed, and symbology changed back to a yellow-to-red ramp (a classified renderer can also be used): A typical operation at this point is to clip out your area of interest. To do this, add your polygon shapefile or feature class to the map view, and use the Clip Raster tool to export your area of interest as a geoTIFF raster (file extension ".tif"). In the environments tab for the Clip Raster tool, click the dropdown for "Extent" and select "Same as Layer:", and select the name of your polygon. If you then need to convert the output raster to a polygon shapefile or feature class, run the Raster to Polygon tool, and select "Value" as the field.Other Sources of Heat Island InformationPlease see these websites for valuable information on heat islands and to learn about exciting new heat island research being led by scientists across the country:EPAās Heat Island Resource CenterDr. Ladd Keith, University of ArizonaDr. Ben McMahan, University of Arizona Dr. Jeremy Hoffman, Science Museum of Virginia Dr. Hunter Jones, NOAA Daphne Lundi, Senior Policy Advisor, NYC Mayor's Office of Recovery and ResiliencyDisclaimer/FeedbackWith nearly 14,000 cities represented, checking each city's heat island raster for quality assurance would be prohibitively time-consuming, so The Trust for Public Land checked a statistically significant sample size for data quality. The sample passed all quality checks, with about 98.5% of the output cities error-free, but there could be instances where the user finds errors in the data. These errors will most likely take the form of a line of discontinuity where there is no city boundary; this type of error is caused by large temperature differences in two adjacent Landsat scenes, so the discontinuity occurs along scene boundaries (see figure below). The Trust for Public Land would appreciate feedback on these errors so that version 2 of the national UHI dataset can be improved. Contact Dale.Watt@tpl.org with feedback.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This Data and Software were used in the submitted paper "Seismicity patterns and multi-scale imaging at Krafla (N-E Iceland) wih local earthquake tomography" by Glück et al.
The data and software provided here are used to compute the velocity models with TomoTV.
The raw data (.mseed format) can be visualised with the Python package Pyrocko/Snuffler, which was also used for the arrival time picking.
For the temporary network the manual picks are provided along with the code to prepare the manual picks as the input files for a localisation with NonLinLoc by weighting and quality checking the data. This resulting localsitations and the weighted traveltimes are then used for the LET.
The same workflow was used for the picks from the permanent network.
Data:
- Raw data (\WaveformsPermanentStations): 7s waveform snippets of the events listed in the ISOR catalogue on http://lv.isor.is:8080/events/browse/ for the years 2021 and 2022.
- Raw data (\WaveformsNodes): 5s waveform snippets of the events listed in the ISOR catalogue on http://lv.isor.is:8080/events/browse/2022 recorded with the temporary network of 98 temporary nodes in June and July 2022.
- Pickfile (ManualPicks_100Nodes_Kafla2022.txt): Manual picks of the events listed in the ISOR catalogue for the evenst recorded with the temporary network.
Software (Hyp_format.py):
- Weighting: The picks are weighted according to their Signal-to-Noise ratio (described in more detail in Section 2.3 in the main text of the paper)
- Writing the inputfile for NonLinLoc (with the selecting the mode option "PorS" in line 118), including all picks, also for those stations where not both phases were picked. The file "endfile.txt" is needed to write the picks to the NonLinLoc input format.
- Quality check of the picks: Computing a modified Wadati diagram from the traveltime differences of P and S phases for all the events available (with the selecting the mode option "PandS" in line 118)
- Python packages needed: numpy, scipy, matplotlib, pandas, obspy
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This resource contains Jupyter Notebooks with examples for conducting quality control post processing for in situ aquatic sensor data. The code uses the Python pyhydroqc package. The resource is part of set of materials for hydroinformatics and water data science instruction. Complete learning module materials are found in HydroLearn: Jones, A.S., Horsburgh, J.S., Bastidas Pacheco, C.J. (2022). Hydroinformatics and Water Data Science. HydroLearn. https://edx.hydrolearn.org/courses/course-v1:USU+CEE6110+2022/about.
This resources consists of 3 example notebooks and associated data files.
Notebooks: 1. Example 1: Import and plot data 2. Example 2: Perform rules-based quality control 3. Example 3: Perform model-based quality control (ARIMA)
Data files: Data files are available for 6 aquatic sites in the Logan River Observatory. Each file contains data for one site for a single year. Each file corresponds to a single year of data. The files are named according to monitoring site (FranklinBasin, TonyGrove, WaterLab, MainStreet, Mendon, BlackSmithFork) and year. The files were sourced by querying the Logan River Observatory relational database, and equivalent data could be obtained from the LRO website or on HydroShare. Additional information on sites, variables, and methods can be found on the LRO website (http://lrodata.usu.edu/tsa/) or HydroShare (https://www.hydroshare.org/search/?q=logan%20river%20observatory). Each file has the same structure indexed with a datetime column (mountain standard time) with three columns corresponding to each variable. Variable abbreviations and units are: - temp: water temperature, degrees C - cond: specific conductance, μS/cm - ph: pH, standard units - do: dissolved oxygen, mg/L - turb: turbidity, NTU - stage: stage height, cm
For each variable, there are 3 columns: - Raw data value measured by the sensor (column header is the variable abbreviation). - Technician quality controlled (corrected) value (column header is the variable abbreviation appended with '_cor'). - Technician labels/qualifiers (column header is the variable abbreviation appended with '_qual').