We discuss a statistical framework that underlies envelope detection schemes as well as dynamical models based on Hidden Markov Models (HMM) that can encompass both discrete and continuous sensor measurements for use in Integrated System Health Management (ISHM) applications. The HMM allows for the rapid assimilation, analysis, and discovery of system anomalies. We motivate our work with a discussion of an aviation problem where the identification of anomalous sequences is essential for safety reasons. The data in this application are discrete and continuous sensor measurements and can be dealt with seamlessly using the methods described here to discover anomalous flights. We specifically treat the problem of discovering anomalous features in the time series that may be hidden from the sensor suite and compare those methods to standard envelope detection methods on test data designed to accentuate the differences between the two methods. Identification of these hidden anomalies is crucial to building stable, reusable, and cost-efficient systems. We also discuss a data mining framework for the analysis and discovery of anomalies in high-dimensional time series of sensor measurements that would be found in an ISHM system. We conclude with recommendations that describe the tradeoffs in building an integrated scalable platform for robust anomaly detection in ISHM applications.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains data collected during a study ("Towards High-Value Datasets determination for data-driven development: a systematic literature review") conducted by Anastasija Nikiforova (University of Tartu), Nina Rizun, Magdalena Ciesielska (Gdańsk University of Technology), Charalampos Alexopoulos (University of the Aegean) and Andrea Miletič (University of Zagreb) It being made public both to act as supplementary data for "Towards High-Value Datasets determination for data-driven development: a systematic literature review" paper (pre-print is available in Open Access here -> https://arxiv.org/abs/2305.10234) and in order for other researchers to use these data in their own work.
The protocol is intended for the Systematic Literature review on the topic of High-value Datasets with the aim to gather information on how the topic of High-value datasets (HVD) and their determination has been reflected in the literature over the years and what has been found by these studies to date, incl. the indicators used in them, involved stakeholders, data-related aspects, and frameworks. The data in this dataset were collected in the result of the SLR over Scopus, Web of Science, and Digital Government Research library (DGRL) in 2023.
Methodology
To understand how HVD determination has been reflected in the literature over the years and what has been found by these studies to date, all relevant literature covering this topic has been studied. To this end, the SLR was carried out to by searching digital libraries covered by Scopus, Web of Science (WoS), Digital Government Research library (DGRL).
These databases were queried for keywords ("open data" OR "open government data") AND ("high-value data*" OR "high value data*"), which were applied to the article title, keywords, and abstract to limit the number of papers to those, where these objects were primary research objects rather than mentioned in the body, e.g., as a future work. After deduplication, 11 articles were found unique and were further checked for relevance. As a result, a total of 9 articles were further examined. Each study was independently examined by at least two authors.
To attain the objective of our study, we developed the protocol, where the information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information.
Test procedure Each study was independently examined by at least two authors, where after the in-depth examination of the full-text of the article, the structured protocol has been filled for each study. The structure of the survey is available in the supplementary file available (see Protocol_HVD_SLR.odt, Protocol_HVD_SLR.docx) The data collected for each study by two researchers were then synthesized in one final version by the third researcher.
Description of the data in this data set
Protocol_HVD_SLR provides the structure of the protocol Spreadsheets #1 provides the filled protocol for relevant studies. Spreadsheet#2 provides the list of results after the search over three indexing databases, i.e. before filtering out irrelevant studies
The information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information
Descriptive information
1) Article number - a study number, corresponding to the study number assigned in an Excel worksheet
2) Complete reference - the complete source information to refer to the study
3) Year of publication - the year in which the study was published
4) Journal article / conference paper / book chapter - the type of the paper -{journal article, conference paper, book chapter}
5) DOI / Website- a link to the website where the study can be found
6) Number of citations - the number of citations of the article in Google Scholar, Scopus, Web of Science
7) Availability in OA - availability of an article in the Open Access
8) Keywords - keywords of the paper as indicated by the authors
9) Relevance for this study - what is the relevance level of the article for this study? {high / medium / low}
Approach- and research design-related information 10) Objective / RQ - the research objective / aim, established research questions 11) Research method (including unit of analysis) - the methods used to collect data, including the unit of analy-sis (country, organisation, specific unit that has been ana-lysed, e.g., the number of use-cases, scope of the SLR etc.) 12) Contributions - the contributions of the study 13) Method - whether the study uses a qualitative, quantitative, or mixed methods approach? 14) Availability of the underlying research data- whether there is a reference to the publicly available underly-ing research data e.g., transcriptions of interviews, collected data, or explanation why these data are not shared? 15) Period under investigation - period (or moment) in which the study was conducted 16) Use of theory / theoretical concepts / approaches - does the study mention any theory / theoretical concepts / approaches? If any theory is mentioned, how is theory used in the study?
Quality- and relevance- related information
17) Quality concerns - whether there are any quality concerns (e.g., limited infor-mation about the research methods used)?
18) Primary research object - is the HVD a primary research object in the study? (primary - the paper is focused around the HVD determination, sec-ondary - mentioned but not studied (e.g., as part of discus-sion, future work etc.))
HVD determination-related information
19) HVD definition and type of value - how is the HVD defined in the article and / or any other equivalent term?
20) HVD indicators - what are the indicators to identify HVD? How were they identified? (components & relationships, “input -> output")
21) A framework for HVD determination - is there a framework presented for HVD identification? What components does it consist of and what are the rela-tionships between these components? (detailed description)
22) Stakeholders and their roles - what stakeholders or actors does HVD determination in-volve? What are their roles?
23) Data - what data do HVD cover?
24) Level (if relevant) - what is the level of the HVD determination covered in the article? (e.g., city, regional, national, international)
Format of the file .xls, .csv (for the first spreadsheet only), .odt, .docx
Licenses or restrictions CC-BY
For more info, see README.txt
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract Four documents describe the specifications, methods and scripts of the Impact and Risk Analysis Databases developed for the Bioregional Assessments Programme. They are: Bioregional Assessment Impact and Risk Databases Installation Advice (IMIA Database Installation Advice v1.docx). Naming Convention of the Bioregional Assessment Impact and Risk Databases (IMIA Project Naming Convention v39.docx). Data treatments for the Bioregional Assessment Impact and Risk Databases (IMIA Project …Show full descriptionAbstract Four documents describe the specifications, methods and scripts of the Impact and Risk Analysis Databases developed for the Bioregional Assessments Programme. They are: Bioregional Assessment Impact and Risk Databases Installation Advice (IMIA Database Installation Advice v1.docx). Naming Convention of the Bioregional Assessment Impact and Risk Databases (IMIA Project Naming Convention v39.docx). Data treatments for the Bioregional Assessment Impact and Risk Databases (IMIA Project Data Treatments v02.docx). Quality Assurance of the Bioregional Assessment Impact and Risk Databases (IMIA Project Quality Assurance Protocol v17.docx). This dataset also includes the Materialised View Information Manager (MatInfoManager.zip). This Microsoft Access database is used to manage the overlay definitions of materialized views of the Impact and Risk Analysis Databases. For more information about this tool, refer to the Data Treatments document. The documentation supports all five Impact and Risk Analysis Databases developed for the assessment areas: Maranoa-Balonne-Condamine: http://data.bioregionalassessments.gov.au/dataset/69075f3e-67ba-405b-8640-96e6cb2a189a Gloucester: http://data.bioregionalassessments.gov.au/dataset/d78c474c-5177-42c2-873c-64c7fe2b178c Hunter: http://data.bioregionalassessments.gov.au/dataset/7c170d60-ff09-4982-bd89-dd3998a88a47 Namoi: http://data.bioregionalassessments.gov.au/dataset/1549c88d-927b-4cb5-b531-1d584d59be58 Galilee: http://data.bioregionalassessments.gov.au/dataset/3dbb5380-2956-4f40-a535-cbdcda129045 Purpose These documents describe end-to-end treatments of scientific data for the Impact and Risk Analysis Databases, developed and published by the Bioregional Assessment Programme. The applied approach to data quality assurance is also described. These documents are intended for people with an advanced knowledge in geospatial analysis and database administration, who seek to understand, restore or utilise the Analysis Databases and their underlying methods of analysis. Dataset History The Impact and Risk Analysis Database Documentation was created for and by the Information Modelling and Impact Assessment Project (IMIA Project). Dataset Citation Bioregional Assessment Programme (2018) Impact and Risk Analysis Database Documentation. Bioregional Assessment Source Dataset. Viewed 12 December 2018, http://data.bioregionalassessments.gov.au/dataset/05e851cf-57a5-4127-948a-1b41732d538c.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Normalized variances calculated using the method described in the article, based on experimental data. Data is stored using Xarray, specifically in the NetCDF format. Data can be easily accessed using the Xarray Python library, specifically by calling xarray.open_dataset() The dataset is structured as follows: two N-dimensional DataArrays, one corresponding for calculations with time displacements (labeled as time) and one for calculations with phase displacements with the time centroid already picked (labeled as final) each DataArray has 5 dimensions: SNR, eps (separation), ph_disp/disp (displacement), sample/sample_time (bootstrapped sample), supersample (ensemble of bootstrapped samples) coordinates label the parameters along each dimension Usage examples Opening the dataset import numpy as np import xarray as xr variances = xr.open_dataset("coherent.nc") Obtaining parameter estimates def get_centroid_indices(variances): return np.bincount( variances.argmin( dim="disp" if "disp" in variances.dims else "ph_disp" ).values.flatten() ) def get_centroid_index(variances): return np.argmax(get_centroid_indices(variances)) def epsilon_estimator(eps): return 4 * np.sqrt(np.clip(var, 0, None)) time_centroid_estimates = variances["time"].idxmin(dim="disp") phase_centroid_estimates = variances["final"].idxmin(dim="ph_disp") epsilon_estimates = eps_estimator( variances["final"].isel(ph_disp=common.get_centroid_index(variances["final"])) ) Calculating and plotting precision def plot(estimates): estimator_variances = estimates.var( dim="sample" if "sample" in estimates.dims else "sample_time" ) precision = ( 1.0 / estimator_variances.snr / variances.attrs["SAMPLE_SIZE"] / estimator_variances ) precision = precision.where(xr.apply_ufunc(np.isfinite, precision), other=0) mean_precision = precision.mean(dim="supersample") mean_precision = mean_precision.where(np.isfinite(mean_precision), 0) precision_error = 2 * precision.std(dim="supersample").fillna(0) g = mean_precision.plot.scatter( x="eps", col="snr", col_wrap=2, sharex=True, sharey=True, ) for ax, snr in zip(g.axs.flat, snrs): ax.errorbar( precision.eps.values, mean_precision.sel(snr=snr), yerr=precision_error.sel(snr=snr), fmt="o", ) plot(time_centroid_estimates) plot(phase_centroid_estimates) plot(epsilon_estimates)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains a variety of publicly available real-life event logs. We derived two types of Petri nets for each event log with two state-of-the-art process miners : Inductive Miner (IM) and Split Miner (SM). Each event log-Petri net pair is intended for evaluating the scalability of existing conformance checking techniques.We used this data-set to evaluate the scalability of the S-Component approach for measuring fitness. The dataset contains tables of descriptive statistics of both process models and event logs. In addition, this dataset includes the results in terms of time performance measured in milliseconds for several approaches for both multi-threaded and single-threaded executions. Last, the dataset contains a cost-comparison of different approaches and reports on the degree of over-approximation of the S-Components approach. The description of the compared conformance checking techniques can be found here: https://arxiv.org/abs/1910.09767. Update:The dataset has been extended with the event logs of the BPIC18 and BPIC19 logs. BPIC19 is actually a collection of four different processes and thus was split into four event logs. For each of the additional five event logs, again, two process models have been mined with inductive and split miner. We used the extended dataset to test the scalability of our tandem repeats approach for measuring fitness. The dataset now contains updated tables of log and model statistics as well as tables of the conducted experiments measuring execution time and raw fitness cost of various fitness approaches. The description of the compared conformance checking techniques can be found here: https://arxiv.org/abs/2004.01781.Update: The dataset has also been used to measure the scalability of a new Generalization measure based on concurrent and repetitive patterns. : A concurrency oracle is used in tandem with partial orders to identify concurrent patterns in the log that are tested against parallel blocks in the process model. Tandem repeats are used with various trace reduction and extensions to define repetitive patterns in the log that are tested against loops in the process model. Each pattern is assigned a partial fulfillment. The generalization is then the average of pattern fulfillments weighted by the trace counts for which the patterns have been observed. The dataset no includes the time results and a breakdown of Generalization values for the dataset.
Overview
This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is contained in the given article) and article stance (i.e., whether the given article supports or rejects the claim or provides both sides of the argument).
The dataset is primarily intended to be used as a training and evaluation set for machine learning methods for claim presence detection and article stance classification, but it enables a range of other misinformation related tasks, such as misinformation characterisation or analyses of misinformation spreading.
Its novelty and our main contributions lie in (1) focus on medical news article and blog posts as opposed to social media posts or political discussions; (2) providing multiple modalities (beside full-texts of the articles, there are also images and videos), thus enabling research of multimodal approaches; (3) mapping of the articles to the fact-checked claims (with manual as well as predicted labels); (4) providing source credibility labels for 95% of all articles and other potential sources of weak labels that can be mined from the articles' content and metadata.
The dataset is associated with the research paper "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" accepted and presented at ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22).
The accompanying Github repository provides a small static sample of the dataset and the dataset's descriptive analysis in a form of Jupyter notebooks.
Options to access the dataset
There are two ways how to get access to the dataset:
In order to obtain an access to the dataset (either to full static dump or REST API), please, request the access by following instructions provided below.
References
If you use this dataset in any publication, project, tool or in any other form, please, cite the following papers:
@inproceedings{SrbaMonantPlatform, author = {Srba, Ivan and Moro, Robert and Simko, Jakub and Sevcech, Jakub and Chuda, Daniela and Navrat, Pavol and Bielikova, Maria}, booktitle = {Proceedings of Workshop on Reducing Online Misinformation Exposure (ROME 2019)}, pages = {1--7}, title = {Monant: Universal and Extensible Platform for Monitoring, Detection and Mitigation of Antisocial Behavior}, year = {2019} }
@inproceedings{SrbaMonantMedicalDataset, author = {Srba, Ivan and Pecher, Branislav and Tomlein Matus and Moro, Robert and Stefancova, Elena and Simko, Jakub and Bielikova, Maria}, booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22)}, numpages = {11}, title = {Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims}, year = {2022}, doi = {10.1145/3477495.3531726}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3477495.3531726}, }
Dataset creation process
In order to create this dataset (and to continuously obtain new data), we used our research platform Monant. The Monant platform provides so called data providers to extract news articles/blogs from news/blog sites as well as fact-checking articles from fact-checking sites. General parsers (from RSS feeds, Wordpress sites, Google Fact Check Tool, etc.) as well as custom crawler and parsers were implemented (e.g., for fact checking site Snopes.com). All data is stored in the unified format in a central data storage.
Ethical considerations
The dataset was collected and is published for research purposes only. We collected only publicly available content of news/blog articles. The dataset contains identities of authors of the articles if they were stated in the original source; we left this information, since the presence of an author's name can be a strong credibility indicator. However, we anonymised the identities of the authors of discussion posts included in the dataset.
The main identified ethical issue related to the presented dataset lies in the risk of mislabelling of an article as supporting a false fact-checked claim and, to a lesser extent, in mislabelling an article as not containing a false claim or not supporting it when it actually does. To minimise these risks, we developed a labelling methodology and require an agreement of at least two independent annotators to assign a claim presence or article stance label to an article. It is also worth noting that we do not label an article as a whole as false or true. Nevertheless, we provide partial article-claim pair veracities based on the combination of claim presence and article stance labels.
As to the veracity labels of the fact-checked claims and the credibility (reliability) labels of the articles' sources, we take these from the fact-checking sites and external listings such as Media Bias/Fact Check as they are and refer to their methodologies for more details on how they were established.
Lastly, the dataset also contains automatically predicted labels of claim presence and article stance using our baselines described in the next section. These methods have their limitations and work with certain accuracy as reported in this paper. This should be taken into account when interpreting them.
Reporting mistakes in the dataset The mean to report considerable mistakes in raw collected data or in manual annotations is by creating a new issue in the accompanying Github repository. Alternately, general enquiries or requests can be sent at info [at] kinit.sk.
Dataset structure
Raw data
At first, the dataset contains so called raw data (i.e., data extracted by the Web monitoring module of Monant platform and stored in exactly the same form as they appear at the original websites). Raw data consist of articles from news sites and blogs (e.g. naturalnews.com), discussions attached to such articles, fact-checking articles from fact-checking portals (e.g. snopes.com). In addition, the dataset contains feedback (number of likes, shares, comments) provided by user on social network Facebook which is regularly extracted for all news/blogs articles.
Raw data are contained in these CSV files (and corresponding REST API endpoints):
sources.csv
articles.csv
article_media.csv
article_authors.csv
discussion_posts.csv
discussion_post_authors.csv
fact_checking_articles.csv
fact_checking_article_media.csv
claims.csv
feedback_facebook.csv
Note: Personal information about discussion posts' authors (name, website, gravatar) are anonymised.
Annotations
Secondly, the dataset contains so called annotations. Entity annotations describe the individual raw data entities (e.g., article, source). Relation annotations describe relation between two of such entities.
Each annotation is described by the following attributes:
category of annotation (annotation_category
). Possible values: label (annotation corresponds to ground truth, determined by human experts) and prediction (annotation was created by means of AI method).
type of annotation (annotation_type_id
). Example values: Source reliability (binary), Claim presence. The list of possible values can be obtained from enumeration in annotation_types.csv.
method which created annotation (method_id
). Example values: Expert-based source reliability evaluation, Fact-checking article to claim transformation method. The list of possible values can be obtained from enumeration methods.csv.
its value (value
). The value is stored in JSON format and its structure differs according to particular annotation type.
At the same time, annotations are associated with a particular object identified by:
entity type (parameter entity_type
in case of entity annotations, or source_entity_type
and target_entity_type
in case of relation annotations). Possible values: sources, articles, fact-checking-articles.
entity id (parameter entity_id
in case of entity annotations, or source_entity_id
and target_entity_id
in case of relation annotations).
The dataset provides specifically these entity annotations:
Source reliability (binary). Determines validity of source (website) at a binary scale with two options: reliable source and unreliable source.
Article veracity. Aggregated information about veracity from article-claim pairs.
The dataset provides specifically these relation annotations:
Fact-checking article to claim mapping. Determines mapping between fact-checking article and claim.
Claim presence. Determines presence of claim in article.
Claim stance. Determines stance of an article to a claim.
Annotations are contained in these CSV files (and corresponding REST API endpoints):
entity_annotations.csv
relation_annotations.csv
Note: Identification of human annotators authors (email provided in the annotation app) is anonymised.
In the article (Ahmer, M., Sandin, F., Marklund, P. et al., 2022), we have investigated the effective use of sensors in a bearing ring grinder for failure classification in the condition-based maintenance context. The proposed methodology combines domain knowledge of process monitoring and condition monitoring to successfully achieve failure mode prediction with high accuracy using only a few key sensors. This enables manufacturing equipment to take advantage of advanced data processing and machine learning techniques.
The grinding machine is of type SGB55 from Lidköping Machine Tools and is used to produce functional raceway surface of inner rings of type SKF-6210 deep groove ball bearing. Additional sensors like vibration, acoustic emission, force, and temperature sensors are installed to monitor machine condition while producing bearing components under different operating conditions. Data is sampled from sensors as well as the machine's numerical controller during operation. Selected parts are measured for the produced quality.
Ahmer, M., Sandin, F., Marklund, P., Gustafsson, M., & Berglund, K. (2022). Failure mode classification for condition-based maintenance in a bearing ring grinding machine. In The International Journal of Advanced Manufacturing Technology (Vol. 122, pp. 1479–1495). https://doi.org/10.1007/s00170-022-09930-6
The files are of three categories and are grouped in zipped folders. The pdf file named "readme_data_description.pdf" describes the content of the files in the folders. The "lib" includes the information on libraries to read the .tdms Data Files in Matlab or Python.
The raw time-domain sensors signal data are grouped in seven main folders named after each test run e.g. "test_1"... "test_7". Each test includes seven dressing cycles named e.g. "dresscyc_1"... "dresscyc_7". Each dressing cycle includes .tdms files for fifteen rings for their individual grinding cycle. The column description for both "Analogue" and "Digital" channels are described in the "readme_data_description.pdf" file. The machine and process parameters used for the tests as sampled from the machine's control system (Numerical Controller) and compiled for all test runs in a single file "process_data.csv" in the folder "proc_param". The column description is available in "readme_data_description.pdf" under "Process Parameters". The measured quality data (nine quality parameters - normalized) of the selected produced parts are recorded in the file "measured_quality_param.csv" under folder "quality". The description of the quality parameters is available in "readme_data_description.pdf". The quality parameter disposition based on their actual acceptance tolerances for the process step is presented in file "quality_disposition.csv" under folder "quality".
This data release contains three different datasets that were used in the Scientific Investigations Report: Spatial and Temporal Distribution of Bacterial Indicators and Microbial Source Tracking within Tumacácori National Historical Park and the Upper Santa Cruz River, Arizona, 2015-16. These datasets contain regression model data, estimated discharge data, and calculated flux and yields data. Regression Model Data: This dataset contains data used in a regression model development in the SIR. The period of data ranged from May 25, 1994 to May 19, 2017. Data from 2015 to 2017 were collected by the U.S. Geological Survey. Data prior to 2015 were provided by various agencies. Listed below are the different data contained within this dataset: - Season represented as an indicator variable (Fall, Spring, Summer, and Winter) - Hydrologic Condition represented as an indicator variable (rising limb, recession limb, peak, or unable to classify) - Flood (binary variable indicating if the sample was collected during a flood event or not) - Decimal Date (DT) represented as a continuous variable - Sine of DT represented as a continuous variable for periodic function to describe seasonal variation - Cosine of DT represented as a continuous variable for periodic function to describe seasonal variation Estimated Discharge: This dataset contains estimated discharge at four different sites between 03/02/2015 and 12/14/2016. The discharge was estimated using nearby streamgage relations and methods are described in detail in the SIR . The sites where discharge was estimated are listed below. - NW8; 312551110573901; Nogales Wash at Ruby Road - SC3; 312654110573201; Santa Cruz River abv Nogales Wash - SC10; 313343110024701; Santa Cruz River at Santa Gertrudis Lane - SC14; 09481740; Santa Cruz River at Tubac, AZ Calculated Flux and Yields: This dataset contains calculated flux and yields for E. coli and suspended sediment concentrations. Mean daily flux was calculated when mean daily discharge was available at a corresponding streamgage. Instantaneous flux was calculated when instantaneous discharge (at 15-minute intervals) were available at a corresponding streamgage, or from a measured or estimated discharge value. The yields were calculated using the calculated flux values and the area of the different watersheds. Methods and equations are described in detail in the SIR. Listed below are the data contained within this dataset: - Mean daily E. coli flux, in most probable number per day - Mean daily suspended sediment, in flux, in tons per day - Instantaneous E. coli flux, in most probable number per second - Instantaneous suspended sediment flux, in tons per second - E. coli, in most probable number per square mile - Suspended sediment, in tons per square mile
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.
Various climate variables summary for all 15 subregions based on Bureau of Meteorology Australian Water Availability Project (BAWAP) climate grids. Including
Time series mean annual BAWAP rainfall from 1900 - 2012.
Long term average BAWAP rainfall and Penman Potentail Evapotranspiration (PET) from Jan 1981 - Dec 2012 for each month
Values calculated over the years 1981 - 2012 (inclusive), for 17 time periods (i.e., annual, 4 seasons and 12 months) for the following 8 meteorological variables: (i) BAWAP_P (precipitation); (ii) Penman ETp; (iii) Tavg (average temperature); (iv) Tmax (maximum temperature); (v) Tmin (minimum temperature); (vi) VPD (Vapour Pressure Deficit); (vii) Rn (net radiation); and (viii) Wind speed. For each of the 17 time periods for each of the 8 meteorological variables have calculated the: (a) average; (b) maximum; (c) minimum; (d) average plus standard deviation (stddev); (e) average minus stddev; (f) stddev; and (g) trend.
Correlation coefficients (-1 to 1) between rainfall and 4 remote rainfall drivers between 1957-2006 for the four seasons. The data and methodology are described in Risbey et al. (2009).
As described in the Risbey et al. (2009) paper, the rainfall was from 0.05 degree gridded data described in Jeffrey et al. (2001 - known as the SILO datasets); sea surface temperature was from the Hadley Centre Sea Ice and Sea Surface Temperature dataset (HadISST) on a 1 degree grid. BLK=Blocking; DMI=Dipole Mode Index; SAM=Southern Annular Mode; SOI=Southern Oscillation Index; DJF=December, January, February; MAM=March, April, May; JJA=June, July, August; SON=September, October, November. The analysis is a summary of Fig. 15 of Risbey et al. (2009).
There are 4 csv files here:
BAWAP_P_annual_BA_SYB_GLO.csv
Desc: Time series mean annual BAWAP rainfall from 1900 - 2012.
Source data: annual BILO rainfall
P_PET_monthly_BA_SYB_GLO.csv
long term average BAWAP rainfall and Penman PET from 198101 - 201212 for each month
Climatology_Trend_BA_SYB_GLO.csv
Values calculated over the years 1981 - 2012 (inclusive), for 17 time periods (i.e., annual, 4 seasons and 12 months) for the following 8 meteorological variables: (i) BAWAP_P; (ii) Penman ETp; (iii) Tavg; (iv) Tmax; (v) Tmin; (vi) VPD; (vii) Rn; and (viii) Wind speed. For each of the 17 time periods for each of the 8 meteorological variables have calculated the: (a) average; (b) maximum; (c) minimum; (d) average plus standard deviation (stddev); (e) average minus stddev; (f) stddev; and (g) trend
Risbey_Remote_Rainfall_Drivers_Corr_Coeffs_BA_NSB_GLO.csv
Correlation coefficients (-1 to 1) between rainfall and 4 remote rainfall drivers between 1957-2006 for the four seasons. The data and methodology are described in Risbey et al. (2009). As described in the Risbey et al. (2009) paper, the rainfall was from 0.05 degree gridded data described in Jeffrey et al. (2001 - known as the SILO datasets); sea surface temperature was from the Hadley Centre Sea Ice and Sea Surface Temperature dataset (HadISST) on a 1 degree grid. BLK=Blocking; DMI=Dipole Mode Index; SAM=Southern Annular Mode; SOI=Southern Oscillation Index; DJF=December, January, February; MAM=March, April, May; JJA=June, July, August; SON=September, October, November. The analysis is a summary of Fig. 15 of Risbey et al. (2009).
Dataset was created from various BAWAP source data, including Monthly BAWAP rainfall, Tmax, Tmin, VPD, etc, and other source data including monthly Penman PET, Correlation coefficient data. Data were extracted from national datasets for the GLO subregion.
BAWAP_P_annual_BA_SYB_GLO.csv
Desc: Time series mean annual BAWAP rainfall from 1900 - 2012.
Source data: annual BILO rainfall
P_PET_monthly_BA_SYB_GLO.csv
long term average BAWAP rainfall and Penman PET from 198101 - 201212 for each month
Climatology_Trend_BA_SYB_GLO.csv
Values calculated over the years 1981 - 2012 (inclusive), for 17 time periods (i.e., annual, 4 seasons and 12 months) for the following 8 meteorological variables: (i) BAWAP_P; (ii) Penman ETp; (iii) Tavg; (iv) Tmax; (v) Tmin; (vi) VPD; (vii) Rn; and (viii) Wind speed. For each of the 17 time periods for each of the 8 meteorological variables have calculated the: (a) average; (b) maximum; (c) minimum; (d) average plus standard deviation (stddev); (e) average minus stddev; (f) stddev; and (g) trend
Risbey_Remote_Rainfall_Drivers_Corr_Coeffs_BA_NSB_GLO.csv
Correlation coefficients (-1 to 1) between rainfall and 4 remote rainfall drivers between 1957-2006 for the four seasons. The data and methodology are described in Risbey et al. (2009). As described in the Risbey et al. (2009) paper, the rainfall was from 0.05 degree gridded data described in Jeffrey et al. (2001 - known as the SILO datasets); sea surface temperature was from the Hadley Centre Sea Ice and Sea Surface Temperature dataset (HadISST) on a 1 degree grid. BLK=Blocking; DMI=Dipole Mode Index; SAM=Southern Annular Mode; SOI=Southern Oscillation Index; DJF=December, January, February; MAM=March, April, May; JJA=June, July, August; SON=September, October, November. The analysis is a summary of Fig. 15 of Risbey et al. (2009).
Bioregional Assessment Programme (2014) GLO climate data stats summary. Bioregional Assessment Derived Dataset. Viewed 18 July 2018, http://data.bioregionalassessments.gov.au/dataset/afed85e0-7819-493d-a847-ec00a318e657.
Derived From Natural Resource Management (NRM) Regions 2010
Derived From Bioregional Assessment areas v03
Derived From BILO Gridded Climate Data: Daily Climate Data for each year from 1900 to 2012
Derived From Bioregional Assessment areas v01
Derived From Bioregional Assessment areas v02
Derived From GEODATA TOPO 250K Series 3
Derived From NSW Catchment Management Authority Boundaries 20130917
Derived From Geological Provinces - Full Extent
Derived From GEODATA TOPO 250K Series 3, File Geodatabase format (.gdb)
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
———————————————————————————————— ORIGINAL PAPERS ———————————————————————————————— Mano, Marsel, Anatole Lécuyer, Elise Bannier, Lorraine Perronnet, Saman Noorzadeh, and Christian Barillot. 2017. “How to Build a Hybrid Neurofeedback Platform Combining EEG and FMRI.” Frontiers in Neuroscience 11 (140). https://doi.org/10.3389/fnins.2017.00140 Perronnet, Lorraine, L Anatole, Marsel Mano, Elise Bannier, Maureen Clerc, Christian Barillot, Lorraine Perronnet, et al. 2017. “Unimodal Versus Bimodal EEG-FMRI Neurofeedback of a Motor Imagery Task.” Frontiers in Human Neuroscience 11 (193). https://doi.org/10.3389/fnhum.2017.00193.
This dataset named XP1 can be pull together with the dataset XP2 (DOI: 10.18112/openneuro.ds002338.v1.0.0). Data acquisition methods have been described in Perronnet et al. (2017, Frontiers in Human Neuroscience). Simultaneous 64 channels EEG and fMRI during right-hand motor imagery and neurofeedback (NF) were acquired in this study (as well as in XP2). For this study, 10 subjects performed three types of NF runs (bimodal EEG-fMRI NF, unimodal EEG-NF and fMRI-NF).
————————————————————————————————
EXPERIMENTAL PARADIGM
————————————————————————————————
Subjects were instructed to perform a kinaesthetic motor imagery of the right hand and to find their own strategy to control and bring the ball to the target.
The experimental protocol consisted of 6 EEG-fMRI runs with a 20s block design alternating rest and task
motor localizer run (task-motorloc) - 8 blocks X (20s rest+20 s task)
motor imagery run without NF (task-MIpre) -5 blocks X (20s rest+20 s task)
three NF runs with different NF conditions (task-eegNF, task-fmriNF, task-eegfmriNF) occurring in random order- 10 blocks X (20s rest+20 s task)
motor imagery run without NF (task-MIpost) - 5 blocks X (20s rest+20 s task)
———————————————————————————————— EEG DATA ———————————————————————————————— EEG data was recorded using a 64-channel MR compatible solution from Brain Products (Brain Products GmbH, Gilching, Germany).
RAW EEG DATA
EEG was sampled at 5kHz with FCz as the reference electrode and AFz as the ground electrode, and a resolution of 0.5 microV. Following the BIDs arborescence, raw eeg data for each task can be found for each subject in
XP1/sub-xp1*/eeg
in Brain Vision Recorder format (File Version 1.0). Each raw EEG recording includes three files: the data file (.eeg), the header file (.vhdr) and the marker file (*.vmrk). The header file contains information about acquisition parameters and amplifier setup. For each electrode, the impedance at the beginning of the recording is also specified. For all subjects, channel 32 is the ECG channel. The 63 other channels are EEG channels.
The marker file contains the list of markers assigned to the EEG recordings and their properties (marker type, marker ID and position in data points). Three type of markers are relevant for the EEG processing:
R128 (Response): is the fMRI volume marker to correct for the gradient artifact
S 99 (Stimulus): is the protocol marker indicating the start of the Rest block
S 2 (Stimulus): is the protocol marker indicating the start of the Task (Motor Execution Motor Imagery or Neurofeedback)
Warning : in few EEG data, the first S99 marker might be missing, but can be easily “added” 20 s before the first S 2.
PREPROCESSED EEG DATA
Following the BIDs arborescence, processed eeg data for each task and subject in the pre-processed data folder :
XP1/derivatives/sub-xp1*/eeg_pp/*eeg_pp.*
and following the Brain Analyzer format. Each processed EEG recording includes three files: the data file (.dat), the header file (.vhdr) and the marker file (*.vmrk), containing information similar to those described for raw data. In the header file of preprocessed data channels location are also specified. In the marker file the location in data points of the identified heart pulse (R marker) are specified as well.
EEG data were pre-processed using BrainVision Analyzer II Software, with the following steps: Automatic gradient artifact correction using the artifact template subtraction method (Sliding average calculation with 21 intervals for sliding average and all channels enabled for correction. Downsampling with factor: 25 (200 Hz) Low Pass FIR Filter:Cut-off Frequency: 50 Hz. Ballistocardiogram (pulse) artifact correction using a semiautomatic procedure (Pulse Template searched between 40 s and 240 s in the ECG channel with the following parameters:Coherence Trigger = 0.5, Minimal Amplitude = 0.5, Maximal Amplitude = 1.3. The identified pulses were marked with R. Segmentation relative to the first block marker (S 99) for all the length of the training protocol (las S 2 + 20 s).
EEG NF SCORES
Neurofeedback scores can be found in the .mat structures in
XP1/derivatives/sub-xp1*/NF_eeg/d_sub*NFeeg_scores.mat
Structures names NF_eeg are composed by the following subfields: ID : Subject ID, for example sub-xp101 lapC3_ERD : a 1x1280 vector of neurofeedback scores. 4 scores per secondes, for the whole session. eeg : a 64x80200 matrix, with the pre-processed EEG signals with the step described above, filtered between 8 and 30 Hz. lapC3_bandpower_8Hz_30Hz : 1x1280 vector. Bandpower of the filtered signal with a laplacian centred on C3, used to estimate the lapC3_ERD. lapC3_filter : 1x64 vector. Laplacian filter centred on C3 channel.
———————————————————————————————— BOLD fMRI DATA ———————————————————————————————— All DICOM files were converted to Nifti-1 and then in BIDs format (version 2.1.4) using the software dcm2niix (version v1.0.20190720 GVV7.4.0)
fMRI acquisitions were performed using echo- planar imaging (EPI) and covering the entire brain with the following parameters
3T Siemens Verio EPI sequence TR=2 s TE=23 ms Resolution 2x2x4 mm3 FOV = 210×210mm2 N of slices: 32 No slice gap
As specified in the relative task event files in XP1\ *events.tsv files onset, the scanner began the EPI pulse sequence two seconds prior to the start of the protocol (first rest block), so the the first two TRs should be discarded. The useful TRs for the runs are therefore
task-motorloc: 320 s (2 to 322) task-MIpre and task-MIpost: 200 s (2 to 202) task-eegNF, task-fmriNF, task-eegfmriNF: 400 s (2 to 402)
In task events files for the different tasks, each column represents:
Following the BIDs arborescence, the functional data and relative metadata are found for each subject in the following directory
XP1/sub-xp1*/func
BOLD-NF SCORES
For each subject and NF session, a matlab structure with BOLD-NF features can be found in
XP1/derivatives/sub-xp1*/NF_bold/
In view of BOLD-NF scores computation, fMRI data were preprocessed using AutoMRI, a software based on spm8 and with the following steps: slice-time correction, spatial realignment and coregistration with the anatomical scan, spatial smoothing with a 6 mm Gaussian kernel and normalization to the Montreal Neurological Institute template For each session, a first level general linear model analysis modeling was then performed. The resulting activation maps (voxel-wise Family-Wise error corrected at p < 0.05) were used to define two ROIs (9x9x3 voxels) around the maximum of activation in the ipsilesional primary motor area (M1) and supplementary motor area (SMA) respectively.
The BOLD-NF scores were calculated as the difference between percentage signal change in the two ROIs (SMA and M1) and a large deep background region (slice 3 out of 16) whose activity is not correlated with the NF task. A smoothed version of the NF scores over the precedent three volumes was also computed.
The NF_boldi structure has the following structure
NF_bold
→ .m1 → .nf
→ .smoothnf
→ .roimean (averaged BOLD signal in the ROI)
→ .bgmean (averaged BOLD signal in the background slice)
→ .method
NFscores.fmri
→ .sma→ .nf
→ .smoothnf
→ .roimean (averaged BOLD signal in the ROI)
→ .bgmean (averaged BOLD signal in the background slice)
→ .method
Where the subfield method contains information about the ROI size (.roisize), the background mask (.bgmask) and ROI mask (.roimask).
More details about signal processing and NF calculation can be found in Perronnet et al. 2017 and Perronnet et al. 2018.
———————————————————————————————— ANATOMICAL MRI DATA ———————————————————————————————— As a structural reference for the fMRI analysis, a high resolution 3D T1 MPRAGE sequence was acquired with the following parameters
3T Siemens Verio 3D T1 MPRAGE TR=1.9 s TE=22.6
CVEfixes is a comprehensive vulnerability dataset that is automatically collected and curated from Common Vulnerabilities and Exposures (CVE) records in the public U.S. National Vulnerability Database (NVD). The goal is to support data-driven security research based on source code and source code metrics related to fixes for CVEs in the NVD by providing detailed information at different interlinked levels of abstraction, such as the commit-, file-, and method level, as well as the repository- and CVE level.
This dataset is a preprocessed version of the CVEfixes dataset provided at the following link: https://zenodo.org/record/7029359
This dataset consists of two files: - CVEFixes.csv : The preprocessed dataset. - LICENSE.txt : The license information of this dataset.
In the CVEFixes.csv, there are three columns: - code : The source code of the data point. - language : The programming language of the source code (c, java, php, etc) - safety : Whether the code is vulnerable or safe.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Motivation: Entity Matching is the task of determining which records from different data sources describe the same real-world entity. It is an important task for data integration and has been the focus of many research works. A large number of entity matching/record linkage tasks has been made available for evaluating entity matching methods. However, the lack of fixed development and test splits as well as correspondence sets including both matching and non-matching record pairs hinders the reproducibility and comparability of benchmark experiments. In an effort to enhance the reproducibility and comparability of the experiments, we complement existing entity matching benchmark tasks with fixed sets of non-matching pairs as well as fixed development and test splits. Dataset Description: An augmented version of the wdc phones dataset for benchmarking entity matching/record linkage methods found at:http://webdatacommons.org/productcorpus/index.html#toc4 The augmented version adds fixed splits for training, validation and testing as well as their corresponding feature vectors. The feature vectors are built using data type specific similarity metrics.The dataset contains 447 records describing products deriving from 17 e-shops which are matched against a product catalog of 50 products. The gold standards have manual annotations for 258 matching and 22,092 non-matching pairs. The total number of attributes used to decribe the product records are 26 while the attribute density is 0.25. The augmented dataset enhances the reproducibility of matching methods and the comparability of matching results. The dataset is part of the CompERBench repository which provides 21 complete benchmark tasks for entity matching for public download: http://data.dws.informatik.uni-mannheim.de/benchmarkmatchingtasks/index.html
Abstract This dataset was derived by the Bioregional Assessment Programme. The parent datasets are identified in the Lineage field in this metadata statement. The process undertaken to product this …Show full descriptionAbstract This dataset was derived by the Bioregional Assessment Programme. The parent datasets are identified in the Lineage field in this metadata statement. The process undertaken to product this derived dataset are described in the History field in this metadata statement. This dataset was created in order to calculate the steady-state drawdown extent for potential CSG developments in the Cooper subregion. The calculations are made using the de Glee method, as described in Krusemann (1994) in accordance with the SA Far North Water Allocation Plan (2009) explanation document. The data was created as part of the conceptual modelling for causal pathways to assess the potential for CSG development-related impacts to propagate via groundwater drawdown. Dataset History The dataset assumes water production rates for CSG in the Cooper subregion based on production rates in other eastern Australian Permian CSG fields, reported in Onshore co-produced water: extent and management (http://data.bioregionalassessments.gov.au/dataset/6b3d8096-f09d-40a2-b5ee-09c9f8b9bdfc); and Fell, (2013) Discussion paper for Office of NSW Chief Scientist and Engineer: Water treatment and coal seam gas. (http://data.bioregionalassessments.gov.au/dataset/714e35df-76bb-4a5d-b5d8-0fcc65329dfe); Distance-drawdown was calculated using the de Glee method for steady-state drawdown described in Krusemann (1994) Analysis and Evaluation of Pumping Test Data. (http://data.bioregionalassessments.gov.au/dataset/c66b744e-e9bb-4d88-a82c-a6593efe91d2) Dataset Citation Bioregional Assessment Programme (2015) Cooper Basin Drawdown Calculations - De Glee method. Bioregional Assessment Derived Dataset. Viewed 27 November 2017, http://data.bioregionalassessments.gov.au/dataset/51323ab4-3613-47eb-acf3-3d89e8b9c062. Dataset Ancestors Derived From Discussion paper for Office of NSW Chief Scientist and Engineer: Water treatment and coal seam gas. Derived From Analysis and Evaluation of Pumping Test Data. Derived From Onshore co-produced water: extent and management
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was updated April, 2024. This ownership dataset was generated primarily from CPAD data, which already tracks the majority of ownership information in California. CPAD is utilized without any snapping or clipping to FRA/SRA/LRA. CPAD has some important data gaps, so additional data sources are used to supplement the CPAD data. Currently this includes the most currently available data from BIA, DOD, and FWS. Additional sources may be added in subsequent versions. Decision rules were developed to identify priority layers in areas of overlap. Starting in 2022, the ownership dataset was compiled using a new methodology. Previous versions attempted to match federal ownership boundaries to the FRA footprint, and used a manual process for checking and tracking Federal ownership changes within the FRA, with CPAD ownership information only being used for SRA and LRA lands. The manual portion of that process was proving difficult to maintain, and the new method (described below) was developed in order to decrease the manual workload, and increase accountability by using an automated process by which any final ownership designation could be traced back to a specific dataset. The current process for compiling the data sources includes: * Clipping input datasets to the California boundary * Filtering the FWS data on the Primary Interest field to exclude lands that are managed by but not owned by FWS (ex: Leases, Easements, etc) * Supplementing the BIA Pacific Region Surface Trust lands data with the Western Region portion of the LAR dataset which extends into California. * Filtering the BIA data on the Trust Status field to exclude areas that represent mineral rights only. * Filtering the CPAD data on the Ownership Level field to exclude areas that are Privately owned (ex: HOAs) * In the case of overlap, sources were prioritized as follows: FWS > BIA > CPAD > DOD * As an exception to the above, DOD lands on FRA which overlapped with CPAD lands that were incorrectly coded as non-Federal were treated as an override, such that the DOD designation could win out over CPAD. In addition to this ownership dataset, a supplemental _source dataset is available which designates the source that was used to determine the ownership in this dataset. Data Sources: * GreenInfo Network's California Protected Areas Database (CPAD2023a). https://www.calands.org/cpad/; https://www.calands.org/wp-content/uploads/2023/06/CPAD-2023a-Database-Manual.pdf * US Fish and Wildlife Service FWSInterest dataset (updated December, 2023). https://gis-fws.opendata.arcgis.com/datasets/9c49bd03b8dc4b9188a8c84062792cff_0/explore * Department of Defense Military Bases dataset (updated September 2023) https://catalog.data.gov/dataset/military-bases * Bureau of Indian Affairs, Pacific Region, Surface Trust and Pacific Region Office (PRO) land boundaries data (2023) via John Mosley John.Mosley@bia.gov * Bureau of Indian Affairs, Land Area Representations (LAR) and BIA Regions datasets (updated Oct 2019) https://biamaps.doi.gov/bogs/datadownload.html Data Gaps & Changes: Known gaps include several BOR, ACE and Navy lands which were not included in CPAD nor the DOD MIRTA dataset. Our hope for future versions is to refine the process by pulling in additional data sources to fill in some of those data gaps. Additionally, any feedback received about missing or inaccurate data can be taken back to the appropriate source data where appropriate, so fixes can occur in the source data, instead of just in this dataset. 24_1: Input datasets this year included numerous changes since the previous version, particularly the CPAD and DOD inputs. Of particular note was the re-addition of Camp Pendleton to the DOD input dataset, which is reflected in this version of the ownership dataset. We were unable to obtain an updated input for tribral data, so the previous inputs was used for this version. 23_1: A few discrepancies were discovered between data changes that occurred in CPAD when compared with parcel data. These issues will be taken to CPAD for clarification for future updates, but for ownership23_1 it reflects the data as it was coded in CPAD at the time. In addition, there was a change in the DOD input data between last year and this year, with the removal of Camp Pendleton. An inquiry was sent for clarification on this change, but for ownership23_1 it reflects the data per the DOD input dataset. 22_1 : represents an initial version of ownership with a new methodology which was developed under a short timeframe. A comparison with previous versions of ownership highlighted the some data gaps with the current version. Some of these known gaps include several BOR, ACE and Navy lands which were not included in CPAD nor the DOD MIRTA dataset. Our hope for future versions is to refine the process by pulling in additional data sources to fill in some of those data gaps. In addition, any topological errors (like overlaps or gaps) that exist in the input datasets may thus carry over to the ownership dataset. Ideally, any feedback received about missing or inaccurate data can be taken back to the relevant source data where appropriate, so fixes can occur in the source data, instead of just in this dataset.
Attribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
License information was derived automatically
Abstract The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The …Show full descriptionAbstract The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement. This dataset contains all the scripts used to carry out the uncertainty analysis for the maximum drawdown and time to maximum drawdown at the groundwater receptors in the Hunter bioregion and all the resulting posterior predictions. This is described in product 2.6.2 Groundwater numerical modelling (Herron et al. 2016). See History for a detailed explanation of the dataset contents. References: Herron N, Crosbie R, Peeters L, Marvanek S, Ramage A and Wilkins A (2016) Groundwater numerical modelling for the Hunter subregion. Product 2.6.2 for the Hunter subregion from the Northern Sydney Basin Bioregional Assessment. Department of the Environment, Bureau of Meteorology, CSIRO and Geoscience Australia, Australia. Dataset History This dataset uses the results of the design of experiment runs of the groundwater model of the Hunter subregion to train emulators to (a) constrain the prior parameter ensembles into the posterior parameter ensembles and to (b) generate the predictive posterior ensembles of maximum drawdown and time to maximum drawdown. This is described in product 2.6.2 Groundwater numerical modelling (Herron et al. 2016). A flow chart of the way the various files and scripts interact is provided in HUN_GW_UA_Flowchart.png (editable version in HUN_GW_UA_Flowchart.gliffy). R-script HUN_DoE_Parameters.R creates the set of parameters for the design of experiment in HUN_DoE_Parameters.csv. Each of these parameter combinations is evaluated with the groundwater model (dataset HUN GW Model v01). Associated with this spreadsheet is file HUN_GW_Parameters.csv. This file contains, for each parameter, if it is included in the sensitivity analysis, tied to another parameters, the initial value and range, the transformation, the type of prior distribution with its mean and covariance structure. The results of the design of experiment model runs are summarised in files HUN_GW_dmax_DoE_Predictions.csv, HUN_GW_tmax_DoE_Predictions.csv, HUN_GW_DoE_Observations.csv, HUN_GW_DoE_mean_BL_BF_hist.csv which have the maximum additional drawdown, the time to maximum additional drawdown for each receptor and the simulated equivalents to observed groundwater levels and SW-GW fluxes respectively. These are generated with post-processing scripts in dataset HUN GW Model v01 from the output (as exemplified in dataset HUN GW Model simulate ua999 pawsey v01). Spreadsheets HUN_GW_dmax_Predictions.csv and HUN_GW_tmax_Predictions.csv capture additional information on each prediction; the name of the prediction, transformation, min, max and median of design of experiment, a boolean to indicate the prediction is to be included in the uncertainty analysis, the layer it is assigned to and which objective function to use to constrain the prediction. Spreadsheet HUN_GW_Observations.csv has additional information on each observation; the name of the observation, a boolean to indicate to use the observation, the min and max of the design of experiment, a metadata statement describing the observation, the spatial coordinates, the observed value and the number of observations at this location (from dataset HUN bores v01). Further it has the distance of each bore to the nearest blue line network and the distance to each prediction (both in km). Spreadsheet HUN_GW_mean_BL_BF_hist.csv has similar information, but on the SW-GW flux. The observed values are from dataset HUN Groundwater Flowrate Time Series v01 These files are used in script HUN_GW_SI.py to generate sensitivity indices (based on the Plischke et al. (2013) method) for each group of observations and predictions. These indices are saved in spreadsheets HUN_GW_dmax_SI.csv, HUN_GW_tmax_SI.csv, HUN_GW_hobs_SI.py, HUN_GW_mean_BF_hist_SI.csv Script HUN_GW_dmax_ObjFun.py calculates the objective function values for the design of experiment runs. Each prediction has a tailored objective function which is a weighted sum of the residuals between observations and predictions with weights based on the distance between observation and prediction. In addition to that there is an objective function for the baseflow rates. The results are stored in HUN_GW_DoE_ObjFun.csv and HUN_GW_ObjFun.csv. The latter files are used in scripts HUN_GW_dmax_CreatePosteriorParameters.R to carry out the Monte Carlo sampling of the prior parameter distributions with the Approximate Bayesian Computation methodology as described in Herron et al (2016) by generating and applying emulators for each objective function. The scripts use the scripts in dataset R-scripts for uncertainty analysis v01. These files are run on the high performance computation cluster machines with batch file HUN_GW_dmax_CreatePosterior.slurm. These scripts result in posterior parameter combinations for each objective function, stored in directory PosteriorParameters, with filename convention HUN_GW_dmax_Posterior_Parameters_OO_$OFName$.csv where $OFName$ is the name of the objective function. Python script HUN_GW_PosteriorParameters_Percentiles.py summarizes these posterior parameter combinations and stores the results in HUN_GW_PosteriorParameters_Percentiles.csv. The same set of spreadsheets is used to test convergence of the emulator performance with script HUN_GW_emulator_convergence.R and batch file HUN_GW_emulator_convergence.slurm to produce spreadsheet HUN_GW_convergence_objfun_BF.csv. The posterior parameter distributions are sampled with scripts HUN_GW_dmax_tmax_MCsampler.R and associated .slurm batch file. The script create and apply an emulator for each prediction. The emulator and results are stored in directory Emulators. This directory is not part of the this dataset but can be regenerated by running the scripts on the high performance computation clusters. A single emulator and associated output is included for illustrative purposes. Script HUN_GW_collate_predictions.csv collates all posterior predictive distributions in spreadsheets HUN_GW_dmax_PosteriorPredictions.csv and HUN_GW_tmax_PosteriorPredictions.csv. These files are further summarised in spreadsheet HUN_GW_dmax_tmax_excprob.csv with script HUN_GW_exc_prob. This spreadsheet contains for all predictions the coordinates, layer, number of samples in the posterior parameter distribution and the 5th, 50th and 95th percentile of dmax and tmax, the probability of exceeding 1 cm and 20 cm drawdown, the maximum dmax value from the design of experiment and the threshold of the objective function and the acceptance rate. The script HUN_GW_dmax_tmax_MCsampler.R is also used to evaluate parameter distributions HUN_GW_dmax_Posterior_Parameters_HUN_OF_probe439.csv and HUN_GW_dmax_Posterior_Parameters_Mackie_OF_probe439.csv. These are, for one predictions, different parameter distributions, in which the latter represents local information. The corresponding dmax values are stored in HUN_GW_dmax_probe439_HUN.csv and HUN_GW_dmax_probe439_Mackie.csv Dataset Citation Bioregional Assessment Programme (XXXX) HUN GW Uncertainty Analysis v01. Bioregional Assessment Derived Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/c25db039-5082-4dd6-bb9d-de7c37f6949a. Dataset Ancestors Derived From HUN GW Model code v01 Derived From Hydstra Groundwater Measurement Update - NSW Office of Water, Nov2013 Derived From Groundwater Economic Elements Hunter NSW 20150520 PersRem v02 Derived From NSW Office of Water - National Groundwater Information System 20140701 Derived From Travelling Stock Route Conservation Values Derived From HUN GW Model v01 Derived From NSW Wetlands Derived From Climate Change Corridors Coastal North East NSW Derived From Communities of National Environmental Significance Database - RESTRICTED - Metadata only Derived From Climate Change Corridors for Nandewar and New England Tablelands Derived From National Groundwater Dependent Ecosystems (GDE) Atlas Derived From Fauna Corridors for North East NSW Derived From R-scripts for uncertainty analysis v01 Derived From Asset database for the Hunter subregion on 27 August 2015 Derived From Hunter CMA GDEs (DRAFT DPI pre-release) Derived From Estuarine Macrophytes of Hunter Subregion NSW DPI Hunter 2004 Derived From Birds Australia - Important Bird Areas (IBA) 2009 Derived From Camerons Gorge Grassy White Box Endangered Ecological Community (EEC) 2008 Derived From Asset database for the Hunter subregion on 16 June 2015 Derived From Spatial Threatened Species and Communities (TESC) NSW 20131129 Derived From Gippsland Project boundary Derived From Bioregional Assessment areas v04 Derived From Asset database for the Hunter subregion on 24 February 2016 Derived From Natural Resource Management (NRM) Regions 2010 Derived From Gosford Council Endangered Ecological Communities (Umina woodlands) EEC3906 Derived From NSW Office of Water Surface Water Offtakes - Hunter v1 24102013 Derived From National Groundwater Dependent Ecosystems (GDE) Atlas (including WA) Derived From Bioregional Assessment areas v03 Derived From HUN groundwater flow rate time series v01 Derived From Asset list for Hunter - CURRENT Derived From NSW Office of Water Surface Water Entitlements Locations v1_Oct2013 Derived From Species Profile and Threats Database (SPRAT) - Australia - Species of National Environmental Significance Database (BA subset - RESTRICTED - Metadata only) Derived From HUN GW Model simulate ua999 pawsey v01 Derived From Northern Rivers CMA GDEs (DRAFT DPI
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract This dataset was created within the Bioregional Assessment Programme. Data has not been derived from any source datasets. Metadata has been compiled by the Bioregional Assessment Programme. In this dataset we describe the application of Impact Modes and Effects Analysis (IMEA) to the hazards associated with coal seam gas and coal mining operations in the Gloucester subregion. Attention is restricted to water mediated hazards, i.e. hazards that might lead directly or indirectly to …Show full descriptionAbstract This dataset was created within the Bioregional Assessment Programme. Data has not been derived from any source datasets. Metadata has been compiled by the Bioregional Assessment Programme. In this dataset we describe the application of Impact Modes and Effects Analysis (IMEA) to the hazards associated with coal seam gas and coal mining operations in the Gloucester subregion. Attention is restricted to water mediated hazards, i.e. hazards that might lead directly or indirectly to impacts on groundwater or surface water, and the assets that depend on them. All other hazards, for example the effects of air quality, are explicitly excluded. Full details of the hazard analysis process are described in "M11Systematic analysis of water-related hazards associated with coal resource development" available at http://bioregionalassessments.gov.au/methods/submethodologies Dataset History Full details of the hazard analysis process are described in M 11: Systematic analysis of water-related hazards associated with coal resource development available at http://bioregionalassessments.gov.au/methods/submethodologies Dataset Citation Bioregional Assessment Programme (2015) Impact Modes and Effects Analysis for the GLO subregion. Bioregional Assessment Source Dataset. Viewed 18 July 2018, http://data.bioregionalassessments.gov.au/dataset/52adbf75-b695-49fe-9d7b-b34ded9feb3a.
This version 2.0 MUSICA IASI / RemoTeC TROPOMI fused methane data set contains total (ground – top of atmosphere, variable ), tropospheric (ground – about 6 km a.s.l., variable ), and UTLS (upper tropospheric/lower stratospheric, about 6 – 20 km a.s.l., variable ) column-averaged dry-air mole fractions of methane (CH4). The data are obtained by combining the level 2 CH4 profiles and XCH4 total columns (generated from the IASI TIR spectra and the TROPOMI NIR/SWIR spectra, respectively). The level 2 CH4 profiles were generated by the MUSICA processor (version 3.3.0) and the level 2 XCH4 total columns by the RemoTeC processor (operational processing algorithm version 2.3.1, this version includes data over ocean in glint mode). The combination is realized by means of a Kalman filter that uses the MUSICA IASI data as the background and the TROPOMI data as the new observation. Details of the combination method, the IASI and TROPOMI collocation requirements, and the data quality are described in Schneider et al. (2022, https://doi.org/10.5194/amt-15-4339-2022). The data cover an example period for northern hemispheric winter and summer conditions (01 January – 30 January 2020 and 01 July – 30 July 2020, respectively). The only difference of this version 2.0 to the version 1.0 of the fused MUSICA IASI / RemoTeC TROPOMI data set (accessible at https://doi.org/10.35097/689) is the use of TROPOMI RemoTeC operational processing version 2.3.1 (instead of version 2.2.0), which offers among others additional data availability over ocean. Metop IASI and Sentinel-5 Precursor TROPOMI The data fusion method is described in detail in Schneider et al. (2022, https://doi.org/10.5194/amt-15-4339-2022). The used the RemoTeC TROPOMI XCH4 data (operational processing algorithm version 2.3.1) are described in Lorente et al. (2022, https://doi.org/10.5194/amt-2022-197). The used MUSICA IASI CH4 profile data (processing version 3.3.0) are described in Schneider et al. (2022, https://doi.org/10.5194/essd-14-709-2022).
Data for "Image-based Backbone Reconstruction for Non-Slender Soft Robots" This dataset provides the data for the forthcoming paper "Image-based Backbone Reconstruction for Non-Slender Soft Robots". The backbone reconstruction method used is based on the method described in Hoffmann et al. 1. The modifications to this method to support the non-slender soft robot in this dataset are described in the forthcoming paper mentioned above. This dataset holds raw images of pressurized and elongated soft robots and the corresponding reconstructed backbones.
Dataset The dataset is split into two subsets with similar structure. The first subset is contained in dataset_01. The second dataset is contained in dataset_02.
Each subset consists of five folders and one schedule file. The schedule file schedule.csv contains the index of the schedule entry, the angle $\alpha$ in degree, the pressure of each chamber $p_1$ to $p_3$ in bar and if the pressurization is active. Furthermore, the five folders of the subset can be described as follows
raw: Contains the raw cropped images. The filenames are formatted as CROPPED_C{CAMERA_INDEX}_E{SCHEDULE_ENTRY}.png with the camera index CAMERA_INDEX and the schedule entry SCHEDULE_ENTRY.
constant_curvature_slender, constant_curvature_volumetric, cubic_curvature_slender and cubic_curvature_volumetric. These folders contain the actual reconstructed backbones based on the raw data from the raw folder. A different reconstruction approach was used in each of these folders
constant_curvature_slender - A constant curvature backbone kinematic based on the slender model, constant_curvature_volumetric - A constant curvature backbone kinematic based on the volumetric model, cubic_curvature_slender - A cubic curvature backbone kinematic based on the slender model, cubic_curvature_volumetric - A cubic curvature backbone kinematic based on the volumetric model.
Each of these folders contain a data and figures folder. The data folder consists of PARAMETER_E{SCHEDULE_ENTRY}.json files listing the optimization parameters for each schedule entry SCHEDULE_ENTRY in the JSON format. The figures folder contains annotated images of the reconstructed backbone on the cropped raw images. The filenames are structured ANNOTATED_E{SCHEDULE_ENTRY}_C{CAMERA_INDEX}_EPOCH{EPOCH}.png with the schedule entry SCHEDULE_ENTRY, the camera index CAMERA_INDEX and the epoch EPOCH of the optimization algorithm.
The optimization parameters include the base position base_position of the reconstructed backbone in world coordinates, the coefficients for the curvature polynomials ux and uy, and the constant coefficient for the elongation polynomial la.
Calibration Data The calibration data is located in the calibration folder and consists of multiple .npy files in the numpy format. The corresponding camera index for the calibrated camera is abbreviated with CAMERA_INDEX in the following:
C{CAMERA_INDEX}.npy - Stores the reprojection error, camera matrix, distortion coefficients, rotation, and translation vectors as returned by the cv2.calibrateCamera 2 method. C{CAMERA_INDEX}_camera_matrix.npy - Stores the camera_matrix as returned by the cv2.calibrateCamera 2 method. C{CAMERA_INDEX}_distortion_coefficients.npy - Stores the distortion coefficients as returned by the cv2.calibrateCamera 2 method. C{CAMERA_INDEX}_projection_matrix.npy - Stores the projection matrix from world space to pixel space based on the stereo camera calibration. STEREO.npy - Stores the reprojection error, R, T, E, F as returned by the cv2.stereoCalibrate 2 method as an object datatype.
Acknowledgement Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – 501861263 – SPP2353
References 1 M. K. Hoffmann, J. Mühlenhoff, Z. Ding, T. Sattel and K. Flaßkamp. An iterative closest point algorithm for marker-free 3D shape registration of continuum robots. arXiv. https://arxiv.org/abs/2405.15336
2 OpenCV. Camera Calibration and 3D Reconstruction. OpenCV Documentation. https://docs.opencv.org/4.x/d9/d0c/group_calib3d.html, accessed May 27, 2024.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We discuss a statistical framework that underlies envelope detection schemes as well as dynamical models based on Hidden Markov Models (HMM) that can encompass both discrete and continuous sensor measurements for use in Integrated System Health Management (ISHM) applications. The HMM allows for the rapid assimilation, analysis, and discovery of system anomalies. We motivate our work with a discussion of an aviation problem where the identification of anomalous sequences is essential for safety reasons. The data in this application are discrete and continuous sensor measurements and can be dealt with seamlessly using the methods described here to discover anomalous flights. We specifically treat the problem of discovering anomalous features in the time series that may be hidden from the sensor suite and compare those methods to standard envelope detection methods on test data designed to accentuate the differences between the two methods. Identification of these hidden anomalies is crucial to building stable, reusable, and cost-efficient systems. We also discuss a data mining framework for the analysis and discovery of anomalies in high-dimensional time series of sensor measurements that would be found in an ISHM system. We conclude with recommendations that describe the tradeoffs in building an integrated scalable platform for robust anomaly detection in ISHM applications.