30 datasets found

Supplementary material from "Visual comparison of two data sets: Do people...
figshare.com
xlsx
Updated Mar 14, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robin Kramer; Caitlin Telfer; Alice Towler (2017). Supplementary material from "Visual comparison of two data sets: Do people use the means and the variability?" [Dataset]. http://doi.org/10.6084/m9.figshare.4751095.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4751095.v1
Dataset updated
Mar 14, 2017
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Robin Kramer; Caitlin Telfer; Alice Towler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In our everyday lives, we are required to make decisions based upon our statistical intuitions. Often, these involve the comparison of two groups, such as luxury versus family cars and their suitability. Research has shown that the mean difference affects judgements where two sets of data are compared, but the variability of the data has only a minor influence, if any at all. However, prior research has tended to present raw data as simple lists of values. Here, we investigated whether displaying data visually, in the form of parallel dot plots, would lead viewers to incorporate variability information. In Experiment 1, we asked a large sample of people to compare two fictional groups (children who drank ‘Brain Juice’ versus water) in a one-shot design, where only a single comparison was made. Our results confirmed that only the mean difference between the groups predicted subsequent judgements of how much they differed, in line with previous work using lists of numbers. In Experiment 2, we asked each participant to make multiple comparisons, with both the mean difference and the pooled standard deviation varying across data sets they were shown. Here, we found that both sources of information were correctly incorporated when making responses. Taken together, we suggest that increasing the salience of variability information, through manipulating this factor across items seen, encourages viewers to consider this in their judgements. Such findings may have useful applications for best practices when teaching difficult concepts like sampling variation.
Z
AIT Alert Data Set
data.niaid.nih.gov
zenodo.org
Updated Oct 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Landauer, Max (2024). AIT Alert Data Set [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8263180
Explore at:
Dataset updated
Oct 14, 2024
Dataset provided by
Skopik, Florian
Wurzenberger, Markus
Landauer, Max
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the AIT Alert Data Set (AIT-ADS), a collection of synthetic alerts suitable for evaluation of alert aggregation, alert correlation, alert filtering, and attack graph generation approaches. The alerts were forensically generated from the AIT Log Data Set V2 (AIT-LDSv2) and origin from three intrusion detection systems, namely Suricata, Wazuh, and AMiner. The data sets comprise eight scenarios, each of which has been targeted by a multi-step attack with attack steps such as scans, web application exploits, password cracking, remote command execution, privilege escalation, etc. Each scenario and attack chain has certain variations so that attack manifestations and resulting alert sequences vary in each scenario; this means that the data set allows to develop and evaluate approaches that compute similarities of attack chains or merge them into meta-alerts. Since only few benchmark alert data sets are publicly available, the AIT-ADS was developed to address common issues in the research domain of multi-step attack analysis; specifically, the alert data set contains many false positives caused by normal user behavior (e.g., user login attempts or software updates), heterogeneous alert formats (although all alerts are in JSON format, their fields are different for each IDS), repeated executions of attacks according to an attack plan, collection of alerts from diverse log sources (application logs and network traffic) and all components in the network (mail server, web server, DNS, firewall, file share, etc.), and labels for attack phases. For more information on how this alert data set was generated, check out our paper accompanying this data set [1] or our GitHub repository. More information on the original log data set, including a detailed description of scenarios and attacks, can be found in [2].

The alert data set contains two files for each of the eight scenarios, and a file for their labels:

_aminer.json contains alerts from AMiner IDS

_wazuh.json contains alerts from Wazuh IDS and Suricata IDS

labels.csv contains the start and end times of attack phases in each scenario

Beside false positive alerts, the alerts in the AIT-ADS correspond to the following attacks:

Scans (nmap, WPScan, dirb)

Webshell upload (CVE-2020-24186)

Password cracking (John the Ripper)

Privilege escalation

Remote command execution

Data exfiltration (DNSteal) and stopped service

The total number of alerts involved in the data set is 2,655,821, of which 2,293,628 origin from Wazuh, 306,635 origin from Suricata, and 55,558 origin from AMiner. The numbers of alerts in each scenario are as follows. fox: 473,104; harrison: 593,948; russellmitchell: 45,544; santos: 130,779; shaw: 70,782; wardbeck: 91,257; wheeler: 616,161; wilson: 634,246.

Acknowledgements: Partially funded by the European Defence Fund (EDF) projects AInception (101103385) and NEWSROOM (101121403), and the FFG project PRESENT (FO999899544). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union. The European Union cannot be held responsible for them.

If you use the AIT-ADS, please cite the following publications:

[1] Landauer, M., Skopik, F., Wurzenberger, M. (2024): Introducing a New Alert Data Set for Multi-Step Attack Analysis. Proceedings of the 17th Cyber Security Experimentation and Test Workshop. [PDF]

[2] Landauer M., Skopik F., Frank M., Hotwagner W., Wurzenberger M., Rauber A. (2023): Maintainable Log Datasets for Evaluation of Intrusion Detection Systems. IEEE Transactions on Dependable and Secure Computing, vol. 20, no. 4, pp. 3466-3482. [PDF]
a
CSDCIOP Structure Points
maine.hub.arcgis.com
Updated Feb 26, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
State of Maine (2020). CSDCIOP Structure Points [Dataset]. https://maine.hub.arcgis.com/maps/maine::csdciop-structure-points
Explore at:
Dataset updated
Feb 26, 2020
Dataset authored and provided by
State of Maine
Area covered

Description
Feature class that compare the elevations between seawall crests (extracted from available LiDAR datasets from 2010 and 2013) with published FEMA Base Flood Elevations (BFEs) from preliminary FEMA DFIRMS (Panels issued in 2018 and 2019) in coastal York and Cumberland counties (up through Willard Beach in South Portland). The dataset included the development of an inventory of coastal armor structures from a range of different datasets. Feature classes include the following:Steps to create the dataset included:Shoreline structures from the most recent NOAA EVI LANDWARD_SHORETYPE feature class were extracted using the boundaries of York and Cumberland counties. This included 1B: Exposed, Solid Man-Made structures, 8B: Sheltered, Solid Man-Made Structures; 6B: Riprap, and 8C: Sheltered Riprap. This resulted in the creation of Cumberland_ESIL_Structures and York_ESIL_Structures. Note that ESIL uses the MHW line as the feature base.Shoreline structures from the work by Rice (2015) were extracted using the York and Cumberland county boundaries. This resulted in the creation of Cumberland_Rice_Structures and York_Rice_Structures.Additional feature classes for structures were created for York and Cumberland county structures that were missed. This was Slovinsky_York_Structures and Slovinsky_Cumberland_Structures. GoogleEarth imagery was inspected while additional structures were being added to the GIS. 2012 York and Cumberland County imagery was used as the basemap, and structures were classified as bulkheads, rip rap, or dunes (if known). Also, whether or not the structure was in contact with the 2015 HAT was noted.MEDEP was consulted to determine which permit data (both PBR and Individual Permit, IP, data) could be used to help determine where shoreline stabilization projects may have been conducted adjacent to or on coastal bluffs. A file was received for IP data and brought into GIS (DEP_Licensing_Points). This is a point file for shoreline stabilization permits under NRPA.Clip GISVIEW.MEDEP.Permit_By_Rule_Locations to the boundaries of the study area and output DEP_PBR_Points.Join GISVIEW.sde>GISVIEW.MEDEP.PBR_ACTIVITY to the DEP_PBR_Points using the PBR_ID Field. Then, export this file as DEP_PBR_Points2. Using the new ACTIVITY_DESC field, select only those activities that relate to shoreline stabilization projects:PBR_ACTIVITY ACTIVITY_DESC02 Act. Adjacent to a Protected Natural Resource04 Maint Repair & Replacement of Structure08 Shoreline StabilizationSelect by Attributes > PBR_ACTIVITY IN (‘02’, ‘04’, ‘08’) select only those activities likely to be related to shoreline stabilization, and export the selected data as a DEP_PBR_Points3. Then delete 1 and 2, and rename this final product as DEP_PBR_Points.Next, visually inspect the Licensing and PBR files using ArcMap 2012, 2013 imagery, along with Google Earth imagery to determine the extents of armoring along the shoreline.Using EVI and Rice data as indicators, manually inspect and digitize sections of the coastline that are armored. Classify the seaward shoreline type (beach, mudflat, channel, dune, etc.) and the armor type (wall or bulkhead). Bring in the HAT line and, using that and visual indicators, identify whether or not the armored sections are in contact with HAT. Use Google Earth at the same time as digitizing in order to help constrain areas. Merge digitized armoring into Cumberland_York_Merged.Bring the preliminary FEMA DFIRM data in and use “intersect” to assign the different flood zones and elevations to the digitized armored sections. This was done first for Cumberland, then for York Counties. Delete ancillary attributes, as needed. Resulting layer is Cumberland_Structure_FloodZones and York_Structure_FloodZones.Go to NOAA Digital Coast Data Layers and download newest LiDAR data for York and Cumberland county beach, dune, and just inland areas. This includes 2006 and newer topobathy data available from 2010 (entire coast), and selected areas from 2013 and 2014 (Wells, Scarborough, Kennebunk).Mosaic the 2006, 2010, 2013 and 2014 data (with 2013 and 2014 being the first dataset laying on top of the 2010 data) Mosaic this dataset into the sacobaydem_ftNAVD raster (this is from the MEGIS bare-earth model). This will cover almost all of the study area except for armor along several areas in York. Resulting in LidAR206_2010_2013_Mosaic.tif.Using the LiDAR data as a proxy, create a “seaward crest” line feature class which follows along the coast and extracts the approximate highest point (cliff, bank, dune) along the shoreline. This will be used to extract LiDAR data and compare with preliminary flood zone information. The line is called Dune_Crest.Using an added tool Points Along Line, create points at 5 m spacing along each of the armored shoreline feature lines and the dune crest lines. Call the outputs PointsonLines and PointsonDunes.Using Spatial Analyst, Extract LIDAR elevations to the points using the 2006_2010_2013 Mosaic first. Call this LidarPointsonLines1. Select those points which have NULL values, export as this LiDARPointsonLines2. Then rerun Extract Values to Points using just the selected data and the state MEGIS DEM. Convert RASTERVALU to feet by multiplying by 3.2808 (and rename as Elev_ft). Select by Attributes, find all NULL values, and in an edit session, delete them from LiDARPointsonLines. Then, merge the 2 datasets and call it LidarPointsonLines. Do the same above with dune lines and create LidarPointsonDunes.Next, use the Cumberland and York flood zone layers to intersect the points with the appropriate flood zone data. Create ….CumbFIRM and …YorkFIRM files for the dunes and lines.Select those points from the Dunes feature class that are within the X zone – these will NOT have an associated BFE for comparison with the Lidar data. Export the Dune Points as Cumberland_York_Dunes_XZone. Run NEAR and use the merged flood zone feature class (with only V, AE, and AO zones selected). Then, join the flood zone data to the feature class using FID (from the feature class) and OBJECTID (from the flood zone feature class). Export as Cumberland_York_Dunes_XZone_Flood. Delete ancillary columns of data, leaving the original FLD_ZONE (X), Elev_ft, NEAR_DIST (distance, in m, to the nearest flood zone), FLD_ZONE_1 (the near flood zone), and the STATIC_BFE_1 (the nearest static BFE).Do the same as above, except with the Structures file (Cumberland_York_Structures_Lidar_DFIRM_Merged), but also select those features that are within the X zone and the OPEN WATER. Export the points as Cumberland_York_Structures_XZone. Again, run the NEAR using the merged flood zone and only AE, VE, and AO zones selected. Export the file as Cumberland_York_Structures_XZone_Flood.Merge the above feature classes with the original feature classes. Add a field BFE_ELEV_COMPARE. Select all those features whose attributes have a VE or AE flood zone and use field calculator to calculate the difference between the Elev_ft and the BFE (subtracting the STATIC_BFE from Elev_ft). Positive values mean the maximum wall value is higher than the BFE, while negative values mean the max is below the BFE. Then, select the remaining values with switch selection. Calculate the same value but use the NEAR_STATIC_BFE value instead. Select by Attributes>FLD_ZONE=AO, and use the DEPTH value to enter into the above created fields as negative values. Delete ancilary attribute fields, leaving those listed in the _FINAL feature classes described above the process steps section.
Clust_100_GE_datasets
zenodo.org
pdf, zip
Updated Aug 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Basel Abu-Jamous; Basel Abu-Jamous; Steven Kelly; Steven Kelly (2024). Clust_100_GE_datasets [Dataset]. http://doi.org/10.5281/zenodo.1169191
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1169191
Dataset updated
Aug 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Basel Abu-Jamous; Basel Abu-Jamous; Steven Kelly; Steven Kelly
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
100 microarray and RNA-seq gene expression datasets from five model species (human, mouse, fruit fly, arabidopsis plants, and baker's yeast). These datasets represent the benchmark set that was used to test our clust clustering method and to compare it with five widely used clustering methods (MCL, k-means, hierarchical clustering, WGCNA, and self-organising maps). This data resource includes raw data files, pre-processed data files, clustering results, clustering results evaluation, and scripts.

The files are split into three zipped parts, 100Datasets_part_1.zip, 100Datasets_part_2.zip, and 100Datasets_part_3.zip. The contents of the three zipped files should be extracted to a single folder (e.g. 100Datasets).

Below is a thorough description of the files and folders in this data resource.

Scripts

The scripts used to apply each one of the clustering methods to each one of the 100 datasets and to evaluate their results are all included in the folder (scripts/).

Datasets and clustering results (folders starting with D)

The datasets are labelled as D001 to D100. Each dataset has two folders: D###/ and D###_Res/, where ### is the number of the dataset. The first folder only includes the raw dataset while the second folder includes the results of applying the clustering methods to that dataset. The files ending with _B.tsv include clustering results in the form of a partition matrix. The files ending with _E include metrics evaluating the clustering results. The files ending with _go and _go_E respectively include the enriched GO terms in the clustering results and evaluation metrics of these GO terms.

Simultaneous analysis of multiple datasets (folders starting with MD)

As our clust method is design to be able to extract clusters from multiple datasets simultaneously, we also tested it over multiple datasets. All folders starting with MD_ are related to "multiple datasets (MD)" results. Each MD experiment simultaneously analyses d randomly selected datasets either out of a set of 10 arabidopsis datasets or out of a set of 10 yeast datasets. For each one of the two species, all d values from 2 to 10 were tested, and at each one of these d values, 10 different runs were conducted, where at each run a different subset of d datasets is selected randomly.

The folders MD_10A and MD_10Y include the full sets of 10 arabidposis or 10 yeast datasets, respectively. Each folder with the format MD_10#_d#_Res## includes the results of applying the six clustering methods at one of the 10 random runs of one of the selected d values. For example, the "MD_10A_d4_Res03/" folder includes the clustering results of the 3^rd random selection of 4 arabidopsis datasets (the letter A in the folder's name refers to arabidopsis).

Our clust method is applied directly over multiple datasets where each dataset is in a separate data file. Each "MD_10#_d#_Res##" folder includes these individual files in a sub-folder named "Processed_Data/". However, the other clustering methods only accept a single input data file. Therefore, the datasets are merged first before being submitted to these methods. Each "MD_10#_d#_Res##" folder includes a file "X_merged.tsv" for the merged data.

Evaluation metrics (folders starting with Metrics)

Each clustering results folder (D##_Res or MD_10#_d#_Res##) includes some clustering evaluation files ending with _E. This information is combined into tables for all datasets, and these tables appear in the folders starting with "Metrics_".

Other files and folders

The GO folder includes the reference GO term annotations for arabidopsis and yeast. The Datasets file includes a TAB delimited table describing the 100 datasets. The SearchCriterion file includes the objective methodology of searching the NCBI database to select these 100 datasets. The Specials file includes some special considerations for couple of datasets that differ a bit from what is described in the SearchCriterion file. The Norm### files and the files in the Reps/ folder describe normalisation codes and replicate structures for the datasets and were fed to the clust method as inputs. The Plots/ folder includes plots of the gene expression profiles of the individual genes in the clusters generated by each one of the 6 methods over each one of the 100 datasets. Only up to 14 clusters per method are plotted.
AIRS/Aqua L3 Monthly Quantization in Physical Units (AIRS-only) 5 degrees x...
catalog.data.gov
datasets.ai
+3more
Updated Jul 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NASA/GSFC/SED/ESD/TISL/GESDISC (2025). AIRS/Aqua L3 Monthly Quantization in Physical Units (AIRS-only) 5 degrees x 5 degrees V006 (AIRS3QPM) at GES DISC [Dataset]. https://catalog.data.gov/dataset/airs-aqua-l3-monthly-quantization-in-physical-units-airs-only-5-degrees-x-5-degrees-v006-a-22c35
Explore at:
Dataset updated
Jul 11, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
The Atmospheric Infrared Sounder (AIRS) is a grating spectrometer (R = 1200) aboard the second Earth Observing System (EOS) polar-orbiting platform, EOS Aqua. In combination with the Advanced Microwave Sounding Unit (AMSU) and the Humidity Sounder for Brazil (HSB), AIRS constitutes an innovative atmospheric sounding group of visible, infrared, and microwave sensors. AIRS/Aqua Level 3 monthly quantization product is in physical units (AIRS Only). The quantization products (QP) are distributional summaries derived from the Level-2 standard retrieval products (of swath type) to provide a more comprehensive set of statistical summaries than the traditional means and standard deviation. The QP products combine the Level 2 standard data parameters over grid cells of 5 x 5 deg spatial extent for temporal periods of a month. They preserve the multivariate distributional features of the original data and so provide a compressed data set that more accurately describes the disparate atmospheric states that is in the original Level-2 swath data set. The geophysical parameters are: Air Temperature and Water Vapor profiles (11 levels/layers), Cloud fraction (vertical distribution).
National Hydrography Dataset Plus Version 2.1
oregonwaterdata.org
resilience.climate.gov
+4more
Updated Aug 16, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esri (2022). National Hydrography Dataset Plus Version 2.1 [Dataset]. https://www.oregonwaterdata.org/maps/4bd9b6892530404abfe13645fcb5099a
Explore at:
Dataset updated
Aug 16, 2022
Dataset authored and provided by
Esrihttp://esri.com/
Area covered
Description
The National Hydrography Dataset Plus (NHDplus) maps the lakes, ponds, streams, rivers and other surface waters of the United States. Created by the US EPA Office of Water and the US Geological Survey, the NHDPlus provides mean annual and monthly flow estimates for rivers and streams. Additional attributes provide connections between features facilitating complicated analyses. For more information on the NHDPlus dataset see the NHDPlus v2 User Guide.Dataset SummaryPhenomenon Mapped: Surface waters and related features of the United States and associated territories not including Alaska.Geographic Extent: The United States not including Alaska, Puerto Rico, Guam, US Virgin Islands, Marshall Islands, Northern Marianas Islands, Palau, Federated States of Micronesia, and American SamoaProjection: Web Mercator Auxiliary Sphere Visible Scale: Visible at all scales but layer draws best at scales larger than 1:1,000,000Source: EPA and USGSUpdate Frequency: There is new new data since this 2019 version, so no updates planned in the futurePublication Date: March 13, 2019Prior to publication, the NHDPlus network and non-network flowline feature classes were combined into a single flowline layer. Similarly, the NHDPlus Area and Waterbody feature classes were merged under a single schema.Attribute fields were added to the flowline and waterbody layers to simplify symbology and enhance the layer's pop-ups. Fields added include Pop-up Title, Pop-up Subtitle, On or Off Network (flowlines only), Esri Symbology (waterbodies only), and Feature Code Description. All other attributes are from the original NHDPlus dataset. No data values -9999 and -9998 were converted to Null values for many of the flowline fields.What can you do with this layer?Feature layers work throughout the ArcGIS system. Generally your work flow with feature layers will begin in ArcGIS Online or ArcGIS Pro. Below are just a few of the things you can do with a feature service in Online and Pro.ArcGIS OnlineAdd this layer to a map in the map viewer. The layer is limited to scales of approximately 1:1,000,000 or larger but a vector tile layer created from the same data can be used at smaller scales to produce a webmap that displays across the full range of scales. The layer or a map containing it can be used in an application. Change the layer’s transparency and set its visibility rangeOpen the layer’s attribute table and make selections. Selections made in the map or table are reflected in the other. Center on selection allows you to zoom to features selected in the map or table and show selected records allows you to view the selected records in the table.Apply filters. For example you can set a filter to show larger streams and rivers using the mean annual flow attribute or the stream order attribute. Change the layer’s style and symbologyAdd labels and set their propertiesCustomize the pop-upUse as an input to the ArcGIS Online analysis tools. This layer works well as a reference layer with the trace downstream and watershed tools. The buffer tool can be used to draw protective boundaries around streams and the extract data tool can be used to create copies of portions of the data.ArcGIS ProAdd this layer to a 2d or 3d map. Use as an input to geoprocessing. For example, copy features allows you to select then export portions of the data to a new feature class. Change the symbology and the attribute field used to symbolize the dataOpen table and make interactive selections with the mapModify the pop-upsApply Definition Queries to create sub-sets of the layerThis layer is part of the ArcGIS Living Atlas of the World that provides an easy way to explore the landscape layers and many other beautiful and authoritative maps on hundreds of topics.Questions?Please leave a comment below if you have a question about this layer, and we will get back to you as soon as possible.
A
‘Population by Country - 2020’ analyzed by Analyst-2
analyst-2.ai
Updated Feb 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘Population by Country - 2020’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-population-by-country-2020-c8b7/latest
Explore at:
Dataset updated
Feb 13, 2020
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Population by Country - 2020’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/tanuprabhu/population-by-country-2020 on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

I always wanted to access a data set that was related to the world’s population (Country wise). But I could not find a properly documented data set. Rather, I just created one manually.

Content

Now I knew I wanted to create a dataset but I did not know how to do so. So, I started to search for the content (Population of countries) on the internet. Obviously, Wikipedia was my first search. But I don't know why the results were not acceptable. And also there were only I think 190 or more countries. So then I surfed the internet for quite some time until then I stumbled upon a great website. I think you probably have heard about this. The name of the website is Worldometer. This is exactly the website I was looking for. This website had more details than Wikipedia. Also, this website had more rows I mean more countries with their population.

Once I got the data, now my next hard task was to download it. Of course, I could not get the raw form of data. I did not mail them regarding the data. Now I learned a new skill which is very important for a data scientist. I read somewhere that to obtain the data from websites you need to use this technique. Any guesses, keep reading you will come to know in the next paragraph.

https://fiverr-res.cloudinary.com/images/t_main1,q_auto,f_auto/gigs/119580480/original/68088c5f588ec32a6b3a3a67ec0d1b5a8a70648d/do-web-scraping-and-data-mining-with-python.png" alt="alt text">

You are right its, Web Scraping. Now I learned this so that I could convert the data into a CSV format. Now I will give you the scraper code that I wrote and also I somehow found a way to directly convert the pandas data frame to a CSV(Comma-separated fo format) and store it on my computer. Now just go through my code and you will know what I'm talking about.

Below is the code that I used to scrape the code from the website

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3200273%2Fe814c2739b99d221de328c72a0b2571e%2FCapture.PNG?generation=1581314967227445&alt=media" alt="">

Acknowledgements

Now I couldn't have got the data without Worldometer. So special thanks to the website. It is because of them I was able to get the data.

Inspiration

As far as I know, I don't have any questions to ask. You guys can let me know by finding your ways to use the data and let me know via kernel if you find something interesting

--- Original source retains full ownership of the source dataset ---
IDAO-2019-MuonID
kaggle.com
zip
Updated Jan 28, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikita Kazeev (YSDA) (2019). IDAO-2019-MuonID [Dataset]. https://www.kaggle.com/datasets/kazeev/idao2019muonid
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Jan 28, 2019
Authors
Nikita Kazeev (YSDA)
Description
Context

[Beginning from the beginning][1]. Normal matter, the one that planets, humans, and stars are made of, make up only 5% of mass in the Universe. The rest are invisible dark matter and dark energy whose existence might be hinted through the gravitational effects. One way of studying these mysteries is to recreate conditions just after the Big Bang with particle accelerators. Using a very rough analogy, we collide automobiles at supersonic speed and try to learn how they work by looking at the photos of the collisions. One of such photo cameras is the LHCb detector.

Here is a typical collision event recorded by the LHCb detector, one of the four big experiments at the Large Hadron collider. Point to the left is where the protons have collided, the lines are the the secondary particles tracks. ![A typical collision event recorded by the LHCb detector][2]

Muon subdetector (see the figure below) consists of ﬁve stations (sensitive planes perpendicular to the beam pipe). Only four of them are used in our competition (M2-M5). Green parallelepipeds in the 3D-figure above are the detector pads which registered a charged particle passing through them. The physical idea is that only muons have penetration ability high enough to allow them to pass though the lead shielding that separates the Muon subdetector from the rest of the detector. Of course, in the real world not all hits are generated by muons, that’s why we need machine learning.

![Muon subdetector ][3]

You are given tracks of three types: muon, pion and proton. Pions might decay in ﬂight into genuine muons, so some of their tracks are very muon-like, you want to reject them as well.

The data is real (i. e. not simulated) and the particle types cannot be known with certainty. To account for that, we use a statistical method called sPlot ([original paper][4], [blog post][5]). Each example is assigned a weight, when used with those weights, the distribution of the features matches the distribution over type-pure samples. Some of the weights are negative, this is expected.

Since the data for different particle types have been obtained from different decays, the distributions of the tracks kinematic observables are different. But in the end we need an algorithm that differentiates particle types in general, not only in the specific decays. In ML terms, this can be viewed as domain adaptation. To achieve that we reweighted the sample so that the distributions in momentum of signal and background match.

Content

The data is used for the [IDAO 2019][6]. For convenience, the training dataset is split into two files. In the ﬁrst phase of the competition (we call it public) the models are scored using 20% of the test data (test_public). The data is present in two formats: csv and hdf. Both have been created with pandas (see the [environment.yml][7] for versions), hdf contains pickled numpy arrays, so it might not be readable outside Python.

Features

WARNING. The description on Kaggle might be out of data. IDAO participants, please see the competition problem statement for the up to date version.

label, integer in {0,1} - you need to predict it. 0 is background (pions and protons), 1 is signal (muons)

particle_type, integer in {0,1,2} - type of the particle. 0 - pion, 1 - muon, 2 - proton. Available only for the training dataset.

weight, float - example weight, used in both training and evaluation. Product of sWeight and kinWeight.

sWeight, float - a component of the example weight that accounts for uncertainty in labeling

kinWeight, float ≥ 0 - a component of the example weight that equalizes kinematic observables between signal and background

id, integer - example id

Lextra_{X,Y}[N], float - coordinates of the track linear extrapolation intersection with the Nth station. The extrapolation uses the following station Z coordinates: [15270, 16470, 17670, 18870]

Mextra_D{X,Y}2[N}, float - uncertainty for squared {X, Y} coordinate of the track extrapolation.

MatchedHit_{X,Y,Z}[N], float - coordinates of the hit in the Nth station that a physics-based tracking algorithm associated with the track. [Poster about the algorithm]8

MatchedHit_TYPE[N], categorical in {0, 1, 2} - whether the Matched hit is crossed. 1 means uncrossed, 2 means crossed. 0 means there is no matched hit in the station (missing value). See pages 6-8 [here][9]

MatchedHit_T[N], integer in {255}∪ [1,20] - timing of the Matched hit, 255 is missing value (no matched hit in the station)

MatchedHit_D{X,YZ}[N], float in {-9999}∪ (0, +) - uncertainty of the Matched hit coordinates

MatchedHit_DT[N], integer delta time for the matched hit in the Nth station

FOI_hits_N, integer ≥ 0 - number of hits inside a physics-defined cone around the track (aka Field Of Interest, FOI)

FOI_hits_{,D}{X,Y,Z,T}, array of float of ...
Z
DCASE 2023 Challenge Task 2 Development Dataset
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated May 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kota Dohi (2023). DCASE 2023 Challenge Task 2 Development Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7687463
Explore at:
Dataset updated
May 2, 2023
Dataset provided by
Tomoya
Yuma
Noboru
Harsh
Keisuke
Takashi
Kota Dohi
Yohei
Daisuke
Description
Description

This dataset is the "development dataset" for the DCASE 2023 Challenge Task 2 "First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring".

The data consists of the normal/anomalous operating sounds of seven types of real/toy machines. Each recording is a single-channel 10-second audio that includes both a machine's operating sound and environmental noise. The following seven types of real/toy machines are used in this task:

ToyCar

ToyTrain

Fan

Gearbox

Bearing

Slide rail

Valve

Overview of the task

Anomalous sound detection (ASD) is the task of identifying whether the sound emitted from a target machine is normal or anomalous. Automatic detection of mechanical failure is an essential technology in the fourth industrial revolution, which involves artificial-intelligence-based factory automation. Prompt detection of machine anomalies by observing sounds is useful for monitoring the condition of machines.

This task is the follow-up from DCASE 2020 Task 2 to DCASE 2022 Task 2. The task this year is to develop an ASD system that meets the following four requirements.

Train a model using only normal sound (unsupervised learning scenario)

Because anomalies rarely occur and are highly diverse in real-world factories, it can be difficult to collect exhaustive patterns of anomalous sounds. Therefore, the system must detect unknown types of anomalous sounds that are not provided in the training data. This is the same requirement as in the previous tasks.

Detect anomalies regardless of domain shifts (domain generalization task)

In real-world cases, the operational states of a machine or the environmental noise can change to cause domain shifts. Domain-generalization techniques can be useful for handling domain shifts that occur frequently or are hard-to-notice. In this task, the system is required to use domain-generalization techniques for handling these domain shifts. This requirement is the same as in DCASE 2022 Task 2.

Train a model for a completely new machine type

For a completely new machine type, hyperparameters of the trained model cannot be tuned. Therefore, the system should have the ability to train models without additional hyperparameter tuning.

Train a model using only one machine from its machine type

While sounds from multiple machines of the same machine type can be used to enhance detection performance, it is often the case that sound data from only one machine are available for a machine type. In such a case, the system should be able to train models using only one machine from a machine type.

The last two requirements are newly introduced in DCASE 2023 Task2 as the "first-shot problem".

Definition

We first define key terms in this task: "machine type," "section," "source domain," "target domain," and "attributes.".

"Machine type" indicates the type of machine, which in the development dataset is one of seven: fan, gearbox, bearing, slide rail, valve, ToyCar, and ToyTrain.

A section is defined as a subset of the dataset for calculating performance metrics.

The source domain is the domain under which most of the training data and some of the test data were recorded, and the target domain is a different set of domains under which some of the training data and some of the test data were recorded. There are differences between the source and target domains in terms of operating speed, machine load, viscosity, heating temperature, type of environmental noise, signal-to-noise ratio, etc.

Attributes are parameters that define states of machines or types of noise.

Dataset

This dataset consists of seven machine types. For each machine type, one section is provided, and the section is a complete set of training and test data. For each section, this dataset provides (i) 990 clips of normal sounds in the source domain for training, (ii) ten clips of normal sounds in the target domain for training, and (iii) 100 clips each of normal and anomalous sounds for the test. The source/target domain of each sample is provided. Additionally, the attributes of each sample in the training and test data are provided in the file names and attribute csv files.

File names and attribute csv files

File names and attribute csv files provide reference labels for each clip. The given reference labels for each training/test clip include machine type, section index, normal/anomaly information, and attributes regarding the condition other than normal/anomaly. The machine type is given by the directory name. The section index is given by their respective file names. For the datasets other than the evaluation dataset, the normal/anomaly information and the attributes are given by their respective file names. Attribute csv files are for easy access to attributes that cause domain shifts. In these files, the file names, name of parameters that cause domain shifts (domain shift parameter, dp), and the value or type of these parameters (domain shift value, dv) are listed. Each row takes the following format:

[filename (string)], [d1p (string)], [d1v (int | float | string)], [d2p], [d2v]...

Recording procedure

Normal/anomalous operating sounds of machines and its related equipment are recorded. Anomalous sounds were collected by deliberately damaging target machines. For simplifying the task, we use only the first channel of multi-channel recordings; all recordings are regarded as single-channel recordings of a fixed microphone. We mixed a target machine sound with environmental noise, and only noisy recordings are provided as training/test data. The environmental noise samples were recorded in several real factory environments. We will publish papers on the dataset to explain the details of the recording procedure by the submission deadline.

Directory structure

/dev_data

/raw

/fan

/train (only normal clips)

/section_00_source_train_normal_0000_.wav

...

/section_00_source_train_normal_0989_.wav

/section_00_target_train_normal_0000_.wav

...

/section_00_target_train_normal_0009_.wav

/test

/section_00_source_test_normal_0000_.wav

...

/section_00_source_test_normal_0049_.wav

/section_00_source_test_anomaly_0000_.wav

...

/section_00_source_test_anomaly_0049_.wav

/section_00_target_test_normal_0000_.wav

...

/section_00_target_test_normal_0049_.wav

/section_00_target_test_anomaly_0000_.wav

...

/section_00_target_test_anomaly_0049_.wav

attributes_00.csv (attribute csv for section 00)

/gearbox (The other machine types have the same directory structure as fan.)

/bearing

/slider (slider means "slide rail")

/ToyCar

/ToyTrain

/valve

Baseline system

The baseline system is available on the Github repository dcase2023_task2_baseline_ae.The baseline systems provide a simple entry-level approach that gives a reasonable performance in the dataset of Task 2. They are good starting points, especially for entry-level researchers who want to get familiar with the anomalous-sound-detection task.

Condition of use

This dataset was created jointly by Hitachi, Ltd. and NTT Corporation and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

Citation

If you use this dataset, please cite all the following papers. We will publish a paper on the description of the DCASE 2023 Task 2, so pleasure make sure to cite the paper, too.

Noboru Harada, Daisuke Niizumi, Yasunori Ohishi, Daiki Takeuchi, and Masahiro Yasuda. First-shot anomaly detection for machine condition monitoring: A domain generalization baseline. In arXiv e-prints: 2303.00455, 2023. [URL]

Kota Dohi, Tomoya Nishida, Harsh Purohit, Ryo Tanabe, Takashi Endo, Masaaki Yamamoto, Yuki Nikaido, and Yohei Kawaguchi. MIMII DG: sound dataset for malfunctioning industrial machine investigation and inspection for domain generalization task. In Proceedings of the 7th Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), 31-35. Nancy, France, November 2022, . [URL]

Noboru Harada, Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Masahiro Yasuda, and Shoichiro Saito. ToyADMOS2: another dataset of miniature-machine operating sounds for anomalous sound detection under domain shift conditions. In Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021), 1–5. Barcelona, Spain, November 2021. [URL]

Contact

If there is any problem, please contact us:

Kota Dohi, kota.dohi.gr@hitachi.com

Keisuke Imoto, keisuke.imoto@ieee.org

Noboru Harada, noboru@ieee.org

Daisuke Niizumi, daisuke.niizumi.dt@hco.ntt.co.jp

Yohei Kawaguchi, yohei.kawaguchi.xk@hitachi.com
a
CSDCIOP Dune Crest Points
maine.hub.arcgis.com
Updated Feb 26, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
State of Maine (2020). CSDCIOP Dune Crest Points [Dataset]. https://maine.hub.arcgis.com/maps/csdciop-dune-crest-points
Explore at:
Dataset updated
Feb 26, 2020
Dataset authored and provided by
State of Maine
Area covered

Description
Feature class that compares the elevations between sand dune crests (extracted from available LiDAR datasets from 2010 and 2013) with published FEMA Base Flood Elevations (BFEs) from preliminary FEMA DFIRMS (Panels issued in 2018 and 2019) in coastal York and Cumberland counties (up through Willard Beach in South Portland). Steps to create the dataset included:Shoreline structures from the most recent NOAA EVI LANDWARD_SHORETYPE feature class were extracted using the boundaries of York and Cumberland counties. This included 1B: Exposed, Solid Man-Made structures, 8B: Sheltered, Solid Man-Made Structures; 6B: Riprap, and 8C: Sheltered Riprap. This resulted in the creation of Cumberland_ESIL_Structures and York_ESIL_Structures. Note that ESIL uses the MHW line as the feature base.Shoreline structures from the work by Rice (2015) were extracted using the York and Cumberland county boundaries. This resulted in the creation of Cumberland_Rice_Structures and York_Rice_Structures.Additional feature classes for structures were created for York and Cumberland county structures that were missed. This was Slovinsky_York_Structures and Slovinsky_Cumberland_Structures. GoogleEarth imagery was inspected while additional structures were being added to the GIS. 2012 York and Cumberland County imagery was used as the basemap, and structures were classified as bulkheads, rip rap, or dunes (if known). Also, whether or not the structure was in contact with the 2015 HAT was noted.MEDEP was consulted to determine which permit data (both PBR and Individual Permit, IP, data) could be used to help determine where shoreline stabilization projects may have been conducted adjacent to or on coastal bluffs. A file was received for IP data and brought into GIS (DEP_Licensing_Points). This is a point file for shoreline stabilization permits under NRPA.Clip GISVIEW.MEDEP.Permit_By_Rule_Locations to the boundaries of the study area and output DEP_PBR_Points.Join GISVIEW.sde>GISVIEW.MEDEP.PBR_ACTIVITY to the DEP_PBR_Points using the PBR_ID Field. Then, export this file as DEP_PBR_Points2. Using the new ACTIVITY_DESC field, select only those activities that relate to shoreline stabilization projects:PBR_ACTIVITY ACTIVITY_DESC02 Act. Adjacent to a Protected Natural Resource04 Maint Repair & Replacement of Structure08 Shoreline StabilizationSelect by Attributes > PBR_ACTIVITY IN (‘02’, ‘04’, ‘08’) select only those activities likely to be related to shoreline stabilization, and export the selected data as a DEP_PBR_Points3. Then delete 1 and 2, and rename this final product as DEP_PBR_Points.Next, visually inspect the Licensing and PBR files using ArcMap 2012, 2013 imagery, along with Google Earth imagery to determine the extents of armoring along the shoreline.Using EVI and Rice data as indicators, manually inspect and digitize sections of the coastline that are armored. Classify the seaward shoreline type (beach, mudflat, channel, dune, etc.) and the armor type (wall or bulkhead). Bring in the HAT line and, using that and visual indicators, identify whether or not the armored sections are in contact with HAT. Use Google Earth at the same time as digitizing in order to help constrain areas. Merge digitized armoring into Cumberland_York_Merged.Bring the preliminary FEMA DFIRM data in and use “intersect” to assign the different flood zones and elevations to the digitized armored sections. This was done first for Cumberland, then for York Counties. Delete ancillary attributes, as needed. Resulting layer is Cumberland_Structure_FloodZones and York_Structure_FloodZones.Go to NOAA Digital Coast Data Layers and download newest LiDAR data for York and Cumberland county beach, dune, and just inland areas. This includes 2006 and newer topobathy data available from 2010 (entire coast), and selected areas from 2013 and 2014 (Wells, Scarborough, Kennebunk).Mosaic the 2006, 2010, 2013 and 2014 data (with 2013 and 2014 being the first dataset laying on top of the 2010 data) Mosaic this dataset into the sacobaydem_ftNAVD raster (this is from the MEGIS bare-earth model). This will cover almost all of the study area except for armor along several areas in York. Resulting in LidAR206_2010_2013_Mosaic.tif.Using the LiDAR data as a proxy, create a “seaward crest” line feature class which follows along the coast and extracts the approximate highest point (cliff, bank, dune) along the shoreline. This will be used to extract LiDAR data and compare with preliminary flood zone information. The line is called Dune_Crest.Using an added tool Points Along Line, create points at 5 m spacing along each of the armored shoreline feature lines and the dune crest lines. Call the outputs PointsonLines and PointsonDunes.Using Spatial Analyst, Extract LIDAR elevations to the points using the 2006_2010_2013 Mosaic first. Call this LidarPointsonLines1. Select those points which have NULL values, export as this LiDARPointsonLines2. Then rerun Extract Values to Points using just the selected data and the state MEGIS DEM. Convert RASTERVALU to feet by multiplying by 3.2808 (and rename as Elev_ft). Select by Attributes, find all NULL values, and in an edit session, delete them from LiDARPointsonLines. Then, merge the 2 datasets and call it LidarPointsonLines. Do the same above with dune lines and create LidarPointsonDunes.Next, use the Cumberland and York flood zone layers to intersect the points with the appropriate flood zone data. Create ….CumbFIRM and …YorkFIRM files for the dunes and lines.Select those points from the Dunes feature class that are within the X zone – these will NOT have an associated BFE for comparison with the Lidar data. Export the Dune Points as Cumberland_York_Dunes_XZone. Run NEAR and use the merged flood zone feature class (with only V, AE, and AO zones selected). Then, join the flood zone data to the feature class using FID (from the feature class) and OBJECTID (from the flood zone feature class). Export as Cumberland_York_Dunes_XZone_Flood. Delete ancillary columns of data, leaving the original FLD_ZONE (X), Elev_ft, NEAR_DIST (distance, in m, to the nearest flood zone), FLD_ZONE_1 (the near flood zone), and the STATIC_BFE_1 (the nearest static BFE).Do the same as above, except with the Structures file (Cumberland_York_Structures_Lidar_DFIRM_Merged), but also select those features that are within the X zone and the OPEN WATER. Export the points as Cumberland_York_Structures_XZone. Again, run the NEAR using the merged flood zone and only AE, VE, and AO zones selected. Export the file as Cumberland_York_Structures_XZone_Flood.Merge the above feature classes with the original feature classes. Add a field BFE_ELEV_COMPARE. Select all those features whose attributes have a VE or AE flood zone and use field calculator to calculate the difference between the Elev_ft and the BFE (subtracting the STATIC_BFE from Elev_ft). Positive values mean the maximum wall value is higher than the BFE, while negative values mean the max is below the BFE. Then, select the remaining values with switch selection. Calculate the same value but use the NEAR_STATIC_BFE value instead. Select by Attributes>FLD_ZONE=AO, and use the DEPTH value to enter into the above created fields as negative values. Delete ancilary attribute fields, leaving those listed in the _FINAL feature classes described above the process steps section.
Dataset used in the publication entitled "Application of machine learning to...
zenodo.org
data.niaid.nih.gov
bin, txt
Updated Jan 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Biaobiao Yang; Biaobiao Yang; Valentin Vassilev-Galindo; Valentin Vassilev-Galindo; Javier Llorca; Javier Llorca (2024). Dataset used in the publication entitled "Application of machine learning to assess the influence of microstructure on twin nucleation in Mg alloys" [Dataset]. http://doi.org/10.5281/zenodo.10225600
Explore at:
bin, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10225600
Dataset updated
Jan 31, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Biaobiao Yang; Biaobiao Yang; Valentin Vassilev-Galindo; Valentin Vassilev-Galindo; Javier Llorca; Javier Llorca
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Documentation for the Dataset used in the publication entitled "Application of machine learning to assess the influence of microstructure on twin nucleation in Mg alloys"
** These datasets comprise the 2D EBSD data acquired in the Mg-1Al (at.%) alloy and AZ31 Mg alloy, analyzed with MTEX 7.0 software. **
** More details about the experimental techniques can be found in the publication "Biaobiao Yang, Valentin Vassilev-Galindo, Javier Llorca, Application of machine learning to assess the influence of microstructure on twin nucleation in Mg alloys. npj Computational Materials, 2024." **

1. AZ31_ML.xlsx
- Description: Both twin and grain data were acquired by EBSD from AZ31 Mg sample before and after deformation at the same area
- Number of grains: 2640 (rows == grains) corresponding to three samples deformed in different orientations: S0, S45, and S90
- Number of analyzed variables (features): 31 (columns == grain characteristics)

- Variable description by columns:
1- (Twinned) - type: boolean
Description: Indicates if the grain twinned or not after deformation
0: non-twinned grain
1: twinned grain
2- (Orientation) - type: numerical (integer)
Description: The loading (tensile) direction with respect to the c axis of lattice
3- (Strain_level) - type: numerical (float)
Description: The maximum strain level after deformation
4- (Grain_size) - type: numerical (float)
Description: The equivalent circle diameter (in micrometers) of the grain before deformation.
5- (Triple_points) - type: numerical (integer)
Description: The number of triple points of the grain before deformation
6- (Near_edge) - type: boolean
Description: Indicates if the grain is located near the edge of the 2D EBSD or not. This feature was used to filter out from the final dataset the grains near the edge of the sample. Hence, only those entries with Near_edge value of 0 were used to train and test the machine learning models.
0: not near the EBSD edge
1: near the EBSD edge
7-12- (T_SF*) - type: numerical (float)
Description: The twinning Schmid factor based on the loading condition, orientation of parent grain and twin variants information.
T_SF1: The highest Schmid factor of extension twinning
T_SF2: The 2nd highest ...
T_SF3: 3rd
T_SF4: 4th
T_SF5: 5th
T_SF6: The lowest Schmid factor of extension twinning
13-15- (S_SF*) - type: numerical (float)
Description: The Schmid factor for basal slip based on the loading condition, orientation of parent grain, and slip system information. Only the basal slip system is considered because it is the dominant deformation slip system in Mg during deformation.
S_SF1: The highest Schmid factor of basal slip
S_SF2: The second highest or the middle Schmid factor of basal slip
S_SF3: The lowest Schmid factor of basal slip
16- (Neighbor_grain_n) - type: numerical (integer)
Description: The number of neighbors of the grain before deformation.
17-19- (B-b_m) - type: numerical (float)
Description: The Luster-Morris geometric compatibility factor (m') between the basal slip systems of the grain and its neighbors. Although there are 3 possible basal slip systems, only the one with the highest Schmid factor was considered to compute m′. Only maximum, minimum, and mean values were included in the dataset.
(Max_B-b_m): The highest basal - basal m' between the grain and its neighbors
(Min_B-b_m): The lowest basal - basal m' between the grain and its neighbors
(Mean_B-b_m): The average basal - basal m' between the grain and its neighbors
20-22- (B-t_m) - type: numerical (float)
Description: The Luster-Morris geometric compatibility factor (m') between the 6 extension twin variants of the grain and the basal slip systems of its neighbors. Although there are 3 possible basal slip systems, only the one with the highest Schmid factor was considered to compute m'. However, all 6 twinning variants have been considered, given that slip induced twinning is a localized process. Only maximum, minimum, and mean values were included in the dataset.
(Max_B-t_m): The highest basal - twin m' between the grain and its neighbors
(Min_B-t_m): The lowest basal - twin m' between the grain and its neighbors
(Mean_B-t_m): The average basal - twin m' between the grain and its neighbors
23-25- (GB_misang) - type: numerical (float)
Description: The misorientation angle (in º) between the grain and its neighbors. In fact, disorientation angle is used for the misorientation angle. Only maximum, minimum, and mean values were included in the dataset.
(Max_GBmisang): The highest GB misorientation angle between the grain and its neighbors
(Min_GBmisang): The lowest GB misorientation angle between the grain and its neighbors
(Mean_GBmisang): The average GB misorientation angle between the grain and its neighbors
26-28- (delta_Gs) - type: numerical (float)
Description: Grain size difference (in micrometers) between a given grain and its neighbors. The grain size is calculated as the diameter of a circular grain with the same area of the grain. Only maximum, minimum, and mean values were included in the dataset.
(Max_deltaGs): The highest grain size difference between the grain and its neighbors
(Min_deltaGs): The smallest grain size difference between the grain and its neighbors
(Mean_deltaGs): The average grain size difference between the grain and its neighbors
29-31- (delta_BSF) - type: numerical (float)
Description: The difference in the basal slip Schmid factor between a given grain and its neighbors. Only the highest basal slip Schmid factor is considered. Only maximum, minimum, and mean values were included in the dataset.
(Max_deltaBSF): The highest basal SF difference between the grain and its neighbors
(Min_deltaBSF): The smallest basal SF difference between the grain and its neighbors
(Mean_deltaBSF): The average basal SF difference between the grain and its neighbors

2. Mg1Al_ML.xlsx
- Description: Both twin and grain data were acquired by EBSD from Mg-1Al (at.%) sample before and after deformation at the same area
- Number of grains: 1496 (rows == grains) corresponding to two true strain levels: ~6%, and ~10%.
- Number of analyzed variables (features): 31 (columns == grain characteristics)

- Variable descriptions by columns are the same as those of AZ31_ML.xlsx
H
National Health and Nutrition Examination Survey (NHANES)
dataverse.harvard.edu
Updated May 30, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anthony Damico (2013). National Health and Nutrition Examination Survey (NHANES) [Dataset]. http://doi.org/10.7910/DVN/IMWQPJ
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/IMWQPJ
Dataset updated
May 30, 2013
Dataset provided by
Harvard Dataverse
Authors
Anthony Damico
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
analyze the national health and nutrition examination survey (nhanes) with r nhanes is this fascinating survey where doctors and dentists accompany survey interviewers in a little mobile medical center that drives around the country. while the survey folks are interviewing people, the medical professionals administer laboratory tests and conduct a real doctor's examination. the b lood work and medical exam allow researchers like you and me to answer tough questions like, "how many people have diabetes but don't know they have diabetes?" conducting the lab tests and the physical isn't cheap, so a new nhanes data set becomes available once every two years and only includes about twelve thousand respondents. since the number of respondents is so small, analysts often pool multiple years of data together. the replication scripts below give a few different examples of how multiple years of data can be pooled with r. the survey gets conducted by the centers for disease control and prevention (cdc), and generalizes to the united states non-institutional, non-active duty military population. most of the data tables produced by the cdc include only a small number of variables, so importation with the foreign package's read.xport function is pretty straightforward. but that makes merging the appropriate data sets trickier, since it might not be clear what to pull for which variables. for every analysis, start with the table with 'demo' in the name -- this file includes basic demographics, weighting, and complex sample survey design variables. since it's quick to download the files directly from the cdc's ftp site, there's no massive ftp download automation script. this new github repository co ntains five scripts: 2009-2010 interview only - download and analyze.R download, import, save the demographics and health insurance files onto your local computer load both files, limit them to the variables needed for the analysis, merge them together perform a few example variable recodes create the complex sample survey object, using the interview weights run a series of pretty generic analyses on the health insurance ques tions 2009-2010 interview plus laboratory - download and analyze.R download, import, save the demographics and cholesterol files onto your local computer load both files, limit them to the variables needed for the analysis, merge them together perform a few example variable recodes create the complex sample survey object, using the mobile examination component (mec) weights perform a direct-method age-adjustment and matc h figure 1 of this cdc cholesterol brief replicate 2005-2008 pooled cdc oral examination figure.R download, import, save, pool, recode, create a survey object, run some basic analyses replicate figure 3 from this cdc oral health databrief - the whole barplot replicate cdc publications.R download, import, save, pool, merge, and recode the demographics file plus cholesterol laboratory, blood pressure questionnaire, and blood pressure laboratory files match the cdc's example sas and sudaan syntax file's output for descriptive means match the cdc's example sas and sudaan synta x file's output for descriptive proportions match the cdc's example sas and sudaan syntax file's output for descriptive percentiles replicate human exposure to chemicals report.R (user-contributed) download, import, save, pool, merge, and recode the demographics file plus urinary bisphenol a (bpa) laboratory files log-transform some of the columns to calculate the geometric means and quantiles match the 2007-2008 statistics shown on pdf page 21 of the cdc's fourth edition of the report click here to view these five scripts for more detail about the national health and nutrition examination survey (nhanes), visit: the cdc's nhanes homepage the national cancer institute's page of nhanes web tutorials notes: nhanes includes interview-only weights and interview + mobile examination component (mec) weights. if you o nly use questions from the basic interview in your analysis, use the interview-only weights (the sample size is a bit larger). i haven't really figured out a use for the interview-only weights -- nhanes draws most of its power from the combination of the interview and the mobile examination component variables. if you're only using variables from the interview, see if you can use a data set with a larger sample size like the current population (cps), national health interview survey (nhis), or medical expenditure panel survey (meps) instead. confidential to sas, spss, stata, sudaan users: why are you still riding around on a donkey after we've invented the internal combustion engine? time to transition to r. :D
DCASE 2025 Challenge Task 2 Development Dataset
zenodo.org
zip
Updated Apr 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tomoya Nishida; Tomoya Nishida; Noboru Harada; Noboru Harada; Daisuke Niizumi; Daisuke Niizumi; Davide Albertini; Roberto Sannino; Simone Pradolini; Filippo Augusti; Keisuke Imoto; Keisuke Imoto; Kota Dohi; Kota Dohi; Harsh Purohit; Takashi Endo; Yohei Kawaguchi; Yohei Kawaguchi; Davide Albertini; Roberto Sannino; Simone Pradolini; Filippo Augusti; Harsh Purohit; Takashi Endo (2025). DCASE 2025 Challenge Task 2 Development Dataset [Dataset]. http://doi.org/10.5281/zenodo.15097779
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15097779
Dataset updated
Apr 1, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tomoya Nishida; Tomoya Nishida; Noboru Harada; Noboru Harada; Daisuke Niizumi; Daisuke Niizumi; Davide Albertini; Roberto Sannino; Simone Pradolini; Filippo Augusti; Keisuke Imoto; Keisuke Imoto; Kota Dohi; Kota Dohi; Harsh Purohit; Takashi Endo; Yohei Kawaguchi; Yohei Kawaguchi; Davide Albertini; Roberto Sannino; Simone Pradolini; Filippo Augusti; Harsh Purohit; Takashi Endo
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Description

This dataset is the "development dataset" for the DCASE 2025 Challenge Task 2.

The data consists of the normal/anomalous operating sounds of seven types of real/toy machines. Each recording is a single-channel 10-sec or 12-sec audio that includes both a machine's operating sound and environmental noise. The following seven types of real/toy machines are used in this task:

ToyCar

ToyTrain

Fan

Gearbox

Bearing

Slide rail

Valve

Overview of the task

Anomalous sound detection (ASD) is the task of identifying whether the sound emitted from a target machine is normal or anomalous. Automatic detection of mechanical failure is an essential technology in the fourth industrial revolution, which involves artificial-intelligence-based factory automation. Prompt detection of machine anomalies by observing sounds is useful for monitoring the condition of machines.

This task is the follow-up from DCASE 2020 Task 2 to DCASE 2024 Task 2. The task this year is to develop an ASD system that meets the following five requirements.

1. Train a model using only normal sound (unsupervised learning scenario)
Because anomalies rarely occur and are highly diverse in real-world factories, it can be difficult to collect exhaustive patterns of anomalous sounds. Therefore, the system must detect unknown types of anomalous sounds that are not provided in the training data, which is called UASD (unsupervised ASD). This is the same requirement as in the previous tasks.
2. Detect anomalies regardless of domain shifts (domain generalization task)
In real-world cases, the operational states of a machine or the environmental noise can change to cause domain shifts. Domain-generalization techniques can be useful for handling domain shifts that occur frequently or are hard-to-notice. In this task, the system is required to use domain-generalization techniques for handling these domain shifts. This requirement is the same since DCASE 2022 Task 2.
3. Train a model for a completely new machine type
For a completely new machine type, hyperparameters of the trained model cannot be tuned. Therefore, the system should have the ability to train models without additional hyperparameter tuning. This requirement is the same since DCASE 2023 Task 2.
4. Train a model both with or without attribute information
While additional attribute information can help enhance the detection performance, we cannot always obtain such information. Therefore, the system must work well both when attribute information is available and when it is not.
5. Train a model with additional clean machine data or noise-only data (optional)
Although the primary training data consists of machine sounds recorded under noisy conditions, in some situations it may be possible to collect clean machine data when the factory is idle or gather noise recordings when the machine itself is not running. Participants are free to incorporate these additional data sources to enhance the accuracy of their models.

The last optional requirement is newly introduced in DCASE 2025 Task2.

Definition

We first define key terms in this task: "machine type," "section," "source domain," "target domain," and "attributes.".

"Machine type" indicates the type of machine, which in the development dataset is one of seven: fan, gearbox, bearing, slide rail, valve, ToyCar, and ToyTrain.

A section is defined as a subset of the dataset for calculating performance metrics.

The source domain is the domain under which most of the training data and some of the test data were recorded, and the target domain is a different set of domains under which some of the training data and some of the test data were recorded. There are differences between the source and target domains in terms of operating speed, machine load, viscosity, heating temperature, type of environmental noise, signal-to-noise ratio, etc.

Attributes are parameters that define states of machines or types of noise. For several machine types, the attributes are hidden.

Dataset

This dataset consists of seven machine types. For each machine type, one section is provided, and the section is a complete set of training and test data. For each section, this dataset provides (i) 990 clips of normal sounds in the source domain for training, (ii) ten clips of normal sounds in the target domain for training, (iii) 100 clips of supplementary sound data containing either clean normal machine sounds in the source domain or noise-only sounds, and (iv) 100 clips each of normal and anomalous sounds for the test. The source/target domain of each sample is provided. Additionally, the attributes of each sample in the training and test data are provided in the file names and attribute csv files.

File names and attribute csv files

File names and attribute csv files provide reference labels for each clip. The given reference labels for each training/test clip include machine type, section index, normal/anomaly information, and attributes regarding the condition other than normal/anomaly. The machine type is given by the directory name. The section index is given by their respective file names. For the datasets other than the evaluation dataset, the normal/anomaly information and the attributes are given by their respective file names. Note that for machine types that has its attribute information hidden, the attribute information in each file names are only labeled as "noAttributes". Attribute csv files are for easy access to attributes that cause domain shifts. In these files, the file names, name of parameters that cause domain shifts (domain shift parameter, dp), and the value or type of these parameters (domain shift value, dv) are listed. Each row takes the following format:

[filename (string)], [d1p (string)], [d1v (int | float | string)], [d2p], [d2v]...

For machine types that have their attribute information hidden, all columns except the filename column are left blank for each row.

Recording procedure

Normal/anomalous operating sounds of machines and its related equipment are recorded. Anomalous sounds were collected by deliberately damaging target machines. For simplifying the task, we use only the first channel of multi-channel recordings; all recordings are regarded as single-channel recordings of a fixed microphone. We mixed a target machine sound with environmental noise, and only noisy recordings are provided as training/test data. The environmental noise samples were recorded in several real factory environments. We will publish papers on the dataset to explain the details of the recording procedure by the submission deadline.

Directory structure

- /dev_data

- /raw
- /fan
- /train (only normal clips)
- /section_00_source_train_normal_0001_

Baseline system

The baseline system is available on the Github repository https://github.com/nttcslab/dcase2023_task2_baseline_ae. The baseline systems provide a simple entry-level approach that gives a reasonable performance in the dataset of Task 2. They are good starting points, especially for entry-level researchers who want to get familiar with the anomalous-sound-detection task.

Condition of use

This dataset was created jointly by Hitachi, Ltd. and NTT Corporation and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

Citation

Contact

If there is any problem, please contact us:

Tomoya Nishida, tomoya.nishida.ax@hitachi.com

Keisuke Imoto, <a
TBX11K Simplified - TB X-rays with bounding boxes
kaggle.com
Updated Feb 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vbookshelf (2023). TBX11K Simplified - TB X-rays with bounding boxes [Dataset]. https://www.kaggle.com/datasets/vbookshelf/tbx11k-simplified/suggestions?status=pending&yourSuggestions=true
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 8, 2023
Dataset provided by
Kaggle
Authors
vbookshelf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The TBX11K dataset is a large dataset containing 11000 chest x-ray images. It's the only TB dataset that I know of that includes TB bounding boxes. This allows both classification and detection models to be trained.

However, it can be mentally tiring to get started with this dataset. It includes many xml, json and txt files that you need to sift through to try to understand what everything means, how it all fits together and how to extract the bounding box coordinates.

Here I've simplified the dataset. Now there's just one csv file, one folder containing the training images and one folder containing the test images.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1086574%2Fa637d3837c261605a3c2f71a18a9b6f0%2FScreenshot%202023-02-08%20at%2012.26.29.png?generation=1675834031712582&alt=media" alt="">

Paper: Rethinking Computer-aided Tuberculosis Diagnosis

Original TBX11K dataset on Kaggle

Notes

1- Please start by reading the paper. It will help you understand what everything means. 2- The original dataset was split into train and validation sets. This split is shown in the 'source' column in the data.csv file. 3- The test images are stored in the folder called "test". There are no labels for these images and I've not included them in data.csv. 4- Each bounding box is on a separate row. Therefore, the file names in the "fname" column are not unique. For example, if an image has two bounding boxes then the file name for that image will appear twice in the "fname" column. 5- The original dataset has a folder named "extra" that contains data from other TB datasets. I've not included that folder here.

Acknowledgements

Many thanks to the team that created the TBX11K dataset and generously made it publicly available.

Citation

# TBX11K dataset @inproceedings{liu2020rethinking, title={Rethinking computer-aided tuberculosis diagnosis}, author={Liu, Yun and Wu, Yu-Huan and Ban, Yunfeng and Wang, Huifang and Cheng, Ming-Ming}, booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages={2646--2655}, year={2020} }

Helpful Resources

This is a list of publicly available TB and Pneumonia chest x-ray datasets: https://github.com/vbookshelf/List-of-TB-and-Pneumonia-Datasets
ClimeMarine – Climate change predictions for Marine Spatial Planning
researchdata.se
Updated Sep 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oscar Törnqvist; Lars Arneborg; Duncan Hume (2022). ClimeMarine – Climate change predictions for Marine Spatial Planning [Dataset]. http://doi.org/10.5878/gwas-0254
Explore at:
(316973908), (19433787), (28261440), (319415533), (26767), (22035), (308975712)Available download formats
Unique identifier
https://doi.org/10.5878/gwas-0254
Dataset updated
Sep 29, 2022
Dataset provided by
SMHIhttp://www.smhi.se/
Authors
Oscar Törnqvist; Lars Arneborg; Duncan Hume
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1975 - Dec 31, 2099
Area covered
North Sea, Baltic Sea
Description
This series is composed of five select physical marine parameters (water salinity and water temperature for surface and near bottom waters and sea ice) for two climate scenarios (RCP 45 and RCP 8.5) and three statistics (minimum, median and maximum) from an ensemble of five downscaled global climate models. The source data for this data series is global climate model outcomes from the Coupled Model Intercomparison Project 5 (CMIP5) published by the Intergovernmental Panel on Climate Change (Stocker et al 2013).

The source data were provided in NetCDF format for each of the downsampled climate models based on the five CMIP5 global climate models: MPI: MPI-ESM-LR, HAD: HadGEM2-ES, ECE: EC-EARTH, GFD: GFDL-ESM2M, IPS: IPSL-CM5A-MR. The data included monthly mean, maximum, minimum and standard deviation calculations and the physical variables provided with the climate scenario models included sea ice cover, water temperature, water salinity, sea level and current strength (as two vectors) as well as a range of derived biogeochemical variables (O2, PO4, NO3, NH4, Secci Depth and Phytoplankton).

These global atmospheric climate model data were subsequently downscaled from global to regional scale and incorporated into the high-resolution ocean–sea ice–atmosphere model RCA4–NEMO by the Swedish Meteorological and Hydrological Institute (Gröger et al 2019) thus providing a wide range of marine specific parameters. The Swedish Geological Survey used these data in the form of monthly mean averages to calculate change in multi-annual (30-year) climate averages from the beginning and end of the 21st century for the five select parameters as proxies for climate change pressures.

Each dataset uses only source data models based on an assumption of atmospheric climate gas concentrations in line with either the IPCCs representative concentration pathway RCP 4.5 or RCP 8.5. Changes were calculated as the difference between two multiannual (30 year) mean averages; one for a historical reference climate period (1976-2005) and one for an end of century projection (2070-2099). These data were extracted for each of the five downscaled CMIP5 models individually and then combined into ensemble summary statistics (ensemble minimum, median and maximum). In the Ensemble_Maximum/Median/Minimum_Rasters datasets, changes in mean (May-Sept) surface temperature and bottom temperature are given in Degrees Celsia (°C); changes in mean annual surface salinity and bottom salinity are given in Practical Salinity Units (PSU); changes in mean (October-April) sea ice are given in Percentage Points (pp).

In the Normalized_Rasters datasets, the changes are normalized using a linear stretch so that a cell value of zero represents no projected and a cell value of 100 represents a value equal to or above the mean change in Swedish national waters. The values representing 100 are: 4 °C for surface temperature; 3 °C for bottom temperature; -1.5 PSU for surface salinity; -2.0 PSU for bottom salinity; and -40 pp for sea ice. These were also the chosen reference values for determining, via expert review, the sensitivity of ecosystem components to changes in these parameters (for further information refer to the Symphony method).

Notes on interpretation. This dataset does not highlight inter-annual or inter-decadal climate variability (e.g. extreme events) or changes in biochemical parameters (e.g. O2, chlorophyll, secchi depth etc) resulting from change in surface temperature. Areas of no-data inshore were filled using extrapolating from nearby cells (using similar depths for benthic data) so data near the coast and particularly within archipelagos, bays and estuaries is not robust. Users should refer to the associated climemarine uncertainty map for this parameter. The uncertainty map shows the interquatile range from the climate ensemble and the area of no-data as 'interpolated values'. For any application which requires more temporally or spatially explicit information (e.g. at sub/national decision making) it is highly recommended that the user contact SMHI for access to the latest climate model source data (in NetCDF format) which contains much more detail and a far wider selection of parameters. For regional applications (e.g. at the scale of the Baltic Sea) - it should be noted that these data will likely require normalisation to regional rather than national values and that sensitivity scores used may differ.

ClimeMarine was selective in its choice of pressure parameters. SMHI have additional data available for other parameters such as O2, secchi depth and nutrients which could be included in future. This is complicated because many parameters are influenced by riverine discharge and therefore by decisions related to watershed management - disentanglement of impacts from climate vs river basin management becomes a complication. In a similar way, data on sealevel rise is also available which could be used to estimate impacts on the coast but likewise complicating factors such as isostatic uplift and coastal defence and management policies would need to be considered.

For simplicity and to reduce the amount of datasets to a manageable level for this assessment the source data were further limited and summarised in several ways:

Only the monthly mean averages of seawater temperature, salinity and sea ice (i.e. key physical parameters) were utilized.
For seawater salinity and temperature, the depth dimension (i.e. the water column) was summarised from 56 depth levels to just two: the surface and the deepest (bottom) waters.
Only two of the three climate periods were selected: a historical reference period: 1976-2005 (to represent the current status) and the projected end of century period: 2070-2099. Only two of the three available emission scenarios were selected detailing the consequence of intermediate and very high climate gas emissions : Representative Concentration Pathway (RCP) 4.5 and 8.5 (see SEDAC 2021).

Each dataset included in the series comes with extensive metadata.

The data processing followed the following steps:

Extraction of data for each parameter from NetCDF to TIFF Rasters for each model, emission scenario, depth level (using scripts in NCO, CDO and R). Calculation of climate ensemble statistics - Minimum, Mean, Median and Maximum (using Arcpy and Numpy)
Reprojection and resampling from the 2nm NEMO-RCO from Lat/Long WGS84 grid to the 250m ETRS89 LAEA Symphony grid (using Arcpy)
Extrapolation to fill no-data cells based on proximity and similar depths (using Arcpy script and the ArcGIS spatial analyst extension) Calculation of change for each parameter as the end of century multi-annual mean minus the reference multi-annual mean (using an Arcpy script)
Inversion of if negative (i.e. decreases) to positive (i.e. magnitude of change)
Normalisation as a linear stretch from 0 to 100 where zero equates to no change and 100 equates to the maximum pixel value in Swedish waters from the RCP 8.5 ensemble mean dataset with any values over this pixel value also set to 100 (Arcpy script)

NetCDF source data used in this analysis can be requested from the Swedish Meteorological and Hydrological Institute - kundtjanst@smhi.se

Processing scripts (R and arcpy) and interim raster data can be requested from the Geological Survey of Sweden - kundtjanst@sgu.se
n
Global contemporary effective population sizes across taxonomic groups
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated May 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shannon H. Clarke; Elizabeth R. Lawrence; Jean-Michel Matte; Sarah J. Salisbury; Sozos N. Michaelides; Ramela Koumrouyan; Daniel E. Ruzzante; James W. A. Grant; Dylan J. Fraser (2024). Global contemporary effective population sizes across taxonomic groups [Dataset]. http://doi.org/10.5061/dryad.p2ngf1vzm
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.p2ngf1vzm
Dataset updated
May 3, 2024
Dataset provided by
Concordia University
Dalhousie University
Authors
Shannon H. Clarke; Elizabeth R. Lawrence; Jean-Michel Matte; Sarah J. Salisbury; Sozos N. Michaelides; Ramela Koumrouyan; Daniel E. Ruzzante; James W. A. Grant; Dylan J. Fraser
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Effective population size (Ne) is a particularly useful metric for conservation as it affects genetic drift, inbreeding and adaptive potential within populations. Current guidelines recommend a minimum Ne of 50 and 500 to avoid short-term inbreeding and to preserve long-term adaptive potential, respectively. However, the extent to which wild populations reach these thresholds globally has not been investigated, nor has the relationship between Ne and human activities. Through a quantitative review, we generated a dataset with 4610 georeferenced Ne estimates from 3829 unique populations, extracted from 723 articles. These data show that certain taxonomic groups are less likely to meet 50/500 thresholds and are disproportionately impacted by human activities; plant, mammal, and amphibian populations had a <54% probability of reaching = 50 and a <9% probability of reaching = 500. Populations listed as being of conservation concern according to the IUCN Red List had a smaller median than unlisted populations, and this was consistent across all taxonomic groups. was reduced in areas with a greater Global Human Footprint, especially for amphibians, birds, and mammals, however relationships varied between taxa. We also highlight several considerations for future works, including the role that gene flow and subpopulation structure plays in the estimation of in wild populations, and the need for finer-scale taxonomic analyses. Our findings provide guidance for more specific thresholds based on Ne and help prioritize assessment of populations from taxa most at risk of failing to meet conservation thresholds. Methods Literature search, screening, and data extraction A primary literature search was conducted using ISI Web of Science Core Collection and any articles that referenced two popular single-sample Ne estimation software packages: LDNe (Waples & Do, 2008), and NeEstimator v2 (Do et al., 2014). The initial search included 4513 articles published up to the search date of May 26, 2020. Articles were screened for relevance in two steps, first based on title and abstract, and then based on the full text. For each step, a consistency check was performed using 100 articles to ensure they were screened consistently between reviewers (n = 6). We required a kappa score (Collaboration for Environmental Evidence, 2020) of ³ 0.6 in order to proceed with screening of the remaining articles. Articles were screened based on three criteria: (1) Is an estimate of Ne or Nb reported; (2) for a wild animal or plant population; (3) using a single-sample genetic estimation method. Further details on the literature search and article screening are found in the Supplementary Material (Fig. S1). We extracted data from all studies retained after both screening steps (title and abstract; full text). Each line of data entered in the database represents a single estimate from a population. Some populations had multiple estimates over several years, or from different estimation methods (see Table S1), and each of these was entered on a unique row in the database. Data on N̂e, N̂b, or N̂c were extracted from tables and figures using WebPlotDigitizer software version 4.3 (Rohatgi, 2020). A full list of data extracted is found in Table S2. Data Filtering After the initial data collation, correction, and organization, there was a total of 8971 Ne estimates (Fig. S1). We used regression analyses to compare Ne estimates on the same populations, using different estimation methods (LD, Sibship, and Bayesian), and found that the R2 values were very low (R2 values of <0.1; Fig. S2 and Fig. S3). Given this inconsistency, and the fact that LD is the most frequently used method in the literature (74% of our database), we proceeded with only using the LD estimates for our analyses. We further filtered the data to remove estimates where no sample size was reported or no bias correction (Waples, 2006) was applied (see Fig. S6 for more details). Ne is sometimes estimated to be infinity or negative within a population, which may reflect that a population is very large (i.e., where the drift signal-to-noise ratio is very low), and/or that there is low precision with the data due to small sample size or limited genetic marker resolution (Gilbert & Whitlock, 2015; Waples & Do, 2008; Waples & Do, 2010) We retained infinite and negative estimates only if they reported a positive lower confidence interval (LCI), and we used the LCI in place of a point estimate of Ne or Nb. We chose to use the LCI as a conservative proxy for in cases where a point estimate could not be generated, given its relevance for conservation (Fraser et al., 2007; Hare et al., 2011; Waples & Do 2008; Waples 2023). We also compared results using the LCI to a dataset where infinite or negative values were all assumed to reflect very large populations and replaced the estimate with an arbitrary large value of 9,999 (for reference in the LCI dataset only 51 estimates, or 0.9%, had an or > 9999). Using this 9999 dataset, we found that the main conclusions from the analyses remained the same as when using the LCI dataset, with the exception of the HFI analysis (see discussion in supplementary material; Table S3, Table S4 Fig. S4, S5). We also note that point estimates with an upper confidence interval of infinity (n = 1358) were larger on average (mean = 1380.82, compared to 689.44 and 571.64, for estimates with no CIs or with an upper boundary, respectively). Nevertheless, we chose to retain point estimates with an upper confidence interval of infinity because accounting for them in the analyses did not alter the main conclusions of our study and would have significantly decreased our sample size (Fig. S7, Table S5). We also retained estimates from populations that were reintroduced or translocated from a wild source (n = 309), whereas those from captive sources were excluded during article screening (see above). In exploratory analyses, the removal of these data did not influence our results, and many of these populations are relevant to real-world conservation efforts, as reintroductions and translocations are used to re-establish or support small, at-risk populations. We removed estimates based on duplication of markers (keeping estimates generated from SNPs when studies used both SNPs and microsatellites), and duplication of software (keeping estimates from NeEstimator v2 when studies used it alongside LDNe). Spatial and temporal replication were addressed with two separate datasets (see Table S6 for more information): the full dataset included spatially and temporally replicated samples, while these two types of replication were removed from the non-replicated dataset. Finally, for all populations included in our final datasets, we manually extracted their protection status according to the IUCN Red List of Threatened Species. Taxa were categorized as “Threatened” (Vulnerable, Endangered, Critically Endangered), “Nonthreatened” (Least Concern, Near Threatened), or “N/A” (Data Deficient, Not Evaluated). Mapping and Human Footprint Index (HFI) All populations were mapped in QGIS using the coordinates extracted from articles. The maps were created using a World Behrmann equal area projection. For the summary maps, estimates were grouped into grid cells with an area of 250,000 km2 (roughly 500 km x 500 km, but the dimensions of each cell vary due to distortions from the projection). Within each cell, we generated the count and median of Ne. We used the Global Human Footprint dataset (WCS & CIESIN, 2005) to generate a value of human influence (HFI) for each population at its geographic coordinates. The footprint ranges from zero (no human influence) to 100 (maximum human influence). Values were available in 1 km x 1 km grid cell size and were projected over the point estimates to assign a value of human footprint to each population. The human footprint values were extracted from the map into a spreadsheet to be used for statistical analyses. Not all geographic coordinates had a human footprint value associated with them (i.e., in the oceans and other large bodies of water), therefore marine fishes were not included in our HFI analysis. Overall, 3610 Ne estimates in our final dataset had an associated footprint value.
Global Land Cover 1992-2020
cacgeoportal.com
climate.esri.ca
+4more
Updated Apr 2, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esri (2020). Global Land Cover 1992-2020 [Dataset]. https://www.cacgeoportal.com/datasets/1453082255024699af55c960bc3dc1fe
Explore at:
Dataset updated
Apr 2, 2020
Dataset authored and provided by
Esrihttp://esri.com/
Area covered
Description
This layer is a time series of the annual ESA CCI (Climate Change Initiative) land cover maps of the world. ESA has produced land cover maps for the years 1992-2020. These are available at the European Space Agency Climate Change Initiative website.Time Extent: 1992-2020Cell Size: 300 meter Source Type: ThematicPixel Type: 8 Bit UnsignedData Projection: GCS WGS84Mosaic Projection: Web Mercator Auxiliary Sphere Extent: GlobalSource: ESA Climate Change InitiativeUpdate Cycle: Annual until 2020, no updates thereafterWhat can you do with this layer? This layer may be added to ArcGIS Online maps and applications and shown in a time series to watch a "time lapse" view of land cover change since 1992 for any part of the world. The same behavior exists when the layer is added to ArcGIS Pro. In addition to displaying all layers in a series, this layer may be queried so that only one year is displayed in a map. This layer can be used in analysis. For example, the layer may be added to ArcGIS Pro with a query set to display just one year. Then, an area count of land cover types may be produced for a feature dataset using the zonal statistics tool. Statistics may be compared with the statistics from other years to show a trend. To sum up area by land cover using this service, or any other analysis, be sure to use an equal area projection, such as Albers or Equal Earth. Different Classifications Available to Map Five processing templates are included in this layer. The processing templates may be used to display a smaller set of land cover classes.Cartographic Renderer (Default Template)Displays all ESA CCI land cover classes.*Forested lands TemplateThe forested lands template shows only forested lands (classes 50-90).Urban Lands TemplateThe urban lands template shows only urban areas (class 190).Converted Lands TemplateThe converted lands template shows only urban lands and lands converted to agriculture (classes 10-40 and 190).Simplified RendererDisplays the map in ten simple classes which match the ten simplified classes used in 2050 Land Cover projections from Clark University.Any of these variables can be displayed or analyzed by selecting their processing template. In ArcGIS Online, select the Image Display Options on the layer. Then pull down the list of variables from the Renderer options. Click Apply and Close. In ArcGIS Pro, go into the Layer Properties. Select Processing Templates from the left hand menu. From the Processing Template pull down menu, select the variable to display. Using Time By default, the map will display as a time series animation, one year per frame. A time slider will appear when you add this layer to your map. To see the most current data, move the time slider until you see the most current year. In addition to displaying the past quarter century of land cover maps as an animation, this time series can also display just one year of data by use of a definition query. For a step by step example using ArcGIS Pro on how to display just one year of this layer, as well as to compare one year to another, see the blog called Calculating Impervious Surface Change. Hierarchical ClassificationLand cover types are defined using the land cover classification (LCCS) developed by the United Nations, FAO. It is designed to be as compatible as possible with other products, namely GLCC2000, GlobCover 2005 and 2009. This is a heirarchical classification system. For example, class 60 means "closed to open" canopy broadleaved deciduous tree cover. But in some places a more specific type of broadleaved deciduous tree cover may be available. In that case, a more specific code 61 or 62 may be used which specifies "open" (61) or "closed" (62) cover. Land Cover Processing To provide consistency over time, these maps are produced from baseline land cover maps, and are revised for changes each year depending on the best available satellite data from each period in time. These revisions were made from AVHRR 1km time series from 1992 to 1999, SPOT-VGT time series between 1999 and 2013, and PROBA-V data for years 2013, 2014 and 2015. When MERIS FR or PROBA-V time series are available, changes detected at 1 km are re-mapped at 300 m. The last step consists in back- and up-dating the 10-year baseline LC map to produce the 24 annual LC maps from 1992 to 2015. Source data The datasets behind this layer were extracted from NetCDF files and TIFF files produced by ESA. Years 1992-2015 were acquired from ESA CCI LC version 2.0.7 in TIFF format, and years 2016-2018 were acquired from version 2.1.1 in NetCDF format. These are downloadable from ESA with an account, after agreeing to their terms of use. https://maps.elie.ucl.ac.be/CCI/viewer/download.php CitationESA. Land Cover CCI Product User Guide Version 2. Tech. Rep. (2017). Available at: maps.elie.ucl.ac.be/CCI/viewer/download/ESACCI-LC-Ph2-PUGv2_2.0.pdfMore technical documentation on the source datasets is available here:https://cds.climate.copernicus.eu/cdsapp#!/dataset/satellite-land-cover?tab=doc*Index of all classes in this layer:10 Cropland, rainfed11 Herbaceous cover12 Tree or shrub cover20 Cropland, irrigated or post-flooding30 Mosaic cropland (>50%) / natural vegetation (tree, shrub, herbaceous cover) (<50%)40 Mosaic natural vegetation (tree, shrub, herbaceous cover) (>50%) / cropland (<50%) 50 Tree cover, broadleaved, evergreen, closed to open (>15%)60 Tree cover, broadleaved, deciduous, closed to open (>15%)61 Tree cover, broadleaved, deciduous, closed (>40%)62 Tree cover, broadleaved, deciduous, open (15-40%)70 Tree cover, needleleaved, evergreen, closed to open (>15%)71 Tree cover, needleleaved, evergreen, closed (>40%)72 Tree cover, needleleaved, evergreen, open (15-40%)80 Tree cover, needleleaved, deciduous, closed to open (>15%)81 Tree cover, needleleaved, deciduous, closed (>40%)82 Tree cover, needleleaved, deciduous, open (15-40%)90 Tree cover, mixed leaf type (broadleaved and needleleaved)100 Mosaic tree and shrub (>50%) / herbaceous cover (<50%)110 Mosaic herbaceous cover (>50%) / tree and shrub (<50%)120 Shrubland121 Shrubland evergreen122 Shrubland deciduous130 Grassland140 Lichens and mosses150 Sparse vegetation (tree, shrub, herbaceous cover) (<15%)151 Sparse tree (<15%)152 Sparse shrub (<15%)153 Sparse herbaceous cover (<15%)160 Tree cover, flooded, fresh or brakish water170 Tree cover, flooded, saline water180 Shrub or herbaceous cover, flooded, fresh/saline/brakish water190 Urban areas200 Bare areas201 Consolidated bare areas202 Unconsolidated bare areas210 Water bodies

Fractional Abundance Datasets for Salt Patches and Marshes Across the...

zenodo.org

tiff

Updated Jul 11, 2025

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Manan Sarupria; Manan Sarupria; Pinki Mondal; Pinki Mondal; Rodrigo Vargas; Rodrigo Vargas; Matthew Walter; Matthew Walter; Jarrod Miller; Jarrod Miller (2025). Fractional Abundance Datasets for Salt Patches and Marshes Across the Delmarva Peninsula, v2 [Dataset]. http://doi.org/10.5281/zenodo.15866496

Explore at:

tiffAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15866496

Dataset updated

Jul 11, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Manan Sarupria; Manan Sarupria; Pinki Mondal; Pinki Mondal; Rodrigo Vargas; Rodrigo Vargas; Matthew Walter; Matthew Walter; Jarrod Miller; Jarrod Miller

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered

Delmarva Peninsula

Description

Abstract:

Coastal agricultural lands in the eastern USA are increasingly plagued by escalating soil salinity, rendering them unsuitable for profitable farming. Saltwater intrusion into groundwater or soil salinization can lead to alterations in land cover, such as diminished plant growth, or complete land cover transformation. Two notable instances of such transformations include the conversion of farmland to marshland or to barren salt patches devoid of vegetation. However, quantifying these land cover changes across vast geographic areas poses a significant challenge due to their varying spatial granularity. To tackle this issue, a non-linear spectral unmixing approach utilizing a Random Forest (RF) algorithm was employed to quantify the fractional abundance of salt patches and marshes. Using 2022 Sentinel-2 imagery, gridded datasets for salt patches and marshes were generated across the Delmarva Peninsula (14 coastal counties in Delaware, Maryland and Virginia, USA), along with the associated uncertainty. The RF models were constructed using 100 trees and 27,437 reference data points, resulting in two sets of ten models: one for salt patches and another for marshes. Validation metrics for sub-pixel fractional abundances revealed a moderate R-squared value of 0.50 for the salt model ensemble and a high R-squared value of 0.90 for the marsh model ensemble. These models predicted a total area of 16.34 sq. km. for salt patches and 1,256.71 sq. km. for marshes. In these datasets, we only report fractional abundance values ranging from 0.4 to 1 for salt patches and 0.25 to 1 for marshes, along with the standard deviation associated with each value.

--------------------------------------------

This collection of gridded data layers provides fractional abundance of salt patches and marshes for the year 2022 for 14 counties in the Delmarva Peninsula in the United States of America (USA). This collection is comprised of 10 files in the form of a single band raster:

Five files for Fractional abundance mean: Salt patch – Mean of per-pixel fractional abundance from an ensemble of 10 RF models. Only pixels with salt patch fraction ≥ 0.40 were retained in this layer.
Five files for Fractional abundance mean: Marsh – Mean of per-pixel fractional abundance from an ensemble of 10 RF models. Only pixels with marsh fraction ≥ 0.25 were retained in this layer.

Input Data:

This approach integrated Sentinel-2 Level 2 A surface reflectance imagery (June, July, and August - 2022), a global land use/land cover dataset from ESRI (Karra et al., 2021), a NAIP-derived Delmarva land cover dataset (Mondal et al., 2022), high-resolution PlanetScope true color images (Planet Team, 2017), very high-resolution Unoccupied Aerial Vehicle (UAV) imagery, and ground truth data.

We derived several spectral indices (see table below) from the Sentinel-2 Level 2 A bands and then used those as inputs into a Random Forest (RF) classifier in python.

Method:

The research utilized Sentinel-2 Level 2 A surface reflectance imagery for spectral unmixing. This multispectral dataset, corrected for atmospheric and radiometric effects, encompasses 13 spectral bands from visible to near-infrared wavelengths (0.443–2.190 micrometers). The imagery offers spatial resolutions ranging from 10 m to 60 m and is captured every 5 days. To aid in selecting reference points for model training and testing, high-resolution (60 cm) UAV images of specific farmlands in Dorchester and Somerset counties, Maryland, were acquired under optimal weather conditions.

The study incorporated multiple datasets to refine the analysis. The Sentinel-2 derived global land use/land cover dataset from ESRI was employed to isolate relevant land cover classes such as 'Crops' and 'Rangeland'. A NAIP-derived Delmarva land cover dataset with eight classes helped exclude non-agricultural land cover types. High-resolution PlanetScope true color images with 3 m spatial resolution were used as reference data for model validation.

A composite image was generated from Sentinel-2 Level 2 A images using a maximum Normalized Difference Vegetation Index (NDVI) filter. This composite was created from Sentinel-2 images captured between June 1 and August 30, 2022, retaining pixels with the highest NDVI values. This approach effectively highlighted areas of reduced crop cover due to high salinity levels, even during peak growing season. Cloud masking was performed using Sentinel-2 cloud probability imagery, applying a 20% threshold for maximum cloud probability. The pre-processing of Sentinel-2 imagery was conducted on Google Earth Engine (GEE), a cloud-based geospatial data processing platform.

NDVI = (Near infrared – Red) / (Near infrared + Red)

The NDVI maximum composite incorporated seven original Sentinel-2 bands (R, G, B, Red-Edge 1 & 2, NIR, SWIR) and five additional indices. These indices included the Enhanced Vegetation Index (EVI), Moisture Stress Index (MSI), and Modified Soil Adjusted Vegetation Index (MSAVI). Furthermore, two new indices were developed for this study: the Normalized Difference Salt Patch Index (NDSPI) and Modified Salt Patch Index (MSPI). These novel indices were designed to enhance the spectral separability between salt patches and bare soil, maximizing the difference in values between these two land cover types.

Spectral Index	Equation
EVI: Enhanced Vegetation Index	2.5 × ((NIR - RED)) / ((NIR + 6 × RED – 7.5 × BLUE + 1) )
MSAVI: Modified soil-adjusted vegetation index	(2 × NIR + 1 - √(((2 × NIR + 1)^2 – 8 × (NIR - RED)) )) /2
MSI: Moisture Stress Index	SWIR / NIR
NDSPI: Normalized Difference Salt Patch Index	(SWIR - B) / (SWIR + B)
MSPI: Modified Salt Patch Index	(R + G + B + NIR - SWIR) / (R + G + B + NIR + SWIR)

For the training process, we identified five common endmembers: salt patch, bare soil, crop, water, and marsh, which were present in and around the selected farmlands. Reference points for bare soil were defined as pixels of soil in farmlands that did not contain salt patches or crops. For salt, reference points were identified as pixels representing salt patches with little to no vegetation. These reference points were gathered using Sentinel-2 imagery, primarily captured on June 29, 2022, and were supplemented by additional UAV imagery from various dates. Farm locations were chosen based on the visibility of significant salt patches, with the imagery dates being as close as possible to the UAV flight dates. Additional ground truth data for land cover was collected during the summer of 2022 to enhance the remotely gathered points. In total, 27,437 reference points were collected for model training and testing: 239 for salt, 1,096 for bare soil, 5,198 for crops, 20,131 for water, and 773 for marsh. Out of these reference points, 142 (69 for salt, 23 for bare soil, and 50 for crops) were collected during field visits; the remainder was obtained digitally with visual support from PlanetLabs data.

In this study, we applied a Random Forest (RF) classifier for nonlinear spectral unmixing. The RF classifier functions by utilizing an ensemble of decision trees that are independently trained on random subsets of training data through bootstrap aggregation. The final classification is determined by aggregating votes from all trees, with the endmember receiving the highest total votes being selected as the final output. To access soft voting information from the RF classifier, we used its probability prediction function called ‘predict_proba’. This function enables each decision tree to produce a probability distribution for each endmember instead of making a single class decision. The probability distribution from a decision tree indicates how likely it is that an input pixel belongs to each endmember. The final predicted probabilities are calculated by averaging these distributions across all decision trees for each of the five endmembers. As a result, each pixel in the final output is represented by five probability values that indicate the fractional abundance of each corresponding endmember within that pixel. These probabilities sum to one, effectively illustrating the spectral unmixing of a mixed pixel. A pixel value of 0 signifies the absence of a specific endmember, while a value of 1 indicates a pure pixel. Values between 0 and 1 reflect varying levels of mixed endmembers.

The RF model used for salt patch unmixing included a total of 4,302 reference points: 239 for salt, 1,195 for crops, and 956 points each for bare soil, water, and marshes. The RF model for marsh unmixing utilized a total of 27,437 reference points: 239 for salt patches, 5,198 for crops, 1,096 for bare soil, 20,131 for water, and 773 for marshes. For both models, the input data was divided into 80% for training purposes and 20% for testing.

Accuracy assessment:

Visual validation of the salt patch model's predictions show low Mean Squared Error (MSE) and Mean Absolute Error (MAE) values of 0.035 and 0.059, respectively (See table below). However, the model does not explain all the variability in the data, as evidenced by the moderate R-squared value of 0.50.

Parameter

Salt (227 points)

Marsh (761

Fractional Abundance Datasets for Salt Patches and Marshes Across the...

zenodo.org

tiff

Updated Feb 7, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Manan Sarupria; Manan Sarupria; Pinki Mondal; Pinki Mondal; Rodrigo Vargas; Rodrigo Vargas; Matthew Walter; Matthew Walter; Jarrod Miller; Jarrod Miller (2025). Fractional Abundance Datasets for Salt Patches and Marshes Across the Delmarva Peninsula, v1 [Dataset]. http://doi.org/10.5281/zenodo.14709313

Explore at:

tiffAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14709313

Dataset updated

Feb 7, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Manan Sarupria; Manan Sarupria; Pinki Mondal; Pinki Mondal; Rodrigo Vargas; Rodrigo Vargas; Matthew Walter; Matthew Walter; Jarrod Miller; Jarrod Miller

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered

Delmarva Peninsula

Description

Abstract:

--------------------------------------------

Fractional abundance mean: Salt patch – Mean of per-pixel fractional abundance from an ensemble of 10 RF models. Only pixels with salt patch fraction ≥ 0.40 were retained in this layer.
Standard deviation of fractional abundance means: Salt patch – Standard deviation of per-pixel fractional abundance means derived from an ensemble of 10 RF models.
Fractional abundance mean: Marsh – Mean of per-pixel fractional abundance from an ensemble of 10 RF models. Only pixels with marsh fraction ≥ 0.25 were retained in this layer.
Standard deviation of fractional abundance means: Marsh – Standard deviation of per-pixel fractional abundance means derived from an ensemble of 10 RF models.

Input Data:

We derived several spectral indices (see table below) from the Sentinel-2 Level 2 A bands and then used those as inputs into a Random Forest (RF) classifier in python.

Method:

NDVI = (Near infrared – Red) / (Near infrared + Red)

Spectral Index	Equation
EVI: Enhanced Vegetation Index	2.5 × ((NIR - RED)) / ((NIR + 6 × RED – 7.5 × BLUE + 1) )
MSAVI: Modified soil-adjusted vegetation index	(2 × NIR + 1 - √(((2 × NIR + 1)^2 – 8 × (NIR - RED)) )) /2
MSI: Moisture Stress Index	SWIR / NIR
NDSPI: Normalized Difference Salt Patch Index	(SWIR - B) / (SWIR + B)
MSPI: Modified Salt Patch Index	(R + G + B + NIR - SWIR) / (R + G + B + NIR + SWIR)

Accuracy assessment:

Visual validation of the salt patch model's predictions show low Mean Squared Error (MSE) and Mean Absolute Error (MAE)

f
Data from: The Relative Importance of Domain Applicability Metrics for...
acs.figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert P. Sheridan (2023). The Relative Importance of Domain Applicability Metrics for Estimating Prediction Errors in QSAR Varies with Training Set Diversity [Dataset]. http://doi.org/10.1021/acs.jcim.5b00110.s001
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.5b00110.s001
Dataset updated
May 31, 2023
Dataset provided by
ACS Publications
Authors
Robert P. Sheridan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
In QSAR, a statistical model is generated from a training set of molecules (represented by chemical descriptors) and their biological activities (an “activity model”). The aim of the field of domain applicability (DA) is to estimate the uncertainty of prediction of a specific molecule on a specific activity model. A number of DA metrics have been proposed in the literature for this purpose. A quantitative model of the prediction uncertainty (an “error model”) can be built using one or more of these metrics. A previous publication from our laboratory (Sheridan, R. P. J. Chem. Inf. Model. 2013, 53, 2837−2850) suggested that QSAR methods such as random forest could be used to build error models by fitting unsigned prediction errors against DA metrics. The QSAR paradigm contains two useful techniques: descriptor importance can determine which DA metrics are most useful, and cross-validation can be used to tell which subset of DA metrics is sufficient to estimate the unsigned errors. Previously we studied 10 large, diverse data sets and seven DA metrics. For those data sets for which it is possible to build a significant error model from those seven metrics, only two metrics were sufficient to account for almost all of the information in the error model. These were TREE_SD (the variation of prediction among random forest trees) and PREDICTED (the predicted activity itself). In this paper we show that when data sets are less diverse, as for example in QSAR models of molecules in a single chemical series, these two DA metrics become less important in explaining prediction error, and the DA metric SIMILARITYNEAREST1 (the similarity of the molecule being predicted to the closest training set compound) becomes more important. Our recommendation is that when the mean pairwise similarity (measured with the Carhart AP descriptor and the Dice similarity index) within a QSAR training set is less than 0.5, one can use only TREE_SD, PREDICTED to form the error model, but otherwise one should use TREE_SD, PREDICTED, SIMILARITYNEAREST1.

Facebook

Twitter

Click to copy link

Link copied

Cite

Robin Kramer; Caitlin Telfer; Alice Towler (2017). Supplementary material from "Visual comparison of two data sets: Do people use the means and the variability?" [Dataset]. http://doi.org/10.6084/m9.figshare.4751095.v1

Supplementary material from "Visual comparison of two data sets: Do people use the means and the variability?"

Explore at:

xlsxAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.4751095.v1

Dataset updated

Mar 14, 2017

Dataset provided by

Figsharehttp://figshare.com/
figshare

Authors

Robin Kramer; Caitlin Telfer; Alice Towler

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

In our everyday lives, we are required to make decisions based upon our statistical intuitions. Often, these involve the comparison of two groups, such as luxury versus family cars and their suitability. Research has shown that the mean difference affects judgements where two sets of data are compared, but the variability of the data has only a minor influence, if any at all. However, prior research has tended to present raw data as simple lists of values. Here, we investigated whether displaying data visually, in the form of parallel dot plots, would lead viewers to incorporate variability information. In Experiment 1, we asked a large sample of people to compare two fictional groups (children who drank ‘Brain Juice’ versus water) in a one-shot design, where only a single comparison was made. Our results confirmed that only the mean difference between the groups predicted subsequent judgements of how much they differed, in line with previous work using lists of numbers. In Experiment 2, we asked each participant to make multiple comparisons, with both the mean difference and the pooled standard deviation varying across data sets they were shown. Here, we found that both sources of information were correctly incorporated when making responses. Taken together, we suggest that increasing the salience of variability information, through manipulating this factor across items seen, encourages viewers to consider this in their judgements. Such findings may have useful applications for best practices when teaching difficult concepts like sampling variation.

Clear search

Close search

Google apps

Main menu

Supplementary material from "Visual comparison of two data sets: Do people...

AIT Alert Data Set

CSDCIOP Structure Points

Clust_100_GE_datasets

AIRS/Aqua L3 Monthly Quantization in Physical Units (AIRS-only) 5 degrees x...

National Hydrography Dataset Plus Version 2.1

‘Population by Country - 2020’ analyzed by Analyst-2

Context

Content

Acknowledgements

Inspiration

IDAO-2019-MuonID

Context

Content

Features

DCASE 2023 Challenge Task 2 Development Dataset

CSDCIOP Dune Crest Points

Dataset used in the publication entitled "Application of machine learning to...

National Health and Nutrition Examination Survey (NHANES)

DCASE 2025 Challenge Task 2 Development Dataset

TBX11K Simplified - TB X-rays with bounding boxes

Notes

Acknowledgements

Citation

Helpful Resources

ClimeMarine – Climate change predictions for Marine Spatial Planning

Global contemporary effective population sizes across taxonomic groups

Global Land Cover 1992-2020

Fractional Abundance Datasets for Salt Patches and Marshes Across the...

Fractional Abundance Datasets for Salt Patches and Marshes Across the...

Data from: The Relative Importance of Domain Applicability Metrics for...

Supplementary material from "Visual comparison of two data sets: Do people use the means and the variability?"