99 datasets found

Data example for vector processing
figshare.com
zip
Updated Oct 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alen Miranda (2024). Data example for vector processing [Dataset]. http://doi.org/10.6084/m9.figshare.27176223.v5
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27176223.v5
Dataset updated
Oct 13, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Alen Miranda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
EnglishThis is a .zip file containing multiple vector data. This material is intended solely for educational purposes and forms part of the "Análisis Geoespacial con Python" course (2024). EspañolEste es un archivo .zip multiples datos vectoriales. Este material unicamente para fines educativos y forma parte del curso "Análisis Geoespacial con Python" (2024).
Data Bundle for PyPSA-Eur: An Open Optimisation Model of the European...
zenodo.org
data.niaid.nih.gov
xz, zip
Updated Jul 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonas Hörsch; Fabian Hofmann; David Schlachtberger; Philipp Glaum; Fabian Neumann; Fabian Neumann; Tom Brown; Iegor Riepin; Bobby Xiong; Jonas Hörsch; Fabian Hofmann; David Schlachtberger; Philipp Glaum; Tom Brown; Iegor Riepin; Bobby Xiong (2024). Data Bundle for PyPSA-Eur: An Open Optimisation Model of the European Transmission System [Dataset]. http://doi.org/10.5281/zenodo.12760663
Explore at:
zip, xzAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12760663
Dataset updated
Jul 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jonas Hörsch; Fabian Hofmann; David Schlachtberger; Philipp Glaum; Fabian Neumann; Fabian Neumann; Tom Brown; Iegor Riepin; Bobby Xiong; Jonas Hörsch; Fabian Hofmann; David Schlachtberger; Philipp Glaum; Tom Brown; Iegor Riepin; Bobby Xiong
Description
PyPSA-Eur is an open model dataset of the European power system at the transmission network level that covers the full ENTSO-E area. It can be built using the code provided at https://github.com/PyPSA/PyPSA-eur.

It contains alternating current lines at and above 220 kV voltage level and all high voltage direct current lines, substations, an open database of conventional power plants, time series for electrical demand and variable renewable generator availability, and geographic potentials for the expansion of wind and solar power.

Not all data dependencies are shipped with the code repository, since git is not suited for handling large changing files. Instead we provide separate data bundles to be downloaded and extracted as noted in the documentation.

This is the full data bundle to be used for rigorous research. It includes large bathymetry and natural protection area datasets.

While the code in PyPSA-Eur is released as free software under the MIT, different licenses and terms of use apply to the various input data, which are summarised below:

corine/*

CORINE Land Cover (CLC) database

Source: https://land.copernicus.eu/pan-european/corine-land-cover/clc-2012/

Extract from Terms of Use:

Access to data is based on a principle of full, open and free access as established by the Copernicus data and information policy Regulation (EU) No 1159/2013 of 12 July 2013. This regulation establishes registration and licensing conditions for GMES/Copernicus users and can be found here. Free, full and open access to this data set is made on the conditions that:

When distributing or communicating Copernicus dedicated data and Copernicus service information to the public, users shall inform the public of the source of that data and information.

Users shall make sure not to convey the impression to the public that the user's activities are officially endorsed by the Union.

Where that data or information has been adapted or modified, the user shall clearly state this.

The data remain the sole property of the European Union. Any information and data produced in the framework of the action shall be the sole property of the European Union. Any communication and publication by the beneficiary shall acknowledge that the data were produced “with funding by the European Union”.

https://land.copernicus.eu/pan-european/corine-land-cover/clc-2012?tab=metadata

eez/*

World exclusive economic zones (EEZ)

Source: http://www.marineregions.org/sources.php#unioneezcountry

Extract from Terms of Use:

Marine Regions’ products are licensed under CC-BY-NC-SA. Please contact us for other uses of the Licensed Material beyond license terms. We kindly request our users not to make our products available for download elsewhere and to always refer to marineregions.org for the most up-to-date products and services.

http://www.marineregions.org/disclaimer.php

natura/*

Natura 2000 natural protection areas

Source: https://www.eea.europa.eu/data-and-maps/data/natura-10

Extract from Terms of Use:

EEA standard re-use policy: unless otherwise indicated, re-use of content on the EEA website for commercial or non-commercial purposes is permitted free of charge, provided that the source is acknowledged (https://www.eea.europa.eu/legal/copyright). Copyright holder: Directorate-General for Environment (DG ENV).

https://www.eea.europa.eu/data-and-maps/data/natura-10#tab-metadata

naturalearth/*

World country shapes

Source: https://www.naturalearthdata.com/downloads/10m-cultural-vectors/10m-admin-0-countries/

Extract from Terms of Use:

All versions of Natural Earth raster + vector map data found on this website are in the public domain. You may use the maps in any manner, including modifying the content and design, electronic dissemination, and offset printing. The primary authors, Tom Patterson and Nathaniel Vaughn Kelso, and all other contributors renounce all financial claim to the maps and invites you to use them for personal, educational, and commercial purposes.

No permission is needed to use Natural Earth. Crediting the authors is unnecessary.

http://www.naturalearthdata.com/about/terms-of-use/

NUTS_2013_60M_SH/*

Europe NUTS3 regions

Source: https://ec.europa.eu/eurostat/web/gisco/geodata/reference-data/administrative-units-statistical-units

Extract from Terms of Use:

In addition to the general copyright and licence policy applicable to the whole Eurostat website, the following specific provisions apply to the datasets you are downloading. The download and usage of these data is subject to the acceptance of the following clauses:

The Commission agrees to grant the non-exclusive and not transferable right to use and process the Eurostat/GISCO geographical data downloaded from this page (the "data").

The permission to use the data is granted on condition that: the data will not be used for commercial purposes; the source will be acknowledged. A copyright notice, as specified below, will have to be visible on any printed or electronic publication using the data downloaded from this page.

https://ec.europa.eu/eurostat/web/gisco/geodata/reference-data/administrative-units-statistical-units

https://ec.europa.eu/eurostat/about/policies/copyright

gebco/GEBCO_2014_2D.nc

GEBCO bathymetric dataset

Source: https://www.gebco.net/data_and_products/gridded_bathymetry_data/version_20141103/

Extract from Terms of Use:

The GEBCO Grid is placed in the public domain and may be used free of charge. Use of the GEBCO Grid indicates that the user accepts the conditions of use and disclaimer information given below.

Users are free to:

Copy, publish, distribute and transmit The GEBCO Grid

Adapt The GEBCO Grid

Commercially exploit The GEBCO Grid, by, for example, combining it with other information, or by including it in their own product or application

Users must:

Acknowledge the source of The GEBCO Grid. A suitable form of attribution is given in the documentation that accompanies The GEBCO Grid.

Not use The GEBCO Grid in a way that suggests any official status or that GEBCO, or the IHO or IOC, endorses any particular application of The GEBCO Grid.

Not mislead others or misrepresent The GEBCO Grid or its source.

https://www.gebco.net/data_and_products/gridded_bathymetry_data/documents/gebco_2014_historic.pdf

je-e-21.03.02.xls

Population and GDP data for Swiss Cantons

Source: https://www.bfs.admin.ch/bfs/en/home/news/whats-new.assetdetail.7786557.html

Extract from Terms of Use:

Information on the websites of the Federal Authorities is accessible to the public. Downloading, copying or integrating content (texts, tables, graphics, maps, photos or any other data) does not entail any transfer of rights to the content.

Copyright and any other rights relating to content available on the websites of the Federal Authorities are the exclusive property of the Federal Authorities or of any other expressly mentioned owners.

Any reproduction requires the prior written consent of the copyright holder. The source of the content (statistical results) should always be given.
Data from: Reference Measurements of Error Vector Magnitude
catalog.data.gov
data.nist.gov
Updated Jul 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2022). Reference Measurements of Error Vector Magnitude [Dataset]. https://catalog.data.gov/dataset/reference-measurements-of-error-vector-magnitude
Explore at:
Dataset updated
Jul 29, 2022
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
The experiment here was to demonstrate that we can reliably measure the Reference Waveforms designed in the IEEE P1765 proposed standard and calculate EVM along with the associated uncertainties. The measurements were performed using NIST's calibrated sampling oscilloscope and were traceable to the primary standards.We have uploaded the following two datasets. (1) Table 3 contains the EVM values (in %) for the Reference Waveforms 1--7 after performing the uncertainty analyses. The Monte Carlo means are also compared with the ideal values from the calculations in the IEEE P1765 standard.(2) Figure 3 shows the complete EVM distribution upon performing uncertainty analysis for Reference Waveform 3 as an example. Each of the entries in Table 3 is associated with an EVM distribution similar to that shown in Fig. 3.
NewsMediaBias-Plus Dataset
zenodo.org
huggingface.co
bin, zip
Updated Nov 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaina Raza; Shaina Raza (2024). NewsMediaBias-Plus Dataset [Dataset]. http://doi.org/10.5281/zenodo.13961155
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13961155
Dataset updated
Nov 29, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Shaina Raza; Shaina Raza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
NewsMediaBias-Plus Dataset

Overview

The NewsMediaBias-Plus dataset is designed for the analysis of media bias and disinformation by combining textual and visual data from news articles. It aims to support research in detecting, categorizing, and understanding biased reporting in media outlets.

Dataset Description

NewsMediaBias-Plus pairs news articles with relevant images and annotations indicating perceived biases and the reliability of the content. It adds a multimodal dimension for bias detection in news media.

Contents

unique_id: Unique identifier for each news item. Each unique_id matches an image for the same article.

outlet: The publisher of the article.

headline: The headline of the article.

article_text: The full content of the news article.

image_description: Description of the paired image.

image: The file path of the associated image.

date_published: The date the article was published.

source_url: The original URL of the article.

canonical_link: The canonical URL of the article.

new_categories: Categories assigned to the article.

news_categories_confidence_scores: Confidence scores for each category.

Annotation Labels

text_label: Indicates the likelihood of the article being disinformation:

Likely: Likely to be disinformation.

Unlikely: Unlikely to be disinformation.

multimodal_label: Indicates the likelihood of disinformation from the combination of the text snippet and image content:

Likely: Likely to be disinformation.

Unlikely: Unlikely to be disinformation.

Getting Started

Prerequisites

Python 3.6+

Pandas

Hugging Face Datasets

Hugging Face Hub

Installation

Load the dataset into Python:

python

Copy code

from datasets import load_dataset ds = load_dataset("vector-institute/newsmediabias-plus") print(ds) # View structure and splits print(ds['train'][0]) # Access the first record of the train split print(ds['train'][:5]) # Access the first five records

Load a Few Records

python

Copy code

from datasets import load_dataset # Load the dataset in streaming mode streamed_dataset = load_dataset("vector-institute/newsmediabias-plus", streaming=True) # Get an iterable dataset dataset_iterable = streamed_dataset['train'].take(5) # Print the records for record in dataset_iterable: print(record)

Contributions

Contributions are welcome! You can:

Add Data: Contribute more data points.

Refine Annotations: Improve annotation accuracy.

Share Usage Examples: Help others use the dataset effectively.

To contribute, fork the repository and create a pull request with your changes.

License

This dataset is released under a non-commercial license. See the LICENSE file for more details.

Citation

Please cite the dataset using this BibTeX entry:

bibtex

Copy code

@misc{vector_institute_2024_newsmediabias_plus, title={NewsMediaBias-Plus: A Multimodal Dataset for Analyzing Media Bias}, author={Vector Institute Research Team}, year={2024}, url={https://huggingface.co/datasets/vector-institute/newsmediabias-plus} }

Contact

For questions or support, contact Shaina Raza at: shaina.raza@vectorinstitute.ai

Disclaimer and User Guidance

Disclaimer: The labels Likely and Unlikely are based on LLM annotations and expert assessments, intended for informational use only. They should not be considered final judgments.

Guidance: This dataset is for research purposes. Cross-reference findings with other reliable sources before drawing conclusions. The dataset aims to encourage critical thinking, not provide definitive classifications.
c
Supporting information for Neural Network Embeddings based Similarity Search...
kilthub.cmu.edu
txt
Updated Jun 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yilin Yang; Mingjie Liu; John Kitchin (2022). Supporting information for Neural Network Embeddings based Similarity Search Method for Catalyst Systems [Dataset]. http://doi.org/10.1184/R1/19968323.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1184/R1/19968323.v1
Dataset updated
Jun 3, 2022
Dataset provided by
Carnegie Mellon University
Authors
Yilin Yang; Mingjie Liu; John Kitchin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this repository, we included code to prepare dataset, train gemnet model, build the faiss index, search the faiss index and visualize the searched results in the notebook faiss-gemnet-qm9-mp.ipynb. It reproduced our examples in the manuscript for the QM9 and the Materials Project dataset. For the OC20 dataset, we did not include its related data here because of its large size (> 50 GB), the code to process the OC20 dataset is almost the same as the code included in the notebook for the QM9 dataset.

We include the intermediate data (GemNet checkpoints, lmdb, faiss index and the searched result for the QM9 and the Materials project in the directory example-data. We also put the GemNet checkpoint for the OC20 dataset in this directory. The training and evaluation of the Gaussian regression process model using the searched molecules for the query Benzene are demonstrated in the ben-gp-data directory, in which the qm9-gp-gemnet-morgan-random-nrg.ipynb can be run on Colab.
m
Data for: MACHINE LEARNING IN MEDICINE: CLASSIFICATION AND PREDICTION OF...
data.mendeley.com
Updated Jul 2, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gopi Battineni (2019). Data for: MACHINE LEARNING IN MEDICINE: CLASSIFICATION AND PREDICTION OF DEMENTIA BY SUPPORT VECTOR MACHINES (SVM) [Dataset]. http://doi.org/10.17632/tsy6rbc5d4.1
Explore at:
Unique identifier
https://doi.org/10.17632/tsy6rbc5d4.1
Dataset updated
Jul 2, 2019
Authors
Gopi Battineni
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description
This set consists of a longitudinal collection of 150 subjects aged 60 to 96. Each subject was scanned on two or more visits, separated by at least one year for a total of 373 imaging sessions. For each subject, 3 or 4 individual T1-weighted MRI scans obtained in single scan sessions are included. The subjects are all right-handed and include both men and women. 72 of the subjects were characterized as nondemented throughout the study. 64 of the included subjects were characterized as demented at the time of their initial visits and remained so for subsequent scans, including 51 individuals with mild to moderate Alzheimer’s disease. Another 14 subjects were characterized as nondemented at the time of their initial visit and were subsequently characterized as demented at a later visit.
Z
Dataset of acoustic intensity vector measurements around an upscaled ear...
data.niaid.nih.gov
zenodo.org
Updated Apr 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Geldert, Aaron (2023). Dataset of acoustic intensity vector measurements around an upscaled ear model [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7564879
Explore at:
Dataset updated
Apr 9, 2023
Dataset provided by
Geldert, Aaron
Pulkki, Ville
Marschall, Marton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A dataset of acoustic vector (particle velocity vector and scalar sound pressure) measurements of the sound field around an upscaled model of an ear. Data collected in July 2022 at the Aalto Acoustics Lab in Espoo, Finland.

See the companion paper at AES for information about the contents of the dataset, measurement methodology, and example scripts.

See the companion repository github.com/aaron-geldert/upscaled-ear-model-scripts for example MATLAB scripts using the dataset.

Correspondence should be directed to Aaron Geldert (aarongeldert@gmail.com).
RICO dataset
kaggle.com
Updated Dec 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Onur Gunes (2021). RICO dataset [Dataset]. https://www.kaggle.com/onurgunes1993/rico-dataset/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 2, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Onur Gunes
Description
Context

Data-driven models help mobile app designers understand best practices and trends, and can be used to make predictions about design performance and support the creation of adaptive UIs. This paper presents Rico, the largest repository of mobile app designs to date, created to support five classes of data-driven applications: design search, UI layout generation, UI code generation, user interaction modeling, and user perception prediction. To create Rico, we built a system that combines crowdsourcing and automation to scalably mine design and interaction data from Android apps at runtime. The Rico dataset contains design data from more than 9.3k Android apps spanning 27 categories. It exposes visual, textual, structural, and interactive design properties of more than 66k unique UI screens. To demonstrate the kinds of applications that Rico enables, we present results from training an autoencoder for UI layout similarity, which supports query-by-example search over UIs.

Content

Rico was built by mining Android apps at runtime via human-powered and programmatic exploration. Like its predecessor ERICA, Rico’s app mining infrastructure requires no access to — or modification of — an app’s source code. Apps are downloaded from the Google Play Store and served to crowd workers through a web interface. When crowd workers use an app, the system records a user interaction trace that captures the UIs visited and the interactions performed on them. Then, an automated agent replays the trace to warm up a new copy of the app and continues the exploration programmatically, leveraging a content-agnostic similarity heuristic to efficiently discover new UI states. By combining crowdsourcing and automation, Rico can achieve higher coverage over an app’s UI states than either crawling strategy alone. In total, 13 workers recruited on UpWork spent 2,450 hours using apps on the platform over five months, producing 10,811 user interaction traces. After collecting a user trace for an app, we ran the automated crawler on the app for one hour.

Acknowledgements

UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN https://interactionmining.org/rico

Inspiration

The Rico dataset is large enough to support deep learning applications. We trained an autoencoder to learn an embedding for UI layouts, and used it to annotate each UI with a 64-dimensional vector representation encoding visual layout. This vector representation can be used to compute structurally — and often semantically — similar UIs, supporting example-based search over the dataset. To create training inputs for the autoencoder that embed layout information, we constructed a new image for each UI capturing the bounding box regions of all leaf elements in its view hierarchy, differentiating between text and non-text elements. Rico’s view hierarchies obviate the need for noisy image processing or OCR techniques to create these inputs.
s
Topographical Vector Data â€“ Northern Cape - Dataset - SASDI EMC-DCPR...
catalogue-staging.sasdi.gov.za
Updated Oct 8, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Topographical Vector Data â€“ Northern Cape - Dataset - SASDI EMC-DCPR Staging [Dataset]. https://catalogue-staging.sasdi.gov.za/dataset/topographical-vector-data-a-northern-cape
Explore at:
Dataset updated
Oct 8, 2021
Area covered
Northern Cape
Description
NGI topographical vector data content is arranged according the following themes, Cultural, Hydrography, Hypsography, Land Cover Land Use, Physiographic and Transportation. Cultural data content contains features that describe the cultural geography of human settlement. It represents the cultural ecology (human adaption to social and physical environments) within a country. Hydrography data content contains hydrological and coastal features. Hydrological features represent the accumulations of water on the land surface and include man made accumulation such as dams and reservoirs. The coastal features represent the physical relationship between the land and the sea. Hypsography data content contains topographical features that represent the measurement of elevation above mean sea level.Typical examples are contours and spot heights. Land Cover Land Use data content contains features that present/describe the land use and land cover within a country. The land use is presented as a vector dataset, unlike the land cover thematic which is a raster based dataset.This dataset only contains vector data for both land use and land cover. Typical examples of natural forms are Boulder, Cave, Cliff, Donga, Dune, Eroded Area, Gorge, and Mountain etc, and typical examples of artificial landforms are a Cutting, Embankment, Excavation, Mine Dump and Open Cast Mine etc. Transportation data content Contain features that represent the transportation facilities and transportation nodes of a country. It represents a connected network of passages that facilitate the transport and movement of goods and people. The dataset coverage is national and each feature instance carries attribute data that describe the classification, capture method, capture source, CUID (custodian unique ID) ,the vintage of the source and describes the correspondence with the SAGDaD(South African Geospatial Data Dictionary ) a feature content dictionary for South Africa, SANS1880.
n
Data from: Exploiting hierarchy in medical concept embedding
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Oct 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anthony Finch; Alexander Crowell; Mamta Bhatia; Pooja Parameshwarappa; Yung-Chieh Chang; Jose Martinez; Michael Horberg (2021). Exploiting hierarchy in medical concept embedding [Dataset]. http://doi.org/10.5061/dryad.v9s4mw6v0
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.v9s4mw6v0
Dataset updated
Oct 27, 2021
Dataset provided by
Mid-Atlantic Permanente Research Institute
Mid-Atlantic Permanente Medical Group
Authors
Anthony Finch; Alexander Crowell; Mamta Bhatia; Pooja Parameshwarappa; Yung-Chieh Chang; Jose Martinez; Michael Horberg
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Objective

To construct and publicly release a set of medical concept embeddings for codes following the ICD-10 coding standard which explicitly incorporate hierarchical information from medical codes into the embedding formulation.

Materials and Methods

We trained concept embeddings using several new extensions to the Word2Vec algorithm using a dataset of approximately 600,000 patients from a major integrated healthcare organization in the Mid-Atlantic US. Our concept embeddings included additional entities to account for the medical categories assigned to codes by the Clinical Classification Software Revised (CCSR) dataset. We compare these results to sets of publicly-released pretrained embeddings and alternative training methodologies.

Results

We found that Word2Vec models which included hierarchical data outperformed ordinary Word2Vec alternatives on tasks which compared naïve clusters to canonical ones provided by CCSR. Our Skip-Gram model with both codes and categories achieved 61.4% Normalized Mutual Information with canonical labels in comparison to 57.5% with traditional Skip-Gram. In models operating on two different outcomes we found that including hierarchical embedding data improved classification performance 96.2% of the time. When controlling for all other variables, we found that co-training embeddings improved classification performance 66.7% of the time. We found that all models outperformed our competitive benchmarks.

Discussion

We found significant evidence that our proposed algorithms can express the hierarchical structure of medical codes more fully than ordinary Word2Vec models, and that this improvement carries forward into classification tasks. As part of this publication, we have released several sets of pretrained medical concept embeddings using the ICD-10 standard which significantly outperform other well-known pretrained vectors on our tested outcomes.

Methods This dataset includes trained medical concept embeddings for 5428 ICD-10 codes and 394 Clinical Classification Software (Revised) (CCSR) categories. We include several different sets of concept embeddings, each trained using a slightly different set of hyperparameters and algorithms.

To train our models, we employed data from the Kaiser Permanente Mid-Atlantic States (KPMAS) medical system. KPMAS is an integrated medical system serving approximately 780,000 members in Maryland, Virginia, and the District of Columbia. KPMAS has a comprehensive Electronic Medical Record system which includes data from all patient interactions with primary or specialty caregivers, from which all data is derived. Our embeddings training set included diagnoses allocated to all adult patients in calendar year 2019.

For each code, we also recovered an associated category, as assigned by the Clinical Classification Software (Revised).

We trained 12 sets of embeddings using classical Word2Vec models with settings differing across three parameters. Our first parameter was the selection of training algorithm, where we trained both CBOW and SG models. Each model was trained using dimension k of 10, 50, and 100. Furthermore, each model-dimension combination was trained with categories and codes trained separately and together (referred to hereafter as ‘co-trained embeddings’ or ‘co-embeddings’). Each model was trained for 10 iterations. We employed an arbitrarily large context window (100), since all codes necessarily occurred within a short period (1 year).

We also trained a set of validation embeddings only on ICD-10 codes using the Med2Vec architecture as a comparison. We trained the Med2Vec model on our data using its default settings, including the default vector size (200) and a training regime of 10 epochs. We grouped all codes occurring on the same calendar date as Med2Vec ‘visits.’ Our Med2Vec model benchmark did not include categorical entities or other novel innovations.

Word2Vec embeddings were generated using the GenSim package in Python. Med2Vec embeddings were generated using the Med2Vec code published by Choi. The JSON files used in this repository were generated using the JSON package in Python.
o
Vector Tiles for water supply systems in Narok, Kenya - Dataset - openAFRICA...
open.africa
Updated Nov 29, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). Vector Tiles for water supply systems in Narok, Kenya - Dataset - openAFRICA [Dataset]. https://open.africa/dataset/narok-water-vectortiles
Explore at:
Dataset updated
Nov 29, 2020
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Narok, Kenya
Description
This is an open data for rural water supply systems in Narok Water and Sewerage Services Co., Ltd in Kenya. The data format is Mapbox Vector Tiles, you can use it by using Mapbox GL JS or Leaflet through browser. Also, QGIS or ArcGIS can handle Mapbox Vector Tiles format. You can see the specification of this vector tiles from here. This vector tiles data is available on Github pages through the URL of https://narwassco.github.io/vt/tiles/{z}/{x}/{y}.mvt. You can use our data together with your own Mapbox style.json. You can also see some example of style.json from our website. In another way to use our open data, you can choose QGIS 3.14 or above version. QGIS officially supported vector tiles from 3.14 version. You can download narok.mbtiles from openAFRICA and just drag & drop it to QGIS. Please let us know if you have any problems to use our data. Also, we would like to know your use cases of our water vector tiles. Enjoy our vector tiles!
m
Educational Attainment in North Carolina Public Schools: Use of statistical...
data.mendeley.com
Updated Nov 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
Explore at:
Unique identifier
https://doi.org/10.17632/6cm9wyd5g5.1
Dataset updated
Nov 14, 2018
Authors
Scott Herford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
North Carolina
Description
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
Vector-QM24 (VQM24) dataset
zenodo.org
application/gzip, bin
Updated May 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Danish Khan; Danish Khan; Anouar Benali; Scott Kim; Guido Falk von Rudorff; Anatole von Lilienfeld; Anouar Benali; Scott Kim; Guido Falk von Rudorff; Anatole von Lilienfeld (2025). Vector-QM24 (VQM24) dataset [Dataset]. http://doi.org/10.5281/zenodo.15442257
Explore at:
bin, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15442257
Dataset updated
May 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Danish Khan; Danish Khan; Anouar Benali; Scott Kim; Guido Falk von Rudorff; Anatole von Lilienfeld; Anouar Benali; Scott Kim; Guido Falk von Rudorff; Anatole von Lilienfeld
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Glendale - Midtown Via 6 Ave
Description
Quantum chemistry dataset of ~836 thousand small organic and inorganic molecules.

DFT properties for all 784,875 conformers in local minima; 258,242 constitutional isomers (most stable conformer) and 51,072 saddle point structures are available in the DFT_all.npz, DFT_uniques.npz and DFT_saddles.npz files respectively.
DMC data for 10,793 constitutional isomers is available in the DMC.npz file.

All molecules are ordered in the same way across every array.

Keys for accessing each property are tabulated in the paper.

Usage example :

import numpy as np data = np.load('DFT_all.npz', allow_pickle=True) print(data.files) #see a list of all properties key = 'freqs' property = data[key] #DFT vibrational frequencies of all molecules print(property[42]) #Frequencies of molecule number 42 in the array (HSCl, Thiohypochlorous acid)

Input file samples and tools : https://github.com/dkhan42/VQM24

Atomic energies (in Hartree) used to calculate the atomization energies :

#atomic energies wB97X-D3/cc-pVDZ

eatomic = {'Hydrogen' : -0.5012728848846926,

'Carbon' : -37.83859584856468,

'Nitrogen' : -54.5760607136932450,

'Oxygen' : -75.0474818911551438,

'Fluorine' : -99.7031524437270917,

'Bromine' : -2574.01253635198464,

'Chlorine' : -460.13960793480203,

'Phosphorous' : -341.2510291850040858,

'Sulfur' : -398.1021030909759020,

'Silicon' : -289.3578409507016431}

Wavefunctions of all 836 thousand molecules from the dataset are available as .molden files in wavefunctions.tar.gz
.molden file for a specific molecule from the dataset can be found using the 'compounds' array in 'DFT_all.npz' file.
For instance : the 0-th entry in the 'compounds' array of DFT_all.npz corresponds to 'SH2_0/conformer_1'
Wavefunction file for this molecule will be found at 'wavefunctions/SH2_0/conformer_1.molden' after untarring wavefunctions.tar.gz
Multiwfn (http://sobereva.com/multiwfn/) can be used to read the .molden wavefunction files

Dataset is described in the pre-print : https://arxiv.org/abs/2405.05961
Z
Synthetic data for assessing and comparing local post-hoc explanation of...
data.niaid.nih.gov
Updated Mar 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macas, Martin (2025). Synthetic data for assessing and comparing local post-hoc explanation of detected process shift [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_15000634
Explore at:
Dataset updated
Mar 10, 2025
Dataset provided by
Misar, Ondrej
Macas, Martin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Synthetic data for assessing and comparing local post-hoc explanation of detected process shift

DOI

10.5281/zenodo.15000635

Synthetic dataset contains data used in experiment described in article submitted to Computers in Industry journal entitled

Assessing and Comparing Local Post-hoc Explanation for Shift Detection in Process Monitoring.

The citation will be updated immediately after the article will be accepted.

Particular data.mat files are stored in a subfolder structure, which clearly assigns the particular file to

on of the tested cases.

For example, data for experiments with normally distributed data, known number of shifted variables and 5 variables are stored in path ormal\known_number\5_vars\rho0.1.

The meaning of particular folders is explained here:

normal - all variables are normally distributed

not-normal - copula based multivariate distribution based on normal and gamma marginal distributions and defined correlation

known_number - known number of shifted variables (methods used this information, which is not available in real world)

unknown_number - unknown number of shifted variables, realistic case

2_vars - data with 2 variables (n=2)

...

10_vars - data with 10 variables (n=2)

rho0.1 - correlation among all variables is 0.1

...

rho0.9 - correlation among all variables is 0.9

Each data.mat file contains the following variables:

LIME_res nval x n results of LIME explanation

MYT_res nval x n results of MYT explanation

NN_res nval x n results of ANN explanation

X p x 11000 Unshifted data

S n x n sigma matrix (covariance matrix) for the unshifted data

mu 1xn mean parameter for the unshifted data

n 1x1 number of variables (dimensionality)

trn_set n x ntrn x 2 train set for ANN explainer,

trn_set(:,:,1) are values of variables from shifted process trn_set(:,:,2) labels denoting which variables are shifted trn_set(i,j,2) is 1 if ith variable of jth sample trn_set(:,j,1) is shifted

val_set n x 95 x 2 validation set used for testing and generating LIME_res, MYT_res and NN_res
o
Vector Tiles for rural water supply systems in Rwanda - Dataset - openAFRICA...
open.africa
Updated Oct 30, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). Vector Tiles for rural water supply systems in Rwanda - Dataset - openAFRICA [Dataset]. https://open.africa/dataset/rw-water-vectortiles
Explore at:
Dataset updated
Oct 30, 2020
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Rwanda
Description
This is an open data for rural water supply systems in the entire country of Rwanda. The data format is Mapbox Vector Tiles, you can use it by using Mapbox GL JS or Leaflet through browser. Also, QGIS or ArcGIS can handle Mapbox Vector Tiles format. You can see the specification of this vector tiles from here. This vector tiles data is available on Github pages through the URL of https://wasac.github.io/vt/tiles/{z}/{x}/{y}.mvt. You can use our data together with your own Mapbox style.json. You can also see some example of style.json from our website. In another way to use our open data, you can choose QGIS 3.14 or above version. QGIS officially supported vector tiles from 3.14 version. You can download rwss.mbtiles from openAFRICA and just drag & drop it to QGIS. Please let us know if you have any problems to use our data. Also, we would like to know your use cases of our water vector tiles. Enjoy our vector tiles!
National Hydrography Dataset Plus Version 2.1
hub.arcgis.com
resilience.climate.gov
+5more
Updated Aug 16, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esri (2022). National Hydrography Dataset Plus Version 2.1 [Dataset]. https://hub.arcgis.com/maps/4bd9b6892530404abfe13645fcb5099a
Explore at:
Dataset updated
Aug 16, 2022
Dataset authored and provided by
Esrihttp://esri.com/
Area covered
Description
The National Hydrography Dataset Plus (NHDplus) maps the lakes, ponds, streams, rivers and other surface waters of the United States. Created by the US EPA Office of Water and the US Geological Survey, the NHDPlus provides mean annual and monthly flow estimates for rivers and streams. Additional attributes provide connections between features facilitating complicated analyses. For more information on the NHDPlus dataset see the NHDPlus v2 User Guide.Dataset SummaryPhenomenon Mapped: Surface waters and related features of the United States and associated territories not including Alaska.Geographic Extent: The United States not including Alaska, Puerto Rico, Guam, US Virgin Islands, Marshall Islands, Northern Marianas Islands, Palau, Federated States of Micronesia, and American SamoaProjection: Web Mercator Auxiliary Sphere Visible Scale: Visible at all scales but layer draws best at scales larger than 1:1,000,000Source: EPA and USGSUpdate Frequency: There is new new data since this 2019 version, so no updates planned in the futurePublication Date: March 13, 2019Prior to publication, the NHDPlus network and non-network flowline feature classes were combined into a single flowline layer. Similarly, the NHDPlus Area and Waterbody feature classes were merged under a single schema.Attribute fields were added to the flowline and waterbody layers to simplify symbology and enhance the layer's pop-ups. Fields added include Pop-up Title, Pop-up Subtitle, On or Off Network (flowlines only), Esri Symbology (waterbodies only), and Feature Code Description. All other attributes are from the original NHDPlus dataset. No data values -9999 and -9998 were converted to Null values for many of the flowline fields.What can you do with this layer?Feature layers work throughout the ArcGIS system. Generally your work flow with feature layers will begin in ArcGIS Online or ArcGIS Pro. Below are just a few of the things you can do with a feature service in Online and Pro.ArcGIS OnlineAdd this layer to a map in the map viewer. The layer is limited to scales of approximately 1:1,000,000 or larger but a vector tile layer created from the same data can be used at smaller scales to produce a webmap that displays across the full range of scales. The layer or a map containing it can be used in an application. Change the layer’s transparency and set its visibility rangeOpen the layer’s attribute table and make selections. Selections made in the map or table are reflected in the other. Center on selection allows you to zoom to features selected in the map or table and show selected records allows you to view the selected records in the table.Apply filters. For example you can set a filter to show larger streams and rivers using the mean annual flow attribute or the stream order attribute. Change the layer’s style and symbologyAdd labels and set their propertiesCustomize the pop-upUse as an input to the ArcGIS Online analysis tools. This layer works well as a reference layer with the trace downstream and watershed tools. The buffer tool can be used to draw protective boundaries around streams and the extract data tool can be used to create copies of portions of the data.ArcGIS ProAdd this layer to a 2d or 3d map. Use as an input to geoprocessing. For example, copy features allows you to select then export portions of the data to a new feature class. Change the symbology and the attribute field used to symbolize the dataOpen table and make interactive selections with the mapModify the pop-upsApply Definition Queries to create sub-sets of the layerThis layer is part of the ArcGIS Living Atlas of the World that provides an easy way to explore the landscape layers and many other beautiful and authoritative maps on hundreds of topics.Questions?Please leave a comment below if you have a question about this layer, and we will get back to you as soon as possible.
Example 1: Long-format trapping data
springernature.figshare.com
figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samuel SC Rund; Samraat Pawar; Bob MacCallum; Lauren J Cator; Sadie Jane Ryan; Dmitry S. Schigel; Scott Emrich; Cynthia Lord; Kyle Braak; Kyle Copas; Gloria I Giraldo-Calderón; Michael A Johansson; Naveed Heydari; Donald Hobern; Sarah A Kelly; Daniel Lawson; Dominique G. Roche; Kurt Vandegrift; Matthew Watts; Jennifer M Zaspel (2023). Example 1: Long-format trapping data [Dataset]. http://doi.org/10.6084/m9.figshare.7599572
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7599572
Dataset updated
May 31, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Samuel SC Rund; Samraat Pawar; Bob MacCallum; Lauren J Cator; Sadie Jane Ryan; Dmitry S. Schigel; Scott Emrich; Cynthia Lord; Kyle Braak; Kyle Copas; Gloria I Giraldo-Calderón; Michael A Johansson; Naveed Heydari; Donald Hobern; Sarah A Kelly; Daniel Lawson; Dominique G. Roche; Kurt Vandegrift; Matthew Watts; Jennifer M Zaspel
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Each row captures count data for a single species’ occurrence in a given sampling event. This illustrates an example of the most common mosquito collection protocol
P
LabPics Dataset
paperswithcode.com
Updated May 3, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sagi Eppel; Haoping Xu; Alan Aspuru-Guzik (2021). LabPics Dataset [Dataset]. https://paperswithcode.com/dataset/labpics
Explore at:
Dataset updated
May 3, 2021
Authors
Sagi Eppel; Haoping Xu; Alan Aspuru-Guzik
Description
LabPics Chemistry Dataset

Dataset for computer vision for materials segmentation and classification in chemistry labs, medical labs, and any setting where materials are handled inside containers. The Vector-LabPics dataset comprises 7900 images of materials in various phases and processes within mostly transparent vessels in chemistry labs, medical labs and hospitals, and other environments. The images are annotated for both the vessels and the individual material phases inside them, and each instance is assigned one or more classes (liquid, solid, foam, suspension, powder, gel, granular, vapor) . The fill level, labels, corks, and other parts of the vessel are also annotated. The material classes cover the main states of matter, including liquids, solids, vapors, foams, gels, and subcategories like powder, granular, and suspension. Relationships between materials, such as which material is immersed inside other materials, are annotated. The vessel class cover glassware, labware plates, bottles, and any other type of vessel that is used to contain or carry materials. The type of vessel (e.g., syringe, tube, cup, infusion bottle/bag), and the properties of the vessel (transparent, opaque) are annotated. In addition, vessel parts such as corks, labels, spikes, and valves are annotated. Relations and hierarchies between vessels and materials are also annotated, such as which vessel contains which material or which vessels are linked or contain each other. The images were collected from various contributors and covered most aspects of chemistry lab works as well as a variety of other fields where materials are handled in container vessels. Documents specifying annotation formats are available inside the dataset file. Version 1 contain 2200 images with simple instance and semantic annotations, and is relatively simple to use, it is described in the paper "Computer Vision for Recognition of Materials and Vessels in Chemistry Lab Settings and the Vector-LabPics Data Set"

Format

The dataset contains annotated images for both material and vessels in chemistry labs, medical labs, and any area where liquids and solids are handled within vessels. There are two levels of annotation for each image. One annotation set for vessels and the second for the material phases inside these vessels. Vessels are defined as any container that can carry materials such as Jars, Erlenmayers, Tubes, Funnels, syringes, IV bags, and any other labware or glassware that can contain or carry materials. Material phases are any material contained within or on the vessel. For example, for two-phase separating liquids, each liquid phase is annotated as one instance. If there is foam above the liquid or a chunk of solid inside the liquid, the foam, liquid, and solid will be annotated as different phases. In addition, vessel parts like cork, labels, and valves are annotated as instances. For each instance, there is a list of all the classes it belongs to, and a list of its property. For vessels, the instance classes are the vessel type (Cup, jar, Separatory-funnel…) and the vessel properties (Transparent, Opaque…). For materials, the classes are the material types ( Liquid, solid, suspension, foam, powder…) and their properties (Scattered, On vessel surface…), and for parts, the part type (cork/label). In addition, the relations between instances are annotated. This includes which material instances are inside which vessels, which vessels are linked to each other or are inside each other (for vessels inside other vessels), and which material phase is immersed inside another material phase. In addition to instance segmentation maps, the dataset also includes semantic segmentation maps that give each pixel in the image all the classes to which it belongs. In other words, for each class (Liquid, Solid, Vessel, Foam), there is a map of all the regions in the image belonging to this class. Note that every pixel and every instance can have several classes. In addition, instances often overlap, like in the case of material inside the vessel, vessel inside the vessel, and material phase immersed inside other material (like solid inside liquid).
d
Australia - Present Major Vegetation Groups - NVIS Version 4.1 (Albers 100m...
data.gov.au
researchdata.edu.au
+1more
zip
Updated Apr 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2022). Australia - Present Major Vegetation Groups - NVIS Version 4.1 (Albers 100m analysis product) [Dataset]. https://data.gov.au/data/dataset/57c8ee5c-43e5-4e9c-9e41-fd5012536374
Explore at:
zipAvailable download formats
Dataset updated
Apr 13, 2022
Dataset authored and provided by
Bioregional Assessment Program
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Australia
Description
Abstract

This dataset and its metadata statement were supplied to the Bioregional Assessment Programme by a third party and are presented here as originally supplied.

Resource contains an ArcGIS file geodatabase raster for the National Vegetation Information System (NVIS) Major Vegetation Groups - Australia-wide, present extent (FGDB_NVIS4_1_AUST_MVG_EXT).

Related datasets are also included: FGDB_NVIS4_1_KEY_LAYERS_EXT - ArcGIS File Geodatabase Feature Class of the Key Datasets that make up NVIS Version 4.1 - Australia wide; and FGDB_NVIS4_1_LUT_KEY_LAYERS - Lookup table for Dataset Key Layers.

This raster dataset provides the latest summary information (November 2012) on Australia's present (extant) native vegetation. It is in Albers Equal Area projection with a 100 m x 100 m (1 Ha) cell size. A comparable Estimated Pre-1750 (pre-european, pre-clearing) raster dataset is available: - NVIS4_1_AUST_MVG_PRE_ALB. State and Territory vegetation mapping agencies supplied a new version of the National Vegetation Information System (NVIS) in 2009-2011. Some agencies did not supply new data for this version but approved re-use of Version 3.1 data. Summaries were derived from the best available data in the NVIS extant theme as at June 2012. This product is derived from a compilation of data collected at different scales on different dates by different organisations. Please refer to the separate key map showing scales of the input datasets. Gaps in the NVIS database were filled by non-NVIS data, notably parts of South Australia and small areas of New South Wales such as the Curlewis area. The data represent on-ground dates of up to 2006 in Queensland, 2001 to 2005 in South Australia (depending on the region) and 2004/5 in other jurisdictions, except NSW. NVIS data was partially updated in NSW with 2001-09 data, with extensive areas of 1997 data remaining from the earlier version of NVIS. Major Vegetation Groups were identified to summarise the type and distribution of Australia's native vegetation. The classification contains different mixes of plant species within the canopy, shrub or ground layers, but are structurally similar and are often dominated by a single genus. In a mapping sense, the groups reflect the dominant vegetation occurring in a map unit where there are a mix of several vegetation types. Subdominant vegetation groups which may also be present in the map unit are not shown. For example, the dominant vegetation in an area may be mapped as dominated by eucalypt open forest, although it contains pockets of rainforest, shrubland and grassland vegetation as subdominants. The (related) Major Vegetation Subgroups represent more detail about the understorey and floristics of the Major Vegetation Groups and are available as separate raster datasets: - NVIS4_1_AUST_MVS_EXT_ALB - NVIS4_1_AUST_MVS_PRE_ALB A number of other non-vegetation and non-native vegetation land cover types are also represented as Major Vegetation Groups. These are provided for cartographic purposes, but should not be used for analyses. For further background and other NVIS products, please see the links on http://www.environment.gov.au/erin/nvis/index.html.

The current NVIS data products are available from http://www.environment.gov.au/land/native-vegetation/national-vegetation-information-system.

Purpose

For use in Bioregional Assessment land classification analyses

Dataset History

NVIS Version 4.1

The input vegetation data were provided from over 100 individual projects representing the majority of Australia's regional vegetation mapping over the last 50 years. State and Territory custodians translated the vegetation descriptions from these datasets into a common attribute framework, the National Vegetation Information System (ESCAVI, 2003). Scales of input mapping ranged from 1:25,000 to 1:5,000,000. These were combined into an Australia-wide set of vector data. Non-terrestrial areas were mostly removed by the State and Territory custodians before supplying the data to the Environmental Resources Information Network (ERIN), Department of Sustainability Environment Water Population and Communities (DSEWPaC).

Each NVIS vegetation description was written to the NVIS XML format file by the custodian, transferred to ERIN and loaded into the NVIS database at ERIN. A considerable number of quality checks were performed automatically by this system to ensure conformity to the NVIS attribute standards (ESCAVI, 2003) and consistency between levels of the NVIS Information Hierarchy within each description. Descriptions for non-vegetation and non-native vegetation mapping codes were transferred via CSV files.

The NVIS vector (polygon) data for Australia comprised a series of jig-saw pieces, eachup to approx 500,000 polygons - the maximum tractable size for routine geoprocesssing. The spatial data was processed to conform to the NVIS spatial format (ESCAVI, 2003; other papers). Spatial processing and attribute additions were done mostly in ESRI File Geodatabases. Topology and minor geometric corrections were also performed at this stage. These datasets were then loaded into ESRI Spatial Database Engine as per the ERIN standard. NVIS attributes were then populated using Oracle database tables provided by custodians, mostly using PL/SQL Developer or in ArcGIS using the field calculator (where simple).

Each spatial dataset was joined to and checked against a lookup table for the relevant State/Territory to ensure that all mapping codes in the dominant vegetation type of each polygon (NVISDSC1) had a valid lookup description, including an allocated MVG. Minor vegetation components of each map unit (NVISDSC2-6) were not checked, but could be considered mostly complete.

Each NVIS vegetation description was allocated to a Major Vegetation Group (MVG) by manual interpretation at ERIN. The Australian Natural Resources Atlas (http://www.anra.gov.au/topics/vegetation/pubs/native_vegetation/vegfsheet.html) provides detailed descriptions of most Major Vegetation Groups. Three new MVGs were created for version 4.1 to better represent open woodland formations and forests (in the NT) with no further data available. NVIS vegetation descriptions were reallocated into these classes, if appropriate:

Unclassified Forest

Other Open Woodlands

Mallee Open Woodlands and Sparse Mallee Shublands

(Thus there are a total of 33 MVGs existing as at June 2012). Data values defined as cleared or non-native by data custodians were attributed specific MVG values such as 25 - Cleared or non native, 27 - naturally bare, 28 - seas & estuaries, and 99 - Unknown.

As part of the process to fill gaps in NVIS, the descriptive data from non-NVIS sources was also referenced in the NVIS database, but with blank vegetation descriptions. In general. the gap-fill data comprised (a) fine scale (1:250K or better) State/Territory vegetation maps for which NVIS descriptions were unavailable and (b) coarse-scale (1:1M) maps from Commonwealth and other sources. MVGs were then allocated to each description from the available desciptions in accompanying publications and other sources.

Parts of New South Wales, South Australia, QLD and the ACT have extensive areas of vector "NoData", thus appearing as an inland sea. The No Data areas were dealt with differently by state. In the ACT and SA, the vector data was 'gap-filled' and attributed using satellite imagery as a guide prior to rasterising. Most of these areas comprised a mixture of MVG 24 (inland water) and 25 (cleared), and in some case 99 (Unknown). The NSW & QLD 'No Data' areas were filled using a raster mask to fill the 'holes'. These areas were attributed with MVG 24, 26 (water & unclassified veg), MVG 25 (cleared); or MVG 99 Unknown/no data, where these areas were a mixture of unknown proportions.

Each spatial dataset with joined lookup table (including MVG_NUMBER linked to NVISDSC1) was exported to a File Geodatabase as a feature class. These were reprojected into Albers Equal Area projection (Central_Meridian: 132.000000, Standard_Parallel_1: -18.000000, Standard_Parallel_2: -36.000000, Linear Unit: Meter (1.000000), Datum GDA94, other parameters 0).

Each feature class was then rasterised to a 100m raster with extents to a multiple of 1000 m, to ensure alignment. In some instances, areas of 'NoData' had to be modelled in raster. For example, in NSW where non-native areas (cleared, water bodies etc) have not been mapped. The rasters were then merged into a 'state wide' raster. State rasters were then merged into this 'Australia wide' raster dataset.

November 2012 Corrections

Closer inspection of the original 4.1 MVG Extant raster dataset highlighted some issues with the raster creation process which meant that raster pixels in some areas did not align as intended. These were corrected, and the new properly aligned rasters released in November 2012.

Dataset Citation

Department of the Environment (2012) Australia - Present Major Vegetation Groups - NVIS Version 4.1 (Albers 100m analysis product). Bioregional Assessment Source Dataset. Viewed 10 July 2017, http://data.bioregionalassessments.gov.au/dataset/57c8ee5c-43e5-4e9c-9e41-fd5012536374.
a
Vector grid system for a Quebec spatial data infrastructure, 2024 edition
catalogue.arctic-sdi.org
open.canada.ca
+1more
Updated Mar 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Vector grid system for a Quebec spatial data infrastructure, 2024 edition [Dataset]. https://catalogue.arctic-sdi.org/geonetwork/srv/search?keyword=Vector%20grids
Explore at:
Dataset updated
Mar 9, 2024
Area covered
Quebec
Description
The vector grid system provides a spatial and statistical infrastructure that allows the integration of environmental and socio-economic data. Its exploitation allows the crossing of different spatial data within the same grid units. Project results obtained using this grid system can be more easily linked. This grid system forms the geographic and statistical infrastructure of the Southern Quebec Land Accounts of the Institute of Statistics of Quebec (ISQ). It forms the geospatial and statistical context for the development of ecosystem accounting in Quebec. **In order to improve the vector grid system and the Land Accounts of Southern Quebec and to better anticipate the future needs of users, we would like to be informed of their use (field of application, objectives of use, territory, association with other products, etc.). You can write to us at maxime.keith@stat.gouv.qc.ca **. This grid system allows the spatial integration of various data relating, for example, to human populations, the economy or the characteristics of land. The ISQ wishes to encourage the use of this system in projects that require the integration of several data sources, the analysis of this data at different spatial scales and the monitoring of this data over time. The fixed geographic references of the grids simplify the compilation of statistics according to different territorial divisions and facilitate the monitoring of changes over time. In particular, the grid system promotes the consistency of data at the provincial level. The spatial intersection of the grid and the spatial data layer to be integrated makes it possible to transfer the information underlying the layer within each cell of the grid. In the case of the Southern Quebec Land Accounts, the spatial intersection of the grid and each of the three land cover layers (1990s, 2000s and 2010s) made it possible to report the dominant coverage within each grid cell. The set of matrix files of Southern Quebec Land Accounts is the result of this intersection. **Characteristics: ** The product includes two vector grids: one formed of cells of 1 km² (or 1,000 m on a side), which covers all of Quebec, and another of 2,500 m² cells (or 50 m on a side, or a quarter of a hectare), which fits perfectly into the first and covers Quebec territory located south of the 52nd parallel. Note that the nomenclature of this system, designed according to a Cartesian plan, was developed so that it was possible to integrate cells with finer resolutions (up to 5 meters on a side). In its 2024 update, the 50 m grid system is divided into 331 parts with a side of 50 km in order to limit the number of cells per part of the grid to millions and thus facilitate geospatial processing. This grid includes a total of approximately 350 million cells or 875,000 km2. It is backwards compatible with the 50m grid broadcast by the ISQ in 2018 (spatial structure and unique identifiers are identical, only the fragmentation is different). **Attribute information for 50 m cells: ** * ID_m50: unique code of the cell; * CO_MUN_2022: geographic code of the municipality of January 2022; * CERQ_NV2: code of the natural region of the ecological reference framework of Quebec; * CL_COUV_T50: unique code of the cell; * CL_COUV_T00, CL_COUV_T01: codes for coverage classes Terrestrial maps from the years 1990, 2000 and 2010. Note: the 2000s are covered by two land cover maps: CL_COUV_T01A and CL_COUV_T01b. The first inventories land cover prior to reassessment using the 2010s map, while the second shows land cover after this reassessment process. **Complementary entity classes: ** * Index_grille50m: index of the parts of the grid; * Decoupage_mun_01_2022: division of municipalities; * Decoupage_MRC_01_2022: division of geographical MRCs; * Decoupage_RA_01_2022: division of administrative regions. Source: System on administrative divisions [SDA] of the Ministry of Natural Resources and Forests [MRNF], January 2022, allows statistical compilations to be carried out according to administrative divisions hierarchically superior to municipalities. * Decoupage_CERQ_NV2_2018: division of level 2 of the CERQ, natural regions. Source: Ministry of the Environment, the Fight against Climate Change, Wildlife and Parks [MELCCFP]. Geospatial processes delivered with the grid (only with the FGDB data set) : * ArcGIS ModelBuilder allowing the spatial intersection and the selection of the dominant value of the geographic layer to populate the grid; * ModelBuilder allowing the statistical compilation of results according to various divisions. Additional information on the grid in the report Southern Quebec Land Accounts published in October 2018 (p. 46). View the results of the Southern Quebec Land Accounts on the interactive map of the Institut de la Statistique du Québec.**This third party metadata element was translated using an automated translation tool (Amazon Translate).**

Facebook

Twitter

Click to copy link

Link copied

Cite

Alen Miranda (2024). Data example for vector processing [Dataset]. http://doi.org/10.6084/m9.figshare.27176223.v5

Data example for vector processing

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.27176223.v5

Dataset updated

Oct 13, 2024

Dataset provided by

Figsharehttp://figshare.com/

Authors

Alen Miranda

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

EnglishThis is a .zip file containing multiple vector data. This material is intended solely for educational purposes and forms part of the "Análisis Geoespacial con Python" course (2024). EspañolEste es un archivo .zip multiples datos vectoriales. Este material unicamente para fines educativos y forma parte del curso "Análisis Geoespacial con Python" (2024).

Clear search

Close search

Google apps

Main menu

Data example for vector processing

Data Bundle for PyPSA-Eur: An Open Optimisation Model of the European...

Data from: Reference Measurements of Error Vector Magnitude

NewsMediaBias-Plus Dataset

NewsMediaBias-Plus Dataset

Overview

Dataset Description

Contents

Annotation Labels

Getting Started

Prerequisites

Installation

Load a Few Records

Contributions

License

Citation

Contact

Disclaimer and User Guidance

Supporting information for Neural Network Embeddings based Similarity Search...

Data for: MACHINE LEARNING IN MEDICINE: CLASSIFICATION AND PREDICTION OF...

Dataset of acoustic intensity vector measurements around an upscaled ear...

RICO dataset

Context

Content

Acknowledgements

Inspiration

Topographical Vector Data â€“ Northern Cape - Dataset - SASDI EMC-DCPR...

Data from: Exploiting hierarchy in medical concept embedding

Vector Tiles for water supply systems in Narok, Kenya - Dataset - openAFRICA...

Educational Attainment in North Carolina Public Schools: Use of statistical...

Vector-QM24 (VQM24) dataset

Synthetic data for assessing and comparing local post-hoc explanation of...

Vector Tiles for rural water supply systems in Rwanda - Dataset - openAFRICA...

National Hydrography Dataset Plus Version 2.1

Example 1: Long-format trapping data

LabPics Dataset

Australia - Present Major Vegetation Groups - NVIS Version 4.1 (Albers 100m...

Abstract

Purpose

Dataset History

Dataset Citation

Vector grid system for a Quebec spatial data infrastructure, 2024 edition

Data example for vector processing