Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
the main contributions of this paper are threefold.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the dataset for the paper "Large Language Models for Structuring and Integration of Heterogeneous Data" (add DOI).
It contains:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
the dataset can used for the test of models of deep learning which include structured data: stock price and unstructured data: stock bar posts. so, the dataset is Multi-source Heterogeneous Data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the data used for “Heterogeneous Multi-Source Data Fusion Through Input Mapping And Latent Variable Gaussian Process” paper by Yigitcan Comlek, Sandipp Krishnan Ravi, Piyush Pandita, Sayan Ghosh, Liping Wang, and Wei Chen. For all correspondence, please contact Dr. Wei Chen (weichen@northwestern.edu) or Dr. Sandipp Krishnan Ravi (sandippk@umich.edu).
Please use the below BibTex format to cite this work:
@article{comlek2024heterogenous,
title={Heterogenous Multi-Source Data Fusion Through Input Mapping and Latent Variable Gaussian Process},
author={Comlek, Yigitcan and Ravi, Sandipp Krishnan and Pandita, Piyush and Ghosh, Sayan and Wang, Liping and Chen, Wei},
journal={arXiv preprint arXiv:2407.11268},
year={2024}
}
The repository consists of data used in three case studies. All the data available is in .csv format. Each csv file contains the data for the specific source used in the case study. Below is a summary of the files for each of the three case studies.
Case Study 1 (Cantilever Beam)
· Source1_RectangularBeam.csv
· Source2_RectangularHollowBeam.csv
· Source3_CircularHollowBeam.csv
Case Study 2 (Ellipsoidal Void)
· Source1_2DEllipse.csv
· Source2_3DEllipse.csv
· Source3_3DEllipseRot.csv
Case Study 3 (Ti6AlV Alloys)
· Source1_LBPF.csv [1,2]
· Source2_EBM.csv [3]
· Source3_FSW.csv [4]
For this case study the data is collected from the below papers:
[1] Q. Luo, L. Yin, T. W. Simpson, and A. M. Beese, “Effect of processing parameters on pore structures, grain features, and mechanical properties in ti-6al-4v by laser powder bed fusion,” Additive Manufacturing, vol. 56, p. 102 915, 2022.
[2] Q. Luo, L. Yin, T. W. Simpson, and A. M. Beese, “Dataset of process-structure-property feature relationship for laser powder bed fusion additive manufactured ti-6al-4v material.,” Data in Brief, vol. 46, p. 108 911, 2023.
[3] J. Ran, F. Jiang, X. Sun, Z. Chen, C. Tian, and H. Zhao, “Microstructure and mechanical properties of ti-6al-4v fabricated by electron beam melting,” Crystals, vol. 10, no. 11, p. 972, 2020.
[4] A. Fall, M. Jahazi, A. Khdabandeh, and M. Fesharaki, “Effect of process parameters on microstructure and mechanical properties of friction stir-welded ti–6al–4v joints,” The International Journal of Advanced Manufacturing Technology, vol. 91, pp. 2919–2931, 2017
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Objective: Driven by the need of pharmacovigilance centres and companies to routinely collect and review all available data about adverse drug reactions (ADRs) and adverse events of interest, we introduce and validate a computational framework exploiting dominant as well as emerging publicly available data sources for drug safety surveillance. Methods: Our approach relies on appropriate query formulation for data acquisition and subsequent filtering, transformation and joint visualization of the obtained data. We acquired data from the FDA Adverse Event Reporting System (FAERS), PubMed and Twitter. In order to assess the validity and the robustness of the approach, we elaborated on two important case studies, namely, clozapine-induced cardiomyopathy/myocarditis versus haloperidol-induced cardiomyopathy/myocarditis, and apixaban-induced cerebral hemorrhage. Results: The analysis of the obtained data provided interesting insights (identification of potential patient and health-care professional experiences regarding ADRs in Twitter, information/arguments against an ADR existence across all sources), while illustrating the benefits (complementing data from multiple sources to strengthen/confirm evidence) and the underlying challenges (selecting search terms, data presentation) of exploiting heterogeneous information sources, thereby advocating the need for the proposed framework. Conclusions: This work contributes in establishing a continuous learning system for drug safety surveillance by exploiting heterogeneous publicly available data sources via appropriate support tools.
Data and software associated with the paper: PHENOstruct: Prediction of human phenotype ontology terms using heterogeneous data sources
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data sources’ characteristics*.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The peer-reviewed publication for this dataset has been presented in the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), and can be accessed here: https://arxiv.org/abs/2205.02596. Please cite this when using the dataset.
This dataset contains a heterogeneous set of True and False COVID claims and online sources of information for each claim.
The claims have been obtained from online fact-checking sources, existing datasets and research challenges. It combines different data sources with different foci, thus enabling a comprehensive approach that combines different media (Twitter, Facebook, general websites, academia), information domains (health, scholar, media), information types (news, claims) and applications (information retrieval, veracity evaluation).
The processing of the claims included an extensive de-duplication process eliminating repeated or very similar claims. The dataset is presented in a LARGE and a SMALL version, accounting for different degrees of similarity between the remaining claims (excluding respectively claims with a 90% and 99% probability of being similar, as obtained through the MonoT5 model). The similarity of claims was analysed using BM25 (Robertson et al., 1995; Crestani et al., 1998; Robertson and Zaragoza, 2009) with MonoT5 re-ranking (Nogueira et al., 2020), and BERTScore (Zhang et al., 2019).
The processing of the content also involved removing claims making only a direct reference to existing content in other media (audio, video, photos); automatically obtained content not representing claims; and entries with claims or fact-checking sources in languages other than English.
The claims were analysed to identify types of claims that may be of particular interest, either for inclusion or exclusion depending on the type of analysis. The following types were identified: (1) Multimodal; (2) Social media references; (3) Claims including questions; (4) Claims including numerical content; (5) Named entities, including: PERSON − People, including fictional; ORGANIZATION − Companies, agencies, institutions, etc.; GPE − Countries, cities, states; FACILITY − Buildings, highways, etc. These entities have been detected using a RoBERTa base English model (Liu et al., 2019) trained on the OntoNotes Release 5.0 dataset (Weischedel et al., 2013) using Spacy.
The original labels for the claims have been reviewed and homogenised from the different criteria used by each original fact-checker into the final True and False labels.
The data sources used are:
The CoronaVirusFacts/DatosCoronaVirus Alliance Database. https://www.poynter.org/ifcn-covid-19-misinformation/
CoAID dataset (Cui and Lee, 2020) https://github.com/cuilimeng/CoAID
MM-COVID (Li et al., 2020) https://github.com/bigheiniu/MM-COVID
CovidLies (Hossain et al., 2020) https://github.com/ucinlp/covid19-data
TREC Health Misinformation track https://trec-health-misinfo.github.io/
TREC COVID challenge (Voorhees et al., 2021; Roberts et al., 2020) https://ir.nist.gov/covidSubmit/data.html
The LARGE dataset contains 5,143 claims (1,810 False and 3,333 True), and the SMALL version 1,709 claims (477 False and 1,232 True).
The entries in the dataset contain the following information:
Claim. Text of the claim.
Claim label. The labels are: False, and True.
Claim source. The sources include mostly fact-checking websites, health information websites, health clinics, public institutions sites, and peer-reviewed scientific journals.
Original information source. Information about which general information source was used to obtain the claim.
Claim type. The different types, previously explained, are: Multimodal, Social Media, Questions, Numerical, and Named Entities.
Funding. This work was supported by the UK Engineering and Physical Sciences Research Council (grant no. EP/V048597/1, EP/T017112/1). ML and YH are supported by Turing AI Fellowships funded by the UK Research and Innovation (grant no. EP/V030302/1, EP/V020579/1).
References
Arana-Catania M., Kochkina E., Zubiaga A., Liakata M., Procter R., He Y.. Natural Language Inference with Self-Attention for Veracity Assessment of Pandemic Claims. NAACL 2022 https://arxiv.org/abs/2205.02596
Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi at trec-3. Nist Special Publication Sp,109:109.
Fabio Crestani, Mounia Lalmas, Cornelis J Van Rijsbergen, and Iain Campbell. 1998. “is this document relevant?. . . probably” a survey of probabilistic models in information retrieval. ACM Computing Surveys (CSUR), 30(4):528–552.
Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc.
Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. Document ranking with a pre-trained sequence-to-sequence model. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 708–718.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, et al. 2013. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA, 23.
Limeng Cui and Dongwon Lee. 2020. Coaid: Covid-19 healthcare misinformation dataset. arXiv preprint arXiv:2006.00885.
Yichuan Li, Bohan Jiang, Kai Shu, and Huan Liu. 2020. Mm-covid: A multilingual and multimodal data repository for combating covid-19 disinformation.
Tamanna Hossain, Robert L. Logan IV, Arjuna Ugarte, Yoshitomo Matsubara, Sean Young, and Sameer Singh. 2020. COVIDLies: Detecting COVID-19 misinformation on social media. In Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020, Online. Association for Computational Linguistics.
Ellen Voorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, William R Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang. 2021. Trec-covid: constructing a pandemic information retrieval test collection. In ACM SIGIR Forum, volume 54, pages 1–12. ACM New York, NY, USA.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Recent advances in Computer Science and the spread of internet connection have allowed specialists to virtualize complex environments on the web and offer further information with realistic exploration experiences. At the same time, the fruition of complex geospatial datasets (point clouds, Building Information Modelling (BIM) models, 2D and 3D models) on the web is still a challenge, because usually it involves the usage of different proprietary software solutions, and the input data need further simplification for computational effort reduction. Moreover, integrating geospatial datasets acquired in different ways with various sensors remains a challenge. An interesting question, in that respect, is how to integrate 3D information in a 3D GIS (Geographic Information System) environment and manage different scales of information in the same application. Integrating a multiscale level of information is currently the first step when it comes to digital twinning. It is needed to properly manage complex urban datasets in digital twins related to the management of the buildings (cadastral management, prevention of natural and anthropogenic hazards, structure monitoring, etc.). Therefore, the current research shows the development of a freely accessible 3D Web navigation model based on open-source technology that allows the visualization of heterogeneous complex geospatial datasets in the same virtual environment. This solution employs JavaScript libraries based on WebGL technology. The model is accessible through web browsers and does not need software installation from the user side. The case study is the new building of the University of Twente-Faculty of Geo-Information (ITC), located in Enschede (the Netherlands). The developed solution allows switching between heterogeneous datasets (point clouds, BIM, 2D and 3D models) at different scales and visualization (indoor first-person navigation, outdoor navigation, urban navigation). This solution could be employed by governmental stakeholders or the private sector to remotely visualize complex datasets on the web in a unique visualization, and take decisions only based on open-source solutions. Furthermore, this system can incorporate underground data or real-time sensor data from the IoT (Internet of Things) for digital twinning tasks.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Trait data represent the basis for ecological and evolutionary research and have relevance for biodiversity conservation, ecosystem management and earth system modelling. The collection and mobilization of trait data has strongly increased over the last decade, but many trait databases still provide only species-level, aggregated trait values (e.g. ranges, means) and lack the direct observations on which those data are based. Thus, the vast majority of trait data measured directly from individuals remains hidden and highly heterogeneous, impeding their discoverability, semantic interoperability, digital accessibility and (re-)use. Here, we integrate quantitative measurements of verbatim trait information from plant individuals (e.g. lengths, widths, counts and angles of stems, leaves, fruits and inflorescence parts) from multiple sources such as field observations and herbarium collections. We develop a workflow to harmonize heterogeneous trait measurements (e.g. trait names and their values and units) as well as additional information related to taxonomy, measurement or fact and occurrence. This data integration and harmonization builds on vocabularies and terminology from existing metadata standards and ontologies such as the Ecological Trait-data Standard (ETS), the Darwin Core (DwC), the Thesaurus Of Plant characteristics (TOP) and the Plant Trait Ontology (TO). A metadata form filled out by data providers enables the automated integration of trait information from heterogeneous datasets. We illustrate our tools with data from palms (family Arecaceae), a globally distributed (pantropical), diverse plant family that is considered a good model system for understanding the ecology and evolution of tropical rainforests. We mobilize nearly 140,000 individual palm trait measurements in an interoperable format, identify semantic gaps in existing plant trait terminology and provide suggestions for the future development of a thesaurus of plant characteristics. Our work thereby promotes the semantic integration of plant trait data in a machine-readable way and shows how large amounts of small trait data sets and their metadata can be integrated into standardized data products.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Facing an increasing amount of movements at sea and daily impacts on ships, crew and our global ecosystem, many research centers, international organizations, industrials have favored and developed sensors, detection techniques for the monitoring, analysis and visualization of sea movements. Automatic Identification System (AIS) is one of these electronic systems that enable ships to broadcast their dynamic (position, speed, destination...) and static (name, type, international identifier…) information via radio communications.
Having spatially and temporally aligned maritime dataset relying not only on ships' positions but also on a variety of complementary data sources is of great interest for the understanding of maritime activities and their impact on the environment.
This dataset contains ships' information collected though the Automatic Identification System, integrated with a set of complementary data having spatial and temporal dimensions aligned. The dataset contains four categories of data: Navigation data, vessel-oriented data, geographic data, and environmental data. It covers a time span of six months, from October 1st, 2015 to March 31st, 2016 and provides ships positions within Celtic sea, the Channel and Bay of Biscay (France). The dataset is proposed with predefined integration and querying principles for relational databases. These rely on the widespread and free relational database management system PostgreSQL, with the adjunction of the PostGIS extension, for the treatment of all spatial features proposed in the dataset.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains the simulation inputs and outputs for the manuscript "An Open Source, Heterogeneous, Nonlinear Optics Simulation" in Optics Continuum
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Relying on geological data to construct 3D models can provide a more intuitive and easily comprehensible spatial perspective. This process aids in exploring underground spatial structures and geological evolutionary processes, providing essential data and assistance for the exploration of geological resources, energy development, engineering decision-making, and various other applications. As one of the methods for 3D geological modeling, multipoint statistics can effectively describe and reconstruct the intricate geometric shapes of nonlinear geological bodies. However, existing multipoint statistics algorithms still face challenges in efficiently extracting and reconstructing the global spatial distribution characteristics of geological objects. Moreover, they lack a data-driven modeling framework that integrates diverse sources of heterogeneous data. This research introduces a novel approach that combines multipoint statistics with multimodal deep artificial neural networks and constructs the 3D crustal P-wave velocity structure model of the South China Sea by using 44 OBS forward profiles, gravity anomalies, magnetic anomalies and topographic relief data. The experimental results demonstrate that the new approach surpasses multipoint statistics and Kriging interpolation methods, and can generate a more accurate 3D geological model through the integration of multiple geophysical data. Furthermore, the reliability of the 3D crustal P-wave velocity structure model, established using the novel method, was corroborated through visual and statistical analyses. This model intuitively delineates the spatial distribution characteristics of the crustal velocity structure in the South China Sea, thereby offering a foundational data basis for researchers to gain a more comprehensive understanding of the geological evolution process within this region.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sub-cellular localisation of proteins is an essential post-translational regulatory mechanism that can be assayed using high-throughput mass spectrometry (MS). These MS-based spatial proteomics experiments enable us to pinpoint the sub-cellular distribution of thousands of proteins in a specific system under controlled conditions. Recent advances in high-throughput MS methods have yielded a plethora of experimental spatial proteomics data for the cell biology community. Yet, there are many third-party data sources, such as immunofluorescence microscopy or protein annotations and sequences, which represent a rich and vast source of complementary information. We present a unique transfer learning classification framework that utilises a nearest-neighbour or support vector machine system, to integrate heterogeneous data sources to considerably improve on the quantity and quality of sub-cellular protein assignment. We demonstrate the utility of our algorithms through evaluation of five experimental datasets, from four different species in conjunction with four different auxiliary data sources to classify proteins to tens of sub-cellular compartments with high generalisation accuracy. We further apply the method to an experiment on pluripotent mouse embryonic stem cells to classify a set of previously unknown proteins, and validate our findings against a recent high resolution map of the mouse stem cell proteome. The methodology is distributed as part of the open-source Bioconductor pRoloc suite for spatial proteomics data analysis.
IncRML resources This Zenodo dataset contains all the resources of the paper 'IncRML: Incremental Knowledge Graph Construction from Heterogeneous Data Sources' submitted to the Semantic Web Journal's Special Issue on Knowledge Graph Construction. This resource aims to make the paper experiments fully reproducible through our experiment tool written in Python which was already used before in the Knowledge Graph Construction Challenge by the ESWC 2023 Workshop on Knowledge Graph Construction. The exact Java JAR file of the RMLMapper (rmlmapper.jar) is also provided in this dataset which was used to execute the experiments. This JAR file was executed with Java OpenJDK 11.0.20.1 on Ubuntu 22.04.1 LTS (Linux 5.15.0-53-generic). Each experiment was executed 5 times and the median values are reported together with the standard deviation of the measurements. Datasets We provide both dataset dumps of the GTFS-Madrid-Benchmark and of real-life use cases from Open Data in Belgium.GTFS-Madrid-Benchmark dumps are used to analyze the impact on execution time and resources, while the real-life use cases aim to verify the approach on different types of datasets since the GTFS-Madrid-Benchmark is a single type of dataset which does not advertise changes at all. Benchmarks GTFS-Madrid-Benchmark: change types with fixed data size and amount of changes: additions-only, modifications-only, deletions-only (11 versions) GTFS-Madrid-Benchmark: amount of changes with fixed data size: 0%, 25%, 50%, 75%, and 100% changes (11 versions) GTFS-Madrid-Benchmark: data size with fixed amount of changes: scales 1, 10, 100 (11 versions) Real-life use cases Traffic control center Vlaams Verkeerscentrum (Belgium): traffic board messages data (1 day, 28760 versions) Meteorological institute KMI (Belgium): weather sensor data (1 day, 144 versions) Public transport agency NMBS (Belgium): train schedule data (1 week, 7 versions) Public transport agency De Lijn (Belgium): busses schedule data (1 week, 7 versions) Bike-sharing company BlueBike (Belgium): bike-sharing availability data (1 day, 1440 versions) Bike-sharing company JCDecaux (EU): bike-sharing availability data (1 day, 1440 versions) OpenStreetMap (World): geographical map data (1 day, 1440 versions) Remarks The first version of each dataset is always used as a baseline. All next versions are applied as an update on the existing version. The reported results are only focusing on the updates since these are the actual incremental generation. GTFS-Change-50_percent-{ALL, CHANGE}.tar.xz datasets are not uploaded as GTFS-Madrid-Benchmark scale 100 because both share the same parameters (50% changes, scale 100). Please use GTFS-Scale-100-{ALL, CHANGE}.tar.xz for GTFS-Change-50_percent-{ALL, CHANGE}.tar.xz All datasets are compressed with XZ and provided as a TAR archive, be aware that you need sufficient space to decompress these archives! 2 TB of free space is advised to decompress all benchmarks and use cases. The expected output is provided as a ZIP file in each TAR archive, decompressing these requires even more space (4 TB). Reproducing By using our experiment tool, you can easily reproduce the experiments as followed: Download one of the TAR.XZ archives and unpack them. Clone the GitHub repository of our experiment tool and install the Python dependencies with 'pip install -r requirements.txt'. Download the rmlmapper.jar JAR file from this Zenodo dataset and place it inside the experiment tool root folder. Execute the tool by running: './exectool --root=/path/to/the/root/of/the/tarxz/archive --runs=5 run'. The argument '--runs=5' is used to perform the experiment 5 times. Once executed, you can generate the statistics by running: './exectool --root=/path/to/the/root/of/the/tarxz/archive stats'. Testcases Testcases to verify the integration of RML and LDES with IncRML, see https://doi.org/10.5281/zenodo.10171394
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Component algorithms description.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Long-term, accurate observations of atmospheric phenomena are essential for a myriad of applications, including historic and future climate assessments, resource management, and infrastructure planning. In Hawai‘i, climate data are available from individual researchers, local, State, and Federal agencies, and from large electronic repositories such as the National Centers for Environmental Information (NCEI). Researchers attempting to make use of available data are faced with a series of challenges that include: (1) identifying potential data sources; (2) acquiring data; (3) establishing data quality assurance and quality control (QA/QC) protocols; and (4) implementing robust gap filling techniques. This paper addresses these challenges by providing: (1) a summary of the available climate data in Hawai‘i including a detailed description of the various meteorological observation networks and data accessibility, and (2) a quality-controlled meteorological dataset across the Hawaiian Islands for the 25-year period 1990-2014. The dataset draws on observations from 471 climate stations and includes rainfall, maximum and minimum surface air temperature, relative humidity, wind speed, downward shortwave and longwave radiation data. Resources in this dataset:Resource Title: Compilation of climate data from heterogeneous networks across the Hawaiian Islands. File Name: Web Page, url: https://figshare.com/collections/Compilation_of_climate_data_from_heterogeneous_networks_across_the_Hawaiian_Islands/3858208 https://doi.org/10.6084/m9.figshare.c.3858208 includes the following 12 datasets:
List of Active and Discontinued Climate Stations in Hawaii
Daily Downwelling Longwave Radiation in Hawaii
Daily Incoming Solar Radiation in Hawaii
Daily Wind Speed in Hawaii
Daily Relative Humidity Data in Hawaii
Daily Minimum Temperature Data in Hawaii
Daily Minimum Temperature Data in Hawaii (partially gap filled)
Daily Maximum Temperature in Hawaii
Daily Maximum Temperature Data in Hawaii (partially gap filled)
Daily Rainfall Data in Hawaii
Daily Rainfall Data in Hawaii (partially gap filled)
Column Headers for all Data Files
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The tool created aims at the environmental monitoring of the Mar Menor coastal lagoon (Spain) and the monitoring of the land use of its watershed. It integrates heterogeneous data sources ranging from ecological data obtained from a multiparametric oceanographic sonde to agro-meteorological data from IMIDA's network of stations or hydrological data from the SAIH network as multispectral satellite images from Sentinel and Landsat space missions. The system is based on free and open source software and has been designed to guarantee maximum levels of flexibility and scalability and minimum coupling so that the incorporation of new components does not affect the existing ones. The platform is designed to handle a data volume of more than 12 million records, experiencing exponential growth in the last six months. The tool allows the transformation of a large volume of data into information, offering them through microservices with optimal response times. As practical applications, the platform created allows us to know the ecological state of the Mar Menor with a very high level of detail, both at biophysical and nutrient levels, being able to detect periods of oxygen deficit and delimit the affected area. In addition, it facilitates the detailed monitoring of the cultivated areas of the watershed, detecting the agricultural use and crop cycles at the plot level. It also makes it possible to calculate the amount of water precipitated on the watershed and to monitor the runoff produced and the amount of water entering the Mar Menor in extreme events. The information is offered in different ways depending on the user profile, offering a very high level of detail for research or data analysis profiles, concrete and direct information to support decision-making for users with managerial profiles and validated and concise information for citizens. It is an integrated and distributed system that will provide data and services for the Mar Menor Observatory.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
"Please if you use this datasets we appreciated that you reference this repository and cite the works related that made possible the generation of this dataset." This change detection datastet has different events, satellites, resolutions and includes both homogeneous/heterogeneous cases. The main idea of the dataset is to bring a benchmark on semantic change detection in remote sensing field.This dataset is the outcome of the following publications:
@article{ JimenezSierra2022graph,author={Jimenez-Sierra, David Alejandro and Quintero-Olaya, David Alfredo and Alvear-Mu{~n}oz, Juan Carlos and Ben{\'i}tez-Restrepo, Hern{\'a}n Dar{\'i}o and Florez-Ospina, Juan Felipe and Chanussot, Jocelyn},journal={IEEE Transactions on Geoscience and Remote Sensing},title={Graph Learning Based on Signal Smoothness Representation for Homogeneous and Heterogeneous Change Detection},year={2022},volume={60},number={},pages={1-16},doi={10.1109/TGRS.2022.3168126}} @article{ JimenezSierra2020graph,title={Graph-Based Data Fusion Applied to: Change Detection and Biomass Estimation in Rice Crops},author={Jimenez-Sierra, David Alejandro and Ben{\'i}tez-Restrepo, Hern{\'a}n Dar{\'i}o and Vargas-Cardona, Hern{\'a}n Dar{\'i}o and Chanussot, Jocelyn},journal={Remote Sensing},volume={12},number={17},pages={2683},year={2020},publisher={Multidisciplinary Digital Publishing Institute},doi={10.3390/rs12172683}} @inproceedings{jimenez2021blue,title={Blue noise sampling and Nystrom extension for graph based change detection},author={Jimenez-Sierra, David Alejandro and Ben{\'\i}tez-Restrepo, Hern{\'a}n Dar{\'\i}o and Arce, Gonzalo R and Florez-Ospina, Juan F},booktitle={2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS},ages={2895--2898},year={2021},organization={IEEE},doi={10.1109/IGARSS47720.2021.9555107}} @article{florez2023exploiting,title={Exploiting variational inequalities for generalized change detection on graphs},author={Florez-Ospina, Juan F and Jimenez Sierra, David A and Benitez-Restrepo, Hernan D and Arce, Gonzalo},journal={IEEE Transactions on Geoscience and Remote Sensing}, year={2023},volume={61},number={},pages={1-16},doi={10.1109/TGRS.2023.3322377}} @article{florez2023exploitingxiv,title={Exploiting variational inequalities for generalized change detection on graphs},author={Florez-Ospina, Juan F. and Jimenez-Sierra, David A. and Benitez-Restrepo, Hernan D. and Arce, Gonzalo R},year={2023},publisher={TechRxiv},doi={10.36227/techrxiv.23295866.v1}} In the table on the html file (dataset_table.html) are tabulated all the metadata and details related to each case within the dasetet. The cases with a link, were gathered from those sources and authors, therefore you should refer to their work as well. The rest of the cases or events (without a link), were obtained through the use of open sources such as:
Copernicus European Space Agency Alaska Satellite Facility (Vertex) Earth Data In addition, we carried out all the processing of the images by using the SNAP toolbox from the European Space Agency. This proccessing involves the following:
Data co-registration Cropping Apply Orbit (for SAR data) Calibration (for SAR data) Speckle Filter (for SAR data) Terrain Correction (for SAR data) Lastly, the ground truth was obtained from homogeneous images for pre/post events by drawing polygons to highlight the areas where a visible change was present. The images where layout and synchorized to be zoomed over the same are to have a better view of changes. This was an exhaustive work in order to be precise as possible.Feel free to improve and contribute to this dataset.
Europe_Asia_establishments_DryadHistorical numbers of European and Asian Scolytinae established in USA by decade
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
the main contributions of this paper are threefold.