50 datasets found

d
Data Mining in Systems Health Management
catalog.data.gov
data.wu.ac.at
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Data Mining in Systems Health Management [Dataset]. https://catalog.data.gov/dataset/data-mining-in-systems-health-management
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
This chapter presents theoretical and practical aspects associated to the implementation of a combined model-based/data-driven approach for failure prognostics based on particle filtering algorithms, in which the current esti- mate of the state PDF is used to determine the operating condition of the system and predict the progression of a fault indicator, given a dynamic state model and a set of process measurements. In this approach, the task of es- timating the current value of the fault indicator, as well as other important changing parameters in the environment, involves two basic steps: the predic- tion step, based on the process model, and an update step, which incorporates the new measurement into the a priori state estimate. This framework allows to estimate of the probability of failure at future time instants (RUL PDF) in real-time, providing information about time-to- failure (TTF) expectations, statistical confidence intervals, long-term predic- tions; using for this purpose empirical knowledge about critical conditions for the system (also referred to as the hazard zones). This information is of paramount significance for the improvement of the system reliability and cost-effective operation of critical assets, as it has been shown in a case study where feedback correction strategies (based on uncertainty measures) have been implemented to lengthen the RUL of a rotorcraft transmission system with propagating fatigue cracks on a critical component. Although the feed- back loop is implemented using simple linear relationships, it is helpful to provide a quick insight into the manner that the system reacts to changes on its input signals, in terms of its predicted RUL. The method is able to manage non-Gaussian pdf’s since it includes concepts such as nonlinear state estimation and confidence intervals in its formulation. Real data from a fault seeded test showed that the proposed framework was able to anticipate modifications on the system input to lengthen its RUL. Results of this test indicate that the method was able to successfully suggest the correction that the system required. In this sense, future work will be focused on the development and testing of similar strategies using different input-output uncertainty metrics.
r
A predictive model for opal exploration in Australia from a data mining...
researchdata.edu.au
Updated May 1, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas Landgrebe; Thomas Landgrebe; Adriana Dutkiewicz; Dietmar Muller (2015). A predictive model for opal exploration in Australia from a data mining approach [Dataset]. http://doi.org/10.4227/11/5587A86C0FDF1
Explore at:
Unique identifier
https://doi.org/10.4227/11/5587A86C0FDF1
Dataset updated
May 1, 2015
Dataset provided by
The University of Sydney
Authors
Thomas Landgrebe; Thomas Landgrebe; Adriana Dutkiewicz; Dietmar Muller
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Area covered

Dataset funded by
Australian Research Council
Description
This data collection is associated with the publications: Merdith, A. S., Landgrebe, T. C. W., Dutkiewicz, A., & Müller, R. D. (2013). Towards a predictive model for opal exploration using a spatio-temporal data mining approach. Australian Journal of Earth Sciences, 60(2), 217-229. doi: 10.1080/08120099.2012.754793
and
Landgrebe, T. C. W., Merdith, A., Dutkiewicz, A., & Müller, R. D. (2013). Relationships between palaeogeography and opal occurrence in Australia: A data-mining approach. Computers & Geosciences, 56(0), 76-82. doi: 10.1016/j.cageo.2013.02.002
Publication Abstract - Merdith et al. (2013)
Opal is Australia's national gemstone, however most significant opal discoveries were made in the early 1900's - more than 100 years ago - until recently. Currently there is no formal exploration model for opal, meaning there are no widely accepted concepts or methodologies available to suggest where new opal fields may be found. As a consequence opal mining in Australia is a cottage industry with the majority of opal exploration focused around old opal fields. The EarthByte Group has developed a new opal exploration methodology for the Great Artesian Basin. The work is based on the concept of applying “big data mining” approaches to data sets relevant for identifying regions that are prospective for opal. The group combined a multitude of geological and geophysical data sets that were jointly analysed to establish associations between particular features in the data with known opal mining sites. A “training set” of known opal localities (1036 opal mines) was assembled, using those localities, which were featured in published reports and on maps. The data used include rock types, soil type, regolith type, topography, radiometric data and a stack of digital palaeogeographic maps. The different data layers were analysed via spatio-temporal data mining combining the GPlates PaleoGIS software (www.gplates.org) with the Orange data mining software (orange.biolab.si) to produce the first opal prospectivity map for the Great Artesian Basin. One of the main results of the study is that the geological conditions favourable for opal were found to be related to a particular sequence of surface environments over geological time. These conditions involved alternating shallow seas and river systems followed by uplift and erosion. The approach reduces the entire area of the Great Artesian Basin to a mere 6% that is deemed to be prospective for opal exploration. The work is described in two companion papers in the Australian Journal of Earth Sciences and Computers and Geosciences.
Publication Abstract - Landgrebe et al. (2013)
Age-coded multi-layered geological datasets are becoming increasingly prevalent with the surge in open-access geodata, yet there are few methodologies for extracting geological information and knowledge from these data. We present a novel methodology, based on the open-source GPlates software in which age-coded digital palaeogeographic maps are used to “data-mine” spatio-temporal patterns related to the occurrence of Australian opal. Our aim is to test the concept that only a particular sequence of depositional/erosional environments may lead to conditions suitable for the formation of gem quality sedimentary opal. Time-varying geographic environment properties are extracted from a digital palaeogeographic dataset of the eastern Australian Great Artesian Basin (GAB) at 1036 opal localities. We obtain a total of 52 independent ordinal sequences sampling 19 time slices from the Early Cretaceous to the present-day. We find that 95% of the known opal deposits are tied to only 27 sequences all comprising fluvial and shallow marine depositional sequences followed by a prolonged phase of erosion. We then map the total area of the GAB that matches these 27 opal-specific sequences, resulting in an opal-prospective region of only about 10% of the total area of the basin. The key patterns underlying this association involve only a small number of key environmental transitions. We demonstrate that these key associations are generally absent at arbitrary locations in the basin. This new methodology allows for the simplification of a complex time-varying geological dataset into a single map view, enabling straightforward application for opal exploration and for future co-assessment with other datasets/geological criteria. This approach may help unravel the poorly understood opal formation process using an empirical spatio-temporal data-mining methodology and readily available datasets to aid hypothesis testing.
Authors and Institutions
Andrew Merdith - EarthByte Research Group, School of Geosciences, The University of Sydney, Australia. ORCID: 0000-0002-7564-8149
Thomas Landgrebe - EarthByte Research Group, School of Geosciences, The University of Sydney, Australia
Adriana Dutkiewicz - EarthByte Research Group, School of Geosciences, The University of Sydney, Australia
R. Dietmar Müller - EarthByte Research Group, School of Geosciences, The University of Sydney, Australia. ORCID: 0000-0002-3334-5764
Overview of Resources Contained
This collection contains geological data from Australia used for data mining in the publications Merdith et al. (2013) and Landgrebe et al. (2013). The resulting maps of opal prospectivity are also included.
List of Resources
Note: For details on the files included in this data collection, see “Description_of_Resources.txt”.
Note: For information on file formats and what programs to use to interact with various file formats, see “File_Formats_and_Recommended_Programs.txt”.
Map of Barfield region, Australia (.jpg, 270 KB)
Map overviewing the Great Artesian basins and main opal mining camps (.png, 82 KB)
Maps showing opal prospectivity data mining results for different geological datasets (.tif, 23.1 MB)
Map of opal prospectivity from palaeogeography data mining (.pdf, 2.6 MB)
Raster of palaeogeography target regions for viewing in Google Earth (.jpg, 418 KB)
Opal mine locations (.gpml, .txt, .kmz, .shp, total 15.6 MB)
Map of opal prospectivity from all data mining results as a Google Earth overlay (.kmz, 12 KB)
Map of probability of opal occurrence in prospective regions from all data mining results (.tif, 5.9 MB)
Paleogeography of Australia (.gpml, .txt, .shp, total 114.2 MB)
Radiometric data showing potassium concentration contrasts (.tif, .kmz, total 311.3 MB)
Regolith data (.gpml, .txt, .kml, .shp, total 7.1 MB)
Soil type data (.gpml, .txt, .kml, .shp, total 7.1 MB)
For more information on this data collection, and links to other datasets from the EarthByte Research Group please visit EarthByte
For more information about using GPlates, including tutorials and a user manual please visit GPlates or EarthByte
m
SPHERE: Students' performance dataset of conceptual understanding,...
data.mendeley.com
Updated Jan 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SPHERE: Students' performance dataset of conceptual understanding, scientific ability, and learning attitude in physics education research (PER) [Dataset]. https://data.mendeley.com/datasets/88d7m2fv7p
Explore at:
Unique identifier
https://doi.org/10.17632/88d7m2fv7p.2
Dataset updated
Jan 15, 2025
Authors
Purwoko Haryadi Santoso
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The SPHERE is students' performance in physics education research dataset. It is presented as a multi-domain learning dataset of students’ performance on physics that has been collected through several research-based assessments (RBAs) established by the physics education research (PER) community. A total of 497 eleventh-grade students were involved from three large and a small public high school located in a suburban district of a high-populated province in Indonesia. Some variables related to demographics, accessibility to literature resources, and students’ physics identity are also investigated. Some RBAs utilized in this data were selected based on concepts learned by the students in the Indonesian physics curriculum. We commenced the survey of students’ understanding on Newtonian mechanics at the end of the first semester using Force Concept Inventory (FCI) and Force and Motion Conceptual Evaluation (FMCE). In the second semester, we assessed the students’ scientific abilities and learning attitude through Scientific Abilities Assessment Rubrics (SAAR) and the Colorado Learning Attitudes about Science Survey (CLASS) respectively. The conceptual assessments were continued at the second semester measured through Rotational and Rolling Motion Conceptual Survey (RRMCS), Fluid Mechanics Concept Inventory (FMCI), Mechanical Waves Conceptual Survey (MWCS), Thermal Concept Evaluation (TCE), and Survey of Thermodynamic Processes and First and Second Laws (STPFaSL). We expect SPHERE could be a valuable dataset for supporting the advancement of the PER field particularly in quantitative studies. For example, there is a need to help advance research on using machine learning and data mining techniques in PER that might face challenges due to the unavailable dataset for the specific purpose of PER studies. SPHERE can be reused as a students’ performance dataset on physics specifically dedicated for PER scholars which might be willing to implement machine learning techniques in physics education.
f
Data from: Improving the semantic quality of conceptual models through text...
figshare.com
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tom Willaert (2023). Improving the semantic quality of conceptual models through text mining. A proof of concept [Dataset]. http://doi.org/10.6084/m9.figshare.6951608.v1
Explore at:
Unique identifier
https://doi.org/10.6084/m9.figshare.6951608.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Authors
Tom Willaert
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Python code generated in the context of the dissertation 'Improving the semantic quality of conceptual models through text mining. A proof of concept' (Postgraduate studies Big Data & Analytics for Business and Management, KU Leuven Faculty of Economics and Business, 2018)
f
Expanding the Kendrick Mass Plot Toolbox in MZmine 2 to Enable Rapid Polymer...
acs.figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ansgar Korf; Thierry Fouquet; Robin Schmid; Heiko Hayen; Sebastian Hagenhoff (2023). Expanding the Kendrick Mass Plot Toolbox in MZmine 2 to Enable Rapid Polymer Characterization in Liquid Chromatography−Mass Spectrometry Data Sets [Dataset]. http://doi.org/10.1021/acs.analchem.9b03863.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.analchem.9b03863.s002
Dataset updated
May 31, 2023
Dataset provided by
ACS Publications
Authors
Ansgar Korf; Thierry Fouquet; Robin Schmid; Heiko Hayen; Sebastian Hagenhoff
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Technological advances in mass spectrometry (MS) toward more accurate and faster data acquisition result in highly informative but also more complex data sets. Especially the hyphenation of liquid chromatography (LC) and MS yields large data files containing a high amount of compound specific information. Using electrospray-ionization for compounds such as polymers enables highly sensitive detection, yet results in very complex spectra, containing multiply charged ions and adducts. Recent years have seen the development of novel or updated data mining strategies to reduce the MS spectra complexity and to ultimately simplify the data analysis workflow. Among other techniques, the Kendrick mass defect analysis, which graphically highlights compounds containing a given repeating unit, has been revitalized with applications in multiple fields of study, such as lipids and polymers. Especially for the latter, various data mining concepts have been developed, which extend regular Kendrick mass defect analysis to multiply charged ion series. The aim of this work is to collect and subsequently implement these concepts in one of the most popular open-source MS data mining software, i.e., MZmine 2, to make them rapidly available for different MS based measurement techniques and various vendor formats, with a special focus on hyphenated techniques such as LC–MS. In combination with already existing data mining modules, an example data set was processed and simplified, enabling an ever faster evaluation and polymer characterization.
f
Description of data sets.
plos.figshare.com
xls
Updated Jul 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaoxia Mou; Heming Zhang (2023). Description of data sets. [Dataset]. http://doi.org/10.1371/journal.pone.0288140.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0288140.t001
Dataset updated
Jul 7, 2023
Dataset provided by
PLOS ONE
Authors
Shaoxia Mou; Heming Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Due to the inherent characteristics of accumulation sequence of unbalanced data, the mining results of this kind of data are often affected by a large number of categories, resulting in the decline of mining performance. To solve the above problems, the performance of data cumulative sequence mining is optimized. The algorithm for mining cumulative sequence of unbalanced data based on probability matrix decomposition is studied. The natural nearest neighbor of a few samples in the unbalanced data cumulative sequence is determined, and the few samples in the unbalanced data cumulative sequence are clustered according to the natural nearest neighbor relationship. In the same cluster, new samples are generated from the core points of dense regions and non core points of sparse regions, and then new samples are added to the original data accumulation sequence to balance the data accumulation sequence. The probability matrix decomposition method is used to generate two random number matrices with Gaussian distribution in the cumulative sequence of balanced data, and the linear combination of low dimensional eigenvectors is used to explain the preference of specific users for the data sequence; At the same time, from a global perspective, the AdaBoost idea is used to adaptively adjust the sample weight and optimize the probability matrix decomposition algorithm. Experimental results show that the algorithm can effectively generate new samples, improve the imbalance of data accumulation sequence, and obtain more accurate mining results. Optimizing global errors as well as more efficient single-sample errors. When the decomposition dimension is 5, the minimum RMSE is obtained. The proposed algorithm has good classification performance for the cumulative sequence of balanced data, and the average ranking of index F value, G mean and AUC is the best.
Longitudinal trends of EHR concepts in pediatric patients
zenodo.org
datadryad.org
csv
Updated Jun 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas Giangreco; Nicholas Giangreco (2022). Longitudinal trends of EHR concepts in pediatric patients [Dataset]. http://doi.org/10.5061/dryad.j0zpc86g3
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.j0zpc86g3
Dataset updated
Jun 10, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nicholas Giangreco; Nicholas Giangreco
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The longitudinal nature of the data motivated temporal trend identification in the pediatric EHR datatypes. Over the past three decades (1980-2018), we identified and quantified the temporal trend of 16,460 EHR concepts across measurement, visit, diagnosis, drug, and procedure datatypes.
q
Simulated supermarket transaction data
researchdatafinder.qut.edu.au
researchdata.edu.au
Updated May 31, 2010
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuefeng Li (2010). Simulated supermarket transaction data [Dataset]. https://researchdatafinder.qut.edu.au/individual/q44
Explore at:
Dataset updated
May 31, 2010
Dataset provided by
Queensland University of Technology (QUT)
Authors
Yuefeng Li
Description
A database of de-identified supermarket customer transactions. This large simulated dataset was created based on a real data sample.
A
OceanXtremes: Oceanographic Data-Intensive Anomaly Detection and Analysis...
data.amerigeoss.org
data.wu.ac.at
html
Updated Jul 25, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States[old] (2019). OceanXtremes: Oceanographic Data-Intensive Anomaly Detection and Analysis Portal [Dataset]. https://data.amerigeoss.org/pl/dataset/0f24d562-556c-4895-955a-74fec4cc9993
Explore at:
htmlAvailable download formats
Dataset updated
Jul 25, 2019
Dataset provided by
United States[old]
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Anomaly detection is a process of identifying items, events or observations, which do not conform to an expected pattern in a dataset or time series. Current and future missions and our research communities challenge us to rapidly identify features and anomalies in complex and voluminous observations to further science and improve decision support. Given this data intensive reality, we propose to develop an anomaly detection system, called OceanXtremes, powered by an intelligent, elastic Cloud-based analytic service backend that enables execution of domain-specific, multi-scale anomaly and feature detection algorithms across the entire archive of ocean science datasets. A parallel analytics engine will be developed as the key computational and data-mining core of OceanXtreams' backend processing. This analytic engine will demonstrate three new technology ideas to provide rapid turn around on climatology computation and anomaly detection: 1. An adaption of the Hadoop/MapReduce framework for parallel data mining of science datasets, typically large 3 or 4 dimensional arrays packaged in NetCDF and HDF. 2. An algorithm profiling service to efficiently and cost-effectively scale up hybrid Cloud computing resources based on the needs of scheduled jobs (CPU, memory, network, and bursting from a private Cloud computing cluster to public cloud provider like Amazon Cloud services). 3. An extension to industry-standard search solutions (OpenSearch and Faceted search) to provide support for shared discovery and exploration of ocean phenomena and anomalies, along with unexpected correlations between key measured variables. We will use a hybrid Cloud compute cluster (private Eucalyptus on-premise at JPL with bursting to Amazon Web Services) as the operational backend. The key idea is that the parallel data-mining operations will be run 'near' the ocean data archives (a local 'network' hop) so that we can efficiently access the thousands of (say, daily) files making up a three decade time-series, and then cache key variables and pre-computed climatologies in a high-performance parallel database. OceanXtremes will be equipped with both web portal and web service interfaces for users and applications/systems to register and retrieve oceanographic anomalies data. By leveraging technology such as Datacasting (Bingham, et.al, 2007), users can also subscribe to anomaly or 'event' types of their interest and have newly computed anomaly metrics and other information delivered to them by metadata feeds packaged in standard Rich Site Summary (RSS) format. Upon receiving new feed entries, users can examine the metrics and download relevant variables, by simply clicking on a link, to begin further analyzing the event. The OceanXtremes web portal will allow users to define their own anomaly or feature types where continuous backend processing will be scheduled to populate the new user-defined anomaly type by executing the chosen data mining algorithm (i.e. differences from climatology or gradients above a specified threshold). Metadata on the identified anomalies will be cataloged including temporal and geospatial profiles, key physical metrics, related observational artifacts and other relevant metadata to facilitate discovery, extraction, and visualization. Products created by the anomaly detection algorithm will be made explorable and subsettable using Webification (Huang, et.al, 2014) and OPeNDAP (http://opendap.org) technologies. Using this platform scientists can efficiently search for anomalies or ocean phenomena, compute data metrics for events or over time-series of ocean variables, and efficiently find and access all of the data relevant to their study (and then download only that data).
The F-measure values of three experiments.
plos.figshare.com
xls
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jyh-Jian Sheu; Ko-Tsung Chu; Nien-Feng Li; Cheng-Chi Lee (2023). The F-measure values of three experiments. [Dataset]. http://doi.org/10.1371/journal.pone.0171518.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0171518.t004
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Jyh-Jian Sheu; Ko-Tsung Chu; Nien-Feng Li; Cheng-Chi Lee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The F-measure values of three experiments.
Data from: CONCEPT- DM2 DATA MODEL TO ANALYSE HEALTHCARE PATHWAYS OF TYPE 2...
zenodo.org
bin, png, zip
Updated Jul 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Berta Ibáñez-Beroiz; Berta Ibáñez-Beroiz; Asier Ballesteros-Domínguez; Asier Ballesteros-Domínguez; Ignacio Oscoz-Villanueva; Ignacio Oscoz-Villanueva; Ibai Tamayo; Ibai Tamayo; Julián Librero; Julián Librero; Mónica Enguita-Germán; Mónica Enguita-Germán; Francisco Estupiñán-Romero; Francisco Estupiñán-Romero; Enrique Bernal-Delgado; Enrique Bernal-Delgado (2024). CONCEPT- DM2 DATA MODEL TO ANALYSE HEALTHCARE PATHWAYS OF TYPE 2 DIABETES [Dataset]. http://doi.org/10.5281/zenodo.7778291
Explore at:
bin, png, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7778291
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Berta Ibáñez-Beroiz; Berta Ibáñez-Beroiz; Asier Ballesteros-Domínguez; Asier Ballesteros-Domínguez; Ignacio Oscoz-Villanueva; Ignacio Oscoz-Villanueva; Ibai Tamayo; Ibai Tamayo; Julián Librero; Julián Librero; Mónica Enguita-Germán; Mónica Enguita-Germán; Francisco Estupiñán-Romero; Francisco Estupiñán-Romero; Enrique Bernal-Delgado; Enrique Bernal-Delgado
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Technical notes and documentation on the common data model of the project CONCEPT-DM2.

This publication corresponds to the Common Data Model (CDM) specification of the CONCEPT-DM2 project for the implementation of a federated network analysis of the healthcare pathway of type 2 diabetes.

Aims of the CONCEPT-DM2 project:

General aim: To analyse chronic care effectiveness and efficiency of care pathways in diabetes, assuming the relevance of care pathways as independent factors of health outcomes using data from real life world (RWD) from five Spanish Regional Health Systems.

Main specific aims:

To characterize the care pathways in patients with diabetes through the whole care system in terms of process indicators and pharmacologic recommendations

To compare these observed care pathways with the theoretical clinical pathways derived from the clinical practice guidelines

To assess if the adherence to clinical guidelines influence on important health outcomes, such as cardiovascular hospitalizations.

To compare the traditional analytical methods with process mining methods in terms of modeling quality, prediction performance and information provided.

Study Design: It is a population-based retrospective observational study centered on all T2D patients diagnosed in five Regional Health Services within the Spanish National Health Service. We will include all the contacts of these patients with the health services using the electronic medical record systems including Primary Care data, Specialized Care data, Hospitalizations, Urgent Care data, Pharmacy Claims, and also other registers such as the mortality and the population register.

Cohort definition: All patients with code of Type 2 Diabetes in the clinical health records

Inclusion criteria: patients that, at 01/01/2017 or during the follow-up from 01/01/2017 to 31/12/2022 had active health card (active TIS - tarjeta sanitaria activa) and code of type 2 diabetes (T2D, DM2 in spanish) in the clinical records of primary care (CIAP2 T90 in case of using CIAP code system)

Exclusion criteria:

patients with no contact with the health system from 01/01/2017 to 31/12/2022

patients that had a T1D (DM1) code opened after the T2D code during the follow-up.

Study period. From 01/01/2017 to 31/12/2022

Files included in this publication:

Datamodel_CONCEPT_DM2_diagram.png

Common data model specification (Datamodel_CONCEPT_DM2_v.0.1.0.xlsx)

Synthetic datasets (Datamodel_CONCEPT_DM2_sample_data)

sample_data1_dm_patient.csv

sample_data2_dm_param.csv

sample_data3_dm_patient.csv

sample_data4_dm_param.csv

sample_data5_dm_patient.csv

sample_data6_dm_param.csv

sample_data7_dm_param.csv

sample_data8_dm_param.csv

Datamodel_CONCEPT_DM2_explanation.pptx
P
AI-DataMining Dataset
paperswithcode.com
Updated Aug 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). AI-DataMining Dataset [Dataset]. https://paperswithcode.com/dataset/ai-datamining
Explore at:
Dataset updated
Aug 10, 2024
Description
Despite the availability of vast amounts of data, legal data is often unstructured, making it difficult even for law practitioners to ingest and comprehend the same. It is important to organise the legal information in a way that is useful for practitioners and downstream automation tasks. The word ontology was used by Greek philosophers to discuss concepts of existence, being, becoming and reality. Today, scientists use this term to describe the relation between concepts, data, and entities. A great example for a working ontology was developed by Dhani and Bhatt. This ontology deals with Indian court cases on intellectual property rights (IPR) The future of legal ontologies is likely to be handled by computer experts and legal experts alike.
Data and Model Checkpoints for "Weakly Supervised Concept Map Generation...
figshare.com
application/x-gzip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiaying Lu (2023). Data and Model Checkpoints for "Weakly Supervised Concept Map Generation through Task-Guided Graph Translation" [Dataset]. http://doi.org/10.6084/m9.figshare.16415802.v2
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.16415802.v2
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Jiaying Lu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data and model checkpoints for paper "Weakly Supervised Concept Map Generation through Task-Guided Graph Translation" by Jiaying Lu, Xiangjue Dong, and Carl Yang. The paper has been accepted by IEEE Transactions on Knowledge and Data Engineering (TKDE).

GT-D2G-*.tar.gz are model checkpoints for GT-D2G variants. These models are trained by seed=27. nyt/dblp/yelp.*.win5.pickle.gz are initial graphs generated by NLP pipelines. glove.840B.restaurant.400d.vec.gz is the pre-trained embedding for the Yelp dataset.

For more instructions, please refer to our GitHub repo.
s
Data from: Joint Behavior-Topic Model for Microblogs
researchdata.smu.edu.sg
bin
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
QIU Minghui; Feida ZHU; Jing JIANG (2023). Joint Behavior-Topic Model for Microblogs [Dataset]. http://doi.org/10.25440/smu.12062724.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.25440/smu.12062724.v1
Dataset updated
May 31, 2023
Dataset provided by
SMU Research Data Repository (RDR)
Authors
QIU Minghui; Feida ZHU; Jing JIANG
License
http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/
Description
We propose an LDA-based behavior-topic model (B-LDA) which jointly models user topic interests and behavioral patterns. We focus the study of the model on on-line social network settings such as microblogs like Twitter where the textual content is relatively short but user interactions on them are rich.Related Publication: Qiu, M., Zhu, F., & Jiang, J. (2013). It is not just what we say, but how we say them: LDA-based behavior-topic model. In 2013 SIAM International Conference on Data Mining (SDM’13): 2-4 May, Austin, Texas (pp. 794-802). Philadelphia: SIAM. http://doi.org/10.1137/1.9781611972832.88
f
Top five keyword counts by month.
figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jyh-Jian Sheu; Ko-Tsung Chu; Nien-Feng Li; Cheng-Chi Lee (2023). Top five keyword counts by month. [Dataset]. http://doi.org/10.1371/journal.pone.0171518.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0171518.t006
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Jyh-Jian Sheu; Ko-Tsung Chu; Nien-Feng Li; Cheng-Chi Lee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Top five keyword counts by month.
Data from: Building the graph of medicine from millions of clinical...
zenodo.org
data.niaid.nih.gov
+1more
application/gzip, txt
Updated May 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samuel G. Finlayson; Paea LePendu; Nigam H. Shah; Samuel G. Finlayson; Paea LePendu; Nigam H. Shah (2022). Data from: Building the graph of medicine from millions of clinical narratives [Dataset]. http://doi.org/10.5061/dryad.jp917
Explore at:
application/gzip, txtAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.jp917
Dataset updated
May 28, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Samuel G. Finlayson; Paea LePendu; Nigam H. Shah; Samuel G. Finlayson; Paea LePendu; Nigam H. Shah
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Electronic health records (EHR) represent a rich and relatively untapped resource for characterizing the true nature of clinical practice and for quantifying the degree of inter-relatedness of medical entities such as drugs, diseases, procedures and devices. We provide a unique set of co-occurrence matrices, quantifying the pairwise mentions of 3 million terms mapped onto 1 million clinical concepts, calculated from the raw text of 20 million clinical notes spanning 19 years of data. Co-frequencies were computed by means of a parallelized annotation, hashing, and counting pipeline that was applied over clinical notes from Stanford Hospitals and Clinics. The co-occurrence matrix quantifies the relatedness among medical concepts which can serve as the basis for many statistical tests, and can be used to directly compute Bayesian conditional probabilities, association rules, as well as a range of test statistics such as relative risks and odds ratios. This dataset can be leveraged to quantitatively assess comorbidity, drug-drug, and drug-disease patterns for a range of clinical, epidemiological, and financial applications.
f
The detailed datum of the Experiment C.
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jyh-Jian Sheu; Ko-Tsung Chu; Nien-Feng Li; Cheng-Chi Lee (2023). The detailed datum of the Experiment C. [Dataset]. http://doi.org/10.1371/journal.pone.0171518.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0171518.t005
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Jyh-Jian Sheu; Ko-Tsung Chu; Nien-Feng Li; Cheng-Chi Lee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The detailed datum of the Experiment C.
c
Concept Lab: Precomputed Associations for Shared Lexis Tool and Associated...
repository.cam.ac.uk
bin, docx, txt
Updated Apr 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Recchia, Gabriel; Jones, Ewan; Nulty, Paul; de Bolla, Peter; Regan, John (2022). Concept Lab: Precomputed Associations for Shared Lexis Tool and Associated Files (Public) [Dataset]. http://doi.org/10.17863/CAM.43499
Explore at:
bin(178312758 bytes), txt(30364 bytes), docx(433340 bytes), bin(166916274 bytes), docx(1296659 bytes)Available download formats
Unique identifier
https://doi.org/10.17863/CAM.43499
Dataset updated
Apr 13, 2022
Dataset provided by
Apollo
University of Cambridge
Authors
Recchia, Gabriel; Jones, Ewan; Nulty, Paul; de Bolla, Peter; Regan, John
Description
This dataset consists of: \( \ \) I. Source code and documentation for the "Shared Lexis Tool", a Windows desktop application that provides a means of exploring all of the words that are statistically associated with a word provided by the user, in a given corpus of text (for certain predefined corpora), over a given date range. \( \ \) II. Source code and documentation for the "Coassociation Grapher", a Windows desktop application. Given a particular word of interest (a “focal token”) in a particular corpus of text, the Coassociation Grapher allows you to view the relative probability of observing other terms (“bound tokens”) before or after the focal token. \( \ \) III. Numerous precomputed files that need to be hosted on a webserver in order for the Shared Lexis Tool to function properly; \( \ \) IV. Files that were created in the course of conducting the research described in "Tracing shifting conceptual vocabularies through time" and "The idea of liberty" (full citations in above section 'SHARING/ACCESS INFORMATION'), including "cliques" (https://en.wikipedia.org/wiki/Clique_(graph_theory)) of words that frequently appear together; \( \ \) V. Source code of text-processing scripts developed by the Concept Lab, primarily for the purpose of generating precomputed files described in section III, and associated data. \( \ \)

The Shared Lexis Tool and Coassociation Grapher (and the required precomputed files) are also being hosted at https://concept-lab.lib.cam.ac.uk/ from 2018 to 2023, and therefore those who are merely interested in using the tools within this time frame will have no use for the present dataset. However, these files may be useful for individuals who wish to host the files on their own webserver, for example, in order to use the Shared Lexis tool past 2023. See README.txt for more information.
t
SDOstreamclust: Stream Clustering Robust to Concept Drift - Evaluation Tests...
researchdata.tuwien.at
researchdata.tuwien.ac.at
zip
Updated Sep 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Felix Iglesias Vazquez; Felix Iglesias Vazquez (2024). SDOstreamclust: Stream Clustering Robust to Concept Drift - Evaluation Tests [Dataset]. http://doi.org/10.48436/xh0w2-q5x18
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.48436/xh0w2-q5x18
Dataset updated
Sep 17, 2024
Dataset provided by
TU Wien
Authors
Felix Iglesias Vazquez; Felix Iglesias Vazquez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SDOstreamclust Evaluation Tests

conducted for the paper: Stream Clustering Robust to Concept Drift

Context and methodology

SDOstreamclust is a stream clustering algorithm able to process data incrementally or per batches. It is a combination of the previous SDOstream (anomaly detection in data streams) and SDOclust (static clustering). SDOstreamclust holds the characteristics of SDO algoritmhs: lightweight, intuitive, self-adjusting, resistant to noise, capable of identifying non-convex clusters, and constructed upon robust parameters and interpretable models. Moreover, it shows excellent adaptation to concept drift

In this repository, SDOclust is evaluated with 165 datasets (both synthetic and real) and compared with CluStream, DBstream, DenStream, StreamKMeans.

This repository is framed within the research on the following domains: algorithm evaluation, stream clustering, unsupervised learning, machine learning, data mining, streaming data analysis. Datasets and algorithms can be used for experiment replication and for further evaluation and comparison.

Docker

A Docker version is also available in: https://hub.docker.com/r/fiv5/sdostreamclust

Technical details

Experiments are conducted in Python v3.8.14. The file and folder structure is as follows:- [algorithms] contains a script with functions related to algorithm configurations.

[data] contains datasets in ARFF format.

[results] contains CSV files with algorithms' performances obtained from running the "run.sh" script (as shown in the paper).

"dependencies.sh" lists and installs python dependencies.

"pysdoclust-stream-main.zip" contains the SDOstreamclust python package.

"README.md" shows details and intructions to use this repository.

"run.sh" runs the complete experiments.

"run_comp.py"for running experiments specified by arguments.

"TSindex.py" implements functions for the Temporal Silhouette index.

Note: if codes in SDOstreamclust are modified, SWIG (v4.2.1) wrappers have to be rebuilt and SDOstreamclust consequently reinstalled with pip.

License

The CC-BY license applies to all data generated with MDCgen. All distributed code is under the GPLv3+ license.
E
PRELEARN Dataset
live.european-language-grid.eu
csv
Updated Nov 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). PRELEARN Dataset [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/8084
Explore at:
csvAvailable download formats
Dataset updated
Nov 23, 2021
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The PRELEARN dataset contains 6607 concept pairs and a “Wikipedia pages file” containing the raw text of the Wikipedia pages referring to the concepts extracted (using WikiExtractor on a Wikipedia dump of Jan. 2020). The dataset has been used for the PRELEARN shared task (https://sites.google.com/view/prelearn20/), organised as part of Evalita 2020 evaluation campaign (http://www.evalita.it/2020). It was extracted from the ITA-PREREQ dataset (Miaschi et al., 2019), built upon the AL-CPL dataset (Liang et al., 2018), a collection of binary-labelled concept pairs extracted from textbooks on four domains: data mining, geometry, physics and pre-calculus.
The concept pairs consist of target and prerequisite concepts (A, B), labelled as follows:
1 if B is a prerequisite of A;
0 in all other cases.
Domain experts were asked to manually annotate if pairs of concepts showed a prerequisite relation or not. The dataset is split into a training set (5908 pairs) and a test set (699 pairs). The distribution of prerequisite and non- prerequisite labels was balanced (50/50) for each domain only in the test datasets.

Facebook

Twitter

Click to copy link

Link copied

Cite

Dashlink (2025). Data Mining in Systems Health Management [Dataset]. https://catalog.data.gov/dataset/data-mining-in-systems-health-management

Data Mining in Systems Health Management

Explore at:

13 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Apr 10, 2025

Dataset provided by

Dashlink

Description

This chapter presents theoretical and practical aspects associated to the implementation of a combined model-based/data-driven approach for failure prognostics based on particle filtering algorithms, in which the current esti- mate of the state PDF is used to determine the operating condition of the system and predict the progression of a fault indicator, given a dynamic state model and a set of process measurements. In this approach, the task of es- timating the current value of the fault indicator, as well as other important changing parameters in the environment, involves two basic steps: the predic- tion step, based on the process model, and an update step, which incorporates the new measurement into the a priori state estimate. This framework allows to estimate of the probability of failure at future time instants (RUL PDF) in real-time, providing information about time-to- failure (TTF) expectations, statistical confidence intervals, long-term predic- tions; using for this purpose empirical knowledge about critical conditions for the system (also referred to as the hazard zones). This information is of paramount significance for the improvement of the system reliability and cost-effective operation of critical assets, as it has been shown in a case study where feedback correction strategies (based on uncertainty measures) have been implemented to lengthen the RUL of a rotorcraft transmission system with propagating fatigue cracks on a critical component. Although the feed- back loop is implemented using simple linear relationships, it is helpful to provide a quick insight into the manner that the system reacts to changes on its input signals, in terms of its predicted RUL. The method is able to manage non-Gaussian pdf’s since it includes concepts such as nonlinear state estimation and confidence intervals in its formulation. Real data from a fault seeded test showed that the proposed framework was able to anticipate modifications on the system input to lengthen its RUL. Results of this test indicate that the method was able to successfully suggest the correction that the system required. In this sense, future work will be focused on the development and testing of similar strategies using different input-output uncertainty metrics.

Clear search

Close search

Google apps

Main menu

Data Mining in Systems Health Management

A predictive model for opal exploration in Australia from a data mining...

and

Landgrebe, T. C. W., Merdith, A., Dutkiewicz, A., & Müller, R. D. (2013). Relationships between palaeogeography and opal occurrence in Australia: A data-mining approach. Computers & Geosciences, 56(0), 76-82. doi: 10.1016/j.cageo.2013.02.002

Publication Abstract - Merdith et al. (2013)

Publication Abstract - Landgrebe et al. (2013)

Authors and Institutions

Overview of Resources Contained

List of Resources

SPHERE: Students' performance dataset of conceptual understanding,...

Data from: Improving the semantic quality of conceptual models through text...

Expanding the Kendrick Mass Plot Toolbox in MZmine 2 to Enable Rapid Polymer...

Description of data sets.

Longitudinal trends of EHR concepts in pediatric patients

Simulated supermarket transaction data

OceanXtremes: Oceanographic Data-Intensive Anomaly Detection and Analysis...

The F-measure values of three experiments.

Data from: CONCEPT- DM2 DATA MODEL TO ANALYSE HEALTHCARE PATHWAYS OF TYPE 2...

AI-DataMining Dataset

Data and Model Checkpoints for "Weakly Supervised Concept Map Generation...

Data from: Joint Behavior-Topic Model for Microblogs

Top five keyword counts by month.

Data from: Building the graph of medicine from millions of clinical...

The detailed datum of the Experiment C.

Concept Lab: Precomputed Associations for Shared Lexis Tool and Associated...

SDOstreamclust: Stream Clustering Robust to Concept Drift - Evaluation Tests...

SDOstreamclust Evaluation Tests

Context and methodology

Technical details

License

PRELEARN Dataset

Data Mining in Systems Health Management