52 datasets found
  1. d

    Data Mining in Systems Health Management

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Data Mining in Systems Health Management [Dataset]. https://catalog.data.gov/dataset/data-mining-in-systems-health-management
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    This chapter presents theoretical and practical aspects associated to the implementation of a combined model-based/data-driven approach for failure prognostics based on particle filtering algorithms, in which the current esti- mate of the state PDF is used to determine the operating condition of the system and predict the progression of a fault indicator, given a dynamic state model and a set of process measurements. In this approach, the task of es- timating the current value of the fault indicator, as well as other important changing parameters in the environment, involves two basic steps: the predic- tion step, based on the process model, and an update step, which incorporates the new measurement into the a priori state estimate. This framework allows to estimate of the probability of failure at future time instants (RUL PDF) in real-time, providing information about time-to- failure (TTF) expectations, statistical confidence intervals, long-term predic- tions; using for this purpose empirical knowledge about critical conditions for the system (also referred to as the hazard zones). This information is of paramount significance for the improvement of the system reliability and cost-effective operation of critical assets, as it has been shown in a case study where feedback correction strategies (based on uncertainty measures) have been implemented to lengthen the RUL of a rotorcraft transmission system with propagating fatigue cracks on a critical component. Although the feed- back loop is implemented using simple linear relationships, it is helpful to provide a quick insight into the manner that the system reacts to changes on its input signals, in terms of its predicted RUL. The method is able to manage non-Gaussian pdf’s since it includes concepts such as nonlinear state estimation and confidence intervals in its formulation. Real data from a fault seeded test showed that the proposed framework was able to anticipate modifications on the system input to lengthen its RUL. Results of this test indicate that the method was able to successfully suggest the correction that the system required. In this sense, future work will be focused on the development and testing of similar strategies using different input-output uncertainty metrics.

  2. m

    SPHERE: Students' performance dataset of conceptual understanding,...

    • data.mendeley.com
    Updated Jan 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Purwoko Haryadi Santoso (2025). SPHERE: Students' performance dataset of conceptual understanding, scientific ability, and learning attitude in physics education research (PER) [Dataset]. http://doi.org/10.17632/88d7m2fv7p.2
    Explore at:
    Dataset updated
    Jan 15, 2025
    Authors
    Purwoko Haryadi Santoso
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The SPHERE is students' performance in physics education research dataset. It is presented as a multi-domain learning dataset of students’ performance on physics that has been collected through several research-based assessments (RBAs) established by the physics education research (PER) community. A total of 497 eleventh-grade students were involved from three large and a small public high school located in a suburban district of a high-populated province in Indonesia. Some variables related to demographics, accessibility to literature resources, and students’ physics identity are also investigated. Some RBAs utilized in this data were selected based on concepts learned by the students in the Indonesian physics curriculum. We commenced the survey of students’ understanding on Newtonian mechanics at the end of the first semester using Force Concept Inventory (FCI) and Force and Motion Conceptual Evaluation (FMCE). In the second semester, we assessed the students’ scientific abilities and learning attitude through Scientific Abilities Assessment Rubrics (SAAR) and the Colorado Learning Attitudes about Science Survey (CLASS) respectively. The conceptual assessments were continued at the second semester measured through Rotational and Rolling Motion Conceptual Survey (RRMCS), Fluid Mechanics Concept Inventory (FMCI), Mechanical Waves Conceptual Survey (MWCS), Thermal Concept Evaluation (TCE), and Survey of Thermodynamic Processes and First and Second Laws (STPFaSL). We expect SPHERE could be a valuable dataset for supporting the advancement of the PER field particularly in quantitative studies. For example, there is a need to help advance research on using machine learning and data mining techniques in PER that might face challenges due to the unavailable dataset for the specific purpose of PER studies. SPHERE can be reused as a students’ performance dataset on physics specifically dedicated for PER scholars which might be willing to implement machine learning techniques in physics education.

  3. Data from: CONCEPT- DM2 DATA MODEL TO ANALYSE HEALTHCARE PATHWAYS OF TYPE 2...

    • zenodo.org
    bin, png, zip
    Updated Jul 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Berta Ibáñez-Beroiz; Berta Ibáñez-Beroiz; Asier Ballesteros-Domínguez; Asier Ballesteros-Domínguez; Ignacio Oscoz-Villanueva; Ignacio Oscoz-Villanueva; Ibai Tamayo; Ibai Tamayo; Julián Librero; Julián Librero; Mónica Enguita-Germán; Mónica Enguita-Germán; Francisco Estupiñán-Romero; Francisco Estupiñán-Romero; Enrique Bernal-Delgado; Enrique Bernal-Delgado (2024). CONCEPT- DM2 DATA MODEL TO ANALYSE HEALTHCARE PATHWAYS OF TYPE 2 DIABETES [Dataset]. http://doi.org/10.5281/zenodo.7778291
    Explore at:
    bin, png, zipAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Berta Ibáñez-Beroiz; Berta Ibáñez-Beroiz; Asier Ballesteros-Domínguez; Asier Ballesteros-Domínguez; Ignacio Oscoz-Villanueva; Ignacio Oscoz-Villanueva; Ibai Tamayo; Ibai Tamayo; Julián Librero; Julián Librero; Mónica Enguita-Germán; Mónica Enguita-Germán; Francisco Estupiñán-Romero; Francisco Estupiñán-Romero; Enrique Bernal-Delgado; Enrique Bernal-Delgado
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Technical notes and documentation on the common data model of the project CONCEPT-DM2.

    This publication corresponds to the Common Data Model (CDM) specification of the CONCEPT-DM2 project for the implementation of a federated network analysis of the healthcare pathway of type 2 diabetes.

    Aims of the CONCEPT-DM2 project:

    General aim: To analyse chronic care effectiveness and efficiency of care pathways in diabetes, assuming the relevance of care pathways as independent factors of health outcomes using data from real life world (RWD) from five Spanish Regional Health Systems.

    Main specific aims:

    • To characterize the care pathways in patients with diabetes through the whole care system in terms of process indicators and pharmacologic recommendations
    • To compare these observed care pathways with the theoretical clinical pathways derived from the clinical practice guidelines
    • To assess if the adherence to clinical guidelines influence on important health outcomes, such as cardiovascular hospitalizations.
    • To compare the traditional analytical methods with process mining methods in terms of modeling quality, prediction performance and information provided.

    Study Design: It is a population-based retrospective observational study centered on all T2D patients diagnosed in five Regional Health Services within the Spanish National Health Service. We will include all the contacts of these patients with the health services using the electronic medical record systems including Primary Care data, Specialized Care data, Hospitalizations, Urgent Care data, Pharmacy Claims, and also other registers such as the mortality and the population register.

    Cohort definition: All patients with code of Type 2 Diabetes in the clinical health records

    • Inclusion criteria: patients that, at 01/01/2017 or during the follow-up from 01/01/2017 to 31/12/2022 had active health card (active TIS - tarjeta sanitaria activa) and code of type 2 diabetes (T2D, DM2 in spanish) in the clinical records of primary care (CIAP2 T90 in case of using CIAP code system)
    • Exclusion criteria:
      • patients with no contact with the health system from 01/01/2017 to 31/12/2022
      • patients that had a T1D (DM1) code opened after the T2D code during the follow-up.
    • Study period. From 01/01/2017 to 31/12/2022

    Files included in this publication:

    • Datamodel_CONCEPT_DM2_diagram.png
    • Common data model specification (Datamodel_CONCEPT_DM2_v.0.1.0.xlsx)
    • Synthetic datasets (Datamodel_CONCEPT_DM2_sample_data)
      • sample_data1_dm_patient.csv
      • sample_data2_dm_param.csv
      • sample_data3_dm_patient.csv
      • sample_data4_dm_param.csv
      • sample_data5_dm_patient.csv
      • sample_data6_dm_param.csv
      • sample_data7_dm_param.csv
      • sample_data8_dm_param.csv
    • Datamodel_CONCEPT_DM2_explanation.pptx
  4. Data from: Improving the semantic quality of conceptual models through text...

    • figshare.com
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tom Willaert (2023). Improving the semantic quality of conceptual models through text mining. A proof of concept [Dataset]. http://doi.org/10.6084/m9.figshare.6951608.v1
    Explore at:
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Tom Willaert
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Python code generated in the context of the dissertation 'Improving the semantic quality of conceptual models through text mining. A proof of concept' (Postgraduate studies Big Data & Analytics for Business and Management, KU Leuven Faculty of Economics and Business, 2018)

  5. r

    Data from: Time in market: using data mining technologies to measure product...

    • resodate.org
    Updated Sep 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erik Poppe (2022). Time in market: using data mining technologies to measure product lifecycles [Dataset]. http://doi.org/10.14279/depositonce-16226
    Explore at:
    Dataset updated
    Sep 13, 2022
    Dataset provided by
    Technische Universität Berlin
    DepositOnce
    Authors
    Erik Poppe
    Description

    Time in Market (TIM) is a metric to describe the time period of a product from its market entry to its decline and disappearance from the market. The concept is often used implicit to describe the acceleration of product life cycles, innovation cycles and is an essential part of the product life cycle concept. It can be assumed that time in markets is an important indicator for manufacturers and marketers to plan and evaluate their market success. Moreover, time in markets are necessary to measure the speed of product life cycles and their implication for the general development of product lifetime. This article’s major contributions are to presenting (1) time in markets as a highly relevant concept for the assessment of product life cycles, although the indicator has received little attention so far, (2) explaining an automated internet-based data mining approach to gather semi-structured product data from 5 German internet shops for electronic consumer goods and (3) presenting initial insights for a period of a half to one year on market data for smartphones. It will turn out that longer periods of time are needed to obtain significant data on time in markets, nevertheless initial results show a high product rollover rate of 40-45% within one year and present a time in market below 100 days for at least 16% of the captured products. Due to the current state of work, this article is addressed to researchers already engaged in data mining or interested in the application of it.

  6. t

    SDOstreamclust: Stream Clustering Robust to Concept Drift - Evaluation Tests...

    • researchdata.tuwien.ac.at
    zip
    Updated Nov 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Felix Iglesias Vazquez; Felix Iglesias Vazquez (2025). SDOstreamclust: Stream Clustering Robust to Concept Drift - Evaluation Tests [Dataset]. http://doi.org/10.48436/xh0w2-q5x18
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 25, 2025
    Dataset provided by
    TU Wien
    Authors
    Felix Iglesias Vazquez; Felix Iglesias Vazquez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SDOstreamclust Evaluation Tests

    conducted for the paper: Stream Clustering Robust to Concept Drift. Please refer to:

    Iglesias Vazquez, F., Konzett, S., Zseby, T., & Bifet, A. (2025). Stream Clustering Robust to Concept Drift. In 2025 International Joint Conference on Neural Networks (IJCNN) (pp. 1–10). IEEE. https://doi.org/10.1109/IJCNN64981.2025.11227664

    Context and methodology

    SDOstreamclust is a stream clustering algorithm able to process data incrementally or per batches. It is a combination of the previous SDOstream (anomaly detection in data streams) and SDOclust (static clustering). SDOstreamclust holds the characteristics of SDO algoritmhs: lightweight, intuitive, self-adjusting, resistant to noise, capable of identifying non-convex clusters, and constructed upon robust parameters and interpretable models. Moreover, it shows excellent adaptation to concept drift

    In this repository, SDOclust is evaluated with 165 datasets (both synthetic and real) and compared with CluStream, DBstream, DenStream, StreamKMeans.

    This repository is framed within the research on the following domains: algorithm evaluation, stream clustering, unsupervised learning, machine learning, data mining, streaming data analysis. Datasets and algorithms can be used for experiment replication and for further evaluation and comparison.

    Docker

    A Docker version is also available in: https://hub.docker.com/r/fiv5/sdostreamclust

    Technical details

    Experiments are conducted in Python v3.8.14. The file and folder structure is as follows:- [algorithms] contains a script with functions related to algorithm configurations.

    • [data] contains datasets in ARFF format.
    • [results] contains CSV files with algorithms' performances obtained from running the "run.sh" script (as shown in the paper).
    • "dependencies.sh" lists and installs python dependencies.
    • "pysdoclust-stream-main.zip" contains the SDOstreamclust python package.
    • "README.md" shows details and intructions to use this repository.
    • "run.sh" runs the complete experiments.
    • "run_comp.py"for running experiments specified by arguments.
    • "TSindex.py" implements functions for the Temporal Silhouette index.
    Note: if codes in SDOstreamclust are modified, SWIG (v4.2.1) wrappers have to be rebuilt and SDOstreamclust consequently reinstalled with pip.

    License

    The CC-BY license applies to all data generated with MDCgen. All distributed code is under the GPLv3+ license.

  7. r

    Simulated supermarket transaction data

    • researchdata.edu.au
    • researchdatafinder.qut.edu.au
    Updated 2010
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Li Yuefeng (2010). Simulated supermarket transaction data [Dataset]. http://doi.org/10.4225/09/5885968451acd
    Explore at:
    Dataset updated
    2010
    Dataset provided by
    Queensland University of Technology
    Authors
    Li Yuefeng
    Description

    A database of de-identified supermarket customer transactions. This large simulated dataset was created based on a real data sample.

  8. Taxo-KG-Bench

    • figshare.com
    zip
    Updated Sep 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiaying Lu; Carl Yang (2022). Taxo-KG-Bench [Dataset]. http://doi.org/10.6084/m9.figshare.16415727.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 12, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Jiaying Lu; Carl Yang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The benchmark contains 6 datasets which are generated from 3 automatically constructed taxonomies (MSCG, SEMusic, SEMedical) and 2 open knowledge graphs (ReVerb, OPIEC). For each dataset, entity-concept pairs (cg_pairs.*.txt) and entity-relation-entity triples (oie_triples.*.txt) are splitted into train, dev, test sets.

    Data of AKBC'22 paper "Open-World Taxonomy and Knowledge Graph Co-Learning". For any suggestion/question, please feel free to create an issue or drop an email @ (jiaying.lu@emory.edu and j.carlyang@emory.edu).

  9. A

    OceanXtremes: Oceanographic Data-Intensive Anomaly Detection and Analysis...

    • data.amerigeoss.org
    • data.wu.ac.at
    html
    Updated Jul 25, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States[old] (2019). OceanXtremes: Oceanographic Data-Intensive Anomaly Detection and Analysis Portal [Dataset]. https://data.amerigeoss.org/pl/dataset/0f24d562-556c-4895-955a-74fec4cc9993
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jul 25, 2019
    Dataset provided by
    United States[old]
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    Anomaly detection is a process of identifying items, events or observations, which do not conform to an expected pattern in a dataset or time series. Current and future missions and our research communities challenge us to rapidly identify features and anomalies in complex and voluminous observations to further science and improve decision support. Given this data intensive reality, we propose to develop an anomaly detection system, called OceanXtremes, powered by an intelligent, elastic Cloud-based analytic service backend that enables execution of domain-specific, multi-scale anomaly and feature detection algorithms across the entire archive of ocean science datasets. A parallel analytics engine will be developed as the key computational and data-mining core of OceanXtreams' backend processing. This analytic engine will demonstrate three new technology ideas to provide rapid turn around on climatology computation and anomaly detection: 1. An adaption of the Hadoop/MapReduce framework for parallel data mining of science datasets, typically large 3 or 4 dimensional arrays packaged in NetCDF and HDF. 2. An algorithm profiling service to efficiently and cost-effectively scale up hybrid Cloud computing resources based on the needs of scheduled jobs (CPU, memory, network, and bursting from a private Cloud computing cluster to public cloud provider like Amazon Cloud services). 3. An extension to industry-standard search solutions (OpenSearch and Faceted search) to provide support for shared discovery and exploration of ocean phenomena and anomalies, along with unexpected correlations between key measured variables. We will use a hybrid Cloud compute cluster (private Eucalyptus on-premise at JPL with bursting to Amazon Web Services) as the operational backend. The key idea is that the parallel data-mining operations will be run 'near' the ocean data archives (a local 'network' hop) so that we can efficiently access the thousands of (say, daily) files making up a three decade time-series, and then cache key variables and pre-computed climatologies in a high-performance parallel database. OceanXtremes will be equipped with both web portal and web service interfaces for users and applications/systems to register and retrieve oceanographic anomalies data. By leveraging technology such as Datacasting (Bingham, et.al, 2007), users can also subscribe to anomaly or 'event' types of their interest and have newly computed anomaly metrics and other information delivered to them by metadata feeds packaged in standard Rich Site Summary (RSS) format. Upon receiving new feed entries, users can examine the metrics and download relevant variables, by simply clicking on a link, to begin further analyzing the event. The OceanXtremes web portal will allow users to define their own anomaly or feature types where continuous backend processing will be scheduled to populate the new user-defined anomaly type by executing the chosen data mining algorithm (i.e. differences from climatology or gradients above a specified threshold). Metadata on the identified anomalies will be cataloged including temporal and geospatial profiles, key physical metrics, related observational artifacts and other relevant metadata to facilitate discovery, extraction, and visualization. Products created by the anomaly detection algorithm will be made explorable and subsettable using Webification (Huang, et.al, 2014) and OPeNDAP (http://opendap.org) technologies. Using this platform scientists can efficiently search for anomalies or ocean phenomena, compute data metrics for events or over time-series of ocean variables, and efficiently find and access all of the data relevant to their study (and then download only that data).

  10. s

    Data from: Joint Behavior-Topic Model for Microblogs

    • researchdata.smu.edu.sg
    bin
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    QIU Minghui; Feida ZHU; Jing JIANG (2023). Joint Behavior-Topic Model for Microblogs [Dataset]. http://doi.org/10.25440/smu.12062724.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    SMU Research Data Repository (RDR)
    Authors
    QIU Minghui; Feida ZHU; Jing JIANG
    License

    http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/

    Description

    We propose an LDA-based behavior-topic model (B-LDA) which jointly models user topic interests and behavioral patterns. We focus the study of the model on on-line social network settings such as microblogs like Twitter where the textual content is relatively short but user interactions on them are rich.Related Publication: Qiu, M., Zhu, F., & Jiang, J. (2013). It is not just what we say, but how we say them: LDA-based behavior-topic model. In 2013 SIAM International Conference on Data Mining (SDM’13): 2-4 May, Austin, Texas (pp. 794-802). Philadelphia: SIAM. http://doi.org/10.1137/1.9781611972832.88

  11. Additional file 1: of Next generation phenotyping using narrative reports in...

    • springernature.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicolas Garcelon; Antoine Neuraz; RĂŠmi Salomon; Nadia Bahi-Buisson; Jeanne Amiel; Capucine Picard; Nizar Mahlaoui; Vincent Benoit; Anita Burgun; Bastien Rance (2023). Additional file 1: of Next generation phenotyping using narrative reports in a rare disease clinical data warehouse [Dataset]. http://doi.org/10.6084/m9.figshare.6401906.v1
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Nicolas Garcelon; Antoine Neuraz; RĂŠmi Salomon; Nadia Bahi-Buisson; Jeanne Amiel; Capucine Picard; Nizar Mahlaoui; Vincent Benoit; Anita Burgun; Bastien Rance
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Extracted phenotypical concepts per cohort. For each cohort, we list the top50 concepts ranked by Frequency and TF-IDF. The first column is the UMLS code of the phenotypical concepts, the second column is the French preferred terms, the third column is the English preferred terms, the fourth column is the frequencies score (FREQ), the fifth column is the TF-IDF score, the sixth column is the rank of the concept sorted by the frequency score, the seventh column is the rank of the concept sorted by the TF-IDF score and the eighth column is the expert evaluation (1: relevant concept, 0: none relevant concept). (XLS 93 kb)

  12. Age Prediction Using Genomic Information

    • kaggle.com
    zip
    Updated Jan 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Age Prediction Using Genomic Information [Dataset]. https://www.kaggle.com/datasets/thedevastator/age-prediction-for-individuals-using-multi-omic
    Explore at:
    zip(3712 bytes)Available download formats
    Dataset updated
    Jan 1, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Age Prediction for Individuals Using Multi-Omic Datasets

    A Machine Learning Approach

    By [source]

    About this dataset

    This dataset provides a comprehensive study of age prediction using machine learning based on multi-omics markers. It contains data from twenty-one different genes (RPA2_3, ZYG11A_4, F5_2, HOXC4_1, NKIRAS2_2, MEIS1_1, SAMD10_2, GRM2_9, TRIM59_5, LDB2

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset collects multi-omics information from individuals to predict their age. It includes markers from various subtypes of cells, such as RPA2_3, ZYG11A_4, F5_2, HOXC4_1, NKIRAS2_2, MEIS1_1, SAMD10_2, GRM 2_9 TRIM59-5 LDB2-3 ELOVL 2-6 DDO 1 KLF14-2.

    To use this dataset effectively requires knowledge in both genetic and machine learning techniques. For the former category the user must understand data mining approaches used in gene expression resolution while for the latter they must familiarize themselves with techniques such as regression methods and decision tree methods.

    To get started working with this data set it is advised that users familiarize themselves with basic concepts such as multi variate analysis (PCA) and feature selection algorithms that may render dimensionality reduction easier before attempting more sophisticated methodology e.g., neural networks or support vector machines - these later techniques can provide more accurate predictions when properly tuned but require a greater learning curve than simpler models due to their complexity . Additionally utilize hyperparameter optimization processes which allow users to test multiple models quickly and see which approach yields the best results (given user’s computing resources). Last but not least once a good model has been identified save it for future use , either through serializing it or saving its weights –don't forget!

    Research Ideas

    • Analyzing the correlation between gene expression levels and age to identify key biomarkers associated with certain life stages.
    • Building machine learning models that can predict a person's age from their multi-omics data.
    • Identifying potential drug targets based on changes in gene expression associated with age-related diseases

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: test_rows.csv | Column name | Description | |:--------------|:---------------------------------------------------------------------| | RPA2_3 | Expression level of the RPA2 gene in the third sample. (Numeric) | | ZYG11A_4 | Expression level of the ZYG11A gene in the fourth sample. (Numeric) | | F5_2 | Expression level of the F5 gene in the second sample. (Numeric) | | HOXC4_1 | Expression level of the HOXC4 gene in the first sample. (Numeric) | | NKIRAS2_2 | Expression level of the NKIRAS2 gene in the second sample. (Numeric) | | MEIS1_1 | Expression level of the MEIS1 gene in the first sample. (Numeric) | | SAMD10_2 | Expression level of the SAMD10 gene in the second sample. (Numeric) | | GRM2_9 | Expression level of the GRM2 gene in the ninth sample. (Numeric) | | TRIM59_5 | Expression level of the TRIM59 gene in the fifth sample. (Numeric) | | LDB2_3 | Expression level of the LDB2 gene in the third sample. (Numeric) | | ELOVL2_6 | Expression level of the ELOVL2 gene in the sixth sample. (Numeric) | | DDO_1 | Expression level of the DDO gene in the first sample. (Numeric) | | KLF14_2 | Expression level of the KLF14 gene in the second sample. (Numeric) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .

  13. E

    Dataset for training classifiers of comparative sentences

    • live.european-language-grid.eu
    csv
    Updated Apr 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Dataset for training classifiers of comparative sentences [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7607
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 19, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    As there was no large publicly available cross-domain dataset for comparative argument mining, we create one composed of sentences, potentially annotated with BETTER / WORSE markers (the first object is better / worse than the second object) or NONE (the sentence does not contain a comparison of the target objects). The BETTER sentences stand for a pro-argument in favor of the first compared object and WORSE-sentences represent a con-argument and favor the second object. We aim for minimizing dataset domain-specific biases in order to capture the nature of comparison and not the nature of the particular domains, thus decided to control the specificity of domains by the selection of comparison targets. We hypothesized and could confirm in preliminary experiments that comparison targets usually have a common hypernym (i.e., are instances of the same class), which we utilized for selection of the compared objects pairs. The most specific domain we choose, is computer science with comparison targets like programming languages, database products and technology standards such as Bluetooth or Ethernet. Many computer science concepts can be compared objectively (e.g., on transmission speed or suitability for certain applications). The objects for this domain were manually extracted from List of-articles at Wikipedia. In the annotation process, annotators were asked to only label sentences from this domain if they had some basic knowledge in computer science. The second, broader domain is brands. It contains objects of different types (e.g., cars, electronics, and food). As brands are present in everyday life, anyone should be able to label the majority of sentences containing well-known brands such as Coca-Cola or Mercedes. Again, targets for this domain were manually extracted from `List of''-articles at Wikipedia.The third domain is not restricted to any topic: random. For each of 24~randomly selected seed words 10 similar words were collected based on the distributional similarity API of JoBimText (http://www.jobimtext.org). Seed words created using randomlists.com: book, car, carpenter, cellphone, Christmas, coffee, cork, Florida, hamster, hiking, Hoover, Metallica, NBC, Netflix, ninja, pencil, salad, soccer, Starbucks, sword, Tolkien, wine, wood, XBox, Yale.Especially for brands and computer science, the resulting object lists were large (4493 in brands and 1339 in computer science). In a manual inspection, low-frequency and ambiguous objects were removed from all object lists (e.g., RAID (a hardware concept) and Unity (a game engine) are also regularly used nouns). The remaining objects were combined to pairs. For each object type (seed Wikipedia list page or the seed word), all possible combinations were created. These pairs were then used to find sentences containing both objects. The aforementioned approaches to selecting compared objects pairs tend minimize inclusion of the domain specific data, but do not solve the problem fully though. We keep open a question of extending dataset with diverse object pairs including abstract concepts for future work. As for the sentence mining, we used the publicly available index of dependency-parsed sentences from the Common Crawl corpus containing over 14 billion English sentences filtered for duplicates. This index was queried for sentences containing both objects of each pair. For 90% of the pairs, we also added comparative cue words (better, easier, faster, nicer, wiser, cooler, decent, safer, superior, solid, terrific, worse, harder, slower, poorly, uglier, poorer, lousy, nastier, inferior, mediocre) to the query in order to bias the selection towards comparisons but at the same time admit comparisons that do not contain any of the anticipated cues. This was necessary as a random sampling would have resulted in only a very tiny fraction of comparisons. Note that even sentences containing a cue word do not necessarily express a comparison between the desired targets (dog vs. cat: He's the best pet that you can get, better than a dog or cat.). It is thus especially crucial to enable a classifier to learn not to rely on the existence of clue words only (very likely in a random sample of sentences with very few comparisons). For our corpus, we keep pairs with at least 100 retrieved sentences.From all sentences of those pairs, 2500 for each category were randomly sampled as candidates for a crowdsourced annotation that we conducted on figure-eight.com in several small batches. Each sentence was annotated by at least five trusted workers. We ranked annotations by confidence, which is the figure-eight internal measure of combining annotator trust and voting, and discarded annotations with a confidence below 50%. Of all annotated items, 71% received unanimous votes and for over 85% at least 4 out of 5 workers agreed -- rendering the collection procedure aimed at ease of annotation successful.The final dataset contains 7199 sentences with 271 distinct object pairs. The majority of sentences (over 72%) are non-comparative despite biasing the selection with cue words; in 70% of the comparative sentences, the favored target is named first.You can browse though the data here: https://docs.google.com/spreadsheets/d/1U8i6EU9GUKmHdPnfwXEuBxi0h3aiRCLPRC-3c9ROiOE/edit?usp=sharing Full description of the dataset is available in the workshop paper at ACL 2019 conference. Please cite this paper if you use the data: Franzek, Mirco, Alexander Panchenko, and Chris Biemann. ""Categorization of Comparative Sentences for Argument Mining."" arXiv preprint arXiv:1809.06152 (2018).@inproceedings{franzek2018categorization, title={Categorization of Comparative Sentences for Argument Mining}, author={Panchenko, Alexander and Bondarenko, and Franzek, Mirco and Hagen, Matthias and Biemann, Chris}, booktitle={Proceedings of the 6th Workshop on Argument Mining at ACL'2019}, year={2019}, address={Florence, Italy}}

  14. Top five keyword counts by month.

    • figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jyh-Jian Sheu; Ko-Tsung Chu; Nien-Feng Li; Cheng-Chi Lee (2023). Top five keyword counts by month. [Dataset]. http://doi.org/10.1371/journal.pone.0171518.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Jyh-Jian Sheu; Ko-Tsung Chu; Nien-Feng Li; Cheng-Chi Lee
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Top five keyword counts by month.

  15. Data and Model Checkpoints for "Weakly Supervised Concept Map Generation...

    • figshare.com
    application/x-gzip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiaying Lu (2023). Data and Model Checkpoints for "Weakly Supervised Concept Map Generation through Task-Guided Graph Translation" [Dataset]. http://doi.org/10.6084/m9.figshare.16415802.v2
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Jiaying Lu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data and model checkpoints for paper "Weakly Supervised Concept Map Generation through Task-Guided Graph Translation" by Jiaying Lu, Xiangjue Dong, and Carl Yang. The paper has been accepted by IEEE Transactions on Knowledge and Data Engineering (TKDE).

    GT-D2G-*.tar.gz are model checkpoints for GT-D2G variants. These models are trained by seed=27. nyt/dblp/yelp.*.win5.pickle.gz are initial graphs generated by NLP pipelines. glove.840B.restaurant.400d.vec.gz is the pre-trained embedding for the Yelp dataset.

    For more instructions, please refer to our GitHub repo.

  16. E

    PRELEARN Dataset

    • live.european-language-grid.eu
    csv
    Updated Nov 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). PRELEARN Dataset [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/8084
    Explore at:
    csvAvailable download formats
    Dataset updated
    Nov 23, 2021
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The PRELEARN dataset contains 6607 concept pairs and a “Wikipedia pages file” containing the raw text of the Wikipedia pages referring to the concepts extracted (using WikiExtractor on a Wikipedia dump of Jan. 2020). The dataset has been used for the PRELEARN shared task (https://sites.google.com/view/prelearn20/), organised as part of Evalita 2020 evaluation campaign (http://www.evalita.it/2020). It was extracted from the ITA-PREREQ dataset (Miaschi et al., 2019), built upon the AL-CPL dataset (Liang et al., 2018), a collection of binary-labelled concept pairs extracted from textbooks on four domains: data mining, geometry, physics and pre-calculus.

    The concept pairs consist of target and prerequisite concepts (A, B), labelled as follows:

    1 if B is a prerequisite of A;

    0 in all other cases.

    Domain experts were asked to manually annotate if pairs of concepts showed a prerequisite relation or not. The dataset is split into a training set (5908 pairs) and a test set (699 pairs). The distribution of prerequisite and non- prerequisite labels was balanced (50/50) for each domain only in the test datasets.

  17. m

    Data from: Data and systems for medication-related text classification and...

    • data.mendeley.com
    • commons.datacite.org
    Updated Jul 16, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abeed Sarker (2018). Data and systems for medication-related text classification and concept normalization from Twitter: Insights from the Social Media Mining for Health (SMM4H) 2017 shared task [Dataset]. http://doi.org/10.17632/rxwfb3tysd.1
    Explore at:
    Dataset updated
    Jul 16, 2018
    Authors
    Abeed Sarker
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data accompanies the following publication:

    Title: Data and systems for medication-related text classification and concept normalization from Twitter: Insights from the Social Media Mining for Health (SMM4H) 2017 shared task

    Journal: Journal of the American Medical Informatics Association (JAMIA)

    The evaluation data (in addition to the training data) was used for the SMM4H-2017 shared tasks, co-located with AMIA-2017 (Washington DC).

  18. Fake-Real News

    • kaggle.com
    zip
    Updated Jun 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kajal Yadav (2020). Fake-Real News [Dataset]. https://www.kaggle.com/techykajal/fakereal-news
    Explore at:
    zip(864364 bytes)Available download formats
    Dataset updated
    Jun 23, 2020
    Authors
    Kajal Yadav
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    As we all know, Fake-News has become the centre of attraction worldwide because of its hazardous impact on our society. One of the recent example is spread of Fake-news related to Covid-19 cure, precautions, and symptoms and you must be understood by now, how dangerous this bogus information could be. Distorted piece of information propagated at the times of election for achieving political agenda is not hidden from anyone.

    Fake news is quickly becoming an epidemic, and it alarms and angers me how often and how rapidly totally fabricated stories circulate. Why? In the first place, the deceptive effect: the fact that if a lie is repeated enough times, you’ll begin to believe it’s true.

    You understand by now that fake news and other types of false information can take on various appearances. They can likewise have significant effects, because information shapes our world view: we make important decisions based on information. We form an idea about people or a situation by obtaining information. So if the information we saw on the Web is invented, false, exaggerated or distorted, we won’t make good decisions.

    Hence, Its in dire need to do something about it and It's a Big Data problem, where data scientist can contribute from their end to fight against Fake-News.

    Content

    Although, fighting against fake-News is a big data problem but I have created this small dataset having approx. 10,000 piece of news article and meta-data scraped through approx. 600 web-pages of Politifact website to analyse it using data science skills and get some insights of how can we stop spread of misinformation at broader aspect and what approach will give us better accuracy to achieve the same.

    This dataset is having 6 attributes among which News_Headline is the most important to us in order to classify news as FALSE or TRUE. As you notice the Label attribute clearly, there are 6 classes specified in it. So, it's totally up-to you whether you want to use my dataset for multi-class classification or convert these class labels into FALSE or TRUE and then, perform binary classification. Although, for your convenience, I will write a notebook on how to convert this dataset from multi-class to binary-class. To deal with the text data, you need to have good hands on practice on NLP & Data-Mining concepts.

    • News_Headline - contains piece of information that has to be analysed.
    • Link_Of_News - contains url of News Headlines specified in very first column.
    • Source - this column contains author names who has posted the information on facebook, instagram, twitter or any other social-media platform.
    • Stated_On - This column contains date when the information is posted by the authors on different social-media platforms.
    • Date - This column contains date when this piece of information is analysed by politifact team of fact-checkers in order to labelize as FAKE or REAL.
    • Label - This column contains 5 class labels : True, Mostly-True, Half-True, Barely-True, False, Pants on Fire.

    So, you can either perform multi-class classification on it or convert Mostly-True, Half-True, Barely-True as True and drop Pants on Fire and perform Binary-class classification.

    Acknowledgements

    Inspiration

    • I want to see which approach can solve this problem of combating Fake-News with greater accuracy.
  19. n

    Data from: Improving Scientific Information Extraction with Text Generation

    • curate.nd.edu
    pdf
    Updated Apr 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qingkai Zeng (2025). Improving Scientific Information Extraction with Text Generation [Dataset]. http://doi.org/10.7274/28571045.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Apr 9, 2025
    Dataset provided by
    University of Notre Dame
    Authors
    Qingkai Zeng
    License

    Public Domain Mark 1.0https://creativecommons.org/publicdomain/mark/1.0/
    License information was derived automatically

    Description

    As research communities expand, the number of scientific articles continues to grow rapidly, with no signs of slowing. This information overload drives the need for automated tools to identify relevant materials and extract key ideas. Information extraction (IE) focuses on converting unstructured scientific text into structured knowledge (e.g., ontologies, taxonomies, and knowledge graphs), enabling intelligent systems to excel in tasks like document organization, scientific literature retrieval and recommendation, claim verification even novel idea or hypothesis generation. To pinpoint the scope of this thesis, I focus on the taxonomic structure in this thesis to represent the knowledge in the scientific domain.

    To construct a taxonomy from scientific corpora, traditional methods often rely on pipeline frameworks. These frameworks typically follow a sequence: first, extracting scientific concepts or entities from the corpus; second, identifying hierarchical relationships between the concepts; and finally, organizing these relationships into a cohesive taxonomy. However, such methods encounter several challenges: (1) the quality of the corpus or annotation data, (2) error propagation within the pipeline framework, and (3) limited generalization and transferability to other specific domains. The development of large language models (LLMs) offers promising advancements, as these models have demonstrated remarkable abilities to internalize knowledge and respond effectively to a wide range of inquiries. Unlike traditional pipeline-based approaches, generative methods harness LLMs to achieve (1) better utilization of their internalized knowledge, (2) direct text-to-knowledge conversion, and (3) flexible, schema-free adaptability.

    This thesis explores innovative methods for integrating text generation technologies to improve IE in the scientific domain, with a focus on taxonomy construction. The approach begins with generating entity names and evolves to create or enrich taxonomies directly via text generation. I will explore combining neighborhood structural context, descriptive textual information, and LLMs' internal knowledge to improve output quality. Finally, this thesis will outline future research directions.

  20. i

    Data from: Synthetic Event Logs for Concept Drift Detection

    • ieee-dataport.org
    • investigacion.usc.gal
    Updated May 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victor Gallego-Fontenla (2022). Synthetic Event Logs for Concept Drift Detection [Dataset]. https://ieee-dataport.org/open-access/synthetic-event-logs-concept-drift-detection
    Explore at:
    Dataset updated
    May 18, 2022
    Authors
    Victor Gallego-Fontenla
    Description

    Real life business processes change over time

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dashlink (2025). Data Mining in Systems Health Management [Dataset]. https://catalog.data.gov/dataset/data-mining-in-systems-health-management

Data Mining in Systems Health Management

Explore at:
12 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description

This chapter presents theoretical and practical aspects associated to the implementation of a combined model-based/data-driven approach for failure prognostics based on particle filtering algorithms, in which the current esti- mate of the state PDF is used to determine the operating condition of the system and predict the progression of a fault indicator, given a dynamic state model and a set of process measurements. In this approach, the task of es- timating the current value of the fault indicator, as well as other important changing parameters in the environment, involves two basic steps: the predic- tion step, based on the process model, and an update step, which incorporates the new measurement into the a priori state estimate. This framework allows to estimate of the probability of failure at future time instants (RUL PDF) in real-time, providing information about time-to- failure (TTF) expectations, statistical confidence intervals, long-term predic- tions; using for this purpose empirical knowledge about critical conditions for the system (also referred to as the hazard zones). This information is of paramount significance for the improvement of the system reliability and cost-effective operation of critical assets, as it has been shown in a case study where feedback correction strategies (based on uncertainty measures) have been implemented to lengthen the RUL of a rotorcraft transmission system with propagating fatigue cracks on a critical component. Although the feed- back loop is implemented using simple linear relationships, it is helpful to provide a quick insight into the manner that the system reacts to changes on its input signals, in terms of its predicted RUL. The method is able to manage non-Gaussian pdf’s since it includes concepts such as nonlinear state estimation and confidence intervals in its formulation. Real data from a fault seeded test showed that the proposed framework was able to anticipate modifications on the system input to lengthen its RUL. Results of this test indicate that the method was able to successfully suggest the correction that the system required. In this sense, future work will be focused on the development and testing of similar strategies using different input-output uncertainty metrics.

Search
Clear search
Close search
Google apps
Main menu