47 datasets found

Table_1_Data Mining Techniques in Analyzing Process Data: A Didactic.pdf
frontiersin.figshare.com
pdf
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xin Qiao; Hong Jiao (2023). Table_1_Data Mining Techniques in Analyzing Process Data: A Didactic.pdf [Dataset]. http://doi.org/10.3389/fpsyg.2018.02231.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyg.2018.02231.s001
Dataset updated
Jun 7, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Xin Qiao; Hong Jiao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Due to increasing use of technology-enhanced educational assessment, data mining methods have been explored to analyse process data in log files from such assessment. However, most studies were limited to one data mining technique under one specific scenario. The current study demonstrates the usage of four frequently used supervised techniques, including Classification and Regression Trees (CART), gradient boosting, random forest, support vector machine (SVM), and two unsupervised methods, Self-organizing Map (SOM) and k-means, fitted to one assessment data. The USA sample (N = 426) from the 2012 Program for International Student Assessment (PISA) responding to problem-solving items is extracted to demonstrate the methods. After concrete feature generation and feature selection, classifier development procedures are implemented using the illustrated techniques. Results show satisfactory classification accuracy for all the techniques. Suggestions for the selection of classifiers are presented based on the research questions, the interpretability and the simplicity of the classifiers. Interpretations for the results from both supervised and unsupervised learning methods are provided.
d
Data Mining in Systems Health Management
catalog.data.gov
s.cnmilf.com
+1more
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Data Mining in Systems Health Management [Dataset]. https://catalog.data.gov/dataset/data-mining-in-systems-health-management
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
This chapter presents theoretical and practical aspects associated to the implementation of a combined model-based/data-driven approach for failure prognostics based on particle filtering algorithms, in which the current esti- mate of the state PDF is used to determine the operating condition of the system and predict the progression of a fault indicator, given a dynamic state model and a set of process measurements. In this approach, the task of es- timating the current value of the fault indicator, as well as other important changing parameters in the environment, involves two basic steps: the predic- tion step, based on the process model, and an update step, which incorporates the new measurement into the a priori state estimate. This framework allows to estimate of the probability of failure at future time instants (RUL PDF) in real-time, providing information about time-to- failure (TTF) expectations, statistical confidence intervals, long-term predic- tions; using for this purpose empirical knowledge about critical conditions for the system (also referred to as the hazard zones). This information is of paramount significance for the improvement of the system reliability and cost-effective operation of critical assets, as it has been shown in a case study where feedback correction strategies (based on uncertainty measures) have been implemented to lengthen the RUL of a rotorcraft transmission system with propagating fatigue cracks on a critical component. Although the feed- back loop is implemented using simple linear relationships, it is helpful to provide a quick insight into the manner that the system reacts to changes on its input signals, in terms of its predicted RUL. The method is able to manage non-Gaussian pdf’s since it includes concepts such as nonlinear state estimation and confidence intervals in its formulation. Real data from a fault seeded test showed that the proposed framework was able to anticipate modifications on the system input to lengthen its RUL. Results of this test indicate that the method was able to successfully suggest the correction that the system required. In this sense, future work will be focused on the development and testing of similar strategies using different input-output uncertainty metrics.
Data Mining in Systems Health Management - Dataset - NASA Open Data Portal
data.nasa.gov
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Data Mining in Systems Health Management - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/data-mining-in-systems-health-management
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
This chapter presents theoretical and practical aspects associated to the implementation of a combined model-based/data-driven approach for failure prognostics based on particle filtering algorithms, in which the current esti- mate of the state PDF is used to determine the operating condition of the system and predict the progression of a fault indicator, given a dynamic state model and a set of process measurements. In this approach, the task of es- timating the current value of the fault indicator, as well as other important changing parameters in the environment, involves two basic steps: the predic- tion step, based on the process model, and an update step, which incorporates the new measurement into the a priori state estimate. This framework allows to estimate of the probability of failure at future time instants (RUL PDF) in real-time, providing information about time-to- failure (TTF) expectations, statistical confidence intervals, long-term predic- tions; using for this purpose empirical knowledge about critical conditions for the system (also referred to as the hazard zones). This information is of paramount significance for the improvement of the system reliability and cost-effective operation of critical assets, as it has been shown in a case study where feedback correction strategies (based on uncertainty measures) have been implemented to lengthen the RUL of a rotorcraft transmission system with propagating fatigue cracks on a critical component. Although the feed- back loop is implemented using simple linear relationships, it is helpful to provide a quick insight into the manner that the system reacts to changes on its input signals, in terms of its predicted RUL. The method is able to manage non-Gaussian pdf’s since it includes concepts such as nonlinear state estimation and confidence intervals in its formulation. Real data from a fault seeded test showed that the proposed framework was able to anticipate modifications on the system input to lengthen its RUL. Results of this test indicate that the method was able to successfully suggest the correction that the system required. In this sense, future work will be focused on the development and testing of similar strategies using different input-output uncertainty metrics.
fdata-01-00003_An Application of Data Mining Techniques to Explore...
frontiersin.figshare.com
pdf
Updated May 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elizabeth Harrison; Caitlin Dreisbach; Nada Basit; Jessica Keim-Malpass (2023). fdata-01-00003_An Application of Data Mining Techniques to Explore Congressional Lobbying Records for Patterns in Pediatric Special Interest Expenditures Prior to the Affordable Care Act.pdf [Dataset]. http://doi.org/10.3389/fdata.2018.00003.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fdata.2018.00003.s001
Dataset updated
May 31, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Elizabeth Harrison; Caitlin Dreisbach; Nada Basit; Jessica Keim-Malpass
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The full text of this article can be freely accessed on the publisher's website.
DataSheet1_Outlier detection using iterative adaptive mini-minimum spanning...
frontiersin.figshare.com
datasetcatalog.nlm.nih.gov
pdf
Updated Oct 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jia Li; Jiangwei Li; Chenxu Wang; Fons J. Verbeek; Tanja Schultz; Hui Liu (2023). DataSheet1_Outlier detection using iterative adaptive mini-minimum spanning tree generation with applications on medical data.pdf [Dataset]. http://doi.org/10.3389/fphys.2023.1233341.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fphys.2023.1233341.s001
Dataset updated
Oct 13, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Jia Li; Jiangwei Li; Chenxu Wang; Fons J. Verbeek; Tanja Schultz; Hui Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As an important technique for data pre-processing, outlier detection plays a crucial role in various real applications and has gained substantial attention, especially in medical fields. Despite the importance of outlier detection, many existing methods are vulnerable to the distribution of outliers and require prior knowledge, such as the outlier proportion. To address this problem to some extent, this article proposes an adaptive mini-minimum spanning tree-based outlier detection (MMOD) method, which utilizes a novel distance measure by scaling the Euclidean distance. For datasets containing different densities and taking on different shapes, our method can identify outliers without prior knowledge of outlier percentages. The results on both real-world medical data corpora and intuitive synthetic datasets demonstrate the effectiveness of the proposed method compared to state-of-the-art methods.
e
Data Warehousing and Data Mining (Old), 7th Semester, Computer Science and...
paper.erudition.co.in
html
Updated Nov 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Einetic (2025). Data Warehousing and Data Mining (Old), 7th Semester, Computer Science and Engineering, MAKAUT | Erudition Paper [Dataset]. https://paper.erudition.co.in/makaut/btech-in-computer-science-and-engineering/7/data-warehousing-and-data-mining
Explore at:
htmlAvailable download formats
Dataset updated
Nov 23, 2025
Dataset authored and provided by
Einetic
License
https://paper.erudition.co.in/termshttps://paper.erudition.co.in/terms
Description
Question Paper Solutions of Data Warehousing and Data Mining (Old),7th Semester,Computer Science and Engineering,Maulana Abul Kalam Azad University of Technology
Z
Softcite Dataset: A dataset of software mentions in research publications
data.niaid.nih.gov
zenodo.org
Updated Jan 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
James Howison; Patrice Lopez; Caifan Du; Hannah Cohoon (2021). Softcite Dataset: A dataset of software mentions in research publications [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4444074
Explore at:
Dataset updated
Jan 17, 2021
Dataset provided by
SCIENCE-MINER
The University of Texas at Austin
Authors
James Howison; Patrice Lopez; Caifan Du; Hannah Cohoon
Description
The Softcite dataset is a gold-standard dataset of software mentions in research publications, a free resource primarily for software entity recognition in scholarly text. This is the first release of this dataset.

What's in the dataset

With the aim of facilitating software entity recognition efforts at scale and eventually increased visibility of research software for the due credit of software contributions to scholarly research, a team of trained annotators from Howison Lab at the University of Texas at Austin annotated 4,093 software mentions in 4,971 open access research publications in biomedicine (from PubMed Central Open Access collection) and economics (from Unpaywall open access services). The annotated software mentions, along with their publisher, version, and access URL, if mentioned in the text, as well as those publications annotated as containing no software mentions, are all included in the released dataset as a TEI/XML corpus file.

For understanding the schema of the Softcite corpus, its design considerations, and provenance, please refer to our paper included in this release (preprint version).

Use scenarios

The release of the Softcite dataset is intended to encourage researchers and stakeholders to make research software more visible in science, especially to academic databases and systems of information retrieval; and facilitate interoperability and collaboration among similar and relevant efforts in software entity recognition and building utilities for software information retrieval. This dataset can also be useful for researchers investigating software use in academic research.

Current release content

softcite-dataset v1.0 release includes:

The Softcite dataset corpus file: softcite_corpus-full.tei.xml

Softcite Dataset: A Dataset of Software Mentions in Biomedical and Economic Research Publications, our paper that describes the design consideration and creation process of the dataset: Softcite_Dataset_Description_RC.pdf. (This is a preprint version of our forthcoming publication in the Journal of the Association for Information Science and Technology.)

The Softcite dataset is licensed under a Creative Commons Attribution 4.0 International License.

If you have questions, please start a discussion or issue in the howisonlab/softcite-dataset Github repository.
f
Experimental data for "Software Data Analytics: Architectural Model...
figshare.com
zip
Updated Jun 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cong Liu (2023). Experimental data for "Software Data Analytics: Architectural Model Discovery and Design Pattern Detection" [Dataset]. http://doi.org/10.4121/uuid:ca1b0690-d9c5-4626-a067-525ec9d5881b
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/uuid:ca1b0690-d9c5-4626-a067-525ec9d5881b
Dataset updated
Jun 6, 2023
Dataset provided by
4TU.ResearchData
Authors
Cong Liu
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset includes all experimental data used for the PhD thesis of Cong Liu, entitled "Software Data Analytics: Architectural Model Discovery and Design Pattern Detection". These data are generated by instrumenting both synthetic and real-life software systems, and are formated according to the IEEE XES format. See http://www.xes-standard.org/ and https://www.win.tue.nl/ieeetfpm/lib/exe/fetch.php?media=shared:downloads:2017-06-22-xes-software-event-v5-2.pdf for more explanations.
Make Data Count Dataset - MinerU Extraction
kaggle.com
zip
Updated Aug 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Omid Erfanmanesh (2025). Make Data Count Dataset - MinerU Extraction [Dataset]. https://www.kaggle.com/datasets/omiderfanmanesh/make-data-count-dataset-mineru-extraction
Explore at:
zip(4272989320 bytes)Available download formats
Dataset updated
Aug 26, 2025
Authors
Omid Erfanmanesh
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Description

This dataset contains PDF-to-text conversions of scientific research articles, prepared for the task of data citation mining. The goal is to identify references to research datasets within full-text scientific papers and classify them as Primary (data generated in the study) or Secondary (data reused from external sources).

The PDF articles were processed using MinerU, which converts scientific PDFs into structured machine-readable formats (JSON, Markdown, images). This ensures participants can access both the raw text and layout information needed for fine-grained information extraction.

Files and Structure

Each paper directory contains the following files:

*_origin.pdf The original PDF file of the scientific article.

*_content_list.json Structured extraction of the PDF content, where each object represents a text or figure element with metadata. Example entry:

{ "type": "text", "text": "10.1002/2017JC013030", "text_level": 1, "page_idx": 0 }

full.md The complete article content in Markdown format (linearized for easier reading).

images/ Folder containing figures and extracted images from the article.

layout.json Page layout metadata, including positions of text blocks and images.

Data Mining Task

The aim is to detect dataset references in the article text and classify them:

DOIs (Digital Object Identifiers): https://doi.org/[prefix]/[suffix] Example: https://doi.org/10.5061/dryad.r6nq870

Accession IDs: Used by data repositories. Format varies by repository. Examples:

GSE12345 (NCBI GEO)

PDB 1Y2T (Protein Data Bank)

E-MEXP-568 (ArrayExpress)

Each dataset mention must be labeled as:

Primary: Data generated by the paper (new experiments, field observations, sequencing runs, etc.).

Secondary: Data reused from external repositories or prior studies.

Training and Test Splits

train/ → Articles with gold-standard labels (train_labels.csv).

test/ → Articles without labels, used for evaluation.

train_labels.csv → Ground truth with:

article_id: Research paper DOI.

dataset_id: Extracted dataset identifier.

type: Citation type (Primary / Secondary).

sample_submission.csv → Example submission format.

Example

Paper: https://doi.org/10.1098/rspb.2016.1151 Data: https://doi.org/10.5061/dryad.6m3n9 In-text span:

"The data we used in this publication can be accessed from Dryad at doi:10.5061/dryad.6m3n9." Citation type: Primary

This dataset enables participants to develop and test NLP systems for:

Information extraction (locating dataset mentions).

Identifier normalization (mapping mentions to persistent IDs).

Citation classification (distinguishing Primary vs Secondary data usage).
Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North...
technavio.com
pdf
Updated Feb 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2025). Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, UK), APAC (China, India, Japan), South America (Brazil), and Middle East and Africa (UAE) [Dataset]. https://www.technavio.com/report/data-science-platform-market-industry-analysis
Explore at:
pdfAvailable download formats
Dataset updated
Feb 8, 2025
Dataset provided by
TechNavio
Authors
Technavio
License
https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Time period covered
2025 - 2029
Area covered
United States
Description
Snapshot img

Data Science Platform Market Size 2025-2029

The data science platform market size is valued to increase USD 763.9 million, at a CAGR of 40.2% from 2024 to 2029. Integration of AI and ML technologies with data science platforms will drive the data science platform market.

Major Market Trends & Insights

North America dominated the market and accounted for a 48% growth during the forecast period. By Deployment - On-premises segment was valued at USD 38.70 million in 2023 By Component - Platform segment accounted for the largest market revenue share in 2023

Market Size & Forecast

Market Opportunities: USD 1.00 million Market Future Opportunities: USD 763.90 million CAGR : 40.2% North America: Largest market in 2023

Market Summary

The market represents a dynamic and continually evolving landscape, underpinned by advancements in core technologies and applications. Key technologies, such as machine learning and artificial intelligence, are increasingly integrated into data science platforms to enhance predictive analytics and automate data processing. Additionally, the emergence of containerization and microservices in data science platforms enables greater flexibility and scalability. However, the market also faces challenges, including data privacy and security risks, which necessitate robust compliance with regulations. According to recent estimates, the market is expected to account for over 30% of the overall big data analytics market by 2025, underscoring its growing importance in the data-driven business landscape.

What will be the Size of the Data Science Platform Market during the forecast period?

Get Key Insights on Market Forecast (PDF) Request Free Sample

How is the Data Science Platform Market Segmented and what are the key trends of market segmentation?

The data science platform industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

Deployment On-premises Cloud Component Platform Services End-user BFSI Retail and e-commerce Manufacturing Media and entertainment Others Sector Large enterprises SMEs Application Data Preparation Data Visualization Machine Learning Predictive Analytics Data Governance Others Geography North America US Canada Europe France Germany UK Middle East and Africa UAE APAC China India Japan South America Brazil Rest of World (ROW)

By Deployment Insights

The on-premises segment is estimated to witness significant growth during the forecast period.

In the dynamic and evolving the market, big data processing is a key focus, enabling advanced model accuracy metrics through various data mining methods. Distributed computing and algorithm optimization are integral components, ensuring efficient handling of large datasets. Data governance policies are crucial for managing data security protocols and ensuring data lineage tracking. Software development kits, model versioning, and anomaly detection systems facilitate seamless development, deployment, and monitoring of predictive modeling techniques, including machine learning algorithms, regression analysis, and statistical modeling. Real-time data streaming and parallelized algorithms enable real-time insights, while predictive modeling techniques and machine learning algorithms drive business intelligence and decision-making.

Cloud computing infrastructure, data visualization tools, high-performance computing, and database management systems support scalable data solutions and efficient data warehousing. ETL processes and data integration pipelines ensure data quality assessment and feature engineering techniques. Clustering techniques and natural language processing are essential for advanced data analysis. The market is witnessing significant growth, with adoption increasing by 18.7% in the past year, and industry experts anticipate a further expansion of 21.6% in the upcoming period. Companies across various sectors are recognizing the potential of data science platforms, leading to a surge in demand for scalable, secure, and efficient solutions.

API integration services and deep learning frameworks are gaining traction, offering advanced capabilities and seamless integration with existing systems. Data security protocols and model explainability methods are becoming increasingly important, ensuring transparency and trust in data-driven decision-making. The market is expected to continue unfolding, with ongoing advancements in technology and evolving business needs shaping its future trajectory.

Request Free Sample

The On-premises segment was valued at USD 38.70 million in 2019 and showed
Company Documents Dataset
kaggle.com
zip
Updated May 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayoub Cherguelaine (2024). Company Documents Dataset [Dataset]. https://www.kaggle.com/datasets/ayoubcherguelaine/company-documents-dataset
Explore at:
zip(9789538 bytes)Available download formats
Dataset updated
May 23, 2024
Authors
Ayoub Cherguelaine
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Overview

This dataset contains a collection of over 2,000 company documents, categorized into four main types: invoices, inventory reports, purchase orders, and shipping orders. Each document is provided in PDF format, accompanied by a CSV file that includes the text extracted from these documents, their respective labels, and the word count of each document. This dataset is ideal for various natural language processing (NLP) tasks, including text classification, information extraction, and document clustering.

Dataset Content

PDF Documents: The dataset includes 2,677 PDF files, each representing a unique company document. These documents are derived from the Northwind dataset, which is commonly used for demonstrating database functionalities.

The document types are:

Invoices: Detailed records of transactions between a buyer and a seller.

Inventory Reports: Records of inventory levels, including items in stock and units sold.

Purchase Orders: Requests made by a buyer to a seller to purchase products or services.

Shipping Orders: Instructions for the delivery of goods to specified recipients.

Example Entries

Here are a few example entries from the CSV file:

Shipping Order:

Order ID: 10718

Shipping Details: "Ship Name: Königlich Essen, Ship Address: Maubelstr. 90, Ship City: ..."

Word Count: 120

Invoice:

Order ID: 10707

Customer Details: "Customer ID: Arout, Order Date: 2017-10-16, Contact Name: Th..."

Word Count: 66

Purchase Order:

Order ID: 10892

Order Details: "Order Date: 2018-02-17, Customer Name: Catherine Dewey, Products: Product ..."

Word Count: 26

Applications

This dataset can be used for:

Text Classification: Train models to classify documents into their respective categories.

Information Extraction: Extract specific fields and details from the documents.

Document Clustering: Group similar documents together based on their content.

OCR and Text Mining: Improve OCR (Optical Character Recognition) models and text mining techniques using real-world data.
COVID-19 Open Research Dataset (CORD-19) 🙄 ❤️😃
kaggle.com
zip
Updated Mar 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qusay AL-Btoush (2022). COVID-19 Open Research Dataset (CORD-19) 🙄 ❤️😃 [Dataset]. https://www.kaggle.com/datasets/qusaybtoush1990/covid19-open-research-dataset-cord19
Explore at:
zip(15862822 bytes)Available download formats
Dataset updated
Mar 7, 2022
Authors
Qusay AL-Btoush
Description
COVID-19 Open Research Dataset (CORD-19) 🙄 😃🙄 ❤️😃🙄 😃

The COVID-19 Open Research Dataset is “a free resource of over 29,000 scholarly articles 🤝😎😎🤝

DESCRIPTION❤️❤️

About This Data ❤️❤️

Description: 😃😃

The COVID-19 Open Research Dataset is “a free resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community.”

in-the-news: On March 16, 2020, the White House issued a “call to action to the tech community” regarding the dataset, asking experts “to develop new text and data mining techniques that can help the science community answer high-priority scientific questions related to COVID-19.”

Included in this dataset:

Commercial use subset (includes PMC content) -- 9000 papers, 186Mb Non-commercial use subset (includes PMC content) -- 1973 papers, 36Mb PMC custom license subset -- 1426 papers, 19Mb bioRxiv/medRxiv subset (pre-prints that are not peer reviewed) -- 803 papers, 13Mb Each paper is represented as a single JSON object. The schema is available here.

We also provide a comprehensive metadata file of 29,000 coronavirus and COVID-19 research articles with links to PubMed, Microsoft Academic and the WHO COVID-19 database of publications (includes articles without open access full text):

Metadata file (readme) -- 47Mb Source: https://pages.semanticscholar.org/coronavirus-research Updated: Weekly License: https://data.world/kgarrett/covid-19-open-research-dataset/workspace/file?filename=COVID.DATA.LIC.AGMT.pdf

Note😃😃😃😃

This data is for training how using data analysis 🤝🎉

Please appreciate the effort with an upvote 👍 😃😃

Thank You ❤️❤️❤️
m
COVID-19 Combined Data-set with Improved Measurement Errors
data.mendeley.com
Updated May 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Afshin Ashofteh (2020). COVID-19 Combined Data-set with Improved Measurement Errors [Dataset]. http://doi.org/10.17632/nw5m4hs3jr.3
Explore at:
Unique identifier
https://doi.org/10.17632/nw5m4hs3jr.3
Dataset updated
May 13, 2020
Authors
Afshin Ashofteh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Public health-related decision-making on policies aimed at controlling the COVID-19 pandemic outbreak depends on complex epidemiological models that are compelled to be robust and use all relevant available data. This data article provides a new combined worldwide COVID-19 dataset obtained from official data sources with improved systematic measurement errors and a dedicated dashboard for online data visualization and summary. The dataset adds new measures and attributes to the normal attributes of official data sources, such as daily mortality, and fatality rates. We used comparative statistical analysis to evaluate the measurement errors of COVID-19 official data collections from the Chinese Center for Disease Control and Prevention (Chinese CDC), World Health Organization (WHO) and European Centre for Disease Prevention and Control (ECDC). The data is collected by using text mining techniques and reviewing pdf reports, metadata, and reference data. The combined dataset includes complete spatial data such as countries area, international number of countries, Alpha-2 code, Alpha-3 code, latitude, longitude, and some additional attributes such as population. The improved dataset benefits from major corrections on the referenced data sets and official reports such as adjustments in the reporting dates, which suffered from a one to two days lag, removing negative values, detecting unreasonable changes in historical data in new reports and corrections on systematic measurement errors, which have been increasing as the pandemic outbreak spreads and more countries contribute data for the official repositories. Additionally, the root mean square error of attributes in the paired comparison of datasets was used to identify the main data problems. The data for China is presented separately and in more detail, and it has been extracted from the attached reports available on the main page of the CCDC website. This dataset is a comprehensive and reliable source of worldwide COVID-19 data that can be used in epidemiological models assessing the magnitude and timeline for confirmed cases, long-term predictions of deaths or hospital utilization, the effects of quarantine, stay-at-home orders and other social distancing measures, the pandemic’s turning point or in economic and social impact analysis, helping to inform national and local authorities on how to implement an adaptive response approach to re-opening the economy, re-open schools, alleviate business and social distancing restrictions, design economic programs or allow sports events to resume.
DataSheet_1_Development and Verify of Survival Analysis Models for Chinese...
frontiersin.figshare.com
pdf
Updated Jun 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Linyu Geng; Wenqiang Qu; Jun Liang; Wei Kong; Xue Xu; Wenyou Pan; Lin Liu; Min Wu; Fuwan Ding; Huaixia Hu; Xiang Ding; Hua Wei; Yaohong Zou; Xian Qian; Meimei Wang; Jian Wu; Juan Tao; Jun Tan; Zhanyun Da; Miaojia Zhang; Jing Li; Huayong Zhang; Xuebing Feng; Jiaqi Chen; Lingyun Sun (2023). DataSheet_1_Development and Verify of Survival Analysis Models for Chinese Patients With Systemic Lupus Erythematosus.pdf [Dataset]. http://doi.org/10.3389/fimmu.2022.900332.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fimmu.2022.900332.s001
Dataset updated
Jun 14, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Linyu Geng; Wenqiang Qu; Jun Liang; Wei Kong; Xue Xu; Wenyou Pan; Lin Liu; Min Wu; Fuwan Ding; Huaixia Hu; Xiang Ding; Hua Wei; Yaohong Zou; Xian Qian; Meimei Wang; Jian Wu; Juan Tao; Jun Tan; Zhanyun Da; Miaojia Zhang; Jing Li; Huayong Zhang; Xuebing Feng; Jiaqi Chen; Lingyun Sun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundThe aim of this study is to develop survival analysis models of hospitalized systemic lupus erythematosus (h-SLE) patients in Jiangsu province using data mining techniques to predict patient survival outcomes and survival status.MethodsIn this study, based on 1999–2009 survival data of 2453 hospitalized SLE (h-SLE) patients in Jiangsu Province, we not only used the Cox proportional hazards model to analyze patients’ survival factors, but also used neural network models to predict survival outcomes. We used semi-supervised learning to label the censored data and introduced cost-sensitivity to achieve data augmentation, addressing category imbalance and pseudo label credibility. In addition, the risk score model was developed by logistic regression.ResultsThe overall accuracy of the survival outcome prediction model exceeded 0.7, and the sensitivity was close to 0.8, and through the comparative analysis of multiple indicators, our model outperformed traditional classifiers. The developed survival risk assessment model based on logistic regression found that there was a clear threshold, i.e., a survival threshold indicating the survival risk of patients, and cardiopulmonary and neuropsychiatric involvement, abnormal blood urea nitrogen levels and alanine aminotransferase level had the greatest impact on patient survival time. In addition, the study developed a graphical user interface (GUI) integrating survival analysis models to assist physicians in diagnosis and treatment.ConclusionsThe proposed survival analysis scheme identifies disease-related pathogenic and prognosis factors, and has the potential to improve the effectiveness of clinical interventions.
Anomaly Detection Market Analysis, Size, and Forecast 2025-2029: North...
technavio.com
pdf
Updated Jun 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2025). Anomaly Detection Market Analysis, Size, and Forecast 2025-2029: North America (US, Canada, and Mexico), Europe (France, Germany, Spain, and UK), APAC (China, India, and Japan), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/anomaly-detection-market-industry-analysis
Explore at:
pdfAvailable download formats
Dataset updated
Jun 12, 2025
Dataset provided by
TechNavio
Authors
Technavio
License
https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Time period covered
2025 - 2029
Area covered
Canada, United States
Description
Snapshot img

Anomaly Detection Market Size 2025-2029

The anomaly detection market size is valued to increase by USD 4.44 billion, at a CAGR of 14.4% from 2024 to 2029. Anomaly detection tools gaining traction in BFSI will drive the anomaly detection market.

Major Market Trends & Insights

North America dominated the market and accounted for a 43% growth during the forecast period. By Deployment - Cloud segment was valued at USD 1.75 billion in 2023 By Component - Solution segment accounted for the largest market revenue share in 2023

Market Size & Forecast

Market Opportunities: USD 173.26 million Market Future Opportunities: USD 4441.70 million CAGR from 2024 to 2029 : 14.4%

Market Summary

Anomaly detection, a critical component of advanced analytics, is witnessing significant adoption across various industries, with the financial services sector leading the charge. The increasing incidence of internal threats and cybersecurity frauds necessitates the need for robust anomaly detection solutions. These tools help organizations identify unusual patterns and deviations from normal behavior, enabling proactive response to potential threats and ensuring operational efficiency. For instance, in a supply chain context, anomaly detection can help identify discrepancies in inventory levels or delivery schedules, leading to cost savings and improved customer satisfaction. In the realm of compliance, anomaly detection can assist in maintaining regulatory adherence by flagging unusual transactions or activities, thereby reducing the risk of penalties and reputational damage. According to recent research, organizations that implement anomaly detection solutions experience a reduction in error rates by up to 25%. This improvement not only enhances operational efficiency but also contributes to increased customer trust and satisfaction. Despite these benefits, challenges persist, including data quality and the need for real-time processing capabilities. As the market continues to evolve, advancements in machine learning and artificial intelligence are expected to address these challenges and drive further growth.

What will be the Size of the Anomaly Detection Market during the forecast period?

Get Key Insights on Market Forecast (PDF) Request Free Sample

How is the Anomaly Detection Market Segmented ?

The anomaly detection industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

Deployment Cloud On-premises Component Solution Services End-user BFSI IT and telecom Retail and e-commerce Manufacturing Others Technology Big data analytics AI and ML Data mining and business intelligence Geography North America US Canada Mexico Europe France Germany Spain UK APAC China India Japan Rest of World (ROW)

By Deployment Insights

The cloud segment is estimated to witness significant growth during the forecast period.

The market is witnessing significant growth, driven by the increasing adoption of advanced technologies such as machine learning algorithms, predictive modeling tools, and real-time monitoring systems. Businesses are increasingly relying on anomaly detection solutions to enhance their root cause analysis, improve system health indicators, and reduce false positives. This is particularly true in sectors where data is generated in real-time, such as cybersecurity threat detection, network intrusion detection, and fraud detection systems. Cloud-based anomaly detection solutions are gaining popularity due to their flexibility, scalability, and cost-effectiveness.

This growth is attributed to cloud-based solutions' quick deployment, real-time data visibility, and customization capabilities, which are offered at flexible payment options like monthly subscriptions and pay-as-you-go models. Companies like Anodot, Ltd, Cisco Systems Inc, IBM Corp, and SAS Institute Inc provide both cloud-based and on-premise anomaly detection solutions. Anomaly detection methods include outlier detection, change point detection, and statistical process control. Data preprocessing steps, such as data mining techniques and feature engineering processes, are crucial in ensuring accurate anomaly detection. Data visualization dashboards and alert fatigue mitigation techniques help in managing and interpreting the vast amounts of data generated.

Network traffic analysis, log file analysis, and sensor data integration are essential components of anomaly detection systems. Additionally, risk management frameworks, drift detection algorithms, time series forecasting, and performance degradation detection are vital in maintaining system performance and capacity planning.
Human Activity Recognition WISDM Lab dataset
kaggle.com
zip
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiashuo Wang (2024). Human Activity Recognition WISDM Lab dataset [Dataset]. https://www.kaggle.com/datasets/wangboluo/mcm2024
Explore at:
zip(10311997 bytes)Available download formats
Dataset updated
Jul 16, 2024
Authors
Jiashuo Wang
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Data Information: WISDM (WIireless Sensor Data Mining) smart phone-based sensor , collecting data from 36 different users in six different activities.

Number of examples: 1,098,207

Number of attributes: 6

Missing attribute values: None

Data processing:

1.Replace the nanoseconds with seconds in the timestamp column, and remove the user column, because each user will perform the same action.

2.Use the sliding window method to transform the data into sequences, and then split each label into training and testing sets, ensuring each label has 8:2 ratio in both the training and testing sets.

3.Shuffle the order of the labels in both training and testing sets and interleave them to prevent two sequences with the same label from being consecutively lined up.

Activity:

0 = Downstairs 100,427 (9.1%)

1 = Jogging 342,177 (31.2%)

2 = Sitting 59,939 (5.5%)

3 = Standing 48,395 (4.4%)

4 = Upstair 122,869 (11.2%)

5 = Walking 424,400 (38.6%)

Resource:

The dataset are collected by WISDM Lab [https://www.cis.fordham.edu/wisdm/dataset.php]

Jeffrey W. Lockhart, Gary M. Weiss, Jack C. Xue, Shaun T. Gallagher, Andrew B. Grosner, and Tony T. Pulickal (2011). "Design Considerations for the WISDM Smart Phone-Based Sensor Mining Architecture," Proceedings of the Fifth International Workshop on Knowledge Discovery from Sensor Data (at KDD-11), San Diego, CA. [https://www.cis.fordham.edu/wisdm/includes/files/Lockhart-Design-SensorKDD11.pdf]
Prediction of Online Orders
kaggle.com
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oscar Aguilar (2023). Prediction of Online Orders [Dataset]. https://www.kaggle.com/datasets/oscarm524/prediction-of-orders/versions/3
Explore at:
zip(6680913 bytes)Available download formats
Dataset updated
May 23, 2023
Authors
Oscar Aguilar
Description
The visit of an online shop by a possible customer is also called a session. During a session the visitor clicks on products in order to see the corresponding detail page. Furthermore, he possibly will add or remove products to/from his shopping basket. At the end of a session it is possible that one or several products from the shopping basket will be ordered. The activities of the user are also called transactions. The goal of the analysis is to predict whether the visitor will place an order or not on the basis of the transaction data collected during the session.

Tasks

In the first task historical shop data are given consisting of the session activities inclusive of the associated information whether an order was placed or not. These data can be used in order to subsequently make order forecasts for other session activities in the same shop. Of course, the real outcome of the sessions for this set is not known. Thus, the first task can be understood as a classical data mining problem.

The second task deals with the online scenario. In this context the participants are to implement an agent learning on the basis of transactions. That means that the agent successively receives the individual transactions and has to make a forecast for each of them with respect to the outcome of the shopping cart transaction. This task maps the practice scenario in the best possible way in the case that a transaction-based forecast is required and a corresponding algorithm should learn in an adaptive manner.

The Data

For the individual tasks anonymised real shop data are provided in the form of structured text files consisting of individual data sets. The data sets represent in each case transactions in the shop and may contain redundant information. For the data, in particular the following applies:

Each data set is in an individual line that is closed by “LF”(“line feed”, 0xA), “CR”(“carriage return”, 0xD), or “CR”and “LF”(“carriage return”and “line feed”, 0xD and 0xA).

The first line is structured analog to the data sets but contains the names of the respective columns (data arrays).

The header and each data set contain several arrays separated by the symbol “|”.

There is no escape character, and no quota system is used.

ASCII is used as character set.

There may be missing values. These are marked by the symbol “?”.

In concrete terms, only the array names of the attached document “*features.pdf*” in their respective sequence will be used as column headings. The corresponding value ranges are listed there, too.

The training file for task 1 is “*transact_train.txt*“) contains all data arrays of the document, whereas the corresponding classification file (“*transact_class.txt*”) of course does not contain the target attribute “*order*”.

In task 2 data in the form of a string array are transferred to the implementations of the participants by means of a method. The individual fields of the array contain the same data arrays that are listed in “*features.pdf*”–also without the target attribute “*order*”–and exactly in the sequence used there.

Acknowledgement

This dataset is publicly available in the data-mining-cup-website.
f
DataSheet1_Water quality monitoring and assessment based on cruise...
frontiersin.figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jing Qian; Hongbo Liu; Li Qian; Jonas Bauer; Xiaobai Xue; Gongliang Yu; Qiang He; Qi Zhou; Yonghong Bi; Stefan Norra (2023). DataSheet1_Water quality monitoring and assessment based on cruise monitoring, remote sensing, and deep learning: A case study of Qingcaosha Reservoir.PDF [Dataset]. http://doi.org/10.3389/fenvs.2022.979133.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fenvs.2022.979133.s001
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Jing Qian; Hongbo Liu; Li Qian; Jonas Bauer; Xiaobai Xue; Gongliang Yu; Qiang He; Qi Zhou; Yonghong Bi; Stefan Norra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Accurate monitoring and assessment of the environmental state, as a prerequisite for improved action, is valuable and necessary because of the growing number of environmental problems that have harmful effects on natural systems and human society. This study developed an integrated novel framework containing three modules remote sensing technology (RST), cruise monitoring technology (CMT), and deep learning to achieve a robust performance for environmental monitoring and the subsequent assessment. The deep neural network (DNN), a type of deep learning, can adapt and take advantage of the big data platform effectively provided by RST and CMT to obtain more accurate and improved monitoring results. It was proved by our case study in the Qingcaosha Reservoir (QCSR) that DNN showed a more robust performance (R2 = 0.89 for pH, R2 = 0.77 for DO, R2 = 0.86 for conductivity, and R2 = 0.95 for backscattered particles) compared to the traditional machine learning, including multiple linear regression, support vector regression, and random forest regression. Based on the monitoring results, the water quality assessment of QCSR was achieved by applying a deep learning algorithm called improved deep embedding clustering. Deep clustering analysis enables the scientific delineation of joint control regions and determines the characteristic factors of each area. This study presents the high value of the framework with a core of big data mining for environmental monitoring and follow-up assessment in a manner of high frequency, multidimensionality, and deep hierarchy.
f
Table_8_Modular Characteristics and Mechanism of Action of Herbs for...
figshare.com
pdf
Updated Jun 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Weilin Zheng; Jiayi Wu; Jiangyong Gu; Heng Weng; Jie Wang; Tao Wang; Xuefang Liang; Lixing Cao (2023). Table_8_Modular Characteristics and Mechanism of Action of Herbs for Endometriosis Treatment in Chinese Medicine: A Data Mining and Network Pharmacology–Based Identification.pdf [Dataset]. http://doi.org/10.3389/fphar.2020.00147.s011
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fphar.2020.00147.s011
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Weilin Zheng; Jiayi Wu; Jiangyong Gu; Heng Weng; Jie Wang; Tao Wang; Xuefang Liang; Lixing Cao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Endometriosis is a common benign disease in women of reproductive age. It has been defined as a disorder characterized by inflammation, compromised immunity, hormone dependence, and neuroangiogenesis. Unfortunately, the mechanisms of endometriosis have not yet been fully elucidated, and available treatment methods are currently limited. The discovery of new therapeutic drugs and improvements in existing treatment schemes remain the focus of research initiatives. Chinese medicine can improve the symptoms associated with endometriosis. Many Chinese herbal medicines could exert antiendometriosis effects via comprehensive interactions with multiple targets. However, these interactions have not been defined. This study used association rule mining and systems pharmacology to discover a method by which potential antiendometriosis herbs can be investigated. We analyzed various combinations and mechanisms of action of medicinal herbs to establish molecular networks showing interactions with multiple targets. The results showed that endometriosis treatment in Chinese medicine is mainly based on methods of supplementation with blood-activating herbs and strengthening qi. Furthermore, we used network pharmacology to analyze the main herbs that facilitate the decoding of multiscale mechanisms of the herbal compounds. We found that Chinese medicine could affect the development of endometriosis by regulating inflammation, immunity, angiogenesis, and other clusters of processes identified by Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses. The antiendometriosis effect of Chinese medicine occurs mainly through nervous system–associated pathways, such as the serotonergic synapse, the neurotrophin signaling pathway, and dopaminergic synapse, among others, to reduce pain. Chinese medicine could also regulate VEGF signaling, toll-like reporter signaling, NF-κB signaling, MAPK signaling, PI3K-Akt signaling, and the HIF-1 signaling pathway, among others. Synergies often exist in herb pairs and herbal prescriptions. In conclusion, we identified some important targets, target pairs, and regulatory networks, using bioinformatics and data mining. The combination of data mining and network pharmacology may offer an efficient method for drug discovery and development from herbal medicines.

Insurance Analytics Market Analysis, Size, and Forecast 2025-2029: North...

technavio.com

pdf

Updated Aug 31, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Technavio (2025). Insurance Analytics Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, Italy, and UK), APAC (China, India, Japan, and South Korea), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/insurance-analytics-market-industry-analysis

Explore at:

pdfAvailable download formats

Dataset updated

Aug 31, 2025

Dataset provided by

TechNavio

Authors

Technavio

License

https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

Time period covered

2025 - 2029

Area covered

Italy, Japan, South Korea, United Kingdom, Germany, France, Europe, Canada, United States

Description

Snapshot img

Insurance Analytics Market Size 2025-2029

The insurance analytics market size is valued to increase by USD 16.12 billion, at a CAGR of 16.7% from 2024 to 2029. Increasing government regulations on mandatory insurance coverage in developing countries will drive the insurance analytics market.

Market Insights

North America dominated the market and accounted for a 36% growth during the 2025-2029.
By Deployment - Cloud segment was valued at USD 4.41 billion in 2023
By Component - Tools segment accounted for the largest market revenue share in 2023

Market Size & Forecast

Market Opportunities: USD 328.64 million 
Market Future Opportunities 2024: USD 16123.20 million
CAGR from 2024 to 2029 : 16.7%

Market Summary

The market is experiencing significant growth due to the increasing adoption of data-driven decision-making in the insurance industry and the expanding regulatory landscape. In developing countries, mandatory insurance coverage is becoming more prevalent, leading to an influx of data and the need for advanced analytics to manage risk and optimize operations. Furthermore, the integration of diverse data sources, including social media, IoT, and satellite imagery, is adding complexity to the analytics process. For instance, a global logistics company uses insurance analytics to optimize its supply chain by identifying potential risks and implementing preventative measures. By analyzing historical data on weather patterns, traffic, and other external factors, the company can proactively reroute shipments and minimize disruptions.
Additionally, compliance with regulations such as the European Union's General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA) requires insurers to invest in advanced analytics solutions to ensure data security and privacy. Despite these opportunities, challenges remain. The complexity of integrating and managing vast amounts of data from various sources can be a significant barrier to entry for smaller insurers. Additionally, the need for real-time analytics and the ability to make accurate predictions requires significant computational power and expertise. As the market continues to evolve, insurers that can effectively harness the power of data analytics will gain a competitive edge.

What will be the size of the Insurance Analytics Market during the forecast period?

Get Key Insights on Market Forecast (PDF) Request Free Sample

The market is a dynamic and ever-evolving landscape, driven by advancements in technology and the growing demand for data-driven insights. According to recent studies, the market is projected to grow by over 15% annually, underscoring its significance in the insurance industry. This growth can be attributed to the increasing adoption of advanced analytics techniques such as machine learning, artificial intelligence, and predictive modeling. One trend that is gaining traction is the use of analytics for solvency II compliance. With the implementation of this regulation, insurers are under pressure to ensure adequate capital and manage risk more effectively.
Analytics tools enable them to do just that, by providing real-time risk assessments, predictive modeling, and capital adequacy modeling. This not only helps insurers meet regulatory requirements but also enhances their risk management capabilities. Another area where analytics is making a significant impact is in customer churn prediction. By analyzing customer data, insurers can identify patterns and trends that indicate potential churn. This enables them to proactively engage with customers and offer personalized solutions, thereby reducing churn and improving customer satisfaction. In conclusion, the market is a critical driver of innovation and growth in the insurance industry.
Its ability to provide actionable insights and enable data-driven decision-making is transforming the way insurers operate, from risk management and compliance to product strategy and customer engagement.

Unpacking the Insurance Analytics Market Landscape

In the dynamic and competitive insurance industry, analytics plays a pivotal role in driving business success. Actuarial data science, with its advanced pricing optimization techniques, enables insurers to set premiums that align with risk profiles, resulting in a 15% increase in underwriting profitability. Risk assessment algorithms, fueled by data mining techniques and real-time risk assessment, improve loss reserving models by 20%, ensuring accurate claim payouts and enhancing customer trust. Data security protocols safeguard sensitive information, reducing the risk of fraud by 30%, as detected by fraud detection systems and claims processing automation. Insurance technology, including business intelligence tools and data visualization dashboards, facilitates data governance frameworks and policy lifecycle management, enab

Facebook

Twitter

Click to copy link

Link copied

Cite

Xin Qiao; Hong Jiao (2023). Table_1_Data Mining Techniques in Analyzing Process Data: A Didactic.pdf [Dataset]. http://doi.org/10.3389/fpsyg.2018.02231.s001

Table_1_Data Mining Techniques in Analyzing Process Data: A Didactic.pdf

Explore at:

pdfAvailable download formats

Unique identifier

https://doi.org/10.3389/fpsyg.2018.02231.s001

Dataset updated

Jun 7, 2023

Dataset provided by

Frontiers Mediahttp://www.frontiersin.org/

Authors

Xin Qiao; Hong Jiao

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Due to increasing use of technology-enhanced educational assessment, data mining methods have been explored to analyse process data in log files from such assessment. However, most studies were limited to one data mining technique under one specific scenario. The current study demonstrates the usage of four frequently used supervised techniques, including Classification and Regression Trees (CART), gradient boosting, random forest, support vector machine (SVM), and two unsupervised methods, Self-organizing Map (SOM) and k-means, fitted to one assessment data. The USA sample (N = 426) from the 2012 Program for International Student Assessment (PISA) responding to problem-solving items is extracted to demonstrate the methods. After concrete feature generation and feature selection, classifier development procedures are implemented using the illustrated techniques. Results show satisfactory classification accuracy for all the techniques. Suggestions for the selection of classifiers are presented based on the research questions, the interpretability and the simplicity of the classifiers. Interpretations for the results from both supervised and unsupervised learning methods are provided.

Clear search

Close search

Google apps

Main menu

Table_1_Data Mining Techniques in Analyzing Process Data: A Didactic.pdf

Data Mining in Systems Health Management

Data Mining in Systems Health Management - Dataset - NASA Open Data Portal

fdata-01-00003_An Application of Data Mining Techniques to Explore...

DataSheet1_Outlier detection using iterative adaptive mini-minimum spanning...

Data Warehousing and Data Mining (Old), 7th Semester, Computer Science and...

Softcite Dataset: A dataset of software mentions in research publications

Experimental data for "Software Data Analytics: Architectural Model...

Make Data Count Dataset - MinerU Extraction

Dataset Description

Files and Structure

Data Mining Task

Training and Test Splits

Example

Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North...

Snapshot img

Company Documents Dataset

Overview

Dataset Content

Example Entries

Shipping Order:

Invoice:

Purchase Order:

Applications

COVID-19 Open Research Dataset (CORD-19) 🙄 ❤️😃

COVID-19 Open Research Dataset (CORD-19) 🙄 😃🙄 ❤️😃🙄 😃

The COVID-19 Open Research Dataset is “a free resource of over 29,000 scholarly articles 🤝😎😎🤝

DESCRIPTION❤️❤️

About This Data ❤️❤️

Description: 😃😃

Note😃😃😃😃

Thank You ❤️❤️❤️

COVID-19 Combined Data-set with Improved Measurement Errors

DataSheet_1_Development and Verify of Survival Analysis Models for Chinese...

Anomaly Detection Market Analysis, Size, and Forecast 2025-2029: North...

Snapshot img

Human Activity Recognition WISDM Lab dataset

Prediction of Online Orders

Tasks

The Data

Acknowledgement

DataSheet1_Water quality monitoring and assessment based on cruise...

Table_8_Modular Characteristics and Mechanism of Action of Herbs for...

Insurance Analytics Market Analysis, Size, and Forecast 2025-2029: North...

Snapshot img

Table_1_Data Mining Techniques in Analyzing Process Data: A Didactic.pdf