47 datasets found
  1. Table_1_Data Mining Techniques in Analyzing Process Data: A Didactic.pdf

    • frontiersin.figshare.com
    pdf
    Updated Jun 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xin Qiao; Hong Jiao (2023). Table_1_Data Mining Techniques in Analyzing Process Data: A Didactic.pdf [Dataset]. http://doi.org/10.3389/fpsyg.2018.02231.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 7, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Xin Qiao; Hong Jiao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Due to increasing use of technology-enhanced educational assessment, data mining methods have been explored to analyse process data in log files from such assessment. However, most studies were limited to one data mining technique under one specific scenario. The current study demonstrates the usage of four frequently used supervised techniques, including Classification and Regression Trees (CART), gradient boosting, random forest, support vector machine (SVM), and two unsupervised methods, Self-organizing Map (SOM) and k-means, fitted to one assessment data. The USA sample (N = 426) from the 2012 Program for International Student Assessment (PISA) responding to problem-solving items is extracted to demonstrate the methods. After concrete feature generation and feature selection, classifier development procedures are implemented using the illustrated techniques. Results show satisfactory classification accuracy for all the techniques. Suggestions for the selection of classifiers are presented based on the research questions, the interpretability and the simplicity of the classifiers. Interpretations for the results from both supervised and unsupervised learning methods are provided.

  2. d

    Data Mining in Systems Health Management

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Data Mining in Systems Health Management [Dataset]. https://catalog.data.gov/dataset/data-mining-in-systems-health-management
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    This chapter presents theoretical and practical aspects associated to the implementation of a combined model-based/data-driven approach for failure prognostics based on particle filtering algorithms, in which the current esti- mate of the state PDF is used to determine the operating condition of the system and predict the progression of a fault indicator, given a dynamic state model and a set of process measurements. In this approach, the task of es- timating the current value of the fault indicator, as well as other important changing parameters in the environment, involves two basic steps: the predic- tion step, based on the process model, and an update step, which incorporates the new measurement into the a priori state estimate. This framework allows to estimate of the probability of failure at future time instants (RUL PDF) in real-time, providing information about time-to- failure (TTF) expectations, statistical confidence intervals, long-term predic- tions; using for this purpose empirical knowledge about critical conditions for the system (also referred to as the hazard zones). This information is of paramount significance for the improvement of the system reliability and cost-effective operation of critical assets, as it has been shown in a case study where feedback correction strategies (based on uncertainty measures) have been implemented to lengthen the RUL of a rotorcraft transmission system with propagating fatigue cracks on a critical component. Although the feed- back loop is implemented using simple linear relationships, it is helpful to provide a quick insight into the manner that the system reacts to changes on its input signals, in terms of its predicted RUL. The method is able to manage non-Gaussian pdf’s since it includes concepts such as nonlinear state estimation and confidence intervals in its formulation. Real data from a fault seeded test showed that the proposed framework was able to anticipate modifications on the system input to lengthen its RUL. Results of this test indicate that the method was able to successfully suggest the correction that the system required. In this sense, future work will be focused on the development and testing of similar strategies using different input-output uncertainty metrics.

  3. Data Mining in Systems Health Management - Dataset - NASA Open Data Portal

    • data.nasa.gov
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Data Mining in Systems Health Management - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/data-mining-in-systems-health-management
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    This chapter presents theoretical and practical aspects associated to the implementation of a combined model-based/data-driven approach for failure prognostics based on particle filtering algorithms, in which the current esti- mate of the state PDF is used to determine the operating condition of the system and predict the progression of a fault indicator, given a dynamic state model and a set of process measurements. In this approach, the task of es- timating the current value of the fault indicator, as well as other important changing parameters in the environment, involves two basic steps: the predic- tion step, based on the process model, and an update step, which incorporates the new measurement into the a priori state estimate. This framework allows to estimate of the probability of failure at future time instants (RUL PDF) in real-time, providing information about time-to- failure (TTF) expectations, statistical confidence intervals, long-term predic- tions; using for this purpose empirical knowledge about critical conditions for the system (also referred to as the hazard zones). This information is of paramount significance for the improvement of the system reliability and cost-effective operation of critical assets, as it has been shown in a case study where feedback correction strategies (based on uncertainty measures) have been implemented to lengthen the RUL of a rotorcraft transmission system with propagating fatigue cracks on a critical component. Although the feed- back loop is implemented using simple linear relationships, it is helpful to provide a quick insight into the manner that the system reacts to changes on its input signals, in terms of its predicted RUL. The method is able to manage non-Gaussian pdf’s since it includes concepts such as nonlinear state estimation and confidence intervals in its formulation. Real data from a fault seeded test showed that the proposed framework was able to anticipate modifications on the system input to lengthen its RUL. Results of this test indicate that the method was able to successfully suggest the correction that the system required. In this sense, future work will be focused on the development and testing of similar strategies using different input-output uncertainty metrics.

  4. fdata-01-00003_An Application of Data Mining Techniques to Explore...

    • frontiersin.figshare.com
    pdf
    Updated May 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth Harrison; Caitlin Dreisbach; Nada Basit; Jessica Keim-Malpass (2023). fdata-01-00003_An Application of Data Mining Techniques to Explore Congressional Lobbying Records for Patterns in Pediatric Special Interest Expenditures Prior to the Affordable Care Act.pdf [Dataset]. http://doi.org/10.3389/fdata.2018.00003.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Elizabeth Harrison; Caitlin Dreisbach; Nada Basit; Jessica Keim-Malpass
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The full text of this article can be freely accessed on the publisher's website.

  5. DataSheet1_Outlier detection using iterative adaptive mini-minimum spanning...

    • frontiersin.figshare.com
    • datasetcatalog.nlm.nih.gov
    pdf
    Updated Oct 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jia Li; Jiangwei Li; Chenxu Wang; Fons J. Verbeek; Tanja Schultz; Hui Liu (2023). DataSheet1_Outlier detection using iterative adaptive mini-minimum spanning tree generation with applications on medical data.pdf [Dataset]. http://doi.org/10.3389/fphys.2023.1233341.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Oct 13, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Jia Li; Jiangwei Li; Chenxu Wang; Fons J. Verbeek; Tanja Schultz; Hui Liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    As an important technique for data pre-processing, outlier detection plays a crucial role in various real applications and has gained substantial attention, especially in medical fields. Despite the importance of outlier detection, many existing methods are vulnerable to the distribution of outliers and require prior knowledge, such as the outlier proportion. To address this problem to some extent, this article proposes an adaptive mini-minimum spanning tree-based outlier detection (MMOD) method, which utilizes a novel distance measure by scaling the Euclidean distance. For datasets containing different densities and taking on different shapes, our method can identify outliers without prior knowledge of outlier percentages. The results on both real-world medical data corpora and intuitive synthetic datasets demonstrate the effectiveness of the proposed method compared to state-of-the-art methods.

  6. e

    Data Warehousing and Data Mining (Old), 7th Semester, Computer Science and...

    • paper.erudition.co.in
    html
    Updated Nov 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Einetic (2025). Data Warehousing and Data Mining (Old), 7th Semester, Computer Science and Engineering, MAKAUT | Erudition Paper [Dataset]. https://paper.erudition.co.in/makaut/btech-in-computer-science-and-engineering/7/data-warehousing-and-data-mining
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Nov 23, 2025
    Dataset authored and provided by
    Einetic
    License

    https://paper.erudition.co.in/termshttps://paper.erudition.co.in/terms

    Description

    Question Paper Solutions of Data Warehousing and Data Mining (Old),7th Semester,Computer Science and Engineering,Maulana Abul Kalam Azad University of Technology

  7. Z

    Softcite Dataset: A dataset of software mentions in research publications

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James Howison; Patrice Lopez; Caifan Du; Hannah Cohoon (2021). Softcite Dataset: A dataset of software mentions in research publications [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4444074
    Explore at:
    Dataset updated
    Jan 17, 2021
    Dataset provided by
    SCIENCE-MINER
    The University of Texas at Austin
    Authors
    James Howison; Patrice Lopez; Caifan Du; Hannah Cohoon
    Description

    The Softcite dataset is a gold-standard dataset of software mentions in research publications, a free resource primarily for software entity recognition in scholarly text. This is the first release of this dataset.

    What's in the dataset

    With the aim of facilitating software entity recognition efforts at scale and eventually increased visibility of research software for the due credit of software contributions to scholarly research, a team of trained annotators from Howison Lab at the University of Texas at Austin annotated 4,093 software mentions in 4,971 open access research publications in biomedicine (from PubMed Central Open Access collection) and economics (from Unpaywall open access services). The annotated software mentions, along with their publisher, version, and access URL, if mentioned in the text, as well as those publications annotated as containing no software mentions, are all included in the released dataset as a TEI/XML corpus file.

    For understanding the schema of the Softcite corpus, its design considerations, and provenance, please refer to our paper included in this release (preprint version).

    Use scenarios

    The release of the Softcite dataset is intended to encourage researchers and stakeholders to make research software more visible in science, especially to academic databases and systems of information retrieval; and facilitate interoperability and collaboration among similar and relevant efforts in software entity recognition and building utilities for software information retrieval. This dataset can also be useful for researchers investigating software use in academic research.

    Current release content

    softcite-dataset v1.0 release includes:

    The Softcite dataset corpus file: softcite_corpus-full.tei.xml

    Softcite Dataset: A Dataset of Software Mentions in Biomedical and Economic Research Publications, our paper that describes the design consideration and creation process of the dataset: Softcite_Dataset_Description_RC.pdf. (This is a preprint version of our forthcoming publication in the Journal of the Association for Information Science and Technology.)

    The Softcite dataset is licensed under a Creative Commons Attribution 4.0 International License.

    If you have questions, please start a discussion or issue in the howisonlab/softcite-dataset Github repository.

  8. f

    Experimental data for "Software Data Analytics: Architectural Model...

    • figshare.com
    zip
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cong Liu (2023). Experimental data for "Software Data Analytics: Architectural Model Discovery and Design Pattern Detection" [Dataset]. http://doi.org/10.4121/uuid:ca1b0690-d9c5-4626-a067-525ec9d5881b
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    4TU.ResearchData
    Authors
    Cong Liu
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset includes all experimental data used for the PhD thesis of Cong Liu, entitled "Software Data Analytics: Architectural Model Discovery and Design Pattern Detection". These data are generated by instrumenting both synthetic and real-life software systems, and are formated according to the IEEE XES format. See http://www.xes-standard.org/ and https://www.win.tue.nl/ieeetfpm/lib/exe/fetch.php?media=shared:downloads:2017-06-22-xes-software-event-v5-2.pdf for more explanations.

  9. Make Data Count Dataset - MinerU Extraction

    • kaggle.com
    zip
    Updated Aug 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omid Erfanmanesh (2025). Make Data Count Dataset - MinerU Extraction [Dataset]. https://www.kaggle.com/datasets/omiderfanmanesh/make-data-count-dataset-mineru-extraction
    Explore at:
    zip(4272989320 bytes)Available download formats
    Dataset updated
    Aug 26, 2025
    Authors
    Omid Erfanmanesh
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Description

    This dataset contains PDF-to-text conversions of scientific research articles, prepared for the task of data citation mining. The goal is to identify references to research datasets within full-text scientific papers and classify them as Primary (data generated in the study) or Secondary (data reused from external sources).

    The PDF articles were processed using MinerU, which converts scientific PDFs into structured machine-readable formats (JSON, Markdown, images). This ensures participants can access both the raw text and layout information needed for fine-grained information extraction.

    Files and Structure

    Each paper directory contains the following files:

    • *_origin.pdf The original PDF file of the scientific article.

    • *_content_list.json Structured extraction of the PDF content, where each object represents a text or figure element with metadata. Example entry:

      {
       "type": "text",
       "text": "10.1002/2017JC013030",
       "text_level": 1,
       "page_idx": 0
      }
      
    • full.md The complete article content in Markdown format (linearized for easier reading).

    • images/ Folder containing figures and extracted images from the article.

    • layout.json Page layout metadata, including positions of text blocks and images.

    Data Mining Task

    The aim is to detect dataset references in the article text and classify them:

    Each dataset mention must be labeled as:

    • Primary: Data generated by the paper (new experiments, field observations, sequencing runs, etc.).
    • Secondary: Data reused from external repositories or prior studies.

    Training and Test Splits

    • train/ → Articles with gold-standard labels (train_labels.csv).
    • test/ → Articles without labels, used for evaluation.
    • train_labels.csv → Ground truth with:

      • article_id: Research paper DOI.
      • dataset_id: Extracted dataset identifier.
      • type: Citation type (Primary / Secondary).
    • sample_submission.csv → Example submission format.

    Example

    Paper: https://doi.org/10.1098/rspb.2016.1151 Data: https://doi.org/10.5061/dryad.6m3n9 In-text span:

    "The data we used in this publication can be accessed from Dryad at doi:10.5061/dryad.6m3n9." Citation type: Primary

    This dataset enables participants to develop and test NLP systems for:

    • Information extraction (locating dataset mentions).
    • Identifier normalization (mapping mentions to persistent IDs).
    • Citation classification (distinguishing Primary vs Secondary data usage).
  10. Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North...

    • technavio.com
    pdf
    Updated Feb 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, UK), APAC (China, India, Japan), South America (Brazil), and Middle East and Africa (UAE) [Dataset]. https://www.technavio.com/report/data-science-platform-market-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Feb 8, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2025 - 2029
    Area covered
    United States
    Description

    Snapshot img

    Data Science Platform Market Size 2025-2029

    The data science platform market size is valued to increase USD 763.9 million, at a CAGR of 40.2% from 2024 to 2029. Integration of AI and ML technologies with data science platforms will drive the data science platform market.

    Major Market Trends & Insights

    North America dominated the market and accounted for a 48% growth during the forecast period.
    By Deployment - On-premises segment was valued at USD 38.70 million in 2023
    By Component - Platform segment accounted for the largest market revenue share in 2023
    

    Market Size & Forecast

    Market Opportunities: USD 1.00 million
    Market Future Opportunities: USD 763.90 million
    CAGR : 40.2%
    North America: Largest market in 2023
    

    Market Summary

    The market represents a dynamic and continually evolving landscape, underpinned by advancements in core technologies and applications. Key technologies, such as machine learning and artificial intelligence, are increasingly integrated into data science platforms to enhance predictive analytics and automate data processing. Additionally, the emergence of containerization and microservices in data science platforms enables greater flexibility and scalability. However, the market also faces challenges, including data privacy and security risks, which necessitate robust compliance with regulations.
    According to recent estimates, the market is expected to account for over 30% of the overall big data analytics market by 2025, underscoring its growing importance in the data-driven business landscape.
    

    What will be the Size of the Data Science Platform Market during the forecast period?

    Get Key Insights on Market Forecast (PDF) Request Free Sample

    How is the Data Science Platform Market Segmented and what are the key trends of market segmentation?

    The data science platform industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

    Deployment
    
      On-premises
      Cloud
    
    
    Component
    
      Platform
      Services
    
    
    End-user
    
      BFSI
      Retail and e-commerce
      Manufacturing
      Media and entertainment
      Others
    
    
    Sector
    
      Large enterprises
      SMEs
    
    
    Application
    
      Data Preparation
      Data Visualization
      Machine Learning
      Predictive Analytics
      Data Governance
      Others
    
    
    Geography
    
      North America
    
        US
        Canada
    
    
      Europe
    
        France
        Germany
        UK
    
    
      Middle East and Africa
    
        UAE
    
    
      APAC
    
        China
        India
        Japan
    
    
      South America
    
        Brazil
    
    
      Rest of World (ROW)
    

    By Deployment Insights

    The on-premises segment is estimated to witness significant growth during the forecast period.

    In the dynamic and evolving the market, big data processing is a key focus, enabling advanced model accuracy metrics through various data mining methods. Distributed computing and algorithm optimization are integral components, ensuring efficient handling of large datasets. Data governance policies are crucial for managing data security protocols and ensuring data lineage tracking. Software development kits, model versioning, and anomaly detection systems facilitate seamless development, deployment, and monitoring of predictive modeling techniques, including machine learning algorithms, regression analysis, and statistical modeling. Real-time data streaming and parallelized algorithms enable real-time insights, while predictive modeling techniques and machine learning algorithms drive business intelligence and decision-making.

    Cloud computing infrastructure, data visualization tools, high-performance computing, and database management systems support scalable data solutions and efficient data warehousing. ETL processes and data integration pipelines ensure data quality assessment and feature engineering techniques. Clustering techniques and natural language processing are essential for advanced data analysis. The market is witnessing significant growth, with adoption increasing by 18.7% in the past year, and industry experts anticipate a further expansion of 21.6% in the upcoming period. Companies across various sectors are recognizing the potential of data science platforms, leading to a surge in demand for scalable, secure, and efficient solutions.

    API integration services and deep learning frameworks are gaining traction, offering advanced capabilities and seamless integration with existing systems. Data security protocols and model explainability methods are becoming increasingly important, ensuring transparency and trust in data-driven decision-making. The market is expected to continue unfolding, with ongoing advancements in technology and evolving business needs shaping its future trajectory.

    Request Free Sample

    The On-premises segment was valued at USD 38.70 million in 2019 and showed

  11. Company Documents Dataset

    • kaggle.com
    zip
    Updated May 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayoub Cherguelaine (2024). Company Documents Dataset [Dataset]. https://www.kaggle.com/datasets/ayoubcherguelaine/company-documents-dataset
    Explore at:
    zip(9789538 bytes)Available download formats
    Dataset updated
    May 23, 2024
    Authors
    Ayoub Cherguelaine
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Overview

    This dataset contains a collection of over 2,000 company documents, categorized into four main types: invoices, inventory reports, purchase orders, and shipping orders. Each document is provided in PDF format, accompanied by a CSV file that includes the text extracted from these documents, their respective labels, and the word count of each document. This dataset is ideal for various natural language processing (NLP) tasks, including text classification, information extraction, and document clustering.

    Dataset Content

    PDF Documents: The dataset includes 2,677 PDF files, each representing a unique company document. These documents are derived from the Northwind dataset, which is commonly used for demonstrating database functionalities.

    The document types are:

    • Invoices: Detailed records of transactions between a buyer and a seller.
    • Inventory Reports: Records of inventory levels, including items in stock and units sold.
    • Purchase Orders: Requests made by a buyer to a seller to purchase products or services.
    • Shipping Orders: Instructions for the delivery of goods to specified recipients.

    Example Entries

    Here are a few example entries from the CSV file:

    Shipping Order:

    • Order ID: 10718
    • Shipping Details: "Ship Name: Königlich Essen, Ship Address: Maubelstr. 90, Ship City: ..."
    • Word Count: 120

    Invoice:

    • Order ID: 10707
    • Customer Details: "Customer ID: Arout, Order Date: 2017-10-16, Contact Name: Th..."
    • Word Count: 66

    Purchase Order:

    • Order ID: 10892
    • Order Details: "Order Date: 2018-02-17, Customer Name: Catherine Dewey, Products: Product ..."
    • Word Count: 26

    Applications

    This dataset can be used for:

    • Text Classification: Train models to classify documents into their respective categories.
    • Information Extraction: Extract specific fields and details from the documents.
    • Document Clustering: Group similar documents together based on their content.
    • OCR and Text Mining: Improve OCR (Optical Character Recognition) models and text mining techniques using real-world data.
  12. COVID-19 Open Research Dataset (CORD-19) 🙄 ❤️😃

    • kaggle.com
    zip
    Updated Mar 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qusay AL-Btoush (2022). COVID-19 Open Research Dataset (CORD-19) 🙄 ❤️😃 [Dataset]. https://www.kaggle.com/datasets/qusaybtoush1990/covid19-open-research-dataset-cord19
    Explore at:
    zip(15862822 bytes)Available download formats
    Dataset updated
    Mar 7, 2022
    Authors
    Qusay AL-Btoush
    Description

    COVID-19 Open Research Dataset (CORD-19) 🙄 😃🙄 ❤️😃🙄 😃

    The COVID-19 Open Research Dataset is “a free resource of over 29,000 scholarly articles 🤝😎😎🤝

    DESCRIPTION❤️❤️

    About This Data ❤️❤️

    Description: 😃😃

    The COVID-19 Open Research Dataset is “a free resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community.”

    in-the-news: On March 16, 2020, the White House issued a “call to action to the tech community” regarding the dataset, asking experts “to develop new text and data mining techniques that can help the science community answer high-priority scientific questions related to COVID-19.”

    Included in this dataset:

    Commercial use subset (includes PMC content) -- 9000 papers, 186Mb Non-commercial use subset (includes PMC content) -- 1973 papers, 36Mb PMC custom license subset -- 1426 papers, 19Mb bioRxiv/medRxiv subset (pre-prints that are not peer reviewed) -- 803 papers, 13Mb Each paper is represented as a single JSON object. The schema is available here.

    We also provide a comprehensive metadata file of 29,000 coronavirus and COVID-19 research articles with links to PubMed, Microsoft Academic and the WHO COVID-19 database of publications (includes articles without open access full text):

    Metadata file (readme) -- 47Mb Source: https://pages.semanticscholar.org/coronavirus-research Updated: Weekly License: https://data.world/kgarrett/covid-19-open-research-dataset/workspace/file?filename=COVID.DATA.LIC.AGMT.pdf

    Note😃😃😃😃

    • This data is for training how using data analysis 🤝🎉

    • Please appreciate the effort with an upvote 👍 😃😃

    Thank You ❤️❤️❤️

  13. m

    COVID-19 Combined Data-set with Improved Measurement Errors

    • data.mendeley.com
    Updated May 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Afshin Ashofteh (2020). COVID-19 Combined Data-set with Improved Measurement Errors [Dataset]. http://doi.org/10.17632/nw5m4hs3jr.3
    Explore at:
    Dataset updated
    May 13, 2020
    Authors
    Afshin Ashofteh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Public health-related decision-making on policies aimed at controlling the COVID-19 pandemic outbreak depends on complex epidemiological models that are compelled to be robust and use all relevant available data. This data article provides a new combined worldwide COVID-19 dataset obtained from official data sources with improved systematic measurement errors and a dedicated dashboard for online data visualization and summary. The dataset adds new measures and attributes to the normal attributes of official data sources, such as daily mortality, and fatality rates. We used comparative statistical analysis to evaluate the measurement errors of COVID-19 official data collections from the Chinese Center for Disease Control and Prevention (Chinese CDC), World Health Organization (WHO) and European Centre for Disease Prevention and Control (ECDC). The data is collected by using text mining techniques and reviewing pdf reports, metadata, and reference data. The combined dataset includes complete spatial data such as countries area, international number of countries, Alpha-2 code, Alpha-3 code, latitude, longitude, and some additional attributes such as population. The improved dataset benefits from major corrections on the referenced data sets and official reports such as adjustments in the reporting dates, which suffered from a one to two days lag, removing negative values, detecting unreasonable changes in historical data in new reports and corrections on systematic measurement errors, which have been increasing as the pandemic outbreak spreads and more countries contribute data for the official repositories. Additionally, the root mean square error of attributes in the paired comparison of datasets was used to identify the main data problems. The data for China is presented separately and in more detail, and it has been extracted from the attached reports available on the main page of the CCDC website. This dataset is a comprehensive and reliable source of worldwide COVID-19 data that can be used in epidemiological models assessing the magnitude and timeline for confirmed cases, long-term predictions of deaths or hospital utilization, the effects of quarantine, stay-at-home orders and other social distancing measures, the pandemic’s turning point or in economic and social impact analysis, helping to inform national and local authorities on how to implement an adaptive response approach to re-opening the economy, re-open schools, alleviate business and social distancing restrictions, design economic programs or allow sports events to resume.

  14. DataSheet_1_Development and Verify of Survival Analysis Models for Chinese...

    • frontiersin.figshare.com
    pdf
    Updated Jun 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Linyu Geng; Wenqiang Qu; Jun Liang; Wei Kong; Xue Xu; Wenyou Pan; Lin Liu; Min Wu; Fuwan Ding; Huaixia Hu; Xiang Ding; Hua Wei; Yaohong Zou; Xian Qian; Meimei Wang; Jian Wu; Juan Tao; Jun Tan; Zhanyun Da; Miaojia Zhang; Jing Li; Huayong Zhang; Xuebing Feng; Jiaqi Chen; Lingyun Sun (2023). DataSheet_1_Development and Verify of Survival Analysis Models for Chinese Patients With Systemic Lupus Erythematosus.pdf [Dataset]. http://doi.org/10.3389/fimmu.2022.900332.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 14, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Linyu Geng; Wenqiang Qu; Jun Liang; Wei Kong; Xue Xu; Wenyou Pan; Lin Liu; Min Wu; Fuwan Ding; Huaixia Hu; Xiang Ding; Hua Wei; Yaohong Zou; Xian Qian; Meimei Wang; Jian Wu; Juan Tao; Jun Tan; Zhanyun Da; Miaojia Zhang; Jing Li; Huayong Zhang; Xuebing Feng; Jiaqi Chen; Lingyun Sun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundThe aim of this study is to develop survival analysis models of hospitalized systemic lupus erythematosus (h-SLE) patients in Jiangsu province using data mining techniques to predict patient survival outcomes and survival status.MethodsIn this study, based on 1999–2009 survival data of 2453 hospitalized SLE (h-SLE) patients in Jiangsu Province, we not only used the Cox proportional hazards model to analyze patients’ survival factors, but also used neural network models to predict survival outcomes. We used semi-supervised learning to label the censored data and introduced cost-sensitivity to achieve data augmentation, addressing category imbalance and pseudo label credibility. In addition, the risk score model was developed by logistic regression.ResultsThe overall accuracy of the survival outcome prediction model exceeded 0.7, and the sensitivity was close to 0.8, and through the comparative analysis of multiple indicators, our model outperformed traditional classifiers. The developed survival risk assessment model based on logistic regression found that there was a clear threshold, i.e., a survival threshold indicating the survival risk of patients, and cardiopulmonary and neuropsychiatric involvement, abnormal blood urea nitrogen levels and alanine aminotransferase level had the greatest impact on patient survival time. In addition, the study developed a graphical user interface (GUI) integrating survival analysis models to assist physicians in diagnosis and treatment.ConclusionsThe proposed survival analysis scheme identifies disease-related pathogenic and prognosis factors, and has the potential to improve the effectiveness of clinical interventions.

  15. Anomaly Detection Market Analysis, Size, and Forecast 2025-2029: North...

    • technavio.com
    pdf
    Updated Jun 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). Anomaly Detection Market Analysis, Size, and Forecast 2025-2029: North America (US, Canada, and Mexico), Europe (France, Germany, Spain, and UK), APAC (China, India, and Japan), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/anomaly-detection-market-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 12, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2025 - 2029
    Area covered
    Canada, United States
    Description

    Snapshot img

    Anomaly Detection Market Size 2025-2029

    The anomaly detection market size is valued to increase by USD 4.44 billion, at a CAGR of 14.4% from 2024 to 2029. Anomaly detection tools gaining traction in BFSI will drive the anomaly detection market.

    Major Market Trends & Insights

    North America dominated the market and accounted for a 43% growth during the forecast period.
    By Deployment - Cloud segment was valued at USD 1.75 billion in 2023
    By Component - Solution segment accounted for the largest market revenue share in 2023
    

    Market Size & Forecast

    Market Opportunities: USD 173.26 million
    Market Future Opportunities: USD 4441.70 million
    CAGR from 2024 to 2029 : 14.4%
    

    Market Summary

    Anomaly detection, a critical component of advanced analytics, is witnessing significant adoption across various industries, with the financial services sector leading the charge. The increasing incidence of internal threats and cybersecurity frauds necessitates the need for robust anomaly detection solutions. These tools help organizations identify unusual patterns and deviations from normal behavior, enabling proactive response to potential threats and ensuring operational efficiency. For instance, in a supply chain context, anomaly detection can help identify discrepancies in inventory levels or delivery schedules, leading to cost savings and improved customer satisfaction. In the realm of compliance, anomaly detection can assist in maintaining regulatory adherence by flagging unusual transactions or activities, thereby reducing the risk of penalties and reputational damage.
    According to recent research, organizations that implement anomaly detection solutions experience a reduction in error rates by up to 25%. This improvement not only enhances operational efficiency but also contributes to increased customer trust and satisfaction. Despite these benefits, challenges persist, including data quality and the need for real-time processing capabilities. As the market continues to evolve, advancements in machine learning and artificial intelligence are expected to address these challenges and drive further growth.
    

    What will be the Size of the Anomaly Detection Market during the forecast period?

    Get Key Insights on Market Forecast (PDF) Request Free Sample

    How is the Anomaly Detection Market Segmented ?

    The anomaly detection industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

    Deployment
    
      Cloud
      On-premises
    
    
    Component
    
      Solution
      Services
    
    
    End-user
    
      BFSI
      IT and telecom
      Retail and e-commerce
      Manufacturing
      Others
    
    
    Technology
    
      Big data analytics
      AI and ML
      Data mining and business intelligence
    
    
    Geography
    
      North America
    
        US
        Canada
        Mexico
    
    
      Europe
    
        France
        Germany
        Spain
        UK
    
    
      APAC
    
        China
        India
        Japan
    
    
      Rest of World (ROW)
    

    By Deployment Insights

    The cloud segment is estimated to witness significant growth during the forecast period.

    The market is witnessing significant growth, driven by the increasing adoption of advanced technologies such as machine learning algorithms, predictive modeling tools, and real-time monitoring systems. Businesses are increasingly relying on anomaly detection solutions to enhance their root cause analysis, improve system health indicators, and reduce false positives. This is particularly true in sectors where data is generated in real-time, such as cybersecurity threat detection, network intrusion detection, and fraud detection systems. Cloud-based anomaly detection solutions are gaining popularity due to their flexibility, scalability, and cost-effectiveness.

    This growth is attributed to cloud-based solutions' quick deployment, real-time data visibility, and customization capabilities, which are offered at flexible payment options like monthly subscriptions and pay-as-you-go models. Companies like Anodot, Ltd, Cisco Systems Inc, IBM Corp, and SAS Institute Inc provide both cloud-based and on-premise anomaly detection solutions. Anomaly detection methods include outlier detection, change point detection, and statistical process control. Data preprocessing steps, such as data mining techniques and feature engineering processes, are crucial in ensuring accurate anomaly detection. Data visualization dashboards and alert fatigue mitigation techniques help in managing and interpreting the vast amounts of data generated.

    Network traffic analysis, log file analysis, and sensor data integration are essential components of anomaly detection systems. Additionally, risk management frameworks, drift detection algorithms, time series forecasting, and performance degradation detection are vital in maintaining system performance and capacity planning.

  16. Human Activity Recognition WISDM Lab dataset

    • kaggle.com
    zip
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiashuo Wang (2024). Human Activity Recognition WISDM Lab dataset [Dataset]. https://www.kaggle.com/datasets/wangboluo/mcm2024
    Explore at:
    zip(10311997 bytes)Available download formats
    Dataset updated
    Jul 16, 2024
    Authors
    Jiashuo Wang
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Data Information: WISDM (WIireless Sensor Data Mining) smart phone-based sensor , collecting data from 36 different users in six different activities.

    Number of examples: 1,098,207

    Number of attributes: 6

    Missing attribute values: None

    Data processing:

    1.Replace the nanoseconds with seconds in the timestamp column, and remove the user column, because each user will perform the same action.

    2.Use the sliding window method to transform the data into sequences, and then split each label into training and testing sets, ensuring each label has 8:2 ratio in both the training and testing sets.

    3.Shuffle the order of the labels in both training and testing sets and interleave them to prevent two sequences with the same label from being consecutively lined up.

    Activity:

    0 = Downstairs 100,427 (9.1%)

    1 = Jogging 342,177 (31.2%)

    2 = Sitting 59,939 (5.5%)

    3 = Standing 48,395 (4.4%)

    4 = Upstair 122,869 (11.2%)

    5 = Walking 424,400 (38.6%)

    Resource:

    The dataset are collected by WISDM Lab [https://www.cis.fordham.edu/wisdm/dataset.php]

    Jeffrey W. Lockhart, Gary M. Weiss, Jack C. Xue, Shaun T. Gallagher, Andrew B. Grosner, and Tony T. Pulickal (2011). "Design Considerations for the WISDM Smart Phone-Based Sensor Mining Architecture," Proceedings of the Fifth International Workshop on Knowledge Discovery from Sensor Data (at KDD-11), San Diego, CA. [https://www.cis.fordham.edu/wisdm/includes/files/Lockhart-Design-SensorKDD11.pdf]

  17. Prediction of Online Orders

    • kaggle.com
    zip
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oscar Aguilar (2023). Prediction of Online Orders [Dataset]. https://www.kaggle.com/datasets/oscarm524/prediction-of-orders/versions/3
    Explore at:
    zip(6680913 bytes)Available download formats
    Dataset updated
    May 23, 2023
    Authors
    Oscar Aguilar
    Description

    The visit of an online shop by a possible customer is also called a session. During a session the visitor clicks on products in order to see the corresponding detail page. Furthermore, he possibly will add or remove products to/from his shopping basket. At the end of a session it is possible that one or several products from the shopping basket will be ordered. The activities of the user are also called transactions. The goal of the analysis is to predict whether the visitor will place an order or not on the basis of the transaction data collected during the session.

    Tasks

    In the first task historical shop data are given consisting of the session activities inclusive of the associated information whether an order was placed or not. These data can be used in order to subsequently make order forecasts for other session activities in the same shop. Of course, the real outcome of the sessions for this set is not known. Thus, the first task can be understood as a classical data mining problem.

    The second task deals with the online scenario. In this context the participants are to implement an agent learning on the basis of transactions. That means that the agent successively receives the individual transactions and has to make a forecast for each of them with respect to the outcome of the shopping cart transaction. This task maps the practice scenario in the best possible way in the case that a transaction-based forecast is required and a corresponding algorithm should learn in an adaptive manner.

    The Data

    For the individual tasks anonymised real shop data are provided in the form of structured text files consisting of individual data sets. The data sets represent in each case transactions in the shop and may contain redundant information. For the data, in particular the following applies:

    1. Each data set is in an individual line that is closed by “LF”(“line feed”, 0xA), “CR”(“carriage return”, 0xD), or “CR”and “LF”(“carriage return”and “line feed”, 0xD and 0xA).
    2. The first line is structured analog to the data sets but contains the names of the respective columns (data arrays).
    3. The header and each data set contain several arrays separated by the symbol “|”.
    4. There is no escape character, and no quota system is used.
    5. ASCII is used as character set.
    6. There may be missing values. These are marked by the symbol “?”.

    In concrete terms, only the array names of the attached document “*features.pdf*” in their respective sequence will be used as column headings. The corresponding value ranges are listed there, too.

    The training file for task 1 is “*transact_train.txt*“) contains all data arrays of the document, whereas the corresponding classification file (“*transact_class.txt*”) of course does not contain the target attribute “*order*”.

    In task 2 data in the form of a string array are transferred to the implementations of the participants by means of a method. The individual fields of the array contain the same data arrays that are listed in “*features.pdf*”–also without the target attribute “*order*”–and exactly in the sequence used there.

    Acknowledgement

    This dataset is publicly available in the data-mining-cup-website.

  18. f

    DataSheet1_Water quality monitoring and assessment based on cruise...

    • frontiersin.figshare.com
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jing Qian; Hongbo Liu; Li Qian; Jonas Bauer; Xiaobai Xue; Gongliang Yu; Qiang He; Qi Zhou; Yonghong Bi; Stefan Norra (2023). DataSheet1_Water quality monitoring and assessment based on cruise monitoring, remote sensing, and deep learning: A case study of Qingcaosha Reservoir.PDF [Dataset]. http://doi.org/10.3389/fenvs.2022.979133.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Jing Qian; Hongbo Liu; Li Qian; Jonas Bauer; Xiaobai Xue; Gongliang Yu; Qiang He; Qi Zhou; Yonghong Bi; Stefan Norra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Accurate monitoring and assessment of the environmental state, as a prerequisite for improved action, is valuable and necessary because of the growing number of environmental problems that have harmful effects on natural systems and human society. This study developed an integrated novel framework containing three modules remote sensing technology (RST), cruise monitoring technology (CMT), and deep learning to achieve a robust performance for environmental monitoring and the subsequent assessment. The deep neural network (DNN), a type of deep learning, can adapt and take advantage of the big data platform effectively provided by RST and CMT to obtain more accurate and improved monitoring results. It was proved by our case study in the Qingcaosha Reservoir (QCSR) that DNN showed a more robust performance (R2 = 0.89 for pH, R2 = 0.77 for DO, R2 = 0.86 for conductivity, and R2 = 0.95 for backscattered particles) compared to the traditional machine learning, including multiple linear regression, support vector regression, and random forest regression. Based on the monitoring results, the water quality assessment of QCSR was achieved by applying a deep learning algorithm called improved deep embedding clustering. Deep clustering analysis enables the scientific delineation of joint control regions and determines the characteristic factors of each area. This study presents the high value of the framework with a core of big data mining for environmental monitoring and follow-up assessment in a manner of high frequency, multidimensionality, and deep hierarchy.

  19. f

    Table_8_Modular Characteristics and Mechanism of Action of Herbs for...

    • figshare.com
    pdf
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weilin Zheng; Jiayi Wu; Jiangyong Gu; Heng Weng; Jie Wang; Tao Wang; Xuefang Liang; Lixing Cao (2023). Table_8_Modular Characteristics and Mechanism of Action of Herbs for Endometriosis Treatment in Chinese Medicine: A Data Mining and Network Pharmacology–Based Identification.pdf [Dataset]. http://doi.org/10.3389/fphar.2020.00147.s011
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Weilin Zheng; Jiayi Wu; Jiangyong Gu; Heng Weng; Jie Wang; Tao Wang; Xuefang Liang; Lixing Cao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Endometriosis is a common benign disease in women of reproductive age. It has been defined as a disorder characterized by inflammation, compromised immunity, hormone dependence, and neuroangiogenesis. Unfortunately, the mechanisms of endometriosis have not yet been fully elucidated, and available treatment methods are currently limited. The discovery of new therapeutic drugs and improvements in existing treatment schemes remain the focus of research initiatives. Chinese medicine can improve the symptoms associated with endometriosis. Many Chinese herbal medicines could exert antiendometriosis effects via comprehensive interactions with multiple targets. However, these interactions have not been defined. This study used association rule mining and systems pharmacology to discover a method by which potential antiendometriosis herbs can be investigated. We analyzed various combinations and mechanisms of action of medicinal herbs to establish molecular networks showing interactions with multiple targets. The results showed that endometriosis treatment in Chinese medicine is mainly based on methods of supplementation with blood-activating herbs and strengthening qi. Furthermore, we used network pharmacology to analyze the main herbs that facilitate the decoding of multiscale mechanisms of the herbal compounds. We found that Chinese medicine could affect the development of endometriosis by regulating inflammation, immunity, angiogenesis, and other clusters of processes identified by Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses. The antiendometriosis effect of Chinese medicine occurs mainly through nervous system–associated pathways, such as the serotonergic synapse, the neurotrophin signaling pathway, and dopaminergic synapse, among others, to reduce pain. Chinese medicine could also regulate VEGF signaling, toll-like reporter signaling, NF-κB signaling, MAPK signaling, PI3K-Akt signaling, and the HIF-1 signaling pathway, among others. Synergies often exist in herb pairs and herbal prescriptions. In conclusion, we identified some important targets, target pairs, and regulatory networks, using bioinformatics and data mining. The combination of data mining and network pharmacology may offer an efficient method for drug discovery and development from herbal medicines.

  20. Insurance Analytics Market Analysis, Size, and Forecast 2025-2029: North...

    • technavio.com
    pdf
    Updated Aug 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). Insurance Analytics Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, Italy, and UK), APAC (China, India, Japan, and South Korea), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/insurance-analytics-market-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Aug 31, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2025 - 2029
    Area covered
    Italy, Japan, South Korea, United Kingdom, Germany, France, Europe, Canada, United States
    Description

    Snapshot img

    Insurance Analytics Market Size 2025-2029

    The insurance analytics market size is valued to increase by USD 16.12 billion, at a CAGR of 16.7% from 2024 to 2029. Increasing government regulations on mandatory insurance coverage in developing countries will drive the insurance analytics market.

    Market Insights

    North America dominated the market and accounted for a 36% growth during the 2025-2029.
    By Deployment - Cloud segment was valued at USD 4.41 billion in 2023
    By Component - Tools segment accounted for the largest market revenue share in 2023
    

    Market Size & Forecast

    Market Opportunities: USD 328.64 million 
    Market Future Opportunities 2024: USD 16123.20 million
    CAGR from 2024 to 2029 : 16.7%
    

    Market Summary

    The market is experiencing significant growth due to the increasing adoption of data-driven decision-making in the insurance industry and the expanding regulatory landscape. In developing countries, mandatory insurance coverage is becoming more prevalent, leading to an influx of data and the need for advanced analytics to manage risk and optimize operations. Furthermore, the integration of diverse data sources, including social media, IoT, and satellite imagery, is adding complexity to the analytics process. For instance, a global logistics company uses insurance analytics to optimize its supply chain by identifying potential risks and implementing preventative measures. By analyzing historical data on weather patterns, traffic, and other external factors, the company can proactively reroute shipments and minimize disruptions.
    Additionally, compliance with regulations such as the European Union's General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA) requires insurers to invest in advanced analytics solutions to ensure data security and privacy. Despite these opportunities, challenges remain. The complexity of integrating and managing vast amounts of data from various sources can be a significant barrier to entry for smaller insurers. Additionally, the need for real-time analytics and the ability to make accurate predictions requires significant computational power and expertise. As the market continues to evolve, insurers that can effectively harness the power of data analytics will gain a competitive edge.
    

    What will be the size of the Insurance Analytics Market during the forecast period?

    Get Key Insights on Market Forecast (PDF) Request Free Sample

    The market is a dynamic and ever-evolving landscape, driven by advancements in technology and the growing demand for data-driven insights. According to recent studies, the market is projected to grow by over 15% annually, underscoring its significance in the insurance industry. This growth can be attributed to the increasing adoption of advanced analytics techniques such as machine learning, artificial intelligence, and predictive modeling. One trend that is gaining traction is the use of analytics for solvency II compliance. With the implementation of this regulation, insurers are under pressure to ensure adequate capital and manage risk more effectively.
    Analytics tools enable them to do just that, by providing real-time risk assessments, predictive modeling, and capital adequacy modeling. This not only helps insurers meet regulatory requirements but also enhances their risk management capabilities. Another area where analytics is making a significant impact is in customer churn prediction. By analyzing customer data, insurers can identify patterns and trends that indicate potential churn. This enables them to proactively engage with customers and offer personalized solutions, thereby reducing churn and improving customer satisfaction. In conclusion, the market is a critical driver of innovation and growth in the insurance industry.
    Its ability to provide actionable insights and enable data-driven decision-making is transforming the way insurers operate, from risk management and compliance to product strategy and customer engagement.
    

    Unpacking the Insurance Analytics Market Landscape

    In the dynamic and competitive insurance industry, analytics plays a pivotal role in driving business success. Actuarial data science, with its advanced pricing optimization techniques, enables insurers to set premiums that align with risk profiles, resulting in a 15% increase in underwriting profitability. Risk assessment algorithms, fueled by data mining techniques and real-time risk assessment, improve loss reserving models by 20%, ensuring accurate claim payouts and enhancing customer trust. Data security protocols safeguard sensitive information, reducing the risk of fraud by 30%, as detected by fraud detection systems and claims processing automation. Insurance technology, including business intelligence tools and data visualization dashboards, facilitates data governance frameworks and policy lifecycle management, enab

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Xin Qiao; Hong Jiao (2023). Table_1_Data Mining Techniques in Analyzing Process Data: A Didactic.pdf [Dataset]. http://doi.org/10.3389/fpsyg.2018.02231.s001
Organization logo

Table_1_Data Mining Techniques in Analyzing Process Data: A Didactic.pdf

Related Article
Explore at:
pdfAvailable download formats
Dataset updated
Jun 7, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Xin Qiao; Hong Jiao
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Due to increasing use of technology-enhanced educational assessment, data mining methods have been explored to analyse process data in log files from such assessment. However, most studies were limited to one data mining technique under one specific scenario. The current study demonstrates the usage of four frequently used supervised techniques, including Classification and Regression Trees (CART), gradient boosting, random forest, support vector machine (SVM), and two unsupervised methods, Self-organizing Map (SOM) and k-means, fitted to one assessment data. The USA sample (N = 426) from the 2012 Program for International Student Assessment (PISA) responding to problem-solving items is extracted to demonstrate the methods. After concrete feature generation and feature selection, classifier development procedures are implemented using the illustrated techniques. Results show satisfactory classification accuracy for all the techniques. Suggestions for the selection of classifiers are presented based on the research questions, the interpretability and the simplicity of the classifiers. Interpretations for the results from both supervised and unsupervised learning methods are provided.

Search
Clear search
Close search
Google apps
Main menu