59 datasets found
  1. Table_1_Data Mining Techniques in Analyzing Process Data: A Didactic.pdf

    • frontiersin.figshare.com
    pdf
    Updated Jun 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xin Qiao; Hong Jiao (2023). Table_1_Data Mining Techniques in Analyzing Process Data: A Didactic.pdf [Dataset]. http://doi.org/10.3389/fpsyg.2018.02231.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 7, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Xin Qiao; Hong Jiao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Due to increasing use of technology-enhanced educational assessment, data mining methods have been explored to analyse process data in log files from such assessment. However, most studies were limited to one data mining technique under one specific scenario. The current study demonstrates the usage of four frequently used supervised techniques, including Classification and Regression Trees (CART), gradient boosting, random forest, support vector machine (SVM), and two unsupervised methods, Self-organizing Map (SOM) and k-means, fitted to one assessment data. The USA sample (N = 426) from the 2012 Program for International Student Assessment (PISA) responding to problem-solving items is extracted to demonstrate the methods. After concrete feature generation and feature selection, classifier development procedures are implemented using the illustrated techniques. Results show satisfactory classification accuracy for all the techniques. Suggestions for the selection of classifiers are presented based on the research questions, the interpretability and the simplicity of the classifiers. Interpretations for the results from both supervised and unsupervised learning methods are provided.

  2. Data supporting the Master thesis "Monitoring von Open Data Praktiken -...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Nov 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Katharina Zinke; Katharina Zinke (2024). Data supporting the Master thesis "Monitoring von Open Data Praktiken - Herausforderungen beim Auffinden von Datenpublikationen am Beispiel der Publikationen von Forschenden der TU Dresden" [Dataset]. http://doi.org/10.5281/zenodo.14196539
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 21, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Katharina Zinke; Katharina Zinke
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Dresden
    Description

    Data supporting the Master thesis "Monitoring von Open Data Praktiken - Herausforderungen beim Auffinden von Datenpublikationen am Beispiel der Publikationen von Forschenden der TU Dresden" (Monitoring open data practices - challenges in finding data publications using the example of publications by researchers at TU Dresden) - Katharina Zinke, Institut für Bibliotheks- und Informationswissenschaften, Humboldt-Universität Berlin, 2023

    This ZIP-File contains the data the thesis is based on, interim exports of the results and the R script with all pre-processing, data merging and analyses carried out. The documentation of the additional, explorative analysis is also available. The actual PDFs and text files of the scientific papers used are not included as they are published open access.

    The folder structure is shown below with the file names and a brief description of the contents of each file. For details concerning the analyses approach, please refer to the master's thesis (publication following soon).

    ## Data sources

    Folder 01_SourceData/

    - PLOS-Dataset_v2_Mar23.csv (PLOS-OSI dataset)

    - ScopusSearch_ExportResults.csv (export of Scopus search results from Scopus)

    - ScopusSearch_ExportResults.ris (export of Scopus search results from Scopus)

    - Zotero_Export_ScopusSearch.csv (export of the file names and DOIs of the Scopus search results from Zotero)

    ## Automatic classification

    Folder 02_AutomaticClassification/

    - (NOT INCLUDED) PDFs folder (Folder for PDFs of all publications identified by the Scopus search, named AuthorLastName_Year_PublicationTitle_Title)

    - (NOT INCLUDED) PDFs_to_text folder (Folder for all texts extracted from the PDFs by ODDPub, named AuthorLastName_Year_PublicationTitle_Title)

    - PLOS_ScopusSearch_matched.csv (merge of the Scopus search results with the PLOS_OSI dataset for the files contained in both)

    - oddpub_results_wDOIs.csv (results file of the ODDPub classification)

    - PLOS_ODDPub.csv (merge of the results file of the ODDPub classification with the PLOS-OSI dataset for the publications contained in both)

    ## Manual coding

    Folder 03_ManualCheck/

    - CodeSheet_ManualCheck.txt (Code sheet with descriptions of the variables for manual coding)

    - ManualCheck_2023-06-08.csv (Manual coding results file)

    - PLOS_ODDPub_Manual.csv (Merge of the results file of the ODDPub and PLOS-OSI classification with the results file of the manual coding)

    ## Explorative analysis for the discoverability of open data

    Folder04_FurtherAnalyses

    Proof_of_of_Concept_Open_Data_Monitoring.pdf (Description of the explorative analysis of the discoverability of open data publications using the example of a researcher) - in German

    ## R-Script

    Analyses_MA_OpenDataMonitoring.R (R-Script for preparing, merging and analyzing the data and for performing the ODDPub algorithm)

  3. f

    DataSheet_1_The TargetMine Data Warehouse: Enhancement and Updates.pdf

    • frontiersin.figshare.com
    pdf
    Updated May 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yi-An Chen; Lokesh P. Tripathi; Takeshi Fujiwara; Tatsuya Kameyama; Mari N. Itoh; Kenji Mizuguchi (2023). DataSheet_1_The TargetMine Data Warehouse: Enhancement and Updates.pdf [Dataset]. http://doi.org/10.3389/fgene.2019.00934.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Yi-An Chen; Lokesh P. Tripathi; Takeshi Fujiwara; Tatsuya Kameyama; Mari N. Itoh; Kenji Mizuguchi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Biological data analysis is the key to new discoveries in disease biology and drug discovery. The rapid proliferation of high-throughput ‘omics’ data has necessitated a need for tools and platforms that allow the researchers to combine and analyse different types of biological data and obtain biologically relevant knowledge. We had previously developed TargetMine, an integrative data analysis platform for target prioritisation and broad-based biological knowledge discovery. Here, we describe the newly modelled biological data types and the enhanced visual and analytical features of TargetMine. These enhancements have included: an enhanced coverage of gene–gene relations, small molecule metabolite to pathway mappings, an improved literature survey feature, and in silico prediction of gene functional associations such as protein–protein interactions and global gene co-expression. We have also described two usage examples on trans-omics data analysis and extraction of gene-disease associations using MeSH term descriptors. These examples have demonstrated how the newer enhancements in TargetMine have contributed to a more expansive coverage of the biological data space and can help interpret genotype–phenotype relations. TargetMine with its auxiliary toolkit is available at https://targetmine.mizuguchilab.org. The TargetMine source code is available at https://github.com/chenyian-nibio/targetmine-gradle.

  4. Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North...

    • technavio.com
    pdf
    Updated Feb 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, UK), APAC (China, India, Japan), South America (Brazil), and Middle East and Africa (UAE) [Dataset]. https://www.technavio.com/report/data-science-platform-market-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Feb 8, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2025 - 2029
    Area covered
    United States
    Description

    Snapshot img

    Data Science Platform Market Size 2025-2029

    The data science platform market size is valued to increase USD 763.9 million, at a CAGR of 40.2% from 2024 to 2029. Integration of AI and ML technologies with data science platforms will drive the data science platform market.

    Major Market Trends & Insights

    North America dominated the market and accounted for a 48% growth during the forecast period.
    By Deployment - On-premises segment was valued at USD 38.70 million in 2023
    By Component - Platform segment accounted for the largest market revenue share in 2023
    

    Market Size & Forecast

    Market Opportunities: USD 1.00 million
    Market Future Opportunities: USD 763.90 million
    CAGR : 40.2%
    North America: Largest market in 2023
    

    Market Summary

    The market represents a dynamic and continually evolving landscape, underpinned by advancements in core technologies and applications. Key technologies, such as machine learning and artificial intelligence, are increasingly integrated into data science platforms to enhance predictive analytics and automate data processing. Additionally, the emergence of containerization and microservices in data science platforms enables greater flexibility and scalability. However, the market also faces challenges, including data privacy and security risks, which necessitate robust compliance with regulations.
    According to recent estimates, the market is expected to account for over 30% of the overall big data analytics market by 2025, underscoring its growing importance in the data-driven business landscape.
    

    What will be the Size of the Data Science Platform Market during the forecast period?

    Get Key Insights on Market Forecast (PDF) Request Free Sample

    How is the Data Science Platform Market Segmented and what are the key trends of market segmentation?

    The data science platform industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

    Deployment
    
      On-premises
      Cloud
    
    
    Component
    
      Platform
      Services
    
    
    End-user
    
      BFSI
      Retail and e-commerce
      Manufacturing
      Media and entertainment
      Others
    
    
    Sector
    
      Large enterprises
      SMEs
    
    
    Application
    
      Data Preparation
      Data Visualization
      Machine Learning
      Predictive Analytics
      Data Governance
      Others
    
    
    Geography
    
      North America
    
        US
        Canada
    
    
      Europe
    
        France
        Germany
        UK
    
    
      Middle East and Africa
    
        UAE
    
    
      APAC
    
        China
        India
        Japan
    
    
      South America
    
        Brazil
    
    
      Rest of World (ROW)
    

    By Deployment Insights

    The on-premises segment is estimated to witness significant growth during the forecast period.

    In the dynamic and evolving the market, big data processing is a key focus, enabling advanced model accuracy metrics through various data mining methods. Distributed computing and algorithm optimization are integral components, ensuring efficient handling of large datasets. Data governance policies are crucial for managing data security protocols and ensuring data lineage tracking. Software development kits, model versioning, and anomaly detection systems facilitate seamless development, deployment, and monitoring of predictive modeling techniques, including machine learning algorithms, regression analysis, and statistical modeling. Real-time data streaming and parallelized algorithms enable real-time insights, while predictive modeling techniques and machine learning algorithms drive business intelligence and decision-making.

    Cloud computing infrastructure, data visualization tools, high-performance computing, and database management systems support scalable data solutions and efficient data warehousing. ETL processes and data integration pipelines ensure data quality assessment and feature engineering techniques. Clustering techniques and natural language processing are essential for advanced data analysis. The market is witnessing significant growth, with adoption increasing by 18.7% in the past year, and industry experts anticipate a further expansion of 21.6% in the upcoming period. Companies across various sectors are recognizing the potential of data science platforms, leading to a surge in demand for scalable, secure, and efficient solutions.

    API integration services and deep learning frameworks are gaining traction, offering advanced capabilities and seamless integration with existing systems. Data security protocols and model explainability methods are becoming increasingly important, ensuring transparency and trust in data-driven decision-making. The market is expected to continue unfolding, with ongoing advancements in technology and evolving business needs shaping its future trajectory.

    Request Free Sample

    The On-premises segment was valued at USD 38.70 million in 2019 and showed

  5. Make Data Count Dataset - MinerU Extraction

    • kaggle.com
    zip
    Updated Aug 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omid Erfanmanesh (2025). Make Data Count Dataset - MinerU Extraction [Dataset]. https://www.kaggle.com/datasets/omiderfanmanesh/make-data-count-dataset-mineru-extraction
    Explore at:
    zip(4272989320 bytes)Available download formats
    Dataset updated
    Aug 26, 2025
    Authors
    Omid Erfanmanesh
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Description

    This dataset contains PDF-to-text conversions of scientific research articles, prepared for the task of data citation mining. The goal is to identify references to research datasets within full-text scientific papers and classify them as Primary (data generated in the study) or Secondary (data reused from external sources).

    The PDF articles were processed using MinerU, which converts scientific PDFs into structured machine-readable formats (JSON, Markdown, images). This ensures participants can access both the raw text and layout information needed for fine-grained information extraction.

    Files and Structure

    Each paper directory contains the following files:

    • *_origin.pdf The original PDF file of the scientific article.

    • *_content_list.json Structured extraction of the PDF content, where each object represents a text or figure element with metadata. Example entry:

      {
       "type": "text",
       "text": "10.1002/2017JC013030",
       "text_level": 1,
       "page_idx": 0
      }
      
    • full.md The complete article content in Markdown format (linearized for easier reading).

    • images/ Folder containing figures and extracted images from the article.

    • layout.json Page layout metadata, including positions of text blocks and images.

    Data Mining Task

    The aim is to detect dataset references in the article text and classify them:

    Each dataset mention must be labeled as:

    • Primary: Data generated by the paper (new experiments, field observations, sequencing runs, etc.).
    • Secondary: Data reused from external repositories or prior studies.

    Training and Test Splits

    • train/ → Articles with gold-standard labels (train_labels.csv).
    • test/ → Articles without labels, used for evaluation.
    • train_labels.csv → Ground truth with:

      • article_id: Research paper DOI.
      • dataset_id: Extracted dataset identifier.
      • type: Citation type (Primary / Secondary).
    • sample_submission.csv → Example submission format.

    Example

    Paper: https://doi.org/10.1098/rspb.2016.1151 Data: https://doi.org/10.5061/dryad.6m3n9 In-text span:

    "The data we used in this publication can be accessed from Dryad at doi:10.5061/dryad.6m3n9." Citation type: Primary

    This dataset enables participants to develop and test NLP systems for:

    • Information extraction (locating dataset mentions).
    • Identifier normalization (mapping mentions to persistent IDs).
    • Citation classification (distinguishing Primary vs Secondary data usage).
  6. Figure 6 and 7 from manuscript Sparsely-Connected Autoencoder (SCA) for...

    • figshare.com
    zip
    Updated Aug 26, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raffaele Calogero (2020). Figure 6 and 7 from manuscript Sparsely-Connected Autoencoder (SCA) for single cell RNAseq data mining [Dataset]. http://doi.org/10.6084/m9.figshare.12866897.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 26, 2020
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Raffaele Calogero
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset used to generate figure 6 and 7.Figure 6: Analysis of human breast cancer (Block A Section 1), from 10XGenomics Visium Spatial Gene Expression 1.0.0. demonstration samples. A) SIMLR partitioning in 9 clusters (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_Stability_Plot.pdf). B) Cell stability score plot for SIMLR clusters in A (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_Stability_Plot.pdf. C) SIMLR clusters location in the tissue section (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_spatial_Stability.pdf). D) Hematoxylin and eosin image (figure6and7/HBC_BAS1/spatial/V1_Breast_Cancer_Block_A_Section_1_image.tif).Figure 6: Analysis of human breast cancer (Block A Section 1), from 10XGenomics Visium Spatial Gene Expression 1.0.0. demonstration samples. A) SIMLR partitioning in 9 clusters (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_Stability_Plot.pdf). B) Cell stability score plot for SIMLR clusters in A (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_Stability_Plot.pdf. C) SIMLR clusters location in the tissue section (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_spatial_Stability.pdf). D) Hematoxylin and eosin image (figure6and7/HBC_BAS1/spatial/V1_Breast_Cancer_Block_A_Section_1_image.tif).Figure 7: Information contents extracted by SCA analysis using a TF-based latent space. A) QCC (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_TF_SIMLRV2/9/HBC_BAS1_expr-var-ann_matrix_stabilityPlot.pdf). B) QCM (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_TF_SIMLRV2/9/HBC_BAS1_expr-var-ann_matrix_stabilityPlotUNBIAS.pdf). C) QCM/QCC plot, where only cluster 7 show, for the majority of the cells, both QCC and QCM greater than 0.5 (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_TF_SIMLRV2/9/HBC_BAS1_expr-var-ann_matrix_StabilitySignificativityJittered.pdf). D) COMET analysis of SCA latent space. SOX5 was detected as first top ranked gene specific for cluster 7, using as input for COMET the latent space frequency table (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/outputvis/cluster_7_singleton/rank_1.png). Input counts table for SCA analysis is made by raw counts.

  7. Data from: Wine Quality

    • kaggle.com
    • tensorflow.org
    zip
    Updated Oct 29, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel S. Panizzo (2017). Wine Quality [Dataset]. https://www.kaggle.com/datasets/danielpanizzo/wine-quality
    Explore at:
    zip(111077 bytes)Available download formats
    Dataset updated
    Oct 29, 2017
    Authors
    Daniel S. Panizzo
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Citation Request: This dataset is public available for research. The details are described in [Cortez et al., 2009]. Please include this citation if you plan to use this database:

    P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

    Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

    1. Title: Wine Quality

    2. Sources Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009

    3. Past Usage:

      P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

      In the above reference, two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model these datasets under a regression approach. The support vector machine model achieved the best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T), etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity analysis procedure).

    4. Relevant Information:

      The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

      These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

    5. Number of Instances: red wine - 1599; white wine - 4898.

    6. Number of Attributes: 11 + output attribute

      Note: several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection.

    7. Attribute information:

      For more information, read [Cortez et al., 2009].

      Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10)

    8. Missing Attribute Values: None

    9. Description of attributes:

      1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

      2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

      3 - citric acid: found in small quantities, citric acid can add 'freshness' and flavor to wines

      4 - residual sugar: the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

      5 - chlorides: the amount of salt in the wine

      6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

      7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

      8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

      9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

      10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

      11 - alcohol: the percent alcohol content of the wine

      Output variable (based on sensory data): 12 - quality (score between 0 and 10)

  8. Multi-aspect Reviews

    • kaggle.com
    zip
    Updated Oct 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmad (2023). Multi-aspect Reviews [Dataset]. https://www.kaggle.com/datasets/pypiahmad/multi-aspect-reviews
    Explore at:
    zip(875907419 bytes)Available download formats
    Dataset updated
    Oct 30, 2023
    Authors
    Ahmad
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The Multi-aspect Reviews dataset primarily encompasses beer review data from RateBeer and BeerAdvocate, with a focus on multiple rated dimensions providing a comprehensive insight into sensory aspects such as taste, look, feel, and smell. This dataset facilitates the analysis of different facets of reviews, thus aiding in a deeper understanding of user preferences and product characteristics.

    Basic Statistics: - RateBeer - Number of users: 40,213 - Number of items: 110,419 - Number of ratings/reviews: 2,855,232 - Timespan: Apr 2000 - Nov 2011

    • BeerAdvocate
      • Number of users: 33,387
      • Number of items: 66,051
      • Number of ratings/reviews: 1,586,259
      • Timespan: Jan 1998 - Nov 2011

    Metadata: - Reviews: Textual reviews provided by users. - Aspect-specific ratings: Ratings on taste, look, feel, smell, and overall impression. - Product Category: Categories of beer products. - ABV (Alcohol By Volume): Indicates the alcohol content in the beer.

    Examples: - RateBeer Example json { "beer/name": "John Harvards Simcoe IPA", "beer/beerId": "63836", "beer/brewerId": "8481", "beer/ABV": "5.4", "beer/style": "India Pale Ale (IPA)", "review/appearance": "4/5", "review/aroma": "6/10", "review/palate": "3/5", "review/taste": "6/10", "review/overall": "13/20", "review/time": "1157587200", "review/profileName": "hopdog", "review/text": "On tap at the Springfield, PA location. Poured a deep and cloudy orange (almost a copper) color with a small sized off white head. Aromas or oranges and all around citric. Tastes of oranges, light caramel and a very light grapefruit finish. I too would not believe the 80+ IBUs - I found this one to have a very light bitterness with a medium sweetness to it. Light lacing left on the glass." }

    Download Links: - BeerAdvocate Data - RateBeer Data - Sentences with aspect labels (annotator 1) - Sentences with aspect labels (annotator 2)

    Citations: - Learning attitudes and attributes from multi-aspect reviews, Julian McAuley, Jure Leskovec, Dan Jurafsky, International Conference on Data Mining (ICDM), 2012. pdf - From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews, Julian McAuley, Jure Leskovec, WWW, 2013. pdf

    Use Cases: 1. Aspect-Based Sentiment Analysis (ABSA): Analyzing sentiments on different aspects of beers like taste, look, feel, and smell to gain deeper insights into user preferences and opinions. 2. Recommendation Systems: Developing personalized recommendation systems that consider multiple aspects of user preferences. 3. Product Development: Utilizing the feedback on various aspects to improve the product. 4. Consumer Behavior Analysis: Studying how different aspects influence consumer choice and satisfaction. 5. Competitor Analysis: Comparing ratings on different aspects with competitors to identify strengths and weaknesses. 6. Trend Analysis: Identifying trends in consumer preferences over time across different aspects. 7. Marketing Strategies: Formulating marketing strategies based on insights drawn from aspect-based reviews. 8. Natural Language Processing (NLP): Developing and enhancing NLP models to understand and categorize multi-aspect reviews. 9. Learning User Expertise Evolution: Studying how user expertise evolves through reviews and ratings over time. 10. Training Machine Learning Models: Training supervised learning models to predict aspect-based ratings from review text.

    This dataset is extremely valuable for researchers, marketers, product developers, and machine learning practitioners looking to delve into multi-dimensional review analysis and understand user-product interaction on a granular level.

  9. f

    fdata-02-00012_Identifying Travel Regions Using Location-Based Social...

    • frontiersin.figshare.com
    pdf
    Updated May 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Avradip Sen; Linus W. Dietz (2023). fdata-02-00012_Identifying Travel Regions Using Location-Based Social Network Check-in Data.pdf [Dataset]. http://doi.org/10.3389/fdata.2019.00012.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Avradip Sen; Linus W. Dietz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Travel regions are not necessarily defined by political or administrative boundaries. For example, in the Schengen region of Europe, tourists can travel freely across borders irrespective of national borders. Identifying transboundary travel regions is an interesting problem which we aim to solve using mobility analysis of Twitter users. Our proposed solution comprises collecting geotagged tweets, combining them into trajectories and, thus, mining thousands of trips undertaken by twitter users. After aggregating these trips into a mobility graph, we apply a community detection algorithm to find coherent regions throughout the world. The discovered regions provide insights into international travel and can reveal both domestic and transnational travel regions.

  10. Company Documents Dataset

    • kaggle.com
    zip
    Updated May 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayoub Cherguelaine (2024). Company Documents Dataset [Dataset]. https://www.kaggle.com/datasets/ayoubcherguelaine/company-documents-dataset
    Explore at:
    zip(9789538 bytes)Available download formats
    Dataset updated
    May 23, 2024
    Authors
    Ayoub Cherguelaine
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Overview

    This dataset contains a collection of over 2,000 company documents, categorized into four main types: invoices, inventory reports, purchase orders, and shipping orders. Each document is provided in PDF format, accompanied by a CSV file that includes the text extracted from these documents, their respective labels, and the word count of each document. This dataset is ideal for various natural language processing (NLP) tasks, including text classification, information extraction, and document clustering.

    Dataset Content

    PDF Documents: The dataset includes 2,677 PDF files, each representing a unique company document. These documents are derived from the Northwind dataset, which is commonly used for demonstrating database functionalities.

    The document types are:

    • Invoices: Detailed records of transactions between a buyer and a seller.
    • Inventory Reports: Records of inventory levels, including items in stock and units sold.
    • Purchase Orders: Requests made by a buyer to a seller to purchase products or services.
    • Shipping Orders: Instructions for the delivery of goods to specified recipients.

    Example Entries

    Here are a few example entries from the CSV file:

    Shipping Order:

    • Order ID: 10718
    • Shipping Details: "Ship Name: Königlich Essen, Ship Address: Maubelstr. 90, Ship City: ..."
    • Word Count: 120

    Invoice:

    • Order ID: 10707
    • Customer Details: "Customer ID: Arout, Order Date: 2017-10-16, Contact Name: Th..."
    • Word Count: 66

    Purchase Order:

    • Order ID: 10892
    • Order Details: "Order Date: 2018-02-17, Customer Name: Catherine Dewey, Products: Product ..."
    • Word Count: 26

    Applications

    This dataset can be used for:

    • Text Classification: Train models to classify documents into their respective categories.
    • Information Extraction: Extract specific fields and details from the documents.
    • Document Clustering: Group similar documents together based on their content.
    • OCR and Text Mining: Improve OCR (Optical Character Recognition) models and text mining techniques using real-world data.
  11. Anomaly Detection Market Analysis, Size, and Forecast 2025-2029: North...

    • technavio.com
    pdf
    Updated Jun 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). Anomaly Detection Market Analysis, Size, and Forecast 2025-2029: North America (US, Canada, and Mexico), Europe (France, Germany, Spain, and UK), APAC (China, India, and Japan), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/anomaly-detection-market-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 12, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2025 - 2029
    Area covered
    Canada, United States
    Description

    Snapshot img

    Anomaly Detection Market Size 2025-2029

    The anomaly detection market size is valued to increase by USD 4.44 billion, at a CAGR of 14.4% from 2024 to 2029. Anomaly detection tools gaining traction in BFSI will drive the anomaly detection market.

    Major Market Trends & Insights

    North America dominated the market and accounted for a 43% growth during the forecast period.
    By Deployment - Cloud segment was valued at USD 1.75 billion in 2023
    By Component - Solution segment accounted for the largest market revenue share in 2023
    

    Market Size & Forecast

    Market Opportunities: USD 173.26 million
    Market Future Opportunities: USD 4441.70 million
    CAGR from 2024 to 2029 : 14.4%
    

    Market Summary

    Anomaly detection, a critical component of advanced analytics, is witnessing significant adoption across various industries, with the financial services sector leading the charge. The increasing incidence of internal threats and cybersecurity frauds necessitates the need for robust anomaly detection solutions. These tools help organizations identify unusual patterns and deviations from normal behavior, enabling proactive response to potential threats and ensuring operational efficiency. For instance, in a supply chain context, anomaly detection can help identify discrepancies in inventory levels or delivery schedules, leading to cost savings and improved customer satisfaction. In the realm of compliance, anomaly detection can assist in maintaining regulatory adherence by flagging unusual transactions or activities, thereby reducing the risk of penalties and reputational damage.
    According to recent research, organizations that implement anomaly detection solutions experience a reduction in error rates by up to 25%. This improvement not only enhances operational efficiency but also contributes to increased customer trust and satisfaction. Despite these benefits, challenges persist, including data quality and the need for real-time processing capabilities. As the market continues to evolve, advancements in machine learning and artificial intelligence are expected to address these challenges and drive further growth.
    

    What will be the Size of the Anomaly Detection Market during the forecast period?

    Get Key Insights on Market Forecast (PDF) Request Free Sample

    How is the Anomaly Detection Market Segmented ?

    The anomaly detection industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

    Deployment
    
      Cloud
      On-premises
    
    
    Component
    
      Solution
      Services
    
    
    End-user
    
      BFSI
      IT and telecom
      Retail and e-commerce
      Manufacturing
      Others
    
    
    Technology
    
      Big data analytics
      AI and ML
      Data mining and business intelligence
    
    
    Geography
    
      North America
    
        US
        Canada
        Mexico
    
    
      Europe
    
        France
        Germany
        Spain
        UK
    
    
      APAC
    
        China
        India
        Japan
    
    
      Rest of World (ROW)
    

    By Deployment Insights

    The cloud segment is estimated to witness significant growth during the forecast period.

    The market is witnessing significant growth, driven by the increasing adoption of advanced technologies such as machine learning algorithms, predictive modeling tools, and real-time monitoring systems. Businesses are increasingly relying on anomaly detection solutions to enhance their root cause analysis, improve system health indicators, and reduce false positives. This is particularly true in sectors where data is generated in real-time, such as cybersecurity threat detection, network intrusion detection, and fraud detection systems. Cloud-based anomaly detection solutions are gaining popularity due to their flexibility, scalability, and cost-effectiveness.

    This growth is attributed to cloud-based solutions' quick deployment, real-time data visibility, and customization capabilities, which are offered at flexible payment options like monthly subscriptions and pay-as-you-go models. Companies like Anodot, Ltd, Cisco Systems Inc, IBM Corp, and SAS Institute Inc provide both cloud-based and on-premise anomaly detection solutions. Anomaly detection methods include outlier detection, change point detection, and statistical process control. Data preprocessing steps, such as data mining techniques and feature engineering processes, are crucial in ensuring accurate anomaly detection. Data visualization dashboards and alert fatigue mitigation techniques help in managing and interpreting the vast amounts of data generated.

    Network traffic analysis, log file analysis, and sensor data integration are essential components of anomaly detection systems. Additionally, risk management frameworks, drift detection algorithms, time series forecasting, and performance degradation detection are vital in maintaining system performance and capacity planning.

  12. d

    Analyses of historic U.S. Bureau of Mines samples for geochemical...

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Jul 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alaska Division of Geological & Geophysical Surveys (Point of Contact) (2023). Analyses of historic U.S. Bureau of Mines samples for geochemical trace-element and rare-earth-element data from the Circle mining district, western Crazy Mountains, and the Lime Peak area of the White Mountains, Circle Quadrangle, east-central Alaska [Dataset]. https://catalog.data.gov/dataset/analyses-of-historic-u-s-bureau-of-mines-samples-for-geochemical-trace-element-and-rare-earth-e11
    Explore at:
    Dataset updated
    Jul 5, 2023
    Dataset provided by
    Alaska Division of Geological & Geophysical Surveys (Point of Contact)
    Area covered
    Alaska, Lime Peak, United States
    Description

    This report and digital data release presents 286 new geochemical analyses on historic U.S. Bureau of Mines (USBM) samples, including 93 rock, 110 stream sediment, 52 soil, and 28 heavy mineral concentrate (pan concentrate) samples, as well as 3 samples of indeterminate type. These samples were originally collected as part of studies by the USBM in the Circle mining district, western Crazy Mountains, and Lime Peak area of the White Mountains, Circle Quadrangle, east-central Alaska. Historic USBM sample materials were retrieved by DGGS from the DGGS Geologic Materials Center (GMC), where the USBM samples were transferred as part of the federally funded Minerals Data and Information Rescue in Alaska (MDIRA) program in the late 1990s and early 2000s. The text and analytical data and tables associated with this report are being released in digital format as PDF files and .csv files. We provide analytical data, detection limits and, when available, the method documentation provided to us by the lab. We also provide the sample location in geographic coordinates, the sample material cited by the originating literature, a reference to the originating report, and the type of sample material that was obtained from the archive and sent to the lab.

  13. winequality-white

    • kaggle.com
    zip
    Updated Oct 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vitalii Puzhenko (2024). winequality-white [Dataset]. https://www.kaggle.com/datasets/vitaliipuzhenko/winequality-white/suggestions?status=pending
    Explore at:
    zip(73187 bytes)Available download formats
    Dataset updated
    Oct 12, 2024
    Authors
    Vitalii Puzhenko
    Description

    P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

    Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

    1. Title: Wine Quality

    2. Sources Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009

    3. Past Usage:

      P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

      In the above reference, two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model these datasets under a regression approach. The support vector machine model achieved the best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T), etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity analysis procedure).

    4. Relevant Information:

      The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

      These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

    5. Number of Instances: red wine - 1599; white wine - 4898.

    6. Number of Attributes: 11 + output attribute

      Note: several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection.

    7. Attribute information:

      For more information, read [Cortez et al., 2009].

      Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10)

    8. Missing Attribute Values: None

  14. Prescriptive Analytics Market Analysis, Size, and Forecast 2025-2029: North...

    • technavio.com
    pdf
    Updated Jun 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). Prescriptive Analytics Market Analysis, Size, and Forecast 2025-2029: North America (US, Canada, and Mexico), Europe (France, Germany, Italy, and UK), APAC (China, India, and Japan), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/prescriptive-analytics-market-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 20, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2025 - 2029
    Area covered
    United States
    Description

    Snapshot img

    Prescriptive Analytics Market Size 2025-2029

    The prescriptive analytics market size is valued to increase by USD 10.96 billion, at a CAGR of 23.3% from 2024 to 2029. Rising demand for predictive analytics will drive the prescriptive analytics market.

    Major Market Trends & Insights

    North America dominated the market and accounted for a 39% growth during the forecast period.
    By Solution - Services segment was valued at USD 3 billion in 2023
    By Deployment - Cloud-based segment accounted for the largest market revenue share in 2023
    

    Market Size & Forecast

    Market Opportunities: USD 359.55 million
    Market Future Opportunities: USD 10962.00 million
    CAGR from 2024 to 2029 : 23.3%
    

    Market Summary

    Prescriptive analytics, an advanced form of business intelligence, is gaining significant traction in today's data-driven business landscape. This analytical approach goes beyond traditional business intelligence and predictive analytics by providing actionable recommendations to optimize business processes and enhance operational efficiency. The market's growth is fueled by the increasing availability of real-time data, the rise of machine learning algorithms, and the growing demand for data-driven decision-making. One area where prescriptive analytics is making a significant impact is in supply chain optimization. For instance, a manufacturing company can use prescriptive analytics to analyze historical data and real-time market trends to optimize production schedules, minimize inventory costs, and improve delivery times.
    In a recent study, a leading manufacturing firm implemented prescriptive analytics and achieved a 15% reduction in inventory holding costs and a 12% improvement in on-time delivery rates. However, the adoption of prescriptive analytics is not without challenges. Data privacy and regulatory compliance are major concerns, particularly in industries such as healthcare and finance. Companies must ensure that they have robust data security measures in place to protect sensitive customer information and comply with regulations such as HIPAA and GDPR. Despite these challenges, the benefits of prescriptive analytics far outweigh the costs, making it an essential tool for businesses looking to gain a competitive edge in their respective markets.
    

    What will be the Size of the Prescriptive Analytics Market during the forecast period?

    Get Key Insights on Market Forecast (PDF) Request Free Sample

    How is the Prescriptive Analytics Market Segmented ?

    The prescriptive analytics industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

    Solution
    
      Services
      Product
    
    
    Deployment
    
      Cloud-based
      On-premises
    
    
    Sector
    
      Large enterprises
      Small and medium-sized enterprises (SMEs)
    
    
    Geography
    
      North America
    
        US
        Canada
        Mexico
    
    
      Europe
    
        France
        Germany
        Italy
        UK
    
    
      APAC
    
        China
        India
        Japan
    
    
      Rest of World (ROW)
    

    By Solution Insights

    The services segment is estimated to witness significant growth during the forecast period.

    In 2024, The market continues to evolve, becoming a pivotal force in data-driven decision-making across industries. With a projected growth of 15.2% annually, this market is transforming business landscapes by delivering actionable recommendations that align with strategic objectives. From enhancing customer satisfaction to optimizing operational efficiency and reducing costs, prescriptive analytics services are increasingly indispensable. Advanced optimization engines and AI-driven models now handle intricate decision variables, constraints, and trade-offs in real time. This real-time capability supports complex decision-making scenarios across strategic, tactical, and operational levels. Industries like healthcare, retail, manufacturing, and logistics are harnessing prescriptive analytics in unique ways.

    Monte Carlo simulation, scenario planning, and neural networks are just a few techniques used to optimize supply chain operations. Data visualization dashboards, what-if analysis, and natural language processing facilitate better understanding of complex data. Reinforcement learning, time series forecasting, and inventory management are essential components of prescriptive modeling, enabling AI-driven recommendations. Decision support systems, dynamic programming, causal inference, and multi-objective optimization are integral to the decision-making process. Machine learning models, statistical modeling, and optimization algorithms power these advanced systems. Real-time analytics, risk assessment modeling, and linear programming are crucial for managing uncertainty and mitigating risks. Data mining techniques and expert systems provide valuable insights, while c

  15. m

    COVID-19 Combined Data-set with Improved Measurement Errors

    • data.mendeley.com
    Updated May 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Afshin Ashofteh (2020). COVID-19 Combined Data-set with Improved Measurement Errors [Dataset]. http://doi.org/10.17632/nw5m4hs3jr.3
    Explore at:
    Dataset updated
    May 13, 2020
    Authors
    Afshin Ashofteh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Public health-related decision-making on policies aimed at controlling the COVID-19 pandemic outbreak depends on complex epidemiological models that are compelled to be robust and use all relevant available data. This data article provides a new combined worldwide COVID-19 dataset obtained from official data sources with improved systematic measurement errors and a dedicated dashboard for online data visualization and summary. The dataset adds new measures and attributes to the normal attributes of official data sources, such as daily mortality, and fatality rates. We used comparative statistical analysis to evaluate the measurement errors of COVID-19 official data collections from the Chinese Center for Disease Control and Prevention (Chinese CDC), World Health Organization (WHO) and European Centre for Disease Prevention and Control (ECDC). The data is collected by using text mining techniques and reviewing pdf reports, metadata, and reference data. The combined dataset includes complete spatial data such as countries area, international number of countries, Alpha-2 code, Alpha-3 code, latitude, longitude, and some additional attributes such as population. The improved dataset benefits from major corrections on the referenced data sets and official reports such as adjustments in the reporting dates, which suffered from a one to two days lag, removing negative values, detecting unreasonable changes in historical data in new reports and corrections on systematic measurement errors, which have been increasing as the pandemic outbreak spreads and more countries contribute data for the official repositories. Additionally, the root mean square error of attributes in the paired comparison of datasets was used to identify the main data problems. The data for China is presented separately and in more detail, and it has been extracted from the attached reports available on the main page of the CCDC website. This dataset is a comprehensive and reliable source of worldwide COVID-19 data that can be used in epidemiological models assessing the magnitude and timeline for confirmed cases, long-term predictions of deaths or hospital utilization, the effects of quarantine, stay-at-home orders and other social distancing measures, the pandemic’s turning point or in economic and social impact analysis, helping to inform national and local authorities on how to implement an adaptive response approach to re-opening the economy, re-open schools, alleviate business and social distancing restrictions, design economic programs or allow sports events to resume.

  16. d

    Geochemical and mineralogical analyses of uranium ores from the Hack II and...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Geochemical and mineralogical analyses of uranium ores from the Hack II and Pigeon deposits, solution-collapse breccia pipes, Grand Canyon region, Mohave and Coconino Counties, Arizona, USA [Dataset]. https://catalog.data.gov/dataset/geochemical-and-mineralogical-analyses-of-uranium-ores-from-the-hack-ii-and-pigeon-deposit
    Explore at:
    Dataset updated
    Nov 26, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Coconino County, Mohave County, Arizona, United States, Grand Canyon
    Description

    This data release compiles the whole-rock geochemistry, X-ray diffraction, and electron microscopy analyses of samples collected from the uranium ore bodies of two mined-out deposits in the Grand Canyon region of northwestern Arizona - the Hack II and Pigeon deposits. The samples are grab samples of ore collected underground at each mine by the U.S. Geological Survey (USGS) during the mid-1980s, while each mine was active. The Hack II and Pigeon mines were remediated after their closure, so these data, analyses of samples in the archives of the USGS, are provided as surviving, although limited representations of these ore bodies. The Hack II and Pigeon deposits are similar to numerous other uranium deposits hosted by solution-collapse breccia pipes in the Grand Canyon region of northwest Arizona. The uranium-copper deposits occur within matrix-supported columns of breccia (a "breccia pipe") that formed by solution and collapse of sedimentary strata (Wenrich, 1985; Alpine, 2010). The regions north and south of the Grand Canyon host hundreds of solution-collapse breccia pipes (Van Gosen and others, 2016). Breccia refers to the broken rock that fills these features, and pipe refers to their vertical, pipe-like shape. The breccia pipes average about 300 ft (90 m) in diameter and can extend vertically for as much as 3,000 ft (900 m), from their base in the Mississippian Redwall Limestone to as stratigraphically high as the Triassic Chinle Formation. The breccia fragments are blocks and pieces of rock units that have fallen downward, now resting below their original stratigraphic level. In contrast to many other types of breccia pipes, there are no igneous rocks associated with the northwestern Arizona breccia pipes, nor have igneous processes contributed to their formation. Many of these breccia pipes contain concentrated deposits of uranium, copper, arsenic, barium, cobalt, lead, molybdenum, nickel, antimony, strontium, vanadium, and zinc minerals (Wenrich, 1985), which is reflected in this data set. The Hack II and Pigeon mines were two of thirteen breccia pipe deposits in the Grand Canyon region mined for uranium from the 1950s to present (2020) (Alpine, 2010; Van Gosen and others, 2016). While hundreds of breccia pipes in the region have been identified (Van Gosen and others, 2016), six decades of exploration across the region has found that most are not mineralized or substantially mineralized, and only a small percentage of the breccia pipes contain economic uranium deposits. The most recent mining operation in a breccia pipe deposit in the region is the Canyon mine, located about 6.1 miles (10 km) south-southeast of Tusayan, Arizona. In 2018, Energy Fuels completed a mine shaft and other mining facilities at the Canyon deposit, a copper- uranium-bearing breccia pipe (Van Gosen and others, 2020); however, this mining operation is currently (2020) inactive, awaiting higher market prices for uranium oxide. The Hack II deposit is one of four breccia pipes mined in Hack Canyon near its intersection with Robinson Canyon (Chenoweth, 1988; Otton and Van Gosen, 2010), approximately 30 miles (48 km) southwest of Fredonia and 9 miles (14.5 km) north-northwest of Kanab Creek. Hack Canyon incised and exposed part of the "Hacks" (or "Hack Canyon") breccia pipe, which was discovered and mined as a surface mine in the early 1900s for copper and silver. The original Hacks mine and adjacent Hack I deposit were later mined underground for uranium from 1950 to 1954 (Chenoweth, 1988). The Hack II deposit was discovered in the late 1970s along Hack Canyon about 1 mile (1.6 km) upstream of the Hacks and Hack I mines. The Hack II mine is located at latitude 36.58219 north, longitude -112.81059 west (datum of WGS84). Mining began at Hack II in 1981 and ended in May 1987. The USGS collected the ore samples reported in this data release in 1984 from underground exposures in the Hack II mine while it was in operation. Reclamation of the four mines in the area (Hacks, Hack I, Hack II, and Hack III) was planned and completed from March 1987 to April 1988, including infilling of the shafts and adits. Total production from the Hack II mine was reported as 7.00 million pounds (3.2 million kilograms) of uranium oxide from ore that had an average grade of 0.70 percent uranium oxide. This represents the largest uranium production from a breccia pipe deposit in the Grand Canyon region thus far (Otton and Van Gosen, 2010). The Pigeon mine was discovered along Kanab Creek in 1980. The site was prepared and developed from 1982 to 1984, and mining began in December 1984. The pipe was mined out in late 1989 and reclamation begun shortly thereafter. The former mine site is located at latitude 36.7239 north, longitude -112.5275 south (datum of WGS84). The Pigeon mine reportedly produced 5.7 million pounds (2.6 million kilograms) of ore that had an average grade of 0.65 percent uranium oxide. The five Pigeon deposit samples reported in this data release were collected by the USGS from underground exposures in the Pigeon mine in 1985, while the mine was in operation. Fourteen samples of Hack II ore and two samples of Pigeon ore were analyzed for major and trace elements by a laboratory contracted by the USGS. Concentrations for 59 elements were determined by Inductively Coupled Plasma-Optical Emission Spectrometry (ICP-OES). Additionally, carbonate carbon (inorganic carbon), total carbon, total sulfur, iron oxide, and mercury concentrations were determined using other element-specific analytical techniques. These 16 samples and an additional four Hack II ore samples and three Pigeon ore samples were analyzed by X-ray diffraction (XRD) to determine their mineralogy. Polished thin sections cut from six of the Hack II ore samples were examined using a scanning electron microscope equipped with an energy dispersive spectrometer (SEM-EDS) to identify the ore minerals and observe their relationships at high magnification. The EDS vendor's auto identification algorithm was used for peak assignments; the user did not attempt to verify every peak identification. The spectra for each EDS measurement are provided in separate documents in Portable Data Format (pdf), one document for each of the six samples that were examined by SEM-EDS. The interpreted mineral phase(s), which is based solely on the judgement of the user, is given below each spectrum. References cited above: Alpine, A.E., ed., 2010, Hydrological, geological, and biological site characterization of breccia pipe uranium deposits in northern Arizona: U.S. Geological Survey Scientific Investigations Report 2010-5025, 353 p., 1 plate, scale 1:375,000. Available at http://pubs.usgs.gov/sir/2010/5025/ Chenoweth, W.L., 1988, The production history and geology of the Hacks, Ridenour, Riverview and Chapel breccia pipes, northwestern Arizona: U.S. Geological Survey Open-File Report 88-648, 60 p. Available at https://pubs.usgs.gov/of/1988/0648/report.pdf Otton, J.K., and Van Gosen, B.S., 2010, Uranium resource availability in breccia pipes in northern Arizona, in Alpine, A.E., ed., Hydrological, geological, and biological site characterization of breccia pipe uranium deposits in northern Arizona: U.S. Geological Survey Scientific Investigations Report 2010-5025, p. 23-41. Available at http://pubs.usgs.gov/sir/2010/5025/ Van Gosen, B.S., Johnson, M.R., and Goldman, M.A., 2016, Three GIS datasets defining areas permissive for the occurrence of uranium-bearing, solution-collapse breccia pipes in northern Arizona and southeast Utah: U.S. Geological Survey data release, https://doi.org/10.5066/F76D5R3Z Van Gosen, B.S., Benzel, W.M., and Campbell, K.M., 2020, Geochemical and X-ray diffraction analyses of drill core samples from the Canyon uranium-copper deposit, a solution-collapse breccia pipe, Grand Canyon area, Coconino County, Arizona: U.S. Geological Survey data release, https://doi.org/10.5066/P9UUILQI Wenrich, K.J., 1985, Mineralization of breccia pipes in northern Arizona: Economic Geology, v. 80, no. 6, p. 1722-1735, https://doi.org/10.2113/gsecongeo.80.6.1722

  17. Life Sciences Analytics Market Analysis, Size, and Forecast 2025-2029: North...

    • technavio.com
    pdf
    Updated May 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). Life Sciences Analytics Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, Italy, and UK), APAC (China, India, Japan, and South Korea), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/life-sciences-analytics-market-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 22, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2025 - 2029
    Area covered
    United States
    Description

    Snapshot img

    Life Sciences Analytics Market Size 2025-2029

    The life sciences analytics market size is valued to increase USD 26.37 billion, at a CAGR of 20.6% from 2024 to 2029. Growing integration of big data with healthcare analytics will drive the life sciences analytics market.

    Major Market Trends & Insights

    Asia dominated the market and accounted for a 37% growth during the forecast period.
    By Deployment - Cloud segment was valued at USD 7.18 billion in 2023
    By End-user - Pharmaceutical companies segment accounted for the largest market revenue share in 2023
    

    Market Size & Forecast

    Market Opportunities: USD 277.25 million
    Market Future Opportunities: USD 26365.00 million
    CAGR from 2024 to 2029 : 20.6%
    

    Market Summary

    The market represents a dynamic and continually evolving landscape, driven by the increasing integration of big data with healthcare analytics. This market encompasses core technologies such as machine learning, artificial intelligence, and data mining, which are revolutionizing the way life sciences companies analyze and interpret complex data. Applications of life sciences analytics span various sectors, including drug discovery, clinical research, and population health management. Despite its transformative potential, the high implementation cost of life sciences analytics poses a significant challenge for market growth. However, the growing emphasis on value-based medicine and the increasing regulatory focus on data-driven decision-making present substantial opportunities for market expansion. For instance, according to a recent report, the global market for life sciences analytics is projected to account for over 30% of the total healthcare analytics market by 2025. This underscores the immense potential of this market and the ongoing efforts to harness its power to drive innovation and improve patient outcomes.

    What will be the Size of the Life Sciences Analytics Market during the forecast period?

    Get Key Insights on Market Forecast (PDF) Request Free Sample

    How is the Life Sciences Analytics Market Segmented ?

    The life sciences analytics industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. DeploymentCloudOn-premisesEnd-userPharmaceutical companiesBiotechnology companiesOthersTypeDescriptive analyticsPredictive analyticsPrescriptive analyticsDiagnostic analyticsGeographyNorth AmericaUSCanadaEuropeFranceGermanyItalyUKAPACChinaIndiaJapanSouth KoreaRest of World (ROW)

    By Deployment Insights

    The cloud segment is estimated to witness significant growth during the forecast period.

    In the dynamic and evolving landscape of life sciences analytics, cloud-based solutions have emerged as a game-changer, revolutionizing data management and analysis in the healthcare sector. According to recent reports, the number of biotech and pharmaceutical companies adopting cloud analytics has increased by 18%, enabling real-world evidence synthesis and disease pathway mapping for improved patient care. Furthermore, the integration of genomic data, proteomic data processing, and systems biology approaches has led to a 21% rise in target identification validation and clinical outcome assessment. Data security measures are paramount in this industry, with regulatory compliance software ensuring pharmacovigilance signal detection and biostatistical modeling to maintain the highest standards. Advanced analytics techniques, such as machine learning algorithms and predictive modeling, have driven a 25% surge in drug development informatics and precision medicine insights. Toxicogenomics applications and network biology analysis have also gained significant traction, contributing to a 27% increase in drug metabolism prediction and AI-driven drug discovery. The integration of high-throughput screening data, patient stratification methods, and translational bioinformatics has further enhanced the value of cloud-based life sciences analytics. Pharmacokinetics modeling and biomarker discovery platforms have seen a 29% growth in usage, providing valuable insights for drug repurposing identification and regulatory compliance. The ongoing unfolding of these trends underscores the importance of cloud computing infrastructure, next-generation sequencing, and omics data integration in the life sciences sector.

    Request Free Sample

    The Cloud segment was valued at USD 7.18 billion in 2019 and showed a gradual increase during the forecast period.

    Request Free Sample

    Regional Analysis

    Asia is estimated to contribute 37% to the growth of the global market during the forecast period.Technavio’s analysts have elaborately explained the regional trends and drivers that shape the market during the forecast period.

  18. Forest Fires Data Set

    • kaggle.com
    zip
    Updated Sep 4, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahiale Darlington (2017). Forest Fires Data Set [Dataset]. https://www.kaggle.com/elikplim/forest-fires-data-set
    Explore at:
    zip(7268 bytes)Available download formats
    Dataset updated
    Sep 4, 2017
    Authors
    Ahiale Darlington
    Description

    Source: https://archive.ics.uci.edu/ml/datasets/forest+fires

    Citation Request: This dataset is public available for research. The details are described in [Cortez and Morais, 2007]. Please include this citation if you plan to use this database:

    P. Cortez and A. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data. In J. Neves, M. F. Santos and J. Machado Eds., New Trends in Artificial Intelligence, Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, December, Guimaraes, Portugal, pp. 512-523, 2007. APPIA, ISBN-13 978-989-95618-0-9. Available at: http://www.dsi.uminho.pt/~pcortez/fires.pdf

    1. Title: Forest Fires

    2. Sources Created by: Paulo Cortez and An�bal Morais (Univ. Minho) @ 2007

    3. Past Usage:

      P. Cortez and A. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data. In Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, December, 2007. (http://www.dsi.uminho.pt/~pcortez/fires.pdf)

      In the above reference, the output "area" was first transformed with a ln(x+1) function. Then, several Data Mining methods were applied. After fitting the models, the outputs were post-processed with the inverse of the ln(x+1) transform. Four different input setups were used. The experiments were conducted using a 10-fold (cross-validation) x 30 runs. Two regression metrics were measured: MAD and RMSE. A Gaussian support vector machine (SVM) fed with only 4 direct weather conditions (temp, RH, wind and rain) obtained the best MAD value: 12.71 +- 0.01 (mean and confidence interval within 95% using a t-student distribution). The best RMSE was attained by the naive mean predictor. An analysis to the regression error curve (REC) shows that the SVM model predicts more examples within a lower admitted error. In effect, the SVM model predicts better small fires, which are the majority.

    4. Relevant Information:

      This is a very difficult regression task. It can be used to test regression methods. Also, it could be used to test outlier detection methods, since it is not clear how many outliers are there. Yet, the number of examples of fires with a large burned area is very small.

    5. Number of Instances: 517

    6. Number of Attributes: 12 + output attribute

      Note: several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection.

    7. Attribute information:

      For more information, read [Cortez and Morais, 2007].

      1. X - x-axis spatial coordinate within the Montesinho park map: 1 to 9
      2. Y - y-axis spatial coordinate within the Montesinho park map: 2 to 9
      3. month - month of the year: "jan" to "dec"
      4. day - day of the week: "mon" to "sun"
      5. FFMC - FFMC index from the FWI system: 18.7 to 96.20
      6. DMC - DMC index from the FWI system: 1.1 to 291.3
      7. DC - DC index from the FWI system: 7.9 to 860.6
      8. ISI - ISI index from the FWI system: 0.0 to 56.10
      9. temp - temperature in Celsius degrees: 2.2 to 33.30
      10. RH - relative humidity in %: 15.0 to 100
      11. wind - wind speed in km/h: 0.40 to 9.40
      12. rain - outside rain in mm/m2 : 0.0 to 6.4
      13. area - the burned area of the forest (in ha): 0.00 to 1090.84 (this output variable is very skewed towards 0.0, thus it may make sense to model with the logarithm transform).
    8. Missing Attribute Values: None

  19. Adult Census Income

    • kaggle.com
    zip
    Updated Oct 7, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCI Machine Learning (2016). Adult Census Income [Dataset]. https://www.kaggle.com/datasets/uciml/adulT-census-income/code
    Explore at:
    zip(460936 bytes)Available download formats
    Dataset updated
    Oct 7, 2016
    Dataset authored and provided by
    UCI Machine Learning
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). The prediction task is to determine whether a person makes over $50K a year.

    Description of fnlwgt (final weight)

    The weights on the Current Population Survey (CPS) files are controlled to independent estimates of the civilian noninstitutional population of the US. These are prepared monthly for us by Population Division here at the Census Bureau. We use 3 sets of controls. These are:

    1. A single cell estimate of the population 16+ for each state.

    2. Controls for Hispanic Origin by age and sex.

    3. Controls by Race, age and sex.

    We use all three sets of controls in our weighting program and "rake" through them 6 times so that by the end we come back to all the controls we used. The term estimate refers to population totals derived from CPS by creating "weighted tallies" of any specified socio-economic characteristics of the population. People with similar demographic characteristics should have similar weights. There is one important caveat to remember about this statement. That is that since the CPS sample is actually a collection of 51 state samples, each with its own probability of selection, the statement only applies within state.

    Relevant papers

    Ron Kohavi, "Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid", Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996. (PDF)

  20. a

    South Fork Cherry River Water Quality

    • conservation-abra.hub.arcgis.com
    Updated Feb 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Allegheny-Blue Ridge Alliance (2023). South Fork Cherry River Water Quality [Dataset]. https://conservation-abra.hub.arcgis.com/maps/3b366a6bc44e4392847b71ec82038173
    Explore at:
    Dataset updated
    Feb 22, 2023
    Dataset authored and provided by
    Allegheny-Blue Ridge Alliance
    Area covered
    Description

    Purpose:This feature layer describes water quality sampling data performed at several operating coal mines in the South Fork of Cherry watershed, West Virginia.Source & Data:Data was downloaded from WV Department of Environmental Protection's ApplicationXtender online database and EPA's ECHO online database between January and April, 2023.There are five data sets here: Surface Water Monitoring Sites, which contains basic information about monitoring sites (name, lat/long, etc.) and NPDES Outlet Monitoring Sites, which contains similar information about outfall discharges surrounding the active mines. Biological Assessment Stations (BAS) contain similar information for pre-project biological sampling. NOV Summary contains locations of Notices of Violation received by South Fork Coal Company from WV Department of Environmental Protection. The Quarterly Monitoring Reports table contains the sampling data for the Surface Water Monitoring Sites, which actually goes as far back as 2018 for some mines. Parameters of concern include iron, aluminum and selenium, among others.A relationship class between Surface Water Monitoring Sites and the Quarterly Monitoring Reports allows access to individual sample results.Processing:Notices of Violation were obtained from the WV DEP AppXtender database for Mining and Reclamation Article 3 (SMCRA) Permitting, and Mining and Reclamation NPDES Permitting. Violation data were entered into Excel and loaded into ArcGIS Pro as a CSV text file with Lat/Long coordinates for each Violation. The CSV file was converted to a point feature class.Water quality data were downloaded in PDF format from the WVDEP AppXtender website. Non-searchable PDFs were converted via Optical Character Recognition, so that data could be copied. Sample results were copied and pasted manually to Notepad++, and several columns were re-ordered. Data was grouped by sample station and sorted chronologically. Sample data, contained in the associated table (SW_QM_Reports) were linked back to the monitoring station locations using the Station_ID text field in a geodatabase relationship class.Water monitoring station locations were taken from published Drainage Maps and from water quality reports. A CSV table was created with station Lat/Long locations and loaded into ArcGIS Pro. It was then converted to a point feature class.Stream Crossings and Road Construction Areas were digitized as polygon feature classes from project Drainage and Progress maps that were converted to TIFF image format from PDF and georeferenced.The ArcGIS Pro map - South Fork Cherry River Water Quality, was published as a service definition to ArcGIS Online.Symbology:NOV Summary - dark blue, solid pointLost Flats Surface Water Monitoring Sites: Data Available - medium blue point, black outlineLost Flats Surface Water Monitoring Sites: No Data Available - no-fill point, thick medium blue outlineLost Flats NPDES Outlet Monitoring Sites - orange point, black outlineBlue Knob Surface Water Monitoring Sites: Data Available - medium blue point, black outlineBlue Knob Surface Water Monitoring Sites: No Data Available - no-fill point, thick medium blue outlineBlue Knob NPDES Outlet Monitoring Sites - orange point, black outlineBlue Knob Biological Assessment Stations: Data Available - medium green point, black outlineBlue Knob Biological Assessment Stations: No Data Available - no-fill point, thick medium green outlineRocky Run Surface Water Monitoring Sites: Data Available - medium blue point, black outlineRocky Run Surface Water Monitoring Sites: No Data Available - no-fill point, thick medium blue outlineRocky Run NPDES Outlet Monitoring Sites - orange point, black outlineRocky Run Biological Assessment Stations: Data Available - medium green point, black outlineRocky Run Biological Assessment Stations: No Data Available - no-fill point, thick medium green outlineRocky Run Stream Crossings: turquoise blue polygon with red outlineRocky Run Haul Road Construction Areas: dark red (40% transparent) polygon with black outlineHaul Road No 2 Surface Water Monitoring Sites: Data Available - medium blue point, black outlineHaul Road No 2 Surface Water Monitoring Sites: No Data Available - no-fill point, thick medium blue outlineHaul Road No 2 NPDES Outlet Monitoring Sites - orange point, black outline

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Xin Qiao; Hong Jiao (2023). Table_1_Data Mining Techniques in Analyzing Process Data: A Didactic.pdf [Dataset]. http://doi.org/10.3389/fpsyg.2018.02231.s001
Organization logo

Table_1_Data Mining Techniques in Analyzing Process Data: A Didactic.pdf

Related Article
Explore at:
pdfAvailable download formats
Dataset updated
Jun 7, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Xin Qiao; Hong Jiao
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Due to increasing use of technology-enhanced educational assessment, data mining methods have been explored to analyse process data in log files from such assessment. However, most studies were limited to one data mining technique under one specific scenario. The current study demonstrates the usage of four frequently used supervised techniques, including Classification and Regression Trees (CART), gradient boosting, random forest, support vector machine (SVM), and two unsupervised methods, Self-organizing Map (SOM) and k-means, fitted to one assessment data. The USA sample (N = 426) from the 2012 Program for International Student Assessment (PISA) responding to problem-solving items is extracted to demonstrate the methods. After concrete feature generation and feature selection, classifier development procedures are implemented using the illustrated techniques. Results show satisfactory classification accuracy for all the techniques. Suggestions for the selection of classifiers are presented based on the research questions, the interpretability and the simplicity of the classifiers. Interpretations for the results from both supervised and unsupervised learning methods are provided.

Search
Clear search
Close search
Google apps
Main menu