Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Due to increasing use of technology-enhanced educational assessment, data mining methods have been explored to analyse process data in log files from such assessment. However, most studies were limited to one data mining technique under one specific scenario. The current study demonstrates the usage of four frequently used supervised techniques, including Classification and Regression Trees (CART), gradient boosting, random forest, support vector machine (SVM), and two unsupervised methods, Self-organizing Map (SOM) and k-means, fitted to one assessment data. The USA sample (N = 426) from the 2012 Program for International Student Assessment (PISA) responding to problem-solving items is extracted to demonstrate the methods. After concrete feature generation and feature selection, classifier development procedures are implemented using the illustrated techniques. Results show satisfactory classification accuracy for all the techniques. Suggestions for the selection of classifiers are presented based on the research questions, the interpretability and the simplicity of the classifiers. Interpretations for the results from both supervised and unsupervised learning methods are provided.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data supporting the Master thesis "Monitoring von Open Data Praktiken - Herausforderungen beim Auffinden von Datenpublikationen am Beispiel der Publikationen von Forschenden der TU Dresden" (Monitoring open data practices - challenges in finding data publications using the example of publications by researchers at TU Dresden) - Katharina Zinke, Institut für Bibliotheks- und Informationswissenschaften, Humboldt-Universität Berlin, 2023
This ZIP-File contains the data the thesis is based on, interim exports of the results and the R script with all pre-processing, data merging and analyses carried out. The documentation of the additional, explorative analysis is also available. The actual PDFs and text files of the scientific papers used are not included as they are published open access.
The folder structure is shown below with the file names and a brief description of the contents of each file. For details concerning the analyses approach, please refer to the master's thesis (publication following soon).
## Data sources
Folder 01_SourceData/
- PLOS-Dataset_v2_Mar23.csv (PLOS-OSI dataset)
- ScopusSearch_ExportResults.csv (export of Scopus search results from Scopus)
- ScopusSearch_ExportResults.ris (export of Scopus search results from Scopus)
- Zotero_Export_ScopusSearch.csv (export of the file names and DOIs of the Scopus search results from Zotero)
## Automatic classification
Folder 02_AutomaticClassification/
- (NOT INCLUDED) PDFs folder (Folder for PDFs of all publications identified by the Scopus search, named AuthorLastName_Year_PublicationTitle_Title)
- (NOT INCLUDED) PDFs_to_text folder (Folder for all texts extracted from the PDFs by ODDPub, named AuthorLastName_Year_PublicationTitle_Title)
- PLOS_ScopusSearch_matched.csv (merge of the Scopus search results with the PLOS_OSI dataset for the files contained in both)
- oddpub_results_wDOIs.csv (results file of the ODDPub classification)
- PLOS_ODDPub.csv (merge of the results file of the ODDPub classification with the PLOS-OSI dataset for the publications contained in both)
## Manual coding
Folder 03_ManualCheck/
- CodeSheet_ManualCheck.txt (Code sheet with descriptions of the variables for manual coding)
- ManualCheck_2023-06-08.csv (Manual coding results file)
- PLOS_ODDPub_Manual.csv (Merge of the results file of the ODDPub and PLOS-OSI classification with the results file of the manual coding)
## Explorative analysis for the discoverability of open data
Folder04_FurtherAnalyses
Proof_of_of_Concept_Open_Data_Monitoring.pdf (Description of the explorative analysis of the discoverability of open data publications using the example of a researcher) - in German
## R-Script
Analyses_MA_OpenDataMonitoring.R (R-Script for preparing, merging and analyzing the data and for performing the ODDPub algorithm)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Biological data analysis is the key to new discoveries in disease biology and drug discovery. The rapid proliferation of high-throughput ‘omics’ data has necessitated a need for tools and platforms that allow the researchers to combine and analyse different types of biological data and obtain biologically relevant knowledge. We had previously developed TargetMine, an integrative data analysis platform for target prioritisation and broad-based biological knowledge discovery. Here, we describe the newly modelled biological data types and the enhanced visual and analytical features of TargetMine. These enhancements have included: an enhanced coverage of gene–gene relations, small molecule metabolite to pathway mappings, an improved literature survey feature, and in silico prediction of gene functional associations such as protein–protein interactions and global gene co-expression. We have also described two usage examples on trans-omics data analysis and extraction of gene-disease associations using MeSH term descriptors. These examples have demonstrated how the newer enhancements in TargetMine have contributed to a more expansive coverage of the biological data space and can help interpret genotype–phenotype relations. TargetMine with its auxiliary toolkit is available at https://targetmine.mizuguchilab.org. The TargetMine source code is available at https://github.com/chenyian-nibio/targetmine-gradle.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset used to generate figure 6 and 7.Figure 6: Analysis of human breast cancer (Block A Section 1), from 10XGenomics Visium Spatial Gene Expression 1.0.0. demonstration samples. A) SIMLR partitioning in 9 clusters (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_Stability_Plot.pdf). B) Cell stability score plot for SIMLR clusters in A (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_Stability_Plot.pdf. C) SIMLR clusters location in the tissue section (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_spatial_Stability.pdf). D) Hematoxylin and eosin image (figure6and7/HBC_BAS1/spatial/V1_Breast_Cancer_Block_A_Section_1_image.tif).Figure 6: Analysis of human breast cancer (Block A Section 1), from 10XGenomics Visium Spatial Gene Expression 1.0.0. demonstration samples. A) SIMLR partitioning in 9 clusters (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_Stability_Plot.pdf). B) Cell stability score plot for SIMLR clusters in A (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_Stability_Plot.pdf. C) SIMLR clusters location in the tissue section (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_BAS1_expr-var-ann_matrix_spatial_Stability.pdf). D) Hematoxylin and eosin image (figure6and7/HBC_BAS1/spatial/V1_Breast_Cancer_Block_A_Section_1_image.tif).Figure 7: Information contents extracted by SCA analysis using a TF-based latent space. A) QCC (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_TF_SIMLRV2/9/HBC_BAS1_expr-var-ann_matrix_stabilityPlot.pdf). B) QCM (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_TF_SIMLRV2/9/HBC_BAS1_expr-var-ann_matrix_stabilityPlotUNBIAS.pdf). C) QCM/QCC plot, where only cluster 7 show, for the majority of the cells, both QCC and QCM greater than 0.5 (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/HBC_TF_SIMLRV2/9/HBC_BAS1_expr-var-ann_matrix_StabilitySignificativityJittered.pdf). D) COMET analysis of SCA latent space. SOX5 was detected as first top ranked gene specific for cluster 7, using as input for COMET the latent space frequency table (figure6and7/HBC_BAS1/Results_simlr/raw-counts/HBC_BAS1_expr-var-ann_matrix/9/outputvis/cluster_7_singleton/rank_1.png). Input counts table for SCA analysis is made by raw counts.
Data Science Platform Market Size 2025-2029
The data science platform market size is forecast to increase by USD 763.9 million, at a CAGR of 40.2% between 2024 and 2029.
The market is experiencing significant growth, driven by the increasing integration of Artificial Intelligence (AI) and Machine Learning (ML) technologies. This fusion enables organizations to derive deeper insights from their data, fueling business innovation and decision-making. Another trend shaping the market is the emergence of containerization and microservices in data science platforms. This approach offers enhanced flexibility, scalability, and efficiency, making it an attractive choice for businesses seeking to streamline their data science operations. However, the market also faces challenges. Data privacy and security remain critical concerns, with the increasing volume and complexity of data posing significant risks. Ensuring robust data security and privacy measures is essential for companies to maintain customer trust and comply with regulatory requirements. Additionally, managing the complexity of data science platforms and ensuring seamless integration with existing systems can be a daunting task, requiring significant investment in resources and expertise. Companies must navigate these challenges effectively to capitalize on the market's opportunities and stay competitive in the rapidly evolving data landscape.
What will be the Size of the Data Science Platform Market during the forecast period?
Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
Request Free SampleThe market continues to evolve, driven by the increasing demand for advanced analytics and artificial intelligence solutions across various sectors. Real-time analytics and classification models are at the forefront of this evolution, with APIs integrations enabling seamless implementation. Deep learning and model deployment are crucial components, powering applications such as fraud detection and customer segmentation. Data science platforms provide essential tools for data cleaning and data transformation, ensuring data integrity for big data analytics. Feature engineering and data visualization facilitate model training and evaluation, while data security and data governance ensure data privacy and compliance. Machine learning algorithms, including regression models and clustering models, are integral to predictive modeling and anomaly detection.
Statistical analysis and time series analysis provide valuable insights, while ETL processes streamline data integration. Cloud computing enables scalability and cost savings, while risk management and algorithm selection optimize model performance. Natural language processing and sentiment analysis offer new opportunities for data storytelling and computer vision. Supply chain optimization and recommendation engines are among the latest applications of data science platforms, demonstrating their versatility and continuous value proposition. Data mining and data warehousing provide the foundation for these advanced analytics capabilities.
How is this Data Science Platform Industry segmented?
The data science platform industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. DeploymentOn-premisesCloudComponentPlatformServicesEnd-userBFSIRetail and e-commerceManufacturingMedia and entertainmentOthersSectorLarge enterprisesSMEsApplicationData PreparationData VisualizationMachine LearningPredictive AnalyticsData GovernanceOthersGeographyNorth AmericaUSCanadaEuropeFranceGermanyUKMiddle East and AfricaUAEAPACChinaIndiaJapanSouth AmericaBrazilRest of World (ROW)
By Deployment Insights
The on-premises segment is estimated to witness significant growth during the forecast period.In the dynamic the market, businesses increasingly adopt solutions to gain real-time insights from their data, enabling them to make informed decisions. Classification models and deep learning algorithms are integral parts of these platforms, providing capabilities for fraud detection, customer segmentation, and predictive modeling. API integrations facilitate seamless data exchange between systems, while data security measures ensure the protection of valuable business information. Big data analytics and feature engineering are essential for deriving meaningful insights from vast datasets. Data transformation, data mining, and statistical analysis are crucial processes in data preparation and discovery. Machine learning models, including regression and clustering, are employed for model training and evaluation. Time series analysis and natural language processing are valuable tools for understanding trends and customer sen
Citation Request: This dataset is public available for research. The details are described in [Cortez and Morais, 2007]. Please include this citation if you plan to use this database:
P. Cortez and A. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data. In J. Neves, M. F. Santos and J. Machado Eds., New Trends in Artificial Intelligence, Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, December, Guimaraes, Portugal, pp. 512-523, 2007. APPIA, ISBN-13 978-989-95618-0-9. Available at: http://www.dsi.uminho.pt/~pcortez/fires.pdf
Title: Forest Fires
Sources Created by: Paulo Cortez and An�bal Morais (Univ. Minho) @ 2007
Past Usage:
P. Cortez and A. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data. In Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, December, 2007. (http://www.dsi.uminho.pt/~pcortez/fires.pdf)
In the above reference, the output "area" was first transformed with a ln(x+1) function. Then, several Data Mining methods were applied. After fitting the models, the outputs were post-processed with the inverse of the ln(x+1) transform. Four different input setups were used. The experiments were conducted using a 10-fold (cross-validation) x 30 runs. Two regression metrics were measured: MAD and RMSE. A Gaussian support vector machine (SVM) fed with only 4 direct weather conditions (temp, RH, wind and rain) obtained the best MAD value: 12.71 +- 0.01 (mean and confidence interval within 95% using a t-student distribution). The best RMSE was attained by the naive mean predictor. An analysis to the regression error curve (REC) shows that the SVM model predicts more examples within a lower admitted error. In effect, the SVM model predicts better small fires, which are the majority.
Relevant Information:
This is a very difficult regression task. It can be used to test regression methods. Also, it could be used to test outlier detection methods, since it is not clear how many outliers are there. Yet, the number of examples of fires with a large burned area is very small.
Number of Instances: 517
Number of Attributes: 12 + output attribute
Note: several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection.
Attribute information:
For more information, read [Cortez and Morais, 2007].
Missing Attribute Values: None
1.Framework overview. This paper proposed a pipeline to construct high-quality datasets for text mining in materials science. Firstly, we utilize the traceable automatic acquisition scheme of literature to ensure the traceability of textual data. Then, a data processing method driven by downstream tasks is performed to generate high-quality pre-annotated corpora conditioned on the characteristics of materials texts. On this basis, we define a general annotation scheme derived from materials science tetrahedron to complete high-quality annotation. Finally, a conditional data augmentation model incorporating materials domain knowledge (cDA-DK) is constructed to augment the data quantity.2.Dataset information. The experimental datasets used in this paper include: the Matscholar dataset publicly published by Weston et al. (DOI: 10.1021/acs.jcim.9b00470), and the NASICON entity recognition dataset constructed by ourselves. Herein, we mainly introduce the details of NASICON entity recognition dataset.2.1 Data collection and preprocessing. Firstly, 55 materials science literature related to NASICON system are collected through Crystallographic Information File (CIF), which contains a wealth of structure-activity relationship information. Note that materials science literature is mostly stored as portable document format (PDF), with content arranged in columns and mixed with tables, images, and formulas, which significantly compromises the readability of the text sequence. To tackle this issue, we employ the text parser PDFMiner (a Python toolkit) to standardize, segment, and parse the original documents, thereby converting PDF literature into plain text. In this process, the entire textual information of literature, encompassing title, author, abstract, keywords, institution, publisher, and publication year, is retained and stored as a unified TXT document. Subsequently, we apply rules based on Python regular expressions to remove redundant information, such as garbled characters and line breaks caused by figures, tables, and formulas. This results in a cleaner text corpus, enhancing its readability and enabling more efficient data analysis. Note that special symbols may also appear as garbled characters, but we refrain from directly deleting them, as they may contain valuable information such as chemical units. Therefore, we converted all such symbols to a special token
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Wine Quality’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/danielpanizzo/wine-quality on 13 February 2022.
--- Dataset description provided by original source is as follows ---
Citation Request: This dataset is public available for research. The details are described in [Cortez et al., 2009]. Please include this citation if you plan to use this database:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
Title: Wine Quality
Sources Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009
Past Usage:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
In the above reference, two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model these datasets under a regression approach. The support vector machine model achieved the best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T), etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity analysis procedure).
Relevant Information:
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.
Number of Instances: red wine - 1599; white wine - 4898.
Number of Attributes: 11 + output attribute
Note: several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection.
Attribute information:
For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10)
Missing Attribute Values: None
Description of attributes:
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add 'freshness' and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
Output variable (based on sensory data): 12 - quality (score between 0 and 10)
--- Original source retains full ownership of the source dataset ---
This geodatabase reflects the U.S. Geological Survey’s (USGS) ongoing commitment to its mission of understanding the nature and distribution of global mineral commodity supply chains by updating and publishing the georeferenced locations of mineral commodity production and processing facilities, mineral exploration and development sites, and mineral commodity exporting ports in Africa. The geodatabase and geospatial data layers serve to create a new geographic information product in the form of a geospatial portable document format (PDF) map. The geodatabase contains data layers from USGS, foreign governmental, and open-source sources as follows: (1) mineral production and processing facilities, (2) mineral exploration and development sites, (3) mineral occurrence sites and deposits, (4) undiscovered mineral resource tracts for Gabon and Mauritania, (5) undiscovered mineral resource tracts for potash, platinum-group elements, and copper, (6) coal occurrence areas, (7) electric power generating facilities, (8) electric power transmission lines, (9) liquefied natural gas terminals, (10) oil and gas pipelines, (11) undiscovered, technically recoverable conventional and continuous hydrocarbon resources (by USGS geologic/petroleum province), (12) cumulative production, and recoverable conventional resources (by oil- and gas-producing nation), (13) major mineral exporting maritime ports, (14) railroads, (15) major roads, (16) major cities, (17) major lakes, (18) major river systems, (19) first-level administrative division (ADM1) boundaries for all countries in Africa, and (20) international boundaries for all countries in Africa.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
High-throughput sequencing has created an exponential increase in the amount of gene expression data, much of which is freely, publicly available in repositories such as NCBI's Gene Expression Omnibus (GEO). Querying this data for patterns such as similarity and distance, however, becomes increasingly challenging as the total amount of data increases. Furthermore, vectorization of the data is commonly required in Artificial Intelligence and Machine Learning (AI/ML) approaches. We present BioVDB, a vector database for storage and analysis of gene expression data, which enhances the potential for integrating biological studies with AI/ML tools. We used a previously developed approach called Automatic Label Extraction (ALE) to extract sample labels from metadata, including age, sex, and tissue/cell-line. BioVDB stores 438,562 samples from eight microarray GEO platforms. We show that it allows for efficient querying of data using similarity search, which can also be useful for identifying and inferring missing labels of samples, and for rapid similarity analysis.
This report and digital data release presents 286 new geochemical analyses on historic U.S. Bureau of Mines (USBM) samples, including 93 rock, 110 stream sediment, 52 soil, and 28 heavy mineral concentrate (pan concentrate) samples, as well as 3 samples of indeterminate type. These samples were originally collected as part of studies by the USBM in the Circle mining district, western Crazy Mountains, and Lime Peak area of the White Mountains, Circle Quadrangle, east-central Alaska. Historic USBM sample materials were retrieved by DGGS from the DGGS Geologic Materials Center (GMC), where the USBM samples were transferred as part of the federally funded Minerals Data and Information Rescue in Alaska (MDIRA) program in the late 1990s and early 2000s. The text and analytical data and tables associated with this report are being released in digital format as PDF files and .csv files. We provide analytical data, detection limits and, when available, the method documentation provided to us by the lab. We also provide the sample _location in geographic coordinates, the sample material cited by the originating literature, a reference to the originating report, and the type of sample material that was obtained from the archive and sent to the lab.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Citation Request: This dataset is public available for research. The details are described in [Cortez et al., 2009]. Please include this citation if you plan to use this database:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
Title: Wine Quality Sources Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009 Past Usage:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
In the above reference, two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model these datasets under a regression approach. The support vector machine model achieved the best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T), etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity analysis procedure). Relevant Information:
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods. Number of Instances: red wine - 1599; white wine - 4898. Number of Attributes: 11 + output attribute
Note: several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection. Attribute information:
For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10) Missing Attribute Values: None Description of attributes:
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add 'freshness' and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
Output variable (based on sensory data): 12 - quality (score between 0 and 10)
In order to support science-based water resource management, a systematic effort was undertaken to characterize the nature and function of the hydrogeology in Jo Daviess County, Illinois. Jo Daviess County is a karst area. Karst is a geologically and hydrologically integrated or interconnected and self-organizing network of landforms and subsurface large-scale, secondary porosity created by a combination of fractured carbonate bedrock, the movement of water into and through the rock body as part of the hydrologic cycle, and physical and chemical weathering (Panno, S.V. et al, 2017). Springs, cover-collapse sinkholes, crevices, and caves are among the defining features of a karst terrain; each of these features is found in Jo Daviess County. Examples of these features have been located in the field and characterized by scientists from the Illinois State Geological and Water Surveys (Prairie Research Institute, University of Illinois at Urbana-Champaign). The lead-zinc ore deposits of the Driftless Area, which includes Jo Daviess County, were emplaced within the Galena Dolomite 270 million years ago (Brannon et al. 1992). Ore-forming and associated solutions (hot brines) migrated through carbonate rocks along existing fractures and were responsible for enlarging many of these fractures into crevices. The crevices and infilling sulfide ore deposits created by these solutions have the same distribution and orientation as those identified as crop lines by Panno, Luman and Kolata (2015) using remote sensing techniques. Consequently, maps of mines and mining activities reflect the fracture and crevice orientations and provide additional information about the physical characteristics of the bedrock and aquifers of the Driftless Area. This dataset was developed from the original IMDA documents by the Illinois State Geological Survey (ISGS) in fulfillment a grant from the United States Geological Survey (USGS) National Geological and Geophysical Data Preservation Program (NGGDPP). The IMDA is a detailed set of paper records for the Lead-Zinc District in Jo Daviess County in northwestern Illinois for the period 1949-1970. The IMDA consists of large-scale (1"=200') 36"x30" sheet section maps depicting mining digs and borehole locations, and a set of 8-1/2"x11" datasheets containing borehole logs and mineral analyses (assays and/or visual estimates at various depths).The following document is directly related to this dataset:Klass, R. and Z. Lasemi. Preservation of Geologic Data and Collections in Illinois: Compilation, Documentation and Planning. Technical Report, Illinois State Geological Survey, Prairie Research Institute, University of Illinois Urbana-Champaign, 2014-15.The following documents are pertinent references providing background information: Brannon, J.C., F.A. Podosek, and R.K. McLimans, 1992, Alleghenian age of the Upper Mississippi Valley zinc-lead deposits determined by Rb-Sr dating of sphalerite: Nature, v. 356, p. 509–511.Mansberger, F., T. Townsend, and C. Stratton. The People Must be Crazy: The Lead and Zinc Mining Resources of Jo Daviess County, Illinois.Fever River Research, Springfield, Illinois, 1997 (revised July, 2020). http://illinoisarchaeology.com/Lead%20Mine%20Report%20Revised.pdf
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 4. Sample input file containing co-ordinates of molecules separated by comma delimiter.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Bioinformatics sequence data mining can reveal hidden microbial symbionts that might normally be filtered and removed as contaminants. Data mining can be helpful to detect Wolbachia, a widespread bacterial endosymbiont in insects and filarial nematodes whose distribution in plant-parasitic nematodes (PPNs) remains underexplored. To date, Wolbachia has only been reported a few PPNs, yet nematode-infecting Wolbachia may have been widespread in the evolutionary history of the phylum based on evidence of horizontal gene transfers, suggesting there may be undiscovered Wolbachia infections in PPNs. The goal of this study was to more broadly sample PPN Wolbachia strains in tylenchid nematodes to enable further comparative genomic analyses that may reveal Wolbachia’s role and identify targets for biocontrol. Published whole-genome shotgun assemblies and their raw sequence data from 33 Meloidogyne spp. assemblies, seven Globodera spp. assemblies, and seven Heterodera spp. assemblies were analyzed to look for Wolbachia. No Wolbachia was found in Meloidogyne spp. and Globodera spp., but among seven genome assemblies for Heterodera spp., an H. schachtii assembly from the Netherlands was found to have a large Wolbachia-like sequence that, when re-assembled from reads, formed a complete, circular genome. Detailed analyses comparing read coverage, GC content, pseudogenes, and phylogenomic patterns clearly demonstrated that the H. schachtii Wolbachia represented a novel strain (hereafter, denoted wHet). Phylogenomic tree construction with PhyloBayes showed wHet was most closely related to another PPN Wolbachia, wTex, while 16S rRNA gene analysis showed it clustered with other Heterodera Wolbachia assembled from sequence databases. Pseudogenes in wHet suggested relatedness to the PPN clade, as did the lack of significantly enriched GO terms compared to PPN Wolbachia strains. It remains unclear whether the lack of Wolbachia in other published H. schachtii isolates represents the true absence of the endosymbiont from some hosts.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Drilling is an important and integral procedural component in mineral exploration to (i) ascertain the subsurface configuration of the ore body (ii) to bring out three dimensional model of the ore deposit (iii) to know the reserve of blocks for ultimate exploitation and (iv) arrive at the grade of the deposit. However, drilling can be expensive and because of this it has become the most critical phase of exploration. Drill costs vary depending on hole depth, rock types, core size, etc.
Diamond drilling is the most important phase of exploration and it is a very expensive method for collection of information on subsurface data. More than any other exploration technique it provides the exploration geologist with the most concrete and accurate material upon which an economic evaluation of a block/ area can be made. It also provides a detailed, continuous, look at the subsurface geology.
Each drill run contains some data on the length, from and to of the drill. It also extracts some rock samples that are put in core boxes.
The current dataset contains images and metadata on multiple drilling runs. From, to and length of each run is specified in the image and associate csv files.
https://www.mines.gov.in/writereaddata/UploadFile/nmet22052017.pdf https://www.sciencedirect.com/topics/earth-and-planetary-sciences/core-sampling https://www.linkedin.com/pulse/drill-core-box-run-block-marking-up-ahmed-emam/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DataSet for use in RapidMiner from the master's thesis. OPEN DATA MINING: AN ANALYSIS OF THE USE OF BOTS IN THEELECTRONIC TRADING FLOORSDissertation presented to the Graduate Program in Management in Learning Organizations in compliance with the requirements for completion of the Professional Master in Management in Learning Organizations-UFPB.Brazil's federal government has sought to match procurement procedures to trends in information and communication technologies. The electronic reverse auction was one of the products of these efforts, being characterized as a modality that presented structural solutions to improve the efficiency of purchases of common goods and services and that represents more than 94% of the bids that occurred in the country. Despite the benefits of electronic format, this environment brings challenges, such as dealing with the use of bots, which works by automatically bidding. While there is no law prohibiting its use, judgments of the Federal Court of Auditors state that its use provides a competitive advantage to suppliers holding this technology in question over other bidders, characterizing an affront to the principle of isonomy. Also in the sense of modernizing public procurement is increasing transparency through open data policies, as part of the context of Open Government and digital transformation. This study aims to analyze the situation of bot use in electronic reverse auctions through open data mining. Electronic reverse auctions held at the Ministry of Agriculture, Livestock and Supply in 2017 were analyzed. Data were obtained by request by the Electronic Information System for Citizen Information (e-SIC), having been adopted as methodology the knowledge discovery in databases. The results indicate that bot use in electronic reverse auctions in 2017 represented a more than 5% advantage in successful bid items observed for only 1.99% of the sample bidders, indicated as suspected use. The most relevant indicator for classifying bidders as suspects was the high number of bids issued in relation to the behavior observed in the sample. Results are expected to foster discussion of the effects of bot use on e-trading and to highlight the need for open data policy development for data mining to be an increasingly effective means to assess anomalies and increase the integrity of the bids made by the Federal Government Procurement Portal.DataSet para uso no RapidMiner provenientes da dissertação de mestrado. MINERAÇÃO DE DADOS ABERTOS: UMA ANÁLISE DO USO DE BOTS EMPREGÕES ELETRÔNICOSDissertação apresentada ao Programa de Pós-Graduação em Gestão nas Organizações Aprendentes em cumprimento às exigências para conclusão do Mestrado Profissional em Gestão nas Organizações Aprendentes-UFPB.https://sig-arq.ufpb.br/arquivos/2019071230f6981803056bc243c9a4b41/Dissertao_-_Hugo_Medeiros_Souto_-_Minerao_de_Dados_Abertos_2.pdf
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Generative pre-trained transformers (GPT) have recently demonstrated excellent performance in various natural language tasks. The development of ChatGPT and the recently released GPT-4 model has shown competence in solving complex and higher-order reasoning tasks without further training or fine-tuning. However, the applicability and strength of these models in classifying legal texts in the context of argument mining are yet to be realized and have not been tested thoroughly. In this study, we investigate the effectiveness of GPT-like models, specifically GPT-3.5 and GPT-4, for argument mining via prompting. We closely study the model's performance considering diverse prompt formulation and example selection in the prompt via semantic search using state-of-the-art embedding models from OpenAI and sentence transformers. We primarily concentrate on the argument component classification task on the legal corpus from the European Court of Human Rights. To address these models' inherent non-deterministic nature and make our result statistically sound, we conducted 5-fold cross-validation on the test set. Our experiments demonstrate, quite surprisingly, that relatively small domain-specific models outperform GPT 3.5 and GPT-4 in the F1-score for premise and conclusion classes, with 1.9% and 12% improvements, respectively. We hypothesize that the performance drop indirectly reflects the complexity of the structure in the dataset, which we verify through prompt and data analysis. Nevertheless, our results demonstrate a noteworthy variation in the performance of GPT models based on prompt formulation. We observe comparable performance between the two embedding models, with a slight improvement in the local model's ability for prompt selection. This suggests that local models are as semantically rich as the embeddings from the OpenAI model. Our results indicate that the structure of prompts significantly impacts the performance of GPT models and should be considered when designing them.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Consumption of nuts has been associated with a range of favorable health outcomes. Evidence is now emerging to suggest that walnuts may also play an important role in supporting the consumption of a healthy dietary pattern. However, limited studies have explored how walnuts are eaten at different meal occasions. The aim of this study was to explore the food choices in relation to walnuts at meal occasions as reported by a sample of overweight and obese adult participants of weight loss clinical trials. Baseline usual food intake data were retrospectively pooled from four food-based clinical trials (n = 758). A nut-specific food composition database was applied to determine walnut consumption within the food intake data. The a priori algorithm of association rules was used to identify food choices associated with walnuts at different meal occasions using a nested hierarchical food group classification system. The proportion of participants who were consuming walnuts was 14.5% (n = 110). The median walnut intake was 5.14 (interquartile range, 1.10–11.45) g/d. A total of 128 food items containing walnuts were identified for walnut consumers. The proportion of participants who reported consuming unsalted raw walnut was 80.5% (n = 103). There were no identified patterns to food choices in relation to walnut at the breakfast, lunch, or dinner meal occasions. A total of 24 clusters of food choices related to walnuts were identified at others (meals). By applying a novel food composition database, the present study was able to map the precise combinations of foods associated with walnuts intakes at mealtimes using data mining. This study offers insights into the role of walnuts for the food choices of overweight adults and may support guidance and dietary behavior change strategies.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Gene expression in individual cells can now be measured for thousands of cells in a single experiment thanks to innovative sample-preparation and sequencing technologies. State-of-the-art computational pipelines for single-cell RNA-sequencing data, however, still employ computational methods that were developed for traditional bulk RNA-sequencing data, thus not accounting for the peculiarities of single-cell data, such as sparseness and zero-inflated counts. Here, we present a ready-to-use pipeline named gf-icf (gene frequency–inverse cell frequency) for normalization of raw counts, feature selection, and dimensionality reduction of scRNA-seq data for their visualization and subsequent analyses. Our work is based on a data transformation model named term frequency–inverse document frequency (TF-IDF), which has been extensively used in the field of text mining where extremely sparse and zero-inflated data are common. Using benchmark scRNA-seq datasets, we show that the gf-icf pipeline outperforms existing state-of-the-art methods in terms of improved visualization and ability to separate and distinguish different cell types.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Due to increasing use of technology-enhanced educational assessment, data mining methods have been explored to analyse process data in log files from such assessment. However, most studies were limited to one data mining technique under one specific scenario. The current study demonstrates the usage of four frequently used supervised techniques, including Classification and Regression Trees (CART), gradient boosting, random forest, support vector machine (SVM), and two unsupervised methods, Self-organizing Map (SOM) and k-means, fitted to one assessment data. The USA sample (N = 426) from the 2012 Program for International Student Assessment (PISA) responding to problem-solving items is extracted to demonstrate the methods. After concrete feature generation and feature selection, classifier development procedures are implemented using the illustrated techniques. Results show satisfactory classification accuracy for all the techniques. Suggestions for the selection of classifiers are presented based on the research questions, the interpretability and the simplicity of the classifiers. Interpretations for the results from both supervised and unsupervised learning methods are provided.