100+ datasets found

Data from: Feature Subset Selection
kaggle.com
zip
Updated Oct 12, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
aumas (2017). Feature Subset Selection [Dataset]. https://www.kaggle.com/aumashe/feature-subset-selection
Explore at:
zip(12608 bytes)Available download formats
Dataset updated
Oct 12, 2017
Authors
aumas
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The dataset is used for a practice of feature selection.

Content

xtrain.txt contains features of training set. there are around 30 anonymous features in the file. We don't have to know the meanings of features. the goal is to find best subset with n features. ytrain.txt contains target of training set. class labels { 1,2,3,4}

Acknowledgements

Inspiration

I appreciate suggestions / algorithms/ methods about feature selection. Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Machine learning algorithm validation with a limited sample size
plos.figshare.com
text/x-python
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
Explore at:
text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0224365
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.
Features used in the toothrow-morph training set.
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ilya Plyusnin; Alistair R. Evans; Aleksis Karme; Aristides Gionis; Jukka Jernvall (2023). Features used in the toothrow-morph training set. [Dataset]. http://doi.org/10.1371/journal.pone.0001742.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0001742.t005
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Ilya Plyusnin; Alistair R. Evans; Aleksis Karme; Aristides Gionis; Jukka Jernvall
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Features used in the toothrow-morph training set.
f
Table_1_Using Minimal-Redundant and Maximal-Relevant Whole-Brain Functional...
frontiersin.figshare.com
docx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yen-Ling Chen; Pei-Chi Tu; Tzu-Hsuan Huang; Ya-Mei Bai; Tung-Ping Su; Mu-Hong Chen; Yu-Te Wu (2023). Table_1_Using Minimal-Redundant and Maximal-Relevant Whole-Brain Functional Connectivity to Classify Bipolar Disorder.DOCX [Dataset]. http://doi.org/10.3389/fnins.2020.563368.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fnins.2020.563368.s001
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Yen-Ling Chen; Pei-Chi Tu; Tzu-Hsuan Huang; Ya-Mei Bai; Tung-Ping Su; Mu-Hong Chen; Yu-Te Wu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundA number of mental illness is often re-diagnosed to be bipolar disorder (BD). Furthermore, the prefronto-limbic-striatal regions seem to be associated with the main dysconnectivity of BD. Functional connectivity is potentially an appropriate objective neurobiological marker that can assist with BD diagnosis.MethodsHealth controls (HC; n = 173) and patients with BD who had been diagnosed by experienced physicians (n = 192) were separated into 10-folds, namely, a ninefold training set and a onefold testing set. The classification involved feature selection of the training set using minimum redundancy/maximum relevance. Support vector machine was used for training. The classification was repeated 10 times until each fold had been used as the testing set.ResultsThe mean accuracy of the 10 testing sets was 76.25%, and the area under the curve was 0.840. The selected functional within-network/between-network connectivity was mainly in the subcortical/cerebellar regions and the frontoparietal network. Furthermore, similarity within the BD patients, calculated by the cosine distance between two functional connectivity matrices, was smaller than between groups before feature selection and greater than between groups after the feature selection.LimitationsThe major limitations were that all the BD patients were receiving medication and that no independent dataset was included.ConclusionOur approach effectively separates a relatively large group of BD patients from HCs. This was done by selecting functional connectivity, which was more similar within BD patients, and also seems to be related to the neuropathological factors associated with BD.
o
madelon
openml.org
Updated May 22, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2015). madelon [Dataset]. https://www.openml.org/d/1485
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 22, 2015
Description
Author: Isabelle Guyon
Source: UCI
Please cite: Isabelle Guyon, Steve R. Gunn, Asa Ben-Hur, Gideon Dror, 2004. Result analysis of the NIPS 2003 feature selection challenge.

Abstract:

MADELON is an artificial dataset, which was part of the NIPS 2003 feature selection challenge. This is a two-class classification problem with continuous input variables. The difficulty is that the problem is multivariate and highly non-linear.

Source:

Isabelle Guyon Clopinet 955 Creston Road Berkeley, CA 90708 isabelle '@' clopinet.com

Data Set Information:

MADELON is an artificial dataset containing data points grouped in 32 clusters placed on the vertices of a five-dimensional hypercube and randomly labeled +1 or -1. The five dimensions constitute 5 informative features. 15 linear combinations of those features were added to form a set of 20 (redundant) informative features. Based on those 20 features one must separate the examples into the 2 classes (corresponding to the +-1 labels). It was added a number of distractor feature called 'probes' having no predictive power. The order of the features and patterns were randomized.

This dataset is one of five datasets used in the NIPS 2003 feature selection challenge. The original data was split into training, validation and test set. Target values are provided only for two first sets (not for the test set). So, this dataset version contains all the examples from training and validation partitions.

There is no attribute information provided to avoid biasing the feature selection process.

Relevant Papers:

The best challenge entrants wrote papers collected in the book: Isabelle Guyon, Steve Gunn, Masoud Nikravesh, Lofti Zadeh (Eds.), Feature Extraction, Foundations and Applications. Studies in Fuzziness and Soft Computing. Physica-Verlag, Springer.

Isabelle Guyon, et al, 2007. Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmark. Pattern Recognition Letters 28 (2007) 1438–1444.

Isabelle Guyon, et al. 2006. Feature selection with the CLOP package. Technical Report.
n
Data from: How many specimens make a sufficient training set for automated...
data.niaid.nih.gov
datadryad.org
zip
Updated May 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
James M. Mulqueeney; Alex Searle-Barnes; Anieke Brombacher; Marisa Sweeney; Anjali Goswami; Thomas H. G. Ezard (2024). How many specimens make a sufficient training set for automated three dimensional feature extraction? [Dataset]. http://doi.org/10.5061/dryad.1rn8pk12f
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.1rn8pk12f
Dataset updated
May 31, 2024
Dataset provided by
University of Southampton
Natural History Museum
Authors
James M. Mulqueeney; Alex Searle-Barnes; Anieke Brombacher; Marisa Sweeney; Anjali Goswami; Thomas H. G. Ezard
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Deep learning has emerged as a robust tool for automating feature extraction from 3D images, offering an efficient alternative to labour-intensive and potentially biased manual image segmentation methods. However, there has been limited exploration into the optimal training set sizes, including assessing whether artificial expansion by data augmentation can achieve consistent results in less time and how consistent these benefits are across different types of traits. In this study, we manually segmented 50 planktonic foraminifera specimens from the genus Menardella to determine the minimum number of training images required to produce accurate volumetric and shape data from internal and external structures. The results reveal unsurprisingly that deep learning models improve with a larger number of training images with eight specimens being required to achieve 95% accuracy. Furthermore, data augmentation can enhance network accuracy by up to 8.0%. Notably, predicting both volumetric and shape measurements for the internal structure poses a greater challenge compared to the external structure, due to low contrast differences between different materials and increased geometric complexity. These results provide novel insight into optimal training set sizes for precise image segmentation of diverse traits and highlight the potential of data augmentation for enhancing multivariate feature extraction from 3D images. Methods Data collection 50 planktonic foraminifera, comprising 4 Menardella menardii, 17 Menardella limbata, 18 Menardella exilis, and 11 Menardella pertenuis specimens, were used in our analyses (electronic supplementary material, figures S1 and S2). The taxonomic classification of these species was established based on the analysis of morphological characteristics observed in their shells. In this context, all species are characterised by lenticular, low trochosprial tests with a prominent keel [13]. Discrimination among these species is achievable, as M. limbata can be distinguished from its ancestor, M. menardii, by having a greater number of chambers and a smaller umbilicus. Moreover, M. exilis and M. pertenuis can be discerned from M. limbata by their thinner, more polished tests and reduced trochospirality. Furthermore, M. pertenuis is identifiable by a thin plate extending over the umbilicus and possessing a greater number of chambers in the final whorl compared to M. exilis [13]. The samples containing these individuals and species spanned 5.65 million years ago (Ma) to 2.85 Ma [14] and were collected from the Ceara Rise in the Equatorial Atlantic region at Ocean Drilling Program (ODP) Site 925, which comprised Hole 925B (4°12.248'N, 43°29.349'W), Hole 925C 20 (4°12.256'N, 43°29.349'W), and Hole 925D (4°12.260'N, 43°29.363'W). See Curry et al., [15] for more details. This group was chosen to provide inter- and intraspecific species variation, and to provide contemporary data to test how morphological distinctiveness maps to taxonomic hypotheses [16]. The non-destructive imaging of both internal and external structures of the foraminifera was conducted at the µ-VIS X-ray Imaging Centre, University of Southampton, UK, using a Zeiss Xradia 510 Versa X-ray tomography scanner. Employing a rotational target system, the scanner operated at a voltage of 110 kV and a power of 10 W. Projections were reconstructed using Zeiss Xradia software, resulting in 16-bit greyscale .tiff stacks characterised by a voxel size of 1.75 μm and an average dimension of 992 x 1015 pixels for each 2D slice. Generation of training sets We extracted the external calcite and internal cavity spaces from the micro-CT scans of the 50 individuals using manual segmentation within Dragonfly v. 2021.3 (Object Research Systems, Canada). This step took approximately 480 minutes per specimen (24,000 minutes total) and involved the manual labelling of 11,947 2D images. Segmentation data for each specimen were exported as multi-label (3 labels: external, internal, and background) 8-bit multipage .tiff stacks and paired with the original CT image data to allow for training (see figure 2). The 50 specimens were categorised into three distinct groups (electronic supplementary material, table S1): 20 training image stacks, 10 validation image stacks, and 20 test image stacks. From the training image category, we generated six distinct training sets, varying in size from 1 to 20 specimens (see table 1). These were used to assess the impact of training set size on segmentation accuracy, as determined through a comparative analysis against the validation set (see Section 2.3). From the initial six training sets, we created six additional training sets through data augmentation using the NumPy library [17] in Python. This augmentation method was chosen for its simplicity and accessibility to researchers with limited computational expertise, as it can be easily implemented using a straightforward batch code. This augmentation process entailed rotating the original images five times (the maximum amount permitted using this method), effectively producing six distinct 3D orientations per specimen for each of the original training sets (see figure 3). The augmented training sets comprised between 6 and 120 .tiff stacks (see table 1). Training the neural networks CNNs were trained using the offline version of Biomedisa, which utilises a 3D U-Net architecture [18] – the primary model employed for image segmentation [19], and is optimised using Keras with a TensorFlow backend. We used patches of size 64 x 64 x 64 voxels, which were then scaled to a size of 256 x 256 x 256 voxels. This scaling was performed to improve the network’s ability to capture spatial features and mitigate potential information loss during training. We trained 3 networks for each of the training sets to check the extent of stochastic variation on the results [20]. To train our models in Biomedisa, we used a stochastic gradient descent with a learning rate of 0.01, a decay of 1 × 10-6, momentum of 0.9, and Nesterov momentum enabled. A stride size of 32 pixels and a batch size of 24 samples per epoch were used alongside an automated cropping feature, which has been demonstrated to enhance accuracy [21]. The training of each network was performed on a Tesla V100S-PCIE-32GB graphics card with 30989 MB of available memory. All the analyses and training procedures were conducted on the High-Performance Computing (HPC) system at the Natural History Museum, London. To measure network accuracy, we used the Dice similarity coefficient (Dice score), a metric commonly used in used in biomedical image segmentation studies [22, 23]. The Dice score quantifies the level of overlap between two segmentations, providing a value between 0 (no overlap) and 1 (perfect match). We conducted experiments to evaluate the potential efficiency gains of using an early stopping mechanism within Biomedisa. After testing a variety of epoch limits, we opted for an early stopping criterion set at 25 epochs, which was found to be the lowest value as to which all models trained correctly for every training set. By “trained correctly” we mean if there is no increase in Dice score within a 25-epoch window, the optimal network is selected, and training is terminated. To gauge its impact of early stopping on network accuracy, we compared the results obtained from the original six training sets under early stopping to those obtained on a full run of 200 epochs. Evaluation of feature extraction We used the median accuracy network from each of the 12 training sets to produce segmentation data for the external and internal structures of the 20 test specimens. The median accuracy was selected as it provides a more robust estimate of performance by ensuring that outliers had less impact on the overall result. We then compared the volumetric and shape measurements from the manual data to those from each training set. The volumetric measurements were total volume (comprising both external and internal volumes) and percentage calcite (calculated as the ratio of external volume to internal volume, multiplied by 100). To compare shape, mesh data for the external and internal structures was generated from the segmentation data of the 12 training sets and the manual data. Meshes were decimated to 50,000 faces and smoothed before being scaled and aligned using Python and Generalized Procrustes Surface Analysis (GPSA) [24], respectively. Shape was then analysed using the landmark-free morphometry pipeline, as outlined by Toussaint et al., [25]. We used a kernel width of 0.1mm and noise parameter of 1.0 for both the analysis of shape for both the external and internal data, using a Keops kernel (PyKeops; https://pypi.org/project/pykeops/) as it performs better with large data [25]. The analyses were run for 150 iterations, using an initial step size of 0.01. The manually generated mesh for the individual st049_bl1_fo2 was used as the atlas for both the external and internal shape comparisons.
NYC_building_energy_data
kaggle.com
zip
Updated Mar 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maksym Dubovyi (2020). NYC_building_energy_data [Dataset]. https://www.kaggle.com/maxbrain/nyc-building-energy-data
Explore at:
zip(9476304 bytes)Available download formats
Dataset updated
Mar 4, 2020
Authors
Maksym Dubovyi
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Area covered
New York
Description
In this notebook, we will walk through solving a complete machine learning problem using a real-world dataset. This was a "homework" assignment given to me for a job application over summer 2018. The entire assignment can be viewed here and the one sentence summary is:

Use the provided building energy data to develop a model that can predict a building's Energy Star score, and then interpret the results to find the variables that are most predictive of the score.

This is a supervised, regression machine learning task: given a set of data with targets (in this case the score) included, we want to train a model that can learn to map the features (also known as the explanatory variables) to the target.

Supervised problem: we are given both the features and the target Regression problem: the target is a continous variable, in this case ranging from 0-100 During training, we want the model to learn the relationship between the features and the score so we give it both the features and the answer. Then, to test how well the model has learned, we evaluate it on a testing set where it has never seen the answers!

Machine Learning Workflow Although the exact implementation details can vary, the general structure of a machine learning project stays relatively constant:

Data cleaning and formatting Exploratory data analysis Feature engineering and selection Establish a baseline and compare several machine learning models on a performance metric Perform hyperparameter tuning on the best model to optimize it for the problem Evaluate the best model on the testing set Interpret the model results to the extent possible Draw conclusions and write a well-documented report Setting up the structure of the pipeline ahead of time lets us see how one step flows into the other. However, the machine learning pipeline is an iterative procedure and so we don't always follow these steps in a linear fashion. We may revisit a previous step based on results from further down the pipeline. For example, while we may perform feature selection before building any models, we may use the modeling results to go back and select a different set of features. Or, the modeling may turn up unexpected results that mean we want to explore our data from another angle. Generally, you have to complete one step before moving on to the next, but don't feel like once you have finished one step the first time, you cannot go back and make improvements!

This notebook will cover the first three (and a half) steps of the pipeline with the other parts discussed in two additional notebooks. Throughout this series, the objective is to show how all the different data science practices come together to form a complete project. I try to focus more on the implementations of the methods rather than explaining them at a low-level, but have provided resources for those who want to go deeper. For the single best book (in my opinion) for learning the basics and implementing machine learning practices in Python, check out Hands-On Machine Learning with Scikit-Learn and Tensorflow by Aurelion Geron.

With this outline in place to guide us, let's get started!
SP500_data
kaggle.com
zip
Updated May 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Franco Dicosola (2023). SP500_data [Dataset]. https://www.kaggle.com/datasets/francod/s-and-p-500-data
Explore at:
zip(39005 bytes)Available download formats
Dataset updated
May 28, 2023
Authors
Franco Dicosola
Description
Project Documentation: Predicting S&P 500 Price Problem Statement: The goal of this project is to develop a machine learning model that can predict the future price of the S&P 500 index based on historical data and relevant features. By accurately predicting the price movements, we aim to assist investors and financial professionals in making informed decisions and managing their portfolios effectively. Dataset Description: The dataset used for this project contains historical data of the S&P 500 index, along with several other features such as dividends, earnings, consumer price index (CPI), interest rates, and more. The dataset spans a certain time period and includes daily values of these variables. Steps Taken: 1. Data Preparation and Exploration: • Loaded the dataset and performed initial exploration. • Checked for missing values and handled them if any. • Explored the statistical summary and distributions of the variables. • Conducted correlation analysis to identify potential features for prediction. 2. Data Visualization and Analysis: • Plotted time series graphs to visualize the S&P 500 index and other variables over time. • Examined the trends, seasonality, and residual behavior of the time series using decomposition techniques. • Analyzed the relationships between the S&P 500 index and other features using scatter plots and correlation matrices. 3. Feature Engineering and Selection: • Selected relevant features based on correlation analysis and domain knowledge. • Explored feature importance using tree-based models and selected informative features. • Prepared the final feature set for model training. 4. Model Training and Evaluation: • Split the dataset into training and testing sets. • Selected a regression model (Linear Regression) for price prediction. • Trained the model using the training set. • Evaluated the model's performance using mean squared error (MSE) and R-squared (R^2) metrics on both training and testing sets. 5. Prediction and Interpretation: • Obtained predictions for future S&P 500 prices using the trained model. • Interpreted the predicted prices in the context of the current market conditions and the percentage change from the current price. Limitations and Future Improvements: • The predictive performance of the model is based on the available features and historical data, and it may not capture all the complexities and factors influencing the S&P 500 index. • The model's accuracy and reliability are subject to the quality and representativeness of the training data. • The model assumes that the historical patterns and relationships observed in the data will continue in the future, which may not always hold true. • Future improvements could include incorporating additional relevant features, exploring different regression algorithms, and considering more sophisticated techniques such as time series forecasting models.
Z
Supporting datasets PubFig05 for: "Heterogeneous Ensemble Combination Search...
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haque, Mohammad Nazmul; Noman, Nasimul; Berratta, Regina; Moscato, Pablo (2020). Supporting datasets PubFig05 for: "Heterogeneous Ensemble Combination Search using Genetic Algorithm for Class Imbalanced Data Classification" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_33539
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
The Priority Research Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine, Hunter Medical Research Institute, New Lambton Heights, New South Wales, Australia
Authors
Haque, Mohammad Nazmul; Noman, Nasimul; Berratta, Regina; Moscato, Pablo
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Faces Dataset: PubFig05

This is a subset of the ''PubFig83'' dataset [1] which provides 100 images each of 5 most difficult celebrities to recognise (referred as class in the classification problem). For each celebrity persons, we took 100 images and separated them into training and testing sets of 90 and 10 images, respectively:

Person: Jenifer Lopez; Katherine Heigl; Scarlett Johansson; Mariah Carey; Jessica Alba

Feature Extraction

To extract features from images, we have applied the HT-L3-model as described in [2] and obtained 25600 features.

Feature Selection

Details about feature selection followed in brief as follows:

Entropy Filtering: First we apply an implementation of Fayyad and Irani's [3] entropy base heuristic to discretise the dataset and discarded features using the minimum description length (MDL) principle and only 4878 passed this entropy based filtering method.

Class-Distribution Balancing: Next, we have converted the dataset to binary-class problem by separating into 5 binary-class datasets using one-vs-all setup. Hence, these datasets became imbalanced at a ratio of 1:4. Then we converted them into balanced binary-class datasets using random sub-sampled method. Further processing of the dataset has been described in the paper.

(alpha,beta)-k Feature selection: To get a good feature set for training the classifier, we select the features using the approach based on the (alpha,beta)-k feature selection [4] problem. It selects a minimum subset of features that maximise both within class similarity and dissimilarity in different classes. We applied the entropy filtering and (alpha,beta)-k feature subset selection methods in three ways and obtained different numbers of features (in the Table below) after consolidating them into binary class dataset.

UAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets and we took the union of selected features for each binary-class datasets. Finally, we applied the (alpha,beta)-k feature set selection method on each of the binary-class datasets and get a set of features.

IAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets and we took the intersection of selected features for each binary-class datasets. Finally, we applied the (alpha,beta)-k feature set selection method on each of the binary-class datasets and get a set of features.

UEAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets. Then, we applied the entropy filtering and (alpha,beta)-k feature set selection method on each of the balanced binary-class datasets. Finally, we took the union of selected features for each balanced binary-class datasets and get a set of features.

All of these datasets are inside the compressed folder. It also contains the document describing the process detail.

References

[1] Pinto, N., Stone, Z., Zickler, T., & Cox, D. (2011). Scaling up biologically-inspired computer vision: A case study in unconstrained face recognition on facebook. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conference on (pp. 35–42).

[2] Cox, D., & Pinto, N. (2011). Beyond simple features: A large-scale feature search approach to unconstrained face recognition. In Automatic Face Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on (pp. 8–15).

[3] Fayyad, U. M., & Irani, K. B. (1993). Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In International Joint Conference on Artificial Intelligence (pp. 1022–1029).

[4] Berretta, R., Mendes, A., & Moscato, P. (2005). Integer programming models and algorithms for molecular classification of cancer from microarray data. In Proceedings of the Twenty-eighth Australasian conference on Computer Science - Volume 38 (pp. 361–370). 1082201: Australian Computer Society, Inc.
Dorothea drug discovery Data
kaggle.com
zip
Updated Sep 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Amritpal Singh (2020). Dorothea drug discovery Data [Dataset]. https://www.kaggle.com/datasets/amritpal333/dorothea-drug-discovery-data/code
Explore at:
zip(5110801 bytes)Available download formats
Dataset updated
Sep 1, 2020
Authors
Dr. Amritpal Singh
Description
DOROTHEA - drug discovery dataset.

Chemical compounds represented by structural molecular features must be classified as active (binding to thrombin) or inactive. This is one of 5 datasets of the NIPS 2003 feature selection challenge.

Drugs are typically small organic molecules that achieve their desired activity by binding to a target site on a receptor. The first step in the discovery of a new drug is usually to identify and isolate the receptor to which it should bind, followed by testing many small molecules for their ability to bind to the target site. This leaves researchers with the task of determining what separates the active (binding) compounds from the inactive (non-binding) ones. Such a determination can then be used in the design of new compounds that not only bind, but also have all the other properties required for a drug (solubility, oral absorption, lack of side effects, appropriate duration of action, toxicity, etc.). The original data were modified for the purpose of the feature selection challenge. In particular, we added a number of distractor feature called 'probes' having no predictive power. The order of the features and patterns were randomized.

DOROTHEA -- Positive ex. -- Negative ex. -- Total

Training set -- 78 -- 722 -- 800 Validation set -- 34 -- 316 -- 350 Test set -- 78 -- 722 -- 800 All -- 190 -- 1760 -- 1950

We mapped Active compounds to the target value +1 (positive examples) and Inactive compounds to the target value –1 (negative examples).

Number of variables/features/attributes:

Real: 50000 Probes: 50000 Total: 100000

This dataset is one of five datasets used in the NIPS 2003 feature selection challenge. All details about the preparation of the data are found in our technical report: Design of experiments for the NIPS 2003 variable selection benchmark, Isabelle Guyon, July 2003,(also included in the dataset archive).

The data are split into training, validation, and test set. Target values are provided only for the 2 first sets.

The data are in the following format:

dataname.param: Parameters and statistics about the data dataname.feat: Identities of the features (withheld, to avoid biasing feature selection). dataname_train.data: Training set (a sparse binary matrix, patterns in lines, features in columns: the number of the non-zero features are provided). dataname_valid.data: Validation set. dataname_test.data: Test set. dataname_train.labels: Labels (truth values of the classes) for training examples. dataname_valid.labels: Validation set labels (withheld during the benchmark, but provided now). dataname_test.labels: Test set labels (withheld, so the data can still be use as a benchmark).

Relevant Papers:

Isabelle Guyon, Steve Gunn, Masoud Nikravesh, Lofti Zadeh (Eds.), Feature Extraction, Foundations and Applications. Studies in Fuzziness and Soft Computing. Physica-Verlag, Springer.

Source - DuPont Pharmaceuticals graciously provided this data set for the KDD Cup 2001 competition For more info - https://archive.ics.uci.edu/ml/datasets/dorothea
AI Training Dataset Market Analysis, Size, and Forecast 2025-2029: North...
technavio.com
pdf
Updated Jul 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2025). AI Training Dataset Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, and UK), APAC (China, India, Japan, and South Korea), South America (Brazil), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/ai-training-dataset-market-industry-analysis
Explore at:
pdfAvailable download formats
Dataset updated
Jul 15, 2025
Dataset provided by
TechNavio
Authors
Technavio
License
https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Time period covered
2025 - 2029
Area covered
United Kingdom, Canada, United States
Description
Snapshot img

AI Training Dataset Market Size 2025-2029

The ai training dataset market size is valued to increase by USD 7.33 billion, at a CAGR of 29% from 2024 to 2029. Proliferation and increasing complexity of foundational AI models will drive the ai training dataset market.

Market Insights

North America dominated the market and accounted for a 36% growth during the 2025-2029. By Service Type - Text segment was valued at USD 742.60 billion in 2023 By Deployment - On-premises segment accounted for the largest market revenue share in 2023

Market Size & Forecast

Market Opportunities: USD 479.81 million Market Future Opportunities 2024: USD 7334.90 million CAGR from 2024 to 2029 : 29%

Market Summary

The market is experiencing significant growth as businesses increasingly rely on artificial intelligence (AI) to optimize operations, enhance customer experiences, and drive innovation. The proliferation and increasing complexity of foundational AI models necessitate large, high-quality datasets for effective training and improvement. This shift from data quantity to data quality and curation is a key trend in the market. Navigating data privacy, security, and copyright complexities, however, poses a significant challenge. Businesses must ensure that their datasets are ethically sourced, anonymized, and securely stored to mitigate risks and maintain compliance. For instance, in the supply chain optimization sector, companies use AI models to predict demand, optimize inventory levels, and improve logistics. Access to accurate and up-to-date training datasets is essential for these applications to function efficiently and effectively. Despite these challenges, the benefits of AI and the need for high-quality training datasets continue to drive market growth. The potential applications of AI are vast and varied, from healthcare and finance to manufacturing and transportation. As businesses continue to explore the possibilities of AI, the demand for curated, reliable, and secure training datasets will only increase.

What will be the size of the AI Training Dataset Market during the forecast period?

Get Key Insights on Market Forecast (PDF) Request Free SampleThe market continues to evolve, with businesses increasingly recognizing the importance of high-quality datasets for developing and refining artificial intelligence models. According to recent studies, the use of AI in various industries is projected to grow by over 40% in the next five years, creating a significant demand for training datasets. This trend is particularly relevant for boardrooms, as companies grapple with compliance requirements, budgeting decisions, and product strategy. Moreover, the importance of data labeling, feature selection, and imbalanced data handling in model performance cannot be overstated. For instance, a mislabeled dataset can lead to biased and inaccurate models, potentially resulting in costly errors. Similarly, effective feature selection algorithms can significantly improve model accuracy and reduce computational resources. Despite these challenges, advances in model compression methods, dataset scalability, and data lineage tracking are helping to address some of the most pressing issues in the market. For example, model compression techniques can reduce the size of models, making them more efficient and easier to deploy. Similarly, data lineage tracking can help ensure data consistency and improve model interpretability. In conclusion, the market is a critical component of the broader AI ecosystem, with significant implications for businesses across industries. By focusing on data quality, effective labeling, and advanced techniques for handling imbalanced data and improving model performance, organizations can stay ahead of the curve and unlock the full potential of AI.

Unpacking the AI Training Dataset Market Landscape

In the realm of artificial intelligence (AI), the significance of high-quality training datasets is indisputable. Businesses harnessing AI technologies invest substantially in acquiring and managing these datasets to ensure model robustness and accuracy. According to recent studies, up to 80% of machine learning projects fail due to insufficient or poor-quality data. Conversely, organizations that effectively manage their training data experience an average ROI improvement of 15% through cost reduction and enhanced model performance.

Distributed computing systems and high-performance computing facilitate the processing of vast datasets, enabling businesses to train models at scale. Data security protocols and privacy preservation techniques are crucial to protect sensitive information within these datasets. Reinforcement learning models and supervised learning models each have their unique applications, with the former demonstrating a 30% faster convergence rate in certain use cases.

Data annot
f
Training and testing accuracies for bagging SVM with different number of...
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md. Raihan-Al-Masud; M. Rubaiyat Hossain Mondal (2023). Training and testing accuracies for bagging SVM with different number of features. [Dataset]. http://doi.org/10.1371/journal.pone.0228422.t009
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0228422.t009
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Md. Raihan-Al-Masud; M. Rubaiyat Hossain Mondal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Training and testing accuracies for bagging SVM with different number of features.
Data from: Benchmarking parametric and machine learning models for genomic...
osti.gov
search.dataone.org
+1more
Updated Oct 4, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Azodi, Christina B.; Bolger, Emily; McCarren, Andrew; Roantree, Mark; Shiu, Shin-Han; de los Campos, Gustavo (2019). Benchmarking parametric and machine learning models for genomic prediction of complex traits [Dataset]. https://www.osti.gov/dataexplorer/biblio/1873858-benchmarking-parametric-machine-learning-models-genomic-prediction-complex-traits
Explore at:
Dataset updated
Oct 4, 2019
Dataset provided by
Department of Energy Biological and Environmental Research Program
Office of Sciencehttp://www.er.doe.gov/
Great Lakes Bioenergy Research Centerhttp://www.glbrc.org/
Authors
Azodi, Christina B.; Bolger, Emily; McCarren, Andrew; Roantree, Mark; Shiu, Shin-Han; de los Campos, Gustavo
Description
The usefulness of genomic prediction in crop and livestock breeding programs has prompted efforts to develop new and improved genomic prediction algorithms, such as artificial neural networks and gradient tree boosting. However, the performance of these algorithms has not been compared in a systematic manner using a wide range of datasets and models. Using data of 18 traits across six plant species with different marker densities and training population sizes, we compared the performance of six linear and six non-linear algorithms. First, we found that hyperparameter selection was necessary for all non-linear algorithms and that feature selection prior to model training was critical for artificial neural networks when the markers greatly outnumbered the number of training lines. Across all species and trait combinations, no one algorithm performed best, however predictions based on a combination of results from multiple algorithms (i.e. ensemble predictions) performed consistently well. While linear and non-linear algorithms performed best for a similar number of traits, the performance of non-linear algorithms vary more between traits. Although artificial neural networks did not perform best for any trait, we identified strategies (i.e. feature selection, seeded starting weights) that boosted their performance to near the level of other algorithms. Our resultsmore » highlight the importance of algorithm selection for the prediction of trait values.« less
Machine Learning Basics for Beginners🤖🧠
kaggle.com
zip
Updated Jun 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bhanupratap Biswas (2023). Machine Learning Basics for Beginners🤖🧠 [Dataset]. https://www.kaggle.com/datasets/bhanupratapbiswas/machine-learning-basics-for-beginners
Explore at:
zip(492015 bytes)Available download formats
Dataset updated
Jun 22, 2023
Authors
Bhanupratap Biswas
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
Sure! I'd be happy to provide you with an introduction to machine learning basics for beginners. Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling computers to learn and make predictions or decisions without being explicitly programmed. Here are some key concepts and terms to help you get started:

Supervised Learning: In supervised learning, the machine learning algorithm learns from labeled training data. The training data consists of input examples and their corresponding correct output or target values. The algorithm learns to generalize from this data and make predictions or classify new, unseen examples.

Unsupervised Learning: Unsupervised learning involves learning patterns and relationships from unlabeled data. Unlike supervised learning, there are no target values provided. Instead, the algorithm aims to discover inherent structures or clusters in the data.

Training Data and Test Data: Machine learning models require a dataset to learn from. The dataset is typically split into two parts: the training data and the test data. The model learns from the training data, and the test data is used to evaluate its performance and generalization ability.

Features and Labels: In supervised learning, the input examples are often represented by features or attributes. For example, in a spam email classification task, features might include the presence of certain keywords or the length of the email. The corresponding output or target values are called labels, indicating the class or category to which the example belongs (e.g., spam or not spam).

Model Evaluation Metrics: To assess the performance of a machine learning model, various evaluation metrics are used. Common metrics include accuracy (the proportion of correctly predicted examples), precision (the proportion of true positives among all positive predictions), recall (the proportion of true positives predicted correctly), and F1 score (a combination of precision and recall).

Overfitting and Underfitting: Overfitting occurs when a model becomes too complex and learns to memorize the training data instead of generalizing well to unseen examples. On the other hand, underfitting happens when a model is too simple and fails to capture the underlying patterns in the data. Balancing the complexity of the model is crucial to achieve good generalization.

Feature Engineering: Feature engineering involves selecting or creating relevant features that can help improve the performance of a machine learning model. It often requires domain knowledge and creativity to transform raw data into a suitable representation that captures the important information.

Bias and Variance Trade-off: The bias-variance trade-off is a fundamental concept in machine learning. Bias refers to the errors introduced by the model's assumptions and simplifications, while variance refers to the model's sensitivity to small fluctuations in the training data. Reducing bias may increase variance and vice versa. Finding the right balance is important for building a well-performing model.

Supervised Learning Algorithms: There are various supervised learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. Each algorithm has its own strengths, weaknesses, and specific use cases.

Unsupervised Learning Algorithms: Unsupervised learning algorithms include clustering algorithms like k-means clustering and hierarchical clustering, dimensionality reduction techniques like principal component analysis (PCA) and t-SNE, and anomaly detection algorithms, among others.

These concepts provide a starting point for understanding the basics of machine learning. As you delve deeper, you can explore more advanced topics such as deep learning, reinforcement learning, and natural language processing. Remember to practice hands-on with real-world datasets to gain practical experience and further refine your skills.
s-sized Training and Evaluation Data for Publication "Using Supervised...
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Apr 20, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tobias Weber; Tobias Weber (2020). s-sized Training and Evaluation Data for Publication "Using Supervised Learning to Classify Metadata of Research Data by Field of Study" [Dataset]. http://doi.org/10.5281/zenodo.3490396
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3490396
Dataset updated
Apr 20, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tobias Weber; Tobias Weber
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Automated classification of metadata of research data by their discipline(s) of research can be used in scientometric research, by repository service providers, and in the context of research data aggregation services. Openly available metadata of the DataCite index for research data were used to compile a large training and evaluation set comprised of 609,524 records. This is the cleaned and vectorized version with a feature selection of small size.
Data from: In Silico Prediction of Physicochemical Properties of...
catalog.data.gov
datasets.ai
Updated Nov 12, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). In Silico Prediction of Physicochemical Properties of Environmental Chemicals Using Molecular Fingerprints and Machine Learning [Dataset]. https://catalog.data.gov/dataset/in-silico-prediction-of-physicochemical-properties-of-environmental-chemicals-using-molecu
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
QSAR Model Reporting Formats. Examples of R code: feature selection and regression analysis. Figure S1: Data distribution of logBCF, BP, MP and logVP. Figures S2–S5: Relationship between model complexity and prediction errors as well as the plots of estimated values versus experimental data for logBCF, BP, MP, and logVP, respectively. Figure S6: Plots of leverage versus standardized residuals for logBCF, BP, MP, and logVP models. Table S1: Chemical product classes for training and test sets. Tables S2–S5: Regression statistics for logBCF, BP, MP, and logVP, respectively. Table S6: Applicability domains for logBCF, BP, MP, and logVP. Tables S7–S12: Chemicals with large prediction residuals for the six properties (PDF) Chemical names, CAS registry number and SMILES as well as experimentally measured and estimated property values of the training and test sets (XLSX). This dataset is associated with the following publication: Zang, Q., K. Mansouri, A. Williams, R. Judson, D. Allen, W.M. Casey, and N.C. Kleinstreuer. (Journal of Chemical Information and Modeling) In Silico Prediction of Physicochemical Properties of Environmental Chemicals Using Molecular Fingerprints and Machine Learning. Journal of Chemical Information and Modeling. American Chemical Society, Washington, DC, USA, 57(1): 36-49, (2017).
Weather Seattle
kaggle.com
zip
Updated Apr 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AbdElRahman16 (2024). Weather Seattle [Dataset]. https://www.kaggle.com/datasets/abdelrahman16/weather-seattle
Explore at:
zip(11824 bytes)Available download formats
Dataset updated
Apr 18, 2024
Authors
AbdElRahman16
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Area covered
Seattle
Description
Seattle weather analysis involves understanding various meteorological variables recorded daily. Using the seattle-weather.csv file, we can explore weather patterns, seasonal changes, and predict future weather conditions in Seattle.

The dataset contains the following features:

🔍 Dataset Overview:

📅 Date: The date of the recorded weather data.

☔ Precipitation: The amount of precipitation (in mm) recorded on that day.

🌡️ Temp_max: The maximum temperature (in degrees Celsius) recorded on that day.

🌡️ Temp_min: The minimum temperature (in degrees Celsius) recorded on that day.

💨 Wind: The wind speed (in m/s) recorded on that day.

🌦️ Weather: The type of weather (e.g., drizzle, rain).

Steps to Analyze the Dataset:

1. Data Preprocessing:

Handle Missing Values: Ensure there are no missing values in the dataset.

Convert Data Types: Convert date columns to datetime format if necessary.

Normalize Numerical Variables: Scale features like Precipitation, Temp_max, Temp_min, and Wind if needed.

2. Feature Selection:

Select Relevant Features: Use techniques like correlation analysis to select features that contribute most to the analysis.

3. Exploratory Data Analysis:

Visualize Data: Create plots to understand the distribution and trends of different weather variables.

Seasonal Analysis: Analyze how weather patterns change with seasons.

4. Model Selection for Prediction:

Choose Algorithms: Consider various machine learning algorithms such as Linear Regression, Decision Tree, Random Forest, and Time Series models.

Compare Performance: Train multiple models and compare their performance.

5. Model Training and Evaluation:

Train Models: Train the selected models on the data.

Evaluate Performance: Use metrics such as RMSE, MAE, and R² score to evaluate model performance.

6. Model Deployment:

Deploy the Model: Deploy the best model for predicting future weather conditions.

Ensure Robustness: Make sure the model is robust and can handle real-world data.

Exploring This Dataset Can Help With:

📊 Weather Pattern Analysis: Understanding the weather patterns in Seattle.

🌼 Seasonal Changes: Gaining insights into seasonal variations in weather.

🌦️ Future Predictions: Predicting future weather conditions in Seattle.

🔍 Research: Providing a solid foundation for research in meteorology and climate studies.

This dataset is an invaluable resource for anyone looking to analyze weather patterns and predict future conditions in Seattle, offering detailed insights into the city's meteorological variables.

Please upvote if you find this helpful! 👍
f
Summary of selected MRI-based texture features to optimize CV training...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Nov 24, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frakes, David; Eschbacher, Jennifer M.; Swanson, Kristin R.; Ning, Shuluo; O’Neill, Brian P.; Loftus, Joseph; Ranjbar, Sara; Baxter, Leslie C.; Wu, Teresa; Dueck, Amylou C.; Plasencia, Jonathan; Sarkaria, Jann; Gaw, Nathan; Nakaji, Peter; Elmquist, William; Karis, John P.; Gao, Fei; Zwart, Christine; Tran, Nhan; Hu, Leland S.; Jenkins, Robert; Price, Stephen J.; Smith, Kris A.; Mitchell, J. Ross; Li, Jing (2015). Summary of selected MRI-based texture features to optimize CV training accuracy. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001855446
Explore at:
Dataset updated
Nov 24, 2015
Authors
Frakes, David; Eschbacher, Jennifer M.; Swanson, Kristin R.; Ning, Shuluo; O’Neill, Brian P.; Loftus, Joseph; Ranjbar, Sara; Baxter, Leslie C.; Wu, Teresa; Dueck, Amylou C.; Plasencia, Jonathan; Sarkaria, Jann; Gaw, Nathan; Nakaji, Peter; Elmquist, William; Karis, John P.; Gao, Fei; Zwart, Christine; Tran, Nhan; Hu, Leland S.; Jenkins, Robert; Price, Stephen J.; Smith, Kris A.; Mitchell, J. Ross; Li, Jing
Description
Machine learning (ML) selected the 3 MRI-based texture features that optimized cross validation (CV) accuracy based on leave-one-out cross validation (LOOCV) of the training set data (60 biopsies, 11 patients). The overall CV accuracy based on the 3 features is 85%.Summary of selected MRI-based texture features to optimize CV training accuracy.
f
Table_1_Comparing feature selection and machine learning approaches for...
datasetcatalog.nlm.nih.gov
Updated Feb 21, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fong, Wei Jing; Pan, Hong; Tan, Geoffrey Chern-Yee; Krishna, Bernadus; Han, Mei; Rane, Nikita; Tan, Hong Ming; Goh, Nicole; Chen, Zou Hui; Tan, Kok Hian; Meaney, Michael; Purwanto, Natania Yovela; Garg, Rishabh; Teh, Ai Ling; Keppo, Jussi; Chan, Shiao-Yng; Gupta, Varsha; Jiang, Yuheng; Yap, Fabian; Chan, Kok Yen Jerry; Tan, Ethel Siew Ee; Wang, Dennis (2024). Table_1_Comparing feature selection and machine learning approaches for predicting CYP2D6 methylation from genetic variation.docx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001475698
Explore at:
Dataset updated
Feb 21, 2024
Authors
Fong, Wei Jing; Pan, Hong; Tan, Geoffrey Chern-Yee; Krishna, Bernadus; Han, Mei; Rane, Nikita; Tan, Hong Ming; Goh, Nicole; Chen, Zou Hui; Tan, Kok Hian; Meaney, Michael; Purwanto, Natania Yovela; Garg, Rishabh; Teh, Ai Ling; Keppo, Jussi; Chan, Shiao-Yng; Gupta, Varsha; Jiang, Yuheng; Yap, Fabian; Chan, Kok Yen Jerry; Tan, Ethel Siew Ee; Wang, Dennis
Description
IntroductionPharmacogenetics currently supports clinical decision-making on the basis of a limited number of variants in a few genes and may benefit paediatric prescribing where there is a need for more precise dosing. Integrating genomic information such as methylation into pharmacogenetic models holds the potential to improve their accuracy and consequently prescribing decisions. Cytochrome P450 2D6 (CYP2D6) is a highly polymorphic gene conventionally associated with the metabolism of commonly used drugs and endogenous substrates. We thus sought to predict epigenetic loci from single nucleotide polymorphisms (SNPs) related to CYP2D6 in children from the GUSTO cohort.MethodsBuffy coat DNA methylation was quantified using the Illumina Infinium Methylation EPIC beadchip. CpG sites associated with CYP2D6 were used as outcome variables in Linear Regression, Elastic Net and XGBoost models. We compared feature selection of SNPs from GWAS mQTLs, GTEx eQTLs and SNPs within 2 MB of the CYP2D6 gene and the impact of adding demographic data. The samples were split into training (75%) sets and test (25%) sets for validation. In Elastic Net model and XGBoost models, optimal hyperparameter search was done using 10-fold cross validation. Root Mean Square Error and R-squared values were obtained to investigate each models’ performance. When GWAS was performed to determine SNPs associated with CpG sites, a total of 15 SNPs were identified where several SNPs appeared to influence multiple CpG sites.ResultsOverall, Elastic Net models of genetic features appeared to perform marginally better than heritability estimates and substantially better than Linear Regression and XGBoost models. The addition of nongenetic features appeared to improve performance for some but not all feature sets and probes. The best feature set and Machine Learning (ML) approach differed substantially between CpG sites and a number of top variables were identified for each model.DiscussionThe development of SNP-based prediction models for CYP2D6 CpG methylation in Singaporean children of varying ethnicities in this study has clinical application. With further validation, they may add to the set of tools available to improve precision medicine and pharmacogenetics-based dosing.
Classification with an Academic Success Dataset
kaggle.com
zip
Updated Jun 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MosesNzoo (2024). Classification with an Academic Success Dataset [Dataset]. https://www.kaggle.com/datasets/mosesnzoo/classification-with-an-academic-success-dataset
Explore at:
zip(137108 bytes)Available download formats
Dataset updated
Jun 28, 2024
Authors
MosesNzoo
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Classification with an Academic Success Dataset

Objective:

The goal of this project is to develop a classification model that can predict whether a student will pass based on various academic and demographic factors. By analyzing the provided dataset, we aim to identify key predictors of academic success and build a model that can help educational institutions improve student outcomes.

Dataset:

The dataset consists of 51,012 rows and includes the following columns:

student_id: Unique identifier for each student.

gender: Gender of the student (Male/Female).

age: Age of the student.

major: Major field of study (Engineering, Science, Arts, Business).

gpa: Grade Point Average (GPA) of the student.

study_hours: Average number of study hours per week.

extra_curricular: Participation in extracurricular activities (Yes/No).

attendance: Attendance percentage.

passed: Whether the student passed (Yes/No).

Tools and Techniques:

Data Preprocessing: Clean and preprocess the data to handle missing values, encode categorical variables, and normalize numerical features.

Exploratory Data Analysis (EDA): Perform EDA to understand the distribution of data and identify patterns and correlations.

Feature Selection: Select the most relevant features that contribute to the prediction of academic success.

Model Training: Train various classification models such as Logistic Regression, Decision Trees, Random Forest, and Support Vector Machines (SVM).

Model Evaluation: Evaluate the performance of the models using metrics such as accuracy, precision, recall, and F1-score.

Hyperparameter Tuning: Optimize the model's performance by tuning hyperparameters using techniques such as Grid Search or Random Search.

Model Interpretation: Interpret the model's predictions to provide actionable insights for educational institutions.

Outcome:

A trained classification model that can predict whether a student will pass based on the provided features. A detailed report outlining the steps taken in the analysis, the performance of different models, and the final model's evaluation metrics. Visualizations to illustrate the relationships between different features and the target variable. Recommendations for educational institutions based on the findings of the analysis.

Skills Required:

Data Preprocessing and Cleaning

Exploratory Data Analysis (EDA)

Feature Selection and Engineering

Classification Algorithms

Model Evaluation and Hyperparameter Tuning

Data Visualization

Model Interpretation

Expected Deliverables:

Cleaned and preprocessed dataset ready for modeling.

Trained classification model with optimized performance.

A comprehensive report detailing the analysis process and findings.

Visualizations and insights to aid in decision-making for educational institutions.

Facebook

Twitter

Click to copy link

Link copied

Cite

aumas (2017). Feature Subset Selection [Dataset]. https://www.kaggle.com/aumashe/feature-subset-selection

Data from: Feature Subset Selection

Explore at:

zip(12608 bytes)Available download formats

Dataset updated

Oct 12, 2017

Authors

aumas

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context

The dataset is used for a practice of feature selection.

Content

xtrain.txt contains features of training set. there are around 30 anonymous features in the file. We don't have to know the meanings of features. the goal is to find best subset with n features. ytrain.txt contains target of training set. class labels { 1,2,3,4}

Acknowledgements

Inspiration

I appreciate suggestions / algorithms/ methods about feature selection. Your data will be in front of the world's largest data science community. What questions do you want to see answered?

Clear search

Close search

Google apps

Main menu

Data from: Feature Subset Selection

Context

Content

Acknowledgements

Inspiration

Machine learning algorithm validation with a limited sample size

Features used in the toothrow-morph training set.

Table_1_Using Minimal-Redundant and Maximal-Relevant Whole-Brain Functional...

madelon

Abstract:

Source:

Data Set Information:

Relevant Papers:

Data from: How many specimens make a sufficient training set for automated...

NYC_building_energy_data

SP500_data

Supporting datasets PubFig05 for: "Heterogeneous Ensemble Combination Search...

Dorothea drug discovery Data

DOROTHEA - drug discovery dataset.

DOROTHEA -- Positive ex. -- Negative ex. -- Total

Number of variables/features/attributes:

The data are in the following format:

Relevant Papers:

AI Training Dataset Market Analysis, Size, and Forecast 2025-2029: North...

Snapshot img

Training and testing accuracies for bagging SVM with different number of...

Data from: Benchmarking parametric and machine learning models for genomic...

Machine Learning Basics for Beginners🤖🧠

s-sized Training and Evaluation Data for Publication "Using Supervised...

Data from: In Silico Prediction of Physicochemical Properties of...

Weather Seattle

The dataset contains the following features:

Steps to Analyze the Dataset:

1. Data Preprocessing:

2. Feature Selection:

3. Exploratory Data Analysis:

4. Model Selection for Prediction:

5. Model Training and Evaluation:

6. Model Deployment:

Exploring This Dataset Can Help With:

Summary of selected MRI-based texture features to optimize CV training...

Table_1_Comparing feature selection and machine learning approaches for...

Classification with an Academic Success Dataset

Classification with an Academic Success Dataset

Objective:

Dataset:

Tools and Techniques:

Outcome:

Skills Required:

Expected Deliverables:

Data from: Feature Subset Selection

Context

Content

Acknowledgements

Inspiration