Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Faces Dataset: PubFig05
This is a subset of the ''PubFig83'' dataset [1] which provides 100 images each of 5 most difficult celebrities to recognise (referred as class in the classification problem). For each celebrity persons, we took 100 images and separated them into training and testing sets of 90 and 10 images, respectively:
Person: Jenifer Lopez; Katherine Heigl; Scarlett Johansson; Mariah Carey; Jessica Alba
Feature Extraction
To extract features from images, we have applied the HT-L3-model as described in [2] and obtained 25600 features.
Feature Selection
Details about feature selection followed in brief as follows:
Entropy Filtering: First we apply an implementation of Fayyad and Irani's [3] entropy base heuristic to discretise the dataset and discarded features using the minimum description length (MDL) principle and only 4878 passed this entropy based filtering method.
Class-Distribution Balancing: Next, we have converted the dataset to binary-class problem by separating into 5 binary-class datasets using one-vs-all setup. Hence, these datasets became imbalanced at a ratio of 1:4. Then we converted them into balanced binary-class datasets using random sub-sampled method. Further processing of the dataset has been described in the paper.
(alpha,beta)-k Feature selection: To get a good feature set for training the classifier, we select the features using the approach based on the (alpha,beta)-k feature selection [4] problem. It selects a minimum subset of features that maximise both within class similarity and dissimilarity in different classes. We applied the entropy filtering and (alpha,beta)-k feature subset selection methods in three ways and obtained different numbers of features (in the Table below) after consolidating them into binary class dataset.
UAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets and we took the union of selected features for each binary-class datasets. Finally, we applied the (alpha,beta)-k feature set selection method on each of the binary-class datasets and get a set of features.
IAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets and we took the intersection of selected features for each binary-class datasets. Finally, we applied the (alpha,beta)-k feature set selection method on each of the binary-class datasets and get a set of features.
UEAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets. Then, we applied the entropy filtering and (alpha,beta)-k feature set selection method on each of the balanced binary-class datasets. Finally, we took the union of selected features for each balanced binary-class datasets and get a set of features.
All of these datasets are inside the compressed folder. It also contains the document describing the process detail.
References
[1] Pinto, N., Stone, Z., Zickler, T., & Cox, D. (2011). Scaling up biologically-inspired computer vision: A case study in unconstrained face recognition on facebook. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conference on (pp. 35–42).
[2] Cox, D., & Pinto, N. (2011). Beyond simple features: A large-scale feature search approach to unconstrained face recognition. In Automatic Face Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on (pp. 8–15).
[3] Fayyad, U. M., & Irani, K. B. (1993). Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In International Joint Conference on Artificial Intelligence (pp. 1022–1029).
[4] Berretta, R., Mendes, A., & Moscato, P. (2005). Integer programming models and algorithms for molecular classification of cancer from microarray data. In Proceedings of the Twenty-eighth Australasian conference on Computer Science - Volume 38 (pp. 361–370). 1082201: Australian Computer Society, Inc.
Context Most of the work in Natural Language Processing (NLP) is done for English language, since it is considered the vehicle of knowledge transmission. However, we must admit that working on other languages is also very relevant to bring the technology to more people.
Spanish is the second world language in terms of the number of native speakers: over 493 millions of people. If we sum to this number the quantity of people with limited language domain and 24 millions of students of Spanish as a foreign language, we will reach 591 millions of potential speakers (7,5 % of the world's population), according to this year’s edition of Cervantes Institute yearbook.
But still, the lack of corpora for NLP research in Spanish is evident: when I searched for a dataset for binary sentiment-based text classification I couldn’t find anything. And this is how I ended up compiling my own collection of hotel reviews retrieved from TripAdvisor which can be used for both binary and multi-class classification (and many other sentiment analysis approaches).
Andalusian Hotel Reviews Corpus is my first ever dataset that made me come up with the idea of a web-scapping algorithm which I plan to use to update this dataset and create new ones, so stay tuned!
Content AHR is a dataset containing 18,172 hotel reviews in Spanish. 16,356 of them were retrieved from TripAdvisor in December, 2021 by me and the other part derives form the COAH corpus (Corpus of Opinions about Andalusian Hotels) which was compiled by the SINAI research group in 2014. This corpus is publicly available and can be accessed in .xml format from the SINAI’s website.
I also include a small, but balanced version of this dataset, containing 7,615 reviews in total.
Here you can find detailed information about the columns of the .csv file:
title - the review’s title rating - the rating that the user gate to the holes on 5 star scale review_text - the reviews text location - reverences to the city and the region of the hotel hotel - hotel’s title label - label for binary classification. NOTE: all neutral reviews (3* rating) are tagged with a «3» and must be removed to perform binary classification. It is worth mentioning that COAH corpus does not provide information about the location and the name of the hotel reviewed, so this columns were filled with NaNs.
Class imbalance This dataset is highly imbalanced: it seems like Andalusian Hotels are generally great 😄. I’ve also uploaded a reduced and balanced version, in case you don’t want to address the rare events detection problem.
Citations The reference to the COAH corpus:
Molina-González, M. D., Martínez-Cámara, E., Martín-Valdivia, M. T., Ureña-López, L. A. (2014). Cross-domain sentiment analysis using spanish opinionated words. Natural Language Processing and Information Systems, Lecture Notes in Computer Science, vol. 8455, pp. 214-219. Springer International Publishing. DOI: 10.1007/978-3-319-07983-7_28
Inspiration This data is suitable for a variety of sentiment analysis tasks:
Binary sentiment classification (Don’t forget to remove neutral reviews) Multi-class sentiment classification Prediction of the review’s ratings Topic modeling on reviews …and all other tasks that your imagination suggests.
Original Data Source: Andalusian Hotels’ Reviews
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
The State Grid Corporation of China (SGCC) dataset with 1000 records was used in the model. This is a key resource in the field of power distribution and management, with a large and varied set of data about electricity transport and grid operations. This set of data contains a lot of different kinds of information, such as history and real-time data on energy use, grid infrastructure, the integration of green energy, and grid performance. It is a key part of making power distribution networks more reliable and efficient by helping with things like predicting demand, watching the grid, and finding problems. Researchers, energy providers, and law- makers can use this information to learn important things about electricity usage trends, the health of the grid, and the merging of green energy sources. This will help the electric power industry come up with new strategies and ideas that are based on data.
Electricity theft detection released by the State Grid Corporation of China (SGCC) dataset data set.csv contains 1037 columns and 42,372 rows for electric consumption from January first 2014 to 30 October 2016. SGCC data first column is consumer ID that is alphanumeric. Then from column 2 to columns 1036 daily electricity consumption is given. Last column named flag is the labels in 0 and 1 values. the small version of the dataset datasetsmall.csv only contains the electric consumption for January 2014.
The data for this competition is from the RAICOM Mission Application Competition and Mo in China, originating from https://www.kaggle.com/datasets/uciml/mushroom-classification/
The copyright of datasets belongs to the organizers of "RAICOM Mission Application Competition"
The result of Official Baseline is:
Accuracy: 0.7464409388226241
Precision: 0.7591353576942872
Recall: 0.6344086021505376
F1: 0.6911902530459232
Confusion matrix:
[[2405 468]
[ 850 1475]]
Mushrooms are a beloved delicacy among people, but beneath their glamorous appearance, they may harbor deadly dangers. China is one of the countries with the largest variety of mushrooms in the world. At the same time, mushroom poisoning is one of the most serious food safety issues in China. According to relevant reports, in 2021, China conducted research on 327 mushroom poisoning incidents, involving 923 patients and 20 deaths, with a total mortality rate of 2.17%. For non professionals, it is impossible to distinguish between poisonous mushrooms and edible mushrooms based on their appearance, shape, color, etc. There is no simple standard that can distinguish between poisonous mushrooms and edible mushrooms. To determine whether mushrooms are edible, it is necessary to collect mushrooms with different characteristic attributes and analyze whether they are toxic. In this competition, 22 characteristic attributes of mushrooms were analyzed to obtain a mushroom usability model, which can better predict whether mushrooms are edible.
In the context of this mushroom usability model competition, several performance metrics can be utilized to evaluate the predictive accuracy of the model. Among them, the F1 score stands out due to its ability to provide a balance between precision and recall, which are crucial for this classification problem where distinguishing between poisonous and edible mushrooms can have severe real-world implications.
F1 Score The F1 score is the harmonic mean of precision and recall, and it is particularly useful in binary classification scenarios with imbalanced class distribution:
Precision (also known as positive predictive value) indicates the proportion of true positive observations among all observations classified as positive. It measures the accuracy of the positive predictions. \( \text{Precision} = \frac{TP}{TP + FP} \)
Recall (also known as sensitivity or true positive rate) measures the proportion of true positive observations out of all actual positives. It assesses the ability to capture all the true positive instances. \( \text{Recall} = \frac{TP}{TP + FN} \)
The F1 score is calculated as follows:
\[ \text{F1 Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]
Why F1 Score? Balance Between Precision and Recall: In the context where mushroom classification error can have critical health impacts, favoring either precision or recall solely might be dangerous. F1 score provides a more comprehensive evaluation by balancing these errors.
Handling Imbalanced Classes: Mushroom datasets often have an imbalance between the number of edible and poisonous instances. The F1 score is less influenced by the skewed class distributions compared to accuracy.
Critical Application: Misclassifying a poisonous mushroom as edible can lead to severe health risks. Hence, ensuring both high precision (minimizing false positives) and high recall (capturing all true positives) is crucial. The F1 score encapsulates the tradeoff between these aspects well.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Description: Good and Bad Omelette Classification
This dataset is created for the binary classification task of identifying whether an omelette sample is of good or bad quality. The primary objective is to develop a machine learning model that can accurately classify an omelette image or sample based on predefined quality parameters.
1000 Good Omelettes
1000 Bad Omelettes
Each sample in the dataset represents one omelette and is accompanied by a corresponding label:
1 or "Good" for high-quality omelettes
0 or "Bad" for low-quality omelettes
Images: High-resolution photos of omelettes under consistent lighting conditions. Each image is labeled accordingly.
Optional Metadata (if available):
Texture metrics (e.g., crispiness, fluffiness)
Color balance (golden brown vs burnt or undercooked)
Shape regularity
Ingredients used
Cooking time and temperature
Quality Criteria (Labeling Guidelines):
i) Good Omelette Characteristics:
Evenly cooked (not burnt or undercooked)
Appealing golden-brown color
Balanced texture (not rubbery or overly crispy)
Well-shaped and visually appealing
Includes expected ingredients (e.g., eggs, milk, seasoning, optional vegetables)
ii) Bad Omelette Characteristics:
Undercooked or overcooked (burnt)
Pale or overly dark in color
Irregular shape, torn or folded poorly
Displeasing texture (e.g., too runny or rubbery)
Missing or wrong ingredients
Purpose of the Dataset:
The dataset is intended for:
Training and evaluating computer vision or quality assessment models
Image classification tasks in food quality control
Benchmarking performance of different ML algorithms in binary classification
Applications:
i) Automated food quality inspection in restaurants or food delivery services
ii) Educational tools for culinary training
iii) Quality assurance in pre-packaged meal production
upload pictures: upload proper bad 1000 pictures and good 1000 pictures.
Ethics and Bias Consideration:
Care has been taken to ensure diversity in sample acquisition—different cooking styles, lighting, and plating are considered to avoid bias.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the accompanying dataset that was generated by the GitHub project: https://github.com/tonyreina/tdc-tcr-epitope-antibody-binding. In that repository I show how to create a machine learning models for predicting if a T-cell receptor (TCR) and protein epitope will bind to each other.
A model that can predict how well a TCR bindings to an epitope can lead to more effective treatments that use immunotherapy. For example, in anti-cancer therapies it is important for the T-cell receptor to bind to the protein marker in the cancer cell so that the T-cell (actually the T-cell's friends in the immune system) can kill the cancer cell.
These are Facebook's Evolutionary Scale Model (ESM-1b) embeddings for the TDC dataset for TCR-Epitope Binding Affinity Prediction Task. The Facebook model is open-sourced and can be downloaded by the open-sourced bio-embeddings Python library.
To load them into Python use the Pandas library:
import pandas as pd
train_data = pd.read_pickle("train_data.pkl")
validation_data = pd.read_pickle("validation_data.pkl")
test_data = pd.read_pickle("test_data.pkl")
The epitope_aa and the tcr_full columns are the protein (peptide) sequences for the epitope and the T-cell receptor, respectively. The letters correspond to the standard amino acid codes.
The epitope_smi column is the SMILES notation for the chemical structure of the epitope. We won't use this information. Instead, the ESM-1b embedder should be sufficient for the input to our binary classification model.
The tcr column is the CDR3 hyperloop. It's the part of the TCR that actually binds (assuming it binds) to the epitope.
The label column is whether the two proteins bind. 0 = No. 1 = Yes.
The tcr_vector and epitope_vector columns are the bio-embeddings of the TCR and epitope sequences generated by the Facebook ESM-1b model. These two vectors can be used to create a machine learning model that predicts whether the combination will produce a successful protein binding.
From the TDC website:
T-cells are an integral part of the adaptive immune system, whose survival, proliferation, activation and function are all governed by the interaction of their T-cell receptor (TCR) with immunogenic peptides (epitopes). A large repertoire of T-cell receptors with different specificity is needed to provide protection against a wide range of pathogens. This new task aims to predict the binding affinity given a pair of TCR sequence and epitope sequence.
Weber et al.
Dataset Description: The dataset is from Weber et al. who assemble a large and diverse data from the VDJ database and ImmuneCODE project. It uses human TCR-beta chain sequences. Since this dataset is highly imbalanced, the authors exclude epitopes with less than 15 associated TCR sequences and downsample to a limit of 400 TCRs per epitope. The dataset contains amino acid sequences either for the entire TCR or only for the hypervariable CDR3 loop. Epitopes are available as amino acid sequences. Since Weber et al. proposed to represent the peptides as SMILES strings (which reformulates the problem to protein-ligand binding prediction) the SMILES strings of the epitopes are also included. 50% negative samples were generated by shuffling the pairs, i.e. associating TCR sequences with epitopes they have not been shown to bind.
Task Description: Binary classification. Given the epitope (a peptide, either represented as amino acid sequence or as SMILES) and a T-cell receptor (amino acid sequence, either of the full protein complex or only of the hypervariable CDR3 loop), predict whether the epitope binds to the TCR.
Dataset Statistics: 47,182 TCR-Epitope pairs between 192 epitopes and 23,139 TCRs.
References:
Weber, Anna, Jannis Born, and María Rodriguez Martínez. “TITAN: T-cell receptor specificity prediction with bimodal attention networks.” Bioinformatics 37.Supplement_1 (2021): i237-i244.
Bagaev, Dmitry V., et al. “VDJdb in 2019: database extension, new analysis infrastructure and a T-cell receptor motif compendium.” Nucleic Acids Research 48.D1 (2020): D1057-D1062.
Dines, Jennifer N., et al. “The immunerace study: A prospective multicohort study of immune response action to covid-19 events with the immunecode™ open access database.” medRxiv (2020).
Dataset License: CC BY 4.0.
Contributed by: Anna Weber and Jannis Born.
The Facebook ESM-1b model has the MIT license and was published in:
Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, Rob Fergus. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. bioRxiv 622803; doi: https://doi.org/10.1101/622803 https://www.biorxiv.org/content/10.1101/622803v4
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Dataset Card for [Dataset Name]
Dataset Summary
The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language. The corpus is based on the dataset introduced by Pang and Lee (2005) and consists of 11,855 single sentences extracted from movie reviews. It was parsed with the Stanford parser and includes a total of 215,154 unique phrases from those parse trees, each… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/sst2.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
ViSIR
Our dataset is a combination of ViHSD and ViHOS. Initially, we planned to use only ViHSD and relabel it into binary categories for toxic and non-toxic comment classification. However, after preprocessing, we noticed a class imbalance, with a significant skew toward non-toxic labels. To address this, we extracted approximately 10,000 toxic comments from ViHOS to balance the dataset.
Acknowledgment
This dataset is built upon the following datasets:
ViHSD
ViHOSWe… See the full description on the dataset page: https://huggingface.co/datasets/UngLong/ViSIR.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary of the gene expression datasets. Number of samples, number of features, and class-wise frequency distribution are shown against each dataset.
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Synthetic Minority Over-sampling Technique (SMOTE) is a machine learning approach to address class imbalance in datasets. It is beneficial for identifying antimicrobial resistance (AMR) patterns. In AMR studies, datasets often contain more susceptible isolates than resistant ones, leading to biased model performance. SMOTE overcomes this issue by generating synthetic samples of the minority class (resistant isolates) through interpolation rather than simple duplication, thereby improving model generalization.When applied to AMR prediction, SMOTE enhances the ability of classification models to accurately identify resistant Escherichia coli strains by balancing the dataset, ensuring that machine learning algorithms do not overlook rare resistance patterns. It is commonly used with classifiers like decision trees, support vector machines (SVM), and deep learning models to improve predictive accuracy. By mitigating class imbalance, SMOTE enables robust AMR detection, aiding in early identification of drug-resistant bacteria and informing antibiotic stewardship efforts.Supervised machine learning is widely used in Escherichia coli genomic analysis to predict antimicrobial resistance, virulence factors, and strain classification. By training models on labeled genomic data (e.g., the presence or absence of resistance genes, SNP profiles, or MLST types), these classifiers help identify patterns and make accurate predictions.10 Supervised machine learning classifiers for E.coli genome analysis:Logistic regression (LR): A simple yet effective statistical model for binary classification, such as predicting antibiotic resistance or susceptibility in E. coli.Linear support vector machine (Linear SVM): This machine finds the optimal hyperplane to separate E. coli strains based on genomic features such as gene presence or sequence variations.Radial basis function kernel-support vector machine (RBF-SVM): A more flexible version of SVM that uses a non-linear kernel to capture complex relationships in genomic data, improving classification accuracy.Extra trees classifier: This tree-based ensemble method enhances classification by randomly selecting features and thresholds, improving robustness in E. coli strain differentiation.Random forest (RF): An ensemble learning method that constructs multiple decision trees, reducing overfitting and improving prediction accuracy for resistance genes and virulence factors.Adaboost: A boosting algorithm that combines weak classifiers iteratively, refining predictions and improving the identification of antimicrobial resistance patterns.XGboost: An optimized gradient boosting algorithm that efficiently handles large genomic datasets, commonly used for high-accuracy predictions in E. coli classification.Naïve bayes (NB): A probabilistic classifier based on Bayes' theorem, suitable for predicting resistance phenotypes based on genomic features.Linear discriminant Analysis (LDA) is a statistical approach that maximizes class separability. It helps distinguish between resistant and susceptible E. coli strains.Quadratic discriminant Analysis (QDA) is a variation of LDA that allows for non-linear decision boundaries, improving classification in datasets with complex genomic structures. When applied to E. coli genomes, these classifiers help predict antibiotic resistance, track outbreak strains, and understand genomic adaptations. Combining them with feature selection and optimization techniques enhances accuracy, making them valuable tools in bacterial genomics and clinical research.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Faces Dataset: PubFig05
This is a subset of the ''PubFig83'' dataset [1] which provides 100 images each of 5 most difficult celebrities to recognise (referred as class in the classification problem). For each celebrity persons, we took 100 images and separated them into training and testing sets of 90 and 10 images, respectively:
Person: Jenifer Lopez; Katherine Heigl; Scarlett Johansson; Mariah Carey; Jessica Alba
Feature Extraction
To extract features from images, we have applied the HT-L3-model as described in [2] and obtained 25600 features.
Feature Selection
Details about feature selection followed in brief as follows:
Entropy Filtering: First we apply an implementation of Fayyad and Irani's [3] entropy base heuristic to discretise the dataset and discarded features using the minimum description length (MDL) principle and only 4878 passed this entropy based filtering method.
Class-Distribution Balancing: Next, we have converted the dataset to binary-class problem by separating into 5 binary-class datasets using one-vs-all setup. Hence, these datasets became imbalanced at a ratio of 1:4. Then we converted them into balanced binary-class datasets using random sub-sampled method. Further processing of the dataset has been described in the paper.
(alpha,beta)-k Feature selection: To get a good feature set for training the classifier, we select the features using the approach based on the (alpha,beta)-k feature selection [4] problem. It selects a minimum subset of features that maximise both within class similarity and dissimilarity in different classes. We applied the entropy filtering and (alpha,beta)-k feature subset selection methods in three ways and obtained different numbers of features (in the Table below) after consolidating them into binary class dataset.
UAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets and we took the union of selected features for each binary-class datasets. Finally, we applied the (alpha,beta)-k feature set selection method on each of the binary-class datasets and get a set of features.
IAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets and we took the intersection of selected features for each binary-class datasets. Finally, we applied the (alpha,beta)-k feature set selection method on each of the binary-class datasets and get a set of features.
UEAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets. Then, we applied the entropy filtering and (alpha,beta)-k feature set selection method on each of the balanced binary-class datasets. Finally, we took the union of selected features for each balanced binary-class datasets and get a set of features.
All of these datasets are inside the compressed folder. It also contains the document describing the process detail.
References
[1] Pinto, N., Stone, Z., Zickler, T., & Cox, D. (2011). Scaling up biologically-inspired computer vision: A case study in unconstrained face recognition on facebook. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conference on (pp. 35–42).
[2] Cox, D., & Pinto, N. (2011). Beyond simple features: A large-scale feature search approach to unconstrained face recognition. In Automatic Face Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on (pp. 8–15).
[3] Fayyad, U. M., & Irani, K. B. (1993). Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In International Joint Conference on Artificial Intelligence (pp. 1022–1029).
[4] Berretta, R., Mendes, A., & Moscato, P. (2005). Integer programming models and algorithms for molecular classification of cancer from microarray data. In Proceedings of the Twenty-eighth Australasian conference on Computer Science - Volume 38 (pp. 361–370). 1082201: Australian Computer Society, Inc.