11 datasets found

Z
Supporting datasets PubFig05 for: "Heterogeneous Ensemble Combination Search...
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haque, Mohammad Nazmul (2020). Supporting datasets PubFig05 for: "Heterogeneous Ensemble Combination Search using Genetic Algorithm for Class Imbalanced Data Classification" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_33539
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Haque, Mohammad Nazmul
Moscato, Pablo
Berratta, Regina
Noman, Nasimul
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Faces Dataset: PubFig05

This is a subset of the ''PubFig83'' dataset [1] which provides 100 images each of 5 most difficult celebrities to recognise (referred as class in the classification problem). For each celebrity persons, we took 100 images and separated them into training and testing sets of 90 and 10 images, respectively:

Person: Jenifer Lopez; Katherine Heigl; Scarlett Johansson; Mariah Carey; Jessica Alba

Feature Extraction

To extract features from images, we have applied the HT-L3-model as described in [2] and obtained 25600 features.

Feature Selection

Details about feature selection followed in brief as follows:

Entropy Filtering: First we apply an implementation of Fayyad and Irani's [3] entropy base heuristic to discretise the dataset and discarded features using the minimum description length (MDL) principle and only 4878 passed this entropy based filtering method.

Class-Distribution Balancing: Next, we have converted the dataset to binary-class problem by separating into 5 binary-class datasets using one-vs-all setup. Hence, these datasets became imbalanced at a ratio of 1:4. Then we converted them into balanced binary-class datasets using random sub-sampled method. Further processing of the dataset has been described in the paper.

(alpha,beta)-k Feature selection: To get a good feature set for training the classifier, we select the features using the approach based on the (alpha,beta)-k feature selection [4] problem. It selects a minimum subset of features that maximise both within class similarity and dissimilarity in different classes. We applied the entropy filtering and (alpha,beta)-k feature subset selection methods in three ways and obtained different numbers of features (in the Table below) after consolidating them into binary class dataset.

UAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets and we took the union of selected features for each binary-class datasets. Finally, we applied the (alpha,beta)-k feature set selection method on each of the binary-class datasets and get a set of features.

IAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets and we took the intersection of selected features for each binary-class datasets. Finally, we applied the (alpha,beta)-k feature set selection method on each of the binary-class datasets and get a set of features.

UEAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets. Then, we applied the entropy filtering and (alpha,beta)-k feature set selection method on each of the balanced binary-class datasets. Finally, we took the union of selected features for each balanced binary-class datasets and get a set of features.

All of these datasets are inside the compressed folder. It also contains the document describing the process detail.

References

[1] Pinto, N., Stone, Z., Zickler, T., & Cox, D. (2011). Scaling up biologically-inspired computer vision: A case study in unconstrained face recognition on facebook. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conference on (pp. 35–42).

[2] Cox, D., & Pinto, N. (2011). Beyond simple features: A large-scale feature search approach to unconstrained face recognition. In Automatic Face Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on (pp. 8–15).

[3] Fayyad, U. M., & Irani, K. B. (1993). Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In International Joint Conference on Artificial Intelligence (pp. 1022–1029).

[4] Berretta, R., Mendes, A., & Moscato, P. (2005). Integer programming models and algorithms for molecular classification of cancer from microarray data. In Proceedings of the Twenty-eighth Australasian conference on Computer Science - Volume 38 (pp. 361–370). 1082201: Australian Computer Society, Inc.
o
Andalusian Hotels’ Reviews
opendatabay.com
.undefined
Updated Jun 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Andalusian Hotels’ Reviews [Dataset]. https://www.opendatabay.com/data/ai-ml/9a778921-eaa7-47d4-a218-9081c4aa9164
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 23, 2025
Dataset authored and provided by
Datasimple
Area covered
Reviews & Ratings
Description
Context Most of the work in Natural Language Processing (NLP) is done for English language, since it is considered the vehicle of knowledge transmission. However, we must admit that working on other languages is also very relevant to bring the technology to more people.

Spanish is the second world language in terms of the number of native speakers: over 493 millions of people. If we sum to this number the quantity of people with limited language domain and 24 millions of students of Spanish as a foreign language, we will reach 591 millions of potential speakers (7,5 % of the world's population), according to this year’s edition of Cervantes Institute yearbook.

But still, the lack of corpora for NLP research in Spanish is evident: when I searched for a dataset for binary sentiment-based text classification I couldn’t find anything. And this is how I ended up compiling my own collection of hotel reviews retrieved from TripAdvisor which can be used for both binary and multi-class classification (and many other sentiment analysis approaches).

Andalusian Hotel Reviews Corpus is my first ever dataset that made me come up with the idea of a web-scapping algorithm which I plan to use to update this dataset and create new ones, so stay tuned!

Content AHR is a dataset containing 18,172 hotel reviews in Spanish. 16,356 of them were retrieved from TripAdvisor in December, 2021 by me and the other part derives form the COAH corpus (Corpus of Opinions about Andalusian Hotels) which was compiled by the SINAI research group in 2014. This corpus is publicly available and can be accessed in .xml format from the SINAI’s website.

I also include a small, but balanced version of this dataset, containing 7,615 reviews in total.

Here you can find detailed information about the columns of the .csv file:

title - the review’s title rating - the rating that the user gate to the holes on 5 star scale review_text - the reviews text location - reverences to the city and the region of the hotel hotel - hotel’s title label - label for binary classification. NOTE: all neutral reviews (3* rating) are tagged with a «3» and must be removed to perform binary classification. It is worth mentioning that COAH corpus does not provide information about the location and the name of the hotel reviewed, so this columns were filled with NaNs.

Class imbalance This dataset is highly imbalanced: it seems like Andalusian Hotels are generally great 😄. I’ve also uploaded a reduced and balanced version, in case you don’t want to address the rare events detection problem.

Citations The reference to the COAH corpus:

Molina-González, M. D., Martínez-Cámara, E., Martín-Valdivia, M. T., Ureña-López, L. A. (2014). Cross-domain sentiment analysis using spanish opinionated words. Natural Language Processing and Information Systems, Lecture Notes in Computer Science, vol. 8455, pp. 214-219. Springer International Publishing. DOI: 10.1007/978-3-319-07983-7_28

Inspiration This data is suitable for a variety of sentiment analysis tasks:

Binary sentiment classification (Don’t forget to remove neutral reviews) Multi-class sentiment classification Prediction of the review’s ratings Topic modeling on reviews …and all other tasks that your imagination suggests.

Original Data Source: Andalusian Hotels’ Reviews
SGCC Electricity Theft Detection
kaggle.com
Updated Sep 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BENSALEM M.Abderrahmane (2023). SGCC Electricity Theft Detection [Dataset]. https://www.kaggle.com/datasets/bensalem14/sgcc-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 30, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
BENSALEM M.Abderrahmane
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Overview

The State Grid Corporation of China (SGCC) dataset with 1000 records was used in the model. This is a key resource in the field of power distribution and management, with a large and varied set of data about electricity transport and grid operations. This set of data contains a lot of different kinds of information, such as history and real-time data on energy use, grid infrastructure, the integration of green energy, and grid performance. It is a key part of making power distribution networks more reliable and efficient by helping with things like predicting demand, watching the grid, and finding problems. Researchers, energy providers, and law- makers can use this information to learn important things about electricity usage trends, the health of the grid, and the merging of green energy sources. This will help the electric power industry come up with new strategies and ideas that are based on data.

Description

Electricity theft detection released by the State Grid Corporation of China (SGCC) dataset data set.csv contains 1037 columns and 42,372 rows for electric consumption from January first 2014 to 30 October 2016. SGCC data first column is consumer ID that is alphanumeric. Then from column 2 to columns 1036 daily electricity consumption is given. Last column named flag is the labels in 0 and 1 values. the small version of the dataset datasetsmall.csv only contains the electric consumption for January 2014.

Features

'MM/DD/YYYY': The electric consumption on a given day .

CONS_NO: Consumer Number stands for a customer ID of string type.

FLAG: 0 indicating no theft and 1 for theft.

Useful for

Binary Classification: The main intention of the dataset is for binary classification of electrical theft.

Imbalanced Datasets Processing: Useful for exploring class balancing methods.

Time Series Forecasting: Can be used for forecasting and predicting electrical consumption on a given day.

Notes

This Dataset Contains missing values .

This Dataset has dates of the form "MM/DD/YYYY".

This Dataset requires slight cleaning.
Mushroom Classification Enhanced
kaggle.com
Updated Jul 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mo_der Steven (2024). Mushroom Classification Enhanced [Dataset]. https://www.kaggle.com/datasets/sakurapuare/mushroom-classification-enhanced/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 18, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mo_der Steven
Description
NOTE

The data for this competition is from the RAICOM Mission Application Competition and Mo in China, originating from https://www.kaggle.com/datasets/uciml/mushroom-classification/

The copyright of datasets belongs to the organizers of "RAICOM Mission Application Competition"

Baseline

The result of Official Baseline is:

Accuracy: 0.7464409388226241 Precision: 0.7591353576942872 Recall: 0.6344086021505376 F1: 0.6911902530459232 Confusion matrix: [[2405 468] [ 850 1475]]

Background

Mushrooms are a beloved delicacy among people, but beneath their glamorous appearance, they may harbor deadly dangers. China is one of the countries with the largest variety of mushrooms in the world. At the same time, mushroom poisoning is one of the most serious food safety issues in China. According to relevant reports, in 2021, China conducted research on 327 mushroom poisoning incidents, involving 923 patients and 20 deaths, with a total mortality rate of 2.17%. For non professionals, it is impossible to distinguish between poisonous mushrooms and edible mushrooms based on their appearance, shape, color, etc. There is no simple standard that can distinguish between poisonous mushrooms and edible mushrooms. To determine whether mushrooms are edible, it is necessary to collect mushrooms with different characteristic attributes and analyze whether they are toxic. In this competition, 22 characteristic attributes of mushrooms were analyzed to obtain a mushroom usability model, which can better predict whether mushrooms are edible.

Metrics

In the context of this mushroom usability model competition, several performance metrics can be utilized to evaluate the predictive accuracy of the model. Among them, the F1 score stands out due to its ability to provide a balance between precision and recall, which are crucial for this classification problem where distinguishing between poisonous and edible mushrooms can have severe real-world implications.

F1 Score The F1 score is the harmonic mean of precision and recall, and it is particularly useful in binary classification scenarios with imbalanced class distribution:

Precision (also known as positive predictive value) indicates the proportion of true positive observations among all observations classified as positive. It measures the accuracy of the positive predictions. \( \text{Precision} = \frac{TP}{TP + FP} \)

Recall (also known as sensitivity or true positive rate) measures the proportion of true positive observations out of all actual positives. It assesses the ability to capture all the true positive instances. \( \text{Recall} = \frac{TP}{TP + FN} \)

The F1 score is calculated as follows:

\[ \text{F1 Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]

Why F1 Score? Balance Between Precision and Recall: In the context where mushroom classification error can have critical health impacts, favoring either precision or recall solely might be dangerous. F1 score provides a more comprehensive evaluation by balancing these errors.

Handling Imbalanced Classes: Mushroom datasets often have an imbalance between the number of edible and poisonous instances. The F1 score is less influenced by the skewed class distributions compared to accuracy.

Critical Application: Misclassifying a poisonous mushroom as edible can lead to severe health risks. Hence, ensuring both high precision (minimizing false positives) and high recall (capturing all true positives) is crucial. The F1 score encapsulates the tradeoff between these aspects well.
m
Good and bad Classification and identification of Omelette
data.mendeley.com
Updated May 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sunindita Sarkar (2025). Good and bad Classification and identification of Omelette [Dataset]. http://doi.org/10.17632/f4gr6rgkwr.1
Explore at:
Unique identifier
https://doi.org/10.17632/f4gr6rgkwr.1
Dataset updated
May 2, 2025
Authors
Sunindita Sarkar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description: Good and Bad Omelette Classification

This dataset is created for the binary classification task of identifying whether an omelette sample is of good or bad quality. The primary objective is to develop a machine learning model that can accurately classify an omelette image or sample based on predefined quality parameters.

Dataset Size and Structure: The dataset consists of 2000 samples, equally divided into two classes:

1000 Good Omelettes

1000 Bad Omelettes

Each sample in the dataset represents one omelette and is accompanied by a corresponding label:

1 or "Good" for high-quality omelettes

0 or "Bad" for low-quality omelettes

Data Type: The dataset may include:

Images: High-resolution photos of omelettes under consistent lighting conditions. Each image is labeled accordingly.

Optional Metadata (if available):

Texture metrics (e.g., crispiness, fluffiness)

Color balance (golden brown vs burnt or undercooked)

Shape regularity

Ingredients used

Cooking time and temperature

Quality Criteria (Labeling Guidelines):

i) Good Omelette Characteristics:

Evenly cooked (not burnt or undercooked)

Appealing golden-brown color

Balanced texture (not rubbery or overly crispy)

Well-shaped and visually appealing Includes expected ingredients (e.g., eggs, milk, seasoning, optional vegetables)

ii) Bad Omelette Characteristics:

Undercooked or overcooked (burnt) Pale or overly dark in color Irregular shape, torn or folded poorly Displeasing texture (e.g., too runny or rubbery) Missing or wrong ingredients

Purpose of the Dataset:

The dataset is intended for:

Training and evaluating computer vision or quality assessment models Image classification tasks in food quality control Benchmarking performance of different ML algorithms in binary classification

Applications:

i) Automated food quality inspection in restaurants or food delivery services

ii) Educational tools for culinary training

iii) Quality assurance in pre-packaged meal production

upload pictures: upload proper bad 1000 pictures and good 1000 pictures.

Ethics and Bias Consideration:

Care has been taken to ensure diversity in sample acquisition—different cooking styles, lighting, and plating are considered to avoid bias.
ESM-1b embeddings for TCR-Epitope Binding Affinity Prediction Task
zenodo.org
bin
Updated Jan 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tony Reina; Tony Reina (2023). ESM-1b embeddings for TCR-Epitope Binding Affinity Prediction Task [Dataset]. http://doi.org/10.5281/zenodo.7504846
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7504846
Dataset updated
Jan 5, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tony Reina; Tony Reina
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the accompanying dataset that was generated by the GitHub project: https://github.com/tonyreina/tdc-tcr-epitope-antibody-binding. In that repository I show how to create a machine learning models for predicting if a T-cell receptor (TCR) and protein epitope will bind to each other.

A model that can predict how well a TCR bindings to an epitope can lead to more effective treatments that use immunotherapy. For example, in anti-cancer therapies it is important for the T-cell receptor to bind to the protein marker in the cancer cell so that the T-cell (actually the T-cell's friends in the immune system) can kill the cancer cell.

These are Facebook's Evolutionary Scale Model (ESM-1b) embeddings for the TDC dataset for TCR-Epitope Binding Affinity Prediction Task. The Facebook model is open-sourced and can be downloaded by the open-sourced bio-embeddings Python library.

To load them into Python use the Pandas library:

import pandas as pd train_data = pd.read_pickle("train_data.pkl") validation_data = pd.read_pickle("validation_data.pkl") test_data = pd.read_pickle("test_data.pkl")

The epitope_aa and the tcr_full columns are the protein (peptide) sequences for the epitope and the T-cell receptor, respectively. The letters correspond to the standard amino acid codes.

The epitope_smi column is the SMILES notation for the chemical structure of the epitope. We won't use this information. Instead, the ESM-1b embedder should be sufficient for the input to our binary classification model.

The tcr column is the CDR3 hyperloop. It's the part of the TCR that actually binds (assuming it binds) to the epitope.

The label column is whether the two proteins bind. 0 = No. 1 = Yes.

The tcr_vector and epitope_vector columns are the bio-embeddings of the TCR and epitope sequences generated by the Facebook ESM-1b model. These two vectors can be used to create a machine learning model that predicts whether the combination will produce a successful protein binding.

From the TDC website:

T-cells are an integral part of the adaptive immune system, whose survival, proliferation, activation and function are all governed by the interaction of their T-cell receptor (TCR) with immunogenic peptides (epitopes). A large repertoire of T-cell receptors with different specificity is needed to provide protection against a wide range of pathogens. This new task aims to predict the binding affinity given a pair of TCR sequence and epitope sequence.

Weber et al.

Dataset Description: The dataset is from Weber et al. who assemble a large and diverse data from the VDJ database and ImmuneCODE project. It uses human TCR-beta chain sequences. Since this dataset is highly imbalanced, the authors exclude epitopes with less than 15 associated TCR sequences and downsample to a limit of 400 TCRs per epitope. The dataset contains amino acid sequences either for the entire TCR or only for the hypervariable CDR3 loop. Epitopes are available as amino acid sequences. Since Weber et al. proposed to represent the peptides as SMILES strings (which reformulates the problem to protein-ligand binding prediction) the SMILES strings of the epitopes are also included. 50% negative samples were generated by shuffling the pairs, i.e. associating TCR sequences with epitopes they have not been shown to bind.

Task Description: Binary classification. Given the epitope (a peptide, either represented as amino acid sequence or as SMILES) and a T-cell receptor (amino acid sequence, either of the full protein complex or only of the hypervariable CDR3 loop), predict whether the epitope binds to the TCR.

Dataset Statistics: 47,182 TCR-Epitope pairs between 192 epitopes and 23,139 TCRs.

References:

Weber, Anna, Jannis Born, and María Rodriguez Martínez. “TITAN: T-cell receptor specificity prediction with bimodal attention networks.” Bioinformatics 37.Supplement_1 (2021): i237-i244.

Bagaev, Dmitry V., et al. “VDJdb in 2019: database extension, new analysis infrastructure and a T-cell receptor motif compendium.” Nucleic Acids Research 48.D1 (2020): D1057-D1062.

Dines, Jennifer N., et al. “The immunerace study: A prospective multicohort study of immune response action to covid-19 events with the immunecode™ open access database.” medRxiv (2020).

Dataset License: CC BY 4.0.

Contributed by: Anna Weber and Jannis Born.

The Facebook ESM-1b model has the MIT license and was published in:

Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, Rob Fergus. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. bioRxiv 622803; doi: https://doi.org/10.1101/622803 https://www.biorxiv.org/content/10.1101/622803v4
h
sst2
huggingface.co
Updated May 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford NLP (2023). sst2 [Dataset]. https://huggingface.co/datasets/stanfordnlp/sst2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 8, 2023
Dataset authored and provided by
Stanford NLP
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Dataset Card for [Dataset Name]

Dataset Summary

The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language. The corpus is based on the dataset introduced by Pang and Lee (2005) and consists of 11,855 single sentences extracted from movie reviews. It was parsed with the Stanford parser and includes a total of 215,154 unique phrases from those parse trees, each… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/sst2.
h
Data from: ViSIR
huggingface.co
Updated Mar 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ung Hoàng Long (2025). ViSIR [Dataset]. https://huggingface.co/datasets/UngLong/ViSIR
Explore at:
Dataset updated
Mar 21, 2025
Authors
Ung Hoàng Long
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
ViSIR

Our dataset is a combination of ViHSD and ViHOS. Initially, we planned to use only ViHSD and relabel it into binary categories for toxic and non-toxic comment classification. However, after preprocessing, we noticed a class imbalance, with a significant skew toward non-toxic labels. To address this, we extracted approximately 10,000 toxic comments from ViHOS to balance the dataset.

Acknowledgment

This dataset is built upon the following datasets:

ViHSD
ViHOSWe… See the full description on the dataset page: https://huggingface.co/datasets/UngLong/ViSIR.
f
Summary of the gene expression datasets. Number of samples, number of...
figshare.com
plos.figshare.com
xls
Updated Jun 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sheema Gul; Dost Muhammad Khan; Saeed Aldahmani; Zardad Khan (2025). Summary of the gene expression datasets. Number of samples, number of features, and class-wise frequency distribution are shown against each dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0325147.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0325147.t001
Dataset updated
Jun 10, 2025
Dataset provided by
PLOS ONE
Authors
Sheema Gul; Dost Muhammad Khan; Saeed Aldahmani; Zardad Khan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary of the gene expression datasets. Number of samples, number of features, and class-wise frequency distribution are shown against each dataset.
HTRU2
figshare.com
zip
Updated Apr 1, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Lyon (2016). HTRU2 [Dataset]. http://doi.org/10.6084/m9.figshare.3080389.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3080389.v1
Dataset updated
Apr 1, 2016
Dataset provided by
Figsharehttp://figshare.com/
Authors
Robert Lyon
License
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Description
Overview HTRU2 is a data set which describes a sample of pulsar candidates collected during the High Time Resolution Universe Survey (South) [1]. Pulsars are a rare type of Neutron star that produce radio emission detectable here on Earth. They are of considerable scientific interest as probes of space-time, the inter-stellar medium, and states of matter (see [2] for more uses). As pulsars rotate, their emission beam sweeps across the sky, and when this crosses our line of sight, produces a detectable pattern of broadband radio emission. As pulsars rotate rapidly, this pattern repeats periodically. Thus pulsar search involves looking for periodic radio signals with large radio telescopes. Each pulsar produces a slightly different emission pattern, which varies slightly with each rotation (see [2] for an introduction to pulsar astrophysics to find out why). Thus a potential signal detection known as a 'candidate', is averaged over many rotations of the pulsar, as determined by the length of an observation. In the absence of additional info, each candidate could potentially describe a real pulsar. However in practice almost all detections are caused by radio frequency interference (RFI) and noise, making legitimate signals hard to find. Machine learning tools are now being used to automatically label pulsar candidates to facilitate rapid analysis. Classification systems in particular are being widely adopted, (see [4,5,6,7,8,9]) which treat the candidate data sets as binary classification problems. Here the legitimate pulsar examples are a minority positive class, and spurious examples the majority negative class. At present multi-class labels are unavailable, given the costs associated with data annotation. The data set shared here contains 16,259 spurious examples caused by RFI/noise, and 1,639 real pulsar examples. These examples have all been checked by human annotators. Each candidate is described by 8 continuous variables. The first four are simple statistics obtained from the integrated pulse profile (folded profile). This is an array of continuous variables that describe a longitude-resolved version of the signal that has been averaged in both time and frequency (see [3] for more details). The remaining four variables are similarly obtained from the DM-SNR curve (again see [3] for more details). These are summarised below: 1. Mean of the integrated profile. 2. Standard deviation of the integrated profile. 3. Excess kurtosis of the integrated profile. 4. Skewness of the integrated profile. 5. Mean of the DM-SNR curve. 6. Standard deviation of the DM-SNR curve. 7. Excess kurtosis of the DM-SNR curve. 8. Skewness of the DM-SNR curve. HTRU 2 Summary 17,898 total examples. 1,639 positive examples. 16,259 negative examples. The data is presented in two formats: CSV and ARFF (used by the WEKA data mining tool). Candidates are stored in both files in separate rows. Each row lists the variables first, and the class label is the final entry. The class labels used are 0 (negative) and 1 (positive). Please note that the data contains no positional information or other astronomical details. It is simply feature data extracted from candidate files using the PulsarFeatureLab tool (see [10]).2. Citing our work If you use the dataset in your work please cite us using the DOI of the dataset, and the paper: R. J. Lyon, B. W. Stappers, S. Cooper, J. M. Brooke, J. D. Knowles, Fifty Years of Pulsar Candidate Selection: From simple filters to a new principled real-time classification approach MNRAS, 2016. 3. Acknowledgements This data was obtained with the support of grant EP/I028099/1 for the University of Manchester Centre for Doctoral Training in Computer Science, from the UK Engineering and Physical Sciences Research Council (EPSRC). The raw observational data was collected by the High Time Resolution Universe Collaboration using the Parkes Observatory, funded by the Commonwealth of Australia and managed by the CSIRO. 4. References [1] M.~J. Keith et al., "The High Time Resolution Universe Pulsar Survey - I. System Configuration and Initial Discoveries",2010, Monthly Notices of the Royal Astronomical Society, vol. 409, pp. 619-627. DOI: 10.1111/j.1365-2966.2010.17325.x [2] D. R. Lorimer and M. Kramer, "Handbook of Pulsar Astronomy", Cambridge University Press, 2005. [3] R. J. Lyon, "Why Are Pulsars Hard To Find?", PhD Thesis, University of Manchester, 2015. [4] R. J. Lyon et al., "Fifty Years of Pulsar Candidate Selection: From simple filters to a new principled real-time classification approach", Monthly Notices of the Royal Astronomical Society, submitted. [5] R. P. Eatough et al., "Selection of radio pulsar candidates using artificial neural networks", Monthly Notices of the Royal Astronomical Society, vol. 407, no. 4, pp. 2443-2450, 2010. [6] S. D. Bates et al., "The high time resolution universe pulsar survey vi. an artificial neural network and timing of 75 pulsars", Monthly Notices of the Royal Astronomical Society, vol. 427, no. 2, pp. 1052-1065, 2012. [7] D. Thornton, "The High Time Resolution Radio Sky", PhD thesis, University of Manchester, Jodrell Bank Centre for Astrophysics School of Physics and Astronomy, 2013. [8] K. J. Lee et al., "PEACE: pulsar evaluation algorithm for candidate extraction a software package for post-analysis processing of pulsar survey candidates", Monthly Notices of the Royal Astronomical Society, vol. 433, no. 1, pp. 688-694, 2013. [9] V. Morello et al., "SPINN: a straightforward machine learning solution to the pulsar candidate selection problem", Monthly Notices of the Royal Astronomical Society, vol. 443, no. 2, pp. 1651-1662, 2014. [10] R. J. Lyon, "PulsarFeatureLab", 2015, https://dx.doi.org/10.6084/m9.figshare.1536472.v1.
Supplementary Material 8
figshare.com
Updated May 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nishitha R Kumar; Tejashree A Balraj; Kerry K Cooper; Akila Prashant (2025). Supplementary Material 8 [Dataset]. http://doi.org/10.6084/m9.figshare.28601057.v1
Explore at:
application/x-wine-extension-iniAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28601057.v1
Dataset updated
May 12, 2025
Dataset provided by
figshare
Authors
Nishitha R Kumar; Tejashree A Balraj; Kerry K Cooper; Akila Prashant
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Synthetic Minority Over-sampling Technique (SMOTE) is a machine learning approach to address class imbalance in datasets. It is beneficial for identifying antimicrobial resistance (AMR) patterns. In AMR studies, datasets often contain more susceptible isolates than resistant ones, leading to biased model performance. SMOTE overcomes this issue by generating synthetic samples of the minority class (resistant isolates) through interpolation rather than simple duplication, thereby improving model generalization.When applied to AMR prediction, SMOTE enhances the ability of classification models to accurately identify resistant Escherichia coli strains by balancing the dataset, ensuring that machine learning algorithms do not overlook rare resistance patterns. It is commonly used with classifiers like decision trees, support vector machines (SVM), and deep learning models to improve predictive accuracy. By mitigating class imbalance, SMOTE enables robust AMR detection, aiding in early identification of drug-resistant bacteria and informing antibiotic stewardship efforts.Supervised machine learning is widely used in Escherichia coli genomic analysis to predict antimicrobial resistance, virulence factors, and strain classification. By training models on labeled genomic data (e.g., the presence or absence of resistance genes, SNP profiles, or MLST types), these classifiers help identify patterns and make accurate predictions.10 Supervised machine learning classifiers for E.coli genome analysis:Logistic regression (LR): A simple yet effective statistical model for binary classification, such as predicting antibiotic resistance or susceptibility in E. coli.Linear support vector machine (Linear SVM): This machine finds the optimal hyperplane to separate E. coli strains based on genomic features such as gene presence or sequence variations.Radial basis function kernel-support vector machine (RBF-SVM): A more flexible version of SVM that uses a non-linear kernel to capture complex relationships in genomic data, improving classification accuracy.Extra trees classifier: This tree-based ensemble method enhances classification by randomly selecting features and thresholds, improving robustness in E. coli strain differentiation.Random forest (RF): An ensemble learning method that constructs multiple decision trees, reducing overfitting and improving prediction accuracy for resistance genes and virulence factors.Adaboost: A boosting algorithm that combines weak classifiers iteratively, refining predictions and improving the identification of antimicrobial resistance patterns.XGboost: An optimized gradient boosting algorithm that efficiently handles large genomic datasets, commonly used for high-accuracy predictions in E. coli classification.Naïve bayes (NB): A probabilistic classifier based on Bayes' theorem, suitable for predicting resistance phenotypes based on genomic features.Linear discriminant Analysis (LDA) is a statistical approach that maximizes class separability. It helps distinguish between resistant and susceptible E. coli strains.Quadratic discriminant Analysis (QDA) is a variation of LDA that allows for non-linear decision boundaries, improving classification in datasets with complex genomic structures. When applied to E. coli genomes, these classifiers help predict antibiotic resistance, track outbreak strains, and understand genomic adaptations. Combining them with feature selection and optimization techniques enhances accuracy, making them valuable tools in bacterial genomics and clinical research.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Haque, Mohammad Nazmul (2020). Supporting datasets PubFig05 for: "Heterogeneous Ensemble Combination Search using Genetic Algorithm for Class Imbalanced Data Classification" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_33539

Supporting datasets PubFig05 for: "Heterogeneous Ensemble Combination Search using Genetic Algorithm for Class Imbalanced Data Classification"

Explore at:

Dataset updated

Jan 24, 2020

Dataset provided by

Haque, Mohammad Nazmul
Moscato, Pablo
Berratta, Regina
Noman, Nasimul

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Faces Dataset: PubFig05

This is a subset of the ''PubFig83'' dataset [1] which provides 100 images each of 5 most difficult celebrities to recognise (referred as class in the classification problem). For each celebrity persons, we took 100 images and separated them into training and testing sets of 90 and 10 images, respectively:

Person: Jenifer Lopez; Katherine Heigl; Scarlett Johansson; Mariah Carey; Jessica Alba

Feature Extraction

To extract features from images, we have applied the HT-L3-model as described in [2] and obtained 25600 features.

Feature Selection

Details about feature selection followed in brief as follows:

Entropy Filtering: First we apply an implementation of Fayyad and Irani's [3] entropy base heuristic to discretise the dataset and discarded features using the minimum description length (MDL) principle and only 4878 passed this entropy based filtering method.

Class-Distribution Balancing: Next, we have converted the dataset to binary-class problem by separating into 5 binary-class datasets using one-vs-all setup. Hence, these datasets became imbalanced at a ratio of 1:4. Then we converted them into balanced binary-class datasets using random sub-sampled method. Further processing of the dataset has been described in the paper.

(alpha,beta)-k Feature selection: To get a good feature set for training the classifier, we select the features using the approach based on the (alpha,beta)-k feature selection [4] problem. It selects a minimum subset of features that maximise both within class similarity and dissimilarity in different classes. We applied the entropy filtering and (alpha,beta)-k feature subset selection methods in three ways and obtained different numbers of features (in the Table below) after consolidating them into binary class dataset.

UAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets and we took the union of selected features for each binary-class datasets. Finally, we applied the (alpha,beta)-k feature set selection method on each of the binary-class datasets and get a set of features.

IAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets and we took the intersection of selected features for each binary-class datasets. Finally, we applied the (alpha,beta)-k feature set selection method on each of the binary-class datasets and get a set of features.

UEAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets. Then, we applied the entropy filtering and (alpha,beta)-k feature set selection method on each of the balanced binary-class datasets. Finally, we took the union of selected features for each balanced binary-class datasets and get a set of features.

All of these datasets are inside the compressed folder. It also contains the document describing the process detail.

References

[1] Pinto, N., Stone, Z., Zickler, T., & Cox, D. (2011). Scaling up biologically-inspired computer vision: A case study in unconstrained face recognition on facebook. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conference on (pp. 35–42).

[2] Cox, D., & Pinto, N. (2011). Beyond simple features: A large-scale feature search approach to unconstrained face recognition. In Automatic Face Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on (pp. 8–15).

[3] Fayyad, U. M., & Irani, K. B. (1993). Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In International Joint Conference on Artificial Intelligence (pp. 1022–1029).

[4] Berretta, R., Mendes, A., & Moscato, P. (2005). Integer programming models and algorithms for molecular classification of cancer from microarray data. In Proceedings of the Twenty-eighth Australasian conference on Computer Science - Volume 38 (pp. 361–370). 1082201: Australian Computer Society, Inc.

Clear search

Close search

Google apps

Main menu

Supporting datasets PubFig05 for: "Heterogeneous Ensemble Combination Search...

Andalusian Hotels’ Reviews

SGCC Electricity Theft Detection

Overview

Description

Features

Useful for

Notes

Mushroom Classification Enhanced

NOTE

Baseline

Background

Metrics

Good and bad Classification and identification of Omelette

ESM-1b embeddings for TCR-Epitope Binding Affinity Prediction Task

sst2

Data from: ViSIR

Summary of the gene expression datasets. Number of samples, number of...

HTRU2

Supplementary Material 8

Supporting datasets PubFig05 for: "Heterogeneous Ensemble Combination Search using Genetic Algorithm for Class Imbalanced Data Classification"