100+ datasets found
  1. Data from: RESEARCH ON IDENTIFICATION AND CLASSIFICATION METHOD OF...

    • scielo.figshare.com
    tiff
    Updated Feb 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Min Jin; Bowen Yang; Chunguang Wang (2024). RESEARCH ON IDENTIFICATION AND CLASSIFICATION METHOD OF IMBALANCED DATA SET OF PIG BEHAVIOR [Dataset]. http://doi.org/10.6084/m9.figshare.23290691.v1
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Feb 14, 2024
    Dataset provided by
    SciELOhttp://www.scielo.org/
    Authors
    Min Jin; Bowen Yang; Chunguang Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ABSTRACT To address the problem of the low accuracy and poor robustness of modeling methods for imbalanced data sets of pig behavior identification and classification, the three commonly used re-sampling methods of under-sampling, SMOTE and Borderline-SMOTE are compared, and an adaptive boundary data augmentation algorithm AD-BL-SMOTE is proposed. The activity of the pigs was measured using triaxial accelerometers, which were fixed on the backs of the pigs. A multilayer feed-forward neural network was trained and validated with 21 input features to classify four pig activities: lying, standing, walking, and exploring. The results showed that re-sampling methods are an effective way to improve the performance of pig behavior identification and classification. Moreover, AD-BL-SMOTE could yield greater improvements in classification performance than the other three methods for balancing the training data set. The overall major mean accuracy of lying, standing, walking, and exploring by pigs A, B and C was significantly improved by using AD-BL-SMOTE, reaching 91.8%, 93.0% and 96.0%, respectively.

  2. Data from: color classification

    • kaggle.com
    zip
    Updated Apr 20, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aydin Ayanzadeh (2018). color classification [Dataset]. https://www.kaggle.com/ayanzadeh93/color-classification
    Explore at:
    zip(169343980 bytes)Available download formats
    Dataset updated
    Apr 20, 2018
    Authors
    Aydin Ayanzadeh
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Introduction

    Color classification is an important application that is used in many areas. For example, systems that perform daily. SVM classifier with an optimal hyperplane life analysis can benefit from this classification process. For the classification process, lots of classification algorithms can be used. Among them, the most popular machine learning algorithms are neural networks, decision trees, k-nearest neighbors, Bayes network, support vector machines. In this work for training, SVMs are used and a classifier model was tried to be obtained. SVMs algorithm is one of the supervised learning methods. SVM calls for solutions to regression and classification problems as in all supervised learning methods. This algorithm is usually used to training for separate and classify different labeled samples. As a result of training with SVM, it is aimed to create an optimum hyperplane and classify the data in different classes. This hyperplane is located as far away from the data as possible to avoid error conditions.

    Dataset

    The datasets have contained about 80 images for trainset datasets for whole color classes and 90 images for the test set. colors which are prepared for this application is y yellow, black, white, green, red, orange, blue a and violet. In this implementation, basic colors are preferred for classification. and created a dataset containing images of these basic colors. The dataset also includes masks for all images. we create these masks by binarizing the image. we did the masking on the images I collected and painted the pixels belonging to the class color to white and remaining pixels to the black color.

  3. d

    MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE...

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Apr 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING [Dataset]. https://catalog.data.gov/dataset/multi-label-asrs-dataset-classification-using-semi-supervised-subspace-clustering
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING MOHAMMAD SALIM AHMED, LATIFUR KHAN, NIKUNJ OZA, AND MANDAVA RAJESWARI Abstract. There has been a lot of research targeting text classification. Many of them focus on a particular characteristic of text data - multi-labelity. This arises due to the fact that a document may be associated with multiple classes at the same time. The consequence of such a characteristic is the low performance of traditional binary or multi-class classification techniques on multi-label text data. In this paper, we propose a text classification technique that considers this characteristic and provides very good performance. Our multi-label text classification approach is an extension of our previously formulated [3] multi-class text classification approach called SISC (Semi-supervised Impurity based Subspace Clustering). We call this new classification model as SISC-ML(SISC Multi-Label). Empirical evaluation on real world multi-label NASA ASRS (Aviation Safety Reporting System) data set reveals that our approach outperforms state-of-theart text classification as well as subspace clustering algorithms.

  4. Plant Growth Data Classification

    • kaggle.com
    zip
    Updated Jul 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    gortorozyannnn (2024). Plant Growth Data Classification [Dataset]. https://www.kaggle.com/datasets/gorororororo23/plant-growth-data-classification
    Explore at:
    zip(4561 bytes)Available download formats
    Dataset updated
    Jul 10, 2024
    Authors
    gortorozyannnn
    Description
    Content

    This "Plant Growth Data Classification" dataset, the prediction task would typically involve predicting or classifying the growth milestone of plants based on the provided environmental and management factors. Specifically, you would aim to predict the growth stage or milestone that a plant reaches based on variables such as soil type, sunlight hours, water frequency, fertilizer type, temperature, and humidity. This prediction can help in understanding how different conditions influence plant growth and can be valuable for optimizing agricultural practices or greenhouse management.

    Here about the description of the columns
    • Soil_Type: The type or composition of soil in which the plants are grown.
    • Sunlight_Hours: The duration or intensity of sunlight exposure received by the plants.
    • Water_Frequency: How often the plants are watered, indicating the watering schedule.
    • Fertilizer_Type: The type of fertilizer used for nourishing the plants.
    • Temperature: The ambient temperature conditions under which the plants are grown.
    • Humidity: The level of moisture or humidity in the environment surrounding the plants.
    • Growth_Milestone: Descriptions or markers indicating stages or significant events in the growth process of the plants.
  5. Comparison with base-line deep learning methods.

    • plos.figshare.com
    xls
    Updated Apr 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahmood Ashraf; Raed Alharthi; Lihui Chen; Muhammad Umer; Shtwai Alsubai; Ala Abdulmajid Eshmawi (2024). Comparison with base-line deep learning methods. [Dataset]. http://doi.org/10.1371/journal.pone.0300013.t009
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Apr 10, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Mahmood Ashraf; Raed Alharthi; Lihui Chen; Muhammad Umer; Shtwai Alsubai; Ala Abdulmajid Eshmawi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Hyperspectral Images (HSI) classification is a challenging task due to a large number of spatial-spectral bands of images with high inter-similarity, extra variability classes, and complex region relationships, including overlapping and nested regions. Classification becomes a complex problem in remote sensing images like HSIs. Convolutional Neural Networks (CNNs) have gained popularity in addressing this challenge by focusing on HSI data classification. However, the performance of 2D-CNN methods heavily relies on spatial information, while 3D-CNN methods offer an alternative approach by considering both spectral and spatial information. Nonetheless, the computational complexity of 3D-CNN methods increases significantly due to the large capacity size and spectral dimensions. These methods also face difficulties in manipulating information from local intrinsic detailed patterns of feature maps and low-rank frequency feature tuning. To overcome these challenges and improve HSI classification performance, we propose an innovative approach called the Attention 3D Central Difference Convolutional Dense Network (3D-CDC Attention DenseNet). Our 3D-CDC method leverages the manipulation of local intrinsic detailed patterns in the spatial-spectral features maps, utilizing pixel-wise concatenation and spatial attention mechanism within a dense strategy to incorporate low-rank frequency features and guide the feature tuning. Experimental results on benchmark datasets such as Pavia University, Houston 2018, and Indian Pines demonstrate the superiority of our method compared to other HSI classification methods, including state-of-the-art techniques. The proposed method achieved 97.93% overall accuracy on the Houston-2018, 99.89% on Pavia University, and 99.38% on the Indian Pines dataset with the 25 × 25 window size.

  6. d

    Pseudo-Label Generation for Multi-Label Text Classification

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Pseudo-Label Generation for Multi-Label Text Classification [Dataset]. https://catalog.data.gov/dataset/pseudo-label-generation-for-multi-label-text-classification
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    With the advent and expansion of social networking, the amount of generated text data has seen a sharp increase. In order to handle such a huge volume of text data, new and improved text mining techniques are a necessity. One of the characteristics of text data that makes text mining difficult, is multi-labelity. In order to build a robust and effective text classification method which is an integral part of text mining research, we must consider this property more closely. This kind of property is not unique to text data as it can be found in non-text (e.g., numeric) data as well. However, in text data, it is most prevalent. This property also puts the text classification problem in the domain of multi-label classification (MLC), where each instance is associated with a subset of class-labels instead of a single class, as in conventional classification. In this paper, we explore how the generation of pseudo labels (i.e., combinations of existing class labels) can help us in performing better text classification and under what kind of circumstances. During the classification, the high and sparse dimensionality of text data has also been considered. Although, here we are proposing and evaluating a text classification technique, our main focus is on the handling of the multi-labelity of text data while utilizing the correlation among multiple labels existing in the data set. Our text classification technique is called pseudo-LSC (pseudo-Label Based Subspace Clustering). It is a subspace clustering algorithm that considers the high and sparse dimensionality as well as the correlation among different class labels during the classification process to provide better performance than existing approaches. Results on three real world multi-label data sets provide us insight into how the multi-labelity is handled in our classification process and shows the effectiveness of our approach.

  7. c

    Simple Fruit Tabular Classification Dataset

    • cubig.ai
    zip
    Updated Jul 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CUBIG (2025). Simple Fruit Tabular Classification Dataset [Dataset]. https://cubig.ai/store/products/566/simple-fruit-tabular-classification-dataset
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 8, 2025
    Dataset authored and provided by
    CUBIG
    License

    https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service

    Measurement technique
    Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
    Description

    1) Data Introduction • The Fruit Classification Dataset is designed to classify different types of fruits based on their spatial coordinates. It includes data points with 'x' and 'y' coordinates and their corresponding fruit class labels (apple, banana, orange), facilitating the development and testing of classification models for simple geometric data.

    2) Data Utilization (1) Fruit Classification data has characteristics that: • It contains detailed coordinates (x and y) for each fruit class, allowing for the visualization and analysis of fruit distribution in a two-dimensional space. This dataset is ideal for understanding basic classification algorithms and testing their performance. (2) Fruit Classification data can be used to: • Machine Learning Education: Supports the teaching and learning of classification techniques, data visualization, and feature extraction in an accessible and engaging manner. • Algorithm Testing: Provides a straightforward dataset for evaluating and comparing the performance of various classification algorithms in distinguishing between different fruit types based on coordinates.

  8. Data from: Evaluation of the preprocessing and training stages in text...

    • scielo.figshare.com
    jpeg
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucas Marques Sathler Guimarães; Magali Rezende Gouvêa Meireles; Paulo Eduardo Maciel de Almeida (2023). Evaluation of the preprocessing and training stages in text classification algorithms in the context of information retrieval [Dataset]. http://doi.org/10.6084/m9.figshare.8162216.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    SciELOhttp://www.scielo.org/
    Authors
    Lucas Marques Sathler Guimarães; Magali Rezende Gouvêa Meireles; Paulo Eduardo Maciel de Almeida
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract The amount of unstructured data grows with the popularization of the Internet. Texts in natural language represent a relevant and significant set for the analysis and production of knowledge. This work proposes a quantitative analysis of the preprocessing and training stages of a text classifier, which uses as an attribute the feelings expressed by the users. Artificial Neural Network, as a classifier algorithm, and texts from Amazon, IMDB and Yelp sites were used for the experiments. The database allows the analysis of the expression of positive and negative feelings of the users in evaluations of products and services in unstructured texts. Two distinct processes of preprocessing and different training of the Artificial Neural Networks were carried out to classify the textual set. The results quantitatively confirm the importance of the preprocessing and training stages of the classifier, highlighting the importance of the vocabulary selected for the text representation and classification. The available classification techniques achieve satisfactory results. However, even by using two distinct processes of preprocessing and identifying the best training process, it was not possible to totally eliminate the learning difficulties and understanding of the model for the classifications of feelings that involved subjective characteristics of the expression of human feeling.

  9. Pseudo-Label Generation for Multi-Label Text Classification - Dataset - NASA...

    • data.nasa.gov
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Pseudo-Label Generation for Multi-Label Text Classification - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/pseudo-label-generation-for-multi-label-text-classification
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    With the advent and expansion of social networking, the amount of generated text data has seen a sharp increase. In order to handle such a huge volume of text data, new and improved text mining techniques are a necessity. One of the characteristics of text data that makes text mining difficult, is multi-labelity. In order to build a robust and effective text classification method which is an integral part of text mining research, we must consider this property more closely. This kind of property is not unique to text data as it can be found in non-text (e.g., numeric) data as well. However, in text data, it is most prevalent. This property also puts the text classification problem in the domain of multi-label classification (MLC), where each instance is associated with a subset of class-labels instead of a single class, as in conventional classification. In this paper, we explore how the generation of pseudo labels (i.e., combinations of existing class labels) can help us in performing better text classification and under what kind of circumstances. During the classification, the high and sparse dimensionality of text data has also been considered. Although, here we are proposing and evaluating a text classification technique, our main focus is on the handling of the multi-labelity of text data while utilizing the correlation among multiple labels existing in the data set. Our text classification technique is called pseudo-LSC (pseudo-Label Based Subspace Clustering). It is a subspace clustering algorithm that considers the high and sparse dimensionality as well as the correlation among different class labels during the classification process to provide better performance than existing approaches. Results on three real world multi-label data sets provide us insight into how the multi-labelity is handled in our classification process and shows the effectiveness of our approach.

  10. Description of datasets used for evaluation and comparison.

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmad B. A. Hassanat (2023). Description of datasets used for evaluation and comparison. [Dataset]. http://doi.org/10.1371/journal.pone.0207772.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Ahmad B. A. Hassanat
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description of datasets used for evaluation and comparison.

  11. Classified Dataset

    • kaggle.com
    zip
    Updated Jun 25, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vineeth Vadlapalli (2020). Classified Dataset [Dataset]. https://www.kaggle.com/datasets/vvineeth/classified-dataset
    Explore at:
    zip(91443 bytes)Available download formats
    Dataset updated
    Jun 25, 2020
    Authors
    Vineeth Vadlapalli
    Description

    Dataset

    This dataset was created by Vineeth Vadlapalli

    Released under Data files © Original Authors

    Contents

  12. IRIS DATA SET WITH ANALYSIS AND DASHBOARD

    • kaggle.com
    zip
    Updated Oct 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed Elkahwagy (2024). IRIS DATA SET WITH ANALYSIS AND DASHBOARD [Dataset]. https://www.kaggle.com/datasets/mohamedelkahwagy/iris-data-set-with-analysis-and-dashboard
    Explore at:
    zip(1073 bytes)Available download formats
    Dataset updated
    Oct 17, 2024
    Authors
    Mohamed Elkahwagy
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Iris Petal and Sepal Dataset Description The Iris dataset is one of the most famous datasets in the field of machine learning and statistical classification. It was first introduced by British biologist and statistician Ronald Fisher in 1936 as an example of linear discriminant analysis. The dataset is widely used for educational purposes and model building in machine learning due to its simplicity and versatility.

    Dataset Overview The dataset contains 150 observations of Iris flowers from three species:

    Iris Setosa Iris Versicolor Iris Virginica Each observation includes four numerical features:

    Sepal Length (cm) Sepal Width (cm) Petal Length (cm) Petal Width (cm) Additionally, the dataset provides a class label for the species of the Iris flower.

    Feature Descriptions: Sepal Length: The length of the flower’s sepal in centimeters. Sepal Width: The width of the flower’s sepal in centimeters. Petal Length: The length of the flower’s petal in centimeters. Petal Width: The width of the flower’s petal in centimeters. Species: The class label that classifies the flower into one of three species (Setosa, Versicolor, Virginica). Data Summary: 150 instances (50 samples per species) 4 features (numeric data) 1 target variable (categorical – species of the flower) Applications: The dataset is often used for:

    Classification tasks: Building models to classify the species of Iris flowers. Exploratory data analysis (EDA): Exploring relationships between features. Data visualization: Plotting petal and sepal dimensions to understand patterns. Predictive modeling: Training and testing machine learning algorithms such as k-nearest neighbors (KNN), support vector machines (SVM), and decision trees. Why This Dataset? The Iris dataset is ideal for beginners and experts alike, as it provides an easy introduction to supervised learning. It is perfect for understanding basic classification algorithms and exploring key concepts such as:

    Multiclass classification Feature correlation Data visualization techniques This description is tailored for the Kaggle community and provides a clear overview of the dataset’s content and potential use cases. You can customize it further if needed!

  13. DBpedia Ontology

    • kaggle.com
    zip
    Updated Dec 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). DBpedia Ontology [Dataset]. https://www.kaggle.com/datasets/thedevastator/dbpedia-ontology-dataset/code
    Explore at:
    zip(69520449 bytes)Available download formats
    Dataset updated
    Dec 2, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    DBpedia Ontology

    Text Classification Dataset with 14 Classes

    By dbpedia_14 (From Huggingface) [source]

    About this dataset

    The DBpedia Ontology Classification Dataset, known as dbpedia_14, is a comprehensive and meticulously constructed dataset containing a vast collection of text samples. These samples have been expertly classified into 14 distinct and non-overlapping classes. The dataset draws its information from the highly reliable and up-to-date DBpedia 2014 knowledge base, ensuring the accuracy and relevance of the data.

    Each text sample in this extensive dataset consists of various components that provide valuable insights into its content. These components include a title, which succinctly summarizes the main topic or subject matter of the text sample, and content that comprehensively covers all relevant information related to a specific topic.

    To facilitate effective training of machine learning models for text classification tasks, each text sample is further associated with a corresponding label. This categorical label serves as an essential element for supervised learning algorithms to classify new instances accurately.

    Furthermore, this exceptional dataset is part of the larger DBpedia Ontology Classification Dataset with 14 Classes (dbpedia_14). It offers numerous possibilities for researchers, practitioners, and enthusiasts alike to conduct in-depth analyses ranging from sentiment analysis to topic modeling.

    Aspiring data scientists will find great value in utilizing this well-organized dataset for training their machine learning models. Although specific details about train.csv and test.csv files are not provided here due to their dynamic nature, they play pivotal roles during model training and testing processes by respectively providing labeled training samples and unseen test samples.

    Lastly, it's worth mentioning that users can refer to the included classes.txt file within this dataset for an exhaustive list of all 14 classes used in classifying these diverse text samples accurately.

    Overall, with its wealth of carefully curated textual data across multiple domains and precise class labels assigned based on well-defined categories derived from DBpedia 2014 knowledge base, the DBpedia Ontology Classification Dataset (dbpedia_14) proves instrumental in advancing research efforts related to natural language processing (NLP), text classification, and other related fields

    Research Ideas

    • Text classification: The DBpedia Ontology Classification Dataset can be used to train machine learning models for text classification tasks. With 14 different classes, the dataset is suitable for various classification tasks such as sentiment analysis, topic classification, or intent detection.
    • Ontology development: The dataset can also be used to improve or expand existing ontologies. By analyzing the text samples and their assigned labels, researchers can identify missing or incorrect relationships between concepts in the ontology and make improvements accordingly.
    • Semantic search engine: The DBpedia knowledge base is widely used in semantic search engines that aim to provide more accurate and relevant search results by understanding the meaning of user queries and matching them with structured data. This dataset can help in training models for improving the performance of these semantic search engines by enhancing their ability to classify and categorize information accurately based on user queries

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:--------------|:---------------------------------------------------------------------------------------------------------| | label | The class label assigned to each text sample. (Categorical) | | title | The heading or name given to each text sample, providing some context or overview of its content. (Text) |

    File: test.csv | Column name | Description | |:--------------|:-----------------------...

  14. Refined Iris Dataset

    • kaggle.com
    zip
    Updated Jun 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GIRITHARAN MANI (2023). Refined Iris Dataset [Dataset]. https://www.kaggle.com/datasets/mystifoe77/iris-clean-dataset
    Explore at:
    zip(1307 bytes)Available download formats
    Dataset updated
    Jun 23, 2023
    Authors
    GIRITHARAN MANI
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset provides a refined version of the popular Iris dataset, tailored for enhanced usability in machine learning and data science applications. Key improvements include:
    - Data Quality: Removal of duplicate and inconsistent entries.
    - Feature Consistency: Verified feature distributions to ensure better modeling accuracy.
    - Enhanced Labeling: Clear and intuitive labels for easier interpretability.

    This dataset is ideal for beginners and professionals alike, offering a robust foundation for testing classification algorithms and exploring supervised learning workflows.

    Tags

    Classification, Machine Learning, Data Cleaning, Iris, Clean Data, Data Analysis

    File Details

    File Name: Iris_clean_dataset.csv
    - Size: 5.11 KB
    - Rows: 150
    - Columns: 6
    - Columns:
    1. Id
    2. SepalLengthCm
    3. SepalWidthCm
    4. PetalLengthCm
    5. PetalWidthCm
    6. Species

    Each row corresponds to a single observation of Iris flower measurements, including species classifications (Iris-setosa, Iris-versicolor, Iris-virginica).

    Usability

    Usability Score: 1.76
    This score reflects the dataset's ease of use for various machine learning and data analysis tasks.

    License

    License Type: CC BY 4.0
    You are free to use, modify, and distribute this dataset, provided appropriate credit is given to the original author.

    Expected Update Frequency

    Frequency: This dataset will not receive regular updates. However, feedback is welcomed for future revisions.

    Provenance

    Source: Original Iris dataset with modifications.
    Methodology: Data cleaning involved removing anomalies, revalidating measurements, and restructuring for compatibility with modern ML workflows.

    Encourage interaction:
    "_Your engagement improves this dataset’s visibility. Feel free to comment or share your use case._"

    Example Use Cases

    • Learning: Ideal for beginners experimenting with machine learning models like Logistic Regression, Random Forest, and KNN.
    • Research: Test your novel classification techniques on this cleaned version.
    • Application: Use it in practical ML projects for training supervised learning models.

    Notes to Users

    If you find this dataset helpful, consider leaving feedback or sharing your implementation in the Kaggle discussions section. Collaboration and suggestions are always welcome!

    Let me know if you'd like further refinements or adjustments!

  15. Z

    Data from: Reviewing ensemble classification methods in breast cancer...

    • data.niaid.nih.gov
    • portalinvestigacion.um.es
    • +1more
    Updated Jan 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hosni, Mohamed; Abnane, Ibtissam; Idri, Ali; Carrillo de Gea, Juan Manuel; Fernández-Alemán, José Luis (2025). Reviewing ensemble classification methods in breast cancer (DATASET) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14767927
    Explore at:
    Dataset updated
    Jan 29, 2025
    Dataset provided by
    University of Murcia
    Universidad de Murcia
    University Mohammed V of Rabat
    Authors
    Hosni, Mohamed; Abnane, Ibtissam; Idri, Ali; Carrillo de Gea, Juan Manuel; Fernández-Alemán, José Luis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains data mapping questions addressed by the selected studies. (MQ7 has 3 sub-questions: D: dataset, V: validation method, M: metrics)

    The data is presented in a pdf file: MappingQuestions.pdf. Contains the list of responses to the mapping questions.

    Contact InformationFor further information or inquiries about this dataset, please contact [Juan Manuel Carrillo de Gea] at [jmcdg1@um.es].

  16. Fruits data for Binary classification

    • kaggle.com
    zip
    Updated Jan 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Swap (2024). Fruits data for Binary classification [Dataset]. https://www.kaggle.com/datasets/swapnilnaique/fruits-data-for-binary-classification
    Explore at:
    zip(4660882 bytes)Available download formats
    Dataset updated
    Jan 21, 2024
    Authors
    Swap
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset was curated for developing CNN models for binary classification in fruit recognition. It comprises images belonging to two classes: Apples and Mangos. The dataset is sourced from the Fruit-360 dataset.

  17. Data from: Datasets for a data-centric image classification benchmark for...

    • zenodo.org
    • openagrar.de
    txt, zip
    Updated Jul 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lars Schmarje; Lars Schmarje; Vasco Grossmann; Vasco Grossmann; Claudius Zelenka; Claudius Zelenka; Sabine Dippel; Sabine Dippel; Rainer Kiko; Rainer Kiko; Mariusz Oszust; Mariusz Oszust; Matti Pastell; Matti Pastell; Jenny Stracke; Jenny Stracke; Anna Valros; Anna Valros; Nina Volkmann; Nina Volkmann; Reinhard Koch; Reinhard Koch (2023). Datasets for a data-centric image classification benchmark for noisy and ambiguous label estimation [Dataset]. http://doi.org/10.5281/zenodo.7180818
    Explore at:
    zip, txtAvailable download formats
    Dataset updated
    Jul 5, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lars Schmarje; Lars Schmarje; Vasco Grossmann; Vasco Grossmann; Claudius Zelenka; Claudius Zelenka; Sabine Dippel; Sabine Dippel; Rainer Kiko; Rainer Kiko; Mariusz Oszust; Mariusz Oszust; Matti Pastell; Matti Pastell; Jenny Stracke; Jenny Stracke; Anna Valros; Anna Valros; Nina Volkmann; Nina Volkmann; Reinhard Koch; Reinhard Koch
    Description

    This is the official data repository of the Data-Centric Image Classification (DCIC) Benchmark. The goal of this benchmark is to measure the impact of tuning the dataset instead of the model for a variety of image classification datasets. Full details about the collection process, the structure and automatic download at

    Paper: https://arxiv.org/abs/2207.06214

    Source Code: https://github.com/Emprime/dcic

    The license information is given below as download.

    Citation

    Please cite as

    @article{schmarje2022benchmark,
      author = {Schmarje, Lars and Grossmann, Vasco and Zelenka, Claudius and Dippel, Sabine and Kiko, Rainer and Oszust, Mariusz and Pastell, Matti and Stracke, Jenny and Valros, Anna and Volkmann, Nina and Koch, Reinahrd},
      journal = {36th Conference on Neural Information Processing Systems (NeurIPS 2022) Track on Datasets and Benchmarks},
      title = {{Is one annotation enough? A data-centric image classification benchmark for noisy and ambiguous label estimation}},
      year = {2022}
    }

    Please see the full details about the used datasets below, which should also be cited as part of the license.

    @article{schoening2020Megafauna,
    author = {Schoening, T and Purser, A and Langenk{\"{a}}mper, D and Suck, I and Taylor, J and Cuvelier, D and Lins, L and Simon-Lled{\'{o}}, E and Marcon, Y and Jones, D O B and Nattkemper, T and K{\"{o}}ser, K and Zurowietz, M and Greinert, J and Gomes-Pereira, J},
    doi = {10.5194/bg-17-3115-2020},
    journal = {Biogeosciences},
    number = {12},
    pages = {3115--3133},
    title = {{Megafauna community assessment of polymetallic-nodule fields with cameras: platform and methodology comparison}},
    volume = {17},
    year = {2020}
    }
    
    @article{Langenkamper2020GearStudy,
    author = {Langenk{\"{a}}mper, Daniel and van Kevelaer, Robin and Purser, Autun and Nattkemper, Tim W},
    doi = {10.3389/fmars.2020.00506},
    issn = {2296-7745},
    journal = {Frontiers in Marine Science},
    title = {{Gear-Induced Concept Drift in Marine Images and Its Effect on Deep Learning Classification}},
    volume = {7},
    year = {2020}
    }
    
    
    @article{peterson2019cifar10h,
    author = {Peterson, Joshua and Battleday, Ruairidh and Griffiths, Thomas and Russakovsky, Olga},
    doi = {10.1109/ICCV.2019.00971},
    issn = {15505499},
    journal = {Proceedings of the IEEE International Conference on Computer Vision},
    pages = {9616--9625},
    title = {{Human uncertainty makes classification more robust}},
    volume = {2019-Octob},
    year = {2019}
    }
    
    @article{schmarje2019,
    author = {Schmarje, Lars and Zelenka, Claudius and Geisen, Ulf and Gl{\"{u}}er, Claus-C. and Koch, Reinhard},
    doi = {10.1007/978-3-030-33676-9_26},
    issn = {23318422},
    journal = {DAGM German Conference of Pattern Regocnition},
    number = {November},
    pages = {374--386},
    publisher = {Springer},
    title = {{2D and 3D Segmentation of uncertain local collagen fiber orientations in SHG microscopy}},
    volume = {11824 LNCS},
    year = {2019}
    }
    
    @article{schmarje2021foc,
    author = {Schmarje, Lars and Br{\"{u}}nger, Johannes and Santarossa, Monty and Schr{\"{o}}der, Simon-Martin and Kiko, Rainer and Koch, Reinhard},
    doi = {10.3390/s21196661},
    issn = {1424-8220},
    journal = {Sensors},
    number = {19},
    pages = {6661},
    title = {{Fuzzy Overclustering: Semi-Supervised Classification of Fuzzy Labels with Overclustering and Inverse Cross-Entropy}},
    volume = {21},
    year = {2021}
    }
    
    @article{schmarje2022dc3,
    author = {Schmarje, Lars and Santarossa, Monty and Schr{\"{o}}der, Simon-Martin and Zelenka, Claudius and Kiko, Rainer and Stracke, Jenny and Volkmann, Nina and Koch, Reinhard},
    journal = {Proceedings of the European Conference on Computer Vision (ECCV)},
    title = {{A data-centric approach for improving ambiguous labels with combined semi-supervised classification and clustering}},
    year = {2022}
    }
    
    
    @article{obuchowicz2020qualityMRI,
    author = {Obuchowicz, Rafal and Oszust, Mariusz and Piorkowski, Adam},
    doi = {10.1186/s12880-020-00505-z},
    issn = {1471-2342},
    journal = {BMC Medical Imaging},
    number = {1},
    pages = {109},
    title = {{Interobserver variability in quality assessment of magnetic resonance images}},
    volume = {20},
    year = {2020}
    }
    
    
    @article{stepien2021cnnQuality,
    author = {St{\c{e}}pie{\'{n}}, Igor and Obuchowicz, Rafa{\l} and Pi{\'{o}}rkowski, Adam and Oszust, Mariusz},
    doi = {10.3390/s21041043},
    issn = {1424-8220},
    journal = {Sensors},
    number = {4},
    title = {{Fusion of Deep Convolutional Neural Networks for No-Reference Magnetic Resonance Image Quality Assessment}},
    volume = {21},
    year = {2021}
    }
    
    @article{volkmann2021turkeys,
    author = {Volkmann, Nina and Br{\"{u}}nger, Johannes and Stracke, Jenny and Zelenka, Claudius and Koch, Reinhard and Kemper, Nicole and Spindler, Birgit},
    doi = {10.3390/ani11092655},
    journal = {Animals 2021},
    pages = {1--13},
    title = {{Learn to train: Improving training data for a neural network to detect pecking injuries in turkeys}},
    volume = {11},
    year = {2021}
    }
    
    @article{volkmann2022keypoint,
    author = {Volkmann, Nina and Zelenka, Claudius and Devaraju, Archana Malavalli and Br{\"{u}}nger, Johannes and Stracke, Jenny and Spindler, Birgit and Kemper, Nicole and Koch, Reinhard},
    doi = {10.3390/s22145188},
    issn = {1424-8220},
    journal = {Sensors},
    number = {14},
    pages = {5188},
    title = {{Keypoint Detection for Injury Identification during Turkey Husbandry Using Neural Networks}},
    volume = {22},
    year = {2022}
    }

  18. p

    Tree Point Classification - New Zealand

    • pacificgeoportal.com
    • digital-earth-pacificcore.hub.arcgis.com
    Updated Jul 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eagle Technology Group Ltd (2022). Tree Point Classification - New Zealand [Dataset]. https://www.pacificgeoportal.com/content/0e2e3d0d0ef843e690169cac2f5620f9
    Explore at:
    Dataset updated
    Jul 26, 2022
    Dataset authored and provided by
    Eagle Technology Group Ltd
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    This New Zealand Point Cloud Classification Deep Learning Package will classify point clouds into tree and background classes. This model is optimized to work with New Zealand aerial LiDAR data.The classification of point cloud datasets to identify Trees is useful in applications such as high-quality 3D basemap creation, urban planning, forestry workflows, and planning climate change response.Trees could have a complex irregular geometrical structure that is hard to capture using traditional means. Deep learning models are highly capable of learning these complex structures and giving superior results.This model is designed to extract Tree in both urban and rural area in New Zealand.The Training/Testing/Validation dataset are taken within New Zealand resulting of a high reliability to recognize the pattern of NZ common building architecture.Licensing requirementsArcGIS Desktop - ArcGIS 3D Analyst extension for ArcGIS ProUsing the modelThe model can be used in ArcGIS Pro's Classify Point Cloud Using Trained Model tool. Before using this model, ensure that the supported deep learning frameworks libraries are installed. For more details, check Deep Learning Libraries Installer for ArcGIS.Note: Deep learning is computationally intensive, and a powerful GPU is recommended to process large datasets.InputThe model is trained with classified LiDAR that follows the LINZ base specification. The input data should be similar to this specification.Note: The model is dependent on additional attributes such as Intensity, Number of Returns, etc, similar to the LINZ base specification. This model is trained to work on classified and unclassified point clouds that are in a projected coordinate system, in which the units of X, Y and Z are based on the metric system of measurement. If the dataset is in degrees or feet, it needs to be re-projected accordingly. The model was trained using a training dataset with the full set of points. Therefore, it is important to make the full set of points available to the neural network while predicting - allowing it to better discriminate points of 'class of interest' versus background points. It is recommended to use 'selective/target classification' and 'class preservation' functionalities during prediction to have better control over the classification and scenarios with false positives.The model was trained on airborne lidar datasets and is expected to perform best with similar datasets. Classification of terrestrial point cloud datasets may work but has not been validated. For such cases, this pre-trained model may be fine-tuned to save on cost, time, and compute resources while improving accuracy. Another example where fine-tuning this model can be useful is when the object of interest is tram wires, railway wires, etc. which are geometrically similar to electricity wires. When fine-tuning this model, the target training data characteristics such as class structure, maximum number of points per block and extra attributes should match those of the data originally used for training this model (see Training data section below).OutputThe model will classify the point cloud into the following classes with their meaning as defined by the American Society for Photogrammetry and Remote Sensing (ASPRS) described below: 0 Background 5 Trees / High-vegetationApplicable geographiesThe model is expected to work well in the New Zealand. It's seen to produce favorable results as shown in many regions. However, results can vary for datasets that are statistically dissimilar to training data.Training dataset - Wellington CityTesting dataset - Tawa CityValidation/Evaluation dataset - Christchurch City Dataset City Training Wellington Testing Tawa Validating ChristchurchModel architectureThis model uses the PointCNN model architecture implemented in ArcGIS API for Python.Accuracy metricsThe table below summarizes the accuracy of the predictions on the validation dataset. - Precision Recall F1-score Never Classified 0.991200 0.975404 0.983239 High Vegetation 0.933569 0.975559 0.954102Training dataThis model is trained on classified dataset originally provided by Open TopoGraphy with < 1% of manual labelling and correction.Train-Test split percentage {Train: 80%, Test: 20%} Chosen this ratio based on the analysis from previous epoch statistics which appears to have a descent improvementThe training data used has the following characteristics: X, Y, and Z linear unitMeter Z range-121.69 m to 26.84 m Number of Returns1 to 5 Intensity16 to 65520 Point spacing0.2 ± 0.1 Scan angle-15 to +15 Maximum points per block8192 Block Size20 Meters Class structure[0, 5]Sample resultsModel to classify a dataset with 5pts/m density Christchurch city dataset. The model's performance are directly proportional to the dataset point density and noise exlcuded point clouds.To learn how to use this model, see this story

  19. Animal Species Classification - V3

    • kaggle.com
    Updated Jan 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DeepNets (2023). Animal Species Classification - V3 [Dataset]. https://www.kaggle.com/datasets/utkarshsaxenadn/animal-image-classification-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 24, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    DeepNets
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Idea

    The vision behind creating this dataset is to have a data set for classifying animal species. A lot of animal species can be included in this data set, which is why it gets revised regularly. This will help to create a machine-learning model that can accurately classify animal species.

    Class Distribution

    This is Animal Classification Data-set made for the Multi-Class Image Recognition Task. The dataset contains 15 Classes, these classes are :

    1. Beetle
    2. Butterfly
    3. Cat
    4. Cow
    5. Dog
    6. Elephant
    7. Gorilla
    8. Hippo
    9. Lizard
    10. Monkey
    11. Mouse
    12. Panda
    13. Spider
    14. Tiger
    15. Zebra

    Data Distribution

    The data is split into 6 directories:

    Interesting Data * As the name suggests, this folder contains 5 interesting images per class. These are called Interesting images because it will be fascinating to know which class the model allocates to these shots. Based on the model's prediction, we can understand the model's understanding of that class.

    Testing Data * This folder is filled with a random number of images per class. As the name indicates this folder is purposefully made to incorporate testing images, that is images on which the model will be tested after training.

    TFRecords Data * This folder contains the data in Tensorflow records format. All the images present in TF records format have already been resized to 256 x 256 pixels and normalized.

    Train Augmented * This time, an additional train augmented data is added to the data set. As per the name, this directory contains augmented images per class. 5 augmented images per original image, in total each class has 10,000 augmented images. This is done to increase the data set size because, With the increase in the total number of classes, the model complexity increases. And thus we require more data to train the model. The best way to get more data is data augmentation. It is highly recommended to shuffle the data before/after loading it.

    Training Images * Each class is filled with 2000 images for training purpose. This is the data that is used for training the model. In this case, all the images are resized to 256 by 256 pixels and normalized to have the input pixel range of 0 to 1.

    Validation Images * This folder contains 100/200 images per class, this is intentionally created for validation purposes. Images from this directory will be used at the time of training for validating the model's performance.

    DeepNets

  20. m

    Educational Attainment in North Carolina Public Schools: Use of statistical...

    • data.mendeley.com
    Updated Nov 14, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
    Explore at:
    Dataset updated
    Nov 14, 2018
    Authors
    Scott Herford
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Min Jin; Bowen Yang; Chunguang Wang (2024). RESEARCH ON IDENTIFICATION AND CLASSIFICATION METHOD OF IMBALANCED DATA SET OF PIG BEHAVIOR [Dataset]. http://doi.org/10.6084/m9.figshare.23290691.v1
Organization logo

Data from: RESEARCH ON IDENTIFICATION AND CLASSIFICATION METHOD OF IMBALANCED DATA SET OF PIG BEHAVIOR

Related Article
Explore at:
tiffAvailable download formats
Dataset updated
Feb 14, 2024
Dataset provided by
SciELOhttp://www.scielo.org/
Authors
Min Jin; Bowen Yang; Chunguang Wang
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

ABSTRACT To address the problem of the low accuracy and poor robustness of modeling methods for imbalanced data sets of pig behavior identification and classification, the three commonly used re-sampling methods of under-sampling, SMOTE and Borderline-SMOTE are compared, and an adaptive boundary data augmentation algorithm AD-BL-SMOTE is proposed. The activity of the pigs was measured using triaxial accelerometers, which were fixed on the backs of the pigs. A multilayer feed-forward neural network was trained and validated with 21 input features to classify four pig activities: lying, standing, walking, and exploring. The results showed that re-sampling methods are an effective way to improve the performance of pig behavior identification and classification. Moreover, AD-BL-SMOTE could yield greater improvements in classification performance than the other three methods for balancing the training data set. The overall major mean accuracy of lying, standing, walking, and exploring by pigs A, B and C was significantly improved by using AD-BL-SMOTE, reaching 91.8%, 93.0% and 96.0%, respectively.

Search
Clear search
Close search
Google apps
Main menu