100+ datasets found
  1. Data mining as a hatchery process evaluation tool

    • scielo.figshare.com
    jpeg
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniela Regina Klein; Marcos Martinez do Vale; Mariana Fernandes Ribas da Silva; Micheli Faccin Kuhn; Tatiane Branco; Mauricio Portella dos Santos (2023). Data mining as a hatchery process evaluation tool [Dataset]. http://doi.org/10.6084/m9.figshare.10258280.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    SciELOhttp://www.scielo.org/
    Authors
    Daniela Regina Klein; Marcos Martinez do Vale; Mariana Fernandes Ribas da Silva; Micheli Faccin Kuhn; Tatiane Branco; Mauricio Portella dos Santos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ABSTRACT The hatchery is one of the most important segments of the poultry chain, and generates an abundance of data, which, when analyzed, allow for identifying critical points of the process . The aim of this study was to evaluate the applicability of the data mining technique to databases of egg incubation of broiler breeders and laying hen breeders. The study uses a database recording egg incubation from broiler breeders housed in pens with shavings used for litters in natural mating, as well as laying hen breeders housed in cages using an artificial insemination mating system. The data mining technique (DM) was applied to analyses in a classification task, using the type of breeder and house system for delineating classes. The database was analyzed in three different ways: original database, attribute selection, and expert analysis. Models were selected on the basis of model precision and class accuracy. The data mining technique allowed for the classification of hatchery fertile eggs from different genetic groups, as well as hatching rates and the percentage of fertile eggs (the attributes with the greatest classification power). Broiler breeders showed higher fertility (> 95 %), but higher embryonic mortality between the third and seventh day post-hatching (> 0.5 %) when compared to laying hen breeders’ eggs. In conclusion, applying data mining to the hatchery process, selection of attributes and strategies based on the experience of experts can improve model performance.

  2. Supplementary Table S1 - A new classification system of beer categories and...

    • zenodo.org
    txt
    Updated May 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diego Bonatto; Diego Bonatto (2025). Supplementary Table S1 - A new classification system of beer categories and styles based on large-scale data mining and self-organizing maps of beer recipes [Dataset]. http://doi.org/10.5281/zenodo.15498451
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 23, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Diego Bonatto; Diego Bonatto
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A dataset containing raw beer recipe data obtained from Brewer's Friend. The dataset contains malt, hops, adjunct types and quantities, and recipe vital statistics [alcohol by volume (ABV), beer color in the standard reference method (SRM), standard gravity (SG), final gravity (FG), international bitterness units (IBUs), and hop usage]. In addition, beer category information and the country or geographical area from which they originated, combined with supercluster and cluster data, were added to each beer recipe.

  3. Data from: A Proposed Churn Prediction Model

    • figshare.com
    pdf
    Updated Feb 24, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mona Nasr; Essam Shaaban; Yehia Helmy; Dr. Ayman Khedr (2019). A Proposed Churn Prediction Model [Dataset]. http://doi.org/10.6084/m9.figshare.7763183.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Feb 24, 2019
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Mona Nasr; Essam Shaaban; Yehia Helmy; Dr. Ayman Khedr
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Churn prediction aims to detect customers intended to leave a service provider. Retaining one customer costs an organization from 5 to 10 times than gaining a new one. Predictive models can provide correct identification of possible churners in the near future in order to provide a retention solution. This paper presents a new prediction model based on Data Mining (DM) techniques. The proposed model is composed of six steps which are; identify problem domain, data selection, investigate data set, classification, clustering and knowledge usage. A data set with 23 attributes and 5000 instances is used. 4000 instances used for training the model and 1000 instances used as a testing set. The predicted churners are clustered into 3 categories in case of using in a retention strategy. The data mining techniques used in this paper are Decision Tree, Support Vector Machine and Neural Network throughout an open source software name WEKA.

  4. Bitcoin Wallet Classification Public Dataset

    • kaggle.com
    zip
    Updated Mar 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gregory Wilder (2025). Bitcoin Wallet Classification Public Dataset [Dataset]. https://www.kaggle.com/datasets/gregorywilder/bitcoin-wallet-classification-public-dataset
    Explore at:
    zip(2471936960 bytes)Available download formats
    Dataset updated
    Mar 15, 2025
    Authors
    Gregory Wilder
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Classification of 43,621,232 Bitcoin Wallet addresses into 14 categories (following the classifications defined by BABD, cited below): (1) Blackmail, (2) Cyber-Security Service, (3) Darknet Market, (4) Centralized Exchange, (5) P2P Financial Infrastructure Service, (6) P2P Financial Service, (7) Gambling, (8) Government Criminal Blocklist, (9) Money Laundering, (10) Ponzi Scheme, (11) Mining Pool, (12) Tumbler, (13) Individual Wallets, and (14) Unknown Service.

    This database was made by combining info from four separate sources: Aleš Janda, who operates WalletExplorer.com, generously offers an API which provides significant wallet classification and identification that he was able to determine. Mr. Janda determined the owners of wallets by registering for certain services and learning the addresses used by those services (like SatoshiDice) and then backed in to other wallet addresses by determining which wallets were merged together (similar to a shadow wallet address analysis done by others). However, WalletExplorer states it has not updated much data since 2016. Thus, the data available from WalletExplorer is largely contained in the first 425,000 blocks (block 425,000 was mined on August 13, 2016). This classification data constituted, by far, the largest portion of classification data in this dataset, with over 43 million wallet addresses being classified. Preference was always given to WalletExplorer.com's classifications, as “Strong Addresses.” Aleš Janda. Wallet explorer. https://www.walletexplorer.com/info.

    Additionally, several datasets (in spreadsheets) identify and classify wallets:

    The largest of these spreadsheets available on Kaggle was made by Xiang, et al. They were able to classify 544,462 wallet addresses between blocks 585,000 (July 12, 2019) and 685,000 (May 26, 2021) into 13 classifications: (1) Blackmail, (2) Cyber-Security Service, (3) Darknet Market, (4) Centralized Exchange, (5) P2P Financial Infrastructure Service, (6) P2P Financial Service, (7) Gambling, (8) Government Criminal Blocklist, (9) Money Laundering, (10) Ponzi Scheme, (11) Mining Pool, (12) Tumbler, and (13) Individual Wallets. They used the WalletExplorer database and other governmental blocklist databases as their “strong addresses” and information from BitcoinAbuse.com (which now appears to be ChainAbuse.com) as “weak addresses” to train their AI models. The processes they used to identify and classify the wallets on their experimental sets resulted in a minimum F-1 score of 92.97%, accuracy of 93.24%, precision of 92.80%, and recall of 93.24%. They accomplished this by using a framework consisting of two parts: a statistical indicator (SI), and a local structural indicator (LSI). The SI considered four indicator types, Pure Amount Indicator (PAI), Pure Degree Indicator (PDI), Pure Time Indicator (PTI), and Combination Indicator (CI). The SI considered 132 features to predict the classification. For the LSI, they generated k-hop subgraphs with an algorithm, and then used various graph metrics and other features to predict the classification. Yiexin Xiang, Yuchen Lei, Ding Bao, Tiantian Li, Quingqing Yang, Wenmao Liu, Wei Ren, and Kim-Kwang Raymond Choo. Babd: A bitcoin address behavior dataset for pattern analysis. IEEE Transactions on Information Forensics and Security, (19):2171–85, 2024. https://www.kaggle.com/datasets/lemonx/babd13

    Michalski, et al., identified 8,808 addresses and classified them using machine learning techniques considering 149 features, into categories including (1) mining pools, (2) miners, (3) coinjoins, (4) gambling, (5) exchange, and (6) services. They analyzed the blocks between 520,850 and 520,950. They obtained training data from WalletExplorer.com and then used machine learning techniques including Evaluated Supervised Learning Algorithms. They determined that the Random Forest classification was the best method of classifying the wallets, and stated they obtained an F-score of 95%. Due to the small size of this dataset and the fact that only 100 blocks were covered by this dataset, I considered these classifications to be “weak addresses” for this work. The wallets they classified as “services” were added to my database as an “Unknown Service.” Radoslaw Michalski, Daria Dziuba ltowska, and Piotr Macek. Bitcoin addresses and their categories. https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/KEWU0N

    The US Office of Foreign Asset Control maintains a list of ‘sanctioned’ Bitcoin and other digital currency assets. Several Github contributors maintain a tool that extracts the Bitcoin addresses from this database. This added 390 wallet addresses to the dataset. U.S. Treasury. Specially designated nationals list of the U.S. Office of Foreign Asset Control. https://www.treasury.gov/ofac/downloads/sanctions/1.0/sdn_advanced.xml. 0xB10C, Michael Neale, and Yahiheb. https://github.com/0xB10C/ofac-sanctioned-digital-currency-addresses?tab=readme-ov-file

  5. M

    Data from: Characterizing and classifying neuroendocrine neoplasms through...

    • datacatalog.mskcc.org
    • search.dataone.org
    • +1more
    Updated Sep 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nanayakkara, Jina; Yang, Xiaojing; Tyryshkin, Kathrin; Wong, Justin J.M.; Vanderbeck, Kaitlin; Ginter, Paula S.; Scognamiglio, Theresa; Chen, Yao-Tseng; Panarelli, Nicole; Cheung, Nai-Kong; Dijk, Frederike; Ben-Dov, Iddo Z.; Kim, Michelle Kang; Singh, Simron; Morozov, Pavel; Max, Klaas E. A.; Tuschl, Thomas; Renwick, Neil (2023). Characterizing and classifying neuroendocrine neoplasms through microRNA sequencing and data mining [Dataset]. http://doi.org/10.5061/dryad.fn2z34tqj
    Explore at:
    Dataset updated
    Sep 19, 2023
    Dataset provided by
    MSK Library
    Authors
    Nanayakkara, Jina; Yang, Xiaojing; Tyryshkin, Kathrin; Wong, Justin J.M.; Vanderbeck, Kaitlin; Ginter, Paula S.; Scognamiglio, Theresa; Chen, Yao-Tseng; Panarelli, Nicole; Cheung, Nai-Kong; Dijk, Frederike; Ben-Dov, Iddo Z.; Kim, Michelle Kang; Singh, Simron; Morozov, Pavel; Max, Klaas E. A.; Tuschl, Thomas; Renwick, Neil
    Description

    From Dryad entry:

    "Abstract
    Neuroendocrine neoplasms (NENs) are clinically diverse and incompletely characterized cancers that are challenging to classify. MicroRNAs (miRNAs) are small regulatory RNAs that can be used to classify cancers. Recently, a morphology-based classification framework for evaluating NENs from different anatomic sites was proposed by experts, with the requirement of improved molecular data integration. Here, we compiled 378 miRNA expression profiles to examine NEN classification through comprehensive miRNA profiling and data mining. Following data preprocessing, our final study cohort included 221 NEN and 114 non-NEN samples, representing 15 NEN pathological types and five site-matched non-NEN control groups. Unsupervised hierarchical clustering of miRNA expression profiles clearly separated NENs from non-NENs. Comparative analyses showed that miR-375 and miR-7 expression is substantially higher in NEN cases than non-NEN controls. Correlation analyses showed that NENs from diverse anatomic sites have convergent miRNA expression programs, likely reflecting morphologic and functional similarities. Using machine learning approaches, we identified 17 miRNAs to discriminate 15 NEN pathological types and subsequently constructed a multi-layer classifier, correctly identifying 217 (98%) of 221 samples and overturning one histologic diagnosis. Through our research, we have identified common and type-specific miRNA tissue markers and constructed an accurate miRNA-based classifier, advancing our understanding of NEN diversity.

    Methods
    Sequencing-based miRNA expression profiles from 378 clinical samples, comprising 239 neuroendocrine neoplasm (NEN) cases and 139 site-matched non-NEN controls, were used in this study. Expression profiles were either compiled from published studies (n=149) or generated through small RNA sequencing (n=229). Prior to sequencing, total RNA was isolated from formalin-fixed paraffin-embedded (FFPE) tissue blocks or fresh-frozen (FF) tissue samples. Small RNA cDNA libraries were sequenced on HiSeq 2500 Illumina platforms using an established small RNA sequencing (Hafner et al., 2012 Methods) and sequence annotation pipeline (Brown et al., 2013 Front Genet) to generate miRNA expression profiles. Scaling our existing approach to miRNA-based NEN classification (Panarelli et al., 2019 Endocr Relat Cancer; Ren et al., 2017 Oncotarget), we constructed and cross-validated a multi-layer classifier for discriminating NEN pathological types based on selected miRNAs.

    Usage notes
    Diagnostic histopathology and small RNA cDNA library preparation information for all samples are presented in Table S1 of the associated manuscript."

  6. Data from: MusicOSet: An Enhanced Open Dataset for Music Data Mining

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    bin, zip
    Updated Jun 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mariana O. Silva; Mariana O. Silva; Laís Mota; Mirella M. Moro; Mirella M. Moro; Laís Mota (2021). MusicOSet: An Enhanced Open Dataset for Music Data Mining [Dataset]. http://doi.org/10.5281/zenodo.4904639
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Jun 7, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mariana O. Silva; Mariana O. Silva; Laís Mota; Mirella M. Moro; Mirella M. Moro; Laís Mota
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MusicOSet is an open and enhanced dataset of musical elements (artists, songs and albums) based on musical popularity classification. Provides a directly accessible collection of data suitable for numerous tasks in music data mining (e.g., data visualization, classification, clustering, similarity search, MIR, HSS and so forth). To create MusicOSet, the potential information sources were divided into three main categories: music popularity sources, metadata sources, and acoustic and lyrical features sources. Data from all three categories were initially collected between January and May 2019. Nevertheless, the update and enhancement of the data happened in June 2019.

    The attractive features of MusicOSet include:

    • Integration and centralization of different musical data sources
    • Calculation of popularity scores and classification of hits and non-hits musical elements, varying from 1962 to 2018
    • Enriched metadata for music, artists, and albums from the US popular music industry
    • Availability of acoustic and lyrical resources
    • Unrestricted access in two formats: SQL database and compressed .csv files
    |    Data    | # Records |
    |:-----------------:|:---------:|
    | Songs       | 20,405  |
    | Artists      | 11,518  |
    | Albums      | 26,522  |
    | Lyrics      | 19,664  |
    | Acoustic Features | 20,405  |
    | Genres      | 1,561   |
  7. Z

    Data Analysis for the Systematic Literature Review of DL4SE

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    College of William and Mary
    Washington and Lee University
    Authors
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

    The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

    Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

    Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

    Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

    Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

    Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

    We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

    Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

    Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise

  8. m

    UVigoMED

    • data.mendeley.com
    Updated Dec 30, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcos Antonio Mouriño-García (2016). UVigoMED [Dataset]. http://doi.org/10.17632/p3jkppwr29.1
    Explore at:
    Dataset updated
    Dec 30, 2016
    Authors
    Marcos Antonio Mouriño-García
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    UVigoMED is a monolingual (english) and single-label corpus composed of 92,661 biomedical abstracts extracted from MEDLINE, classified under 26 MeSH categories.

  9. w

    Global Discovery Classification Market Research Report: By Application...

    • wiseguyreports.com
    Updated Sep 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Global Discovery Classification Market Research Report: By Application (Scientific Research, Data Mining, Content Recommendation, Sentiment Analysis), By Deployment Type (Cloud-Based, On-Premise, Hybrid), By End User (Academic Institutions, Healthcare Organizations, Retail Sector, Financial Services), By Technology (Machine Learning, Natural Language Processing, Data Analytics, Artificial Intelligence) and By Regional (North America, Europe, South America, Asia Pacific, Middle East and Africa) - Forecast to 2035 [Dataset]. https://www.wiseguyreports.com/reports/discovery-classification-market
    Explore at:
    Dataset updated
    Sep 15, 2025
    License

    https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy

    Time period covered
    Sep 25, 2025
    Area covered
    Global
    Description
    BASE YEAR2024
    HISTORICAL DATA2019 - 2023
    REGIONS COVEREDNorth America, Europe, APAC, South America, MEA
    REPORT COVERAGERevenue Forecast, Competitive Landscape, Growth Factors, and Trends
    MARKET SIZE 20242308.3(USD Million)
    MARKET SIZE 20252426.0(USD Million)
    MARKET SIZE 20354000.0(USD Million)
    SEGMENTS COVEREDApplication, Deployment Type, End User, Technology, Regional
    COUNTRIES COVEREDUS, Canada, Germany, UK, France, Russia, Italy, Spain, Rest of Europe, China, India, Japan, South Korea, Malaysia, Thailand, Indonesia, Rest of APAC, Brazil, Mexico, Argentina, Rest of South America, GCC, South Africa, Rest of MEA
    KEY MARKET DYNAMICSTechnological advancements, Increasing data volumes, Rising regulatory requirements, Demand for efficient classification, Growth of cloud-based solutions
    MARKET FORECAST UNITSUSD Million
    KEY COMPANIES PROFILEDRoche, SeraCare Life Sciences, Illumina, Thermo Fisher Scientific, Danaher Corporation, Becton Dickinson, PerkinElmer, Tecan Group, Pacific Biosciences, Genomatix, QIAGEN, Merck Group, BioRad Laboratories, Abbott Laboratories, Agilent Technologies, NanoString Technologies
    MARKET FORECAST PERIOD2025 - 2035
    KEY MARKET OPPORTUNITIESAI-driven data analysis solutions, Increased demand for regulatory compliance, Growth in data-driven decision making, Expanding healthcare and pharma sectors, Integration with emerging technologies
    COMPOUND ANNUAL GROWTH RATE (CAGR) 5.1% (2025 - 2035)
  10. White Wine Quality

    • kaggle.com
    zip
    Updated Sep 28, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Piyush Agnihotri (2020). White Wine Quality [Dataset]. https://www.kaggle.com/datasets/piyushagni5/white-wine-quality/code
    Explore at:
    zip(74835 bytes)Available download formats
    Dataset updated
    Sep 28, 2020
    Authors
    Piyush Agnihotri
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, refer to [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

    These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

    Content

    For more information, read [Cortez et al., 2009]. Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10)

    Acknowledgements

    This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality, to get both the dataset i.e. red and white vinho verde wine samples, from the north of Portugal, please visit the above link.

    Please include this citation if you plan to use this database:

    P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

    Inspiration

    We kagglers can apply several machine-learning algorithms to determine which physiochemical properties make a wine 'good'!

    Relevant papers

    P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

  11. Data from: An open dataset for intelligent recognition and classification of...

    • springernature.figshare.com
    bin
    Updated Jul 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xuhui Zhang; Wenjuan Yang; Bing Ma; Yanqun Wang; Yujia Wu; Jianxin Yan; Yongwei Liu; Chao Zhang; Jicheng Wan; Yue Wang; Mengyao Huang; Yuyang Li; Dian Zhao (2024). An open dataset for intelligent recognition and classification of abnormal condition in longwall mining [Dataset]. http://doi.org/10.6084/m9.figshare.22654945.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 8, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Xuhui Zhang; Wenjuan Yang; Bing Ma; Yanqun Wang; Yujia Wu; Jianxin Yan; Yongwei Liu; Chao Zhang; Jicheng Wan; Yue Wang; Mengyao Huang; Yuyang Li; Dian Zhao
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This work developed image dataset of underground longwall mining face (DsLMF+), which consists of 138004 images with annotation 6 categories of mine personnel, hydraulic support guard plate, large coal, towline, miners’ behaviour and mine safety helmet. All the labels of dataset are publicly available in YOLO format and COCO format.The dataset aims to support further research and advancement of the intelligent identification and classification of abnormal conditions for underground mining.

  12. h

    Global Data Mining Tools Market - Global Outlook 2020-2033

    • htfmarketinsights.com
    pdf & excel
    Updated Oct 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HTF Market Intelligence (2025). Global Data Mining Tools Market - Global Outlook 2020-2033 [Dataset]. https://www.htfmarketinsights.com/report/4364059-data-mining-tools-market
    Explore at:
    pdf & excelAvailable download formats
    Dataset updated
    Oct 15, 2025
    Dataset authored and provided by
    HTF Market Intelligence
    License

    https://www.htfmarketinsights.com/privacy-policyhttps://www.htfmarketinsights.com/privacy-policy

    Time period covered
    2019 - 2031
    Area covered
    Global
    Description

    Global Data Mining Tools Market is segmented by Application (Predictive analytics_Fraud detection_Marketing_Healthcare diagnostics_Manufacturing optimization), Type (Classification tools_Clustering tools_Regression tools_Association tools_Text mining tools), and Geography (North America_ LATAM_ West Europe_Central & Eastern Europe_ Northern Europe_ Southern Europe_ East Asia_ Southeast Asia_ South Asia_ Central Asia_ Oceania_ MEA)

  13. d

    SIAM 2007 Text Mining Competition dataset

    • catalog.data.gov
    • data.nasa.gov
    • +1more
    Updated Apr 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). SIAM 2007 Text Mining Competition dataset [Dataset]. https://catalog.data.gov/dataset/siam-2007-text-mining-competition-dataset
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    Subject Area: Text Mining Description: This is the dataset used for the SIAM 2007 Text Mining competition. This competition focused on developing text mining algorithms for document classification. The documents in question were aviation safety reports that documented one or more problems that occurred during certain flights. The goal was to label the documents with respect to the types of problems that were described. This is a subset of the Aviation Safety Reporting System (ASRS) dataset, which is publicly available. How Data Was Acquired: The data for this competition came from human generated reports on incidents that occurred during a flight. Sample Rates, Parameter Description, and Format: There is one document per incident. The datasets are in raw text format. All documents for each set will be contained in a single file. Each row in this file corresponds to a single document. The first characters on each line of the file are the document number and a tilde separats the document number from the text itself. Anomalies/Faults: This is a document category classification problem.

  14. S

    Predictive data analysis techniques for higher education students dropout

    • scidb.cn
    Updated Apr 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cindy (2023). Predictive data analysis techniques for higher education students dropout [Dataset]. http://doi.org/10.57760/sciencedb.07894
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 10, 2023
    Dataset provided by
    Science Data Bank
    Authors
    Cindy
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    In this research, we have generated student retention alerts. The alerts are classified into two types: preventive and corrective. This classification varies according to the level of maturity of the data systematization process. Therefore, to systematize the data, data mining techniques have been applied. The experimental analytical method has been used, with a population of 13,715 students with 62 sociological, academic, family, personal, economic, psychological, and institutional variables, and factors such as academic follow-up and performance, financial situation, and personal information. In particular, information is collected on each of the problems or a combination of problems that could affect dropout rates. Following the methodology, the information has been generated through an abstract data model to reflect the profile of the dropout student. As advancement from previous research, this proposal will create preventive and corrective alternatives to avoid dropout higher education. Also, in contrast to previous work, we generated corrective warnings with the application of data mining techniques such as neural networks until reaching a precision of 97% and losses of 0.1052. In conclusion, this study pretends to analyze the behavior of students who drop out the university through the evaluation of predictive patterns. The overall objective is to predict the profile of student dropout, considering reasons such as admission to higher education and career changes. Consequently, using a data systematization process promotes the permanence of students in higher education. Once the profile of the dropout has been identified, student retention strategies have been approached, according to the time of its appearance and the point of view of the institution.

  15. f

    Table_1_Data Mining and Machine Learning Models for Predicting Drug Likeness...

    • frontiersin.figshare.com
    • figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abraham Yosipof; Rita C. Guedes; Alfonso T. García-Sosa (2023). Table_1_Data Mining and Machine Learning Models for Predicting Drug Likeness and Their Disease or Organ Category.CSV [Dataset]. http://doi.org/10.3389/fchem.2018.00162.s001
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Abraham Yosipof; Rita C. Guedes; Alfonso T. García-Sosa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data mining approaches can uncover underlying patterns in chemical and pharmacological property space decisive for drug discovery and development. Two of the most common approaches are visualization and machine learning methods. Visualization methods use dimensionality reduction techniques in order to reduce multi-dimension data into 2D or 3D representations with a minimal loss of information. Machine learning attempts to find correlations between specific activities or classifications for a set of compounds and their features by means of recurring mathematical models. Both models take advantage of the different and deep relationships that can exist between features of compounds, and helpfully provide classification of compounds based on such features or in case of visualization methods uncover underlying patterns in the feature space. Drug-likeness has been studied from several viewpoints, but here we provide the first implementation in chemoinformatics of the t-Distributed Stochastic Neighbor Embedding (t-SNE) method for the visualization and the representation of chemical space, and the use of different machine learning methods separately and together to form a new ensemble learning method called AL Boost. The models obtained from AL Boost synergistically combine decision tree, random forests (RF), support vector machine (SVM), artificial neural network (ANN), k nearest neighbors (kNN), and logistic regression models. In this work, we show that together they form a predictive model that not only improves the predictive force but also decreases bias. This resulted in a corrected classification rate of over 0.81, as well as higher sensitivity and specificity rates for the models. In addition, separation and good models were also achieved for disease categories such as antineoplastic compounds and nervous system diseases, among others. Such models can be used to guide decision on the feature landscape of compounds and their likeness to either drugs or other characteristics, such as specific or multiple disease-category(ies) or organ(s) of action of a molecule.

  16. Z

    EthanolLevel UCR Archive Dataset

    • data.niaid.nih.gov
    Updated May 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of California, Riverside; University of Southampton (2024). EthanolLevel UCR Archive Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11190984
    Explore at:
    Dataset updated
    May 15, 2024
    Authors
    University of California, Riverside; University of Southampton
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is part of the UCR Archive maintained by University of Southampton researchers. Please cite a relevant or the latest full archive release if you use the datasets. See http://www.timeseriesclassification.com/.

    This dataset is part of the project with Scotch Whisky Research Institute into detecting forged spirits in a non-intrusive manner. One way of detecting forgery without sampling the wine is through inspecting ethanol level by spectrograph. The dataset covers 20 different bottle types and four levels of alcohol: 35%, 38%, 40% and 45%. Each series is a spectrograph of 1751 observations. This dataset is an example of when it is wrong to merge and resample, because the train/test split are constructed so that the same bottle type is never in both train and test sets. There are 4 classes. - Class 1: E35 - Class 2: E38 - Class 3: E40 - Class 4: E45 For more information about this dataset, see [1,2].

    [1] Lines, Jason, Sarah Taylor, and Anthony Bagnall. "Hive-cote: The hierarchical vote collective of transformation-based ensembles for time series classification." Data Mining (ICDM), 2016 IEEE 16th International Conference on. IEEE, 2016.

    [2] J. Large, E. K. Kemsley, N.Wellner, I. Goodall, and A. Bagnall, Detecting forged alcohol non-invasively through vibrational spectroscopy and machine learning," in Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2018.

    Donator: A. Bagnall

  17. Z

    Data from: QuerTCI: A Tool Integrating GitHub Issue Querying with Comment...

    • data.niaid.nih.gov
    Updated Feb 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ye Paing; Tatiana Castro Vélez; Raffi Khatchadourian (2022). QuerTCI: A Tool Integrating GitHub Issue Querying with Comment Classification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6115403
    Explore at:
    Dataset updated
    Feb 21, 2022
    Dataset provided by
    City University of New York (CUNY) Hunter College
    City University of New York (CUNY) Graduate Center
    Authors
    Ye Paing; Tatiana Castro Vélez; Raffi Khatchadourian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Issue tracking systems enable users and developers to comment on problems plaguing a software system. Empirical Software Engineering (ESE) researchers study (open-source) project issues and the comments and threads within to discover---among others---challenges developers face when, e.g., incorporating new technologies, platforms, and programming language constructs. However, issue discussion threads accumulate over time and thus can become unwieldy, hindering any insight that researchers may gain. While existing approaches alleviate this burden by classifying issue thread comments, there is a gap between searching popular open-source software repositories (e.g., those on GitHub) for issues containing particular keywords and feeding the results into a classification model. In this paper, we demonstrate a research infrastructure tool called QuerTCI that bridges this gap by integrating the GitHub issue comment search API with the classification models found in existing approaches. Using queries, ESE researchers can retrieve GitHub issues containing particular keywords, e.g., those related to a certain programming language construct, and subsequently classify the kinds of discussions occurring in those issues. Using our tool, our hope is that ESE researchers can uncover challenges related to particular technologies using certain keywords through popular open-source repositories more seamlessly than previously possible. A tool demonstration video may be found at: https://youtu.be/fADKSxn0QUk.

  18. Data from: HiPerMovelets: high-performance movelet extraction for trajectory...

    • tandf.figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tarlis Tortelli Portela; Jonata Tyska Carvalho; Vania Bogorny (2023). HiPerMovelets: high-performance movelet extraction for trajectory classification [Dataset]. http://doi.org/10.6084/m9.figshare.17722278.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Tarlis Tortelli Portela; Jonata Tyska Carvalho; Vania Bogorny
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In the last decade, trajectory classification has received significant attention. The vast amount of data generated on social media, the use of sensor networks, IOT devices and other Internet-enabled sources allowed the semantic enrichment of mobility data, making the classification task more challenging. Existing trajectory classification methods have mainly considered space, time and numerical data, ignoring the semantic dimensions. Only recently proposed methods as Movelets and MASTERMovelets can handle all types of dimensions. MASTERMovelets is the only method that automatically discovers the best dimension combination and subtrajectory size for trajectory classification. However, although it outperformed the state-of-the-art in terms of accuracy, MASTERMovelets is computationally expensive and results in a high dimensionality problem, which makes it unfeasible for most real trajectory datasets that contain a big volume of data. To overcome this problem and enable the application of the movelets approach on large datasets, in this paper we propose a new high-performance method for extracting movelets and classifying trajectories, called HiPerMovelets (High-performance Movelets). Experimental results show that HiPerMovelets is 10 times faster than MASTERMovelets, reduces the high-dimensionality problem, is more scalable, and presents a high classification accuracy in all evaluated datasets with both raw and semantic trajectories.

  19. Data from: Classifying document types to enhance search and recommendations...

    • figshare.com
    txt
    Updated May 25, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aristotelis Charalampous; Petr Knoth (2017). Classifying document types to enhance search and recommendations in digital libraries - Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.4834229.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 25, 2017
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Aristotelis Charalampous; Petr Knoth
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Taken from "Classifying document types to enhance search and recommendations in digital libraries"https://www.overleaf.com/read/zzzrvmzmwdckAbstract: In this paper, we address the problem of classifying documents available from the global network of (open access) repositories according to their type. We show that the metadata provided by repositories enabling us to distinguish research papers, thesis and slides are missing in over 60% of cases. While these metadata describing document types are useful in a variety of scenarios ranging from research analytics to improving search and recommender (SR) systems, this problem has not yet been sufficiently addressed in the context of the repositories infrastructure. We have developed a new approach for classifying document types using supervised machine learning based exclusively on text specific features. We achieve 0.96 F1-score using the random forest and Adaboost classifiers, which are the best performing models on our data. By analysing the SR system logs of the CORE digital library aggregator, we show that users are an order of magnitude more likely to click on research papers and thesis than on slides. This suggests that using document types as a feature for ranking/filtering SR results in digital libraries has the potential to improve user experience.The descriptors, as featured in the study, are encoded in the dataset as follows:authors_len: Number of authors associated with the document entry.num_of_pages: Number of pages the document has in total.avg_word_per_page: Average words per page in the document.total_words: Total words in the document.source: The online service from which the document originated (can be either "CORE" or "SlideShare").id: Identifier with which the source's API can be queried to retrieve the corresponding document.label: The document's type, from "research", "thesis" or "slides".

  20. Data from: Wine Quality

    • kaggle.com
    • tensorflow.org
    zip
    Updated Oct 29, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel S. Panizzo (2017). Wine Quality [Dataset]. https://www.kaggle.com/datasets/danielpanizzo/wine-quality
    Explore at:
    zip(111077 bytes)Available download formats
    Dataset updated
    Oct 29, 2017
    Authors
    Daniel S. Panizzo
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Citation Request: This dataset is public available for research. The details are described in [Cortez et al., 2009]. Please include this citation if you plan to use this database:

    P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

    Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

    1. Title: Wine Quality

    2. Sources Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009

    3. Past Usage:

      P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

      In the above reference, two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model these datasets under a regression approach. The support vector machine model achieved the best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T), etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity analysis procedure).

    4. Relevant Information:

      The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

      These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

    5. Number of Instances: red wine - 1599; white wine - 4898.

    6. Number of Attributes: 11 + output attribute

      Note: several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection.

    7. Attribute information:

      For more information, read [Cortez et al., 2009].

      Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10)

    8. Missing Attribute Values: None

    9. Description of attributes:

      1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

      2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

      3 - citric acid: found in small quantities, citric acid can add 'freshness' and flavor to wines

      4 - residual sugar: the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

      5 - chlorides: the amount of salt in the wine

      6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

      7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

      8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

      9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

      10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

      11 - alcohol: the percent alcohol content of the wine

      Output variable (based on sensory data): 12 - quality (score between 0 and 10)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Daniela Regina Klein; Marcos Martinez do Vale; Mariana Fernandes Ribas da Silva; Micheli Faccin Kuhn; Tatiane Branco; Mauricio Portella dos Santos (2023). Data mining as a hatchery process evaluation tool [Dataset]. http://doi.org/10.6084/m9.figshare.10258280.v1
Organization logo

Data mining as a hatchery process evaluation tool

Related Article
Explore at:
jpegAvailable download formats
Dataset updated
Jun 3, 2023
Dataset provided by
SciELOhttp://www.scielo.org/
Authors
Daniela Regina Klein; Marcos Martinez do Vale; Mariana Fernandes Ribas da Silva; Micheli Faccin Kuhn; Tatiane Branco; Mauricio Portella dos Santos
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

ABSTRACT The hatchery is one of the most important segments of the poultry chain, and generates an abundance of data, which, when analyzed, allow for identifying critical points of the process . The aim of this study was to evaluate the applicability of the data mining technique to databases of egg incubation of broiler breeders and laying hen breeders. The study uses a database recording egg incubation from broiler breeders housed in pens with shavings used for litters in natural mating, as well as laying hen breeders housed in cages using an artificial insemination mating system. The data mining technique (DM) was applied to analyses in a classification task, using the type of breeder and house system for delineating classes. The database was analyzed in three different ways: original database, attribute selection, and expert analysis. Models were selected on the basis of model precision and class accuracy. The data mining technique allowed for the classification of hatchery fertile eggs from different genetic groups, as well as hatching rates and the percentage of fertile eggs (the attributes with the greatest classification power). Broiler breeders showed higher fertility (> 95 %), but higher embryonic mortality between the third and seventh day post-hatching (> 0.5 %) when compared to laying hen breeders’ eggs. In conclusion, applying data mining to the hatchery process, selection of attributes and strategies based on the experience of experts can improve model performance.

Search
Clear search
Close search
Google apps
Main menu