Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABSTRACT The hatchery is one of the most important segments of the poultry chain, and generates an abundance of data, which, when analyzed, allow for identifying critical points of the process . The aim of this study was to evaluate the applicability of the data mining technique to databases of egg incubation of broiler breeders and laying hen breeders. The study uses a database recording egg incubation from broiler breeders housed in pens with shavings used for litters in natural mating, as well as laying hen breeders housed in cages using an artificial insemination mating system. The data mining technique (DM) was applied to analyses in a classification task, using the type of breeder and house system for delineating classes. The database was analyzed in three different ways: original database, attribute selection, and expert analysis. Models were selected on the basis of model precision and class accuracy. The data mining technique allowed for the classification of hatchery fertile eggs from different genetic groups, as well as hatching rates and the percentage of fertile eggs (the attributes with the greatest classification power). Broiler breeders showed higher fertility (> 95 %), but higher embryonic mortality between the third and seventh day post-hatching (> 0.5 %) when compared to laying hen breeders’ eggs. In conclusion, applying data mining to the hatchery process, selection of attributes and strategies based on the experience of experts can improve model performance.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A dataset containing raw beer recipe data obtained from Brewer's Friend. The dataset contains malt, hops, adjunct types and quantities, and recipe vital statistics [alcohol by volume (ABV), beer color in the standard reference method (SRM), standard gravity (SG), final gravity (FG), international bitterness units (IBUs), and hop usage]. In addition, beer category information and the country or geographical area from which they originated, combined with supercluster and cluster data, were added to each beer recipe.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Churn prediction aims to detect customers intended to leave a service provider. Retaining one customer costs an organization from 5 to 10 times than gaining a new one. Predictive models can provide correct identification of possible churners in the near future in order to provide a retention solution. This paper presents a new prediction model based on Data Mining (DM) techniques. The proposed model is composed of six steps which are; identify problem domain, data selection, investigate data set, classification, clustering and knowledge usage. A data set with 23 attributes and 5000 instances is used. 4000 instances used for training the model and 1000 instances used as a testing set. The predicted churners are clustered into 3 categories in case of using in a retention strategy. The data mining techniques used in this paper are Decision Tree, Support Vector Machine and Neural Network throughout an open source software name WEKA.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Classification of 43,621,232 Bitcoin Wallet addresses into 14 categories (following the classifications defined by BABD, cited below): (1) Blackmail, (2) Cyber-Security Service, (3) Darknet Market, (4) Centralized Exchange, (5) P2P Financial Infrastructure Service, (6) P2P Financial Service, (7) Gambling, (8) Government Criminal Blocklist, (9) Money Laundering, (10) Ponzi Scheme, (11) Mining Pool, (12) Tumbler, (13) Individual Wallets, and (14) Unknown Service.
This database was made by combining info from four separate sources: Aleš Janda, who operates WalletExplorer.com, generously offers an API which provides significant wallet classification and identification that he was able to determine. Mr. Janda determined the owners of wallets by registering for certain services and learning the addresses used by those services (like SatoshiDice) and then backed in to other wallet addresses by determining which wallets were merged together (similar to a shadow wallet address analysis done by others). However, WalletExplorer states it has not updated much data since 2016. Thus, the data available from WalletExplorer is largely contained in the first 425,000 blocks (block 425,000 was mined on August 13, 2016). This classification data constituted, by far, the largest portion of classification data in this dataset, with over 43 million wallet addresses being classified. Preference was always given to WalletExplorer.com's classifications, as “Strong Addresses.” Aleš Janda. Wallet explorer. https://www.walletexplorer.com/info.
Additionally, several datasets (in spreadsheets) identify and classify wallets:
The largest of these spreadsheets available on Kaggle was made by Xiang, et al. They were able to classify 544,462 wallet addresses between blocks 585,000 (July 12, 2019) and 685,000 (May 26, 2021) into 13 classifications: (1) Blackmail, (2) Cyber-Security Service, (3) Darknet Market, (4) Centralized Exchange, (5) P2P Financial Infrastructure Service, (6) P2P Financial Service, (7) Gambling, (8) Government Criminal Blocklist, (9) Money Laundering, (10) Ponzi Scheme, (11) Mining Pool, (12) Tumbler, and (13) Individual Wallets. They used the WalletExplorer database and other governmental blocklist databases as their “strong addresses” and information from BitcoinAbuse.com (which now appears to be ChainAbuse.com) as “weak addresses” to train their AI models. The processes they used to identify and classify the wallets on their experimental sets resulted in a minimum F-1 score of 92.97%, accuracy of 93.24%, precision of 92.80%, and recall of 93.24%. They accomplished this by using a framework consisting of two parts: a statistical indicator (SI), and a local structural indicator (LSI). The SI considered four indicator types, Pure Amount Indicator (PAI), Pure Degree Indicator (PDI), Pure Time Indicator (PTI), and Combination Indicator (CI). The SI considered 132 features to predict the classification. For the LSI, they generated k-hop subgraphs with an algorithm, and then used various graph metrics and other features to predict the classification. Yiexin Xiang, Yuchen Lei, Ding Bao, Tiantian Li, Quingqing Yang, Wenmao Liu, Wei Ren, and Kim-Kwang Raymond Choo. Babd: A bitcoin address behavior dataset for pattern analysis. IEEE Transactions on Information Forensics and Security, (19):2171–85, 2024. https://www.kaggle.com/datasets/lemonx/babd13
Michalski, et al., identified 8,808 addresses and classified them using machine learning techniques considering 149 features, into categories including (1) mining pools, (2) miners, (3) coinjoins, (4) gambling, (5) exchange, and (6) services. They analyzed the blocks between 520,850 and 520,950. They obtained training data from WalletExplorer.com and then used machine learning techniques including Evaluated Supervised Learning Algorithms. They determined that the Random Forest classification was the best method of classifying the wallets, and stated they obtained an F-score of 95%. Due to the small size of this dataset and the fact that only 100 blocks were covered by this dataset, I considered these classifications to be “weak addresses” for this work. The wallets they classified as “services” were added to my database as an “Unknown Service.” Radoslaw Michalski, Daria Dziuba ltowska, and Piotr Macek. Bitcoin addresses and their categories. https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/KEWU0N
The US Office of Foreign Asset Control maintains a list of ‘sanctioned’ Bitcoin and other digital currency assets. Several Github contributors maintain a tool that extracts the Bitcoin addresses from this database. This added 390 wallet addresses to the dataset. U.S. Treasury. Specially designated nationals list of the U.S. Office of Foreign Asset Control. https://www.treasury.gov/ofac/downloads/sanctions/1.0/sdn_advanced.xml. 0xB10C, Michael Neale, and Yahiheb. https://github.com/0xB10C/ofac-sanctioned-digital-currency-addresses?tab=readme-ov-file
Facebook
TwitterFrom Dryad entry:
"Abstract
Neuroendocrine neoplasms (NENs) are clinically diverse and incompletely characterized cancers that are challenging to classify. MicroRNAs (miRNAs) are small regulatory RNAs that can be used to classify cancers. Recently, a morphology-based classification framework for evaluating NENs from different anatomic sites was proposed by experts, with the requirement of improved molecular data integration. Here, we compiled 378 miRNA expression profiles to examine NEN classification through comprehensive miRNA profiling and data mining. Following data preprocessing, our final study cohort included 221 NEN and 114 non-NEN samples, representing 15 NEN pathological types and five site-matched non-NEN control groups. Unsupervised hierarchical clustering of miRNA expression profiles clearly separated NENs from non-NENs. Comparative analyses showed that miR-375 and miR-7 expression is substantially higher in NEN cases than non-NEN controls. Correlation analyses showed that NENs from diverse anatomic sites have convergent miRNA expression programs, likely reflecting morphologic and functional similarities. Using machine learning approaches, we identified 17 miRNAs to discriminate 15 NEN pathological types and subsequently constructed a multi-layer classifier, correctly identifying 217 (98%) of 221 samples and overturning one histologic diagnosis. Through our research, we have identified common and type-specific miRNA tissue markers and constructed an accurate miRNA-based classifier, advancing our understanding of NEN diversity.
Methods
Sequencing-based miRNA expression profiles from 378 clinical samples, comprising 239 neuroendocrine neoplasm (NEN) cases and 139 site-matched non-NEN controls, were used in this study. Expression profiles were either compiled from published studies (n=149) or generated through small RNA sequencing (n=229). Prior to sequencing, total RNA was isolated from formalin-fixed paraffin-embedded (FFPE) tissue blocks or fresh-frozen (FF) tissue samples. Small RNA cDNA libraries were sequenced on HiSeq 2500 Illumina platforms using an established small RNA sequencing (Hafner et al., 2012 Methods) and sequence annotation pipeline (Brown et al., 2013 Front Genet) to generate miRNA expression profiles. Scaling our existing approach to miRNA-based NEN classification (Panarelli et al., 2019 Endocr Relat Cancer; Ren et al., 2017 Oncotarget), we constructed and cross-validated a multi-layer classifier for discriminating NEN pathological types based on selected miRNAs.
Usage notes
Diagnostic histopathology and small RNA cDNA library preparation information for all samples are presented in Table S1 of the associated manuscript."
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MusicOSet is an open and enhanced dataset of musical elements (artists, songs and albums) based on musical popularity classification. Provides a directly accessible collection of data suitable for numerous tasks in music data mining (e.g., data visualization, classification, clustering, similarity search, MIR, HSS and so forth). To create MusicOSet, the potential information sources were divided into three main categories: music popularity sources, metadata sources, and acoustic and lyrical features sources. Data from all three categories were initially collected between January and May 2019. Nevertheless, the update and enhancement of the data happened in June 2019.
The attractive features of MusicOSet include:
| Data | # Records |
|:-----------------:|:---------:|
| Songs | 20,405 |
| Artists | 11,518 |
| Albums | 26,522 |
| Lyrics | 19,664 |
| Acoustic Features | 20,405 |
| Genres | 1,561 |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.
The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.
Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:
Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.
Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.
Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.
Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).
We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.
Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.
Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise
Facebook
TwitterAttribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
UVigoMED is a monolingual (english) and single-label corpus composed of 92,661 biomedical abstracts extracted from MEDLINE, classified under 26 MeSH categories.
Facebook
Twitterhttps://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy
| BASE YEAR | 2024 |
| HISTORICAL DATA | 2019 - 2023 |
| REGIONS COVERED | North America, Europe, APAC, South America, MEA |
| REPORT COVERAGE | Revenue Forecast, Competitive Landscape, Growth Factors, and Trends |
| MARKET SIZE 2024 | 2308.3(USD Million) |
| MARKET SIZE 2025 | 2426.0(USD Million) |
| MARKET SIZE 2035 | 4000.0(USD Million) |
| SEGMENTS COVERED | Application, Deployment Type, End User, Technology, Regional |
| COUNTRIES COVERED | US, Canada, Germany, UK, France, Russia, Italy, Spain, Rest of Europe, China, India, Japan, South Korea, Malaysia, Thailand, Indonesia, Rest of APAC, Brazil, Mexico, Argentina, Rest of South America, GCC, South Africa, Rest of MEA |
| KEY MARKET DYNAMICS | Technological advancements, Increasing data volumes, Rising regulatory requirements, Demand for efficient classification, Growth of cloud-based solutions |
| MARKET FORECAST UNITS | USD Million |
| KEY COMPANIES PROFILED | Roche, SeraCare Life Sciences, Illumina, Thermo Fisher Scientific, Danaher Corporation, Becton Dickinson, PerkinElmer, Tecan Group, Pacific Biosciences, Genomatix, QIAGEN, Merck Group, BioRad Laboratories, Abbott Laboratories, Agilent Technologies, NanoString Technologies |
| MARKET FORECAST PERIOD | 2025 - 2035 |
| KEY MARKET OPPORTUNITIES | AI-driven data analysis solutions, Increased demand for regulatory compliance, Growth in data-driven decision making, Expanding healthcare and pharma sectors, Integration with emerging technologies |
| COMPOUND ANNUAL GROWTH RATE (CAGR) | 5.1% (2025 - 2035) |
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, refer to [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.
For more information, read [Cortez et al., 2009]. Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10)
This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality, to get both the dataset i.e. red and white vinho verde wine samples, from the north of Portugal, please visit the above link.
Please include this citation if you plan to use this database:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
We kagglers can apply several machine-learning algorithms to determine which physiochemical properties make a wine 'good'!
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This work developed image dataset of underground longwall mining face (DsLMF+), which consists of 138004 images with annotation 6 categories of mine personnel, hydraulic support guard plate, large coal, towline, miners’ behaviour and mine safety helmet. All the labels of dataset are publicly available in YOLO format and COCO format.The dataset aims to support further research and advancement of the intelligent identification and classification of abnormal conditions for underground mining.
Facebook
Twitterhttps://www.htfmarketinsights.com/privacy-policyhttps://www.htfmarketinsights.com/privacy-policy
Global Data Mining Tools Market is segmented by Application (Predictive analytics_Fraud detection_Marketing_Healthcare diagnostics_Manufacturing optimization), Type (Classification tools_Clustering tools_Regression tools_Association tools_Text mining tools), and Geography (North America_ LATAM_ West Europe_Central & Eastern Europe_ Northern Europe_ Southern Europe_ East Asia_ Southeast Asia_ South Asia_ Central Asia_ Oceania_ MEA)
Facebook
TwitterSubject Area: Text Mining Description: This is the dataset used for the SIAM 2007 Text Mining competition. This competition focused on developing text mining algorithms for document classification. The documents in question were aviation safety reports that documented one or more problems that occurred during certain flights. The goal was to label the documents with respect to the types of problems that were described. This is a subset of the Aviation Safety Reporting System (ASRS) dataset, which is publicly available. How Data Was Acquired: The data for this competition came from human generated reports on incidents that occurred during a flight. Sample Rates, Parameter Description, and Format: There is one document per incident. The datasets are in raw text format. All documents for each set will be contained in a single file. Each row in this file corresponds to a single document. The first characters on each line of the file are the document number and a tilde separats the document number from the text itself. Anomalies/Faults: This is a document category classification problem.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
In this research, we have generated student retention alerts. The alerts are classified into two types: preventive and corrective. This classification varies according to the level of maturity of the data systematization process. Therefore, to systematize the data, data mining techniques have been applied. The experimental analytical method has been used, with a population of 13,715 students with 62 sociological, academic, family, personal, economic, psychological, and institutional variables, and factors such as academic follow-up and performance, financial situation, and personal information. In particular, information is collected on each of the problems or a combination of problems that could affect dropout rates. Following the methodology, the information has been generated through an abstract data model to reflect the profile of the dropout student. As advancement from previous research, this proposal will create preventive and corrective alternatives to avoid dropout higher education. Also, in contrast to previous work, we generated corrective warnings with the application of data mining techniques such as neural networks until reaching a precision of 97% and losses of 0.1052. In conclusion, this study pretends to analyze the behavior of students who drop out the university through the evaluation of predictive patterns. The overall objective is to predict the profile of student dropout, considering reasons such as admission to higher education and career changes. Consequently, using a data systematization process promotes the permanence of students in higher education. Once the profile of the dropout has been identified, student retention strategies have been approached, according to the time of its appearance and the point of view of the institution.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data mining approaches can uncover underlying patterns in chemical and pharmacological property space decisive for drug discovery and development. Two of the most common approaches are visualization and machine learning methods. Visualization methods use dimensionality reduction techniques in order to reduce multi-dimension data into 2D or 3D representations with a minimal loss of information. Machine learning attempts to find correlations between specific activities or classifications for a set of compounds and their features by means of recurring mathematical models. Both models take advantage of the different and deep relationships that can exist between features of compounds, and helpfully provide classification of compounds based on such features or in case of visualization methods uncover underlying patterns in the feature space. Drug-likeness has been studied from several viewpoints, but here we provide the first implementation in chemoinformatics of the t-Distributed Stochastic Neighbor Embedding (t-SNE) method for the visualization and the representation of chemical space, and the use of different machine learning methods separately and together to form a new ensemble learning method called AL Boost. The models obtained from AL Boost synergistically combine decision tree, random forests (RF), support vector machine (SVM), artificial neural network (ANN), k nearest neighbors (kNN), and logistic regression models. In this work, we show that together they form a predictive model that not only improves the predictive force but also decreases bias. This resulted in a corrected classification rate of over 0.81, as well as higher sensitivity and specificity rates for the models. In addition, separation and good models were also achieved for disease categories such as antineoplastic compounds and nervous system diseases, among others. Such models can be used to guide decision on the feature landscape of compounds and their likeness to either drugs or other characteristics, such as specific or multiple disease-category(ies) or organ(s) of action of a molecule.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is part of the UCR Archive maintained by University of Southampton researchers. Please cite a relevant or the latest full archive release if you use the datasets. See http://www.timeseriesclassification.com/.
This dataset is part of the project with Scotch Whisky Research Institute into detecting forged spirits in a non-intrusive manner. One way of detecting forgery without sampling the wine is through inspecting ethanol level by spectrograph. The dataset covers 20 different bottle types and four levels of alcohol: 35%, 38%, 40% and 45%. Each series is a spectrograph of 1751 observations. This dataset is an example of when it is wrong to merge and resample, because the train/test split are constructed so that the same bottle type is never in both train and test sets. There are 4 classes. - Class 1: E35 - Class 2: E38 - Class 3: E40 - Class 4: E45 For more information about this dataset, see [1,2].
[1] Lines, Jason, Sarah Taylor, and Anthony Bagnall. "Hive-cote: The hierarchical vote collective of transformation-based ensembles for time series classification." Data Mining (ICDM), 2016 IEEE 16th International Conference on. IEEE, 2016.
[2] J. Large, E. K. Kemsley, N.Wellner, I. Goodall, and A. Bagnall, Detecting forged alcohol non-invasively through vibrational spectroscopy and machine learning," in Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2018.
Donator: A. Bagnall
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Issue tracking systems enable users and developers to comment on problems plaguing a software system. Empirical Software Engineering (ESE) researchers study (open-source) project issues and the comments and threads within to discover---among others---challenges developers face when, e.g., incorporating new technologies, platforms, and programming language constructs. However, issue discussion threads accumulate over time and thus can become unwieldy, hindering any insight that researchers may gain. While existing approaches alleviate this burden by classifying issue thread comments, there is a gap between searching popular open-source software repositories (e.g., those on GitHub) for issues containing particular keywords and feeding the results into a classification model. In this paper, we demonstrate a research infrastructure tool called QuerTCI that bridges this gap by integrating the GitHub issue comment search API with the classification models found in existing approaches. Using queries, ESE researchers can retrieve GitHub issues containing particular keywords, e.g., those related to a certain programming language construct, and subsequently classify the kinds of discussions occurring in those issues. Using our tool, our hope is that ESE researchers can uncover challenges related to particular technologies using certain keywords through popular open-source repositories more seamlessly than previously possible. A tool demonstration video may be found at: https://youtu.be/fADKSxn0QUk.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the last decade, trajectory classification has received significant attention. The vast amount of data generated on social media, the use of sensor networks, IOT devices and other Internet-enabled sources allowed the semantic enrichment of mobility data, making the classification task more challenging. Existing trajectory classification methods have mainly considered space, time and numerical data, ignoring the semantic dimensions. Only recently proposed methods as Movelets and MASTERMovelets can handle all types of dimensions. MASTERMovelets is the only method that automatically discovers the best dimension combination and subtrajectory size for trajectory classification. However, although it outperformed the state-of-the-art in terms of accuracy, MASTERMovelets is computationally expensive and results in a high dimensionality problem, which makes it unfeasible for most real trajectory datasets that contain a big volume of data. To overcome this problem and enable the application of the movelets approach on large datasets, in this paper we propose a new high-performance method for extracting movelets and classifying trajectories, called HiPerMovelets (High-performance Movelets). Experimental results show that HiPerMovelets is 10 times faster than MASTERMovelets, reduces the high-dimensionality problem, is more scalable, and presents a high classification accuracy in all evaluated datasets with both raw and semantic trajectories.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Taken from "Classifying document types to enhance search and recommendations in digital libraries"https://www.overleaf.com/read/zzzrvmzmwdckAbstract: In this paper, we address the problem of classifying documents available from the global network of (open access) repositories according to their type. We show that the metadata provided by repositories enabling us to distinguish research papers, thesis and slides are missing in over 60% of cases. While these metadata describing document types are useful in a variety of scenarios ranging from research analytics to improving search and recommender (SR) systems, this problem has not yet been sufficiently addressed in the context of the repositories infrastructure. We have developed a new approach for classifying document types using supervised machine learning based exclusively on text specific features. We achieve 0.96 F1-score using the random forest and Adaboost classifiers, which are the best performing models on our data. By analysing the SR system logs of the CORE digital library aggregator, we show that users are an order of magnitude more likely to click on research papers and thesis than on slides. This suggests that using document types as a feature for ranking/filtering SR results in digital libraries has the potential to improve user experience.The descriptors, as featured in the study, are encoded in the dataset as follows:authors_len: Number of authors associated with the document entry.num_of_pages: Number of pages the document has in total.avg_word_per_page: Average words per page in the document.total_words: Total words in the document.source: The online service from which the document originated (can be either "CORE" or "SlideShare").id: Identifier with which the source's API can be queried to retrieve the corresponding document.label: The document's type, from "research", "thesis" or "slides".
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Citation Request: This dataset is public available for research. The details are described in [Cortez et al., 2009]. Please include this citation if you plan to use this database:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
Title: Wine Quality
Sources Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009
Past Usage:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
In the above reference, two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model these datasets under a regression approach. The support vector machine model achieved the best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T), etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity analysis procedure).
Relevant Information:
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.
Number of Instances: red wine - 1599; white wine - 4898.
Number of Attributes: 11 + output attribute
Note: several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection.
Attribute information:
For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10)
Missing Attribute Values: None
Description of attributes:
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add 'freshness' and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
Output variable (based on sensory data): 12 - quality (score between 0 and 10)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABSTRACT The hatchery is one of the most important segments of the poultry chain, and generates an abundance of data, which, when analyzed, allow for identifying critical points of the process . The aim of this study was to evaluate the applicability of the data mining technique to databases of egg incubation of broiler breeders and laying hen breeders. The study uses a database recording egg incubation from broiler breeders housed in pens with shavings used for litters in natural mating, as well as laying hen breeders housed in cages using an artificial insemination mating system. The data mining technique (DM) was applied to analyses in a classification task, using the type of breeder and house system for delineating classes. The database was analyzed in three different ways: original database, attribute selection, and expert analysis. Models were selected on the basis of model precision and class accuracy. The data mining technique allowed for the classification of hatchery fertile eggs from different genetic groups, as well as hatching rates and the percentage of fertile eggs (the attributes with the greatest classification power). Broiler breeders showed higher fertility (> 95 %), but higher embryonic mortality between the third and seventh day post-hatching (> 0.5 %) when compared to laying hen breeders’ eggs. In conclusion, applying data mining to the hatchery process, selection of attributes and strategies based on the experience of experts can improve model performance.