Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This folder contains data used to illustrate the utility of Weka detector in TrackMate.
More detail on using these files can be found here: https://imagej.net/plugins/trackmate/trackmate-weka.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These datasets contain a set of news articles in English, French and Spanish extracted from Medisys (i‧e. advanced search) according the following criteria: (1) Keywords (at least): COVID-19, ncov2019, cov2019, coronavirus; (2) Keywords (all words): masque (French), mask (English), máscara (Spanish) (3) Periods: March 2020, May 2020, July 2020; (4) Countries: UK (English), Spain (Spanish), France (French). A corpus by country has been manually collected (copy/paste) from Medisys. For each country, 100 snippets by period (the 1st, 10th, 15th, 20th for each month) are built. The datasets are composed of: (1) A corpus preprocessed for the BioTex tool - https://gitlab.irstea.fr/jacques.fize/biotex_python (.txt) [~ 900 texts]; (2) The same corpus preprocessed for the Weka tool - https://www.cs.waikato.ac.nz/ml/weka/ (.arff); (3) Terms extracted with BioTex according spatio-temporal criteria (*.csv) [~ 9000 terms]. Other corpora can be collected with this same method. The code in Perl in order to preprocess textual data for terminology extraction (with BioTex) and classification (with Weka) tasks is available. A new version of this dataset (December 2020) includes additional data: - Python preprocessing and BioTex code [Execution_BioTex‧tgz]. - Terms extracted with different ranking measures (i‧e. C-Value, F-TFIDF-C_M) and methods (i‧e. extraction of words and multi-word terms) with the online version of BioTex [Terminology_with_BioTex_online_dec2020.tgz],
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This folder contains data used to illustrate the utility of Weka detector in TrackMate.
- classifier.model: trained Weka classifier.
- MDA231 paxillin DMSO 1 min.czi - MDA231 paxillin DMSO 1 min.czi #01_t1_t40_crop.tif: example image.
More detail on using these files can be found here: https://imagej.net/plugins/trackmate/trackmate-weka.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
File Name: WordsSelectedByInformationGain.csv Data Preparation: Xiaoru Dong, Linh Hoang Date of Preparation: 2018-12-12 Data Contributions: Jingyi Xie, Xiaoru Dong, Linh Hoang Data Source: Cochrane systematic reviews published up to January 3, 2018 by 52 different Cochrane groups in 8 Cochrane group networks. Associated Manuscript authors: Xiaoru Dong, Jingyi Xie, Linh Hoang, and Jodi Schneider. Associated Manuscript, Working title: Machine classification of inclusion criteria from Cochrane systematic reviews. Description: the file contains a list of 1655 informative words selected by applying information gain feature selection strategy. Information gain is one of the methods commonly used for feature selection, which tells us how many bits of information the presence of the word are helpful for us to predict the classes, and can be computed in a specific formula [Jurafsky D, Martin JH. Speech and language processing. London: Pearson; 2014 Dec 30].We ran Information Gain feature selection on Weka -- a machine learning tool. Notes: In order to reproduce the data in this file, please get the code of the project published on GitHub at: https://github.com/XiaoruDong/InclusionCriteria and run the code following the instruction provided.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Faces Dataset: PubFig05
This is a subset of the ''PubFig83'' dataset [1] which provides 100 images each of 5 most difficult celebrities to recognise (referred as class in the classification problem). For each celebrity persons, we took 100 images and separated them into training and testing sets of 90 and 10 images, respectively:
Person: Jenifer Lopez; Katherine Heigl; Scarlett Johansson; Mariah Carey; Jessica Alba
Feature Extraction
To extract features from images, we have applied the HT-L3-model as described in [2] and obtained 25600 features.
Feature Selection
Details about feature selection followed in brief as follows:
Entropy Filtering: First we apply an implementation of Fayyad and Irani's [3] entropy base heuristic to discretise the dataset and discarded features using the minimum description length (MDL) principle and only 4878 passed this entropy based filtering method.
Class-Distribution Balancing: Next, we have converted the dataset to binary-class problem by separating into 5 binary-class datasets using one-vs-all setup. Hence, these datasets became imbalanced at a ratio of 1:4. Then we converted them into balanced binary-class datasets using random sub-sampled method. Further processing of the dataset has been described in the paper.
(alpha,beta)-k Feature selection: To get a good feature set for training the classifier, we select the features using the approach based on the (alpha,beta)-k feature selection [4] problem. It selects a minimum subset of features that maximise both within class similarity and dissimilarity in different classes. We applied the entropy filtering and (alpha,beta)-k feature subset selection methods in three ways and obtained different numbers of features (in the Table below) after consolidating them into binary class dataset.
UAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets and we took the union of selected features for each binary-class datasets. Finally, we applied the (alpha,beta)-k feature set selection method on each of the binary-class datasets and get a set of features.
IAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets and we took the intersection of selected features for each binary-class datasets. Finally, we applied the (alpha,beta)-k feature set selection method on each of the binary-class datasets and get a set of features.
UEAB: We applied (alpha,beta)-k feature set method on each of the balanced binary-class datasets. Then, we applied the entropy filtering and (alpha,beta)-k feature set selection method on each of the balanced binary-class datasets. Finally, we took the union of selected features for each balanced binary-class datasets and get a set of features.
All of these datasets are inside the compressed folder. It also contains the document describing the process detail.
References
[1] Pinto, N., Stone, Z., Zickler, T., & Cox, D. (2011). Scaling up biologically-inspired computer vision: A case study in unconstrained face recognition on facebook. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conference on (pp. 35–42).
[2] Cox, D., & Pinto, N. (2011). Beyond simple features: A large-scale feature search approach to unconstrained face recognition. In Automatic Face Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on (pp. 8–15).
[3] Fayyad, U. M., & Irani, K. B. (1993). Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In International Joint Conference on Artificial Intelligence (pp. 1022–1029).
[4] Berretta, R., Mendes, A., & Moscato, P. (2005). Integer programming models and algorithms for molecular classification of cancer from microarray data. In Proceedings of the Twenty-eighth Australasian conference on Computer Science - Volume 38 (pp. 361–370). 1082201: Australian Computer Society, Inc.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SI1_Supporting Information file (docx) brings together detailed information on the outstanding models obtained for each dataset analyzed in this study such as statistical and training parameters and outliers. There can be found the responses in spikes/s of the mosquito Culex quinquefasciatus to the 50 IRs. Besides, there is presented a full table of the up-to-date studies related to QSAR and insect repellency.
SI2_EXP1_50IRs from Liu et al (2013) SDF file presents the structures of each of the 50 IRs analyzed.
SI3_EXP2_Datasets gathers the four datasets as SDF files from Oliferenko et al. (2013), Gaudin et al. (2008), Omolo et al. (2004), and Paluch et al. (2009) used for the repellency modeling in EXP2.
SI4_EXP3_Prospective analysis provides Malaria Box Library (400 compounds) as an SDF file, which were analyzed in our virtual screening to prospect potential virtual hits.
SI5_QuBiLS-MIDAS MDs lists contain three TXT lists of 3D molecular descriptors used in QuBiLS-MIDAS to describe the molecules used in the present study.
SI6_EXP1_Sensillar Modeling comprises two subfolders: Classification and Regression models for each of the six sensilla. Models built to predict the physiological interaction experimentally obtained from Liu et al. (2013). All of the models are implemented in the software SiLiS-PAPACS. Every single folder compiles a DOCX file with the detailed description of the model, an XLSX file with the output obtained from the training in Weka 3.9.4, an ARFF, and CSV files with the MDs for each molecule, and the SDF of the study dataset.
SI7_EXP2_Repellency Modeling encompasses the four datasets in the study: Oliferenko et al. (2013), Gaudin et al. (2008), Omolo et al. (2004), and Paluch et al. (2009). Inside the subfolders, there are three models per type of MDs (duplex, triple, generic, and mix) selected that best predict each dataset. As well as the SI6 folder, each model includes six files: DOCX, XLSX, ARFF, CSV, and an SDF.
SI8_Virtual Hits includes the cluster analysis results and physico-chemical properties of new IR virtual leads.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Classifier result.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Experiments results.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LDA attributes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Best parameter values.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This parameters were used for Naïve Bayes(NB), Multilayer Perceptron(MLP), Random Forest(RF), Gradient Boosting(GB), Support Vector Machine(SVM) and K-Nearest Neighbors(KNN) algorithms evaluation when applied on the imbalanced sequences. The color trend of F-score from blue to red indicates performance from the best to the poorest. Accuracy, sensitivity, specificity, and F-score are represented in the table as Acc, Sen, Spec, and F-sco, respectively.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Two synthetic datasets for binary classification, generated with the Random Radial Basis Function generator from WEKA. They are the same shape and size (104.952 instances, 185 attributes), but the "balanced" dataset has 52,13% of its instances belonging to class c0, while the "unbalanced" one only has 4,04% of its instances belonging to class c0. Therefore, this set of datasets is primarily meant to study how class balance influences the behaviour of a machine learning model.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The performance of different machine learning techniques-based stage classification models developed using 21 methylation CpG sites selected by WEKA (LS-CPG-WEKA).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the raw dataset associated to the scientific article "Stable psychological traits predict psychological perceived stress to COVID-19 outbreak”, by L. Flesia, V. Fietta, B. Segatto, M. Monaro. Data are contained in the excel file and organized as follows:
- the entire dataset used by the authors to perform statistical analysis
- the training set used by the authors to train and validate ML models
- the test set used by the authors to test the ML models
The "Legend" file contains the description of each variable in the excel file.
The step by step instructions to replicate the results of ML classification models, which are reported in the paper, including two .arff files containing the training and test set od data that can be directly run in WEKA software 3.9.
The "COVID-19 QUESTIONNAIRE" file contains the English version of the questions administered to participants.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The performance of stage classification models developed using 30 RNA transcripts selected using WEKA from 103 RNA transcripts (LS-RNA-WEKA).
-> If you use Turkish_Product_Reviews_by_Gozukara_and_Ozel_2016 dataset please cite: https://dergipark.org.tr/en/pub/cukurovaummfd/issue/28708/310341
@research article { cukurovaummfd310341, journal = {Çukurova Üniversitesi Mühendislik-Mimarlık Fakültesi Dergisi}, issn = {1019-1011}, eissn = {2564-7520}, address = {Çukurova Üniversitesi Mühendislik-Mimarlık Fakültesi Dergisi Yayın Kurulu Başkanlığı 01330 ADANA}, publisher = {Cukurova University}, year = {2016}, volume = {31}, pages = {464 - 482}, doi = {10.21605/cukurovaummfd.310341}, title = {Türkçe ve İngilizce Yorumların Duygu Analizinde Doküman Vektörü Hesaplama Yöntemleri için Bir Deneysel İnceleme}, key = {cite}, author = {Gözükara, Furkan and Özel, Selma Ayşe} }
https://doi.org/10.21605/cukurovaummfd.310341
-> Turkish_Product_Reviews_by_Gozukara_and_Ozel_2016 dataset is composed as below: ->-> Top 50 E-commerce sites in Turkey are crawled and their comments are extracted. Then randomly 2000 comments selected and manually labelled by a field expert. ->-> After manual labeling the selected comments is done, 600 negative and 600 positive comments are left. ->-> This dataset contains these comments.
-> English_Movie_Reviews_by_Pang_and_Lee_2004 ->-> Pang, B., Lee, L., 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts, In Proceedings of the 42nd annual meeting on Association for Computational Linguistics (p. 271). ->-> Source: https://www.cs.cornell.edu/people/pabo/movie-review-data/ | polarity dataset v2.0 - review_polarity.tar.gz
-> English_Movie_Reviews_Sentences_by_Pang_and_Lee_2005 ->-> Pang, B., Lee, L., 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales, In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (pp. 115-124), Association for Computational Linguistics ->-> Source: https://www.cs.cornell.edu/people/pabo/movie-review-data/ | sentence polarity dataset v1.0 - rt-polaritydata.tar.gz
-> English_Product_Reviews_by_Blitzer_et_al_2007 ->-> Article of the dataset: Blitzer, J., Dredze, M., Pereira, F., 2007. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification, In ACL (Vol. 7, pp. 440-447). ->-> Source: http://www.cs.jhu.edu/~mdredze/datasets/sentiment/ | processed_acl.tar.gz
-> Turkish_Movie_Reviews_by_Demirtas_and_Pechenizkiy_2013 ->-> Demirtas, E., Pechenizkiy, M., 2013. Cross-lingual polarity detection with machine translation, In Proceedings of the Second International Workshop on Issues of Sentiment Discovery and Opinion Mining (p. 9). ACM. ->-> http://www.win.tue.nl/~mpechen/projects/smm/#Datasets Turkish_Movie_Sentiment.zip
-> The dataset files are provided as used in the article. -> Weka files are generated with Raw Frequency of terms rather than used Weighting Schemes
-> The folder Cross_Validation contains 10-fold cross-validation each fold files. -> Inside Cross_Validation folder, each turn of the cross-validation is named as test_X where X is the turn number -> Inside test_X folder * Test_Set_Negative_RAW: Contains raw negative class Test data of that cross-validation turn * Test_Set_Negative_Processed: Contains pre-processed negative class Test data of that cross-validation turn * Test_Set_Positive_RAW: Contains raw positive class Test data of that cross-validation turn * Test_Set_Positive_Processed: Contains pre-processed positive class Test data of that cross-validation turn * Train_Set_Negative_RAW: Contains raw negative class Train data of that cross-validation turn * Train_Set_Negative_Processed: Contains pre-processed negative class Train data of that cross-validation turn * Train_Set_Positive_RAW: Contains raw positive class Train data of that cross-validation turn * Train_Set_Positive_Processed: Contains pre-processed positive class Train data of that cross-validation turn * Train_Set_For_Weka: Contains processed Train set formatted for Weka * Test_Set_For_Weka: Contains processed Test set formatted for Weka
-> The folder Entire_Dataset contains files for Entire Dataset * Negative_Processed: Contains all negative comments processed data * Positive_Processed: Contains all positive comments processed data * Negative_RAW: Contains all negative comments RAW data * Positive_RAW: Contains all positive comments RAW data * Entire_Dataset_WEKA: Contains all documents processed data in WEKA format
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction: Despite the unquestionable advantages of Matrix-Assisted Laser Desorption/Ionization Mass Spectrometry Imaging in visualizing the spatial distribution and the relative abundance of biomolecules directly on-tissue, the yielded data is complex and high dimensional. Therefore, analysis and interpretation of this huge amount of information is mathematically, statistically and computationally challenging. Areas covered: This article reviews some of the challenges in data elaboration with particular emphasis on machine learning techniques employed in clinical applications, and can be useful in general as an entry point for those who want to study the computational aspects. Several characteristics of data processing are described, enlightening advantages and disadvantages. Different approaches for data elaboration focused on clinical applications are also provided. Practical tutorial based upon Orange Canvas and Weka software is included, helping familiarization with the data processing. Expert commentary: Recently, MALDI-MSI has gained considerable attention and has been employed for research and diagnostic purposes, with successful results. Data dimensionality constitutes an important issue and statistical methods for information-preserving data reduction represent one of the most challenging aspects. The most common data reduction methods are characterized by collecting independent observations into a single table. However, the incorporation of relational information can improve the discriminatory capability of the data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Nucleotide location designated refers to match with their position reported in reference.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Churn prediction aims to detect customers intended to leave a service provider. Retaining one customer costs an organization from 5 to 10 times than gaining a new one. Predictive models can provide correct identification of possible churners in the near future in order to provide a retention solution. This paper presents a new prediction model based on Data Mining (DM) techniques. The proposed model is composed of six steps which are; identify problem domain, data selection, investigate data set, classification, clustering and knowledge usage. A data set with 23 attributes and 5000 instances is used. 4000 instances used for training the model and 1000 instances used as a testing set. The predicted churners are clustered into 3 categories in case of using in a retention strategy. The data mining techniques used in this paper are Decision Tree, Support Vector Machine and Neural Network throughout an open source software name WEKA.
Estimation of obesity levels based on eating habits and physical condition Data Set Download: Data Folder, Data Set Description
Abstract: This dataset include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition.
Data Set Characteristics:
Multivariate
Number of Instances:
2111
Area:
Life
Attribute Characteristics:
Integer
Number of Attributes:
17
Date Donated
2019-08-27
Associated Tasks:
Classification, Regression, Clustering
Missing Values?
N/A
Number of Web Hits:
70843
Fabio Mendoza Palechor, Email: fmendoza1 '@' cuc.edu.co, Celphone: +573182929611 Alexis de la Hoz Manotas, Email: akdelahoz '@' gmail.com, Celphone: +573017756983
This dataset include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition. The data contains 17 attributes and 2111 records, the records are labeled with the class variable NObesity (Obesity Level), that allows classification of the data using the values of Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III. 77% of the data was generated synthetically using the Weka tool and the SMOTE filter, 23% of the data was collected directly from users through a web platform.
Read the article ([Web Link]) to see the description of the attributes.
[1]Palechor, F. M., & de la Hoz Manotas, A. (2019). Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico. Data in Brief, 104344. [2]De-La-Hoz-Correa, E., Mendoza Palechor, F., De-La-Hoz-Manotas, A., Morales Ortega, R., & Sánchez Hernández, A. B. (2019). Obesity level estimation software based on decision trees.
[1] Palechor, F. M., & de la Hoz Manotas, A. (2019). Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico. Data in Brief, 104344.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This folder contains data used to illustrate the utility of Weka detector in TrackMate.
More detail on using these files can be found here: https://imagej.net/plugins/trackmate/trackmate-weka.