Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Source More Info : https://archive.ics.uci.edu/datasets
The **UCI Machine Learning Repository **is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.
The datasets collected in this project represent a diverse and comprehensive set of cancer-related data sourced from the UCI Machine Learning Repository. They cover a wide spectrum of cancer types and research perspectives, including breast cancer datasets such as the original, diagnostic, prognostic, and Coimbra variants, which focus on tumor features, recurrence, and biochemical markers. Cervical cancer is represented through datasets focusing on behavioral risks and general risk factors. The lung cancer dataset provides categorical diagnostic attributes, while the primary tumor dataset offers insights into tumor locations based on metastasis data. Additionally, specialized datasets like differentiated thyroid cancer recurrence, glioma grading with clinical and mutation features, and gene expression RNA-Seq data expand the scope into genetic and molecular-level cancer analysis. Together, these datasets support a wide range of machine learning applications including classification, prediction, survival analysis, and feature correlation across various types of cancer.
RRA_Think Differently, Create history’s next line.
Hello Data Hunters! Hope you're doing well. https://www.kaggle.com/shuvokumarbasak4004 (More Dataset) https://www.kaggle.com/shuvokumarbasak2030
Facebook
TwitterAuthor: Alen Shapiro Source: UCI Please cite: UCI citation policy
Title: Chess End-Game -- King+Rook versus King+Pawn on a7 (usually abbreviated KRKPA7). The pawn on a7 means it is one square away from queening. It is the King+Rook's side (white) to move.
Sources: (a) Database originally generated and described by Alen Shapiro. (b) Donor/Coder: Rob Holte (holte@uottawa.bitnet). The database was supplied to Holte by Peter Clark of the Turing Institute in Glasgow (pete@turing.ac.uk). (c) Date: 1 August 1989
Past Usage:
Alen D. Shapiro (1983,1987), "Structured Induction in Expert Systems", Addison-Wesley. This book is based on Shapiro's Ph.D. thesis (1983) at the University of Edinburgh entitled "The Role of Structured Induction in Expert Systems".
Stephen Muggleton (1987), "Structuring Knowledge by Asking Questions", pp.218-229 in "Progress in Machine Learning", edited by I. Bratko and Nada Lavrac, Sigma Press, Wilmslow, England SK9 5BB.
Robert C. Holte, Liane Acker, and Bruce W. Porter (1989), "Concept Learning and the Problem of Small Disjuncts", Proceedings of IJCAI. Also available as technical report AI89-106, Computer Sciences Department, University of Texas at Austin, Austin, Texas 78712.
Relevant Information: The dataset format is described below. Note: the format of this database was modified on 2/26/90 to conform with the format of all the other databases in the UCI repository of machine learning databases.
Number of Instances: 3196 total
Number of Attributes: 36
Attribute Summaries: Classes (2): -- White-can-win ("won") and White-cannot-win ("nowin"). I believe that White is deemed to be unable to win if the Black pawn can safely advance. Attributes: see Shapiro's book.
Missing Attributes: -- none
Class Distribution: In 1669 of the positions (52%), White can win. In 1527 of the positions (48%), White cannot win.
The format for instances in this database is a sequence of 37 attribute values. Each instance is a board-descriptions for this chess endgame. The first 36 attributes describe the board. The last (37th) attribute is the classification: "win" or "nowin". There are 0 missing values. A typical board-description is
f,f,f,f,f,f,f,f,f,f,f,f,l,f,n,f,f,t,f,f,f,f,f,f,f,t,f,f,f,f,f,f,f,t,t,n,won
The names of the features do not appear in the board-descriptions. Instead, each feature correponds to a particular position in the feature-value list. For example, the head of this list is the value for the feature "bkblk". The following is the list of features, in the order in which their values appear in the feature-value list:
[bkblk,bknwy,bkon8,bkona,bkspr,bkxbq,bkxcr,bkxwp,blxwp,bxqsq,cntxt,dsopp,dwipd, hdchk,katri,mulch,qxmsq,r2ar8,reskd,reskr,rimmx,rkxwp,rxmsq,simpl,skach,skewr, skrxp,spcop,stlmt,thrsk,wkcti,wkna8,wknck,wkovl,wkpos,wtoeg]
In the file, there is one instance (board position) per line.
Num Instances: 3196 Num Attributes: 37 Num Continuous: 0 (Int 0 / Real 0) Num Discrete: 37 Missing values: 0 / 0.0%
Facebook
TwitterThis dataset integrates all the databases present in Heart Disease Dataset available at UCI Machine Learning Repository. Original one contains 4 databases: Cleveland, Hungarian, Long Beach, and Switzerland. Most of the work has been done using Cleveland dataset only.
Originally there are 76 attributes in the dataset, Selection of attributes depends on one's need. Here I've taken 10 attributes for the prediction.
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Facebook
TwitterCollection of databases, domain theories, and data generators that are used by machine learning community for empirical analysis of machine learning algorithms. Datasets approved to be in the repository will be assigned Digital Object Identifier (DOI) if they do not already possess one. Datasets will be licensed under a Creative Commons Attribution 4.0 International license (CC BY 4.0) which allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Data From: UCI Machine Learning Repository http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wpbc.names
"Each record represents follow-up data for one breast cancer case. These are consecutive patients seen by Dr. Wolberg since 1984, and include only those cases exhibiting invasive breast cancer and no evidence of distant metastases at the time of diagnosis.
The first 30 features are computed from a digitized image of a
fine needle aspirate (FNA) of a breast mass. They describe
characteristics of the cell nuclei present in the image.
A few of the images can be found at
http://www.cs.wisc.edu/~street/images/
The separation described above was obtained using
Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
Construction Via Linear Programming." Proceedings of the 4th
Midwest Artificial Intelligence and Cognitive Science Society,
pp. 97-101, 1992], a classification method which uses linear
programming to construct a decision tree. Relevant features
were selected using an exhaustive search in the space of 1-4
features and 1-3 separating planes.
The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].
The Recurrence Surface Approximation (RSA) method is a linear
programming model which predicts Time To Recur using both
recurrent and nonrecurrent cases. See references (i) and (ii)
above for details of the RSA method.
This database is also available through the UW CS ftp server:
ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WPBC/
1) ID number 2) Outcome (R = recur, N = nonrecur) 3) Time (recurrence time if field 2 = R, disease-free time if field 2 = N) 4-33) Ten real-valued features are computed for each cell nucleus:
a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)"
Creators:
Dr. William H. Wolberg, General Surgery Dept., University of
Wisconsin, Clinical Sciences Center, Madison, WI 53792
wolberg@eagle.surgery.wisc.edu
W. Nick Street, Computer Sciences Dept., University of
Wisconsin, 1210 West Dayton St., Madison, WI 53706
street@cs.wisc.edu 608-262-6619
Olvi L. Mangasarian, Computer Sciences Dept., University of
Wisconsin, 1210 West Dayton St., Madison, WI 53706
olvi@cs.wisc.edu
I'm really interested in trying out various machine learning algorithms on some real life science data.
Facebook
TwitterThis dataset is a curated collection of 5 classic, publicly available datasets from the UCI Machine Learning Repository.
Context This collection was compiled for the purpose of benchmarking and evaluating machine learning algorithms, particularly metaheuristic-based feature selection (FS) algorithms. The datasets were used in the research paper: "Hybrid-FS: A Novel Feature Selection Algorithm Integrating Sine-Cosine Optimization and Genetic Operators for High-Dimensional Data Classification."
Content The dataset contains 5 separate files, all originating from the UCI ML Repository:
Ionosphere: 351 instances, 34 features, 2 classes
Sonar: 208 instances, 60 features, 2 classes
Waveform (v2): 5000 instances, 40 features, 3 classes
Wine: 178 instances, 13 features, 3 classes
Zoo: 101 instances, 16 features, 7 classes
Original Sources & Citation All datasets are provided as-is from the UCI Machine Learning Repository. Please cite the original creators of each dataset as specified on their respective UCI pages.
Ionosphere: https://archive.ics.uci.edu/dataset/52/ionosphere
Sonar: https://archive.ics.uci.edu/dataset/151/connectionist-bench-sonar-mines-vs-rocks
Waveform: https://archive.ics.uci.edu/dataset/108/waveform+database+generator+version+2
Facebook
TwitterAuthor: H. Altay Guvenir, Burak Acar, Haldun Muderrisoglu
Source: UCI
Please cite: UCI
Cardiac Arrhythmia Database
The aim is to determine the type of arrhythmia from the ECG recordings. This database contains 279 attributes, 206 of which are linear valued and the rest are nominal.
Concerning the study of H. Altay Guvenir: "The aim is to distinguish between the presence and absence of cardiac arrhythmia and to classify it in one of the 16 groups. Class 01 refers to 'normal' ECG classes, 02 to 15 refers to different classes of arrhythmia and class 16 refers to the rest of unclassified ones. For the time being, there exists a computer program that makes such a classification. However, there are differences between the cardiologist's and the program's classification. Taking the cardiologist's as a gold standard we aim to minimize this difference by means of machine learning tools.
The names and id numbers of the patients were recently removed from the database.
1 Age: Age in years , linear
2 Sex: Sex (0 = male; 1 = female) , nominal
3 Height: Height in centimeters , linear
4 Weight: Weight in kilograms , linear
5 QRS duration: Average of QRS duration in msec., linear
6 P-R interval: Average duration between onset of P and Q waves
in msec., linear
7 Q-T interval: Average duration between onset of Q and offset
of T waves in msec., linear
8 T interval: Average duration of T wave in msec., linear
9 P interval: Average duration of P wave in msec., linear
Vector angles in degrees on front plane of:, linear
10 QRS
11 T
12 P
13 QRST
14 J
15 Heart rate: Number of heart beats per minute ,linear
Of channel DI:
Average width, in msec., of: linear
16 Q wave
17 R wave
18 S wave
19 R' wave, small peak just after R
20 S' wave
21 Number of intrinsic deflections, linear
22 Existence of ragged R wave, nominal
23 Existence of diphasic derivation of R wave, nominal
24 Existence of ragged P wave, nominal
25 Existence of diphasic derivation of P wave, nominal
26 Existence of ragged T wave, nominal
27 Existence of diphasic derivation of T wave, nominal
Of channel DII:
28 .. 39 (similar to 16 .. 27 of channel DI)
Of channels DIII:
40 .. 51
Of channel AVR:
52 .. 63
Of channel AVL:
64 .. 75
Of channel AVF:
76 .. 87
Of channel V1:
88 .. 99
Of channel V2:
100 .. 111
Of channel V3:
112 .. 123
Of channel V4:
124 .. 135
Of channel V5:
136 .. 147
Of channel V6:
148 .. 159
Of channel DI:
Amplitude , * 0.1 milivolt, of
160 JJ wave, linear
161 Q wave, linear
162 R wave, linear
163 S wave, linear
164 R' wave, linear
165 S' wave, linear
166 P wave, linear
167 T wave, linear
168 QRSA , Sum of areas of all segments divided by 10,
( Area= width * height / 2 ), linear
169 QRSTA = QRSA + 0.5 * width of T wave * 0.1 * height of T
wave. (If T is diphasic then the bigger segment is
considered), linear
Of channel DII:
170 .. 179
Of channel DIII:
180 .. 189
Of channel AVR:
190 .. 199
Of channel AVL:
200 .. 209
Of channel AVF:
210 .. 219
Of channel V1:
220 .. 229
Of channel V2:
230 .. 239
Of channel V3:
240 .. 249
Of channel V4:
250 .. 259
Of channel V5:
260 .. 269
Of channel V6:
270 .. 279
Class code - class - number of instances:
01 Normal 245 02 Ischemic changes (Coronary Artery Disease) 44 03 Old Anterior Myocardial Infarction 15 04 Old Inferior Myocardial Infarction 15 05 Sinus tachycardy 13 06 Sinus bradycardy 25 07 Ventricular Premature Contraction (PVC) 3 08 Supraventricular Premature Contraction 2 09 Left bundle branch block 9 10 Right bundle branch block 50 11 1. degree AtrioVentricular block 0 12 2. degree AV block 0 13 3. degree AV block 0 14 Left ventricule hypertrophy 4 15 Atrial Fibrillation or Flutter 5 16 Others 22
Facebook
TwitterAuthor: David W. Aha
Source: UCI - 1991
Please cite: UCI
Tic-Tac-Toe Endgame database
This database encodes the complete set of possible board configurations at the end of tic-tac-toe games, where "x" is assumed to have played first. The target concept is "win for x" (i.e., true when "x" has one of 8 possible ways to create a "three-in-a-row").
(x=player x has taken, o=player o has taken, b=blank)
1. top-left-square: {x,o,b}
2. top-middle-square: {x,o,b}
3. top-right-square: {x,o,b}
4. middle-left-square: {x,o,b}
5. middle-middle-square: {x,o,b}
6. middle-right-square: {x,o,b}
7. bottom-left-square: {x,o,b}
8. bottom-middle-square: {x,o,b}
9. bottom-right-square: {x,o,b}
10. Class: {positive,negative}
Facebook
TwitterAuthor: Rami Mustafa A Mohammad ( University of Huddersfield","rami.mohammad '@' hud.ac.uk","rami.mustafa.a '@' gmail.com) Lee McCluskey (University of Huddersfield","t.l.mccluskey '@' hud.ac.uk ) Fadi Thabtah (Canadian University of Dubai","fadi '@' cud.ac.ae)
Source: UCI
Please cite: Please refer to the Machine Learning Repository's citation policy
Source:
Rami Mustafa A Mohammad ( University of Huddersfield, rami.mohammad '@' hud.ac.uk, rami.mustafa.a '@' gmail.com) Lee McCluskey (University of Huddersfield,t.l.mccluskey '@' hud.ac.uk ) Fadi Thabtah (Canadian University of Dubai,fadi '@' cud.ac.ae)
Data Set Information:
One of the challenges faced by our research was the unavailability of reliable training datasets. In fact this challenge faces any researcher in the field. However, although plenty of articles about predicting phishing websites have been disseminated these days, no reliable training dataset has been published publically, may be because there is no agreement in literature on the definitive features that characterize phishing webpages, hence it is difficult to shape a dataset that covers all possible features. In this dataset, we shed light on the important features that have proved to be sound and effective in predicting phishing websites. In addition, we propose some new features.
Attribute Information:
For Further information about the features see the features file in the data folder of UCI.
Relevant Papers:
Mohammad, Rami, McCluskey, T.L. and Thabtah, Fadi (2012) An Assessment of Features Related to Phishing Websites using an Automated Technique. In: International Conferece For Internet Technology And Secured Transactions. ICITST 2012 . IEEE, London, UK, pp. 492-497. ISBN 978-1-4673-5325-0
Mohammad, Rami, Thabtah, Fadi Abdeljaber and McCluskey, T.L. (2014) Predicting phishing websites based on self-structuring neural network. Neural Computing and Applications, 25 (2). pp. 443-458. ISSN 0941-0643
Mohammad, Rami, McCluskey, T.L. and Thabtah, Fadi Abdeljaber (2014) Intelligent Rule based Phishing Websites Classification. IET Information Security, 8 (3). pp. 153-160. ISSN 1751-8709
Citation Request:
Please refer to the Machine Learning Repository's citation policy
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Recognising human activities using smart devices has led to countless inventions in various domains like healthcare, security, sports, etc. Sensor-based human activity recognition (HAR), especially smartphone-based HAR, has become popular among the research community due to lightweight computation and user privacy protection. Deep learning models are the most preferred solutions in developing smartphone-based HAR as they can automatically capture salient and distinctive features from input signals and classify them into respective activity classes. However, in most cases, the architecture of these models needs to be deep and complex for better classification performance. Furthermore, training these models requires extensive computational resources. Hence, this research proposes a hybrid lightweight model that integrates an enhanced Temporal Convolutional Network (TCN) with Gated Recurrent Unit (GRU) layers for salient spatiotemporal feature extraction without tedious manual feature extraction. Essentially, dilations are incorporated into each convolutional kernel in the TCN-GRU model to extend the kernel’s field of view without imposing additional model parameters. Moreover, fewer short filters are applied for each convolutional layer to alleviate excess parameters. Despite reducing computational cost, the proposed model utilises dilations, residual connections, and GRU layers for longer-term time dependency modelling by retaining longer implicit features of the input inertial sequences throughout training to provide sufficient information for future prediction. The performance of the TCN-GRU model is verified on two benchmark smartphone-based HAR databases, i.e., UCI HAR and UniMiB SHAR. The model attains promising accuracy in recognising human activities with 97.25% on UCI HAR and 93.51% on UniMiB SHAR. Since the current study exclusively works on the inertial signals captured by smartphones, future studies will explore the generalisation of the proposed TCN-GRU across diverse datasets, including various sensor types, to ensure its adaptability across different applications.
Facebook
TwitterWine recognition dataset from UC Irvine. Great for testing out different classifiers
Labels: "name" - Number denoting a specific wine class
Number of instances of each wine class
Features:
"This data set is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines"
Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
@misc{Lichman:2013 , author = "M. Lichman", year = "2013", title = "{UCI} Machine Learning Repository", url = "http://archive.ics.uci.edu/ml", institution = "University of California, Irvine, School of Information and Computer Sciences" }
UC Irvine data base: "https://archive.ics.uci.edu/ml/machine-learning-databases/wine"
Sources: (a) Forina, M. et al, PARVUS - An Extendible Package for Data Exploration, Classification and Correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy. (b) Stefan Aeberhard, email: stefan@coral.cs.jcu.edu.au (c) July 1991 Past Usage: (1) S. Aeberhard, D. Coomans and O. de Vel, Comparison of Classifiers in High Dimensional Settings, Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of Mathematics and Statistics, James Cook University of North Queensland. (Also submitted to Technometrics).
The data was used with many others for comparing various classifiers. The classes are separable, though only RDA has achieved 100% correct classification. (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data)) (All results using the leave-one-out technique)
(2) S. Aeberhard, D. Coomans and O. de Vel, "THE CLASSIFICATION PERFORMANCE OF RDA" Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of Mathematics and Statistics, James Cook University of North Queensland. (Also submitted to Journal of Chemometrics).
This data set is great for drawing comparisons between algorithms and testing out classifications models when learning new techniques
Facebook
TwitterDataset Card for Census Income (Adult)
This dataset is a precise version of Adult or Census Income. This dataset from UCI somehow happens to occupy two links, but we checked and confirm that they are identical. We used the following python script to create this Hugging Face dataset. import pandas as pd from datasets import Dataset, DatasetDict, Features, Value, ClassLabel
url1 = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" url2 =… See the full description on the dataset page: https://huggingface.co/datasets/cestwc/census-income.
Facebook
TwitterAuthor: Robert P.W. Duin, Department of Applied Physics, Delft University of Technology
Source: UCI - 1998
Please cite: UCI
Multiple Features Dataset: Factors
One of a set of 6 datasets describing features of handwritten numerals (0 - 9) extracted from a collection of Dutch utility maps. Corresponding patterns in different datasets correspond to the same original character. 200 instances per class (for a total of 2,000 instances) have been digitized in binary images.
The attributes represent 216 profile correlations. No more information is known.
A slightly different version of the database is used in
M. van Breukelen, R.P.W. Duin, D.M.J. Tax, and J.E. den Hartog, Handwritten digit recognition by combined classifiers, Kybernetika, vol. 34, no. 4, 1998, 381-386.
The database as is is used in:
A.K. Jain, R.P.W. Duin, J. Mao, Statistical Pattern Recognition: A Review, IEEE Transactions on Pattern Analysis and Machine Intelligence archive, Volume 22 Issue 1, January 2000
Facebook
TwitterThe dataset was collected during 60 days, this is a real database of a brazilian logistics company. The dataset has twelve predictive attributes and a target that is the total of orders for daily treatment. The database was used in academic research at the Universidade Nove de Julho.
Facebook
TwitterThe data used in this project is a sample from a census data(1994) from the US census database. The data generated will contain census prediction models income for the selected sample. When inserted new data on a specific person, the model will indicate whether the person will achieve a desired income census. From the data a computational learning process will be used to do inference trough bayesian networks
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
The different algorithms of the imbalanced-learn toolbox are evaluated on a set of common dataset, which are more or less balanced. These benchmark have been proposed in [1]. The following section presents the main characteristics of this benchmark.
| ID | Name | Repository & Target | Ratio | # samples | # features |
|---|---|---|---|---|---|
| 1 | Ecoli | UCI, target: imU | 8.6:1 | 336 | 7 |
| 2 | Optical Digits | UCI, target: 8 | 9.1:1 | 5,620 | 64 |
| 3 | SatImage | UCI, target: 4 | 9.3:1 | 6,435 | 36 |
| 4 | Pen Digits | UCI, target: 5 | 9.4:1 | 10,992 | 16 |
| 5 | Abalone | UCI, target: 7 | 9.7:1 | 4,177 | 8 |
| 6 | Sick Euthyroid | UCI, target: sick euthyroid | 9.8:1 | 3,163 | 25 |
| 7 | Spectrometer | UCI, target: >=44 | 11:1 | 531 | 93 |
| 8 | Car_Eval_34 | UCI, target: good, v good | 12:1 | 1,728 | 6 |
| 9 | ISOLET | UCI, target: A, B | 12:1 | 7,797 | 617 |
| 10 | US Crime | UCI, target: >0.65 | 12:1 | 1,994 | 122 |
| 11 | Yeast_ML8 | LIBSVM, target: 8 | 13:1 | 2,417 | 103 |
| 12 | Scene | LIBSVM, target: >one label | 13:1 | 2,407 | 294 |
| 13 | Libras Move | UCI, target: 1 | 14:1 | 360 | 90 |
| 14 | Thyroid Sick | UCI, target: sick | 15:1 | 3,772 | 28 |
| 15 | Coil_2000 | KDD, CoIL, target: minority | 16:1 | 9,822 | 85 |
| 16 | Arrhythmia | UCI, target: 06 | 17:1 | 452 | 279 |
| 17 | Solar Flare M0 | UCI, target: M->0 | 19:1 | 1,389 | 10 |
| 18 | OIL | UCI, target: minority | 22:1 | 937 | 49 |
| 19 | Car_Eval_4 | UCI, target: vgood | 26:1 | 1,728 | 6 |
| 20 | Wine Quality | UCI, wine, target: <=4 | 26:1 | 4,898 | 11 |
| 21 | Letter Img | UCI, target: Z | 26:1 | 20,000 | 16 |
| 22 | Yeast _ME2 | UCI, target: ME2 | 28:1 | 1,484 | 8 |
| 23 | Webpage | LIBSVM, w7a, target: minority | 33:1 | 49,749 | 300 |
| 24 | Ozone Level | UCI, ozone, data | 34:1 | 2,536 | 72 |
| 25 | Mammography | UCI, target: minority | 42:1 | 11,183 | 6 |
| 26 | Protein homo. | KDD CUP 2004, minority | 111:1 | 145,751 | 74 |
| 27 | Abalone_19 | UCI, target: 19 | 130:1 | 4,177 | 8 |
[1] Ding, Zejin, "Diversified Ensemble Classifiers for H ighly Imbalanced Data Learning and their Application in Bioinformatics." Dissertation, Georgia State University, (2011).
[2] Blake, Catherine, and Christopher J. Merz. "UCI Repository of machine learning databases." (1998).
[3] Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: a library for support vector machines." ACM Transactions on Intelligent Systems and Technology (TIST) 2.3 (2011): 27.
[4] Caruana, Rich, Thorsten Joachims, and Lars Backstrom. "KDD-Cup 2004: results and analysis." ACM SIGKDD Explorations Newsletter 6.2 (2004): 95-108.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Annotated Benchmark of Real-World Data for Approximate Functional Dependency Discovery
This collection consists of ten open access relations commonly used by the data management community. In addition to the relations themselves (please take note of the references to the original sources below), we added three lists in this collection that describe approximate functional dependencies found in the relations. These lists are the result of a manual annotation process performed by two independent individuals by consulting the respective schemas of the relations and identifying column combinations where one column implies another based on its semantics. As an example, in the claims.csv file, the AirportCode implies AirportName, as each code should be unique for a given airport.
The file ground_truth.csv is a comma separated file containing approximate functional dependencies. table describes the relation we refer to, lhs and rhs reference two columns of those relations where semantically we found that lhs implies rhs.
The file excluded_candidates.csv and included_candidates.csv list all column combinations that were excluded or included in the manual annotation, respectively. We excluded a candidate if there was no tuple where both attributes had a value or if the g3_prime value was too small.
Dataset References
Facebook
TwitterAuthor: Gary Bradshaw
Source: UCI
Please cite:
Solar Flare database Relevant Information: -- The database contains 3 potential classes, one for the number of times a certain type of solar flare occured in a 24 hour period. -- Each instance represents captured features for 1 active region on the sun. -- The data are divided into two sections. The second section (flare.data2) has had much more error correction applied to the it, and has consequently been treated as more reliable.
Number of Instances: flare.data1: 323, flare.data2: 1066
Number of attributes: 13 (includes 3 class attributes)
1. Code for class (modified Zurich class) (A,B,C,D,E,F,H)
2. Code for largest spot size (X,R,S,A,H,K)
3. Code for spot distribution (X,O,I,C)
4. Activity (1 = reduced, 2 = unchanged)
5. Evolution (1 = decay, 2 = no growth,
3 = growth)
6. Previous 24 hour flare activity code (1 = nothing as big as an M1,
2 = one M1,
3 = more activity than one M1)
7. Historically-complex (1 = Yes, 2 = No)
8. Did region become historically complex (1 = yes, 2 = no)
on this pass across the sun's disk
9. Area (1 = small, 2 = large)
Area of the largest spot (1 = <=5, 2 = >5)
From all these predictors three classes of flares are predicted, which are represented in the last three columns.
C-class flares production by this region Number
in the following 24 hours (common flares)
M-class flares production by this region Number in the following 24 hours (moderate flares)
X-class flares production by this region Number in the following 24 hours (severe flares)
CLASSTYPE: nominal CLASSINDEX: first
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DAGHAR benchmark is a curated dataset collection designed for domain adaptation and domain generalization studies in HAR tasks, using inertial sensors such as accelerometers and gyroscopes, from "A benchmark for domain adaptation and generalization in smartphone-based human activity recognition" work. It features raw inertial sensor data sourced exclusively from smartphones. Six public datasets were selected and standardized in terms of accelerometer units of measurement, sampling rate, gravity component, activity labels, user partitioning, and time window size. This standardization process allows for creating a comprehensive benchmark for evaluating the generalization capabilities of HAR models in cross-dataset scenarios.
The benchmark is based on the following datasets:
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset was created by Ahmed Alghali
Released under Database: Open Database, Contents: Database Contents
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Source More Info : https://archive.ics.uci.edu/datasets
The **UCI Machine Learning Repository **is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.
The datasets collected in this project represent a diverse and comprehensive set of cancer-related data sourced from the UCI Machine Learning Repository. They cover a wide spectrum of cancer types and research perspectives, including breast cancer datasets such as the original, diagnostic, prognostic, and Coimbra variants, which focus on tumor features, recurrence, and biochemical markers. Cervical cancer is represented through datasets focusing on behavioral risks and general risk factors. The lung cancer dataset provides categorical diagnostic attributes, while the primary tumor dataset offers insights into tumor locations based on metastasis data. Additionally, specialized datasets like differentiated thyroid cancer recurrence, glioma grading with clinical and mutation features, and gene expression RNA-Seq data expand the scope into genetic and molecular-level cancer analysis. Together, these datasets support a wide range of machine learning applications including classification, prediction, survival analysis, and feature correlation across various types of cancer.
RRA_Think Differently, Create history’s next line.
Hello Data Hunters! Hope you're doing well. https://www.kaggle.com/shuvokumarbasak4004 (More Dataset) https://www.kaggle.com/shuvokumarbasak2030