Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Cuff-Less Blood Pressure Estimation Dataset [2] from the UCI Machine Learning Repository. It is a subset of the MIMIC-II Waveform Dataset that contains 12000 records of simultaneous PPG and ABP from 942 patients with a sampling rate of 125 Hz. The 12000 records were uniformly split into four parts with 3000 records each. However, as the subject information is lacking, the Hold-one-out strategy was utilized to generate training, validation, and test sets once the data was preprocessed. In the end, the UCI dataset had 291,078 segments, which was around 404 hours of recording, making it substantially the biggest data set with a considerably higher ratio of continuous segments per record (32.15).
[2] Kachuee, M., Kiani, M. M., Mohammadzade, H. & Shabany, M. Cuff-less blood pressure estimation data set (2015). UCI repository https://archive.ics.uci.edu/ml/datasets/Cuff-Less+Blood+Pressure+Estimation.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Sahar Pourahmad
Released under CC0: Public Domain
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Datasets available at UCI Machine Learning Repository and other repositories. List of datasets used in the experiment with their sources. ForestCover dataset @ https://archive.ics.uci.edu/ml/datasets/Covertype KDD Cup99 dataset @ https://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data PAMAP dataset @ https://archive.ics.uci.edu/ml/datasets/PAMAP2+Physical+Activity+Monitoring Powersupply @ http://www.cse.fau.edu/~xqzhu/stream.html SEA @ http://www.liaad.up.pt/kdus/products/datasets-for-concept-drift Syn002 & Syn003 (generated) @ http://moa.cms.waikato.ac.nz/details/classification/streams/ MNIST @ https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html News20 @ https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html
Facebook
TwitterThe dataset used can be found on the UCI Machine Learning Repository at the following location:
There are several copies of this dataset to be found on Kaggle, with people focusing on different types of analyses of the data. This specific copy can be analysed by anyone interested, but is primarily used by a study group from the Udacity Bertelsmann Technology Scholarship to practice analysis of association between variables as well as implementation and comparison of various Machine Learning models.
According to the paper by (Detrano et al., 1989) as found on the UCI Dataset webpage, the data represents data collected for 303 patients referred for coronary angiography at the Cleveland Clinic between May 1981 and September 1984. The 13 independent/ features variables can be divided into 3 groups as follows:
Routine evaluation (based on historical data):
Non-invasive test data (informed consent obtained for data as part of research protocol):
Other demographic and clinical variables (based on routine data):
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3632459%2Fa01747fb0158dc51c12bc0824c9c4ae4%2Fdata_dictionary2.png?generation=1609522473018549&alt=media" alt="">
UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Donor:
David W. Aha (aha '@' ics.uci.edu) (714) 856-8779
The objective of the analysis is to use statistical learning to identify factors associated with Coronary Artery Disease as indicated by a coronary angiography interpreted by a Cardiologist (as per paper written by Detrano et al cited before).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Adaptation of http://archive.ics.uci.edu/ml/datasets/Statlog+(Heart)Ready for usage with ehrapy
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Collection of two datasets from the UCI website that could be used for structure learning tasks. Includes datasets regarding
Size: Two datasets of sizes 9471*17 and 2458285*68 correspondingly
Number of features: 15-68
Ground truth: No
Type of Graph: No ground truth
More information about the datasets is contained in the dataset_description.html files.
Facebook
TwitterThis CSV contain a data set prepared for the use of participants for the 1994 AAAI Spring Symposium on Artificial Intelligence in Medicine.
Original files were obtained from: https://archive.ics.uci.edu/ml/datasets/diabetes
Archived file diabetes-data.tar.z which contains 70 sets of data recorded on diabetes patients (several weeks' to months' worth of glucose, insulin, and lifestyle data per patient + a description of the problem domain) is extracted and processed and merged as a CSV file.
The Code field of the CSV is deciphered as follows:
33 = Regular insulin dose 34 = NPH insulin dose 35 = UltraLente insulin dose 48 = Unspecified blood glucose measurement 57 = Unspecified blood glucose measurement 58 = Pre-breakfast blood glucose measurement 59 = Post-breakfast blood glucose measurement 60 = Pre-lunch blood glucose measurement 61 = Post-lunch blood glucose measurement 62 = Pre-supper blood glucose measurement 63 = Post-supper blood glucose measurement 64 = Pre-snack blood glucose measurement 65 = Hypoglycemic symptoms 66 = Typical meal ingestion 67 = More-than-usual meal ingestion 68 = Less-than-usual meal ingestion 69 = Typical exercise activity 70 = More-than-usual exercise activity 71 = Less-than-usual exercise activity 72 = Unspecified special event
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Repackaging in parquet file of the well-known HIGGS dataset from UCI. Same rows and content of the original file (11M rows), size is almost 80% smaller than the original. Original source: https://archive.ics.uci.edu/dataset/280/higgs Link to the original paper: Whiteson, D. (2014). HIGGS [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5V312.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Adaptation of http://archive.ics.uci.edu/ml/datasets/HepatitisReady for usage with ehrapy
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains information for Machine Learning algorithms to forecast recurrence events (RE) for patients with breast cancer stages I to III. The dataset contains 252 instances and six attributes, including a binary class indicating whether RE occurred. This dataset has been reduced and denoised from the original Ljubljana, which holds 286 instances with ten attributes each (LBCD, Zwitter M. and Soklic M. (1988). Breast Cancer. UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/14/breast+cancer).
The ranking results by eight different Machine learning algorithms and statistical handling of the ranking 8-component vectors for attributes allow one to reduce ten features to six of the most relevant ones. The most pertinent features were the following five: {deg_malig, irradiat, node_caps, tumor_size, inv_nodes}. Less relevant found four attributes: {age, breast_quad, breast, menopause}.
The CAIRAD: Co-appearance based Analysis for Incorrect Records and Attribute-values Detection ( Rahman MG, Islam MZ, Bossomaier T, Gao J. CAIRAD: A co-appearance based analysis for incorrect records and attribute-values detection. Proc Int Jt Conf Neural Networks. 2012;(June). https://doi.org/10.1109/IJCNN.2012.6252669) filter has been determined the noises in attributes and class features. Per the filtering results, 34 instances of LBCD had noises in half (or even more than half) of their features. Those were removed from the data.
It is known that the noises in the class are riskier and teasing than those of attributes. Meantime, the class attribute had 35 (14%) missed values from 252 after COIRAD filtering. It was unacceptable, considering the comparable number (only 85 cases) of recurrence events in the class of initial LBCD. The imputation (reconstruction, "cure") of missed values was performed via the algorithm offered in:
Bai BM, Mangathayaru N, Rani BP. An approach to find missing values in medical datasets. In: ACM International Conference Proceeding Series. Vol 24-26-Sept. ; 2015. https://doi.org/10.1145/2832987.2833083. The noises presented in the remaining attributes, ranging from 1% to 14%, were neglected.
There are 252 instances in the dataset, of which 206 do not have RE, and the remaining 46 have RE. Six attributes, including its class, define each instance. This dataset is obtained from the initial version of the LBCD betterment, and it provides a significant advantage in the performance over the original LBCD for most classifying algorithms of Machine Learning. However, the dataset is slightly more imbalanced than the LBCD, which is a minus.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is part of the UCR Archive maintained by University of Southampton researchers. Please cite a relevant or the latest full archive release if you use the datasets. See http://www.timeseriesclassification.com/.
The traffic data are collected with the loop sensor installed on ramp for the 101 North freeway in Los Angeles. This location is close to Dodgers Stadium; therefore the traffic is affected by volume of visitors to the stadium. Missing values are represented with NaN. - Class 1: Normal Day - Class 2: Game Day There is nothing to infer from the order of examples in the train and test set. Missing values are represented with NaN in the text file. Data created by Ihler, Alexander, Jon Hutchins, and Padhraic Smyth (see [1][2][3]). Data edited by Chin-Chia Michael Yeh.
[1] Ihler, Alexander, Jon Hutchins, and Padhraic Smyth. "Adaptive event detection with time-varying poisson processes." Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006.
[2] “UCI Machine Learning Repository: Dodgers Loop Sensor Data Set.” UCI Machine Learning Repository, archive.ics.uci.edu/ml/datasets/dodgers+loop+sensor.
[3] “Caltrans PeMS.” Caltrans, pems.dot.ca.gov/.
Donator: C. Yeh
Facebook
TwitterThis dataset is downloaded from UCI repository. https://archive.ics.uci.edu/ml/datasets/nursery the dataset contains categorical data to rank nursery school applicants. The original dataset contains 5 classes. Classes were reorganized in order to remain with only two classes (”recommended” or ”not recommended”).
Facebook
Twitter
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
100 volunteers provide a semen sample analyzed according to the WHO 2010 criteria. Sperm concentration are related to socio-demographic data, environmental factors, health status, and life habits
Attribute Information:
Season in which the analysis was performed. 1) winter, 2) spring, 3) Summer, 4) fall. (-1, -0.33, 0.33, 1)
Age at the time of analysis. 18-36 (0, 1)
Childish diseases (ie , chicken pox, measles, mumps, polio) 1) yes, 2) no. (0, 1)
Accident or serious trauma 1) yes, 2) no. (0, 1)
Surgical intervention 1) yes, 2) no. (0, 1)
High fevers in the last year 1) less than three months ago, 2) more than three months ago, 3) no. (-1, 0, 1)
Frequency of alcohol consumption 1) several times a day, 2) every day, 3) several times a week, 4) once a week, 5) hardly ever or never (0, 1)
Smoking habit 1) never, 2) occasional 3) daily. (-1, 0, 1)
Number of hours spent sitting per day ene-16 (0, 1)
Output: Diagnosis normal (N), altered (O)
Source:
David Gil, dgil '@' dtic.ua.es, Lucentia Research Group, Department of Computer Technology, University of Alicante
Jose Luis Girela, girela '@' ua.es, Department of Biotechnology, University of Alicante
Relevant Papers:
David Gil, Jose Luis Girela, Joaquin De Juan, M. Jose Gomez-Torres, and Magnus Johnsson. Predicting seminal quality with artificial intelligence methods. Expert Systems with Applications, 39(16):12564 – 12573, 2012
Citation Request:
David Gil, Jose Luis Girela, Joaquin De Juan, M. Jose Gomez-Torres, and Magnus Johnsson. Predicting seminal quality with artificial intelligence methods. Expert Systems with Applications, 39(16):12564 – 12573, 2012
Facebook
TwitterThe dataset is downloaded from UCI repository https://archive.ics.uci.edu/ml/datasets/covertype The dataset contains 1 to 7 Forest Cover Type. The task is to predict the forest cover type from cartographic variables only (no remotely sensed data)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data sets supporting the results reported in the paper: Hellinger Distance Trees for Imbalanced Streams, R. J. Lyon, J.M. Brooke, J.D. Knowles, B.W Stappers, 22nd International Conference on Pattern Recognition (ICPR), p.1969 - 1974, 2014. DOI: 10.1109/ICPR.2014.344 Contained in this distribution are results of stream classifier perfromance on four different data sets. Also included are the test results from our attempt at reproducing the outcome of the paper, Learning Decision Trees for Un-balanced Data, D. A. Cieslak and N. V. Chawla, in Machine Learning and Knowledge Discovery in Databases (W. Daelemans, B. Goethals, and K. Morik, eds.), vol. 5211 of LNCS, pp. 241-256, 2008. The data sets used for these experiments include, MAGIC Gamma Telescope Data Set : https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+TelescopeMiniBooNE particle identification Data Set : https://archive.ics.uci.edu/ml/datasets/MiniBooNE+particle+identificationSkin Segmentation Data Set : https://archive.ics.uci.edu/ml/datasets/Skin+SegmentationLetter Recognition Data Set : https://archive.ics.uci.edu/ml/datasets/Letter+RecognitionPen-Based Recognition of Handwritten Digits Data Set : https://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+DigitsStatlog (Landsat Satellite) Data Set : https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite)Statlog (Image Segmentation) Data Set : https://archive.ics.uci.edu/ml/datasets/Statlog+(Image+Segmentation) A further data set used is not publicly available at present. However we are in the process of releasing it for public use. Please get in touch if you'd like to use it.
A readme file accompanies the data describing it in more detail.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
part of the dataset supplied in https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009 https://archive.ics.uci.edu/ml/datasets/wine+quality
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The dataset is downloaded from UCI repository http://archive.ics.uci.edu/ml/datasets/connectionist+bench+(sonar,+mines+vs.+rocks) The task is to discriminate between sonar signals bounced off a metal cylinder and those bounced off a roughly cylindrical rock. The original file is converted to a csv file (2018-04-05)
Facebook
TwitterOriginal Data from: https://archive.ics.uci.edu/ml/datasets/Heart+Disease Changes made: - four rows with missing values were removed, leaving 299 records - Chest Pain Type, Restecg, Thal variables were converted to indicator variables - class attribute binarised to -1 (no disease) / +1 disease (original values 1,2,3) Attributes: Col 0: CLASS: -1: no disease +1: disease Col 1: Age (cts) Col 2: Sex (0/1) Col 3: indicator (0/1) for typ angina Col 4: indicator for atyp angina Col 5: indicator for non-ang pain Col 6: resting blood pressure (cts) Col 7: Serum cholest (cts) Col 8: fasting blood sugar >120mg/dl (0/1) Col 9: indicator for electrocardio value 1 Col 10: indicator for electrocardio value 2 Col 11: Max heart rate (cts) Col 12: exercised induced angina (0/1) Col 13: ST depression induced by exercise (cts) Col 14: indicator for slope of peak exercise up Col 15: indicator for slope of peak exercise down Col 16: no major vessels colored by fluro (ctsish: 0,1,2,3) Col 17: Thal reversible defect indicator Col 18: Thal fixed defect indicator Col 19: Class 0-4, where 0 is disease not present, 1-4 is present
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
First, I am new to ML, and just in case I slip up, apologies in advance!! So, I am doing an online ML course and this is an assignment where we are supposed to practice scikit-learn's PCA routine. Since the course has been ARCHIVED - which means the discussion posts are not answered!! - hence my posting of the problem here. What better way to learn than to get so many experts giving me feedback … right? Content The data was taken over a 2-month period in India with 25 features ( eg, red blood cell count, white blood cell count, etc). The target is the 'classification', which is either 'ckd' or 'notckd' - ckd=chronic kidney disease. There are 400 rows The data needs cleaning: in that it has NaNs and the numeric features need to be forced to floats. Basically, we were instructed to get rid of ALL ROWS with Nans, with no threshold - meaning, any row that has even one NaN, gets deleted. Part 1: We are asked to choose 3 features (bgr, rc, wc), visualize them, then run the PCA with n_components=2. the PCA is to be run twice: one with no scaling and the second run WITH scaling. And this is where my issue starts … in that after scaling I can hardly see any difference! I will stop here for now till I get feedback and then move to Part 2. Acknowledgements The dataset is available at: https://archive.ics.uci.edu/ml/datasets/Chronic_Kidney_Disease Inspiration I would like to get an intuitive and a practical understanding of PCA.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Cuff-Less Blood Pressure Estimation Dataset [2] from the UCI Machine Learning Repository. It is a subset of the MIMIC-II Waveform Dataset that contains 12000 records of simultaneous PPG and ABP from 942 patients with a sampling rate of 125 Hz. The 12000 records were uniformly split into four parts with 3000 records each. However, as the subject information is lacking, the Hold-one-out strategy was utilized to generate training, validation, and test sets once the data was preprocessed. In the end, the UCI dataset had 291,078 segments, which was around 404 hours of recording, making it substantially the biggest data set with a considerably higher ratio of continuous segments per record (32.15).
[2] Kachuee, M., Kiani, M. M., Mohammadzade, H. & Shabany, M. Cuff-less blood pressure estimation data set (2015). UCI repository https://archive.ics.uci.edu/ml/datasets/Cuff-Less+Blood+Pressure+Estimation.