100+ datasets found
  1. UCI dataset

    • springernature.figshare.com
    bin
    Updated Mar 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wan-Ting Hsieh; Sergio González Vázquez; Trista Chen (2023). UCI dataset [Dataset]. http://doi.org/10.6084/m9.figshare.20496258.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 13, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Wan-Ting Hsieh; Sergio González Vázquez; Trista Chen
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The Cuff-Less Blood Pressure Estimation Dataset [2] from the UCI Machine Learning Repository. It is a subset of the MIMIC-II Waveform Dataset that contains 12000 records of simultaneous PPG and ABP from 942 patients with a sampling rate of 125 Hz. The 12000 records were uniformly split into four parts with 3000 records each. However, as the subject information is lacking, the Hold-one-out strategy was utilized to generate training, validation, and test sets once the data was preprocessed. In the end, the UCI dataset had 291,078 segments, which was around 404 hours of recording, making it substantially the biggest data set with a considerably higher ratio of continuous segments per record (32.15).

    [2] Kachuee, M., Kiani, M. M., Mohammadzade, H. & Shabany, M. Cuff-less blood pressure estimation data set (2015). UCI repository https://archive.ics.uci.edu/ml/datasets/Cuff-Less+Blood+Pressure+Estimation.

  2. Data from: Online Retail Data Set

    • kaggle.com
    Updated Sep 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sahar Pourahmad (2021). Online Retail Data Set [Dataset]. https://www.kaggle.com/saharpourahmad/online-retail-data-set/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 11, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sahar Pourahmad
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Sahar Pourahmad

    Released under CC0: Public Domain

    Contents

  3. H

    Replication Data for: Scalable Kernel Mean Matching

    • dataverse.harvard.edu
    Updated Apr 3, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Swarup Chandra (2016). Replication Data for: Scalable Kernel Mean Matching [Dataset]. http://doi.org/10.7910/DVN/ELFPEM
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 3, 2016
    Dataset provided by
    Harvard Dataverse
    Authors
    Swarup Chandra
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description
  4. UCI Heart Disease Data Set

    • kaggle.com
    Updated Jan 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lourens Walters (2021). UCI Heart Disease Data Set [Dataset]. https://www.kaggle.com/lourenswalters/uci-heart-disease-data-set/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 1, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Lourens Walters
    Description

    Context

    The dataset used can be found on the UCI Machine Learning Repository at the following location:

    Heart Disease Dataset

    There are several copies of this dataset to be found on Kaggle, with people focusing on different types of analyses of the data. This specific copy can be analysed by anyone interested, but is primarily used by a study group from the Udacity Bertelsmann Technology Scholarship to practice analysis of association between variables as well as implementation and comparison of various Machine Learning models.

    Content

    According to the paper by (Detrano et al., 1989) as found on the UCI Dataset webpage, the data represents data collected for 303 patients referred for coronary angiography at the Cleveland Clinic between May 1981 and September 1984. The 13 independent/ features variables can be divided into 3 groups as follows:

    Routine evaluation (based on historical data):

    • ECG at rest
    • Serum Cholesterol
    • Fasting blood sugar

    Non-invasive test data (informed consent obtained for data as part of research protocol):

    • Exercise ECG
      • ST-segment peak slope (upsloping, flat or downsloping)
      • ST-segment depression
    • Excercise Thallium scintigraphy (fixed, reversible or none)
    • Cardiac fluoroscopy (number of vessels appeared to contain calcium)

    Other demographic and clinical variables (based on routine data):

    • Age
    • Sex
    • Chest pain type
    • Systolic blood pressure
    • ST-T-wave abnormality (T-wave abnormality)
    • Probably or definite ventricular hypertrophy (Este's criteria)
    • The dependent/ response variable was the angiographic test result indicating a >50% diameter narrowing.

    Data Dictionary

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3632459%2Fa01747fb0158dc51c12bc0824c9c4ae4%2Fdata_dictionary2.png?generation=1609522473018549&alt=media" alt="">

    Acknowledgements

    UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Donor:

    David W. Aha (aha '@' ics.uci.edu) (714) 856-8779

    Inspiration

    The objective of the analysis is to use statistical learning to identify factors associated with Coronary Artery Disease as indicated by a coronary angiography interpreted by a Cardiologist (as per paper written by Detrano et al cited before).

  5. Statlog (Heart) Data Set

    • figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xinyue Zhang (2023). Statlog (Heart) Data Set [Dataset]. http://doi.org/10.6084/m9.figshare.19236777.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Xinyue Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
  6. UCI datasets

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Apr 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mathias Drton; Stephan Haug; David Reifferscheidt; Oleksandr Zadorozhnyi; Mathias Drton; Stephan Haug; David Reifferscheidt; Oleksandr Zadorozhnyi (2023). UCI datasets [Dataset]. http://doi.org/10.5281/zenodo.7681792
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 4, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mathias Drton; Stephan Haug; David Reifferscheidt; Oleksandr Zadorozhnyi; Mathias Drton; Stephan Haug; David Reifferscheidt; Oleksandr Zadorozhnyi
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Collection of two datasets from the UCI website that could be used for structure learning tasks. Includes datasets regarding

    • Air Quality
    • US census 1990

    Size: Two datasets of sizes 9471*17 and 2458285*68 correspondingly

    Number of features: 15-68

    Ground truth: No

    Type of Graph: No ground truth

    More information about the datasets is contained in the dataset_description.html files.

  7. UCI Diabetes Data Set

    • kaggle.com
    zip
    Updated May 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ergin Altıntaş (2020). UCI Diabetes Data Set [Dataset]. https://www.kaggle.com/ealtintas/uci-machine-learning-repository-diabetes-data-set
    Explore at:
    zip(217048 bytes)Available download formats
    Dataset updated
    May 1, 2020
    Authors
    Ergin Altıntaş
    Description

    About this Dataset

    This CSV contain a data set prepared for the use of participants for the 1994 AAAI Spring Symposium on Artificial Intelligence in Medicine.

    Content

    Original files were obtained from: https://archive.ics.uci.edu/ml/datasets/diabetes

    Archived file diabetes-data.tar.z which contains 70 sets of data recorded on diabetes patients (several weeks' to months' worth of glucose, insulin, and lifestyle data per patient + a description of the problem domain) is extracted and processed and merged as a CSV file.

    The Code field of the CSV is deciphered as follows:

    33 = Regular insulin dose 34 = NPH insulin dose 35 = UltraLente insulin dose 48 = Unspecified blood glucose measurement 57 = Unspecified blood glucose measurement 58 = Pre-breakfast blood glucose measurement 59 = Post-breakfast blood glucose measurement 60 = Pre-lunch blood glucose measurement 61 = Post-lunch blood glucose measurement 62 = Pre-supper blood glucose measurement 63 = Post-supper blood glucose measurement 64 = Pre-snack blood glucose measurement 65 = Hypoglycemic symptoms 66 = Typical meal ingestion 67 = More-than-usual meal ingestion 68 = Less-than-usual meal ingestion 69 = Typical exercise activity 70 = More-than-usual exercise activity 71 = Less-than-usual exercise activity 72 = Unspecified special event

  8. h

    higgs_parquet

    • huggingface.co
    Updated Feb 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alberto Danese (2023). higgs_parquet [Dataset]. https://huggingface.co/datasets/albedan/higgs_parquet
    Explore at:
    Dataset updated
    Feb 23, 2023
    Authors
    Alberto Danese
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Repackaging in parquet file of the well-known HIGGS dataset from UCI. Same rows and content of the original file (11M rows), size is almost 80% smaller than the original. Original source: https://archive.ics.uci.edu/dataset/280/higgs Link to the original paper: Whiteson, D. (2014). HIGGS [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5V312.

  9. Hepatitis Data Set

    • figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xinyue Zhang (2023). Hepatitis Data Set [Dataset]. http://doi.org/10.6084/m9.figshare.19236768.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Xinyue Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Adaptation of http://archive.ics.uci.edu/ml/datasets/HepatitisReady for usage with ehrapy

  10. m

    Updated Ljubljana Breast Cancer Data Set: reduced and cleaned version

    • data.mendeley.com
    Updated Oct 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gennady Chuiko (2023). Updated Ljubljana Breast Cancer Data Set: reduced and cleaned version [Dataset]. http://doi.org/10.17632/fgs9pyfv2z.2
    Explore at:
    Dataset updated
    Oct 25, 2023
    Authors
    Gennady Chuiko
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains information for Machine Learning algorithms to forecast recurrence events (RE) for patients with breast cancer stages I to III. The dataset contains 252 instances and six attributes, including a binary class indicating whether RE occurred. This dataset has been reduced and denoised from the original Ljubljana, which holds 286 instances with ten attributes each (LBCD, Zwitter M. and Soklic M. (1988). Breast Cancer. UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/14/breast+cancer). The ranking results by eight different Machine learning algorithms and statistical handling of the ranking 8-component vectors for attributes allow one to reduce ten features to six of the most relevant ones. The most pertinent features were the following five: {deg_malig, irradiat, node_caps, tumor_size, inv_nodes}. Less relevant found four attributes: {age, breast_quad, breast, menopause}. The CAIRAD: Co-appearance based Analysis for Incorrect Records and Attribute-values Detection ( Rahman MG, Islam MZ, Bossomaier T, Gao J. CAIRAD: A co-appearance based analysis for incorrect records and attribute-values detection. Proc Int Jt Conf Neural Networks. 2012;(June). https://doi.org/10.1109/IJCNN.2012.6252669) filter has been determined the noises in attributes and class features. Per the filtering results, 34 instances of LBCD had noises in half (or even more than half) of their features. Those were removed from the data. It is known that the noises in the class are riskier and teasing than those of attributes. Meantime, the class attribute had 35 (14%) missed values from 252 after COIRAD filtering. It was unacceptable, considering the comparable number (only 85 cases) of recurrence events in the class of initial LBCD. The imputation (reconstruction, "cure") of missed values was performed via the algorithm offered in:
    Bai BM, Mangathayaru N, Rani BP. An approach to find missing values in medical datasets. In: ACM International Conference Proceeding Series. Vol 24-26-Sept. ; 2015. https://doi.org/10.1145/2832987.2833083. The noises presented in the remaining attributes, ranging from 1% to 14%, were neglected. There are 252 instances in the dataset, of which 206 do not have RE, and the remaining 46 have RE. Six attributes, including its class, define each instance. This dataset is obtained from the initial version of the LBCD betterment, and it provides a significant advantage in the performance over the original LBCD for most classifying algorithms of Machine Learning. However, the dataset is slightly more imbalanced than the LBCD, which is a minus.

  11. DodgerLoopGame UCR Archive Dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated May 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2024). DodgerLoopGame UCR Archive Dataset [Dataset]. http://doi.org/10.5281/zenodo.11186628
    Explore at:
    binAvailable download formats
    Dataset updated
    May 14, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is part of the UCR Archive maintained by University of Southampton researchers. Please cite a relevant or the latest full archive release if you use the datasets. See http://www.timeseriesclassification.com/.

    The traffic data are collected with the loop sensor installed on ramp for the 101 North freeway in Los Angeles. This location is close to Dodgers Stadium; therefore the traffic is affected by volume of visitors to the stadium. Missing values are represented with NaN. - Class 1: Normal Day - Class 2: Game Day There is nothing to infer from the order of examples in the train and test set. Missing values are represented with NaN in the text file. Data created by Ihler, Alexander, Jon Hutchins, and Padhraic Smyth (see [1][2][3]). Data edited by Chin-Chia Michael Yeh.

    [1] Ihler, Alexander, Jon Hutchins, and Padhraic Smyth. "Adaptive event detection with time-varying poisson processes." Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006.

    [2] “UCI Machine Learning Repository: Dodgers Loop Sensor Data Set.” UCI Machine Learning Repository, archive.ics.uci.edu/ml/datasets/dodgers+loop+sensor.

    [3] “Caltrans PeMS.” Caltrans, pems.dot.ca.gov/.

    Donator: C. Yeh

  12. d

    Replication Data for: Nursery Data Set

    • dataone.org
    • dataverse.harvard.edu
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wang, Wenjuan (2023). Replication Data for: Nursery Data Set [Dataset]. http://doi.org/10.7910/DVN/MBFQK0
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Wang, Wenjuan
    Description

    This dataset is downloaded from UCI repository. https://archive.ics.uci.edu/ml/datasets/nursery the dataset contains categorical data to rank nursery school applicants. The original dataset contains 5 classes. Classes were reorganized in order to remain with only two classes (”recommended” or ”not recommended”).

  13. h

    projectData

    • huggingface.co
    Updated May 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haley Rindfleisch (2024). projectData [Dataset]. https://huggingface.co/datasets/h2mrind/projectData
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 1, 2024
    Authors
    Haley Rindfleisch
    Description
  14. Fertility Data Set

    • kaggle.com
    Updated Sep 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sahar Pourahmad (2021). Fertility Data Set [Dataset]. https://www.kaggle.com/saharpourahmad/fertility-data-set/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 11, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sahar Pourahmad
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    100 volunteers provide a semen sample analyzed according to the WHO 2010 criteria. Sperm concentration are related to socio-demographic data, environmental factors, health status, and life habits

    Content

    Attribute Information:

    Season in which the analysis was performed. 1) winter, 2) spring, 3) Summer, 4) fall. (-1, -0.33, 0.33, 1)

    Age at the time of analysis. 18-36 (0, 1)

    Childish diseases (ie , chicken pox, measles, mumps, polio) 1) yes, 2) no. (0, 1)

    Accident or serious trauma 1) yes, 2) no. (0, 1)

    Surgical intervention 1) yes, 2) no. (0, 1)

    High fevers in the last year 1) less than three months ago, 2) more than three months ago, 3) no. (-1, 0, 1)

    Frequency of alcohol consumption 1) several times a day, 2) every day, 3) several times a week, 4) once a week, 5) hardly ever or never (0, 1)

    Smoking habit 1) never, 2) occasional 3) daily. (-1, 0, 1)

    Number of hours spent sitting per day ene-16 (0, 1)

    Output: Diagnosis normal (N), altered (O)

    Acknowledgements

    Source:

    David Gil, dgil '@' dtic.ua.es, Lucentia Research Group, Department of Computer Technology, University of Alicante

    Jose Luis Girela, girela '@' ua.es, Department of Biotechnology, University of Alicante

    Relevant Papers:

    David Gil, Jose Luis Girela, Joaquin De Juan, M. Jose Gomez-Torres, and Magnus Johnsson. Predicting seminal quality with artificial intelligence methods. Expert Systems with Applications, 39(16):12564 – 12573, 2012

    Citation Request:

    David Gil, Jose Luis Girela, Joaquin De Juan, M. Jose Gomez-Torres, and Magnus Johnsson. Predicting seminal quality with artificial intelligence methods. Expert Systems with Applications, 39(16):12564 – 12573, 2012

  15. d

    Replication Data for: Covtype

    • dataone.org
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wang, Wenjuan (2023). Replication Data for: Covtype [Dataset]. http://doi.org/10.7910/DVN/NTIWVN
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Wang, Wenjuan
    Description

    The dataset is downloaded from UCI repository https://archive.ics.uci.edu/ml/datasets/covertype The dataset contains 1 to 7 Forest Cover Type. The task is to predict the forest cover type from cartographic variables only (no remotely sensed data)

  16. Classification results for: Hellinger Distance Trees for Imbalanced Streams

    • figshare.com
    application/gzip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert Lyon (2023). Classification results for: Hellinger Distance Trees for Imbalanced Streams [Dataset]. http://doi.org/10.6084/m9.figshare.1534549.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Robert Lyon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data sets supporting the results reported in the paper: Hellinger Distance Trees for Imbalanced Streams, R. J. Lyon, J.M. Brooke, J.D. Knowles, B.W Stappers, 22nd International Conference on Pattern Recognition (ICPR), p.1969 - 1974, 2014. DOI: 10.1109/ICPR.2014.344 Contained in this distribution are results of stream classifier perfromance on four different data sets. Also included are the test results from our attempt at reproducing the outcome of the paper, Learning Decision Trees for Un-balanced Data, D. A. Cieslak and N. V. Chawla, in Machine Learning and Knowledge Discovery in Databases (W. Daelemans, B. Goethals, and K. Morik, eds.), vol. 5211 of LNCS, pp. 241-256, 2008. The data sets used for these experiments include, MAGIC Gamma Telescope Data Set : https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+TelescopeMiniBooNE particle identification Data Set : https://archive.ics.uci.edu/ml/datasets/MiniBooNE+particle+identificationSkin Segmentation Data Set : https://archive.ics.uci.edu/ml/datasets/Skin+SegmentationLetter Recognition Data Set : https://archive.ics.uci.edu/ml/datasets/Letter+RecognitionPen-Based Recognition of Handwritten Digits Data Set : https://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+DigitsStatlog (Landsat Satellite) Data Set : https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite)Statlog (Image Segmentation) Data Set : https://archive.ics.uci.edu/ml/datasets/Statlog+(Image+Segmentation) A further data set used is not publicly available at present. However we are in the process of releasing it for public use. Please get in touch if you'd like to use it.

    A readme file accompanies the data describing it in more detail.

  17. Wine Quality Full

    • figshare.com
    txt
    Updated Jul 4, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deepchecks Data (2022). Wine Quality Full [Dataset]. http://doi.org/10.6084/m9.figshare.20223303.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 4, 2022
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Deepchecks Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
  18. H

    Replication Data for: Sonar

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Apr 5, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wenjuan Wang (2018). Replication Data for: Sonar [Dataset]. http://doi.org/10.7910/DVN/LG2FSS
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 5, 2018
    Dataset provided by
    Harvard Dataverse
    Authors
    Wenjuan Wang
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The dataset is downloaded from UCI repository http://archive.ics.uci.edu/ml/datasets/connectionist+bench+(sonar,+mines+vs.+rocks) The task is to discriminate between sonar signals bounced off a metal cylinder and those bounced off a roughly cylindrical rock. The original file is converted to a csv file (2018-04-05)

  19. d

    Replication Data for: Cleveland Heart Disease

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bartley, Christopher (2023). Replication Data for: Cleveland Heart Disease [Dataset]. http://doi.org/10.7910/DVN/QWXVNT
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Bartley, Christopher
    Description

    Original Data from: https://archive.ics.uci.edu/ml/datasets/Heart+Disease Changes made: - four rows with missing values were removed, leaving 299 records - Chest Pain Type, Restecg, Thal variables were converted to indicator variables - class attribute binarised to -1 (no disease) / +1 disease (original values 1,2,3) Attributes: Col 0: CLASS: -1: no disease +1: disease Col 1: Age (cts) Col 2: Sex (0/1) Col 3: indicator (0/1) for typ angina Col 4: indicator for atyp angina Col 5: indicator for non-ang pain Col 6: resting blood pressure (cts) Col 7: Serum cholest (cts) Col 8: fasting blood sugar >120mg/dl (0/1) Col 9: indicator for electrocardio value 1 Col 10: indicator for electrocardio value 2 Col 11: Max heart rate (cts) Col 12: exercised induced angina (0/1) Col 13: ST depression induced by exercise (cts) Col 14: indicator for slope of peak exercise up Col 15: indicator for slope of peak exercise down Col 16: no major vessels colored by fluro (ctsish: 0,1,2,3) Col 17: Thal reversible defect indicator Col 18: Thal fixed defect indicator Col 19: Class 0-4, where 0 is disease not present, 1-4 is present

  20. p

    Chronic KIdney Disease dataset - Dataset - CKAN

    • data.poltekkes-smg.ac.id
    Updated Sep 21, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Chronic KIdney Disease dataset - Dataset - CKAN [Dataset]. https://data.poltekkes-smg.ac.id/dataset/chronic-kidney-disease-dataset
    Explore at:
    Dataset updated
    Sep 21, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    First, I am new to ML, and just in case I slip up, apologies in advance!! So, I am doing an online ML course and this is an assignment where we are supposed to practice scikit-learn's PCA routine. Since the course has been ARCHIVED - which means the discussion posts are not answered!! - hence my posting of the problem here. What better way to learn than to get so many experts giving me feedback … right? Content The data was taken over a 2-month period in India with 25 features ( eg, red blood cell count, white blood cell count, etc). The target is the 'classification', which is either 'ckd' or 'notckd' - ckd=chronic kidney disease. There are 400 rows The data needs cleaning: in that it has NaNs and the numeric features need to be forced to floats. Basically, we were instructed to get rid of ALL ROWS with Nans, with no threshold - meaning, any row that has even one NaN, gets deleted. Part 1: We are asked to choose 3 features (bgr, rc, wc), visualize them, then run the PCA with n_components=2. the PCA is to be run twice: one with no scaling and the second run WITH scaling. And this is where my issue starts … in that after scaling I can hardly see any difference! I will stop here for now till I get feedback and then move to Part 2. Acknowledgements The dataset is available at: https://archive.ics.uci.edu/ml/datasets/Chronic_Kidney_Disease Inspiration I would like to get an intuitive and a practical understanding of PCA.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Wan-Ting Hsieh; Sergio González Vázquez; Trista Chen (2023). UCI dataset [Dataset]. http://doi.org/10.6084/m9.figshare.20496258.v1
Organization logoOrganization logo

UCI dataset

Explore at:
binAvailable download formats
Dataset updated
Mar 13, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Wan-Ting Hsieh; Sergio González Vázquez; Trista Chen
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

The Cuff-Less Blood Pressure Estimation Dataset [2] from the UCI Machine Learning Repository. It is a subset of the MIMIC-II Waveform Dataset that contains 12000 records of simultaneous PPG and ABP from 942 patients with a sampling rate of 125 Hz. The 12000 records were uniformly split into four parts with 3000 records each. However, as the subject information is lacking, the Hold-one-out strategy was utilized to generate training, validation, and test sets once the data was preprocessed. In the end, the UCI dataset had 291,078 segments, which was around 404 hours of recording, making it substantially the biggest data set with a considerably higher ratio of continuous segments per record (32.15).

[2] Kachuee, M., Kiani, M. M., Mohammadzade, H. & Shabany, M. Cuff-less blood pressure estimation data set (2015). UCI repository https://archive.ics.uci.edu/ml/datasets/Cuff-Less+Blood+Pressure+Estimation.

Search
Clear search
Close search
Google apps
Main menu