Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
biology
UCI Machine Learning Repository is a collection of over 550 datasets.
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine. Since that time, it has been widely used by students, educators, and researchers all over the world as a primary source of machine learning data sets. As an indication of the impact of the archive, it has been cited over 1000 times, making it one of the top 100 most cited "papers" in all of computer science. The current version of the web site was designed in 2007 by Arthur Asuncion and David Newman, and this project is in collaboration with Rexa.info at the University of Massachusetts Amherst. Funding support from the National Science Foundation is gratefully acknowledged. Many people deserve thanks for making the repository a success. Foremost among them are the d
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Cuff-Less Blood Pressure Estimation Dataset [2] from the UCI Machine Learning Repository. It is a subset of the MIMIC-II Waveform Dataset that contains 12000 records of simultaneous PPG and ABP from 942 patients with a sampling rate of 125 Hz. The 12000 records were uniformly split into four parts with 3000 records each. However, as the subject information is lacking, the Hold-one-out strategy was utilized to generate training, validation, and test sets once the data was preprocessed. In the end, the UCI dataset had 291,078 segments, which was around 404 hours of recording, making it substantially the biggest data set with a considerably higher ratio of continuous segments per record (32.15).
[2] Kachuee, M., Kiani, M. M., Mohammadzade, H. & Shabany, M. Cuff-less blood pressure estimation data set (2015). UCI repository https://archive.ics.uci.edu/ml/datasets/Cuff-Less+Blood+Pressure+Estimation.
The dataset used can be found on the UCI Machine Learning Repository at the following location:
There are several copies of this dataset to be found on Kaggle, with people focusing on different types of analyses of the data. This specific copy can be analysed by anyone interested, but is primarily used by a study group from the Udacity Bertelsmann Technology Scholarship to practice analysis of association between variables as well as implementation and comparison of various Machine Learning models.
According to the paper by (Detrano et al., 1989) as found on the UCI Dataset webpage, the data represents data collected for 303 patients referred for coronary angiography at the Cleveland Clinic between May 1981 and September 1984. The 13 independent/ features variables can be divided into 3 groups as follows:
Routine evaluation (based on historical data):
Non-invasive test data (informed consent obtained for data as part of research protocol):
Other demographic and clinical variables (based on routine data):
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3632459%2Fa01747fb0158dc51c12bc0824c9c4ae4%2Fdata_dictionary2.png?generation=1609522473018549&alt=media" alt="">
UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Donor:
David W. Aha (aha '@' ics.uci.edu) (714) 856-8779
The objective of the analysis is to use statistical learning to identify factors associated with Coronary Artery Disease as indicated by a coronary angiography interpreted by a Cardiologist (as per paper written by Detrano et al cited before).
This is the dataset Occupancy Detection Data Set, UCI as used in the article how-to-predict-room-occupancy-based-on-environmental-factors
"no","date","Temperature","Humidity","Light","CO2","HumidityRatio","Occupancy"
UC Irvine Machine Learning Repository
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Collection of two datasets from the UCI website that could be used for structure learning tasks. Includes datasets regarding
Size: Two datasets of sizes 9471*17 and 2458285*68 correspondingly
Number of features: 15-68
Ground truth: No
Type of Graph: No ground truth
More information about the datasets is contained in the dataset_description.html files.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Irvine.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Classifying wine varieties’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/brynja/wineuci on 28 January 2022.
--- Dataset description provided by original source is as follows ---
Wine recognition dataset from UC Irvine. Great for testing out different classifiers
Labels: "name" - Number denoting a specific wine class
Number of instances of each wine class
Features:
"This data set is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines"
Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
@misc{Lichman:2013 , author = "M. Lichman", year = "2013", title = "{UCI} Machine Learning Repository", url = "http://archive.ics.uci.edu/ml", institution = "University of California, Irvine, School of Information and Computer Sciences" }
UC Irvine data base: "https://archive.ics.uci.edu/ml/machine-learning-databases/wine"
Sources: (a) Forina, M. et al, PARVUS - An Extendible Package for Data Exploration, Classification and Correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy. (b) Stefan Aeberhard, email: stefan@coral.cs.jcu.edu.au (c) July 1991 Past Usage: (1) S. Aeberhard, D. Coomans and O. de Vel, Comparison of Classifiers in High Dimensional Settings, Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of Mathematics and Statistics, James Cook University of North Queensland. (Also submitted to Technometrics).
The data was used with many others for comparing various classifiers. The classes are separable, though only RDA has achieved 100% correct classification. (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data)) (All results using the leave-one-out technique)
(2) S. Aeberhard, D. Coomans and O. de Vel, "THE CLASSIFICATION PERFORMANCE OF RDA" Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of Mathematics and Statistics, James Cook University of North Queensland. (Also submitted to Journal of Chemometrics).
This data set is great for drawing comparisons between algorithms and testing out classifications models when learning new techniques
--- Original source retains full ownership of the source dataset ---
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
https://img.freepik.com/free-photo/group-happy-young-students-university_85574-4531.jpg" alt="student">
This beginner level data set has 403 rows and 6 columns.
It is a real dataset about the students' knowledge status about the subject of Electrical DC Machines.
This data set is recommended for learning and practicing your skills in exploratory data analysis
, data visualization
, and classification
and clustering
techniques.
Feel free to explore the data set with multiple supervised and unsupervised learning techniques. The Following data dictionary gives more details on this data set:
|Column Position|Atribute Name|Definition|Data Type|Example| | --- | --- | |1 |STG|The degree of study time for goal object materials |Quantitative |0.060, 0.100, 0.080 | |2 |SCG|The degree of repetition number of user for goal object materials |Quantitative |0.000, 0.100, 0.250 | |3 |STR|The degree of study time of user for related objects with goal object |Quantitative |0.10, 0.15, 0.05 | |4 |LPR|The exam performance of user for related objects with goal object |Quantitative |0.98, 0.10, 0.01 | |5 |PEG|The exam performance of user for goal objects |Quantitative |0.66, 0.56, 0.33 | |6 |UNS|The knowledge level of user (Very Low, Low, Middle, High) |Quantitative |"High", "Middle", "Low" |
This data set has been sourced from the Machine Learning Repository of University of California, Irvine User Knowledge Modeling Data Set (UC Irvine). The UCI page mentions the following publication as the original source of the data set: H. T. Kahraman, Sagiroglu, S., Colak, I., Developing intuitive knowledge classifier and modeling of users' domain dependent data in web, Knowledge Based Systems, vol. 37, pp. 283-295, 2013
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Three datasets from the UC Irvine (UCI) machine learning repository, that is, the Australian, German, and Japanese datasets, were adopted for the current study. The Australian credit dataset contains 690 samples, of which 307 are positive and 383 are negative. The dimensions of its input features are 15. The German credit dataset contains 1000 samples, 700 of which are positive and 300 are negative. The dimensions of its input features are 21. The Japanese credit dataset contains 690 samples, of which 383 are positive and 307 are negative. The dimensions of its input features are 16.
https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/
Dataset Details
1.Dataset Loading:
Initially, we load the Drug Review Dataset from the UC Irvine Machine Learning Repository. This dataset contains patient reviews of different drugs, along with the medical condition being treated and the patients' satisfaction ratings.
2.Data Preprocessing:
The dataset is preprocessed to ensure data integrity and consistency. We handle missing values and ensure that each patient ID is unique across the dataset.
3.Text… See the full description on the dataset page: https://huggingface.co/datasets/Mouwiya/drug-reviews.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Travel Review Rating Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/wirachleelakiatiwong/travel-review-rating-dataset on 30 September 2021.
--- Dataset description provided by original source is as follows ---
This data set has been sourced from the Machine Learning Repository of University of California, Irvine (UC Irvine) : Travel Review Ratings Data Set. This data set is populated by capturing user ratings from Google reviews. Reviews on attractions from 24 categories across Europe are considered. Google user rating ranges from 1 to 5 and average user rating per category is calculated.
Attribute 1 : Unique user id Attribute 2 : Average ratings on churches Attribute 3 : Average ratings on resorts Attribute 4 : Average ratings on beaches Attribute 5 : Average ratings on parks Attribute 6 : Average ratings on theatres Attribute 7 : Average ratings on museums Attribute 8 : Average ratings on malls Attribute 9 : Average ratings on zoo Attribute 10 : Average ratings on restaurants Attribute 11 : Average ratings on pubs/bars Attribute 12 : Average ratings on local services Attribute 13 : Average ratings on burger/pizza shops Attribute 14 : Average ratings on hotels/other lodgings Attribute 15 : Average ratings on juice bars Attribute 16 : Average ratings on art galleries Attribute 17 : Average ratings on dance clubs Attribute 18 : Average ratings on swimming pools Attribute 19 : Average ratings on gyms Attribute 20 : Average ratings on bakeries Attribute 21 : Average ratings on beauty & spas Attribute 22 : Average ratings on cafes Attribute 23 : Average ratings on view points Attribute 24 : Average ratings on monuments Attribute 25 : Average ratings on gardens
This data set has been sourced from the Machine Learning Repository of University of California, Irvine (UC Irvine) : Travel Review Ratings Data Set
The UCI page mentions the following publication as the original source of the data set: Renjith, Shini, A. Sreekumar, and M. Jathavedan. 2018. Evaluation of Partitioning Clustering Algorithms for Processing Social Media Data in Tourism Domain. In 2018 IEEE Recent Advances in Intelligent Computational Systems (RAICS), 12731. IEEE
I'm kind of people who love traveling. But sometimes I've problems like where should I visit? Are there somewhere interesting places matched with my lifestyle? Often I spent hours to search for interesting place to go out. Such a waste of time.
What if we can build a recommender system which can recommend you several interesting venue based on your preferences. With information from Google review, I'll try to divide Google review user into cluster of similar interest for further work of building recommender system based on thier preference.
--- Original source retains full ownership of the source dataset ---
The problem is to predict user ratings for web pages (within a subject category). The HTML source of a web page is given. Users looked at each web page and indicated on a 3 point scale (hot medium cold) 50-100 pages per domain.
This database contains HTML source of web pages plus the ratings of a single user on these web pages. Web pages are on four separate subjects (Bands- recording artists; Goats; Sheep; and BioMedical).
Data originally from the UCI ML Repository. Donated by:
Michael Pazzani Department of Information and Computer Science, University of California, Irvine Irvine, CA 92697-3425 pazzani@ics.uci.edu
Concept based Information Access with Google for Personalized Information Retrieval
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this experiment, the datasets are from the UC Irvine (UCI) UCI machine learning repository (Zięba et al., 2016), which contains the financial indicators of Polish manufacturing corporates from 2007 to 2011 in the real world. The datasets were separated into five parts (each part represents each fiscal year) that describe the period from the 1st year (2007 fiscal year) to the 5th year (2011 fiscal year), which corresponds to five different bankruptcy cycles. The class labels (“0” is operating and “1” is bankruptcy) of the datasets are determined by the bankruptcy status of the enterprise in 2012. Furthermore, the Creator dataset from the real world that was published by a Chinese intelligent government services provider called Creator Information Technology Co., Ltd in 2019 was also adopted. The Creator dataset includes company management information of 35960 Chinese companies.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Nine datasets from the UC Irvine (UCI) machine learning repository, i.e., the Australian, Japanese, German (Asuncion & Newman, 2007), Taiwan (Yeh & Lien, 2009) and Polish credit datasets (Zięba et al., 2016) were adopted for the current study. The Polish credit datasets contain five datasets distinguished five classification cases that depend on the forecasting period (e.g., the Polish 1, the Polish 2, the Polish 3, the Polish 4 and the Polish 5). AER credit dataset (Greene, 2003), which is a credit card dataset for econometric analysis. Creator dataset, which is published in 2019 by a Chinese digital government services provider named Creator Information Technology Co., Ltd[1]. The Creator dataset contains the property rights, financial statements, and basic company information of 35960 Chinese companies.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Annotated Benchmark of Real-World Data for Approximate Functional Dependency Discovery
This collection consists of ten open access relations commonly used by the data management community. In addition to the relations themselves (please take note of the references to the original sources below), we added three lists in this collection that describe approximate functional dependencies found in the relations. These lists are the result of a manual annotation process performed by two independent individuals by consulting the respective schemas of the relations and identifying column combinations where one column implies another based on its semantics. As an example, in the claims.csv file, the AirportCode implies AirportName, as each code should be unique for a given airport.
The file ground_truth.csv is a comma separated file containing approximate functional dependencies. table describes the relation we refer to, lhs and rhs reference two columns of those relations where semantically we found that lhs implies rhs.
The file excluded_candidates.csv and included_candidates.csv list all column combinations that were excluded or included in the manual annotation, respectively. We excluded a candidate if there was no tuple where both attributes had a value or if the g3_prime value was too small.
Dataset References
adult.csv: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.
claims.csv: TSA Claims Data 2002 to 2006, published by the U.S. Department of Homeland Security.
dblp10k.csv: Frequency-aware Similarity Measures. Lange, Dustin; Naumann, Felix (2011). 243–248. Made available as DBLP Dataset 2.
hospital.csv: Hospital dataset used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper.
t_biocase_... files: t_bioc_... files used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper.
tax.csv: Tax dataset used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets including: MVB waveform feature datasets; 15 UCI datasets from the University of California at Irvine(UCI) machine learning repository; 2 artificial synthetic datasets
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The ChinaZJB dataset consists of 1,329 valid samples of SMEs after merging the non-financial behavioral information and soft information on credit rating with the financial information, loan information, and non-financial basic information found in the annual loan ledger data. Among them, 108 SMEs have default records, while 1,221 SMEs have no default records, resulting in an imbalanced ratio of approximately 1:11.Five datasets from the UC Irvine (UCI) machine-learning repository, that is, the Polish 1, Polish 2, Polish 3 , Australian, and Taiwan credit datasets, were used for robustness checks.
The Orange County Survey - a collaborative effort of the Public Policy Institute of California and the School of Social Ecology at the University of California, Irvine - is a special edition of the PPIC Statewide Survey. This is the first of an annual series of PPIC surveys of Orange County. The purpose of this study is to inform policymakers by providing timely, accurate, and objective information about policy preferences and economic, social, and political trends. The sample size is 2,004 Orange County adult residents.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
biology