Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The diamond is 58 times harder than any other mineral in the world, and its elegance as a jewel has long been appreciated. Forecasting diamond prices is challenging due to nonlinearity in important features such as carat, cut, clarity, table, and depth. Against this backdrop, the study conducted a comparative analysis of the performance of multiple supervised machine learning models (regressors and classifiers) in predicting diamond prices. Eight supervised machine learning algorithms were evaluated in this work including Multiple Linear Regression, Linear Discriminant Analysis, eXtreme Gradient Boosting, Random Forest, k-Nearest Neighbors, Support Vector Machines, Boosted Regression and Classification Trees, and Multi-Layer Perceptron. The analysis is based on data preprocessing, exploratory data analysis (EDA), training the aforementioned models, assessing their accuracy, and interpreting their results. Based on the performance metrics values and analysis, it was discovered that eXtreme Gradient Boosting was the most optimal algorithm in both classification and regression, with a R2 score of 97.45% and an Accuracy value of 74.28%. As a result, eXtreme Gradient Boosting was recommended as the optimal regressor and classifier for forecasting the price of a diamond specimen. Methods Kaggle, a data repository with thousands of datasets, was used in the investigation. It is an online community for machine learning practitioners and data scientists, as well as a robust, well-researched, and sufficient resource for analyzing various data sources. On Kaggle, users can search for and publish various datasets. In a web-based data-science environment, they can study datasets and construct models.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Despite the ongoing success of populist parties in many parts of the world, we lack comprehensive information about parties' level of populism over time. A recent contribution to Political Analysis by Di Cocco and Monechi (DCM) suggests that this research gap can be closed by predicting parties' populism scores from their election manifestos using supervised machine-learning. In this paper, we provide a detailed discussion of the suggested approach. Building on recent debates about the validation of machine-learning models, we argue that the validity checks provided in DCM's paper are insufficient. We conduct a series of additional validity checks and empirically demonstrate that the approach is not suitable for deriving populism scores from texts. We conclude that measuring populism over time and between countries remains an immense challenge for empirical research. More generally, our paper illustrates the importance of more comprehensive validations of supervised machine-learning models.
Facebook
TwitterOne of the primary challenges inherent in utilizing deep learning models is the scarcity and accessibility hurdles associated with acquiring datasets of sufficient size to facilitate effective training of these networks. This is particularly significant in object detection, shape completion, and fracture assembly. Instead of scanning a large number of real-world fragments, it is possible to generate massive datasets with synthetic pieces. However, realistic fragmentation is computationally intensive in the preparation (e.g., pre-factured models) and generation. Otherwise, simpler algorithms such as Voronoi diagrams provide faster processing speeds at the expense of compromising realism. Hence, it is required to balance computational efficiency and realism for generating large datasets for marching learning.
We proposed a GPU-based fragmentation method to improve the baseline Discrete Voronoi Chain aimed at completing this dataset generation task. The dataset in this repository includes voxelized fragments from high-resolution 3D models, curated to be used as training sets for machine learning models. More specifically, these models come from an archaeological dataset, which led to more than 1M fragments from 1,052 Iberian vessels. In this dataset, fragments are not stored individually; instead, the fragmented voxelizations are provided in a compressed binary file (.rle.zip). Once uncompressed, each fragment is represented by a different number in the grid. The class to which each vessel belongs is also included in class.csv. The GPU-based pipeline that generated this dataset is explained at https://doi.org/10.1016/j.cag.2024.104104.
Please, note that this dataset originally provided voxel data, point clouds and triangle meshes. However, we opted for including only voxel data because 1) the original dataset is too large to be uploaded to Zenodo and 2) the original intent of our paper is to generate implicit data in the form of voxels. If interested in the whole dataset (450GB), please visit the web page of our research institute.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset has 21 columns that carry the features (questions) of 988 respondents. The efficiency of any machine learning model is heavily dependent on its raw initial dataset. For this, we had to be extra careful in gathering our information. We figured out that for our particular problem, we had to go forward with data that was not only authentic but also versatile enough to get the proper information from relevant sources. Hence we opted to build our dataset by dispatching a survey questionnaire among targeted audiences. Firstly, we built the questionnaire with inquiries that were made after keen observation. Studying the behavior from our intended audience, we came up with factual and informative queries that generated appropriate data. Our prime audience were those who were highly into buying fashion accessories and hence we had created a set of questionnaires that emphasized on questions related to that field. We had a total of twenty one well revised questions that gave us an overview of all answers that were going to be needed within the proximity of our system. As such, we had the opportunity to gather over half a thousand authentic leads and concluded upon our initial raw dataset accordingly.
Facebook
TwitterDeep Learning (DL) has consistently surpassed other Machine Learning methods and achieved state-of-the-art performance in multiple cases. Several modern applications like financial and recommender systems require models that are constantly updated with fresh data. The prominent approach for keeping a DL model fresh is to trigger full retraining from scratch when enough new data are available. However, retraining large and complex DL models is time-consuming and compute-intensive. This makes full retraining costly, wasteful, and slow. In this paper, we present an approach to continuously train and deploy DL models. First, we enable continuous training through proactive training that combines samples of historical data with new streaming data. Second, we enable continuous deployment through gradient sparsification that allows us to send a small percentage of the model updates per training iteration. Our experimental results with LeNet5 on MNIST and modern DL models on CIFAR-10 show that proactive training keeps models fresh with comparable—if not superior—performance to full retraining at a fraction of the time. Combined with gradient sparsification, sparse proactive training enables very fast updates of a deployed model with arbitrarily large sparsity, reducing communication per iteration up to four orders of magnitude, with minimal—if any—losses in model quality. Sparse training, however, comes at a price; it incurs overhead on the training that depends on the size of the model and increases the training time by factors ranging from 1.25 to 3 in our experiments. Arguably, a small price to pay for successfully enabling the continuous training and deployment of large DL models.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset used in this study is publicly available for research purposes. If you are using this dataset, please cite the following paper, which outlines the complete details of the dataset and the methodology used for its generation:
Amit Karamchandani, Javier Núñez, Luis de-la-Cal, Yenny Moreno, Alberto Mozo, Antonio Pastor, "On the Applicability of Network Digital Twins in Generating Synthetic Data for Heavy Hitter Discrimination," under submission.
This dataset contains a synthetic dataset generated to differentiate between benign and malicious heavy hitter flows within complex network environments. Heavy Hitter flows, which include high-volume data transfers, can significantly impact network performance, leading to congestion and degraded quality of service. Distinguishing legitimate heavy hitter activity from malicious Distributed Denial-of-Service traffic is critical for network management and security, yet existing datasets lack the granularity needed for training machine learning models to effectively make this distinction.
To address this, a Network Digital Twin (NDT) approach was utilized to emulate realistic network conditions and traffic patterns, enabling automated generation of labeled data for both benign and malicious HH flows alongside regular traffic.
The feature set includes flow statistics commonly used in network analysis, such as:
Facebook
Twitterhttps://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
Determining the association constant between a cyclodextrin and a guest molecule is an important task for various applications in various industrial and academical fields. However, such a task is time consuming, tedious and requires samples of both molecules. A significant number of association constants and relevant data is available from the literature. The availability of data makes the use of machine learning techniques to predict association constants possible. However, such data is mainly available from tables in articles or appendices. It is necessary to make them available in a computer friendly format and to curate them. Furthermore, the raw data need to be enriched with physicochemical information about each molecule and when such information does not allow to discriminate molecules, some additional data is needed. We present a dataset built from data gathered from the literature.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Correlation of class sample size in the training set with classification performance.
Facebook
TwitterWe assess which waters the Clean Water Act protects and how Supreme Court and White House rules change this regulation. We train a deep learning model using aerial imagery and geophysical data to predict 150,000 jurisdictional determinations from the Army Corps of Engineers, each deciding regulation for one water resource. Under a 2006 Supreme Court ruling, the Clean Water Act protects two-thirds of US streams and over half of wetlands; under a 2020 White House rule, it protects under half of streams and a fourth of wetlands, implying deregulation of 690,000 stream miles, 35 million wetland acres, and 30% of waters around drinking water sources. Our framework can support permitting, policy design, and use of machine learning in regulatory implementation problems. , This dataset contains data used to train the models., , # Training data from: Machine learning predicts which rivers, streams, and wetlands the Clean Water Act regulates
This dataset contains data used to train the models in Greenhill et al. (2023). All data are publicly available and can be accessed either through Google Earth Engine or directly from the data providers, as described in Table S3 of the Supplementary Material. In addition, we are providing access to the full set of pre-processed inputs for model training via this repository. We are also providing access to a subset of the data used for prediction, as well as all data needed for reproducing the results of the paper, in another Dryad repository: . All code written for the project is available at .
The files here include:
wotus_model.pth.tar, resource_type_model.pth.tar, cowardin_code_model.pth.tar, ajd_model.pth.tar.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A Flexible Machine Learning-Aware Architecture for Future WLANs
Authors: Francesc Wilhelmi, Sergio Barrachina-Muñoz, Boris Bellalta, Cristina Cano, Anders Jonsson & Vishnu Ram.
Abstract: Lots of hopes have been placed in Machine Learning (ML) as a key enabler of future wireless networks. By taking advantage of the large volumes of data generated by networks, ML is expected to deal with the ever-increasing complexity of networking problems. Unfortunately, current networking systems are not yet prepared for supporting the ensuing requirements of ML-based applications, especially for enabling procedures related to data collection, processing, and output distribution. This article points out the architectural requirements that are needed to pervasively include ML as part of future wireless networks operation. To this aim, we propose to adopt the International Telecommunications Union (ITU) unified architecture for 5G and beyond. Specifically, we look into Wireless Local Area Networks (WLANs), which, due to their nature, can be found in multiple forms, ranging from cloud-based to edge-computing-like deployments. Based on ITU's architecture, we provide insights on the main requirements and the major challenges of introducing ML to the multiple modalities of WLANs.
Dataset description: This is the dataset generated for training a Neural Network (NN) in the Access Point (AP) (re)association problem in IEEE 802.11 Wireless Local Area Networks (WLANs).
In particular, the NN is meant to output a prediction function of the throughput that a given station (STA) can obtain from a given Access Point (AP) after association. The features included in the dataset are:
Identifier of the AP to which the STA has been associated.
RSSI obtained from the AP to which the STA has been associated.
Data rate in bits per second (bps) that the STA is allowed to use for the selected AP.
Load in packets per second (pkt/s) that the STA generates.
Percentage of data that the AP is able to serve before the user association is done.
Amount of traffic load in pkt/s handled by the AP before the user association is done.
Airtime in % that the AP enjoys before the user association is done.
Throughput in pkt/s that the STA receives after the user association is done.
The dataset has been generated through random simulations, based on the model provided in https://github.com/toniadame/WiFi_AP_Selection_Framework. More details regarding the dataset generation have been provided in https://github.com/fwilhelmi/machine_learning_aware_architecture_wlans.
Facebook
TwitterCommunity science image libraries offer a massive, but largely untapped, source of observational data for phenological research. The iNaturalist platform offers a particularly rich archive, containing more than 49 million verifiable, georeferenced, open access images, encompassing seven continents and over 278,000 species. A critical limitation preventing scientists from taking full advantage of this rich data source is labor. Each image must be manually inspected and categorized by phenophase, which is both time-intensive and costly. Consequently, researchers may only be able to use a subset of the total number of images available in the database. While iNaturalist has the potential to yield enough data for high-resolution and spatially extensive studies, it requires more efficient tools for phenological data extraction. A promising solution is automation of the image annotation process using deep learning. Recent innovations in deep learning have made these open-source tools accessibl...
Facebook
TwitterEmbeddings and raw files to complement the paper "General Chemically Intuitive Atom-Level DFT Descriptors for Machine Learning Approaches to Reaction Condition Prediction". The embeddings should be all the data needed for full reproducibility of the results published. The GitHub repo GeneralDFT (https://github.com/moleculebits/GeneralDFT) contains the python scripts required to make use of the data, along with some basic plotting functionalities.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository introduces the RadIOCD (Radar-based Interior Object Classification Dataset), which contains sparse point cloud representations of interior objects, collected by subjects wearing a commercial off-the-shelf mmWave radar. RadIOCD includes the recording of 10 volunteers, aged between 25 and 50 years old. A total amount of 5 objects, with the participants moving towards them in 2 different environments were recorded. RadIoCD includes sparse 3D point cloud data, together with their doppler velocity provided by the mmWave radar. The files were stored in CSV format to ensure its reuse.
The scope of RadIoCD is the availability of data for the recognition of objects solely recorded by the mmWave radar, to be used in applications were the vision-based classification is not robust (e.g, in search and rescue operation where there is smoke inside a building). Furthermore, we showcase that this dataset contains
enough data to apply Machine Learning-based techniques, and ensure that it could generalize in different environments and "unseen" subjects.
Facebook
Twitterhttps://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/D3WZIDhttps://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/D3WZID
Automatic damage assessment by analysing UAV-derived 3D point clouds provides fast information on the damage situation after an earthquake. However, the assessment of different damage grades is challenging given the variety in damage characteristics and limited transferability of methods to other geographic regions or data sources. We present a novel change-based approach to automatically assess multi-class building damage from real-world point clouds using a machine learning model trained on virtual laser scanning (VLS) data. Therein, we (1) identify object-specific point cloud-based change features, (2) extract changed building parts using k-means clustering, (3) train a random forest machine learning model with VLS data based on object-specific change features, and (4) use the classifier to assess building damage in real-world photogrammetric point clouds. We evaluate the classifier with respect to its capacity to classify three damage grades (heavy, extreme, destruction) in pre-event and post-event point clouds of an earthquake in L’Aquila (Italy). Using object-specific change features derived from bi-temporal point clouds, our approach is transferable with respect to multi-source input point clouds used for model training (VLS) and application (real-world photogrammetry). We further achieve geographic transferability by using simulated training data which characterises damage grades across different geographic regions. The model yields high multi-target classification accuracies (overall accuracy: 92.0%–95.1%). Classification performance improves only slightly when using real-world region-specific training data (3% higher overall accuracies). We consider our approach especially relevant for applications where timely information on the damage situation is required and sufficient real-world training data is not available. This dataset includes 3D building models (building_models.zip) representing the target damage grades (no damage, heavy damage, extreme damage, destruction) of this study Python source code (code.zip) used in this study to (1) generate simulated multi-temporal 3D point clouds using HELIOS++ (https://github.com/3dgeo-heidelberg/helios), (2) extract damaged building parts using k-means clustering, (3) compute object-specific geometric change features per building (4) train a multi-target random forest classifier to classify buildings into four damage grades based on object-specific change features.
Facebook
TwitterDataset Card for ArPod
Dataset Summary
[More Information Needed]
Supported Tasks and Leaderboards
[More Information Needed]
Languages
[More Information Needed]
Dataset Structure
Data Instances
[More Information Needed]
Data Fields
[More Information Needed]
Data Splits
[More Information Needed]
Dataset Creation
Curation Rationale
[More Information Needed]
Source Data… See the full description on the dataset page: https://huggingface.co/datasets/arbml/ArPod.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract The dataset was collected for 332 compute nodes throughout May 19 - 23, 2023. May 19 - 22 characterizes normal compute cluster behavior, while May 23 includes an anomalous event. The dataset includes eight CPU, 11 disk, 47 memory, and 22 Slurm metrics. It represents five distinct hardware configurations and contains over one million records, totaling more than 180GB of raw data. Background Motivated by the goal to develop a digital twin of a compute cluster, the dataset was collected using a Prometheus server (1) scraping the Thomas Jefferson National Accelerator Facility (JLab) batch cluster used to run an assortment of physics analysis and simulation jobs, where analysis workloads leverage data generated from the laboratory's electron accelerator, and simulation workloads generate large amounts of flat data that is then carved to verify amplitudes. Metrics were scraped from the cluster throughout May 19 - 23, 2023. Data from May 19 to May 22 primarily reflected normal system behavior, while May 23, 2023, recorded a notable anomaly. This anomaly was severe enough to necessitate intervention by JLab IT Operations staff. The metrics were collected from CPU, disk, memory, and Slurm. Metrics related to CPU, disk, and memory provide insights into the status of individual compute nodes. Furthermore, Slurm metrics collected from the network have the capability to detect anomalies that may propagate to compute nodes executing the same job. Usage Notes While the data from May 19 - 22 characterizes normal compute cluster behavior, and May 23 includes anomalous observations, the dataset cannot be considered labeled data. The set of nodes and the exact start and end time affected nodes demonstrate abnormal effects are unclear. Thus, the dataset could be used to develop unsupervised machine-learning algorithms to detect anomalous events in a batch cluster. https://doi.org/10.48550/arXiv.2311.16129
Facebook
TwitterNOTE: The manuscript associated with this data package is currently in review. The data may be revised based on reviewer feedback. Upon manuscript acceptance, this data package will be updated with the final dataset and additional metadata.This data package is associated with the manuscript “Artificial intelligence-guided iterations between observations and modeling significantly improve environmental predictions” (Malhotra et al., in prep). This effort was designed following ICON (integrated, coordinated, open, and networked) principles to facilitate a model-experiment (ModEx) iteration approach, leveraging crowdsourced sampling across the contiguous United States (CONUS). New machine learning models were created every month to guide sampling locations. Data from the resulting samples were used to test and rebuild the machine learning models for the next round of sampling guidance. Associated sediment and water geochemistry and in situ sensor data can be found at https://data.ess-dive.lbl.gov/datasets/doi:10.15485/1923689, https://data.ess-dive.lbl.gov/datasets/doi:10.15485/1729719, and https://data.ess-dive.lbl.gov/datasets/doi:10.15485/1603775. This data package is associated with two GitHub repositories found at https://github.com/parallelworks/dynamic-learning-rivers and https://github.com/WHONDRS-Hub/ICON-ModEx_Open_Manuscript. In addition to this readme, this data package also includes two file-level metadata (FLMD) files that describes each file and two data dictionaries (DD) that describe all column/row headers and variable definitions. This data package consists of two main folders (1) dynamic-learning-rivers and (2) ICON-ModEx_Open_Manuscript whichmore » contain snapshots of the associated GitHub repositories. The input data, output data, and machine learning models used to guide sampling locations are within dynamic-learning-rivers. The folder is organized into five top-level directories: (1) “input_data” holds the training data for the ML models; (2) “ml_models” holds machine learning (ML) models trained on the data in “input_data”; (3) “examples” contains files for direct experimentation with the machine learning model, including scripts for setting up “hindcast” run; (4) “scripts” contains data preprocessing and postprocessing scripts and intermediate results specific to this data set that bookend the ML workflow; and (5) “output_data” holds the overall results of the ML model on that branch. Each trained ML model resides on its own branch in the repository; this means that inputs and outputs can be different branch-to-branch. There is also one hidden directory “.github/workflows”. This hidden directory contains information for how to run the ML workflow as an end-to-end automated GitHub Action but it is not needed for reusing the ML models archived here. Please see the top-level README.md in the GitHub repository for more details on the automation.The scripts and data used to create figures in the manuscript are within ICON-ModEx_Open_Manuscript. The folder is organized into four folders which contain the scripts, data, and pdf for each figure. Within the “fig-model-score-evolution” folder, there is a folder called “intermediate_branch_data” which contains some intermediate files pulled from dynamic-learning-rivers and reorganized to easily integrate into the workflows. NOTE: THIS FOLDER INCLUDES THE FILES AT THE POINT OF PAPER SUBMISSION. IT WILL BE UPDATED ONCE THE PAPER IS ACCEPTED WITH ANY REVISIONS AND WILL INCLUDE A DD/FLMD AT THAT POINT.« less
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The data set represents movies which were released in the years of xxx up to 2017. It is kept quite general and does not have any real problem / challenge as a background. The whole data set is meant to practice different types of techniques for a data analyst / data scientist.
I´d like also to mention that the Dataset is not fully cleaned. Reasoning is that it shall demonstrate you the real life of being an Analyst / Scientist. Get Data - Prep Data - Analyse Data - Visualize Data - Predict Outcomes of different Use Cases ;-)
I love watching movies and therefore tried to combine this hobby with my current self studies of becoming a data scientist. Therefore I needed a way to obtain a data set which included information of movies so that I could play around and use my learnings. On the first glance I could see that the data set can be used for Regressions, Classifications or potentially even Deep Learning (such as Image Recognition - Post URLs are given)
I did aquire this dataset by using different steps. First I did check the internet for a specific API which I may use to receive movie information. After a short time I got to know omdbapi.com. With the help of this API I was able to fetch information based on the title of the movies.
Now I had another problem. I was missing movie titles. The next search had begun. I couldn´t find an API for that but I did see that wikipedia was quite well structured in regards to movie titles. So I did build a scraper to fetch all movie titles from 1990 to 2017.
After receiving all the data I could finally start to obtain all movie information of a movie by having the title + year (there might be movies which have the same name). Unfortunately some movie titles have been written differently and so I had a failure rate of 10% for obtaining the movie data. Based on the 10% failed movie titles - I did an Text Analysis and found around 400 000 new Movies / Series. The latest Version should include nearly 200 000 different movies based on the imdbID.
Additionally I did clean some of the information such as Genre, Actors and Writer for better analysing. Each of the CSV File can be joined by the imdbID. Be aware that some information are missing and declared as _NOT_GIVEN.
The inspiration of this data set came from getting into the practical flow of developing an image recognition application. Recognize the genre of a movie by the given poster. By request I could also provide the images of the movies. But for the given Dataset I do have the following questions in my mind:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset created for machine learning and deep learning training and teaching purposes.
Can for instance be used for classification, regression, and forecasting tasks.
Complex enough to demonstrate realistic issues such as overfitting and unbalanced data, while still remaining intuitively accessible.
ORIGINAL DATA TAKEN FROM:
EUROPEAN CLIMATE ASSESSMENT & DATASET (ECA&D), file created on 22-04-2021
THESE DATA CAN BE USED FREELY PROVIDED THAT THE FOLLOWING SOURCE IS ACKNOWLEDGED:
Klein Tank, A.M.G. and Coauthors, 2002. Daily dataset of 20th-century surface
air temperature and precipitation series for the European Climate Assessment.
Int. J. of Climatol., 22, 1441-1453.
Data and metadata available at http://www.ecad.eu
For more information see metadata.txt file.
The Python code used to create the weather prediction dataset from the ECA&D data can be found on GitHub: https://github.com/florian-huber/weather_prediction_dataset
(this repository also contains Jupyter notebooks with teaching examples)
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The diamond is 58 times harder than any other mineral in the world, and its elegance as a jewel has long been appreciated. Forecasting diamond prices is challenging due to nonlinearity in important features such as carat, cut, clarity, table, and depth. Against this backdrop, the study conducted a comparative analysis of the performance of multiple supervised machine learning models (regressors and classifiers) in predicting diamond prices. Eight supervised machine learning algorithms were evaluated in this work including Multiple Linear Regression, Linear Discriminant Analysis, eXtreme Gradient Boosting, Random Forest, k-Nearest Neighbors, Support Vector Machines, Boosted Regression and Classification Trees, and Multi-Layer Perceptron. The analysis is based on data preprocessing, exploratory data analysis (EDA), training the aforementioned models, assessing their accuracy, and interpreting their results. Based on the performance metrics values and analysis, it was discovered that eXtreme Gradient Boosting was the most optimal algorithm in both classification and regression, with a R2 score of 97.45% and an Accuracy value of 74.28%. As a result, eXtreme Gradient Boosting was recommended as the optimal regressor and classifier for forecasting the price of a diamond specimen. Methods Kaggle, a data repository with thousands of datasets, was used in the investigation. It is an online community for machine learning practitioners and data scientists, as well as a robust, well-researched, and sufficient resource for analyzing various data sources. On Kaggle, users can search for and publish various datasets. In a web-based data-science environment, they can study datasets and construct models.