Facebook
TwitterODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Sure! I'd be happy to provide you with an introduction to machine learning basics for beginners. Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling computers to learn and make predictions or decisions without being explicitly programmed. Here are some key concepts and terms to help you get started:
Supervised Learning: In supervised learning, the machine learning algorithm learns from labeled training data. The training data consists of input examples and their corresponding correct output or target values. The algorithm learns to generalize from this data and make predictions or classify new, unseen examples.
Unsupervised Learning: Unsupervised learning involves learning patterns and relationships from unlabeled data. Unlike supervised learning, there are no target values provided. Instead, the algorithm aims to discover inherent structures or clusters in the data.
Training Data and Test Data: Machine learning models require a dataset to learn from. The dataset is typically split into two parts: the training data and the test data. The model learns from the training data, and the test data is used to evaluate its performance and generalization ability.
Features and Labels: In supervised learning, the input examples are often represented by features or attributes. For example, in a spam email classification task, features might include the presence of certain keywords or the length of the email. The corresponding output or target values are called labels, indicating the class or category to which the example belongs (e.g., spam or not spam).
Model Evaluation Metrics: To assess the performance of a machine learning model, various evaluation metrics are used. Common metrics include accuracy (the proportion of correctly predicted examples), precision (the proportion of true positives among all positive predictions), recall (the proportion of true positives predicted correctly), and F1 score (a combination of precision and recall).
Overfitting and Underfitting: Overfitting occurs when a model becomes too complex and learns to memorize the training data instead of generalizing well to unseen examples. On the other hand, underfitting happens when a model is too simple and fails to capture the underlying patterns in the data. Balancing the complexity of the model is crucial to achieve good generalization.
Feature Engineering: Feature engineering involves selecting or creating relevant features that can help improve the performance of a machine learning model. It often requires domain knowledge and creativity to transform raw data into a suitable representation that captures the important information.
Bias and Variance Trade-off: The bias-variance trade-off is a fundamental concept in machine learning. Bias refers to the errors introduced by the model's assumptions and simplifications, while variance refers to the model's sensitivity to small fluctuations in the training data. Reducing bias may increase variance and vice versa. Finding the right balance is important for building a well-performing model.
Supervised Learning Algorithms: There are various supervised learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. Each algorithm has its own strengths, weaknesses, and specific use cases.
Unsupervised Learning Algorithms: Unsupervised learning algorithms include clustering algorithms like k-means clustering and hierarchical clustering, dimensionality reduction techniques like principal component analysis (PCA) and t-SNE, and anomaly detection algorithms, among others.
These concepts provide a starting point for understanding the basics of machine learning. As you delve deeper, you can explore more advanced topics such as deep learning, reinforcement learning, and natural language processing. Remember to practice hands-on with real-world datasets to gain practical experience and further refine your skills.
Facebook
Twitterhttp://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
This dataset provides information about Vibration levels , torque, process temperature and Fault.
The dataset in the image is a spreadsheet containing information about engine performance. The spreadsheet has the following variables:
UDI: This is likely a unique identifier for each engine. Product ID: This could be a specific code or identifier for the engine model. Type: This indicates the type of engine, possibly categorized by fuel type (e.g., M - motor, L - liquid). Air temperature (K): This is the air temperature in Kelvin around the engine. Process temperature [K]: This is the internal temperature of the engine during operation, measured in Kelvin. Speed (rpm): This is the rotational speed of the engine in revolutions per minute. Torque (Nm): This is the twisting force exerted by the engine, measured in Newton meters. Vibration Levels: This could be a measure of the engine's vibration intensity. Operational Hours: This is the total number of hours the engine has been operational. Tailure Type: This indicates the type of failure the engine experienced, if any. Rotational: This might be a specific type of failure related to the engine's rotation. This dataset could be used for various analytical purposes related to engine performance and maintenance. Here are some examples:
Identifying patterns of engine failure: By analyzing the data, you could identify correlations between specific variables (e.g., air temperature, operational hours) and engine failures. This could help predict potential failures and schedule preventative maintenance. Optimizing engine performance: By analyzing the data, you could identify the operating conditions (e.g., temperature, speed) that lead to optimal engine performance. This could help improve fuel efficiency and engine lifespan. Comparing engine types: The data could be used to compare the performance and efficiency of different engine types under various operating conditions. Building predictive models: The data could be used to train machine learning models to predict engine failures, optimize maintenance schedules, and improve overall engine performance. It's important to note that the specific value of this dataset would depend on the context and the intended use case. For example, if you are only interested in a specific type of engine or a particular type of failure, you might need to filter or subset the data accordingly.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The diamond is 58 times harder than any other mineral in the world, and its elegance as a jewel has long been appreciated. Forecasting diamond prices is challenging due to nonlinearity in important features such as carat, cut, clarity, table, and depth. Against this backdrop, the study conducted a comparative analysis of the performance of multiple supervised machine learning models (regressors and classifiers) in predicting diamond prices. Eight supervised machine learning algorithms were evaluated in this work including Multiple Linear Regression, Linear Discriminant Analysis, eXtreme Gradient Boosting, Random Forest, k-Nearest Neighbors, Support Vector Machines, Boosted Regression and Classification Trees, and Multi-Layer Perceptron. The analysis is based on data preprocessing, exploratory data analysis (EDA), training the aforementioned models, assessing their accuracy, and interpreting their results. Based on the performance metrics values and analysis, it was discovered that eXtreme Gradient Boosting was the most optimal algorithm in both classification and regression, with a R2 score of 97.45% and an Accuracy value of 74.28%. As a result, eXtreme Gradient Boosting was recommended as the optimal regressor and classifier for forecasting the price of a diamond specimen. Methods Kaggle, a data repository with thousands of datasets, was used in the investigation. It is an online community for machine learning practitioners and data scientists, as well as a robust, well-researched, and sufficient resource for analyzing various data sources. On Kaggle, users can search for and publish various datasets. In a web-based data-science environment, they can study datasets and construct models.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
we propose a standard synthetic data set for the training of speckle reduction algorithms.
Facebook
Twitterhttps://images.cv/licensehttps://images.cv/license
Labeled Stop sign images suitable for training and evaluating computer vision and deep learning models.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains information for Machine Learning algorithms to forecast recurrence events (RE) for patients with breast cancer stages I to III. The dataset contains 252 instances and six attributes, including a binary class indicating whether RE occurred. This dataset has been reduced and denoised from the original Ljubljana, which holds 286 instances with ten attributes each (LBCD, Zwitter M. and Soklic M. (1988). Breast Cancer. UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/14/breast+cancer).
The ranking results by eight different Machine learning algorithms and statistical handling of the ranking 8-component vectors for attributes allow one to reduce ten features to six of the most relevant ones. The most pertinent features were the following five: {deg_malig, irradiat, node_caps, tumor_size, inv_nodes}. Less relevant found four attributes: {age, breast_quad, breast, menopause}.
The CAIRAD: Co-appearance based Analysis for Incorrect Records and Attribute-values Detection ( Rahman MG, Islam MZ, Bossomaier T, Gao J. CAIRAD: A co-appearance based analysis for incorrect records and attribute-values detection. Proc Int Jt Conf Neural Networks. 2012;(June). https://doi.org/10.1109/IJCNN.2012.6252669) filter has been determined the noises in attributes and class features. Per the filtering results, 34 instances of LBCD had noises in half (or even more than half) of their features. Those were removed from the data.
It is known that the noises in the class are riskier and teasing than those of attributes. Meantime, the class attribute had 35 (14%) missed values from 252 after COIRAD filtering. It was unacceptable, considering the comparable number (only 85 cases) of recurrence events in the class of initial LBCD. The imputation (reconstruction, "cure") of missed values was performed via the algorithm offered in:
Bai BM, Mangathayaru N, Rani BP. An approach to find missing values in medical datasets. In: ACM International Conference Proceeding Series. Vol 24-26-Sept. ; 2015. https://doi.org/10.1145/2832987.2833083. The noises presented in the remaining attributes, ranging from 1% to 14%, were neglected.
There are 252 instances in the dataset, of which 206 do not have RE, and the remaining 46 have RE. Six attributes, including its class, define each instance. This dataset is obtained from the initial version of the LBCD betterment, and it provides a significant advantage in the performance over the original LBCD for most classifying algorithms of Machine Learning. However, the dataset is slightly more imbalanced than the LBCD, which is a minus.
Facebook
TwitterImages have always played a vital role in human life because vision is the most crucial sense for humans. As a result, image processing has a wide range of applications. Photographs are everywhere nowadays, more than ever, and it is quite easy for anyone to make a large number of photographs utilizing a smart phone. Given the complexities of vision, machine learning has emerged as a critical component of intelligent computer vision programmed when adaptability is required. Deep learning is a subfield of artificial intelligence that combines a number of statistical, probabilistic, and optimisation techniques to enable computers to "learn" from previous examples and find difficult-to-detect patterns in big, noisy, or complex data sets. This capability is particularly well-suited to medical applications, especially those that depend on complex promote and genomic measurements. An innovative integration of machine learning in image processing is very likely to have a great benefit to the field, which will contribute to a better understanding of complex images. This capability is especially well-suited to medical applications that rely on complicated promote and genomic measurements. A novel application of deep learning in image processing is extremely likely to benefit the field and lead to a better understanding of complicated images. A country’s economy is dependent on agricultural productivity. The identification of plant diseases is critical for reducing production losses and enhancing agricultural product quality. Traditional methods are dependable, but they necessitate the use of a human resource to visually observe plant leaf patterns and identify disease. Traditional methods take more time and need more labour. Early identification of plant disease utilising automated procedures will reduce productivity loss in large farm fields. We propose a vision-based automatic detection of plant disease detection utilising Image Processing Technique in this research. By recognising the colour feature of the leaf region, image processing algorithms are developed to detect plant illness or disease. The K mean algorithm is utilised for colour segmentation, whereas the GLCM algorithm is employed for disease classification. Plant infection based on vision yielded efficient results and Promising performance.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains simulated data modeled after real-world Emergency Room operations to support a Lean Six Sigma DMAIC (Define, Measure, Analyze, Improve, Control) project aimed at reducing ER wait times and patient walkouts.
baseline_data.csv: Simulated patient records before intervention (250 cases)post_intervention_data.csv: Simulated patient records after DMAIC-based process improvements (250 cases)| Column Name | Description |
|---|---|
| Patient_ID | Unique patient identifier |
| WaitTime_Mins | Patient wait time in minutes |
| Triage_Level | Emergency triage category (1 = critical, 5 = low) |
| Staff_On_Duty | Number of ER staff assigned during patient care |
| Critical_Case | 1 = Critical case, 0 = Non-critical |
| Shift | Shift during patient visit (Morning, Afternoon, Night) |
| Walkout | 1 = Patient walked out, 0 = Completed visit |
If you use this dataset, please cite:
Briones, F.J. (2025). Simulated ER Wait Time Reduction Dataset (DMAIC Lean Six Sigma Project). Kaggle Datasets.
Facebook
TwitterMost SaaS organizations spend a chunk of their revenue on various marketing initiatives - digital marketing, media outreach, search engine optimization, and more.
However, if there’s a way to target a highly qualified set of customers to buy your product, the organization reaps multiple benefits, such as enhanced revenue generation, higher deal closure rates, and an increase in profit margins.
An organization that offers a hiring assessment platform is looking at reducing its yearly marketing spends and you have been appointed as the Machine Learning engineer for this project.
Your task is to build a sophisticated Machine Learning model that predicts the probability percentage of marketing leads purchasing their product, based on information provided in the given dataset.
The dataset consists of parameters such as the deal value and pitch, the lead’s source, its revenue and funding information, assigned points of contact for the lead (internal and external), and the like.
The benefits of practising this problem by using Machine Learning techniques are as follows:
This challenge encourages you to apply your Machine Learning skills to build a model that predicts the probability percentage of a marketing lead to convert into a client and purchase the product. This challenge will help you enhance your knowledge of regression. Regression is one of the basic building blocks of Machine Learning. We challenge you to build a model that successfully predicts the probability percentage of a marketing lead to convert into a client and purchase the product.
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
This dataset is a transformed and preprocessed version of the Bank Churn Dataset from a Kaggle competition. The original dataset was designed to predict customer churn in the banking industry, containing key customer attributes such as credit score, age, account balance, and activity status.
In this version, I have applied a complete data preprocessing pipeline, ensuring the dataset is cleaned, structured, and optimized for machine learning models. This includes handling missing values, encoding categorical features, scaling numerical attributes, detecting and treating outliers, and feature engineering. The processed dataset is now ready for training and evaluation, making it an ideal resource for anyone working on churn prediction, customer retention strategies, or financial analytics.
This work was inspired by the need for high-quality, well-prepared datasets that enable better model performance and reduce preprocessing time for data scientists and machine learning practitioners. 🚀
Below is the refined breakdown of the dataset columns, incorporating feature engineering and transformations:
| Column Name | Description | Data Type |
|---|---|---|
| CustomerId | Unique identifier for each customer. | int64 |
| Surname | Last name of the customer (not used in ML modeling). | object |
| CreditScore | Customer's credit score, ranging from 350 to 850. | int64 |
| Geography | Country of the customer (France, Germany, or Spain). | object |
| Gender | Gender of the customer (Male or Female). | object |
| Age | Age of the customer (18-92 years). | float64 |
| Tenure | Number of years the customer has been with the bank (0-10). | int64 |
| Balance | Account balance of the customer (0.0 to 250,898.09). | float64 |
| NumOfProducts | Number of products the customer uses (1-4). | int64 |
| HasCrCard | Whether the customer owns a credit card (1 = Yes, 0 = No). | int64 |
| IsActiveMember | Whether the customer is an active bank member (1 = Yes, 0 = No). | int64 |
| EstimatedSalary | Estimated annual salary of the customer (11.58 to 199,992.48). | float64 |
| Exited (Only in train_preprocessed.csv) | Target variable indicating if the customer churned (1 = Yes, 0 = No). | int64 |
| AgeGroup | Categorized age group (Child, Teen, Young Adult, Middle-Aged Adult, Senior). | object |
| BalanceCategory | Categorized balance levels (No Balance, 0-100K, ..., 900K-1M). | object |
| SalaryCategory | Categorized salary levels (Zero Income, Low Income, ..., Very High Income). | object |
| CreditScoreCategory | Categorized credit score (Low, Fair, Good, High, Exceptional). | object |
This breakdown provides a comprehensive overview of the dataset's structure and transformations. 🚀
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data used for producing Figures in the manuscript are uploaded here, with the ML model weights and ocean model grid info.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 3102 experimental measurements . It was used to evaluate a two-stage experimental design approach combining cluster-based data reduction and machine learning models for accurate prediction of optical band gap values
Facebook
Twitterhttps://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Oral diseases affect nearly 3.5 billion people, with the majority residing in low- and middle-income countries. Due to limited healthcare resources, many individuals are unable to access proper oral healthcare services. Image-based machine learning technology is one of the most promising approaches to improving oral healthcare services and reducing patient costs. Openly accessible datasets play a crucial role in facilitating the development of machine learning techniques. However, existing dental datasets have limitations such as a scarcity of Cone Beam Computed Tomography (CBCT) data, lack of matched multi-modal data, and insufficient complexity and diversity of the data. This project addresses these challenges by providing a dataset that includes 329 CBCT images from 169 patients, multi-modal data with matching modalities, and images representing various oral health conditions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This upload contains slices 1 – 1,000 from the data collection described in
Maximilian B. Kiss, Sophia B. Coban, K. Joost Batenburg, Tristan van Leeuwen, and Felix Lucka “"2DeteCT - A large 2D expandable, trainable, experimental Computed Tomography dataset for machine learning", Sci Data 10, 576 (2023) or arXiv:2306.05907 (2023)
Abstract:
"Recent research in computational imaging largely focuses on developing machine learning (ML) techniques for image reconstruction, which requires large-scale training datasets consisting of measurement data and ground-truth images. However, suitable experimental datasets for X-ray Computed Tomography (CT) are scarce, and methods are often developed and evaluated only on simulated data. We fill this gap by providing the community with a versatile, open 2D fan-beam CT dataset suitable for developing ML techniques for a range of image reconstruction tasks. To acquire it, we designed a sophisticated, semi-automatic scan procedure that utilizes a highly-flexible laboratory X-ray CT setup. A diverse mix of samples with high natural variability in shape and density was scanned slice-by-slice (5000 slices in total) with high angular and spatial resolution and three different beam characteristics: A high-fidelity, a low-dose and a beam-hardening-inflicted mode. In addition, 750 out-of-distribution slices were scanned with sample and beam variations to accommodate robustness and segmentation tasks. We provide raw projection data, reference reconstructions and segmentations based on an open-source data processing pipeline."
The data collection has been acquired using a highly flexible, programmable and custom-built X-ray CT scanner, the FleX-ray scanner, developed by TESCAN-XRE NV, located in the FleX-ray Lab at the Centrum Wiskunde & Informatica (CWI) in Amsterdam, Netherlands. It consists of a cone-beam microfocus X-ray point source (limited to 90 kV and 90 W) that projects polychromatic X-rays onto a 14-bit CMOS (complementary metal-oxide semiconductor) flat panel detector with CsI(Tl) scintillator (Dexella 1512NDT) and 1536-by-1944 pixels, \(74.8\mu m^2\) each. To create a 2D dataset, a fan-beam geometry was mimicked by only reading out the central row of the detector. Between source and detector there is a rotation stage, upon which samples can be mounted. The machine components (i.e., the source, the detector panel, and the rotation stage) are mounted on translation belts that allow the moving of the components independently from one another.
Please refer to the paper for all further technical details.
The complete dataset can be found via the following links: 1-1000, 1001-2000, 2001-3000, 3001-4000, 4001-5000, OOD.
The reference reconstructions and segmentations can be found via the following links: 1-1000, 1001-2000, 2001-3000, 3001-4000, 4001-5000, OOD.
The corresponding Python scripts for loading, pre-processing, reconstructing and segmenting the projection data in the way described in the paper can be found on github. A machine-readable file with the used scanning parameters and instrument data for each acquisition mode as well as a script loading it can be found on the GitHub repository as well.
Note: It is advisable to use the graphical user interface when decompressing the .zip archives. If you experience a zipbomb error when unzipping the file on a Linux system rerun the command with the UNZIP_DISABLE_ZIPBOMB_DETECTION=TRUE environment variable by setting in your .bashrc “export UNZIP_DISABLE_ZIPBOMB_DETECTION=TRUE”.
For more information or guidance in using the data collection, please get in touch with
Maximilian.Kiss [at] cwi.nl
Felix.Lucka [at] cwi.nl
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Overview: This dataset was designed for understanding the influence of various classroom environmental factors on student performance. It contains synthetic data based on real-world variables known to impact the learning experience, including air quality, classroom layout, student density, and environmental conditions. The data is primarily focused on classroom dimensions, air quality metrics, student engagement factors, and dynamic performance outcomes.
The dataset contains 15,000 data points of classroom settings from a variety of student environments, providing valuable insights for educational researchers, policymakers, and institutions looking to enhance the learning experience by optimizing environmental factors.
Features: Length (L) (m):
The length of the classroom (in meters), ranging from 8 to 12 meters. This affects space utilization and airflow, contributing to overall classroom comfort. Width (W) (m):
The width of the classroom (in meters), ranging from 6 to 10 meters. Influences classroom layout and seating arrangement. Height (H) (m):
The height of the classroom (in meters), ranging from 2.5 to 4 meters. Affects air circulation and overall comfort levels. Number of Students (N):
The number of students in the classroom, ranging from 51 to 120 students. Higher student count can affect air quality and classroom dynamics. Airflow (Q) (m³/hr):
The total airflow in cubic meters per hour, based on the number of students in the classroom. The airflow helps to maintain air quality, and reduced airflow may correlate with lower student performance due to poor ventilation. Heat Generation (Q, Watts):
The heat generated by students in the classroom, assuming each student generates 100 watts of heat. This can impact temperature levels, influencing comfort and student focus. Lighting Intensity (lux):
The intensity of lighting in the classroom (measured in lux), which can affect visual comfort and focus. Ranges from 200 to 1000 lux, with dim lighting potentially causing fatigue and reducing performance. Noise Level (dB):
The noise level in decibels, influenced by the number of students. More students lead to higher noise levels, which could negatively impact concentration. Ergonomic Comfort:
A rating of seating comfort, ranging from 50 to 100, depending on the number of students and classroom layout. Higher comfort levels correlate with better student focus and engagement. Classroom Layout:
A categorical variable indicating the classroom layout: 0: Rows 1: Clusters 2: Circles Different layouts can influence student interaction, visibility, and engagement. Visual Accessibility:
A score indicating how well students can see and interact with materials (e.g., the blackboard or projector). Larger classrooms or improper seating arrangements can reduce visual accessibility, impacting learning. Greenery (%):
The percentage of classroom space covered by plants, ranging from 0% to 10%. Greenery has been shown to improve cognitive function and learning outcomes by creating a more pleasant and relaxed environment. Time of Day (hrs):
The time during which the class is conducted, ranging from 8 AM to 4 PM. Later classes might see a decrease in student performance due to fatigue or circadian rhythms. Dynamic Learning Outcome:
The overall learning outcome for a given session, measured as a score between 50 and 90. The outcome is influenced by all of the above factors and includes add
itional adjustments for class dynamics, environmental conditions, and temporal factors (e.g., time of day).
Purpose of the Dataset: This dataset serves to model and analyze the relationship between various classroom environmental factors and student performance. Researchers can explore the effects of air quality, classroom layout, and other factors on learning outcomes. The dataset may also be used to develop predictive models to optimize classroom environments for improved student performance.
Example Applications: Educational Research: Studying how air quality and classroom layout affect student concentration and learning efficiency. Environmental Psychology: Analyzing the relationship between environmental comfort (e.g., lighting, temperature) and cognitive performance. Policy Development: Providing evidence-based recommendations for improving school infrastructure, air quality, and classroom design to support better learning outcomes. Machine Learning Models: Training machine learning algorithms to predict student performance based on environmental features and class conditions.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A web framework designed for researchers to perform comparative analysis of various machine learning algorithms in the context of fake news detection. The folder also includes several datasets for experimentation, alongside the source code. The rise of social media has transformed the landscape of news dissemination, presenting new challenges in combating the spread of fake news. This study addresses the automated detection of misinformation within written content, a task that has prompted extensive research efforts across various methodologies. We evaluate existing benchmarks, introduce a novel hybrid word embedding model, and implement a web framework for text classification. Our approach integrates traditional frequency–inverse document frequency (TF–IDF) methods with sophisticated feature extraction techniques, considering linguistic, psychological, morphological, and grammatical aspects of the text. Through a series of experiments on diverse datasets, applying transfer and incremental learning techniques, we demonstrate the effectiveness of our hybrid model in surpassing benchmarks and outperforming alternative experimental setups. Furthermore, our findings emphasize the importance of dataset alignment and balance in transfer learning, as well as the utility of incremental learning in maintaining high detection performance while reducing runtime. This research offers promising avenues for further advancements in fake news detection methodologies, with implications for future research and development in this critical domain.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This upload contains samples 9 - 16 from the data collection described in
Henri Der Sarkissian, Felix Lucka, Maureen van Eijnatten, Giulia Colacicco, Sophia Bethany Coban, Kees Joost Batenburg, "A Cone-Beam X-Ray CT Data Collection Designed for Machine Learning", Sci Data 6, 215 (2019). https://doi.org/10.1038/s41597-019-0235-y or arXiv:1905.04787 (2019)
Abstract:
"Unlike previous works, this open data collection consists of X-ray cone-beam (CB) computed tomography (CT) datasets specifically designed for machine learning applications and high cone-angle artefact reduction: Forty-two walnuts were scanned with a laboratory X-ray setup to provide not only data from a single object but from a class of objects with natural variability. For each walnut, CB projections on three different orbits were acquired to provide CB data with different cone angles as well as being able to compute artefact-free, high-quality ground truth images from the combined data that can be used for supervised learning. We provide the complete image reconstruction pipeline: raw projection data, a description of the scanning geometry, pre-processing and reconstruction scripts using open software, and the reconstructed volumes. Due to this, the dataset can not only be used for high cone-angle artefact reduction but also for algorithm development and evaluation for other tasks, such as image reconstruction from limited or sparse-angle (low-dose) scanning, super resolution, or segmentation."
The scans are performed using a custom-built, highly flexible X-ray CT scanner, the FleX-ray scanner, developed by XRE nvand located in the FleX-ray Lab at the Centrum Wiskunde & Informatica (CWI) in Amsterdam, Netherlands. The general purpose of the FleX-ray Lab is to conduct proof of concept experiments directly accessible to researchers in the field of mathematics and computer science. The scanner consists of a cone-beam microfocus X-ray point source that projects polychromatic X-rays onto a 1536-by-1944 pixels, 14-bit flat panel detector (Dexella 1512NDT) and a rotation stage in-between, upon which a sample is mounted. All three components are mounted on translation stages which allow them to move independently from one another.
Please refer to the paper for all further technical details.
The complete data set can be found via the following links: 1-8, 9-16, 17-24, 25-32, 33-37, 38-42
The corresponding Python scripts for loading, pre-processing and reconstructing the projection data in the way described in the paper can be found on github
For more information or guidance in using these dataset, please get in touch with
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TWNERTC and EWNERTC are collections of automatically categorized and annotated sentences obtained from Turkish and English Wikipedia for named-entity recognition and text categorization.
Firstly, we construct large-scale gazetteers by using a graph crawler algorithm to extract relevant entity and domain information from a semantic knowledge base, Freebase. The final gazetteers has 77 domains (categories) and more than 1000 fine-grained entity types for both languages. Turkish gazetteers contains approximately 300K named-entities and English gazetteers has approximately 23M named-entities.
By leveraging large-scale gazetteers and linked Wikipedia articles, we construct TWNERTC and EWNERTC. Since the categorization and annotation processes are automated, the raw collections are prone to ambiguity. Hence, we introduce two noise reduction methodologies: (a) domain-dependent (b) domain-independent. We produce two different versions by post-processing raw collections. As a result of this process, we introduced 3 versions of TWNERTC and EWNERTC: (a) raw (b) domain-dependent post-processed (c) domain-independent post-processed. Turkish collections have approximately 700K sentences for each version (varies between versions), while English collections contain more than 7M sentences.
We also introduce "Coarse-Grained NER" versions of the same datasets. We reduce fine-grained types into "organization", "person", "location" and "misc" by mapping each fine-grained type to the most similar coarse-grained version. Note that this process also eliminated many domains and fine-grained annotations due to lack of information for coarse-grained NER. Hence, "Coarse-Grained NER" labelled datasets contain only 25 domains and number of sentences are decreased compared to "Fine-Grained NER" versions.
All processes are explained in our published white paper for Turkish; however, major methods (gazetteers creation, automatic categorization/annotation, noise reduction) do not change for English.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The demand for high-dimensional data processing in machine learning has led to the increasing use of dimensionality reduction techniques. These techniques aim to extract the most important information from high-dimensional data, reducing it to a lower-dimensional representation that can be easily processed by machine learning algorithms. However, with the availability of a multitude of dimensionality reduction techniques and heterogeneous datasets, it can be challenging for researchers to select the most appropriate one for their specific application. This research conducts a comparative analysis to identify the distinctive behaviors of various dimensionality reduction techniques under different data situations. The state-of-the-art linear and non-linear dimensionality reduction techniques are analyzed. The study also analyses the performance of each technique in terms of its ability to extract meaningful, interpretable, and low-dimensional features from high-dimensional data. The analysis results provide insights into each technique's strengths and weaknesses and highlight the most appropriate technique when dealing with heterogeneous datasets for different machine-learning tasks. We use multiple tabular, text, and image datasets to validate our findings.
Facebook
TwitterODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Sure! I'd be happy to provide you with an introduction to machine learning basics for beginners. Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling computers to learn and make predictions or decisions without being explicitly programmed. Here are some key concepts and terms to help you get started:
Supervised Learning: In supervised learning, the machine learning algorithm learns from labeled training data. The training data consists of input examples and their corresponding correct output or target values. The algorithm learns to generalize from this data and make predictions or classify new, unseen examples.
Unsupervised Learning: Unsupervised learning involves learning patterns and relationships from unlabeled data. Unlike supervised learning, there are no target values provided. Instead, the algorithm aims to discover inherent structures or clusters in the data.
Training Data and Test Data: Machine learning models require a dataset to learn from. The dataset is typically split into two parts: the training data and the test data. The model learns from the training data, and the test data is used to evaluate its performance and generalization ability.
Features and Labels: In supervised learning, the input examples are often represented by features or attributes. For example, in a spam email classification task, features might include the presence of certain keywords or the length of the email. The corresponding output or target values are called labels, indicating the class or category to which the example belongs (e.g., spam or not spam).
Model Evaluation Metrics: To assess the performance of a machine learning model, various evaluation metrics are used. Common metrics include accuracy (the proportion of correctly predicted examples), precision (the proportion of true positives among all positive predictions), recall (the proportion of true positives predicted correctly), and F1 score (a combination of precision and recall).
Overfitting and Underfitting: Overfitting occurs when a model becomes too complex and learns to memorize the training data instead of generalizing well to unseen examples. On the other hand, underfitting happens when a model is too simple and fails to capture the underlying patterns in the data. Balancing the complexity of the model is crucial to achieve good generalization.
Feature Engineering: Feature engineering involves selecting or creating relevant features that can help improve the performance of a machine learning model. It often requires domain knowledge and creativity to transform raw data into a suitable representation that captures the important information.
Bias and Variance Trade-off: The bias-variance trade-off is a fundamental concept in machine learning. Bias refers to the errors introduced by the model's assumptions and simplifications, while variance refers to the model's sensitivity to small fluctuations in the training data. Reducing bias may increase variance and vice versa. Finding the right balance is important for building a well-performing model.
Supervised Learning Algorithms: There are various supervised learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. Each algorithm has its own strengths, weaknesses, and specific use cases.
Unsupervised Learning Algorithms: Unsupervised learning algorithms include clustering algorithms like k-means clustering and hierarchical clustering, dimensionality reduction techniques like principal component analysis (PCA) and t-SNE, and anomaly detection algorithms, among others.
These concepts provide a starting point for understanding the basics of machine learning. As you delve deeper, you can explore more advanced topics such as deep learning, reinforcement learning, and natural language processing. Remember to practice hands-on with real-world datasets to gain practical experience and further refine your skills.