MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by chelbi Zineb
Released under MIT
This dataset was created by Sachin Jain
This dataset was created by Vishal Kr. Srivastava
It contains the following files:
This dataset was created by Constantius
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Kaggle Oracle Dataset
Expert Instruction-Following Data for Competitive Machine Learning
Overview
The Kaggle Oracle Dataset is a high-quality collection of instruction-response pairs tailored for fine-tuning LLMs to provide expert guidance in Kaggle competitions. Built from 14.9M+ kernels and 9,700 competitions, this is the most comprehensive dataset for competitive ML strategy.
Highlights
175 expert-curated instruction-response pairs 100% real-world Kaggle… See the full description on the dataset page: https://huggingface.co/datasets/Aktraiser/Oracle_Kaggle.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The MLCommons Dollar Street Dataset is a collection of images of everyday household items from homes around the world that visually captures socioeconomic diversity of traditionally underrepresented populations. It consists of public domain data, licensed for academic, commercial and non-commercial usage, under CC-BY and CC-BY-SA 4.0. The dataset was developed because similar datasets lack socioeconomic metadata and are not representative of global diversity.
This is a subset of the original dataset that can be used for multiclass classification with 10 categories. It is designed to be used in teaching, similar to the widely used, but unlicensed CIFAR-10 dataset.
These are the preprocessing steps that were performed:
This is the label mapping:
Category | label |
day bed | 0 |
dishrag | 1 |
plate | 2 |
running shoe | 3 |
soap dispenser | 4 |
street sign | 5 |
table lamp | 6 |
tile roof | 7 |
toilet seat | 8 |
washing machine | 9 |
Checkout https://github.com/carpentries-lab/deep-learning-intro/blob/main/instructors/prepare-dollar-street-data.ipynb" target="_blank" rel="noopener">this notebook to see how the subset was created.
The original dataset was downloaded from https://www.kaggle.com/datasets/mlcommons/the-dollar-street-dataset. See https://mlcommons.org/datasets/dollar-street/ for more information.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘ Medical Cost Personal Datasets’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/mirichoi0218/insurance on 12 November 2021.
--- Dataset description provided by original source is as follows ---
Machine Learning with R by Brett Lantz is a book that provides an introduction to machine learning using R. As far as I can tell, Packt Publishing does not make its datasets available online unless you buy the book and create a user account which can be a problem if you are checking the book out from the library or borrowing the book from a friend. All of these datasets are in the public domain but simply needed some cleaning up and recoding to match the format in the book.
Columns - age: age of primary beneficiary
sex: insurance contractor gender, female, male
bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
children: Number of children covered by health insurance / Number of dependents
smoker: Smoking
region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.
charges: Individual medical costs billed by health insurance
The dataset is available on GitHub here.
Can you accurately predict insurance costs?
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Pokemon with stats’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/abcsds/pokemon on 28 January 2022.
--- Dataset description provided by original source is as follows ---
This data set includes 721 Pokemon, including their number, name, first and second type, and basic stats: HP, Attack, Defense, Special Attack, Special Defense, and Speed. It has been of great use when teaching statistics to kids. With certain types you can also give a geeky introduction to machine learning.
This are the raw attributes that are used for calculating how much damage an attack will do in the games. This dataset is about the pokemon games (NOT pokemon cards or Pokemon Go).
The data as described by Myles O'Neill is:
The data for this table has been acquired from several different sites, including:
One question has been answered with this database: The type of a pokemon cannot be inferred only by it's Attack and Deffence. It would be worthy to find which two variables can define the type of a pokemon, if any. Two variables can be plotted in a 2D space, and used as an example for machine learning. This could mean the creation of a visual example any geeky Machine Learning class would love.
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Pokemon’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/mlomuscio/pokemon on 14 February 2022.
--- Dataset description provided by original source is as follows ---
I acquired the data from Alberto Barradas at https://www.kaggle.com/abcsds/pokemon. I needed to edit some of the variable names and remove the Total variable in order for my students to use this data for class. Otherwise, I would have just had them use his version of the data.
This dataset is for my Introduction to Data Science and Machine Learning Course. Using a modified Pokémon dataset acquired from Kaggle.com, I created example code for students demonstrating how to explore data with R.
Barradas provides the following description of each variable. I have modified the variable names to make them easier to deal with.
--- Original source retains full ownership of the source dataset ---
DEEP-VOICE: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion This dataset contains examples of real human speech, and DeepFake versions of those speeches by using Retrieval-based Voice Conversion.
Can machine learning be used to detect when speech is AI-generated?
Introduction There are growing implications surrounding generative AI in the speech domain that enable voice cloning and real-time voice conversion from one individual to another. This technology poses a significant ethical threat and could lead to breaches of privacy and misrepresentation, thus there is an urgent need for real-time detection of AI-generated speech for DeepFake Voice Conversion.
To address the above emerging issues, we are introducing the DEEP-VOICE dataset. DEEP-VOICE is comprised of real human speech from eight well-known figures and their speech converted to one another using Retrieval-based Voice Conversion.
For each speech, the accompaniment ("background noise") was removed before conversion using RVC. The original accompaniment is then added back to the DeepFake speech:
(Above: Overview of the Retrieval-based Voice Conversion process to generate DeepFake speech with Ryan Gosling's speech converted to Margot Robbie. Conversion is run on the extracted vocals before being layered on the original background ambience.)
Dataset There are two forms to the dataset that are made available.
First, the raw audio can be found in the "AUDIO" directory. They are arranged within "REAL" and "FAKE" class directories. The audio filenames note which speakers provided the real speech, and which voices they were converted to. For example "Obama-to-Biden" denotes that Barack Obama's speech has been converted to Joe Biden's voice.
Second, the extracted features can be found in the "DATASET-balanced.csv" file. This is the data that was used in the below study. The dataset has each feature extracted from one-second windows of audio and are balanced through random sampling.
Note: All experimental data is found within the "KAGGLE" directory. The "DEMONSTRATION" directory is used for playing cropped and compressed demos in notebooks due to Kaggle's limitations on file size.
A potential use of a successful system could be used for the following:
(Above: Usage of the real-time system. The end user is notified when the machine learning model has processed the speech audio (e.g. a phone or conference call) and predicted that audio chunks contain AI-generated speech.)
Kaggle The dataset is available on the Kaggle data science platform.
The Kaggle page can be found by clicking here: Dataset on Kaggle
Attribution This dataset was produced from the study "Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion"
The preprint can be found on ArXiv by clicking here: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion
License This dataset is provided under the MIT License:
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Titanic: cleaned data’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/jamesleslie/titanic-cleaned-data on 30 September 2021.
--- Dataset description provided by original source is as follows ---
This dataset was created in this notebook as part of a three-part series. The data is in machine-learning-ready format, with all missing values for the Age
, Fare
and Embarked
columns having been imputed.
Age
: this column was imputed by using the median age for the passenger's title (Mr, Mrs, Dr etc).Fare
: the single missing value in this column was imputed using the median value for that passenger's class.Embarked
: the two missing values here were imputed using the Pandas backfill
method.This data is used in both the second and third parts of the series.
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Titanic Solution for Beginner's Guide’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/harunshimanto/titanic-solution-for-beginners-guide on 14 February 2022.
--- Dataset description provided by original source is as follows ---
The data has been split into two groups:
training set (train.csv)
test set (test.csv)
The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.
The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.
We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.
Variable Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview: This dataset was collected and curated to support research on predicting real estate prices using machine learning algorithms, specifically Support Vector Regression (SVR) and Gradient Boosting Machine (GBM). The dataset includes comprehensive information on residential properties, enabling the development and evaluation of predictive models for accurate and transparent real estate appraisals.Data Source: The data was sourced from Department of Lands and Survey real estate listings.Features: The dataset contains the following key attributes for each property:Area (in square meters): The total living area of the property.Floor Number: The floor on which the property is located.Location: Geographic coordinates or city/region where the property is situated.Type of Apartment: The classification of the property, such as studio, one-bedroom, two-bedroom, etc.Number of Bathrooms: The total number of bathrooms in the property.Number of Bedrooms: The total number of bedrooms in the property.Property Age (in years): The number of years since the property was constructed.Property Condition: A categorical variable indicating the condition of the property (e.g., new, good, fair, needs renovation).Proximity to Amenities: The distance to nearby amenities such as schools, hospitals, shopping centers, and public transportation.Market Price (target variable): The actual sale price or listed price of the property.Data Preprocessing:Normalization: Numeric features such as area and proximity to amenities were normalized to ensure consistency and improve model performance.Categorical Encoding: Categorical features like property condition and type of apartment were encoded using one-hot encoding or label encoding, depending on the specific model requirements.Missing Values: Missing data points were handled using appropriate imputation techniques or by excluding records with significant missing information.Usage: This dataset was utilized to train and test machine learning models, aiming to predict the market price of residential properties based on the provided attributes. The models developed using this dataset demonstrated improved accuracy and transparency over traditional appraisal methods.Dataset Availability: The dataset is available for public use under the [CC BY 4.0]. Users are encouraged to cite the related publication when using the data in their research or applications.Citation: If you use this dataset in your research, please cite the following publication:[Real Estate Decision-Making: Precision in Price Prediction through Advanced Machine Learning Algorithms].
Android is one of the most used mobile operating systems worldwide. Due to its technological impact, its open-source code and the possibility of installing applications from third parties without any central control, Android has recently become a malware target. Even if it includes security mechanisms, the last news about malicious activities and Android´s vulnerabilities point to the importance of continuing the development of methods and frameworks to improve its security.
To prevent malware attacks, researches and developers have proposed different security solutions, applying static analysis, dynamic analysis, and artificial intelligence. Indeed, data science has become a promising area in cybersecurity, since analytical models based on data allow for the discovery of insights that can help to predict malicious activities.
In this work, we propose to consider some network layer features as the basis for machine learning models that can successfully detect malware applications, using open datasets from the research community.
This dataset is based on another dataset (DroidCollector) where you can get all the network traffic in pcap files, in our research we preprocessed the files in order to get network features that are illustrated in the next article:
López, C. C. U., Villarreal, J. S. D., Belalcazar, A. F. P., Cadavid, A. N., & Cely, J. G. D. (2018, May). Features to Detect Android Malware. In 2018 IEEE Colombian Conference on Communications and Computing (COLCOM) (pp. 1-6). IEEE.
Cao, D., Wang, S., Li, Q., Cheny, Z., Yan, Q., Peng, L., & Yang, B. (2016, August). DroidCollector: A High Performance Framework for High Quality Android Traffic Collection. In Trustcom/BigDataSE/I SPA, 2016 IEEE (pp. 1753-1758). IEEE
Having a pet is one of life’s most fulfilling experiences. Your pets spoil you with their love, compassion, and loyalty. And dare anyone lay a finger on you in your pet’s presence, they are in for a lot of trouble. Thanks to social media, videos of clumsy and fussy (yet adorable) pets from across the globe entertain you all day long. Their love is pure and infinite. So, in return, all pets deserve a warm and loving family, indeed. And occasional boops, of course.
Numerous organizations across the world provide shelter to all homeless animals until they are adopted into a new home. However, finding a loving family for them can be a daunting task at times. This International Homeless Animals Day, we present a Machine Learning challenge to you: Adopt a buddy.
The brighter side of the pandemic is an increase in animal adoption and fostering. To ensure that their customers stay indoors, a leading pet adoption agency plans on creating a virtual-tour experience, showcasing all animals available in their shelter. To enable that, you have been tasked to build a Machine Learning model that determines the type and breed of the animal-based on its physical attributes and other factors.
The dataset consists of parameters such as a unique ID assigned to each animal that is up for adoption, the date on which they arrived at the shelter, their physical attributes such as color, length, and height, among other factors.
The benefits of practicing this problem by using Machine Learning techniques are as follows:
This challenge will help you to actively enhance your knowledge of multi-label classification. It is one of the basic building blocks of Machine Learning We challenge you to build a predictive model that detects the type and breed of an animal-based on its condition, appearance, and other factors.
Considering these unprecedented times that the world is facing due to the Coronavirus pandemic, we wish to do our bit and contribute the prize money for the welfare of society.
Machine Learning is an application of Artificial Intelligence (AI) that provides systems with the ability to automatically learn and improve from experiences without being explicitly programmed. Machine Learning is a Science that determines patterns in data. These patterns provide a deeper meaning to problems. First, it helps you understand the problems better and then solve the same with elegance.
Here is the new HackerEarth Machine Learning Challenge—Adopt a buddy
This challenge is designed to help you improve your Machine Learning skills by competing and learning from fellow participants.
To analyze and implement multiple algorithms, and determine which is more appropriate for a problem. To get hands-on experience of Machine Learning problems.
Working professionals. Data Science or Machine Learning enthusiasts. College students (if you understand the basics of predictive modeling).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
I notice there are some problems with the anomalous sequences. Please do not consider using the dataset with the firing sequence marked as anomalous. I am investigating what is the problem and work towards a new release. I recommend to not use this dataset for anomaly detection at the moment.
Testing hardware to qualify it for Spaceflight is critical to model and verify performances. Hot fire tests (also known as life-tests) are typically run during the qualification campaigns of satellite thrusters, but results remain proprietary data, hence making it difficult for the machine learning community to develop suitable data-driven predictive models. This synthetic dataset was generated partially based on the real-world physics of monopropellant chemical thrusters, to foster the development and benchmarking of new data-driven analytical methods (machine learning, deep-learning, etc.).
A monopropellant thruster is an engine that provide thrust by usage a unique propellant, as opposed to bipropellant systems which uses the combustion of fuel and oxidizer. The unique propellant flow into the chamber is controlled by a valve, usually an integral part of the thruster. It is injected into a catalyst bed, where it decomposes. A monopropellant must be slightly unstable chemical, which will decompose exothermally to produce a hot gas. The resulting hot gases are expelled through a converging/diverging nozzle generating thrust. The gas temperature is high which require the usage of high-temperature alloys to manufacture the nozzle.
The most classical type of monopropellant thrusters are reaction control thrusters generating about 1 to 10 Newton of thrust using hydrazine as propellant. These reaction control thrusters are used, for instance to control the attitude of a spacecraft and/or to desaturate the reaction wheels.
The performance of a monopropellant thruster (and its degradation) is mostly driven by the valve performance and the s of the catalyst bed on which the propellant decomposes. The life of the catalyst bed is mainly affected by the degradation of catalyst granules. The catalyst is made of alumina-based Indium metal granules (about 1mm in diameter) that are carefully designed and selected to optimize its lifetime. However, catalyst granules are easily damaged by thermoelastic shocks, collisions with other granules, and so on, thus hey are broken up into fine particles which reduces their efficiency. After the long duration of firing, large voids are formed in the catalyst bed and induce unstable decomposition of hydrazine and degradation of thruster performance.
The properties of this simulated thruster fire tests are fictious and not necessarily equivalent to a real-world thruster available on the market. Nevertheless, it provides sufficient granularity and challenge to benchmark algorithm that may then be tested on real fire test sequences. This is possible because the simulator is based, partially, on real-world physics of such reaction control thrusters. The details of the simulator are not provided on purpose to avoid leakage into feature engineering methods and modelling approaches developed.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The computational de novo design of new drugs and materials requires a thorough and unbiased exploration of chemical compound space. However, this space remains largely unexplored due to its combinatorial scaling with molecular size. To address this challenge, a dataset of 134,000 stable small organic molecules composed of carbon (C), hydrogen (H), oxygen (O), nitrogen (N), and fluorine (F) has been meticulously computed. These molecules represent a subset of all 133,885 species with up to nine heavy atoms (C, O, N, F) from the GDB-17 chemical universe, which encompasses 166 billion organic molecules.
For each molecule, computed geometric, energetic, electronic, and thermodynamic properties are provided, including:
This dataset offers a relevant, consistent, and comprehensive exploration of chemical space for small organic molecules, providing a valuable resource for benchmarking existing methods, developing new methodologies (such as hybrid quantum mechanics/machine learning approaches), and systematically identifying structure-property relationships [1].
[1] Ramakrishnan, Raghunathan, et al. "Quantum chemistry structures and properties of 134 kilo molecules." Scientific data 1.1 (2014): 1-7.
In this notebook, we aim to leverage this dataset (QM9) to predict the molecular properties of these small organic molecules using the Coulomb matrix representation. Specifically, we will focus on using the eigenvalues of the Coulomb matrix, which serve as a crucial descriptor for capturing the electronic structure of molecules for predicting molecular properties.
By the end of this notebook, you will have:
Let's begin by loading and exploring the dataset.
Enjoy! ⚛
No. | Property | Unit | Description |
---|---|---|---|
1 | tag | — | ‘gdb9’ string to facilitate extraction |
2 | i | — | Consecutive, 1-based integer identifier |
3 | A | GHz | Rotational constant |
4 | B | GHz | Rotational constant |
5 | C | GHz | Rotational constant |
6 | μ | D | Dipole moment |
7 | α | a³ | Isotropic polarizability |
8 | εHOMO | Ha | Energy of HOMO |
9 | εLUMO | Ha | Energy of LUMO |
10 | εgap | Ha | Gap (εLUMO − εHOMO) |
11 | /R2S | a² | Electronic spatial extent |
12 | zpve | Ha | Zero point vibrational energy |
13 | U0 | Ha | Internal energy at 0 K |
14 | U | Ha | Internal energy at 298.15 K |
15 | H | Ha | Enthalpy at 298.15 K |
16 | G | Ha | Free energy at 298.15 K |
17 | C v | cal/mol·K | Heat capacity at 298.15 K |
For each molecule, atomic coordinates and calculated properties are stored in a file named dataset_index.xyz
. The XYZ format 1 is a widespread plain text format for encoding Cartesian coordinates
of molecules, with no formal specification. It contains a header line specifying the number of atoms n a, a
comment line, and n a lines containing element type and atomic coordinates, one atom per line. The comment line is used to store all scalar properties, Mulliken charges are added as a fifth column. Harmonic vibrational frequencies, SMILES and InChI [2] are appended as respective additional lines.
[1] https://open-babel.readthedocs.io/en/latest/FileFormats/XYZ_cartesian_coordinates_format.html
[2] https://iupac.org/who-we-are/divisions/division-details/inchi/
| Line | Content | |------|----------------------------------------------------------...
HackerEarth Deep Learning challenge: Keep babies safe (Sep 11, 07:30 PM IST - Oct 26, 07:30 PM IST)
The dataset consists of 1500 images depicting numerous baby products - baby-proofing kits, toys, gadgets, and the like.
The benefits of practicing this problem by using unsupervised Machine Learning/Deep Learning techniques are as follows:
This challenge encourages you to apply your unsupervised Deep Learning skills to build models that can extract, identify, and tag brand names of various products. This challenge will help you enhance your knowledge of image processing and optical character recognition (OCR), which is one of the advanced fields of Machine Learning and Artificial Intelligence. We challenge you to build a model that will tag images with corresponding brand names of baby/kid products.
Your task, as a Machine Learning expert, is to build a Deep Learning model that will tag each image with the extracted product types and brand names of these products. In case there is no brand name mentioned on a product, the model should tag the image as Unnamed.
Deep Learning is an application of Artificial Intelligence (AI) that provides systems with the ability to automatically learn and improve from experience without being explicitly programmed. Deep Learning is a science that determines patterns in data. These patterns provide deeper meaning to problems and help you to first understand problems better and then solve the same with elegance. HackerEarth’s Deep Learning challenge is designed to help you improve your Deep Learning skills by competing and learning from fellow participants.
Here’s presenting HackerEarth’s Deep Learning Challenge—Keep babies safe
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Datasets with dissolved gases concentrations in power transformer oil for remaining useful life (RUL), fault detection and diagnosis (FDD) problems.
Power transformers (PTs) are an important component of a nuclear power plant (NPP). They convert alternating voltage and are instrumental in power supply of both external NPP energy consumers and NPPs themselves. Currently, many PTs have exceeded planned service life that had been extended over the designated 25 years. Due to the extension, monitoring the PT technical condition becomes an urgent matter.
An important method for monitoring and diagnosing PTs is Chromatographic Analysis of Dissolved Gas (CADG). It is based on the principle of forced extraction and analysis of dissolved gases from PT oil. Almost all types of equipment defects are accompanied by formation of gases that dissolve in oil; certain types of defects generate certain gases in different quantities. The concentrations also differ on various stages of defects developing that allows to calculate RUL of the PT. At present, NPP control and diagnostic systems for PT equipment use predefined control limits for concentration of dissolved gases in oil. The main disadvantages of this approach are the lack of automatic control and insufficient quality of diagnostics, especially for PTs with extended service life. To combat these shortcomings in diagnostic systems for the analysis of data obtained using CADG, machine learning (ML) methods can be used, as they are used in diagnostics of many NNP components.
The datasets are available as .csv files containing 420 records of gas concentration, presented as a time dependence. The gasses are 𝐻2, 𝐶𝑂, 𝐶2𝐻4 и 𝐶2𝐻2. The period between time points is 12 hours. There are 3000 datasets splitted into train (2100 datasets) and test (900 datasets) sets.
For RUL problem, annotations are available (in the separate files): each .csv file corresponds to a value in points that is equal the time remaining until the equipment fails, at the end of record.
For FDD problems, there are labels (in the separate files) with four PT operating modes (classes): 1. Normal mode (2436 datasets); 2. Partial discharge: local dielectric breakdown in gas-filled cavities (127 datasets); 3. Low energy discharge: sparking or arc discharges in poor contact connections of structural elements with different or floating potential; discharges between PT core structural elements, high voltage winding taps and the tank, high voltage winding and grounding; discharges in oil during contact switching (162 datasets); 4. Low-temperature overheating: oil flow disruption in windings cooling channels, magnetic system causing low efficiency of the cooling system for temperatures < 300 °C (275 datasets).
Data in this repository is an extension (test set added) of data from here and here.
In our case, the fault detection problem transforms into a classification problem, since the data is related to one of four labeled classes (including one normal and three anomalous), so the model’s output needs to be a class number. The problem can be stated as binary classification (healthy/anomalous) for fault detection or multi class classification (on of 4 states) for fault diagnosis.
To ensure high-quality maintenance and repair, it is vital to be aware of potential malfunctions and predict RUL of transformer equipment. Therefore, it is necessary to create a mathematical model that will determine RUL by the final 420 points.
As part of my Korean language learning hobby, I write and type out daily conversations from Naver Conversation of the Day. After getting introduced to data science and machine learning, I wanted to use programming to facilitate my learning process by collecting data and trying out projects. So I scraped data from Naver Dictionary using a Python script to be used later when I train a bilingual AI study buddy chatbot or automate Anki flashcards.
This is a corpus of Korean - English paired conversations parallel text extracted from Naver Dictionary. This dataset consists of 4563 parallel text pairs from December 4, 2017 to August 19, 2020 of Naver's Conversation of the Day. The files and their headers are listed below. * conversations.csv * date - 'Conversation of the Day' date * conversation_id - ordered numbering to indicate conversation flow * kor_sent - Korean sentence * eng_sent - English translation * qna_id - from sender or receiver, message or feedback * conversation_titles.csv * date - 'Conversation of the Day' date * kor_title - 'Conversation of the Day' title in Korean * eng_title - English translation of the title * grammar - grammar of the day * grammar_desc - grammar description
The data was collected from Naver Dictionary and the conversations were from the Korean Language Institute of Yonsei University.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by chelbi Zineb
Released under MIT