47 datasets found

introduction to machine learning
kaggle.com
Updated Dec 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
chelbi Zineb (2024). introduction to machine learning [Dataset]. https://www.kaggle.com/datasets/chelbizineb/introduction-to-machine-learning/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 11, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
chelbi Zineb
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by chelbi Zineb

Released under MIT

Contents
Introduction to Machine Learning - Part1
kaggle.com
Updated Jan 21, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sachin Jain (2021). Introduction to Machine Learning - Part1 [Dataset]. https://www.kaggle.com/sachinlnm/introduction-to-machine-learning-part1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 21, 2021
Dataset provided by
Kaggle
Authors
Sachin Jain
Description
Dataset

This dataset was created by Sachin Jain

Contents
Intro to Machine Learning
kaggle.com
zip
Updated Jun 2, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vishal Kr. Srivastava (2020). Intro to Machine Learning [Dataset]. https://www.kaggle.com/vishalkrsrivastava/intro-to-machine-learning
Explore at:
zip(96211 bytes)Available download formats
Dataset updated
Jun 2, 2020
Authors
Vishal Kr. Srivastava
Description
Dataset

This dataset was created by Vishal Kr. Srivastava

Contents

It contains the following files:
udacity-intro-to-machine-learning
kaggle.com
Updated Jul 26, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lien Suitnatsnoc (2020). udacity-intro-to-machine-learning [Dataset]. https://www.kaggle.com/datasets/davydev/udacity-intro-to-ml/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 26, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Lien Suitnatsnoc
Description
Dataset

This dataset was created by Constantius

Contents
h
Oracle_Kaggle
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
bometon, Oracle_Kaggle [Dataset]. https://huggingface.co/datasets/Aktraiser/Oracle_Kaggle
Explore at:
Authors
bometon
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Kaggle Oracle Dataset

Expert Instruction-Following Data for Competitive Machine Learning

Overview

The Kaggle Oracle Dataset is a high-quality collection of instruction-response pairs tailored for fine-tuning LLMs to provide expert guidance in Kaggle competitions. Built from 14.9M+ kernels and 9,700 competitions, this is the most comprehensive dataset for competitive ML strategy.

Highlights

175 expert-curated instruction-response pairs 100% real-world Kaggle… See the full description on the dataset page: https://huggingface.co/datasets/Aktraiser/Oracle_Kaggle.
Dollar street 10 - 64x64x3
zenodo.org
data.niaid.nih.gov
bin
Updated May 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sven van der burg; Sven van der burg (2025). Dollar street 10 - 64x64x3 [Dataset]. http://doi.org/10.5281/zenodo.10970014
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10970014
Dataset updated
May 6, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sven van der burg; Sven van der burg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The MLCommons Dollar Street Dataset is a collection of images of everyday household items from homes around the world that visually captures socioeconomic diversity of traditionally underrepresented populations. It consists of public domain data, licensed for academic, commercial and non-commercial usage, under CC-BY and CC-BY-SA 4.0. The dataset was developed because similar datasets lack socioeconomic metadata and are not representative of global diversity.

This is a subset of the original dataset that can be used for multiclass classification with 10 categories. It is designed to be used in teaching, similar to the widely used, but unlicensed CIFAR-10 dataset.

These are the preprocessing steps that were performed:

Only take examples with one imagenet_synonym label

Use only examples with the 10 most frequently occuring labels

Downscale images to 64 x 64 pixels

Split data in train and test

Store as numpy array

This is the label mapping:

Category label
day bed 0
dishrag 1
plate 2
running shoe 3
soap dispenser 4
street sign 5
table lamp 6
tile roof 7
toilet seat 8
washing machine 9

Checkout https://github.com/carpentries-lab/deep-learning-intro/blob/main/instructors/prepare-dollar-street-data.ipynb" target="_blank" rel="noopener">this notebook to see how the subset was created.

The original dataset was downloaded from https://www.kaggle.com/datasets/mlcommons/the-dollar-street-dataset. See https://mlcommons.org/datasets/dollar-street/ for more information.
A
‘ Medical Cost Personal Datasets’ analyzed by Analyst-2
analyst-2.ai
Updated Nov 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘ Medical Cost Personal Datasets’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-medical-cost-personal-datasets-703f/f489ee08/?iid=012-673&v=presentation
Explore at:
Dataset updated
Nov 12, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘ Medical Cost Personal Datasets’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/mirichoi0218/insurance on 12 November 2021.

--- Dataset description provided by original source is as follows ---

Context

Machine Learning with R by Brett Lantz is a book that provides an introduction to machine learning using R. As far as I can tell, Packt Publishing does not make its datasets available online unless you buy the book and create a user account which can be a problem if you are checking the book out from the library or borrowing the book from a friend. All of these datasets are in the public domain but simply needed some cleaning up and recoding to match the format in the book.

Content

Columns - age: age of primary beneficiary

sex: insurance contractor gender, female, male

bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

children: Number of children covered by health insurance / Number of dependents

smoker: Smoking

region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

charges: Individual medical costs billed by health insurance

Acknowledgements

The dataset is available on GitHub here.

Inspiration

Can you accurately predict insurance costs?

--- Original source retains full ownership of the source dataset ---
A
‘Pokemon with stats’ analyzed by Analyst-2
analyst-2.ai
Updated Aug 22, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2016). ‘Pokemon with stats’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-pokemon-with-stats-2520/04882d1e/?iid=005-178&v=presentation
Explore at:
Dataset updated
Aug 22, 2016
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Pokemon with stats’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/abcsds/pokemon on 28 January 2022.

--- Dataset description provided by original source is as follows ---

This data set includes 721 Pokemon, including their number, name, first and second type, and basic stats: HP, Attack, Defense, Special Attack, Special Defense, and Speed. It has been of great use when teaching statistics to kids. With certain types you can also give a geeky introduction to machine learning.

This are the raw attributes that are used for calculating how much damage an attack will do in the games. This dataset is about the pokemon games (NOT pokemon cards or Pokemon Go).

The data as described by Myles O'Neill is:

#: ID for each pokemon

Name: Name of each pokemon

Type 1: Each pokemon has a type, this determines weakness/resistance to attacks

Type 2: Some pokemon are dual type and have 2

Total: sum of all stats that come after this, a general guide to how strong a pokemon is

HP: hit points, or health, defines how much damage a pokemon can withstand before fainting

Attack: the base modifier for normal attacks (eg. Scratch, Punch)

Defense: the base damage resistance against normal attacks

SP Atk: special attack, the base modifier for special attacks (e.g. fire blast, bubble beam)

SP Def: the base damage resistance against special attacks

Speed: determines which pokemon attacks first each round

The data for this table has been acquired from several different sites, including:

pokemon.com

pokemondb

bulbapedia

One question has been answered with this database: The type of a pokemon cannot be inferred only by it's Attack and Deffence. It would be worthy to find which two variables can define the type of a pokemon, if any. Two variables can be plotted in a 2D space, and used as an example for machine learning. This could mean the creation of a visual example any geeky Machine Learning class would love.

--- Original source retains full ownership of the source dataset ---
A
‘Pokemon’ analyzed by Analyst-2
analyst-2.ai
Updated Sep 3, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2019). ‘Pokemon’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-pokemon-a6b8/latest
Explore at:
Dataset updated
Sep 3, 2019
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Pokemon’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/mlomuscio/pokemon on 14 February 2022.

--- Dataset description provided by original source is as follows ---

I acquired the data from Alberto Barradas at https://www.kaggle.com/abcsds/pokemon. I needed to edit some of the variable names and remove the Total variable in order for my students to use this data for class. Otherwise, I would have just had them use his version of the data.

This dataset is for my Introduction to Data Science and Machine Learning Course. Using a modified Pokémon dataset acquired from Kaggle.com, I created example code for students demonstrating how to explore data with R.

Barradas provides the following description of each variable. I have modified the variable names to make them easier to deal with.

Num: ID for each Pokémon.

Name: Name of each Pokémon.

Type1: Each Pokémon has a type, this determines weakness/resistance to attacks.

Type2: Some Pokémon are dual type and have 2.

HP: Hit points, or health, defines how much damage a Pokémon can withstand before fainting.

Attack: The base modifier for normal attacks (eg. Scratch, Punch).

Defense: The base damage resistance against normal attacks.

SPAtk: Special attack, the base modifier for special attacks (e.g. fire blast, bubble beam).

SPDef: The base damage resistance against special attacks.

Speed: Determines which Pokémon attacks first each round.

Generation: Number of generation.

Legendary: True if Legendary Pokémon, False if not.

--- Original source retains full ownership of the source dataset ---
P
DEEP-VOICE: DeepFake Voice Recognition Dataset
paperswithcode.com
Updated Aug 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). DEEP-VOICE: DeepFake Voice Recognition Dataset [Dataset]. https://paperswithcode.com/dataset/deep-voice-deepfake-voice-recognition
Explore at:
Dataset updated
Aug 23, 2023
Description
DEEP-VOICE: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion This dataset contains examples of real human speech, and DeepFake versions of those speeches by using Retrieval-based Voice Conversion.

Can machine learning be used to detect when speech is AI-generated?

Introduction There are growing implications surrounding generative AI in the speech domain that enable voice cloning and real-time voice conversion from one individual to another. This technology poses a significant ethical threat and could lead to breaches of privacy and misrepresentation, thus there is an urgent need for real-time detection of AI-generated speech for DeepFake Voice Conversion.

To address the above emerging issues, we are introducing the DEEP-VOICE dataset. DEEP-VOICE is comprised of real human speech from eight well-known figures and their speech converted to one another using Retrieval-based Voice Conversion.

For each speech, the accompaniment ("background noise") was removed before conversion using RVC. The original accompaniment is then added back to the DeepFake speech:

(Above: Overview of the Retrieval-based Voice Conversion process to generate DeepFake speech with Ryan Gosling's speech converted to Margot Robbie. Conversion is run on the extracted vocals before being layered on the original background ambience.)

Dataset There are two forms to the dataset that are made available.

First, the raw audio can be found in the "AUDIO" directory. They are arranged within "REAL" and "FAKE" class directories. The audio filenames note which speakers provided the real speech, and which voices they were converted to. For example "Obama-to-Biden" denotes that Barack Obama's speech has been converted to Joe Biden's voice.

Second, the extracted features can be found in the "DATASET-balanced.csv" file. This is the data that was used in the below study. The dataset has each feature extracted from one-second windows of audio and are balanced through random sampling.

Note: All experimental data is found within the "KAGGLE" directory. The "DEMONSTRATION" directory is used for playing cropped and compressed demos in notebooks due to Kaggle's limitations on file size.

A potential use of a successful system could be used for the following:

(Above: Usage of the real-time system. The end user is notified when the machine learning model has processed the speech audio (e.g. a phone or conference call) and predicted that audio chunks contain AI-generated speech.)

Kaggle The dataset is available on the Kaggle data science platform.

The Kaggle page can be found by clicking here: Dataset on Kaggle

Attribution This dataset was produced from the study "Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion"

The preprint can be found on ArXiv by clicking here: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion

License This dataset is provided under the MIT License:

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
A
‘Titanic: cleaned data’ analyzed by Analyst-2
analyst-2.ai
Updated Sep 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Titanic: cleaned data’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-titanic-cleaned-data-cbf4/dc9cd7ff/?iid=055-046&v=presentation
Explore at:
Dataset updated
Sep 30, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Titanic: cleaned data’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/jamesleslie/titanic-cleaned-data on 30 September 2021.

--- Dataset description provided by original source is as follows ---

Introduction

This dataset was created in this notebook as part of a three-part series. The data is in machine-learning-ready format, with all missing values for the Age, Fare and Embarked columns having been imputed.

Data imputation

Age: this column was imputed by using the median age for the passenger's title (Mr, Mrs, Dr etc).

Fare: the single missing value in this column was imputed using the median value for that passenger's class.

Embarked: the two missing values here were imputed using the Pandas backfill method.

Usage

This data is used in both the second and third parts of the series.

--- Original source retains full ownership of the source dataset ---
A
‘Titanic Solution for Beginner's Guide’ analyzed by Analyst-2
analyst-2.ai
Updated Feb 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Titanic Solution for Beginner's Guide’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-titanic-solution-for-beginner-s-guide-03a8/ae3641d4/?iid=014-162&v=presentation
Explore at:
Dataset updated
Feb 14, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Titanic Solution for Beginner's Guide’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/harunshimanto/titanic-solution-for-beginners-guide on 14 February 2022.

--- Dataset description provided by original source is as follows ---

Overview

The data has been split into two groups:

training set (train.csv) test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

Data Dictionary

Variable Definition Key survival Survival 0 = No, 1 = Yes pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

Variable Notes

pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.

--- Original source retains full ownership of the source dataset ---
Real Estate Price Prediction Data
figshare.com
txt
Updated Aug 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammad Shbool; Rand Al-Dmour; Bashar Al-Shboul; Nibal Albashabsheh; Najat Almasarwah (2024). Real Estate Price Prediction Data [Dataset]. http://doi.org/10.6084/m9.figshare.26517325.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26517325.v1
Dataset updated
Aug 8, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Mohammad Shbool; Rand Al-Dmour; Bashar Al-Shboul; Nibal Albashabsheh; Najat Almasarwah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview: This dataset was collected and curated to support research on predicting real estate prices using machine learning algorithms, specifically Support Vector Regression (SVR) and Gradient Boosting Machine (GBM). The dataset includes comprehensive information on residential properties, enabling the development and evaluation of predictive models for accurate and transparent real estate appraisals.Data Source: The data was sourced from Department of Lands and Survey real estate listings.Features: The dataset contains the following key attributes for each property:Area (in square meters): The total living area of the property.Floor Number: The floor on which the property is located.Location: Geographic coordinates or city/region where the property is situated.Type of Apartment: The classification of the property, such as studio, one-bedroom, two-bedroom, etc.Number of Bathrooms: The total number of bathrooms in the property.Number of Bedrooms: The total number of bedrooms in the property.Property Age (in years): The number of years since the property was constructed.Property Condition: A categorical variable indicating the condition of the property (e.g., new, good, fair, needs renovation).Proximity to Amenities: The distance to nearby amenities such as schools, hospitals, shopping centers, and public transportation.Market Price (target variable): The actual sale price or listed price of the property.Data Preprocessing:Normalization: Numeric features such as area and proximity to amenities were normalized to ensure consistency and improve model performance.Categorical Encoding: Categorical features like property condition and type of apartment were encoded using one-hot encoding or label encoding, depending on the specific model requirements.Missing Values: Missing data points were handled using appropriate imputation techniques or by excluding records with significant missing information.Usage: This dataset was utilized to train and test machine learning models, aiming to predict the market price of residential properties based on the provided attributes. The models developed using this dataset demonstrated improved accuracy and transparency over traditional appraisal methods.Dataset Availability: The dataset is available for public use under the [CC BY 4.0]. Users are encouraged to cite the related publication when using the data in their research or applications.Citation: If you use this dataset in your research, please cite the following publication:[Real Estate Decision-Making: Precision in Price Prediction through Advanced Machine Learning Algorithms].
Network Traffic Android Malware
kaggle.com
zip
Updated Sep 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Urcuqui (2019). Network Traffic Android Malware [Dataset]. https://www.kaggle.com/datasets/xwolf12/network-traffic-android-malware
Explore at:
zip(116603 bytes)Available download formats
Dataset updated
Sep 12, 2019
Authors
Christian Urcuqui
Description
Introduction

Android is one of the most used mobile operating systems worldwide. Due to its technological impact, its open-source code and the possibility of installing applications from third parties without any central control, Android has recently become a malware target. Even if it includes security mechanisms, the last news about malicious activities and Android´s vulnerabilities point to the importance of continuing the development of methods and frameworks to improve its security.

To prevent malware attacks, researches and developers have proposed different security solutions, applying static analysis, dynamic analysis, and artificial intelligence. Indeed, data science has become a promising area in cybersecurity, since analytical models based on data allow for the discovery of insights that can help to predict malicious activities.

In this work, we propose to consider some network layer features as the basis for machine learning models that can successfully detect malware applications, using open datasets from the research community.

Content

This dataset is based on another dataset (DroidCollector) where you can get all the network traffic in pcap files, in our research we preprocessed the files in order to get network features that are illustrated in the next article:

López, C. C. U., Villarreal, J. S. D., Belalcazar, A. F. P., Cadavid, A. N., & Cely, J. G. D. (2018, May). Features to Detect Android Malware. In 2018 IEEE Colombian Conference on Communications and Computing (COLCOM) (pp. 1-6). IEEE.

Acknowledgements

Cao, D., Wang, S., Li, Q., Cheny, Z., Yan, Q., Peng, L., & Yang, B. (2016, August). DroidCollector: A High Performance Framework for High Quality Android Traffic Collection. In Trustcom/BigDataSE/I SPA, 2016 IEEE (pp. 1753-1758). IEEE
HackerEarth ML challenge: Adopt a buddy
kaggle.com
Updated Jul 31, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manvendra Singh (2020). HackerEarth ML challenge: Adopt a buddy [Dataset]. https://www.kaggle.com/mannsingh/hackerearth-ml-challenge-pet-adoption/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 31, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Manvendra Singh
Description
Problem statement

Having a pet is one of life’s most fulfilling experiences. Your pets spoil you with their love, compassion, and loyalty. And dare anyone lay a finger on you in your pet’s presence, they are in for a lot of trouble. Thanks to social media, videos of clumsy and fussy (yet adorable) pets from across the globe entertain you all day long. Their love is pure and infinite. So, in return, all pets deserve a warm and loving family, indeed. And occasional boops, of course.

Numerous organizations across the world provide shelter to all homeless animals until they are adopted into a new home. However, finding a loving family for them can be a daunting task at times. This International Homeless Animals Day, we present a Machine Learning challenge to you: Adopt a buddy.

The brighter side of the pandemic is an increase in animal adoption and fostering. To ensure that their customers stay indoors, a leading pet adoption agency plans on creating a virtual-tour experience, showcasing all animals available in their shelter. To enable that, you have been tasked to build a Machine Learning model that determines the type and breed of the animal-based on its physical attributes and other factors.

Dataset

The dataset consists of parameters such as a unique ID assigned to each animal that is up for adoption, the date on which they arrived at the shelter, their physical attributes such as color, length, and height, among other factors.

The benefits of practicing this problem by using Machine Learning techniques are as follows:

This challenge will help you to actively enhance your knowledge of multi-label classification. It is one of the basic building blocks of Machine Learning We challenge you to build a predictive model that detects the type and breed of an animal-based on its condition, appearance, and other factors.

Prizes

Considering these unprecedented times that the world is facing due to the Coronavirus pandemic, we wish to do our bit and contribute the prize money for the welfare of society.

Overview

Machine Learning is an application of Artificial Intelligence (AI) that provides systems with the ability to automatically learn and improve from experiences without being explicitly programmed. Machine Learning is a Science that determines patterns in data. These patterns provide a deeper meaning to problems. First, it helps you understand the problems better and then solve the same with elegance.

Here is the new HackerEarth Machine Learning Challenge—Adopt a buddy

This challenge is designed to help you improve your Machine Learning skills by competing and learning from fellow participants.

Why should you participate?

To analyze and implement multiple algorithms, and determine which is more appropriate for a problem. To get hands-on experience of Machine Learning problems.

Who should participate?

Working professionals. Data Science or Machine Learning enthusiasts. College students (if you understand the basics of predictive modeling).
Spacecraft Thruster Firing Tests Dataset
kaggle.com
zip
Updated Oct 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
astro_pat (2022). Spacecraft Thruster Firing Tests Dataset [Dataset]. https://www.kaggle.com/datasets/patrickfleith/spacecraft-thruster-firing-tests-dataset
Explore at:
zip(3194276953 bytes)Available download formats
Dataset updated
Oct 15, 2022
Authors
astro_pat
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Warning - Anomalies currently not usable

I notice there are some problems with the anomalous sequences. Please do not consider using the dataset with the firing sequence marked as anomalous. I am investigating what is the problem and work towards a new release. I recommend to not use this dataset for anomaly detection at the moment.

Context

Testing hardware to qualify it for Spaceflight is critical to model and verify performances. Hot fire tests (also known as life-tests) are typically run during the qualification campaigns of satellite thrusters, but results remain proprietary data, hence making it difficult for the machine learning community to develop suitable data-driven predictive models. This synthetic dataset was generated partially based on the real-world physics of monopropellant chemical thrusters, to foster the development and benchmarking of new data-driven analytical methods (machine learning, deep-learning, etc.).

Overview of chemical monopropellant thruster

A monopropellant thruster is an engine that provide thrust by usage a unique propellant, as opposed to bipropellant systems which uses the combustion of fuel and oxidizer. The unique propellant flow into the chamber is controlled by a valve, usually an integral part of the thruster. It is injected into a catalyst bed, where it decomposes. A monopropellant must be slightly unstable chemical, which will decompose exothermally to produce a hot gas. The resulting hot gases are expelled through a converging/diverging nozzle generating thrust. The gas temperature is high which require the usage of high-temperature alloys to manufacture the nozzle.

The most classical type of monopropellant thrusters are reaction control thrusters generating about 1 to 10 Newton of thrust using hydrazine as propellant. These reaction control thrusters are used, for instance to control the attitude of a spacecraft and/or to desaturate the reaction wheels.

The performance of a monopropellant thruster (and its degradation) is mostly driven by the valve performance and the s of the catalyst bed on which the propellant decomposes. The life of the catalyst bed is mainly affected by the degradation of catalyst granules. The catalyst is made of alumina-based Indium metal granules (about 1mm in diameter) that are carefully designed and selected to optimize its lifetime. However, catalyst granules are easily damaged by thermoelastic shocks, collisions with other granules, and so on, thus hey are broken up into fine particles which reduces their efficiency. After the long duration of firing, large voids are formed in the catalyst bed and induce unstable decomposition of hydrazine and degradation of thruster performance.

The properties of this simulated thruster fire tests are fictious and not necessarily equivalent to a real-world thruster available on the market. Nevertheless, it provides sufficient granularity and challenge to benchmark algorithm that may then be tested on real fire test sequences. This is possible because the simulator is based, partially, on real-world physics of such reaction control thrusters. The details of the simulator are not provided on purpose to avoid leakage into feature engineering methods and modelling approaches developed.

Tasks and use cases

Regression for Performance Prediction

Performance Modelling: Prediction of the thruster performances (target can be thrust, mass flow rate, and/or the average specific impulse of over a given sequence that can be calculated using the first two). This task may also be referred to as “time seriesforecasting with exogenous inputs”, or “System Dynamics modelling with control”, where the control is the command sent to the thruster (on/off) in the column “ton”. It is very unlikely in practice to have more than a few complete SN qualification datasets, so we suggest a 50:50 split between the train and test sets.

Use SN01 to SN12 to build your train and validation sets. These are data assumed to be collected on-ground.

Use SN13 to SN24 as the final test set. Warning: given that these thrusters are expected to be mounted on a flying satellite, you cannot retrain the model progressively using SN13 to SN24 data. The data is the ground truth as if we were able to measure it on-board.

Acceptance Test for Individualised Performance Model refinement:

In practice, every newly manufactured thruster is tested via acceptance tests: test_id 1 to 12. So these measurement are actually available for each thruster serial number. For this reason, it is possible to use the acceptance test data (test_id 1 to 12) of SN13 to SN24 as input to the model to predict the flight performance of these SN over test_id 13 to 112. Taking into account the acceptance test of individual thruster might be helpful to generate individualized thruster predictive model. This use case propose to investigate...

QM9_molecules

kaggle.com

Updated Sep 9, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Mario Vozza (2024). QM9_molecules [Dataset]. https://www.kaggle.com/datasets/mariovozza5/qm9-molecules

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Sep 9, 2024

Dataset provided by

Kaggle

Authors

Mario Vozza

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Notebook powered by https://daimoners.eu

PREDICTING MOLECULAR PROPERTIES WITH MACHINE LEARNING

Introduction and Objectives

The computational de novo design of new drugs and materials requires a thorough and unbiased exploration of chemical compound space. However, this space remains largely unexplored due to its combinatorial scaling with molecular size. To address this challenge, a dataset of 134,000 stable small organic molecules composed of carbon (C), hydrogen (H), oxygen (O), nitrogen (N), and fluorine (F) has been meticulously computed. These molecules represent a subset of all 133,885 species with up to nine heavy atoms (C, O, N, F) from the GDB-17 chemical universe, which encompasses 166 billion organic molecules.

For each molecule, computed geometric, energetic, electronic, and thermodynamic properties are provided, including:

This dataset offers a relevant, consistent, and comprehensive exploration of chemical space for small organic molecules, providing a valuable resource for benchmarking existing methods, developing new methodologies (such as hybrid quantum mechanics/machine learning approaches), and systematically identifying structure-property relationships [1].

[1] Ramakrishnan, Raghunathan, et al. "Quantum chemistry structures and properties of 134 kilo molecules." Scientific data 1.1 (2014): 1-7.

In this notebook, we aim to leverage this dataset (QM9) to predict the molecular properties of these small organic molecules using the Coulomb matrix representation. Specifically, we will focus on using the eigenvalues of the Coulomb matrix, which serve as a crucial descriptor for capturing the electronic structure of molecules for predicting molecular properties.

By the end of this notebook, you will have:

Explored the dataset and visualized key molecular properties
Generated Coulomb matrices for the molecules in the dataset
Calculate the eigenvalues of the Coulomb matrices and predicting properties using machine learning models
Evaluated the performance of these models in accurately predicting molecular properties

Let's begin by loading and exploring the dataset.

Enjoy! ⚛

Properties in the QM9 Dataset

No.	Property	Unit	Description
1	tag	—	‘gdb9’ string to facilitate extraction
2	i	—	Consecutive, 1-based integer identifier
3	A	GHz	Rotational constant
4	B	GHz	Rotational constant
5	C	GHz	Rotational constant
6	μ	D	Dipole moment
7	α	a³	Isotropic polarizability
8	εHOMO	Ha	Energy of HOMO
9	εLUMO	Ha	Energy of LUMO
10	εgap	Ha	Gap (εLUMO − εHOMO)
11	/R2S	a²	Electronic spatial extent
12	zpve	Ha	Zero point vibrational energy
13	U0	Ha	Internal energy at 0 K
14	U	Ha	Internal energy at 298.15 K
15	H	Ha	Enthalpy at 298.15 K
16	G	Ha	Free energy at 298.15 K
17	C v	cal/mol·K	Heat capacity at 298.15 K

Dataset Structure

For each molecule, atomic coordinates and calculated properties are stored in a file named dataset_index.xyz. The XYZ format 1 is a widespread plain text format for encoding Cartesian coordinates of molecules, with no formal specification. It contains a header line specifying the number of atoms n a, a comment line, and n a lines containing element type and atomic coordinates, one atom per line. The comment line is used to store all scalar properties, Mulliken charges are added as a fifth column. Harmonic vibrational frequencies, SMILES and InChI [2] are appended as respective additional lines.

[1] https://open-babel.readthedocs.io/en/latest/FileFormats/XYZ_cartesian_coordinates_format.html

[2] https://iupac.org/who-we-are/divisions/division-details/inchi/

QM9 xyz format

| Line | Content | |------|----------------------------------------------------------...

Keep babies safe
kaggle.com
Updated Sep 22, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akash Gupta (2020). Keep babies safe [Dataset]. https://www.kaggle.com/akash14/keep-babies-safe/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 22, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Akash Gupta
Description
Context

HackerEarth Deep Learning challenge: Keep babies safe (Sep 11, 07:30 PM IST - Oct 26, 07:30 PM IST)

Content

The dataset consists of 1500 images depicting numerous baby products - baby-proofing kits, toys, gadgets, and the like.

The benefits of practicing this problem by using unsupervised Machine Learning/Deep Learning techniques are as follows:

This challenge encourages you to apply your unsupervised Deep Learning skills to build models that can extract, identify, and tag brand names of various products. This challenge will help you enhance your knowledge of image processing and optical character recognition (OCR), which is one of the advanced fields of Machine Learning and Artificial Intelligence. We challenge you to build a model that will tag images with corresponding brand names of baby/kid products.

Problem statement

Your task, as a Machine Learning expert, is to build a Deep Learning model that will tag each image with the extracted product types and brand names of these products. In case there is no brand name mentioned on a product, the model should tag the image as Unnamed.

Overview

Deep Learning is an application of Artificial Intelligence (AI) that provides systems with the ability to automatically learn and improve from experience without being explicitly programmed. Deep Learning is a science that determines patterns in data. These patterns provide deeper meaning to problems and help you to first understand problems better and then solve the same with elegance. HackerEarth’s Deep Learning challenge is designed to help you improve your Deep Learning skills by competing and learning from fellow participants.

Here’s presenting HackerEarth’s Deep Learning Challenge—Keep babies safe
Power Transformers FDD and RUL
kaggle.com
zip
Updated Sep 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iurii Katser (2024). Power Transformers FDD and RUL [Dataset]. https://www.kaggle.com/datasets/yuriykatser/power-transformers-fdd-and-rul
Explore at:
zip(33405750 bytes)Available download formats
Dataset updated
Sep 1, 2024
Authors
Iurii Katser
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Datasets with dissolved gases concentrations in power transformer oil for remaining useful life (RUL), fault detection and diagnosis (FDD) problems.

Introduction

Power transformers (PTs) are an important component of a nuclear power plant (NPP). They convert alternating voltage and are instrumental in power supply of both external NPP energy consumers and NPPs themselves. Currently, many PTs have exceeded planned service life that had been extended over the designated 25 years. Due to the extension, monitoring the PT technical condition becomes an urgent matter.

An important method for monitoring and diagnosing PTs is Chromatographic Analysis of Dissolved Gas (CADG). It is based on the principle of forced extraction and analysis of dissolved gases from PT oil. Almost all types of equipment defects are accompanied by formation of gases that dissolve in oil; certain types of defects generate certain gases in different quantities. The concentrations also differ on various stages of defects developing that allows to calculate RUL of the PT. At present, NPP control and diagnostic systems for PT equipment use predefined control limits for concentration of dissolved gases in oil. The main disadvantages of this approach are the lack of automatic control and insufficient quality of diagnostics, especially for PTs with extended service life. To combat these shortcomings in diagnostic systems for the analysis of data obtained using CADG, machine learning (ML) methods can be used, as they are used in diagnostics of many NNP components.

Data description

The datasets are available as .csv files containing 420 records of gas concentration, presented as a time dependence. The gasses are 𝐻2, 𝐶𝑂, 𝐶2𝐻4 и 𝐶2𝐻2. The period between time points is 12 hours. There are 3000 datasets splitted into train (2100 datasets) and test (900 datasets) sets.

For RUL problem, annotations are available (in the separate files): each .csv file corresponds to a value in points that is equal the time remaining until the equipment fails, at the end of record.

For FDD problems, there are labels (in the separate files) with four PT operating modes (classes): 1. Normal mode (2436 datasets); 2. Partial discharge: local dielectric breakdown in gas-filled cavities (127 datasets); 3. Low energy discharge: sparking or arc discharges in poor contact connections of structural elements with different or floating potential; discharges between PT core structural elements, high voltage winding taps and the tank, high voltage winding and grounding; discharges in oil during contact switching (162 datasets); 4. Low-temperature overheating: oil flow disruption in windings cooling channels, magnetic system causing low efficiency of the cooling system for temperatures < 300 °C (275 datasets).

Data in this repository is an extension (test set added) of data from here and here.

FDD problems statement

In our case, the fault detection problem transforms into a classification problem, since the data is related to one of four labeled classes (including one normal and three anomalous), so the model’s output needs to be a class number. The problem can be stated as binary classification (healthy/anomalous) for fault detection or multi class classification (on of 4 states) for fault diagnosis.

RUL problem statement

To ensure high-quality maintenance and repair, it is vital to be aware of potential malfunctions and predict RUL of transformer equipment. Therefore, it is necessary to create a mathematical model that will determine RUL by the final 420 points.

Data usage examples

Dataset was used in this article.

Dataset was used in this research by Katser et.al. that solves the problem proposing ensemble of classifiers.
Korean - English Parallel Corpus
kaggle.com
Updated Aug 19, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ramzel Renz Loto (2020). Korean - English Parallel Corpus [Dataset]. https://www.kaggle.com/rareloto/naver-dictionary-conversation-of-the-day/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 19, 2020
Dataset provided by
Kaggle
Authors
Ramzel Renz Loto
Description
Context

As part of my Korean language learning hobby, I write and type out daily conversations from Naver Conversation of the Day. After getting introduced to data science and machine learning, I wanted to use programming to facilitate my learning process by collecting data and trying out projects. So I scraped data from Naver Dictionary using a Python script to be used later when I train a bilingual AI study buddy chatbot or automate Anki flashcards.

Content

This is a corpus of Korean - English paired conversations parallel text extracted from Naver Dictionary. This dataset consists of 4563 parallel text pairs from December 4, 2017 to August 19, 2020 of Naver's Conversation of the Day. The files and their headers are listed below. * conversations.csv * date - 'Conversation of the Day' date * conversation_id - ordered numbering to indicate conversation flow * kor_sent - Korean sentence * eng_sent - English translation * qna_id - from sender or receiver, message or feedback * conversation_titles.csv * date - 'Conversation of the Day' date * kor_title - 'Conversation of the Day' title in Korean * eng_title - English translation of the title * grammar - grammar of the day * grammar_desc - grammar description

Acknowledgements

The data was collected from Naver Dictionary and the conversations were from the Korean Language Institute of Yonsei University.

Facebook

Twitter

Click to copy link

Link copied

Cite

chelbi Zineb (2024). introduction to machine learning [Dataset]. https://www.kaggle.com/datasets/chelbizineb/introduction-to-machine-learning/code

introduction to machine learning

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Dec 11, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

chelbi Zineb

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset

This dataset was created by chelbi Zineb

Released under MIT

Clear search

Close search

Google apps

Main menu

Category	label
day bed	0
dishrag	1
plate	2
running shoe	3
soap dispenser	4
street sign	5
table lamp	6
tile roof	7
toilet seat	8
washing machine	9

introduction to machine learning

Dataset

Contents

Introduction to Machine Learning - Part1

Dataset

Contents

Intro to Machine Learning

Dataset

Contents

udacity-intro-to-machine-learning

Dataset

Contents

Oracle_Kaggle

Dollar street 10 - 64x64x3

‘ Medical Cost Personal Datasets’ analyzed by Analyst-2

Context

Content

Acknowledgements

Inspiration

‘Pokemon with stats’ analyzed by Analyst-2

‘Pokemon’ analyzed by Analyst-2

DEEP-VOICE: DeepFake Voice Recognition Dataset

‘Titanic: cleaned data’ analyzed by Analyst-2

Introduction

Data imputation

Usage

‘Titanic Solution for Beginner's Guide’ analyzed by Analyst-2

Overview

Data Dictionary

Variable Notes

Real Estate Price Prediction Data

Network Traffic Android Malware

Introduction

Content

Acknowledgements

HackerEarth ML challenge: Adopt a buddy

Problem statement

Dataset

Prizes

Overview

Why should you participate?

Who should participate?

Spacecraft Thruster Firing Tests Dataset

Warning - Anomalies currently not usable

Context

Overview of chemical monopropellant thruster

Tasks and use cases

Regression for Performance Prediction

QM9_molecules

Notebook powered by https://daimoners.eu

PREDICTING MOLECULAR PROPERTIES WITH MACHINE LEARNING

Introduction and Objectives

Properties in the QM9 Dataset

Dataset Structure

QM9 xyz format

Keep babies safe

Context

Content

Problem statement

Overview

Power Transformers FDD and RUL

Introduction

Data description

FDD problems statement

RUL problem statement

Data usage examples

Korean - English Parallel Corpus

Context

Content

Acknowledgements

introduction to machine learning

Dataset

Contents