47 datasets found
  1. introduction to machine learning

    • kaggle.com
    Updated Dec 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    chelbi Zineb (2024). introduction to machine learning [Dataset]. https://www.kaggle.com/datasets/chelbizineb/introduction-to-machine-learning/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 11, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    chelbi Zineb
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by chelbi Zineb

    Released under MIT

    Contents

  2. Introduction to Machine Learning - Part1

    • kaggle.com
    Updated Jan 21, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sachin Jain (2021). Introduction to Machine Learning - Part1 [Dataset]. https://www.kaggle.com/sachinlnm/introduction-to-machine-learning-part1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 21, 2021
    Dataset provided by
    Kaggle
    Authors
    Sachin Jain
    Description

    Dataset

    This dataset was created by Sachin Jain

    Contents

  3. Intro to Machine Learning

    • kaggle.com
    zip
    Updated Jun 2, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vishal Kr. Srivastava (2020). Intro to Machine Learning [Dataset]. https://www.kaggle.com/vishalkrsrivastava/intro-to-machine-learning
    Explore at:
    zip(96211 bytes)Available download formats
    Dataset updated
    Jun 2, 2020
    Authors
    Vishal Kr. Srivastava
    Description

    Dataset

    This dataset was created by Vishal Kr. Srivastava

    Contents

    It contains the following files:

  4. udacity-intro-to-machine-learning

    • kaggle.com
    Updated Jul 26, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lien Suitnatsnoc (2020). udacity-intro-to-machine-learning [Dataset]. https://www.kaggle.com/datasets/davydev/udacity-intro-to-ml/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 26, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Lien Suitnatsnoc
    Description

    Dataset

    This dataset was created by Constantius

    Contents

  5. h

    Oracle_Kaggle

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    bometon, Oracle_Kaggle [Dataset]. https://huggingface.co/datasets/Aktraiser/Oracle_Kaggle
    Explore at:
    Authors
    bometon
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Kaggle Oracle Dataset

    Expert Instruction-Following Data for Competitive Machine Learning

      Overview
    

    The Kaggle Oracle Dataset is a high-quality collection of instruction-response pairs tailored for fine-tuning LLMs to provide expert guidance in Kaggle competitions. Built from 14.9M+ kernels and 9,700 competitions, this is the most comprehensive dataset for competitive ML strategy.

      Highlights
    

    175 expert-curated instruction-response pairs 100% real-world Kaggle… See the full description on the dataset page: https://huggingface.co/datasets/Aktraiser/Oracle_Kaggle.

  6. Dollar street 10 - 64x64x3

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated May 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sven van der burg; Sven van der burg (2025). Dollar street 10 - 64x64x3 [Dataset]. http://doi.org/10.5281/zenodo.10970014
    Explore at:
    binAvailable download formats
    Dataset updated
    May 6, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sven van der burg; Sven van der burg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The MLCommons Dollar Street Dataset is a collection of images of everyday household items from homes around the world that visually captures socioeconomic diversity of traditionally underrepresented populations. It consists of public domain data, licensed for academic, commercial and non-commercial usage, under CC-BY and CC-BY-SA 4.0. The dataset was developed because similar datasets lack socioeconomic metadata and are not representative of global diversity.

    This is a subset of the original dataset that can be used for multiclass classification with 10 categories. It is designed to be used in teaching, similar to the widely used, but unlicensed CIFAR-10 dataset.

    These are the preprocessing steps that were performed:

    1. Only take examples with one imagenet_synonym label
    2. Use only examples with the 10 most frequently occuring labels
    3. Downscale images to 64 x 64 pixels
    4. Split data in train and test
    5. Store as numpy array

    This is the label mapping:

    Categorylabel
    day bed0
    dishrag1
    plate2
    running shoe3
    soap dispenser4
    street sign5
    table lamp6
    tile roof7
    toilet seat8
    washing machine9

    Checkout https://github.com/carpentries-lab/deep-learning-intro/blob/main/instructors/prepare-dollar-street-data.ipynb" target="_blank" rel="noopener">this notebook to see how the subset was created.

    The original dataset was downloaded from https://www.kaggle.com/datasets/mlcommons/the-dollar-street-dataset. See https://mlcommons.org/datasets/dollar-street/ for more information.

  7. A

    ‘ Medical Cost Personal Datasets’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Nov 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘ Medical Cost Personal Datasets’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-medical-cost-personal-datasets-703f/f489ee08/?iid=012-673&v=presentation
    Explore at:
    Dataset updated
    Nov 12, 2021
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘ Medical Cost Personal Datasets’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/mirichoi0218/insurance on 12 November 2021.

    --- Dataset description provided by original source is as follows ---

    Context

    Machine Learning with R by Brett Lantz is a book that provides an introduction to machine learning using R. As far as I can tell, Packt Publishing does not make its datasets available online unless you buy the book and create a user account which can be a problem if you are checking the book out from the library or borrowing the book from a friend. All of these datasets are in the public domain but simply needed some cleaning up and recoding to match the format in the book.

    Content

    Columns - age: age of primary beneficiary

    • sex: insurance contractor gender, female, male

    • bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

    • children: Number of children covered by health insurance / Number of dependents

    • smoker: Smoking

    • region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

    • charges: Individual medical costs billed by health insurance

    Acknowledgements

    The dataset is available on GitHub here.

    Inspiration

    Can you accurately predict insurance costs?

    --- Original source retains full ownership of the source dataset ---

  8. A

    ‘Pokemon with stats’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Aug 22, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2016). ‘Pokemon with stats’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-pokemon-with-stats-2520/04882d1e/?iid=005-178&v=presentation
    Explore at:
    Dataset updated
    Aug 22, 2016
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Pokemon with stats’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/abcsds/pokemon on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    This data set includes 721 Pokemon, including their number, name, first and second type, and basic stats: HP, Attack, Defense, Special Attack, Special Defense, and Speed. It has been of great use when teaching statistics to kids. With certain types you can also give a geeky introduction to machine learning.

    This are the raw attributes that are used for calculating how much damage an attack will do in the games. This dataset is about the pokemon games (NOT pokemon cards or Pokemon Go).

    The data as described by Myles O'Neill is:

    • #: ID for each pokemon
    • Name: Name of each pokemon
    • Type 1: Each pokemon has a type, this determines weakness/resistance to attacks
    • Type 2: Some pokemon are dual type and have 2
    • Total: sum of all stats that come after this, a general guide to how strong a pokemon is
    • HP: hit points, or health, defines how much damage a pokemon can withstand before fainting
    • Attack: the base modifier for normal attacks (eg. Scratch, Punch)
    • Defense: the base damage resistance against normal attacks
    • SP Atk: special attack, the base modifier for special attacks (e.g. fire blast, bubble beam)
    • SP Def: the base damage resistance against special attacks
    • Speed: determines which pokemon attacks first each round

    The data for this table has been acquired from several different sites, including:

    One question has been answered with this database: The type of a pokemon cannot be inferred only by it's Attack and Deffence. It would be worthy to find which two variables can define the type of a pokemon, if any. Two variables can be plotted in a 2D space, and used as an example for machine learning. This could mean the creation of a visual example any geeky Machine Learning class would love.

    --- Original source retains full ownership of the source dataset ---

  9. A

    ‘Pokemon’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Sep 3, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2019). ‘Pokemon’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-pokemon-a6b8/latest
    Explore at:
    Dataset updated
    Sep 3, 2019
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Pokemon’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/mlomuscio/pokemon on 14 February 2022.

    --- Dataset description provided by original source is as follows ---

    I acquired the data from Alberto Barradas at https://www.kaggle.com/abcsds/pokemon. I needed to edit some of the variable names and remove the Total variable in order for my students to use this data for class. Otherwise, I would have just had them use his version of the data.

    This dataset is for my Introduction to Data Science and Machine Learning Course. Using a modified Pokémon dataset acquired from Kaggle.com, I created example code for students demonstrating how to explore data with R.

    Barradas provides the following description of each variable. I have modified the variable names to make them easier to deal with.

    • Num: ID for each Pokémon.
    • Name: Name of each Pokémon.
    • Type1: Each Pokémon has a type, this determines weakness/resistance to attacks.
    • Type2: Some Pokémon are dual type and have 2.
    • HP: Hit points, or health, defines how much damage a Pokémon can withstand before fainting.
    • Attack: The base modifier for normal attacks (eg. Scratch, Punch).
    • Defense: The base damage resistance against normal attacks.
    • SPAtk: Special attack, the base modifier for special attacks (e.g. fire blast, bubble beam).
    • SPDef: The base damage resistance against special attacks.
    • Speed: Determines which Pokémon attacks first each round.
    • Generation: Number of generation.
    • Legendary: True if Legendary Pokémon, False if not.

    --- Original source retains full ownership of the source dataset ---

  10. P

    DEEP-VOICE: DeepFake Voice Recognition Dataset

    • paperswithcode.com
    Updated Aug 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). DEEP-VOICE: DeepFake Voice Recognition Dataset [Dataset]. https://paperswithcode.com/dataset/deep-voice-deepfake-voice-recognition
    Explore at:
    Dataset updated
    Aug 23, 2023
    Description

    DEEP-VOICE: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion This dataset contains examples of real human speech, and DeepFake versions of those speeches by using Retrieval-based Voice Conversion.

    Can machine learning be used to detect when speech is AI-generated?

    Introduction There are growing implications surrounding generative AI in the speech domain that enable voice cloning and real-time voice conversion from one individual to another. This technology poses a significant ethical threat and could lead to breaches of privacy and misrepresentation, thus there is an urgent need for real-time detection of AI-generated speech for DeepFake Voice Conversion.

    To address the above emerging issues, we are introducing the DEEP-VOICE dataset. DEEP-VOICE is comprised of real human speech from eight well-known figures and their speech converted to one another using Retrieval-based Voice Conversion.

    For each speech, the accompaniment ("background noise") was removed before conversion using RVC. The original accompaniment is then added back to the DeepFake speech:

    (Above: Overview of the Retrieval-based Voice Conversion process to generate DeepFake speech with Ryan Gosling's speech converted to Margot Robbie. Conversion is run on the extracted vocals before being layered on the original background ambience.)

    Dataset There are two forms to the dataset that are made available.

    First, the raw audio can be found in the "AUDIO" directory. They are arranged within "REAL" and "FAKE" class directories. The audio filenames note which speakers provided the real speech, and which voices they were converted to. For example "Obama-to-Biden" denotes that Barack Obama's speech has been converted to Joe Biden's voice.

    Second, the extracted features can be found in the "DATASET-balanced.csv" file. This is the data that was used in the below study. The dataset has each feature extracted from one-second windows of audio and are balanced through random sampling.

    Note: All experimental data is found within the "KAGGLE" directory. The "DEMONSTRATION" directory is used for playing cropped and compressed demos in notebooks due to Kaggle's limitations on file size.

    A potential use of a successful system could be used for the following:

    (Above: Usage of the real-time system. The end user is notified when the machine learning model has processed the speech audio (e.g. a phone or conference call) and predicted that audio chunks contain AI-generated speech.)

    Kaggle The dataset is available on the Kaggle data science platform.

    The Kaggle page can be found by clicking here: Dataset on Kaggle

    Attribution This dataset was produced from the study "Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion"

    The preprint can be found on ArXiv by clicking here: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion

    License This dataset is provided under the MIT License:

    Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

    The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

  11. A

    ‘Titanic: cleaned data’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Sep 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Titanic: cleaned data’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-titanic-cleaned-data-cbf4/dc9cd7ff/?iid=055-046&v=presentation
    Explore at:
    Dataset updated
    Sep 30, 2021
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Titanic: cleaned data’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/jamesleslie/titanic-cleaned-data on 30 September 2021.

    --- Dataset description provided by original source is as follows ---

    Introduction

    This dataset was created in this notebook as part of a three-part series. The data is in machine-learning-ready format, with all missing values for the Age, Fare and Embarked columns having been imputed.

    Data imputation

    • Age: this column was imputed by using the median age for the passenger's title (Mr, Mrs, Dr etc).
    • Fare: the single missing value in this column was imputed using the median value for that passenger's class.
    • Embarked: the two missing values here were imputed using the Pandas backfill method.

    Usage

    This data is used in both the second and third parts of the series.

    --- Original source retains full ownership of the source dataset ---

  12. A

    ‘Titanic Solution for Beginner's Guide’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Feb 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Titanic Solution for Beginner's Guide’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-titanic-solution-for-beginner-s-guide-03a8/ae3641d4/?iid=014-162&v=presentation
    Explore at:
    Dataset updated
    Feb 14, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Titanic Solution for Beginner's Guide’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/harunshimanto/titanic-solution-for-beginners-guide on 14 February 2022.

    --- Dataset description provided by original source is as follows ---

    Overview

    The data has been split into two groups:

    training set (train.csv)
    test set (test.csv)
    

    The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

    The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

    We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

    Data Dictionary

    Variable Definition Key survival Survival 0 = No, 1 = Yes pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd sex Sex
    Age Age in years
    sibsp # of siblings / spouses aboard the Titanic
    parch # of parents / children aboard the Titanic
    ticket Ticket number
    fare Passenger fare
    cabin Cabin number
    embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

    Variable Notes

    pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

    age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

    sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

    parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.

    --- Original source retains full ownership of the source dataset ---

  13. Real Estate Price Prediction Data

    • figshare.com
    txt
    Updated Aug 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Shbool; Rand Al-Dmour; Bashar Al-Shboul; Nibal Albashabsheh; Najat Almasarwah (2024). Real Estate Price Prediction Data [Dataset]. http://doi.org/10.6084/m9.figshare.26517325.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Aug 8, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Mohammad Shbool; Rand Al-Dmour; Bashar Al-Shboul; Nibal Albashabsheh; Najat Almasarwah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview: This dataset was collected and curated to support research on predicting real estate prices using machine learning algorithms, specifically Support Vector Regression (SVR) and Gradient Boosting Machine (GBM). The dataset includes comprehensive information on residential properties, enabling the development and evaluation of predictive models for accurate and transparent real estate appraisals.Data Source: The data was sourced from Department of Lands and Survey real estate listings.Features: The dataset contains the following key attributes for each property:Area (in square meters): The total living area of the property.Floor Number: The floor on which the property is located.Location: Geographic coordinates or city/region where the property is situated.Type of Apartment: The classification of the property, such as studio, one-bedroom, two-bedroom, etc.Number of Bathrooms: The total number of bathrooms in the property.Number of Bedrooms: The total number of bedrooms in the property.Property Age (in years): The number of years since the property was constructed.Property Condition: A categorical variable indicating the condition of the property (e.g., new, good, fair, needs renovation).Proximity to Amenities: The distance to nearby amenities such as schools, hospitals, shopping centers, and public transportation.Market Price (target variable): The actual sale price or listed price of the property.Data Preprocessing:Normalization: Numeric features such as area and proximity to amenities were normalized to ensure consistency and improve model performance.Categorical Encoding: Categorical features like property condition and type of apartment were encoded using one-hot encoding or label encoding, depending on the specific model requirements.Missing Values: Missing data points were handled using appropriate imputation techniques or by excluding records with significant missing information.Usage: This dataset was utilized to train and test machine learning models, aiming to predict the market price of residential properties based on the provided attributes. The models developed using this dataset demonstrated improved accuracy and transparency over traditional appraisal methods.Dataset Availability: The dataset is available for public use under the [CC BY 4.0]. Users are encouraged to cite the related publication when using the data in their research or applications.Citation: If you use this dataset in your research, please cite the following publication:[Real Estate Decision-Making: Precision in Price Prediction through Advanced Machine Learning Algorithms].

  14. Network Traffic Android Malware

    • kaggle.com
    zip
    Updated Sep 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christian Urcuqui (2019). Network Traffic Android Malware [Dataset]. https://www.kaggle.com/datasets/xwolf12/network-traffic-android-malware
    Explore at:
    zip(116603 bytes)Available download formats
    Dataset updated
    Sep 12, 2019
    Authors
    Christian Urcuqui
    Description

    Introduction

    Android is one of the most used mobile operating systems worldwide. Due to its technological impact, its open-source code and the possibility of installing applications from third parties without any central control, Android has recently become a malware target. Even if it includes security mechanisms, the last news about malicious activities and Android´s vulnerabilities point to the importance of continuing the development of methods and frameworks to improve its security.

    To prevent malware attacks, researches and developers have proposed different security solutions, applying static analysis, dynamic analysis, and artificial intelligence. Indeed, data science has become a promising area in cybersecurity, since analytical models based on data allow for the discovery of insights that can help to predict malicious activities.

    In this work, we propose to consider some network layer features as the basis for machine learning models that can successfully detect malware applications, using open datasets from the research community.

    Content

    This dataset is based on another dataset (DroidCollector) where you can get all the network traffic in pcap files, in our research we preprocessed the files in order to get network features that are illustrated in the next article:

    López, C. C. U., Villarreal, J. S. D., Belalcazar, A. F. P., Cadavid, A. N., & Cely, J. G. D. (2018, May). Features to Detect Android Malware. In 2018 IEEE Colombian Conference on Communications and Computing (COLCOM) (pp. 1-6). IEEE.

    Acknowledgements

    Cao, D., Wang, S., Li, Q., Cheny, Z., Yan, Q., Peng, L., & Yang, B. (2016, August). DroidCollector: A High Performance Framework for High Quality Android Traffic Collection. In Trustcom/BigDataSE/I SPA, 2016 IEEE (pp. 1753-1758). IEEE

  15. HackerEarth ML challenge: Adopt a buddy

    • kaggle.com
    Updated Jul 31, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manvendra Singh (2020). HackerEarth ML challenge: Adopt a buddy [Dataset]. https://www.kaggle.com/mannsingh/hackerearth-ml-challenge-pet-adoption/tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 31, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Manvendra Singh
    Description

    Problem statement

    Having a pet is one of life’s most fulfilling experiences. Your pets spoil you with their love, compassion, and loyalty. And dare anyone lay a finger on you in your pet’s presence, they are in for a lot of trouble. Thanks to social media, videos of clumsy and fussy (yet adorable) pets from across the globe entertain you all day long. Their love is pure and infinite. So, in return, all pets deserve a warm and loving family, indeed. And occasional boops, of course.

    Numerous organizations across the world provide shelter to all homeless animals until they are adopted into a new home. However, finding a loving family for them can be a daunting task at times. This International Homeless Animals Day, we present a Machine Learning challenge to you: Adopt a buddy.

    The brighter side of the pandemic is an increase in animal adoption and fostering. To ensure that their customers stay indoors, a leading pet adoption agency plans on creating a virtual-tour experience, showcasing all animals available in their shelter. To enable that, you have been tasked to build a Machine Learning model that determines the type and breed of the animal-based on its physical attributes and other factors.

    Dataset

    The dataset consists of parameters such as a unique ID assigned to each animal that is up for adoption, the date on which they arrived at the shelter, their physical attributes such as color, length, and height, among other factors.

    The benefits of practicing this problem by using Machine Learning techniques are as follows:

    This challenge will help you to actively enhance your knowledge of multi-label classification. It is one of the basic building blocks of Machine Learning We challenge you to build a predictive model that detects the type and breed of an animal-based on its condition, appearance, and other factors.

    Prizes

    Considering these unprecedented times that the world is facing due to the Coronavirus pandemic, we wish to do our bit and contribute the prize money for the welfare of society.

    Overview

    Machine Learning is an application of Artificial Intelligence (AI) that provides systems with the ability to automatically learn and improve from experiences without being explicitly programmed. Machine Learning is a Science that determines patterns in data. These patterns provide a deeper meaning to problems. First, it helps you understand the problems better and then solve the same with elegance.

    Here is the new HackerEarth Machine Learning Challenge—Adopt a buddy

    This challenge is designed to help you improve your Machine Learning skills by competing and learning from fellow participants.

    Why should you participate?

    To analyze and implement multiple algorithms, and determine which is more appropriate for a problem. To get hands-on experience of Machine Learning problems.

    Who should participate?

    Working professionals. Data Science or Machine Learning enthusiasts. College students (if you understand the basics of predictive modeling).

  16. Spacecraft Thruster Firing Tests Dataset

    • kaggle.com
    zip
    Updated Oct 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    astro_pat (2022). Spacecraft Thruster Firing Tests Dataset [Dataset]. https://www.kaggle.com/datasets/patrickfleith/spacecraft-thruster-firing-tests-dataset
    Explore at:
    zip(3194276953 bytes)Available download formats
    Dataset updated
    Oct 15, 2022
    Authors
    astro_pat
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Warning - Anomalies currently not usable

    I notice there are some problems with the anomalous sequences. Please do not consider using the dataset with the firing sequence marked as anomalous. I am investigating what is the problem and work towards a new release. I recommend to not use this dataset for anomaly detection at the moment.

    Context

    Testing hardware to qualify it for Spaceflight is critical to model and verify performances. Hot fire tests (also known as life-tests) are typically run during the qualification campaigns of satellite thrusters, but results remain proprietary data, hence making it difficult for the machine learning community to develop suitable data-driven predictive models. This synthetic dataset was generated partially based on the real-world physics of monopropellant chemical thrusters, to foster the development and benchmarking of new data-driven analytical methods (machine learning, deep-learning, etc.).

    Overview of chemical monopropellant thruster

    A monopropellant thruster is an engine that provide thrust by usage a unique propellant, as opposed to bipropellant systems which uses the combustion of fuel and oxidizer. The unique propellant flow into the chamber is controlled by a valve, usually an integral part of the thruster. It is injected into a catalyst bed, where it decomposes. A monopropellant must be slightly unstable chemical, which will decompose exothermally to produce a hot gas. The resulting hot gases are expelled through a converging/diverging nozzle generating thrust. The gas temperature is high which require the usage of high-temperature alloys to manufacture the nozzle.

    The most classical type of monopropellant thrusters are reaction control thrusters generating about 1 to 10 Newton of thrust using hydrazine as propellant. These reaction control thrusters are used, for instance to control the attitude of a spacecraft and/or to desaturate the reaction wheels.

    The performance of a monopropellant thruster (and its degradation) is mostly driven by the valve performance and the s of the catalyst bed on which the propellant decomposes. The life of the catalyst bed is mainly affected by the degradation of catalyst granules. The catalyst is made of alumina-based Indium metal granules (about 1mm in diameter) that are carefully designed and selected to optimize its lifetime. However, catalyst granules are easily damaged by thermoelastic shocks, collisions with other granules, and so on, thus hey are broken up into fine particles which reduces their efficiency. After the long duration of firing, large voids are formed in the catalyst bed and induce unstable decomposition of hydrazine and degradation of thruster performance.

    The properties of this simulated thruster fire tests are fictious and not necessarily equivalent to a real-world thruster available on the market. Nevertheless, it provides sufficient granularity and challenge to benchmark algorithm that may then be tested on real fire test sequences. This is possible because the simulator is based, partially, on real-world physics of such reaction control thrusters. The details of the simulator are not provided on purpose to avoid leakage into feature engineering methods and modelling approaches developed.

    Tasks and use cases

    Regression for Performance Prediction

    • Performance Modelling: Prediction of the thruster performances (target can be thrust, mass flow rate, and/or the average specific impulse of over a given sequence that can be calculated using the first two). This task may also be referred to as “time seriesforecasting with exogenous inputs”, or “System Dynamics modelling with control”, where the control is the command sent to the thruster (on/off) in the column “ton”. It is very unlikely in practice to have more than a few complete SN qualification datasets, so we suggest a 50:50 split between the train and test sets.
      • Use SN01 to SN12 to build your train and validation sets. These are data assumed to be collected on-ground.
      • Use SN13 to SN24 as the final test set. Warning: given that these thrusters are expected to be mounted on a flying satellite, you cannot retrain the model progressively using SN13 to SN24 data. The data is the ground truth as if we were able to measure it on-board.
    • Acceptance Test for Individualised Performance Model refinement:
      • In practice, every newly manufactured thruster is tested via acceptance tests: test_id 1 to 12. So these measurement are actually available for each thruster serial number. For this reason, it is possible to use the acceptance test data (test_id 1 to 12) of SN13 to SN24 as input to the model to predict the flight performance of these SN over test_id 13 to 112. Taking into account the acceptance test of individual thruster might be helpful to generate individualized thruster predictive model. This use case propose to investigate...
  17. QM9_molecules

    • kaggle.com
    Updated Sep 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mario Vozza (2024). QM9_molecules [Dataset]. https://www.kaggle.com/datasets/mariovozza5/qm9-molecules
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 9, 2024
    Dataset provided by
    Kaggle
    Authors
    Mario Vozza
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Notebook powered by https://daimoners.eu

    PREDICTING MOLECULAR PROPERTIES WITH MACHINE LEARNING

    Introduction and Objectives

    The computational de novo design of new drugs and materials requires a thorough and unbiased exploration of chemical compound space. However, this space remains largely unexplored due to its combinatorial scaling with molecular size. To address this challenge, a dataset of 134,000 stable small organic molecules composed of carbon (C), hydrogen (H), oxygen (O), nitrogen (N), and fluorine (F) has been meticulously computed. These molecules represent a subset of all 133,885 species with up to nine heavy atoms (C, O, N, F) from the GDB-17 chemical universe, which encompasses 166 billion organic molecules.

    For each molecule, computed geometric, energetic, electronic, and thermodynamic properties are provided, including:

    This dataset offers a relevant, consistent, and comprehensive exploration of chemical space for small organic molecules, providing a valuable resource for benchmarking existing methods, developing new methodologies (such as hybrid quantum mechanics/machine learning approaches), and systematically identifying structure-property relationships [1].

    [1] Ramakrishnan, Raghunathan, et al. "Quantum chemistry structures and properties of 134 kilo molecules." Scientific data 1.1 (2014): 1-7.

    In this notebook, we aim to leverage this dataset (QM9) to predict the molecular properties of these small organic molecules using the Coulomb matrix representation. Specifically, we will focus on using the eigenvalues of the Coulomb matrix, which serve as a crucial descriptor for capturing the electronic structure of molecules for predicting molecular properties.

    By the end of this notebook, you will have:

    • Explored the dataset and visualized key molecular properties
    • Generated Coulomb matrices for the molecules in the dataset
    • Calculate the eigenvalues of the Coulomb matrices and predicting properties using machine learning models
    • Evaluated the performance of these models in accurately predicting molecular properties

    Let's begin by loading and exploring the dataset.

    Enjoy! ⚛

    Properties in the QM9 Dataset

    No.PropertyUnitDescription
    1tag‘gdb9’ string to facilitate extraction
    2iConsecutive, 1-based integer identifier
    3AGHzRotational constant
    4BGHzRotational constant
    5CGHzRotational constant
    6μDDipole moment
    7αIsotropic polarizability
    8εHOMOHaEnergy of HOMO
    9εLUMOHaEnergy of LUMO
    10εgapHaGap (εLUMO − εHOMO)
    11/R2SElectronic spatial extent
    12zpveHaZero point vibrational energy
    13U0HaInternal energy at 0 K
    14UHaInternal energy at 298.15 K
    15HHaEnthalpy at 298.15 K
    16GHaFree energy at 298.15 K
    17C vcal/mol·KHeat capacity at 298.15 K

    Dataset Structure

    For each molecule, atomic coordinates and calculated properties are stored in a file named dataset_index.xyz. The XYZ format 1 is a widespread plain text format for encoding Cartesian coordinates of molecules, with no formal specification. It contains a header line specifying the number of atoms n a, a comment line, and n a lines containing element type and atomic coordinates, one atom per line. The comment line is used to store all scalar properties, Mulliken charges are added as a fifth column. Harmonic vibrational frequencies, SMILES and InChI [2] are appended as respective additional lines.

    [1] https://open-babel.readthedocs.io/en/latest/FileFormats/XYZ_cartesian_coordinates_format.html

    [2] https://iupac.org/who-we-are/divisions/division-details/inchi/

    QM9 xyz format

    | Line | Content | |------|----------------------------------------------------------...

  18. Keep babies safe

    • kaggle.com
    Updated Sep 22, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akash Gupta (2020). Keep babies safe [Dataset]. https://www.kaggle.com/akash14/keep-babies-safe/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 22, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Akash Gupta
    Description

    Context

    HackerEarth Deep Learning challenge: Keep babies safe (Sep 11, 07:30 PM IST - Oct 26, 07:30 PM IST)

    Content

    The dataset consists of 1500 images depicting numerous baby products - baby-proofing kits, toys, gadgets, and the like.

    The benefits of practicing this problem by using unsupervised Machine Learning/Deep Learning techniques are as follows:

    This challenge encourages you to apply your unsupervised Deep Learning skills to build models that can extract, identify, and tag brand names of various products. This challenge will help you enhance your knowledge of image processing and optical character recognition (OCR), which is one of the advanced fields of Machine Learning and Artificial Intelligence. We challenge you to build a model that will tag images with corresponding brand names of baby/kid products.

    Problem statement

    Your task, as a Machine Learning expert, is to build a Deep Learning model that will tag each image with the extracted product types and brand names of these products. In case there is no brand name mentioned on a product, the model should tag the image as Unnamed.

    Overview

    Deep Learning is an application of Artificial Intelligence (AI) that provides systems with the ability to automatically learn and improve from experience without being explicitly programmed. Deep Learning is a science that determines patterns in data. These patterns provide deeper meaning to problems and help you to first understand problems better and then solve the same with elegance. HackerEarth’s Deep Learning challenge is designed to help you improve your Deep Learning skills by competing and learning from fellow participants.

    Here’s presenting HackerEarth’s Deep Learning Challenge—Keep babies safe

  19. Power Transformers FDD and RUL

    • kaggle.com
    zip
    Updated Sep 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iurii Katser (2024). Power Transformers FDD and RUL [Dataset]. https://www.kaggle.com/datasets/yuriykatser/power-transformers-fdd-and-rul
    Explore at:
    zip(33405750 bytes)Available download formats
    Dataset updated
    Sep 1, 2024
    Authors
    Iurii Katser
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Datasets with dissolved gases concentrations in power transformer oil for remaining useful life (RUL), fault detection and diagnosis (FDD) problems.

    Introduction

    Power transformers (PTs) are an important component of a nuclear power plant (NPP). They convert alternating voltage and are instrumental in power supply of both external NPP energy consumers and NPPs themselves. Currently, many PTs have exceeded planned service life that had been extended over the designated 25 years. Due to the extension, monitoring the PT technical condition becomes an urgent matter.

    An important method for monitoring and diagnosing PTs is Chromatographic Analysis of Dissolved Gas (CADG). It is based on the principle of forced extraction and analysis of dissolved gases from PT oil. Almost all types of equipment defects are accompanied by formation of gases that dissolve in oil; certain types of defects generate certain gases in different quantities. The concentrations also differ on various stages of defects developing that allows to calculate RUL of the PT. At present, NPP control and diagnostic systems for PT equipment use predefined control limits for concentration of dissolved gases in oil. The main disadvantages of this approach are the lack of automatic control and insufficient quality of diagnostics, especially for PTs with extended service life. To combat these shortcomings in diagnostic systems for the analysis of data obtained using CADG, machine learning (ML) methods can be used, as they are used in diagnostics of many NNP components.

    Data description

    The datasets are available as .csv files containing 420 records of gas concentration, presented as a time dependence. The gasses are 𝐻2, 𝐶𝑂, 𝐶2𝐻4 и 𝐶2𝐻2. The period between time points is 12 hours. There are 3000 datasets splitted into train (2100 datasets) and test (900 datasets) sets.

    For RUL problem, annotations are available (in the separate files): each .csv file corresponds to a value in points that is equal the time remaining until the equipment fails, at the end of record.

    For FDD problems, there are labels (in the separate files) with four PT operating modes (classes): 1. Normal mode (2436 datasets); 2. Partial discharge: local dielectric breakdown in gas-filled cavities (127 datasets); 3. Low energy discharge: sparking or arc discharges in poor contact connections of structural elements with different or floating potential; discharges between PT core structural elements, high voltage winding taps and the tank, high voltage winding and grounding; discharges in oil during contact switching (162 datasets); 4. Low-temperature overheating: oil flow disruption in windings cooling channels, magnetic system causing low efficiency of the cooling system for temperatures < 300 °C (275 datasets).

    Data in this repository is an extension (test set added) of data from here and here.

    FDD problems statement

    In our case, the fault detection problem transforms into a classification problem, since the data is related to one of four labeled classes (including one normal and three anomalous), so the model’s output needs to be a class number. The problem can be stated as binary classification (healthy/anomalous) for fault detection or multi class classification (on of 4 states) for fault diagnosis.

    RUL problem statement

    To ensure high-quality maintenance and repair, it is vital to be aware of potential malfunctions and predict RUL of transformer equipment. Therefore, it is necessary to create a mathematical model that will determine RUL by the final 420 points.

    Data usage examples

    • Dataset was used in this article.
    • Dataset was used in this research by Katser et.al. that solves the problem proposing ensemble of classifiers.
  20. Korean - English Parallel Corpus

    • kaggle.com
    Updated Aug 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ramzel Renz Loto (2020). Korean - English Parallel Corpus [Dataset]. https://www.kaggle.com/rareloto/naver-dictionary-conversation-of-the-day/tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 19, 2020
    Dataset provided by
    Kaggle
    Authors
    Ramzel Renz Loto
    Description

    Context

    As part of my Korean language learning hobby, I write and type out daily conversations from Naver Conversation of the Day. After getting introduced to data science and machine learning, I wanted to use programming to facilitate my learning process by collecting data and trying out projects. So I scraped data from Naver Dictionary using a Python script to be used later when I train a bilingual AI study buddy chatbot or automate Anki flashcards.

    Content

    This is a corpus of Korean - English paired conversations parallel text extracted from Naver Dictionary. This dataset consists of 4563 parallel text pairs from December 4, 2017 to August 19, 2020 of Naver's Conversation of the Day. The files and their headers are listed below. * conversations.csv * date - 'Conversation of the Day' date * conversation_id - ordered numbering to indicate conversation flow * kor_sent - Korean sentence * eng_sent - English translation * qna_id - from sender or receiver, message or feedback * conversation_titles.csv * date - 'Conversation of the Day' date * kor_title - 'Conversation of the Day' title in Korean * eng_title - English translation of the title * grammar - grammar of the day * grammar_desc - grammar description

    Acknowledgements

    The data was collected from Naver Dictionary and the conversations were from the Korean Language Institute of Yonsei University.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
chelbi Zineb (2024). introduction to machine learning [Dataset]. https://www.kaggle.com/datasets/chelbizineb/introduction-to-machine-learning/code
Organization logo

introduction to machine learning

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 11, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
chelbi Zineb
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset

This dataset was created by chelbi Zineb

Released under MIT

Contents

Search
Clear search
Close search
Google apps
Main menu