57 datasets found
  1. CS229- Machine Learning Course Transcripts

    • kaggle.com
    Updated Jan 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Himanshu Nakrani (2025). CS229- Machine Learning Course Transcripts [Dataset]. https://www.kaggle.com/datasets/himanshunakrani/rag-with-langchain-deeplearning-ai
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 3, 2025
    Dataset provided by
    Kaggle
    Authors
    Himanshu Nakrani
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    this dataset contains 20 files in pdf format. each file consists to text transcripts for each lecture. this data can be used for creating question answering application using LLM.

  2. home data for ml course

    • kaggle.com
    zip
    Updated Aug 27, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julián Pérez Pesce (2019). home data for ml course [Dataset]. https://www.kaggle.com/datasets/estrotococo/home-data-for-ml-course
    Explore at:
    zip(199207 bytes)Available download formats
    Dataset updated
    Aug 27, 2019
    Authors
    Julián Pérez Pesce
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Exercise: Machine Learning Competitions

    When you click on Run / All, the notebook will give you an error: "Files doesn't exist" With this DataSet you fix that. It's the same from DanB. Please UPVOTE!

    Enjoy!

  3. a

    Udemy - Machine Learning A-Z Become Kaggle Master

    • academictorrents.com
    bittorrent
    Updated Apr 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    None (2023). Udemy - Machine Learning A-Z Become Kaggle Master [Dataset]. https://academictorrents.com/details/9e378efb6e2f67de46c6c3660d9675be50bfc21f
    Explore at:
    bittorrent(15004863898)Available download formats
    Dataset updated
    Apr 24, 2023
    Authors
    None
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    A BitTorrent file to download data with the title 'Udemy - Machine Learning A-Z Become Kaggle Master'

  4. A

    ‘Deep Learning A-Z - ANN dataset’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Nov 21, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Deep Learning A-Z - ANN dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-deep-learning-a-z-ann-dataset-e900/a6df077a/?iid=030-068&v=presentation
    Explore at:
    Dataset updated
    Nov 21, 2021
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Deep Learning A-Z - ANN dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/filippoo/deep-learning-az-ann on 21 November 2021.

    --- Dataset description provided by original source is as follows ---

    Context

    This is the dataset used in the section "ANN (Artificial Neural Networks)" of the Udemy course from Kirill Eremenko (Data Scientist & Forex Systems Expert) and Hadelin de Ponteves (Data Scientist), called Deep Learning A-Z™: Hands-On Artificial Neural Networks. The dataset is very useful for beginners of Machine Learning, and a simple playground where to compare several techniques/skills.

    It can be freely downloaded here: https://www.superdatascience.com/deep-learning/

    The story: A bank is investigating a very high rate of customer leaving the bank. Here is a 10.000 records dataset to investigate and predict which of the customers are more likely to leave the bank soon.

    The story of the story: I'd like to compare several techniques (better if not alone, and with the experience of several Kaggle users) to improve my basic knowledge on Machine Learning.

    Content

    I will write more later, but the columns names are very self-explaining.

    Acknowledgements

    Udemy instructors Kirill Eremenko (Data Scientist & Forex Systems Expert) and Hadelin de Ponteves (Data Scientist), and their efforts to provide this dataset to their students.

    Inspiration

    Which methods score best with this dataset? Which are fastest (or, executable in a decent time)? Which are the basic steps with such a simple dataset, very useful to beginners?

    --- Original source retains full ownership of the source dataset ---

  5. Cat Classifier Dataset for Neural Network

    • kaggle.com
    Updated Sep 26, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Devesh Kashyap (2020). Cat Classifier Dataset for Neural Network [Dataset]. https://www.kaggle.com/deveshkashyap/cat-classifier-dataset-for-neural-network/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 26, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Devesh Kashyap
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    I was attempting my 2nd week programming assignment of Coursera's "Neural Networks and Deep Learning" course(Deep Learning Specialization),when i thought about creating a copy of the assignment's notebook from scratch using the same datasets and the model's algorithm.

    In the Coursera's assignment ,most of the code is already given to us & we are required to make only small changes in the function and submit.I found this approach to be utterly distasteful so i am creating a new notebook regarding the same.

    This notebook is created just for practice purposes so that i get a hang of writing neural network codes from scratch .However,if you find the given notebook to be interesting,do up-vote.😀

    ADIOS!

  6. U_tech / Machine-learning course / Mini-project1

    • kaggle.com
    Updated Nov 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asma Abeyat (2021). U_tech / Machine-learning course / Mini-project1 [Dataset]. https://www.kaggle.com/asmaabeyat/u-tech-machinelearning-course-miniproject1/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 22, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Asma Abeyat
    Description

    Dataset

    This dataset was created by Asma Abeyat

    Contents

  7. Coursera Courses and Skills dataset 2025

    • kaggle.com
    Updated Mar 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yosefxx590 (2025). Coursera Courses and Skills dataset 2025 [Dataset]. https://www.kaggle.com/datasets/yosefxx590/coursera-courses-and-skills-dataset-2025
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 7, 2025
    Dataset provided by
    Kaggle
    Authors
    Yosefxx590
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The dataset contains basic information about Coursera courses, including details about the subject, title, institution, skills acquired, rating, reviews, level, learning product, and duration. This data was scrapped directly from the Coursera website, providing insights about courses such as subject, title, institution, skills acquired, rating, reviews, level, learning product, and duration.

    Content

    Here is a description for each column:

    Subject: This column represents the academic or professional category or field of study of the course, such as Business, Data Science, Information Technology, or Computer Science.

    Title: The specific name or title of the course or program, which gives an idea of what the course covers (e.g., "Business Analysis & Process Management," "Financial Markets").

    Institution: The organization or platform offering the course, such as "IBM," "Yale University," or "Università Bocconi."

    Gained Skills: The skills and knowledge learners are expected to gain upon completing the course, such as "Data Analysis," "Machine learning," or "Artificial intelligence."

    Rate: The rate rating or score given by participants based on their experiences in the course. The scale ranges from 1 to 5 stars.

    Reviews: The number of user reviews or ratings provided for the course.

    Level: This column categorizes the difficulty level of the course, such as "Beginner," "Intermediate," or "Mixed"

    Learning Product: The type of course or learning experience, such as "Guided Project" or "Course".

    Duration: The length of time required to complete the course, which could be listed as "Less Than 2 Hours," "1 - 3 Months," etc.

  8. h

    NNDL_HW5_S2025

    • huggingface.co
    Updated May 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arian (2025). NNDL_HW5_S2025 [Dataset]. https://huggingface.co/datasets/ArianFiroozi/NNDL_HW5_S2025
    Explore at:
    Dataset updated
    May 29, 2025
    Authors
    Arian
    Description

    Dataset Card for "NNDL_HW5_S2025"

    This is a dataset created for neural networks and deep learning course at University of Tehran. The original data can be accessed at https://www.kaggle.com/datasets/emmarex/plantdisease/data More Information needed

  9. df-materials-gemini

    • kaggle.com
    Updated Dec 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artem Volgin (2024). df-materials-gemini [Dataset]. https://www.kaggle.com/datasets/artvolgin/df-materials-gemini
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 1, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Artem Volgin
    Description

    Materials from CS224N-2024, CS224N-2019, and CS231N-2024 Stanford courses in text format. The collected resources include slides, notes, code, readings, and subtitles from YouTube videos of these courses. Additional scripts that were used to parse and preprocess this dataset can be found here: https://github.com/artvolgin/gemini-long-context-dataset

  10. A

    ‘Pokemon’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Sep 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Pokemon’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-pokemon-7442/b5a1c10f/?iid=014-813&v=presentation
    Explore at:
    Dataset updated
    Sep 30, 2021
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Pokemon’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/mlomuscio/pokemon on 30 September 2021.

    --- Dataset description provided by original source is as follows ---

    I acquired the data from Alberto Barradas at https://www.kaggle.com/abcsds/pokemon. I needed to edit some of the variable names and remove the Total variable in order for my students to use this data for class. Otherwise, I would have just had them use his version of the data.

    This dataset is for my Introduction to Data Science and Machine Learning Course. Using a modified Pokémon dataset acquired from Kaggle.com, I created example code for students demonstrating how to explore data with R.

    Barradas provides the following description of each variable. I have modified the variable names to make them easier to deal with.

    • Num: ID for each Pokémon.
    • Name: Name of each Pokémon.
    • Type1: Each Pokémon has a type, this determines weakness/resistance to attacks.
    • Type2: Some Pokémon are dual type and have 2.
    • HP: Hit points, or health, defines how much damage a Pokémon can withstand before fainting.
    • Attack: The base modifier for normal attacks (eg. Scratch, Punch).
    • Defense: The base damage resistance against normal attacks.
    • SPAtk: Special attack, the base modifier for special attacks (e.g. fire blast, bubble beam).
    • SPDef: The base damage resistance against special attacks.
    • Speed: Determines which Pokémon attacks first each round.
    • Generation: Number of generation.
    • Legendary: True if Legendary Pokémon, False if not.

    --- Original source retains full ownership of the source dataset ---

  11. Human Activity Recognition (HAR - Video Dataset)

    • kaggle.com
    Updated May 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sharjeel M. (2023). Human Activity Recognition (HAR - Video Dataset) [Dataset]. http://doi.org/10.34740/kaggle/dsv/5722068
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 19, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sharjeel M.
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The dataset contains a comprehensive collection of human activity videos, spanning across 7 distinct classes. These classes include clapping, meeting and splitting, sitting, standing still, walking, walking while reading book, and walking while using the phone.

    Each video clip in the dataset showcases a specific human activity and has been labeled with the corresponding class to facilitate supervised learning.

    The primary inspiration behind creating this dataset is to enable machines to recognize and classify human activities accurately. With the advent of computer vision and deep learning techniques, it has become increasingly important to train machine learning models on large and diverse datasets to improve their accuracy and robustness.

  12. CIFAR-10 dataset from packthub

    • kaggle.com
    Updated Sep 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mark Dodds (2020). CIFAR-10 dataset from packthub [Dataset]. https://www.kaggle.com/markjansen/cifar/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 21, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mark Dodds
    Description

    Context

    The data set was

    Content

    What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  13. Student Performance & Learning Style

    • kaggle.com
    Updated Feb 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adil Shamim (2025). Student Performance & Learning Style [Dataset]. https://www.kaggle.com/datasets/adilshamim8/student-performance-and-learning-style/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 12, 2025
    Dataset provided by
    Kaggle
    Authors
    Adil Shamim
    Description

    You should not take this dataset seriously, as it is a synthetic representation based on true trends in education and career outcomes.

    About the Dataset

    This dataset provides insights into how different study habits, learning styles, and external factors influence student performance. It includes 10,000 records, covering details about students' study hours, online learning participation, exam scores, and other factors impacting academic success.

    Dataset Features

    • Student_ID – Unique identifier for each student
    • Age – Student's age (18-30 years)
    • Gender – Male, Female, or Other
    • Study_Hours_per_Week – Hours spent studying per week (5-50 hours)
    • Preferred_Learning_Style – Visual, Auditory, Reading/Writing, Kinesthetic
    • Online_Courses_Completed – Number of online courses completed (0-20)
    • Participation_in_Discussions – Whether the student actively participates in discussions (Yes/No)
    • Assignment_Completion_Rate (%) – Percentage of assignments completed (50%-100%)
    • Exam_Score (%) – Student’s final exam score (40%-100%)
    • Attendance_Rate (%) – Percentage of classes attended (50%-100%)
    • Use_of_Educational_Tech – Whether the student uses educational technology (Yes/No)
    • Self_Reported_Stress_Level – Student’s stress level (Low, Medium, High)
    • Time_Spent_on_Social_Media (hours/week) – Weekly hours spent on social media (0-30 hours)
    • Sleep_Hours_per_Night – Average sleep duration (4-10 hours)
    • Final_Grade – Assigned grade based on exam score (A, B, C, D, F)

    Use Cases

    • Predicting Student Performance – Analyze how different factors influence exam scores.
    • Educational Insights – Understand the impact of study habits, learning styles, and external activities.
    • Machine Learning Applications – Train predictive models for student success.
  14. MNIST Digit Data

    • kaggle.com
    Updated Jun 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arga Adyatama (2020). MNIST Digit Data [Dataset]. https://www.kaggle.com/datasets/argaadya/course-mnist/suggestions?status=pending&yourSuggestions=true
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 23, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Arga Adyatama
    Description

    Dataset

    This dataset was created by Arga Adyatama

    Contents

  15. Fish Recognition Ground-Truth data

    • kaggle.com
    Updated Nov 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Madhushree Sannigrahi (2023). Fish Recognition Ground-Truth data [Dataset]. https://www.kaggle.com/datasets/madhushreesannigrahi/fish-recognition-ground-truth-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 25, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Madhushree Sannigrahi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This fish data is acquired from a live video dataset resulting in 27370 verified fish images. The whole dataset is divided into 23 clusters and each cluster is presented by a representative species, which is based on the synapomorphies characteristic from the extent that the taxon is monophyletic. The representative image indicates the distinction between clusters shown in the figure below, e.g. the presence or absence of components (anal-fin, nasal, infraorbitals), specific number (six dorsal-fin spines, two spiny dorsal-fins), particular shape (second dorsal-fin spine long), etc. This figure shows the representative fish species name and the numbers of detections. The data is very imbalanced where the most frequent species is about 1000 times more than the least one. The fish detection and tracking software described in [1] is used to obtain the fish images. The fish species are manually labeled by following instructions from marine biologists [2]. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5980358%2F5cc6093c54b3dc535bed661e93fc7a12%2Fgt_labels.png?generation=1700950637892694&alt=media" alt="">

    Original page created by Phoenix X. Huang, Bastiaan B. Boom and Robert B. Fisher. Permission is granted for anyone to copy, use, modify, or distribute this data and accompanying documents for any purpose, provided this copyright notice is retained and prominently displayed, along with a note saying that the original data are available from our web page and refering to [2]. The data and documents are distributed without any warranty, express or implied. As the data were acquired for research purposes only, they have not been tested to the degree that would be advisable in any important application. All use of these data is entirely at the user's own risk.

    Acknowledgments: This research was funded by European Commission FP7 grant 257024, in the Fish4Knowledge project.

    • [1]. B. J. Boom, P. X. Huang, C. Spampinato, S. Palazzo, J. He, C. Beyan, E. Beauxis-Aussalet, J. van Ossenbruggen, G. Nadarajan, J. Y. Chen-Burger, D. Giordano, L. Hardman, F.-P. Lin, R. B. Fisher, "Long-term underwater camera surveillance for monitoring and analysis of fish populations", Proc. Int. Workshop on Visual observation and Analysis of Animal and Insect Behavior (VAIB), in conjunction with ICPR 2012, Tsukuba, Japan, 2012.

    • [2]. B. J. Boom, P. X. Huang, J. He, R. B. Fisher, "Supporting Ground-Truth annotation of image datasets using clustering", 21st Int. Conf. on Pattern Recognition (ICPR), 2012.

  16. notMNIST

    • kaggle.com
    • opendatalab.com
    • +2more
    Updated Feb 14, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jwjohnson314 (2018). notMNIST [Dataset]. https://www.kaggle.com/datasets/jwjohnson314/notmnist/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 14, 2018
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    jwjohnson314
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    The MNIST dataset is one of the best known image classification problems out there, and a veritable classic of the field of machine learning. This dataset is more challenging version of the same root problem: classifying letters from images. This is a multiclass classification dataset of glyphs of English letters A - J.

    This dataset is used extensively in the Udacity Deep Learning course, and is available in the Tensorflow Github repo (under Examples). I'm not aware of any license governing the use of this data, so I'm posting it here so that the community can use it with Kaggle kernels.

    Content

    notMNIST _large.zip is a large but dirty version of the dataset with 529,119 images, and notMNIST_small.zip is a small hand-cleaned version of the dataset, with 18726 images. The dataset was assembled by Yaroslav Bulatov, and can be obtained on his blog. According to this blog entry there is about a 6.5% label error rate on the large uncleaned dataset, and a 0.5% label error rate on the small hand-cleaned dataset.

    The two files each containing 28x28 grayscale images of letters A - J, organized into directories by letter. notMNIST_large.zip contains 529,119 images and notMNIST_small.zip contains 18726 images.

    Acknowledgements

    Thanks to Yaroslav Bulatov for putting together the dataset.

  17. Monkeypox Skin Lesion Dataset

    • kaggle.com
    Updated Jul 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TensorKitty (2022). Monkeypox Skin Lesion Dataset [Dataset]. https://www.kaggle.com/datasets/nafin59/monkeypox-skin-lesion-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 5, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    TensorKitty
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    An updated version of the MSLD dataset, MSLD v2.0 has been released after being verified by an expert dermatologist!

    For details, check our GitHub repo!

    Context

    The recent monkeypox outbreak has become a global healthcare concern owing to its rapid spread in more than 65 countries around the globe. To obstruct its expeditious pace, early diagnosis is a must. But the confirmatory Polymerase Chain Reaction (PCR) tests and other biochemical assays are not readily available in sufficient quantities. In this scenario, computer-aided monkeypox identification from skin lesion images can be a beneficial measure. Nevertheless, so far, such datasets are not available. Hence, the "Monkeypox Skin Lesion Dataset (MSLD)" is created by collecting and processing images from different means of web-scrapping i.e., from news portals, websites and publicly accessible case reports.

    The creation of "Monkeypox Image Lesion Dataset" is primarily focused on distinguishing the monkeypox cases from the similar non-monkeypox cases. Therefore, along with the 'Monkeypox' class, we included skin lesion images of 'Chickenpox' and 'Measles' because of their resemblance to the monkeypox rash and pustules in initial state in another class named 'Others' to perform binary classification.

    Content

    There are 3 folders in the dataset.

    1) Original Images: It contains a total number of 228 images, among which 102 belongs to the 'Monkeypox' class and the remaining 126 represents the 'Others' class i.e., non-monkeypox (chickenpox and measles) cases.

    2) Augmented Images: To aid the classification task, several data augmentation methods such as rotation, translation, reflection, shear, hue, saturation, contrast and brightness jitter, noise, scaling etc. have been applied using MATLAB R2020a. Although this can be readily done using ImageGenerator/other image augmentors, to ensure reproducibility of the results, the augmented images are provided in this folder. Post-augmentation, the number of images increased by approximately 14-folds. The classes 'Monkeypox' and 'Others' have 1428 and 1764 images, respectively.

    3) Fold1: One of the three-fold cross validation datasets. To avoid any sort of bias in training, three-fold cross validation was performed. The original images were split into training, validation and test set(s) with the approximate proportion of 70 : 10 : 20 while maintaining patient independence. According to the commonly perceived data preparation practice, only the training and validation images were augmented while the test set contained only the original images. Users have the option of using the folds directly or using the original data and employing other algorithms to augment it.

    Additionally, a CSV file is provided that has 228 rows and two columns. The table contains the list of all the ImageID(s) with their corresponding label.

    Web Application

    Since monkeypox is demonstrating a very rapid community transmission pattern, a consumer-level software is truly necessary to increase awareness and encourage people to take rapid action. We have developed an easy-to-use web application named Monkey Pox Detector using the open-source python streamlit framework that uses our trained model to address this issue. It makes predictions on whether or not to see a specialist along with the prediction accuracy. Future updates will benefit from the user data we continue to collect and use to improve our model. The web app has a flask core, so that it can be deployed cross-platform in the future.

    Learn more at our GitHub repo!

    Citation

    If this dataset helped your research, please cite the following articles:

    Ali, S. N., Ahmed, M. T., Paul, J., Jahan, T., Sani, S. M. Sakeef, Noor, N., & Hasan, T. (2022). Monkeypox Skin Lesion Detection Using Deep Learning Models: A Preliminary Feasibility Study. arXiv preprint arXiv:2207.03342.

    @article{Nafisa2022, title={Monkeypox Skin Lesion Detection Using Deep Learning Models: A Preliminary Feasibility Study}, author={Ali, Shams Nafisa and Ahmed, Md. Tazuddin and Paul, Joydip and Jahan, Tasnim and Sani, S. M. Sakeef and Noor, Nawshaba and Hasan, Taufiq}, journal={arXiv preprint arXiv:2207.03342}, year={2022} }

    Ali, S. N., Ahmed, M. T., Jahan, T., Paul, J., Sani, S. M. Sakeef, Noor, N., Asma, A. N., & Hasan, T. (2023). A Web-based Mpox Skin Lesion Detection System Using State-of-the-art Deep Learning Models Considering Racial Diversity. arXiv preprint arXiv:2306.14169.

    @article{Nafisa2023, title={A Web-base...

  18. cyber security and analytics

    • kaggle.com
    Updated Apr 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    basharath ali (2024). cyber security and analytics [Dataset]. https://www.kaggle.com/datasets/basharath123/cyber-security-and-analytics/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 29, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    basharath ali
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by basharath ali

    Released under Apache 2.0

    Contents

  19. HPV_IEEE_ML_SJTU

    • kaggle.com
    Updated May 10, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ldslds (2018). HPV_IEEE_ML_SJTU [Dataset]. https://www.kaggle.com/lidasong/hpv-ieee-ml-sjtu/kernels
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 10, 2018
    Dataset provided by
    Kaggle
    Authors
    ldslds
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    2018 ML course project in IEEE Honor Class.

    基于分子显微图像的蛋白质亚细胞定位

    My View: 这个project label不准确...... sota只能做到50%-60%

    dataset description

    人类蛋白图库(Human Protein Atlas, HPA)是一个公开免费的生物图像数据库,目前的Version 16.1中储存有16998个蛋白质在人类健康和癌症组织以及细胞中的上百万幅显微图像。

    difficulty

    • 图片尺寸大: 3000*3000
    • 图片有用信息少、噪声高
    • Multi-label
    • Multi-instance

    related work

    • Image based: Xu, Y., et al., Bioimaging-based detection of mislocalized proteins in human cancers by semi-supervised learning. Bioinformatics, 2014. 31(7): p. 1111-1119.
    • Image based: Xu, Y.Y., et al., An image-based multi-label human protein subcellular localization predictor (iLocator) reveals protein mislocalizations in cancer tissues. Bioinformatics,
      1. 29(16): p. 2032-2040
    • Sequence based: Almagro, J.A., et al., DeepLoc: prediction of protein subcellular localization using deep learning. Bioinformatics, 2017.
  20. COVID Rearrange Dataset

    • kaggle.com
    Updated Nov 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DD.Zh (2024). COVID Rearrange Dataset [Dataset]. https://www.kaggle.com/datasets/dadadazhang/covid-data-rearrange
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 21, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    DD.Zh
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The original dataset is from https://www.kaggle.com/datasets/andyczhao/covidx-cxr2

    The data is separated based on the .txt file (see link) into positive and negative.

    Data Augmentation Code

    from tensorflow.keras.preprocessing.image import ImageDataGenerator
    
    from tensorflow.keras.preprocessing.image import ImageDataGenerator
    
    datagen = ImageDataGenerator(
      rescale=1./255,        # Normalize
      rotation_range=20,      # Rotation reference
      zoom_range=0.2,        # Zoom reference
      width_shift_range=0.2,    # wrap
      height_shift_range=0.2,    # wrap
      shear_range=0.2,       # Add shear transformation
      brightness_range=(0.7, 1.3), # Wider brightness adjustment - reference 0.3
      horizontal_flip=True,
      fill_mode='nearest'
    )
    
    
    # Counts
    current_count = len(os.listdir(input_dir))
    target_count = 57199
    required_augmented_count = target_count - current_count
    
    print(f"Original negatives: {current_count}")
    print(f"Required augmented images: {required_augmented_count}")
    
    # augmenting ...
    augmented_count = 0
    max_augmentations_per_image = 10 #I used 5 and 10, this dataset was generated with 10
    
    for img_file in os.listdir(input_dir):
      img_path = os.path.join(input_dir, img_file)
      img = load_img(img_path, target_size=(480, 480)) # 480 by 480 referring to reference.
      img_array = img_to_array(img)
      img_array = img_array.reshape((1,) + img_array.shape)
    
      # Generate multiple augmentations per image
      i = 0
      for batch in datagen.flow(
        img_array,
        batch_size=1,
        save_to_dir=output_dir,
        save_prefix='aug',
        save_format='jpeg'
      ):
        i += 1
        augmented_count += 1
        if i >= max_augmentations_per_image:
          break
        if augmented_count >= required_augmented_count:
          break
    
      if augmented_count >= required_augmented_count:
        break
    

    I tried using different max_augmentations_per_image, or without setting this parameter; both ways generated augmented data (around 9,000) ...

    positive_balanced: ```python random.seed(42)

    Total negative samples

    target_count = 20579

    all_positive_images = os.listdir(positive_dir) selected_positive_images = random.sample(all_positive_images, target_count) ```

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Himanshu Nakrani (2025). CS229- Machine Learning Course Transcripts [Dataset]. https://www.kaggle.com/datasets/himanshunakrani/rag-with-langchain-deeplearning-ai
Organization logo

CS229- Machine Learning Course Transcripts

Lecture transcripts for CS229 - Machine Learning Course by Andrew Ng

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 3, 2025
Dataset provided by
Kaggle
Authors
Himanshu Nakrani
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

this dataset contains 20 files in pdf format. each file consists to text transcripts for each lecture. this data can be used for creating question answering application using LLM.

Search
Clear search
Close search
Google apps
Main menu