MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
this dataset contains 20 files in pdf format. each file consists to text transcripts for each lecture. this data can be used for creating question answering application using LLM.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Exercise: Machine Learning Competitions
When you click on Run / All, the notebook will give you an error: "Files doesn't exist" With this DataSet you fix that. It's the same from DanB. Please UPVOTE!
Enjoy!
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
A BitTorrent file to download data with the title 'Udemy - Machine Learning A-Z Become Kaggle Master'
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Deep Learning A-Z - ANN dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/filippoo/deep-learning-az-ann on 21 November 2021.
--- Dataset description provided by original source is as follows ---
This is the dataset used in the section "ANN (Artificial Neural Networks)" of the Udemy course from Kirill Eremenko (Data Scientist & Forex Systems Expert) and Hadelin de Ponteves (Data Scientist), called Deep Learning A-Z™: Hands-On Artificial Neural Networks. The dataset is very useful for beginners of Machine Learning, and a simple playground where to compare several techniques/skills.
It can be freely downloaded here: https://www.superdatascience.com/deep-learning/
The story: A bank is investigating a very high rate of customer leaving the bank. Here is a 10.000 records dataset to investigate and predict which of the customers are more likely to leave the bank soon.
The story of the story: I'd like to compare several techniques (better if not alone, and with the experience of several Kaggle users) to improve my basic knowledge on Machine Learning.
I will write more later, but the columns names are very self-explaining.
Udemy instructors Kirill Eremenko (Data Scientist & Forex Systems Expert) and Hadelin de Ponteves (Data Scientist), and their efforts to provide this dataset to their students.
Which methods score best with this dataset? Which are fastest (or, executable in a decent time)? Which are the basic steps with such a simple dataset, very useful to beginners?
--- Original source retains full ownership of the source dataset ---
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
I was attempting my 2nd week programming assignment of Coursera's "Neural Networks and Deep Learning" course(Deep Learning Specialization),when i thought about creating a copy of the assignment's notebook from scratch using the same datasets and the model's algorithm.
In the Coursera's assignment ,most of the code is already given to us & we are required to make only small changes in the function and submit.I found this approach to be utterly distasteful so i am creating a new notebook regarding the same.
This notebook is created just for practice purposes so that i get a hang of writing neural network codes from scratch .However,if you find the given notebook to be interesting,do up-vote.😀
ADIOS!
This dataset was created by Asma Abeyat
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The dataset contains basic information about Coursera courses, including details about the subject, title, institution, skills acquired, rating, reviews, level, learning product, and duration. This data was scrapped directly from the Coursera website, providing insights about courses such as subject, title, institution, skills acquired, rating, reviews, level, learning product, and duration.
Here is a description for each column:
Subject: This column represents the academic or professional category or field of study of the course, such as Business, Data Science, Information Technology, or Computer Science.
Title: The specific name or title of the course or program, which gives an idea of what the course covers (e.g., "Business Analysis & Process Management," "Financial Markets").
Institution: The organization or platform offering the course, such as "IBM," "Yale University," or "Università Bocconi."
Gained Skills: The skills and knowledge learners are expected to gain upon completing the course, such as "Data Analysis," "Machine learning," or "Artificial intelligence."
Rate: The rate rating or score given by participants based on their experiences in the course. The scale ranges from 1 to 5 stars.
Reviews: The number of user reviews or ratings provided for the course.
Level: This column categorizes the difficulty level of the course, such as "Beginner," "Intermediate," or "Mixed"
Learning Product: The type of course or learning experience, such as "Guided Project" or "Course".
Duration: The length of time required to complete the course, which could be listed as "Less Than 2 Hours," "1 - 3 Months," etc.
Dataset Card for "NNDL_HW5_S2025"
This is a dataset created for neural networks and deep learning course at University of Tehran. The original data can be accessed at https://www.kaggle.com/datasets/emmarex/plantdisease/data More Information needed
Materials from CS224N-2024, CS224N-2019, and CS231N-2024 Stanford courses in text format. The collected resources include slides, notes, code, readings, and subtitles from YouTube videos of these courses. Additional scripts that were used to parse and preprocess this dataset can be found here: https://github.com/artvolgin/gemini-long-context-dataset
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Pokemon’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/mlomuscio/pokemon on 30 September 2021.
--- Dataset description provided by original source is as follows ---
I acquired the data from Alberto Barradas at https://www.kaggle.com/abcsds/pokemon. I needed to edit some of the variable names and remove the Total variable in order for my students to use this data for class. Otherwise, I would have just had them use his version of the data.
This dataset is for my Introduction to Data Science and Machine Learning Course. Using a modified Pokémon dataset acquired from Kaggle.com, I created example code for students demonstrating how to explore data with R.
Barradas provides the following description of each variable. I have modified the variable names to make them easier to deal with.
--- Original source retains full ownership of the source dataset ---
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The dataset contains a comprehensive collection of human activity videos, spanning across 7 distinct classes. These classes include clapping, meeting and splitting, sitting, standing still, walking, walking while reading book, and walking while using the phone.
Each video clip in the dataset showcases a specific human activity and has been labeled with the corresponding class to facilitate supervised learning.
The primary inspiration behind creating this dataset is to enable machines to recognize and classify human activities accurately. With the advent of computer vision and deep learning techniques, it has become increasingly important to train machine learning models on large and diverse datasets to improve their accuracy and robustness.
The data set was
What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
You should not take this dataset seriously, as it is a synthetic representation based on true trends in education and career outcomes.
This dataset provides insights into how different study habits, learning styles, and external factors influence student performance. It includes 10,000 records, covering details about students' study hours, online learning participation, exam scores, and other factors impacting academic success.
This dataset was created by Arga Adyatama
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This fish data is acquired from a live video dataset resulting in 27370 verified fish images. The whole dataset is divided into 23 clusters and each cluster is presented by a representative species, which is based on the synapomorphies characteristic from the extent that the taxon is monophyletic. The representative image indicates the distinction between clusters shown in the figure below, e.g. the presence or absence of components (anal-fin, nasal, infraorbitals), specific number (six dorsal-fin spines, two spiny dorsal-fins), particular shape (second dorsal-fin spine long), etc. This figure shows the representative fish species name and the numbers of detections. The data is very imbalanced where the most frequent species is about 1000 times more than the least one. The fish detection and tracking software described in [1] is used to obtain the fish images. The fish species are manually labeled by following instructions from marine biologists [2].
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5980358%2F5cc6093c54b3dc535bed661e93fc7a12%2Fgt_labels.png?generation=1700950637892694&alt=media" alt="">
Original page created by Phoenix X. Huang, Bastiaan B. Boom and Robert B. Fisher. Permission is granted for anyone to copy, use, modify, or distribute this data and accompanying documents for any purpose, provided this copyright notice is retained and prominently displayed, along with a note saying that the original data are available from our web page and refering to [2]. The data and documents are distributed without any warranty, express or implied. As the data were acquired for research purposes only, they have not been tested to the degree that would be advisable in any important application. All use of these data is entirely at the user's own risk.
Acknowledgments: This research was funded by European Commission FP7 grant 257024, in the Fish4Knowledge project.
[1]. B. J. Boom, P. X. Huang, C. Spampinato, S. Palazzo, J. He, C. Beyan, E. Beauxis-Aussalet, J. van Ossenbruggen, G. Nadarajan, J. Y. Chen-Burger, D. Giordano, L. Hardman, F.-P. Lin, R. B. Fisher, "Long-term underwater camera surveillance for monitoring and analysis of fish populations", Proc. Int. Workshop on Visual observation and Analysis of Animal and Insect Behavior (VAIB), in conjunction with ICPR 2012, Tsukuba, Japan, 2012.
[2]. B. J. Boom, P. X. Huang, J. He, R. B. Fisher, "Supporting Ground-Truth annotation of image datasets using clustering", 21st Int. Conf. on Pattern Recognition (ICPR), 2012.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The MNIST dataset is one of the best known image classification problems out there, and a veritable classic of the field of machine learning. This dataset is more challenging version of the same root problem: classifying letters from images. This is a multiclass classification dataset of glyphs of English letters A - J.
This dataset is used extensively in the Udacity Deep Learning course, and is available in the Tensorflow Github repo (under Examples). I'm not aware of any license governing the use of this data, so I'm posting it here so that the community can use it with Kaggle kernels.
notMNIST _large.zip
is a large but dirty version of the dataset with 529,119 images, and notMNIST_small.zip
is a small hand-cleaned version of the dataset, with 18726 images. The dataset was assembled by Yaroslav Bulatov, and can be obtained on his blog. According to this blog entry there is about a 6.5% label error rate on the large uncleaned dataset, and a 0.5% label error rate on the small hand-cleaned dataset.
The two files each containing 28x28 grayscale images of letters A - J, organized into directories by letter. notMNIST_large.zip
contains 529,119 images and notMNIST_small.zip
contains 18726 images.
Thanks to Yaroslav Bulatov for putting together the dataset.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
For details, check our GitHub repo!
The recent monkeypox outbreak has become a global healthcare concern owing to its rapid spread in more than 65 countries around the globe. To obstruct its expeditious pace, early diagnosis is a must. But the confirmatory Polymerase Chain Reaction (PCR) tests and other biochemical assays are not readily available in sufficient quantities. In this scenario, computer-aided monkeypox identification from skin lesion images can be a beneficial measure. Nevertheless, so far, such datasets are not available. Hence, the "Monkeypox Skin Lesion Dataset (MSLD)" is created by collecting and processing images from different means of web-scrapping i.e., from news portals, websites and publicly accessible case reports.
The creation of "Monkeypox Image Lesion Dataset" is primarily focused on distinguishing the monkeypox cases from the similar non-monkeypox cases. Therefore, along with the 'Monkeypox' class, we included skin lesion images of 'Chickenpox' and 'Measles' because of their resemblance to the monkeypox rash and pustules in initial state in another class named 'Others' to perform binary classification.
There are 3 folders in the dataset.
1) Original Images: It contains a total number of 228 images, among which 102 belongs to the 'Monkeypox' class and the remaining 126 represents the 'Others' class i.e., non-monkeypox (chickenpox and measles) cases.
2) Augmented Images: To aid the classification task, several data augmentation methods such as rotation, translation, reflection, shear, hue, saturation, contrast and brightness jitter, noise, scaling etc. have been applied using MATLAB R2020a. Although this can be readily done using ImageGenerator/other image augmentors, to ensure reproducibility of the results, the augmented images are provided in this folder. Post-augmentation, the number of images increased by approximately 14-folds. The classes 'Monkeypox' and 'Others' have 1428 and 1764 images, respectively.
3) Fold1: One of the three-fold cross validation datasets. To avoid any sort of bias in training, three-fold cross validation was performed. The original images were split into training, validation and test set(s) with the approximate proportion of 70 : 10 : 20 while maintaining patient independence. According to the commonly perceived data preparation practice, only the training and validation images were augmented while the test set contained only the original images. Users have the option of using the folds directly or using the original data and employing other algorithms to augment it.
Additionally, a CSV file is provided that has 228 rows and two columns. The table contains the list of all the ImageID(s) with their corresponding label.
Since monkeypox is demonstrating a very rapid community transmission pattern, a consumer-level software is truly necessary to increase awareness and encourage people to take rapid action. We have developed an easy-to-use web application named Monkey Pox Detector using the open-source python streamlit framework that uses our trained model to address this issue. It makes predictions on whether or not to see a specialist along with the prediction accuracy. Future updates will benefit from the user data we continue to collect and use to improve our model. The web app has a flask core, so that it can be deployed cross-platform in the future.
Learn more at our GitHub repo!
If this dataset helped your research, please cite the following articles:
Ali, S. N., Ahmed, M. T., Paul, J., Jahan, T., Sani, S. M. Sakeef, Noor, N., & Hasan, T. (2022). Monkeypox Skin Lesion Detection Using Deep Learning Models: A Preliminary Feasibility Study. arXiv preprint arXiv:2207.03342.
@article{Nafisa2022, title={Monkeypox Skin Lesion Detection Using Deep Learning Models: A Preliminary Feasibility Study}, author={Ali, Shams Nafisa and Ahmed, Md. Tazuddin and Paul, Joydip and Jahan, Tasnim and Sani, S. M. Sakeef and Noor, Nawshaba and Hasan, Taufiq}, journal={arXiv preprint arXiv:2207.03342}, year={2022} }
Ali, S. N., Ahmed, M. T., Jahan, T., Paul, J., Sani, S. M. Sakeef, Noor, N., Asma, A. N., & Hasan, T. (2023). A Web-based Mpox Skin Lesion Detection System Using State-of-the-art Deep Learning Models Considering Racial Diversity. arXiv preprint arXiv:2306.14169.
@article{Nafisa2023, title={A Web-base...
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by basharath ali
Released under Apache 2.0
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
人类蛋白图库(Human Protein Atlas, HPA)是一个公开免费的生物图像数据库,目前的Version 16.1中储存有16998个蛋白质在人类健康和癌症组织以及细胞中的上百万幅显微图像。
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The original dataset is from https://www.kaggle.com/datasets/andyczhao/covidx-cxr2
The data is separated based on the .txt
file (see link) into positive and negative.
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.preprocessing.image import ImageDataGenerator
datagen = ImageDataGenerator(
rescale=1./255, # Normalize
rotation_range=20, # Rotation reference
zoom_range=0.2, # Zoom reference
width_shift_range=0.2, # wrap
height_shift_range=0.2, # wrap
shear_range=0.2, # Add shear transformation
brightness_range=(0.7, 1.3), # Wider brightness adjustment - reference 0.3
horizontal_flip=True,
fill_mode='nearest'
)
# Counts
current_count = len(os.listdir(input_dir))
target_count = 57199
required_augmented_count = target_count - current_count
print(f"Original negatives: {current_count}")
print(f"Required augmented images: {required_augmented_count}")
# augmenting ...
augmented_count = 0
max_augmentations_per_image = 10 #I used 5 and 10, this dataset was generated with 10
for img_file in os.listdir(input_dir):
img_path = os.path.join(input_dir, img_file)
img = load_img(img_path, target_size=(480, 480)) # 480 by 480 referring to reference.
img_array = img_to_array(img)
img_array = img_array.reshape((1,) + img_array.shape)
# Generate multiple augmentations per image
i = 0
for batch in datagen.flow(
img_array,
batch_size=1,
save_to_dir=output_dir,
save_prefix='aug',
save_format='jpeg'
):
i += 1
augmented_count += 1
if i >= max_augmentations_per_image:
break
if augmented_count >= required_augmented_count:
break
if augmented_count >= required_augmented_count:
break
I tried using different max_augmentations_per_image,
or without setting this parameter; both ways generated augmented data (around 9,000) ...
positive_balanced: ```python random.seed(42)
target_count = 20579
all_positive_images = os.listdir(positive_dir) selected_positive_images = random.sample(all_positive_images, target_count) ```
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
this dataset contains 20 files in pdf format. each file consists to text transcripts for each lecture. this data can be used for creating question answering application using LLM.