18 datasets found

Data Mining Project - Boston
kaggle.com
zip
Updated Nov 25, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SophieLiu (2019). Data Mining Project - Boston [Dataset]. https://www.kaggle.com/sliu65/data-mining-project-boston
Explore at:
zip(59313797 bytes)Available download formats
Dataset updated
Nov 25, 2019
Authors
SophieLiu
Area covered
Boston
Description
Context

To make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.

Use of Data Files

You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:

This loads the file into R

df<-read.csv('uber.csv')

The next codes is to subset the data into specific car types. The example below only has Uber 'Black' car types.

df_black<-subset(uber_df, uber_df$name == 'Black')

This next portion of code will be to load it into R. First, we must write this dataframe into a csv file on our computer in order to load it into R.

write.csv(df_black, "nameofthefileyouwanttosaveas.csv")

The file will appear in you working directory. If you are not familiar with your working directory. Run this code:

getwd()

The output will be the file path to your working directory. You will find the file you just created in that folder.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?

ISIC 2016 - 256x256

kaggle.com

zip

Updated Aug 7, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Mehran Ziadloo (2024). ISIC 2016 - 256x256 [Dataset]. https://www.kaggle.com/datasets/ziadloo/isic-2016-256x256

Explore at:

zip(17534629 bytes)Available download formats

Dataset updated

Aug 7, 2024

Authors

Mehran Ziadloo

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset is derived from the ISIC Archive with the following changes:

A new integer column is added named "target" with values 0, 1, null. This column is populated using two other columns: "bengin_malignant" and "diagnosis". If the first column explicitly confirms that the record is either "benign" or "malignant", the target is set to "0" and "1" respectively. If the "benign_malignant" column is null, then the value of "diagnosis" column is used to determine the value for "target". The following diagnosis values are considered cancerous and as the result, "target" is set to "1":

squamous cell carcinoma
basal cell carcinoma
melanoma
squamous cell carcinoma

If the "benign_malignant" column is null and the "diagnosis" column is "vascular lesion", the target is set to null.

DISCLAIMER I'm not a dermatologist and I'm not affiliated with ISIC in any way. I don't know if my approach to setting the target value is acceptable by the ISIC competition. Use at your own risk.

All the images are resized to 256x256 using the following Python code:

import os
import multiprocessing as mp
from PIL import Image, ImageOps
import glob
from functools import partial


def list_jpg_files(folder_path):
  # Ensure the folder path ends with a slash
  if not folder_path.endswith('/'):
    folder_path += '/'

  # Use glob to find all .jpg files in the specified folder (non-recursive)
  jpg_files = glob.glob(folder_path + '*.jpg')

  return jpg_files



def resize_image(image_path, destination_folder):
  # Open the image file
  with Image.open(image_path) as img:
    # Get the original dimensions
    original_width, original_height = img.size

    # Calculate the aspect ratio
    aspect_ratio = original_width / original_height

    # Determine the new dimensions based on the aspect ratio
    if aspect_ratio > 1:
      # Width is larger, so we will crop the width
      new_width = int(256 * aspect_ratio)
      new_height = 256
    else:
      # Height is larger, so we will crop the height
      new_width = 256
      new_height = int(256 / aspect_ratio)

    # Resize the image while maintaining the aspect ratio
    img = img.resize((new_width, new_height))

    # Calculate the crop box to center the image
    left = (new_width - 256) / 2
    top = (new_height - 256) / 2
    right = (new_width + 256) / 2
    bottom = (new_height + 256) / 2

    # Crop the image if it results in shrinking
    if new_width > 256 or new_height > 256:
      img = img.crop((left, top, right, bottom))
    else:
      # Add black edges if it results in scaling up
      img = ImageOps.expand(img, border=(int(left), int(top), int(left), int(top)), fill='black')

    # Resize the image to the final dimensions
    img = img.resize((256, 256))

  img.save(os.path.join(destination_folder, os.path.basename(image_path)))


source_folder = ""
destination_folder = ""

images = list_jpg_files(source_folder)

with mp.Pool(processes=12) as pool:
  images = pool.map(partial(resize_image, destination_folder=destination_folder), images)
print("All images resized")

This code will shrink (down-sample) the image if it is larger than 256x256. But if the image is smaller than 256x256, it will add either vertical or horizontal black edges after scaling up the image. In both scenarios, it will keep the center of the input image in the center of the output image.

The HDF5 file is created using the following code:

import os
import pandas as pd
from PIL import Image
import h5py
import io
import numpy as np

# File paths
base_folder = "./isic-2018-task-12-256x256"
csv_file_path = 'train-metadata.csv'
image_folder_path = 'train-image/image'
hdf5_file_path = 'train-image.hdf5'

# Read the CSV file
df = pd.read_csv(os.path.join(base_folder, csv_file_path))

# Open an HDF5 file
with h5py.File(os.path.join(base_folder, hdf5_file_path), 'w') as hdf5_file:
  for index, row in df.iterrows():
    isic_id = row['isic_id']
    image_file_path = os.path.join(base_folder, image_folder_path, f'{isic_id}.jpg')
    
    if os.path.exists(image_file_path):
      # Open the image file
      with Image.open(image_file_path) as img:
        # Convert the image to a byte buffer
        img_byte_arr = io.BytesIO()
        img.save(img_byte_arr, format=img.format)
        img_byte_arr = img_byte_arr.getvalue()
        hdf5_file.create_dataset(isic_id, data=np.void(img_byte_arr))
    else:
      print(f"Image file for {isic_id} not found.")

print("HDF5 file created successfully.")

To read the hdf5 file, use the following code:

import h5py
from PIL import Image
...

Bangladeshi License Plates for OCR

kaggle.com

zip

Updated Jun 20, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Abrar Chowdhury (2023). Bangladeshi License Plates for OCR [Dataset]. https://www.kaggle.com/datasets/abrarchowdhury/bangladeshi-license-plates-382359-images

Explore at:

zip(622682351 bytes)Available download formats

Dataset updated

Jun 20, 2023

Authors

Abrar Chowdhury

License

https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/

Area covered

Bangladesh

Description

Bangladeshi Vehicle License Plate Dataset

Consists of 3,82,359 images of Blender generated License Plates of Bangladeshi Vehicle.
Intended to be used for OCR purposes.
Link to Paper: https://dspace.bracu.ac.bd/xmlui/bitstream/handle/10361/21999/18301185_23141055_23141056_CSE.pdf?sequence=1&isAllowed=y

Pre-Processing

Make the images sharper and larger for better training

import os
import cv2
import numpy as np
from multiprocessing import Pool, cpu_count

input_dir = "/Users/abrarahasanadil/Downloads/Thesis/dataset/distorted_images"
output_dir = "/Users/abrarahasanadil/Downloads/Thesis/dataset/clear_images"

def preprocess_image(img_file):
  img = cv2.imread(img_file)

  # Resize the image to have a minimum height of 300 pixels
  height, width, _ = img.shape
  new_height = 400
  new_width = int((new_height / height) * width)
  img = cv2.resize(img, (new_width, new_height))

  # Apply bilateral filtering to remove noise while keeping edges sharp
  img = cv2.bilateralFilter(img, 9, 75, 75)

  # Apply unsharp masking to enhance edges
  img = cv2.GaussianBlur(img, (0, 0), 3)
  img = cv2.addWeighted(
    img, 1.2, img, -0.2, 0
  ) # Reduce the value of alpha from 1.5 to 1.2

  # Convert the image to grayscale
  gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

  # Increase contrast in darker regions using adaptive histogram equalization
  clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
  gray = clahe.apply(gray)

  # Apply a sharpening filter
  kernel = np.array([[-1, -1, -1], [-1, 9, -1], [-1, -1, -1]])
  gray = cv2.filter2D(gray, -1, kernel)

  # Save the output image
  output_file = os.path.join(output_dir, os.path.basename(img_file))
  os.makedirs(
    os.path.dirname(output_file), exist_ok=True
  ) # Create the output directory if it doesn't exist
  cv2.imwrite(output_file, gray)

if _name_ == "_main_":
  # Get a list of all the image files in the input directory
  image_files = [
    os.path.join(input_dir, f) for f in os.listdir(input_dir) if f.endswith(".jpg")
  ]

  # Create a pool of worker processes
  num_workers = cpu_count() # Use all available CPU cores
  with Pool(num_workers) as pool:
    # Preprocess all the images in parallel
    pool.map(preprocess_image, image_files)

NIH Chest X-rays Preprocessed Version
kaggle.com
zip
Updated Sep 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yasiru-210329E (2025). NIH Chest X-rays Preprocessed Version [Dataset]. https://www.kaggle.com/datasets/laksaraky210329e/nih-chest-x-rays-preprocessed-version
Explore at:
zip(60316035929 bytes)Available download formats
Dataset updated
Sep 13, 2025
Authors
Yasiru-210329E
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
NIH Chest X-rays Preprocessed Version

This dataset is a preprocessed version of the NIH Chest X-ray Dataset. The original images were systematically organized, explored, and enhanced to improve their quality for research and machine learning applications.

What was done:

The full directory structure of the dataset was explored and documented, including image counts and sample filenames for each folder.

All chest X-ray images were processed using CLAHE (Contrast Limited Adaptive Histogram Equalization) to improve local contrast and highlight important features in the images.

The processed images were saved in a new directory structure that mirrors the original, ensuring easy traceability and organization.

Sample images were visualized and compared before and after preprocessing, with histograms provided to illustrate the improvement in contrast.

Batch processing was performed, with summary statistics and verification steps to confirm successful image enhancement and saving.

Outputs:

A complete set of CLAHE-enhanced chest X-ray images, organized in the same way as the original dataset.

Visualizations and statistics demonstrating the effectiveness of the preprocessing, including side-by-side comparisons and histogram overlays.

Verified counts of processed images for each folder, ensuring data integrity.

This preprocessed dataset is ready for use in further analysis, model training, or clinical research, with improved image quality and consistent organization. No changes were made to the original labels or metadata.
Stone Classification
kaggle.com
zip
Updated Mar 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khadgar (2025). Stone Classification [Dataset]. https://www.kaggle.com/datasets/claydonwang/stone-classification
Explore at:
zip(69490 bytes)Available download formats
Dataset updated
Mar 18, 2025
Authors
Khadgar
Description
Outline

The dataset is used in final project of STA325 at SUSTech.

How to Generate submission.csv from test_loader

1. Define the Prediction Function

Use the following function to extract predictions from test_loader: ```python def predict(model, loader, device): model.eval() # Set the model to evaluation mode predictions = [] # Store predicted classes image_ids = [] # Store image filenames

with torch.no_grad(): # Disable gradient computation for images, img_paths in tqdm(loader, desc="Predicting on test set"): images = images.to(device) # Move images to the specified device outputs = model(images) # Forward pass to get model outputs _, predicted = torch.max(outputs, 1) # Get predicted classes

# Collect predictions and image IDs predictions.extend(predicted.cpu().numpy()) image_ids.extend([os.path.basename(path) for path in img_paths])

return image_ids, predictions ```

2. Run Predictions

Call the prediction function with the trained model, test_loader, and device: python image_ids, predictions = predict(model, test_loader, device)

3. Create the Submission File

import pandas as pd import os # Create DataFrame submission_df = pd.DataFrame({ "id": image_ids, # Image filenames "label": predictions # Predicted classes }) # Save to the specified path OUTPUT_DIR = "logs" os.makedirs(OUTPUT_DIR, exist_ok=True) submission_path = os.path.join(OUTPUT_DIR, "submission.csv") submission_df.to_csv(submission_path, index=False) print(f"Kaggle submission file saved to {submission_path}")

Output Description

submission.csv Format:
The file contains two columns:

id: Filenames of test images (without paths, e.g., image1.jpg).

label: Predicted class indices (e.g., 0, 1, 2, depending on the number of classes).

Example Content: id,label 000001.jpg,0 000002.jpg,1 000003.jpg,2 Then submit the submission.csv to Kaggle.
R
Accident Detection Model Dataset
universe.roboflow.com
zip
Updated Apr 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Accident detection model (2024). Accident Detection Model Dataset [Dataset]. https://universe.roboflow.com/accident-detection-model/accident-detection-model/model/1
Explore at:
zipAvailable download formats
Dataset updated
Apr 8, 2024
Dataset authored and provided by
Accident detection model
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Accident Bounding Boxes
Description
Accident-Detection-Model

Accident Detection Model is made using YOLOv8, Google Collab, Python, Roboflow, Deep Learning, OpenCV, Machine Learning, Artificial Intelligence. It can detect an accident on any accident by live camera, image or video provided. This model is trained on a dataset of 3200+ images, These images were annotated on roboflow.

Problem Statement

Road accidents are a major problem in India, with thousands of people losing their lives and many more suffering serious injuries every year.

According to the Ministry of Road Transport and Highways, India witnessed around 4.5 lakh road accidents in 2019, which resulted in the deaths of more than 1.5 lakh people.

The age range that is most severely hit by road accidents is 18 to 45 years old, which accounts for almost 67 percent of all accidental deaths.

Accidents survey

https://user-images.githubusercontent.com/78155393/233774342-287492bb-26c1-4acf-bc2c-9462e97a03ca.png" alt="Survey">

Literature Survey

Sreyan Ghosh in Mar-2019, The goal is to develop a system using deep learning convolutional neural network that has been trained to identify video frames as accident or non-accident.

Deeksha Gour Sep-2019, uses computer vision technology, neural networks, deep learning, and various approaches and algorithms to detect objects.

Research Gap

Lack of real-world data - We trained model for more then 3200 images.

Large interpretability time and space needed - Using google collab to reduce interpretability time and space required.

Outdated Versions of previous works - We aer using Latest version of Yolo v8.

Proposed methodology

We are using Yolov8 to train our custom dataset which has been 3200+ images, collected from different platforms.

This model after training with 25 iterations and is ready to detect an accident with a significant probability.

Model Set-up

Preparing Custom dataset

We have collected 1200+ images from different sources like YouTube, Google images, Kaggle.com etc.

Then we annotated all of them individually on a tool called roboflow.

During Annotation we marked the images with no accident as NULL and we drew a box on the site of accident on the images having an accident

Then we divided the data set into train, val, test in the ratio of 8:1:1

At the final step we downloaded the dataset in yolov8 format.
#### Using Google Collab

We are using google colaboratory to code this model because google collab uses gpu which is faster than local environments.

You can use Jupyter notebooks, which let you blend code, text, and visualisations in a single document, to write and run Python code using Google Colab.

Users can run individual code cells in Jupyter Notebooks and quickly view the results, which is helpful for experimenting and debugging. Additionally, they enable the development of visualisations that make use of well-known frameworks like Matplotlib, Seaborn, and Plotly.

In Google collab, First of all we Changed runtime from TPU to GPU.

We cross checked it by running command ‘!nvidia-smi’
#### Coding

First of all, We installed Yolov8 by the command ‘!pip install ultralytics==8.0.20’

Further we checked about Yolov8 by the command ‘from ultralytics import YOLO from IPython.display import display, Image’

Then we connected and mounted our google drive account by the code ‘from google.colab import drive drive.mount('/content/drive')’

Then we ran our main command to run the training process ‘%cd /content/drive/MyDrive/Accident Detection model !yolo task=detect mode=train model=yolov8s.pt data= data.yaml epochs=1 imgsz=640 plots=True’

After the training we ran command to test and validate our model ‘!yolo task=detect mode=val model=runs/detect/train/weights/best.pt data=data.yaml’ ‘!yolo task=detect mode=predict model=runs/detect/train/weights/best.pt conf=0.25 source=data/test/images’

Further to get result from any video or image we ran this command ‘!yolo task=detect mode=predict model=runs/detect/train/weights/best.pt source="/content/drive/MyDrive/Accident-Detection-model/data/testing1.jpg/mp4"’

The results are stored in the runs/detect/predict folder.
Hence our model is trained, validated and tested to be able to detect accidents on any video or image.

Challenges I ran into

I majorly ran into 3 problems while making this model

I got difficulty while saving the results in a folder, as yolov8 is latest version so it is still underdevelopment. so i then read some blogs, referred to stackoverflow then i got to know that we need to writ an extra command in new v8 that ''save=true'' This made me save my results in a folder.

I was facing problem on cvat website because i was not sure what
r/cosplay hot top images with titles
kaggle.com
zip
Updated Mar 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dinhanhx (2023). r/cosplay hot top images with titles [Dataset]. https://www.kaggle.com/datasets/inhanhv/rcosplay-hot-top-images-with-titles
Explore at:
zip(1251562500 bytes)Available download formats
Dataset updated
Mar 2, 2023
Authors
dinhanhx
License
https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api
Description
Please visit dinhanhx/rct

Sauce for the thumbnail

r/cosplay title crawler

Available on Kaggle

Please take time to read all this readme before using the dataset. Yes I'm serious!

Setup

pip install -e .

Go to this PRAW doc page, follow the instructions to get your client id, client secret, and user agent.

Then store them in confidential/reddit.json like this (don't actually write "spooky"): json { "id": "spooky", "secret": "spooky", "user-agent": "windows-10:spooky:v0.0.1 (by u/spooky)" }

Run

Download all posts in top and hot

(but the number in each category limited by Reddit) - Output file: data/cosplay.jsonl - 2161 posts (on 01/03/2023) python rct/crawl.py

Clean text

(in post's title) enclosed by square brackets such as [self], [found], ... - Input file: data/cosplay.jsonl - Output file: data/clean_cosplay.jsonl python rct/clean.py

Download images

Input file: data/clean_cosplay.jsonl

Output file: data/map_cosplay.jsonl, data/bad_response.jsonl

2160 downloaded images, 1 bad/delete/deprecated image (on 02/03/2023) python rct/download.py

⚠ The image_id, and image_path attributes' values are NOT linearly continuous. For example,

in data/bad_response.jsonl python {"image_id": "001912", "image_path": "data/image/001912.jpg"} and in data/map_cosplay.jsonl ```python

omit other json objects

{"image_id": "001911", "image_path": "data/image/001911.jpg"} {"image_id": "001913", "image_path": "data/image/001913.jpg"}

omit other json objects

⚠ `image_path` attribute's values are `data/image/*.jpg`. They are relative to the folder `data` containing all `.jsonl` files and `image` folder. The folder `data` is produced by Python scripts. ⚠ `image_path` attribute's values MISMATCH with *the name of folder containing all `.jsonl` files and `image` folder on _Kaggle_*. When you load the data from Kaggle Dataset, `data/image/000000.jpg`'s `data` should be replaced with Kaggle path (see [this notebook](https://www.kaggle.com/code/inhanhv/rct-demo)). It shall become `/kaggle/input/rcosplay-hot-top-images-with-titles/image/000000.jpg`

Diabetic Retinopathy (resized)

kaggle.com

zip

Updated May 8, 2019

Facebook

Twitter

Click to copy link

Link copied

Cite

ilovescience (2019). Diabetic Retinopathy (resized) [Dataset]. https://www.kaggle.com/tanlikesmath/diabetic-retinopathy-resized

Explore at:

zip(7785957896 bytes)Available download formats

Dataset updated

May 8, 2019

Authors

ilovescience

Description

Diabetic Retinopathy Detection Competition Dataset Resized/Cropped

In this dataset, I have included both a resized version of the dataset, and a cropped then resized version of the data.

trainLabels.csv

This file contains the name of the file under the 'image' column and the label under the 'level' column.

resized_train:

This folder was created by simply resizing the dataset to 1024x1024 if it is bigger than this size, else it remains the same. The code used to create this dataset is:

import glob
import os
from tqdm import tqdm
import math
from PIL import Image 
files = glob.glob('D:\\Experiments with Deep Learning\\DR Kaggle\\train\\train\\train\\*.jpeg')

new_width = 1024

for i in tqdm(range(len(files))):
  img = Image.open(files[i])
  width,height = img.size
  ratio = height/width
  if width > new_width:
    new_image = img.resize((new_width,math.ceil(ratio*new_width)))  
  else:
    new_image = img
  new_image.save('D:\\Experiments with Deep Learning\\DR 
Kaggle\\train\\train\\resized_train\\'+os.path.basename(files[i]))

resized_train_cropped:

In this case, as much of the black space is cropped out by trying to identify the center and radius of the circle of the fundus image. Some of the images turned out to be fully black or very close to fully black, and no mask was found. Hence, those images were manually removed. There may still be some noisy images remaining, however.

The code used to create this dataset is:

# import the necessary packages
import numpy as np
import cv2
import glob
import os
from tqdm import tqdm
import math
from PIL import Image
files = glob.glob('D:\\Experiments with Deep Learning\\DR Kaggle\\train\\train\\train\\*.jpeg')

new_sz = 1024

def crop_image(image):
  output = image.copy()
  gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
  ret,gray = cv2.threshold(gray,10,255,cv2.THRESH_BINARY)
  contours,hierarchy = cv2.findContours(gray,cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_SIMPLE)
  if not contours:
    print('no contours!')
    flag = 0
    return image, flag
  cnt = max(contours, key=cv2.contourArea)
  ((x, y), r) = cv2.minEnclosingCircle(cnt)
  x = int(x); y = int(y); r = int(r)
  flag = 1
  #print(x,y,r)
  if r > 100:
    return output[0 + (y-r)*int(r

CelebsV2_Faces_224
kaggle.com
zip
Updated Jun 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shreyansh Manav Shukla (2024). CelebsV2_Faces_224 [Dataset]. https://www.kaggle.com/datasets/shreyanshmanavshukla/celebsv2-faces-224/data
Explore at:
zip(110015744 bytes)Available download formats
Dataset updated
Jun 8, 2024
Authors
Shreyansh Manav Shukla
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Name: Celeb-DF Faces Dataset

Description: The Celeb-DF Faces Dataset is a curated collection of facial images extracted from the Celeb-DF dataset. This dataset focuses on providing a comprehensive set of facial images for research and analysis in the field of deepfake detection and facial image analysis. The images are categorized into two classes: "Fake" and "Real," based on the source of the videos.

Dataset Structure:

Image Size: 224x224 pixels

Source Folders:

celeb-df-v2/Celeb-real: Contains authentic facial videos. celeb-df-v2/Celeb-synthesis: Contains synthesized (fake) facial videos. celeb-df-v2/YouTube-real: Contains additional authentic facial videos from YouTube. Output Folder:

celeb_faces_224/: Contains the extracted and resized facial images. Metadata File:

metadata_celebs.csv: A CSV file storing metadata information for each extracted image with the following columns: Name: The filename of the extracted image. Label: The label indicating whether the image is "Fake" or "Real." Creation Process:

Video Frame Extraction:

The first frame from each video in the source folders is extracted. Image Resizing:

The extracted frames are resized to 224x224 pixels to ensure uniformity and compatibility with common machine learning models. Image Storage:

The resized images are saved in the celeb_faces_224/ folder with filenames corresponding to the original video names. Metadata Compilation:

A metadata CSV file (metadata_celebs.csv) is created to store the filenames and labels of the images, indicating whether they are from "Fake" or "Real" videos. Intended Use: The dataset is ideal for tasks such as:

Deepfake detection and analysis Training and evaluation of machine learning models for facial image classification Image forensics research and development Note: This dataset is derived from the Celeb-DF dataset and is intended for research and educational purposes only.
External CFD Aerodynamics Dataset
kaggle.com
zip
Updated Sep 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrey Shtrauss (2021). External CFD Aerodynamics Dataset [Dataset]. https://www.kaggle.com/shtrausslearning/external-cfd-aero
Explore at:
zip(2848480902 bytes)Available download formats
Dataset updated
Sep 9, 2021
Authors
Andrey Shtrauss
Description
https://i.imgur.com/gAwhrwd.jpg" alt="">

External CFD Aerodynamics Dataset - The theme for this data is external flow aerodynamics; simulation of flow physics around some geometry (eg. aircraft, vehicle ) - Iterative based simulation of some flow physics are obtained upon reaching some final convergence criteria - The dataset contains simulation results obtained from different CFD solvers, sorted into folders for similar types of data outputs

Data origin Where the data is stored - Depending on the type of solver, export option, data can be stored in vertices (grid intersections) or cells (each little domain, usually cell centre) - Before solving nonlinear equations, we need to discretise the domain into smaller zones, when the solution achieves convergence, it be exported: - Either the entire domain data, subsets of the flowfield domain, or just tabular data (at specific points)

|Discretised region around a geometry|Part of the result is visualised via a 2D slice | |--|--| |https://images-wixmp-ed30a86b8c4ca887773594c2.wixmp.com/f/8cc1eeaa-4046-4c4a-ae93-93d656f68688/df51hov-c527cb71-62c8-4b98-aef9-002dcc7567ea.jpg?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJ1cm46YXBwOjdlMGQxODg5ODIyNjQzNzNhNWYwZDQxNWVhMGQyNmUwIiwiaXNzIjoidXJuOmFwcDo3ZTBkMTg4OTgyMjY0MzczYTVmMGQ0MTVlYTBkMjZlMCIsIm9iaiI6W1t7InBhdGgiOiJcL2ZcLzhjYzFlZWFhLTQwNDYtNGM0YS1hZTkzLTkzZDY1NmY2ODY4OFwvZGY1MWhvdi1jNTI3Y2I3MS02MmM4LTRiOTgtYWVmOS0wMDJkY2M3NTY3ZWEuanBnIn1dXSwiYXVkIjpbInVybjpzZXJ2aWNlOmZpbGUuZG93bmxvYWQiXX0.PgMkNNrg2EmDBkPioLt2MRsoNdXyO1ro5Ol0Z1QP3qM" alt="">|https://images-wixmp-ed30a86b8c4ca887773594c2.wixmp.com/f/8cc1eeaa-4046-4c4a-ae93-93d656f68688/df51hor-636d665b-d6b3-4f2b-b141-c06199702eff.jpg?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJ1cm46YXBwOjdlMGQxODg5ODIyNjQzNzNhNWYwZDQxNWVhMGQyNmUwIiwiaXNzIjoidXJuOmFwcDo3ZTBkMTg4OTgyMjY0MzczYTVmMGQ0MTVlYTBkMjZlMCIsIm9iaiI6W1t7InBhdGgiOiJcL2ZcLzhjYzFlZWFhLTQwNDYtNGM0YS1hZTkzLTkzZDY1NmY2ODY4OFwvZGY1MWhvci02MzZkNjY1Yi1kNmIzLTRmMmItYjE0MS1jMDYxOTk3MDJlZmYuanBnIn1dXSwiYXVkIjpbInVybjpzZXJ2aWNlOmZpbGUuZG93bmxvYWQiXX0.0H9mu_Crr4zAfbiR8wNWOdaMYG8Duyam0GCMmKqbeA" alt="">|

Folder Structure Utilising the dataset with notebook classes, the recommended data storage structure:

Main Folder ( Geometry name used in simulation )

Case Name ( Brief simulation name; what was tested etc )

Individual Case Name ( If multiple cases were tested etc ):

flowfield folder (stores multiblock file content - automatically created when saving VTM)

tab_final ( final iteration tabular data output content )

tab_iter ( iteratively changing tabular data, etc convergence history of a parameter )

Current Dataset Content Content in the dataset: External CFD Aerodynamics Dataset

[**30p30n**] [**anglevar_sst**]

[10][11][12][13][3][4][5][6][7][8][9]

[**crm**][**nacelle_effect**]

[crm_wb_clean][crm_wb_eng]

[**m6**]

[**3p06_sst_ep**]

[**rae2822**][**angvar_sa**]

[0][1][10][11][12][13][14][15][16][17][18][19][2][20][3][4][5][6][7][8][9]
Churn Prediction and Transaction Forecasting
kaggle.com
zip
Updated Aug 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richa Patel (2025). Churn Prediction and Transaction Forecasting [Dataset]. https://www.kaggle.com/datasets/richapatel912/churn-prediction-and-transaction-forecasting/discussion
Explore at:
zip(3258802 bytes)Available download formats
Dataset updated
Aug 19, 2025
Authors
Richa Patel
Description
“AI-Powered Banking Analytics: Automated Power BI Documentation, Churn Prediction, and Transaction Forecasting”

Project Workflow 1. Data Acquisition (Kaggle) • Dataset sourced from Kaggle (credit card / banking dataset). • Contains customer demographics, credit card transactions, and account details. • Cleaned and transformed data in Power BI for dashboard building.

Interactive Power BI Dashboard • Built two key analytics pages:

Customer Churn Insights → shows churn risk, drivers, segmentation.

Transaction Forecasting → predicts future monthly transactions with confidence bands. • Added KPI cards, slicers, and professional formatting. • Ensured design follows: customer risk, forecasting, and governance. _

Automated Documentation (Python + VPAX) • Exported the Power BI data model (VPAX) using DAX Studio. • Created a Python script to automatically generate: o Word doc with model documentation. o Excel file with tables, relationships, and fields. o ER diagram image. • This automation saves analysts hours of manual work and enforces governance. _

Churn Prediction Model (Python + Power BI) • Built a Random Forest model for churn prediction. • Output: o Customer-level churn probability. o Risk categories (Low, Medium, High). o Feature importance (drivers of churn). • Exported predictions to Excel → Imported into Power BI. • Added Churn Risk Dashboard: o Distribution of churn risk. o Top churn drivers (feature importance bar chart). _

Transaction Forecasting Model (Python + Prophet) • Used Prophet (Facebook’s forecasting library) to model monthly transaction volumes. • Forecasted next 12 months with confidence intervals (yhat_lower, yhat_upper). • Exported results to Excel → Integrated into Power BI. • Added Transaction Forecasting Dashboard: o Actual vs Forecast line chart (with confidence band). o KPI cards (Next Month Forecast, YoY Growth). o Clustered column chart for recent 12 months. _

End-to-End Data & AI Pipeline • Data Source (Kaggle) → Power BI Dashboard → Automated Documentation → AI/ML Models → Power BI Insights.

File Details: File / Folder Name Description .idea/ PyCharm IDE configuration folder (auto-generated). Churn Prediction + Forecasting.py Main Python script for churn prediction (Random Forest) and transaction forecasting (Prophet). churn_model.pkl Saved machine learning model (Random Forest) for churn prediction. Churn_Predictions.xlsx Excel output of churn probabilities and risk categories per customer. Credit Card Financial Dashboard.pbix Power BI dashboard file (interactive BI report). Credit Card Financial Dashboard.pdf Exported PDF version of the Power BI dashboard. credit_card.xlsx Kaggle dataset (credit card transactions / account features). customer.xlsx Kaggle dataset (customer demographic and account info). DocumentationGenerator.py Python script that parses VPAX model and generates automated Power BI documentation. Feature_Importance.xlsx Feature importance scores from churn model (top churn drivers). forecast_model.pkl Saved Prophet model for forecasting monthly transactions. LICENSE License file for open-source/public sharing. model.vpax Exported Power BI data model (via DAX Studio) for documentation. PowerBI_Documentation.docx Word output of auto-generated Power BI documentation. PowerBI_Documentation.xlsx Excel output of auto-generated Power BI documentation. PowerBI_ER_Diagram.png Entity-Relationship diagram image generated from Power BI model. README.md Markdown summary file for GitHub/Kaggle. Transaction_Forecast.xlsx Excel output containing actuals + forecast (Prophet) with confidence bounds.
BBC NEWS SUMMARY(CSV FORMAT)
kaggle.com
zip
Updated Sep 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dhiraj (2024). BBC NEWS SUMMARY(CSV FORMAT) [Dataset]. https://www.kaggle.com/datasets/dignity45/bbc-news-summarycsv-format
Explore at:
zip(2097600 bytes)Available download formats
Dataset updated
Sep 9, 2024
Authors
Dhiraj
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Description: Text Summarization Dataset

This dataset is designed for users aiming to train models for text summarization. It contains 2,225 rows of data with two columns: "Text" and "Summary". Each row features a detailed news article or piece of text paired with its corresponding summary, providing a rich resource for developing and fine-tuning summarization algorithms.

Key Features:

Text: Full-length articles or passages that serve as the input for summarization.

Summary: Concise summaries of the articles, which are ideal for training models to generate brief, coherent summaries from longer texts.

Future Enhancements:

This evolving dataset is planned to include additional features, such as text class labels, in future updates. These enhancements will provide more context and facilitate the development of models that can perform summarization across different categories of news content.

Usage:

Ideal for researchers and developers focused on text summarization tasks, this dataset enables the training of models to effectively compress information while retaining the essence of the original content.

Acknowledgment

We would like to extend our sincere gratitude to the dataset creator for their contribution to this valuable resource. This dataset, sourced from the BBC News Summary dataset on Kaggle, was created by Pariza. Their work has provided an invaluable asset for those working on text summarization tasks, and we appreciate their efforts in curating and sharing this data with the community.

Thank you for supporting research and development in the field of natural language processing!

File Description

This script processes and consolidates text data from various directories containing news articles and their corresponding summaries. It reads the files from specified folders, handles encoding issues, and then creates a DataFrame that is saved as a CSV file for further analysis.

Key Components:

Imports:

numpy (np): Numerical operations library, though it's not used in this script.

pandas (pd): Data manipulation and analysis library.

os: For interacting with the operating system, e.g., building file paths.

glob: For file pattern matching and retrieving file paths.

Function: get_texts

Parameters:

text_folders: List of folders containing news article text files.

text_list: List to store the content of text files.

summ_folder: List of folders containing summary text files.

sum_list: List to store the content of summary files.

encodings: List of encodings to try for reading files.

Purpose:

Reads text files from specified folders, handles different encodings, and appends the content to text_list and sum_list.

Returns the updated lists of texts and summaries.

Data Preparation:

text_folder: List of directories for news articles.

summ_folder: List of directories for summaries.

text_list and summ_list: Initialize empty lists to store the contents.

data_df: Empty DataFrame to store the final data.

Execution:

Calls get_texts function to populate text_list and summ_list.

Creates a DataFrame data_df with columns 'Text' and 'Summary'.

Saves data_df to a CSV file at /kaggle/working/bbc_news_data.csv.

Output:

Prints the first few entries of the DataFrame to verify the content.

Column Descriptions:

Text: Contains the full-length articles or passages of news content. This column is used as the input for summarization models.

Summary: Contains concise summaries of the corresponding articles in the "Text" column. This column is used as the target output for summarization models.

Usage:

This script is designed to be run in a Kaggle environment where paths to text data are predefined.

It is intended for preprocessing and saving text data from news articles and summaries for subsequent analysis or model training.

ISIC 2018 Task 3 - 256x256

kaggle.com

zip

Updated Aug 7, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Mehran Ziadloo (2024). ISIC 2018 Task 3 - 256x256 [Dataset]. https://www.kaggle.com/datasets/ziadloo/isic-2018-task-3-256x256/code

Explore at:

zip(164325064 bytes)Available download formats

Dataset updated

Aug 7, 2024

Authors

Mehran Ziadloo

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

This dataset is derived from the ISIC Archive with the following changes:

A new integer column is added named "target" with values 0, 1, null. This column is populated using two other columns: "bengin_malignant" and "diagnosis". If the first column explicitly confirms that the record is either "benign" or "malignant", the target is set to "0" and "1" respectively. If the "benign_malignant" column is null, then the value of "diagnosis" column is used to determine the value for "target". The following diagnosis values are considered cancerous and as the result, "target" is set to "1":

squamous cell carcinoma
basal cell carcinoma
melanoma
squamous cell carcinoma

If the "benign_malignant" column is null and the "diagnosis" column is "vascular lesion", the target is set to null.

All the images are resized to 256x256 using the following Python code:

import os
import multiprocessing as mp
from PIL import Image, ImageOps
import glob
from functools import partial


def list_jpg_files(folder_path):
  # Ensure the folder path ends with a slash
  if not folder_path.endswith('/'):
    folder_path += '/'

  # Use glob to find all .jpg files in the specified folder (non-recursive)
  jpg_files = glob.glob(folder_path + '*.jpg')

  return jpg_files



def resize_image(image_path, destination_folder):
  # Open the image file
  with Image.open(image_path) as img:
    # Get the original dimensions
    original_width, original_height = img.size

    # Calculate the aspect ratio
    aspect_ratio = original_width / original_height

    # Determine the new dimensions based on the aspect ratio
    if aspect_ratio > 1:
      # Width is larger, so we will crop the width
      new_width = int(256 * aspect_ratio)
      new_height = 256
    else:
      # Height is larger, so we will crop the height
      new_width = 256
      new_height = int(256 / aspect_ratio)

    # Resize the image while maintaining the aspect ratio
    img = img.resize((new_width, new_height))

    # Calculate the crop box to center the image
    left = (new_width - 256) / 2
    top = (new_height - 256) / 2
    right = (new_width + 256) / 2
    bottom = (new_height + 256) / 2

    # Crop the image if it results in shrinking
    if new_width > 256 or new_height > 256:
      img = img.crop((left, top, right, bottom))
    else:
      # Add black edges if it results in scaling up
      img = ImageOps.expand(img, border=(int(left), int(top), int(left), int(top)), fill='black')

    # Resize the image to the final dimensions
    img = img.resize((256, 256))

  img.save(os.path.join(destination_folder, os.path.basename(image_path)))


source_folder = ""
destination_folder = ""

images = list_jpg_files(source_folder)

with mp.Pool(processes=12) as pool:
  images = pool.map(partial(resize_image, destination_folder=destination_folder), images)
print("All images resized")

The HDF5 file is created using the following code:

import os
import pandas as pd
from PIL import Image
import h5py
import io
import numpy as np

# File paths
base_folder = "./isic-2018-task-3-256x256"
csv_file_path = 'train-metadata.csv'
image_folder_path = 'train-image/image'
hdf5_file_path = 'train-image.hdf5'

# Read the CSV file
df = pd.read_csv(os.path.join(base_folder, csv_file_path))

# Open an HDF5 file
with h5py.File(os.path.join(base_folder, hdf5_file_path), 'w') as hdf5_file:
  for index, row in df.iterrows():
    isic_id = row['isic_id']
    image_file_path = os.path.join(base_folder, image_folder_path, f'{isic_id}.jpg')
    
    if os.path.exists(image_file_path):
      # Open the image file
      with Image.open(image_file_path) as img:
        # Convert the image to a byte buffer
        img_byte_arr = io.BytesIO()
        img.save(img_byte_arr, format=img.format)
        img_byte_arr = img_byte_arr.getvalue()
        hdf5_file.create_dataset(isic_id, data=np.void(img_byte_arr))
    else:
      print(f"Image file for {isic_id} not found.")

print("HDF5 file created successfully.")

To read the hdf5 file, use the following code:

import h5py
from PIL import Image

...

OLM Converter for Mac
kaggle.com
zip
Updated Mar 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BitVare Software (2022). OLM Converter for Mac [Dataset]. https://www.kaggle.com/datasets/bitvaresoftware/olm-converter-for-mac
Explore at:
zip(18499539 bytes)Available download formats
Dataset updated
Mar 23, 2022
Authors
BitVare Software
Description
OLM Converter for Mac allows users to export OLM to PST, PDF, MBOX, EML, MSG, EMLX, VCF, ICS, etc. Add OLM files including contacts, emails, tasks, calendars, journals, etc. to multiple file formats. Mac OLM Converter is a reliable tool to bulk convert OLM files to multiple file formats. The software protects mail metadata elements like a mailing list, from, Cc, To, Bcc, date, email formatting, folder hierarchy, images, color, links, attachments, etc. Tool support OLM file conversion without any data loss and provides various option to save resultant file. Export OLM Contacts to CSV format and Calendars to ICS format. The Mac OLM Converter is compatible with all Mac OS versions. OLM Converter makes sure that the data folder hierarchy is intact as same. The Mac OLM Converter allows users to convert Mac OLM files to 6+ different file formats. After the OLM file conversion, the output file can be used on any Mac and Windows- supported applications.

Get complete information - https://www.bitvare.com/olm/
The files on your computer
kaggle.com
zip
Updated Jan 15, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
cogs (2017). The files on your computer [Dataset]. https://www.kaggle.com/cogitoe/crab
Explore at:
zip(14326302 bytes)Available download formats
Dataset updated
Jan 15, 2017
Authors
cogs
Description
Dataset: The files on your computer.

Crab is a command line tool for Mac and Windows that scans file data into a SQLite database, so you can run SQL queries over it.

e.g. (Win) C:> crab C:\some\path\MyProject or (Mac) $ crab /some/path/MyProject

You get a CRAB> prompt where you can enter SQL queries on the data, e.g. Count files by extension

SELECT extension, count(*) FROM files GROUP BY extension;

e.g. List the 5 biggest directories

SELECT parentpath, sum(bytes)/1e9 as GB FROM files GROUP BY parentpath ORDER BY sum(bytes) DESC LIMIT 5;

Crab provides a virtual table, fileslines, which exposes file contents to SQL

e.g. Count TODO and FIXME entries in any .c files, recursively

SELECT fullpath, count(*) FROM fileslines WHERE parentpath like '/Users/GN/HL3/%' and extension = '.c' and (data like '%TODO%' or data like '%FIXME%') GROUP BY fullpath;

As well there are functions to run programs or shell commands on any subset of files, or lines within files e.g. (Mac) unzip all the .zip files, recursively

SELECT exec('unzip', '-n', fullpath, '-d', '/Users/johnsmith/Target Dir/') FROM files WHERE parentpath like '/Users/johnsmith/Source Dir/%' and extension = '.zip';

(Here -n tells unzip not to overwrite anything, and -d specifies target directory)

There is also a function to write query output to file, e.g. (Win) Sort the lines of all the .txt files in a directory and write them to a new file

SELECT writeln('C:\Users\SJohnson\dictionary2.txt', data) FROM fileslines WHERE parentpath = 'C:\Users\SJohnson\' and extension = '.txt' ORDER BY data;

In place of the interactive prompt you can run queries in batch mode. E.g. Here is a one-liner that returns the full path all the files in the current directory

C:> crab -batch -maxdepth 1 . "SELECT fullpath FROM files"

Crab SQL can also be used in Windows batch files, or Bash scripts, e.g. for ETL processing.

Crab is free for personal use, $5/mo commercial

See more details here (mac): [http://etia.co.uk/][1] or here (win): [http://etia.co.uk/win/about/][2]

An example SQLite database (Mac data) has been uploaded for you to play with. It includes an example files table for the directory tree you get when downloading the Project Gutenberg corpus, which contains 95k directories and 123k files.

To scan your own files, and get access to the virtual tables and support functions you have to use the Crab SQLite shell, available for download from this page (Mac): [http://etia.co.uk/download/][3] or this page (Win): [http://etia.co.uk/win/download/][4]

Content

FILES TABLE

The FILES table contains details of every item scanned, file or directory. All columns are indexed except 'mode'

COLUMNS fileid (int) primary key -- files table row number, a unique id for each item name (text) -- item name e.g. 'Hei.ttf' bytes (int) -- item size in bytes e.g. 7502752 depth (int) -- how far scan recursed to find the item, starts at 0 accessed (text) -- datetime item was accessed modified (text) -- datetime item was modified basename (text) -- item name without path or extension, e.g. 'Hei' extension (text) -- item extension including the dot, e.g. '.ttf' type (text) -- item type, 'f' for file or 'd' for directory mode (text) -- further type info and permissions, e.g. 'drwxr-xr-x' parentpath (text) -- absolute path of directory containing the item, e.g. '/Library/Fonts/' fullpath (text) unique -- parentpath of the item concatenated with its name, e.g. '/Library/Fonts/Hei.ttf' PATHS 1) parentpath and fullpath don't support abbreviations such as ~ . or .. They're just strings. 2) Directory paths all have a '/' on the end.

FILESLINES TABLE

The FILESLINES table is for querying data content of files. It has line number and data columns, with one row for each line of data in each file scanned by Crab.

This table isn't available in the example dataset, because it's a virtual table and doesn't physically contain data.

COLUMNS linenumber (int) -- line number within file, restarts count from 1 at the first line of each file data (text) -- data content of the files, one entry for each line

FILESLINES also duplicates the columns of the FILES table: fileid, name, bytes, depth, accessed, modified, basename, extension, type, mode, parentpath, and fullpath. This way you can restrict which files are searched without having to join tables.

Example Gutenberg data

An example SQLite database (Mac data), database.sqlite, has been uploaded for you to play with. It includes an example files table...
USA_Contracts_medical_equip_2019_2024
kaggle.com
zip
Updated Apr 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Phil Gieschen (2025). USA_Contracts_medical_equip_2019_2024 [Dataset]. https://www.kaggle.com/datasets/philgieschen/usa-contracts-medical-equip-2019-2024/code
Explore at:
zip(16588354 bytes)Available download formats
Dataset updated
Apr 7, 2025
Authors
Phil Gieschen
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
United States
Description
US Government Contract Awards for Medical Equipment (product codes 6515 &6640) 2029-2024. Shows Total amounts Obligated and Outlaid by Zip Code & added Metropolitan Statistical Area for mapping & visualization

https://www.usaspending.gov/download_center/award_data_archive

Used Git Bash to remove other products and merge CSVs together:

!/bin/bash

Define variables

input_folder="/c/Users/phgie/Downloads/FY2024_All_Contracts" output_file="combined_filtered.csv" temp_file="temp_filtered.csv"

Create or clear the output file

"$output_file"

Loop through all CSV files in the folder

for file in "$input_folder"/*.csv; do # Skip the header for all but the first file if [ ! -s "$output_file" ]; then # Include the header row from the first file awk -F, 'NR == 1 || $104 == "6515" || $104 == "6640"' "$file" > "$temp_file" else # Exclude the header row from subsequent files awk -F, 'NR > 1 && ($104 == "6515" || $104 == "6640")' "$file" > "$temp_file" fi

# Append the filtered content to the output file cat "$temp_file" >> "$output_file" done

Clean up temporary file

rm -f "$temp_file"

echo "Combined and filtered CSVs are saved in $output_file"
Embryo classification based on microscopic images
kaggle.com
Updated Oct 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gaurav Dutta (2023). Embryo classification based on microscopic images [Dataset]. https://www.kaggle.com/datasets/gauravduttakiit/embryo-classification-based-on-microscopic-images
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 3, 2023
Dataset provided by
Kaggle
Authors
Gaurav Dutta
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset Description Welcome to the "Hung Vuong Hospital Embryo Classification" dataset. This page provides a comprehensive overview of the data files, their formats, and the essential columns you'll encounter in this competition. Taking a moment to understand the data will help you navigate the challenge effectively and make informed decisions during your analysis and modeling.

The dataset comprises the following key files:

train folder - Contains images of embryos at day-3 and day-5 for training purposes. test folder - Contains images of embryos at day-3 and day-5 for testing purposes. train.csv - Contains information about the training set. test.csv - Contains information about the test set. sample_submission.csv - A sample submission file that demonstrates the correct submission format. Data Format Expectations

The embryo images are arranged within subfolders under the train and test directories. Each image is saved in JPG format and is labeled with a prefix. Images corresponding to day-3 embryos have the prefix D3 while images related to day-5 embryos bear the prefix D5. This prefix-based categorization allows for easy identification of the embryo's developmental stage.

Expected Output

Your task in this competition is to create a deep learning model that can accurately classify embryo images as 1 for good or 0 for not good for both day-3 and day-5 stages. The model should be trained on the training set and then used to predict the embryo quality in the test set. The ID column assigns an ID to each image. You will create the Class column as the result of model classification. The submission file contains only 2 columns: ID and Class (See the sample submission file)

Columns

You will encounter the following columns throughout the dataset:

ID - Refers to the ID of the images in the test set. Image - Refers to the file name of the embryo images in the train or test folder. Class - Represents the evaluation of the embryo images. This column provides the ground truth label for each image, indicating whether the embryo is classified as 'good' or 'not good'. We encourage you to explore, analyze, and preprocess the provided data to build a robust model for accurate embryo quality classification. Good luck, and may your innovative solutions contribute to advancements in reproductive science!
Complete Pokemon Image Dataset
kaggle.com
zip
Updated Nov 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
hlrhegemony (2020). Complete Pokemon Image Dataset [Dataset]. https://www.kaggle.com/hlrhegemony/pokemon-image-dataset
Explore at:
zip(60660766 bytes)Available download formats
Dataset updated
Nov 15, 2020
Authors
hlrhegemony
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

I was searching for labeled Pokemon images which satisfy these requirements: - Uniform, white backgrounds - Generations 1 through 8 - Multiple images per Pokemon

I could not find any after searching for a while, so I built one myself!

Content

These images are all scraped from https://pokemondb.net/. Each folder contains between 1 and 8 images (all .jpg) of the Pokemon, all with white backgrounds and reasonable file size. There are 2,500+ total images, which is much greater than any other Kaggle dataset I have found which preserve background and picture quality (no random backgrounds, nor some white and some black, etc.). Note that all other forms of the Pokemon (gigantamax, mega evolution, Alolan, Galarian, etc.) are included in the same folder.

Inspiration

I hope that a larger, cleaner dataset like this one can result in better GAN's and VAE's.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

SophieLiu (2019). Data Mining Project - Boston [Dataset]. https://www.kaggle.com/sliu65/data-mining-project-boston

Data Mining Project - Boston

Explore at:

zip(59313797 bytes)Available download formats

Dataset updated

Nov 25, 2019

Authors

SophieLiu

Area covered

Boston

Description

Context

To make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.

Use of Data Files

You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:

This loads the file into R

df<-read.csv('uber.csv')

The next codes is to subset the data into specific car types. The example below only has Uber 'Black' car types.

df_black<-subset(uber_df, uber_df$name == 'Black')

This next portion of code will be to load it into R. First, we must write this dataframe into a csv file on our computer in order to load it into R.

write.csv(df_black, "nameofthefileyouwanttosaveas.csv")

The file will appear in you working directory. If you are not familiar with your working directory. Run this code:

getwd()

The output will be the file path to your working directory. You will find the file you just created in that folder.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?

Clear search

Close search

Google apps

Main menu

Data Mining Project - Boston

Context

Use of Data Files

This loads the file into R

The next codes is to subset the data into specific car types. The example below only has Uber 'Black' car types.

This next portion of code will be to load it into R. First, we must write this dataframe into a csv file on our computer in order to load it into R.

The file will appear in you working directory. If you are not familiar with your working directory. Run this code:

The output will be the file path to your working directory. You will find the file you just created in that folder.

Inspiration

ISIC 2016 - 256x256

Bangladeshi License Plates for OCR

Bangladeshi Vehicle License Plate Dataset

Pre-Processing

NIH Chest X-rays Preprocessed Version

NIH Chest X-rays Preprocessed Version

What was done:

Outputs:

Stone Classification

Outline

How to Generate submission.csv from test_loader

1. Define the Prediction Function

2. Run Predictions

3. Create the Submission File

Output Description

Accident Detection Model Dataset

Accident-Detection-Model

Problem Statement

Accidents survey

Literature Survey

Research Gap

Proposed methodology

Model Set-up

Preparing Custom dataset

Challenges I ran into

I majorly ran into 3 problems while making this model

r/cosplay hot top images with titles

r/cosplay title crawler

Setup

Run

Download all posts in top and hot

Clean text

Download images

omit other json objects

omit other json objects

Diabetic Retinopathy (resized)

Diabetic Retinopathy Detection Competition Dataset Resized/Cropped

trainLabels.csv

resized_train:

resized_train_cropped:

CelebsV2_Faces_224

External CFD Aerodynamics Dataset

Churn Prediction and Transaction Forecasting

BBC NEWS SUMMARY(CSV FORMAT)

Dataset Description: Text Summarization Dataset

Key Features:

Future Enhancements:

Usage:

Acknowledgment

File Description

Key Components:

Column Descriptions:

Usage:

ISIC 2018 Task 3 - 256x256

OLM Converter for Mac

The files on your computer

Dataset: The files on your computer.

Content

FILES TABLE

FILESLINES TABLE

Example Gutenberg data

USA_Contracts_medical_equip_2019_2024

!/bin/bash

Define variables

Create or clear the output file

Loop through all CSV files in the folder

Clean up temporary file

Embryo classification based on microscopic images

Complete Pokemon Image Dataset

Context

Content