18 datasets found
  1. Data Mining Project - Boston

    • kaggle.com
    zip
    Updated Nov 25, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SophieLiu (2019). Data Mining Project - Boston [Dataset]. https://www.kaggle.com/sliu65/data-mining-project-boston
    Explore at:
    zip(59313797 bytes)Available download formats
    Dataset updated
    Nov 25, 2019
    Authors
    SophieLiu
    Area covered
    Boston
    Description

    Context

    To make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.

    Use of Data Files

    You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:

    This loads the file into R

    df<-read.csv('uber.csv')

    The next codes is to subset the data into specific car types. The example below only has Uber 'Black' car types.

    df_black<-subset(uber_df, uber_df$name == 'Black')

    This next portion of code will be to load it into R. First, we must write this dataframe into a csv file on our computer in order to load it into R.

    write.csv(df_black, "nameofthefileyouwanttosaveas.csv")

    The file will appear in you working directory. If you are not familiar with your working directory. Run this code:

    getwd()

    The output will be the file path to your working directory. You will find the file you just created in that folder.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  2. ISIC 2016 - 256x256

    • kaggle.com
    zip
    Updated Aug 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mehran Ziadloo (2024). ISIC 2016 - 256x256 [Dataset]. https://www.kaggle.com/datasets/ziadloo/isic-2016-256x256
    Explore at:
    zip(17534629 bytes)Available download formats
    Dataset updated
    Aug 7, 2024
    Authors
    Mehran Ziadloo
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is derived from the ISIC Archive with the following changes:

    1. A new integer column is added named "target" with values 0, 1, null. This column is populated using two other columns: "bengin_malignant" and "diagnosis". If the first column explicitly confirms that the record is either "benign" or "malignant", the target is set to "0" and "1" respectively. If the "benign_malignant" column is null, then the value of "diagnosis" column is used to determine the value for "target". The following diagnosis values are considered cancerous and as the result, "target" is set to "1":
    • squamous cell carcinoma
    • basal cell carcinoma
    • melanoma
    • squamous cell carcinoma

    If the "benign_malignant" column is null and the "diagnosis" column is "vascular lesion", the target is set to null.

    DISCLAIMER I'm not a dermatologist and I'm not affiliated with ISIC in any way. I don't know if my approach to setting the target value is acceptable by the ISIC competition. Use at your own risk.

    1. All the images are resized to 256x256 using the following Python code:
    import os
    import multiprocessing as mp
    from PIL import Image, ImageOps
    import glob
    from functools import partial
    
    
    def list_jpg_files(folder_path):
      # Ensure the folder path ends with a slash
      if not folder_path.endswith('/'):
        folder_path += '/'
    
      # Use glob to find all .jpg files in the specified folder (non-recursive)
      jpg_files = glob.glob(folder_path + '*.jpg')
    
      return jpg_files
    
    
    
    def resize_image(image_path, destination_folder):
      # Open the image file
      with Image.open(image_path) as img:
        # Get the original dimensions
        original_width, original_height = img.size
    
        # Calculate the aspect ratio
        aspect_ratio = original_width / original_height
    
        # Determine the new dimensions based on the aspect ratio
        if aspect_ratio > 1:
          # Width is larger, so we will crop the width
          new_width = int(256 * aspect_ratio)
          new_height = 256
        else:
          # Height is larger, so we will crop the height
          new_width = 256
          new_height = int(256 / aspect_ratio)
    
        # Resize the image while maintaining the aspect ratio
        img = img.resize((new_width, new_height))
    
        # Calculate the crop box to center the image
        left = (new_width - 256) / 2
        top = (new_height - 256) / 2
        right = (new_width + 256) / 2
        bottom = (new_height + 256) / 2
    
        # Crop the image if it results in shrinking
        if new_width > 256 or new_height > 256:
          img = img.crop((left, top, right, bottom))
        else:
          # Add black edges if it results in scaling up
          img = ImageOps.expand(img, border=(int(left), int(top), int(left), int(top)), fill='black')
    
        # Resize the image to the final dimensions
        img = img.resize((256, 256))
    
      img.save(os.path.join(destination_folder, os.path.basename(image_path)))
    
    
    source_folder = ""
    destination_folder = ""
    
    images = list_jpg_files(source_folder)
    
    with mp.Pool(processes=12) as pool:
      images = pool.map(partial(resize_image, destination_folder=destination_folder), images)
    print("All images resized")
    

    This code will shrink (down-sample) the image if it is larger than 256x256. But if the image is smaller than 256x256, it will add either vertical or horizontal black edges after scaling up the image. In both scenarios, it will keep the center of the input image in the center of the output image.

    The HDF5 file is created using the following code:

    import os
    import pandas as pd
    from PIL import Image
    import h5py
    import io
    import numpy as np
    
    # File paths
    base_folder = "./isic-2018-task-12-256x256"
    csv_file_path = 'train-metadata.csv'
    image_folder_path = 'train-image/image'
    hdf5_file_path = 'train-image.hdf5'
    
    # Read the CSV file
    df = pd.read_csv(os.path.join(base_folder, csv_file_path))
    
    # Open an HDF5 file
    with h5py.File(os.path.join(base_folder, hdf5_file_path), 'w') as hdf5_file:
      for index, row in df.iterrows():
        isic_id = row['isic_id']
        image_file_path = os.path.join(base_folder, image_folder_path, f'{isic_id}.jpg')
        
        if os.path.exists(image_file_path):
          # Open the image file
          with Image.open(image_file_path) as img:
            # Convert the image to a byte buffer
            img_byte_arr = io.BytesIO()
            img.save(img_byte_arr, format=img.format)
            img_byte_arr = img_byte_arr.getvalue()
            hdf5_file.create_dataset(isic_id, data=np.void(img_byte_arr))
        else:
          print(f"Image file for {isic_id} not found.")
    
    print("HDF5 file created successfully.")
    

    To read the hdf5 file, use the following code:

    import h5py
    from PIL import Image
    ...
    
  3. Bangladeshi License Plates for OCR

    • kaggle.com
    zip
    Updated Jun 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abrar Chowdhury (2023). Bangladeshi License Plates for OCR [Dataset]. https://www.kaggle.com/datasets/abrarchowdhury/bangladeshi-license-plates-382359-images
    Explore at:
    zip(622682351 bytes)Available download formats
    Dataset updated
    Jun 20, 2023
    Authors
    Abrar Chowdhury
    License

    https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/

    Area covered
    Bangladesh
    Description

    Bangladeshi Vehicle License Plate Dataset

    Pre-Processing

    Make the images sharper and larger for better training

    import os
    import cv2
    import numpy as np
    from multiprocessing import Pool, cpu_count
    
    input_dir = "/Users/abrarahasanadil/Downloads/Thesis/dataset/distorted_images"
    output_dir = "/Users/abrarahasanadil/Downloads/Thesis/dataset/clear_images"
    
    def preprocess_image(img_file):
      img = cv2.imread(img_file)
    
      # Resize the image to have a minimum height of 300 pixels
      height, width, _ = img.shape
      new_height = 400
      new_width = int((new_height / height) * width)
      img = cv2.resize(img, (new_width, new_height))
    
      # Apply bilateral filtering to remove noise while keeping edges sharp
      img = cv2.bilateralFilter(img, 9, 75, 75)
    
      # Apply unsharp masking to enhance edges
      img = cv2.GaussianBlur(img, (0, 0), 3)
      img = cv2.addWeighted(
        img, 1.2, img, -0.2, 0
      ) # Reduce the value of alpha from 1.5 to 1.2
    
      # Convert the image to grayscale
      gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    
      # Increase contrast in darker regions using adaptive histogram equalization
      clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
      gray = clahe.apply(gray)
    
      # Apply a sharpening filter
      kernel = np.array([[-1, -1, -1], [-1, 9, -1], [-1, -1, -1]])
      gray = cv2.filter2D(gray, -1, kernel)
    
      # Save the output image
      output_file = os.path.join(output_dir, os.path.basename(img_file))
      os.makedirs(
        os.path.dirname(output_file), exist_ok=True
      ) # Create the output directory if it doesn't exist
      cv2.imwrite(output_file, gray)
    
    if _name_ == "_main_":
      # Get a list of all the image files in the input directory
      image_files = [
        os.path.join(input_dir, f) for f in os.listdir(input_dir) if f.endswith(".jpg")
      ]
    
      # Create a pool of worker processes
      num_workers = cpu_count() # Use all available CPU cores
      with Pool(num_workers) as pool:
        # Preprocess all the images in parallel
        pool.map(preprocess_image, image_files)
    
  4. NIH Chest X-rays Preprocessed Version

    • kaggle.com
    zip
    Updated Sep 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yasiru-210329E (2025). NIH Chest X-rays Preprocessed Version [Dataset]. https://www.kaggle.com/datasets/laksaraky210329e/nih-chest-x-rays-preprocessed-version
    Explore at:
    zip(60316035929 bytes)Available download formats
    Dataset updated
    Sep 13, 2025
    Authors
    Yasiru-210329E
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    NIH Chest X-rays Preprocessed Version

    This dataset is a preprocessed version of the NIH Chest X-ray Dataset. The original images were systematically organized, explored, and enhanced to improve their quality for research and machine learning applications.

    What was done:

    • The full directory structure of the dataset was explored and documented, including image counts and sample filenames for each folder.
    • All chest X-ray images were processed using CLAHE (Contrast Limited Adaptive Histogram Equalization) to improve local contrast and highlight important features in the images.
    • The processed images were saved in a new directory structure that mirrors the original, ensuring easy traceability and organization.
    • Sample images were visualized and compared before and after preprocessing, with histograms provided to illustrate the improvement in contrast.
    • Batch processing was performed, with summary statistics and verification steps to confirm successful image enhancement and saving.

    Outputs:

    • A complete set of CLAHE-enhanced chest X-ray images, organized in the same way as the original dataset.
    • Visualizations and statistics demonstrating the effectiveness of the preprocessing, including side-by-side comparisons and histogram overlays.
    • Verified counts of processed images for each folder, ensuring data integrity.

    This preprocessed dataset is ready for use in further analysis, model training, or clinical research, with improved image quality and consistent organization. No changes were made to the original labels or metadata.

  5. Stone Classification

    • kaggle.com
    zip
    Updated Mar 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khadgar (2025). Stone Classification [Dataset]. https://www.kaggle.com/datasets/claydonwang/stone-classification
    Explore at:
    zip(69490 bytes)Available download formats
    Dataset updated
    Mar 18, 2025
    Authors
    Khadgar
    Description

    Outline

    The dataset is used in final project of STA325 at SUSTech.

    How to Generate submission.csv from test_loader

    1. Define the Prediction Function

    Use the following function to extract predictions from test_loader: ```python def predict(model, loader, device): model.eval() # Set the model to evaluation mode predictions = [] # Store predicted classes image_ids = [] # Store image filenames

    with torch.no_grad(): # Disable gradient computation for images, img_paths in tqdm(loader, desc="Predicting on test set"): images = images.to(device) # Move images to the specified device outputs = model(images) # Forward pass to get model outputs _, predicted = torch.max(outputs, 1) # Get predicted classes

      # Collect predictions and image IDs
      predictions.extend(predicted.cpu().numpy())
      image_ids.extend([os.path.basename(path) for path in img_paths])
    

    return image_ids, predictions ```

    2. Run Predictions

    Call the prediction function with the trained model, test_loader, and device: python image_ids, predictions = predict(model, test_loader, device)

    3. Create the Submission File

    import pandas as pd
    import os
    
    # Create DataFrame
    submission_df = pd.DataFrame({
      "id": image_ids,  # Image filenames
      "label": predictions # Predicted classes
    })
    
    # Save to the specified path
    OUTPUT_DIR = "logs"
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    submission_path = os.path.join(OUTPUT_DIR, "submission.csv")
    submission_df.to_csv(submission_path, index=False)
    print(f"Kaggle submission file saved to {submission_path}")
    

    Output Description

    • submission.csv Format:
      The file contains two columns:
    • id: Filenames of test images (without paths, e.g., image1.jpg).
    • label: Predicted class indices (e.g., 0, 1, 2, depending on the number of classes).

    • Example Content: id,label 000001.jpg,0 000002.jpg,1 000003.jpg,2 Then submit the submission.csv to Kaggle.

  6. R

    Accident Detection Model Dataset

    • universe.roboflow.com
    zip
    Updated Apr 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Accident detection model (2024). Accident Detection Model Dataset [Dataset]. https://universe.roboflow.com/accident-detection-model/accident-detection-model/model/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 8, 2024
    Dataset authored and provided by
    Accident detection model
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Accident Bounding Boxes
    Description

    Accident-Detection-Model

    Accident Detection Model is made using YOLOv8, Google Collab, Python, Roboflow, Deep Learning, OpenCV, Machine Learning, Artificial Intelligence. It can detect an accident on any accident by live camera, image or video provided. This model is trained on a dataset of 3200+ images, These images were annotated on roboflow.

    Problem Statement

    • Road accidents are a major problem in India, with thousands of people losing their lives and many more suffering serious injuries every year.
    • According to the Ministry of Road Transport and Highways, India witnessed around 4.5 lakh road accidents in 2019, which resulted in the deaths of more than 1.5 lakh people.
    • The age range that is most severely hit by road accidents is 18 to 45 years old, which accounts for almost 67 percent of all accidental deaths.

    Accidents survey

    https://user-images.githubusercontent.com/78155393/233774342-287492bb-26c1-4acf-bc2c-9462e97a03ca.png" alt="Survey">

    Literature Survey

    • Sreyan Ghosh in Mar-2019, The goal is to develop a system using deep learning convolutional neural network that has been trained to identify video frames as accident or non-accident.
    • Deeksha Gour Sep-2019, uses computer vision technology, neural networks, deep learning, and various approaches and algorithms to detect objects.

    Research Gap

    • Lack of real-world data - We trained model for more then 3200 images.
    • Large interpretability time and space needed - Using google collab to reduce interpretability time and space required.
    • Outdated Versions of previous works - We aer using Latest version of Yolo v8.

    Proposed methodology

    • We are using Yolov8 to train our custom dataset which has been 3200+ images, collected from different platforms.
    • This model after training with 25 iterations and is ready to detect an accident with a significant probability.

    Model Set-up

    Preparing Custom dataset

    • We have collected 1200+ images from different sources like YouTube, Google images, Kaggle.com etc.
    • Then we annotated all of them individually on a tool called roboflow.
    • During Annotation we marked the images with no accident as NULL and we drew a box on the site of accident on the images having an accident
    • Then we divided the data set into train, val, test in the ratio of 8:1:1
    • At the final step we downloaded the dataset in yolov8 format.
      #### Using Google Collab
    • We are using google colaboratory to code this model because google collab uses gpu which is faster than local environments.
    • You can use Jupyter notebooks, which let you blend code, text, and visualisations in a single document, to write and run Python code using Google Colab.
    • Users can run individual code cells in Jupyter Notebooks and quickly view the results, which is helpful for experimenting and debugging. Additionally, they enable the development of visualisations that make use of well-known frameworks like Matplotlib, Seaborn, and Plotly.
    • In Google collab, First of all we Changed runtime from TPU to GPU.
    • We cross checked it by running command ‘!nvidia-smi’
      #### Coding
    • First of all, We installed Yolov8 by the command ‘!pip install ultralytics==8.0.20’
    • Further we checked about Yolov8 by the command ‘from ultralytics import YOLO from IPython.display import display, Image’
    • Then we connected and mounted our google drive account by the code ‘from google.colab import drive drive.mount('/content/drive')’
    • Then we ran our main command to run the training process ‘%cd /content/drive/MyDrive/Accident Detection model !yolo task=detect mode=train model=yolov8s.pt data= data.yaml epochs=1 imgsz=640 plots=True’
    • After the training we ran command to test and validate our model ‘!yolo task=detect mode=val model=runs/detect/train/weights/best.pt data=data.yaml’ ‘!yolo task=detect mode=predict model=runs/detect/train/weights/best.pt conf=0.25 source=data/test/images’
    • Further to get result from any video or image we ran this command ‘!yolo task=detect mode=predict model=runs/detect/train/weights/best.pt source="/content/drive/MyDrive/Accident-Detection-model/data/testing1.jpg/mp4"’
    • The results are stored in the runs/detect/predict folder.
      Hence our model is trained, validated and tested to be able to detect accidents on any video or image.

    Challenges I ran into

    I majorly ran into 3 problems while making this model

    • I got difficulty while saving the results in a folder, as yolov8 is latest version so it is still underdevelopment. so i then read some blogs, referred to stackoverflow then i got to know that we need to writ an extra command in new v8 that ''save=true'' This made me save my results in a folder.
    • I was facing problem on cvat website because i was not sure what
  7. r/cosplay hot top images with titles

    • kaggle.com
    zip
    Updated Mar 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    dinhanhx (2023). r/cosplay hot top images with titles [Dataset]. https://www.kaggle.com/datasets/inhanhv/rcosplay-hot-top-images-with-titles
    Explore at:
    zip(1251562500 bytes)Available download formats
    Dataset updated
    Mar 2, 2023
    Authors
    dinhanhx
    License

    https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api

    Description

    Please visit dinhanhx/rct

    Sauce for the thumbnail

    r/cosplay title crawler

    Available on Kaggle

    Please take time to read all this readme before using the dataset. Yes I'm serious!

    Setup

    pip install -e .
    

    Go to this PRAW doc page, follow the instructions to get your client id, client secret, and user agent.

    Then store them in confidential/reddit.json like this (don't actually write "spooky"): json { "id": "spooky", "secret": "spooky", "user-agent": "windows-10:spooky:v0.0.1 (by u/spooky)" }

    Run

    Download all posts in top and hot

    (but the number in each category limited by Reddit) - Output file: data/cosplay.jsonl - 2161 posts (on 01/03/2023) python rct/crawl.py

    Clean text

    (in post's title) enclosed by square brackets such as [self], [found], ... - Input file: data/cosplay.jsonl - Output file: data/clean_cosplay.jsonl python rct/clean.py

    Download images

    • Input file: data/clean_cosplay.jsonl
    • Output file: data/map_cosplay.jsonl, data/bad_response.jsonl
    • 2160 downloaded images, 1 bad/delete/deprecated image (on 02/03/2023) python rct/download.py

    ⚠ The image_id, and image_path attributes' values are NOT linearly continuous. For example,

    in data/bad_response.jsonl python {"image_id": "001912", "image_path": "data/image/001912.jpg"} and in data/map_cosplay.jsonl ```python

    omit other json objects

    {"image_id": "001911", "image_path": "data/image/001911.jpg"} {"image_id": "001913", "image_path": "data/image/001913.jpg"}

    omit other json objects

    
    ⚠ `image_path` attribute's values are `data/image/*.jpg`. They are relative to the folder `data` containing all `.jsonl` files and `image` folder. The folder `data` is produced by Python scripts.
    
    ⚠ `image_path` attribute's values MISMATCH with *the name of folder containing all `.jsonl` files and `image` folder on _Kaggle_*. When you load the data from Kaggle Dataset, `data/image/000000.jpg`'s `data` should be replaced with Kaggle path (see [this notebook](https://www.kaggle.com/code/inhanhv/rct-demo)). It shall become `/kaggle/input/rcosplay-hot-top-images-with-titles/image/000000.jpg`
    
  8. Diabetic Retinopathy (resized)

    • kaggle.com
    zip
    Updated May 8, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ilovescience (2019). Diabetic Retinopathy (resized) [Dataset]. https://www.kaggle.com/tanlikesmath/diabetic-retinopathy-resized
    Explore at:
    zip(7785957896 bytes)Available download formats
    Dataset updated
    May 8, 2019
    Authors
    ilovescience
    Description

    Diabetic Retinopathy Detection Competition Dataset Resized/Cropped

    In this dataset, I have included both a resized version of the dataset, and a cropped then resized version of the data.

    trainLabels.csv

    This file contains the name of the file under the 'image' column and the label under the 'level' column.

    resized_train:

    This folder was created by simply resizing the dataset to 1024x1024 if it is bigger than this size, else it remains the same. The code used to create this dataset is:

    import glob
    import os
    from tqdm import tqdm
    import math
    from PIL import Image 
    files = glob.glob('D:\\Experiments with Deep Learning\\DR Kaggle\\train\\train\\train\\*.jpeg')
    
    new_width = 1024
    
    for i in tqdm(range(len(files))):
      img = Image.open(files[i])
      width,height = img.size
      ratio = height/width
      if width > new_width:
        new_image = img.resize((new_width,math.ceil(ratio*new_width)))  
      else:
        new_image = img
      new_image.save('D:\\Experiments with Deep Learning\\DR 
    Kaggle\\train\\train\\resized_train\\'+os.path.basename(files[i]))
    

    `

    resized_train_cropped:

    In this case, as much of the black space is cropped out by trying to identify the center and radius of the circle of the fundus image. Some of the images turned out to be fully black or very close to fully black, and no mask was found. Hence, those images were manually removed. There may still be some noisy images remaining, however.

    The code used to create this dataset is:

    # import the necessary packages
    import numpy as np
    import cv2
    import glob
    import os
    from tqdm import tqdm
    import math
    from PIL import Image
    files = glob.glob('D:\\Experiments with Deep Learning\\DR Kaggle\\train\\train\\train\\*.jpeg')
    
    new_sz = 1024
    
    def crop_image(image):
      output = image.copy()
      gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
      ret,gray = cv2.threshold(gray,10,255,cv2.THRESH_BINARY)
      contours,hierarchy = cv2.findContours(gray,cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_SIMPLE)
      if not contours:
        print('no contours!')
        flag = 0
        return image, flag
      cnt = max(contours, key=cv2.contourArea)
      ((x, y), r) = cv2.minEnclosingCircle(cnt)
      x = int(x); y = int(y); r = int(r)
      flag = 1
      #print(x,y,r)
      if r > 100:
        return output[0 + (y-r)*int(r
    
  9. CelebsV2_Faces_224

    • kaggle.com
    zip
    Updated Jun 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shreyansh Manav Shukla (2024). CelebsV2_Faces_224 [Dataset]. https://www.kaggle.com/datasets/shreyanshmanavshukla/celebsv2-faces-224/data
    Explore at:
    zip(110015744 bytes)Available download formats
    Dataset updated
    Jun 8, 2024
    Authors
    Shreyansh Manav Shukla
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Name: Celeb-DF Faces Dataset

    Description: The Celeb-DF Faces Dataset is a curated collection of facial images extracted from the Celeb-DF dataset. This dataset focuses on providing a comprehensive set of facial images for research and analysis in the field of deepfake detection and facial image analysis. The images are categorized into two classes: "Fake" and "Real," based on the source of the videos.

    Dataset Structure:

    Image Size: 224x224 pixels

    Source Folders:

    celeb-df-v2/Celeb-real: Contains authentic facial videos. celeb-df-v2/Celeb-synthesis: Contains synthesized (fake) facial videos. celeb-df-v2/YouTube-real: Contains additional authentic facial videos from YouTube. Output Folder:

    celeb_faces_224/: Contains the extracted and resized facial images. Metadata File:

    metadata_celebs.csv: A CSV file storing metadata information for each extracted image with the following columns: Name: The filename of the extracted image. Label: The label indicating whether the image is "Fake" or "Real." Creation Process:

    Video Frame Extraction:

    The first frame from each video in the source folders is extracted. Image Resizing:

    The extracted frames are resized to 224x224 pixels to ensure uniformity and compatibility with common machine learning models. Image Storage:

    The resized images are saved in the celeb_faces_224/ folder with filenames corresponding to the original video names. Metadata Compilation:

    A metadata CSV file (metadata_celebs.csv) is created to store the filenames and labels of the images, indicating whether they are from "Fake" or "Real" videos. Intended Use: The dataset is ideal for tasks such as:

    Deepfake detection and analysis Training and evaluation of machine learning models for facial image classification Image forensics research and development Note: This dataset is derived from the Celeb-DF dataset and is intended for research and educational purposes only.

  10. External CFD Aerodynamics Dataset

    • kaggle.com
    zip
    Updated Sep 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrey Shtrauss (2021). External CFD Aerodynamics Dataset [Dataset]. https://www.kaggle.com/shtrausslearning/external-cfd-aero
    Explore at:
    zip(2848480902 bytes)Available download formats
    Dataset updated
    Sep 9, 2021
    Authors
    Andrey Shtrauss
    Description

    https://i.imgur.com/gAwhrwd.jpg" alt="">

    External CFD Aerodynamics Dataset - The theme for this data is external flow aerodynamics; simulation of flow physics around some geometry (eg. aircraft, vehicle ) - Iterative based simulation of some flow physics are obtained upon reaching some final convergence criteria - The dataset contains simulation results obtained from different CFD solvers, sorted into folders for similar types of data outputs


    Data origin Where the data is stored - Depending on the type of solver, export option, data can be stored in vertices (grid intersections) or cells (each little domain, usually cell centre) - Before solving nonlinear equations, we need to discretise the domain into smaller zones, when the solution achieves convergence, it be exported: - Either the entire domain data, subsets of the flowfield domain, or just tabular data (at specific points)

    |Discretised region around a geometry|Part of the result is visualised via a 2D slice | |--|--| |https://images-wixmp-ed30a86b8c4ca887773594c2.wixmp.com/f/8cc1eeaa-4046-4c4a-ae93-93d656f68688/df51hov-c527cb71-62c8-4b98-aef9-002dcc7567ea.jpg?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJ1cm46YXBwOjdlMGQxODg5ODIyNjQzNzNhNWYwZDQxNWVhMGQyNmUwIiwiaXNzIjoidXJuOmFwcDo3ZTBkMTg4OTgyMjY0MzczYTVmMGQ0MTVlYTBkMjZlMCIsIm9iaiI6W1t7InBhdGgiOiJcL2ZcLzhjYzFlZWFhLTQwNDYtNGM0YS1hZTkzLTkzZDY1NmY2ODY4OFwvZGY1MWhvdi1jNTI3Y2I3MS02MmM4LTRiOTgtYWVmOS0wMDJkY2M3NTY3ZWEuanBnIn1dXSwiYXVkIjpbInVybjpzZXJ2aWNlOmZpbGUuZG93bmxvYWQiXX0.PgMkNNrg2EmDBkPioLt2MRsoNdXyO1ro5Ol0Z1QP3qM" alt="">|https://images-wixmp-ed30a86b8c4ca887773594c2.wixmp.com/f/8cc1eeaa-4046-4c4a-ae93-93d656f68688/df51hor-636d665b-d6b3-4f2b-b141-c06199702eff.jpg?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJ1cm46YXBwOjdlMGQxODg5ODIyNjQzNzNhNWYwZDQxNWVhMGQyNmUwIiwiaXNzIjoidXJuOmFwcDo3ZTBkMTg4OTgyMjY0MzczYTVmMGQ0MTVlYTBkMjZlMCIsIm9iaiI6W1t7InBhdGgiOiJcL2ZcLzhjYzFlZWFhLTQwNDYtNGM0YS1hZTkzLTkzZDY1NmY2ODY4OFwvZGY1MWhvci02MzZkNjY1Yi1kNmIzLTRmMmItYjE0MS1jMDYxOTk3MDJlZmYuanBnIn1dXSwiYXVkIjpbInVybjpzZXJ2aWNlOmZpbGUuZG93bmxvYWQiXX0.0H9mu_Crr4zAfbiR8wNWOdaMYG8Duyam0GCMmKqbeA" alt="">|


    Folder Structure Utilising the dataset with notebook classes, the recommended data storage structure:

    • Main Folder ( Geometry name used in simulation )
      • Case Name ( Brief simulation name; what was tested etc )
        • Individual Case Name ( If multiple cases were tested etc ):
          • flowfield folder (stores multiblock file content - automatically created when saving VTM)
          • tab_final ( final iteration tabular data output content )
          • tab_iter ( iteratively changing tabular data, etc convergence history of a parameter )


    Current Dataset Content Content in the dataset: External CFD Aerodynamics Dataset

    • [**30p30n**] [**anglevar_sst**]
      • [10][11][12][13][3][4][5][6][7][8][9]
    • [**crm**][**nacelle_effect**]
      • [crm_wb_clean][crm_wb_eng]
    • [**m6**]
      • [**3p06_sst_ep**]
    • [**rae2822**][**angvar_sa**]
      • [0][1][10][11][12][13][14][15][16][17][18][19][2][20][3][4][5][6][7][8][9]
  11. Churn Prediction and Transaction Forecasting

    • kaggle.com
    zip
    Updated Aug 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richa Patel (2025). Churn Prediction and Transaction Forecasting [Dataset]. https://www.kaggle.com/datasets/richapatel912/churn-prediction-and-transaction-forecasting/discussion
    Explore at:
    zip(3258802 bytes)Available download formats
    Dataset updated
    Aug 19, 2025
    Authors
    Richa Patel
    Description

    “AI-Powered Banking Analytics: Automated Power BI Documentation, Churn Prediction, and Transaction Forecasting”

    Project Workflow 1. Data Acquisition (Kaggle) • Dataset sourced from Kaggle (credit card / banking dataset). • Contains customer demographics, credit card transactions, and account details. • Cleaned and transformed data in Power BI for dashboard building.

    1. Interactive Power BI Dashboard • Built two key analytics pages:
    2. Customer Churn Insights → shows churn risk, drivers, segmentation.
    3. Transaction Forecasting → predicts future monthly transactions with confidence bands. • Added KPI cards, slicers, and professional formatting. • Ensured design follows: customer risk, forecasting, and governance. _
    4. Automated Documentation (Python + VPAX) • Exported the Power BI data model (VPAX) using DAX Studio. • Created a Python script to automatically generate: o Word doc with model documentation. o Excel file with tables, relationships, and fields. o ER diagram image. • This automation saves analysts hours of manual work and enforces governance. _
    5. Churn Prediction Model (Python + Power BI) • Built a Random Forest model for churn prediction. • Output: o Customer-level churn probability. o Risk categories (Low, Medium, High). o Feature importance (drivers of churn). • Exported predictions to Excel → Imported into Power BI. • Added Churn Risk Dashboard: o Distribution of churn risk. o Top churn drivers (feature importance bar chart). _
    6. Transaction Forecasting Model (Python + Prophet) • Used Prophet (Facebook’s forecasting library) to model monthly transaction volumes. • Forecasted next 12 months with confidence intervals (yhat_lower, yhat_upper). • Exported results to Excel → Integrated into Power BI. • Added Transaction Forecasting Dashboard: o Actual vs Forecast line chart (with confidence band). o KPI cards (Next Month Forecast, YoY Growth). o Clustered column chart for recent 12 months. _
    7. End-to-End Data & AI Pipeline • Data Source (Kaggle) → Power BI Dashboard → Automated Documentation → AI/ML Models → Power BI Insights.

    File Details: File / Folder Name Description .idea/ PyCharm IDE configuration folder (auto-generated). Churn Prediction + Forecasting.py Main Python script for churn prediction (Random Forest) and transaction forecasting (Prophet). churn_model.pkl Saved machine learning model (Random Forest) for churn prediction. Churn_Predictions.xlsx Excel output of churn probabilities and risk categories per customer. Credit Card Financial Dashboard.pbix Power BI dashboard file (interactive BI report). Credit Card Financial Dashboard.pdf Exported PDF version of the Power BI dashboard. credit_card.xlsx Kaggle dataset (credit card transactions / account features). customer.xlsx Kaggle dataset (customer demographic and account info). DocumentationGenerator.py Python script that parses VPAX model and generates automated Power BI documentation. Feature_Importance.xlsx Feature importance scores from churn model (top churn drivers). forecast_model.pkl Saved Prophet model for forecasting monthly transactions. LICENSE License file for open-source/public sharing. model.vpax Exported Power BI data model (via DAX Studio) for documentation. PowerBI_Documentation.docx Word output of auto-generated Power BI documentation. PowerBI_Documentation.xlsx Excel output of auto-generated Power BI documentation. PowerBI_ER_Diagram.png Entity-Relationship diagram image generated from Power BI model. README.md Markdown summary file for GitHub/Kaggle. Transaction_Forecast.xlsx Excel output containing actuals + forecast (Prophet) with confidence bounds.

  12. BBC NEWS SUMMARY(CSV FORMAT)

    • kaggle.com
    zip
    Updated Sep 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dhiraj (2024). BBC NEWS SUMMARY(CSV FORMAT) [Dataset]. https://www.kaggle.com/datasets/dignity45/bbc-news-summarycsv-format
    Explore at:
    zip(2097600 bytes)Available download formats
    Dataset updated
    Sep 9, 2024
    Authors
    Dhiraj
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Description: Text Summarization Dataset

    This dataset is designed for users aiming to train models for text summarization. It contains 2,225 rows of data with two columns: "Text" and "Summary". Each row features a detailed news article or piece of text paired with its corresponding summary, providing a rich resource for developing and fine-tuning summarization algorithms.

    Key Features:

    • Text: Full-length articles or passages that serve as the input for summarization.
    • Summary: Concise summaries of the articles, which are ideal for training models to generate brief, coherent summaries from longer texts.

    Future Enhancements:

    This evolving dataset is planned to include additional features, such as text class labels, in future updates. These enhancements will provide more context and facilitate the development of models that can perform summarization across different categories of news content.

    Usage:

    Ideal for researchers and developers focused on text summarization tasks, this dataset enables the training of models to effectively compress information while retaining the essence of the original content.

    Acknowledgment

    We would like to extend our sincere gratitude to the dataset creator for their contribution to this valuable resource. This dataset, sourced from the BBC News Summary dataset on Kaggle, was created by Pariza. Their work has provided an invaluable asset for those working on text summarization tasks, and we appreciate their efforts in curating and sharing this data with the community.

    Thank you for supporting research and development in the field of natural language processing!

    File Description

    This script processes and consolidates text data from various directories containing news articles and their corresponding summaries. It reads the files from specified folders, handles encoding issues, and then creates a DataFrame that is saved as a CSV file for further analysis.

    Key Components:

    1. Imports:

      • numpy (np): Numerical operations library, though it's not used in this script.
      • pandas (pd): Data manipulation and analysis library.
      • os: For interacting with the operating system, e.g., building file paths.
      • glob: For file pattern matching and retrieving file paths.
    2. Function: get_texts

      • Parameters:
        • text_folders: List of folders containing news article text files.
        • text_list: List to store the content of text files.
        • summ_folder: List of folders containing summary text files.
        • sum_list: List to store the content of summary files.
        • encodings: List of encodings to try for reading files.
      • Purpose:
        • Reads text files from specified folders, handles different encodings, and appends the content to text_list and sum_list.
        • Returns the updated lists of texts and summaries.
    3. Data Preparation:

      • text_folder: List of directories for news articles.
      • summ_folder: List of directories for summaries.
      • text_list and summ_list: Initialize empty lists to store the contents.
      • data_df: Empty DataFrame to store the final data.
    4. Execution:

      • Calls get_texts function to populate text_list and summ_list.
      • Creates a DataFrame data_df with columns 'Text' and 'Summary'.
      • Saves data_df to a CSV file at /kaggle/working/bbc_news_data.csv.
    5. Output:

      • Prints the first few entries of the DataFrame to verify the content.

    Column Descriptions:

    • Text: Contains the full-length articles or passages of news content. This column is used as the input for summarization models.
    • Summary: Contains concise summaries of the corresponding articles in the "Text" column. This column is used as the target output for summarization models.

    Usage:

    • This script is designed to be run in a Kaggle environment where paths to text data are predefined.
    • It is intended for preprocessing and saving text data from news articles and summaries for subsequent analysis or model training.
  13. ISIC 2018 Task 3 - 256x256

    • kaggle.com
    zip
    Updated Aug 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mehran Ziadloo (2024). ISIC 2018 Task 3 - 256x256 [Dataset]. https://www.kaggle.com/datasets/ziadloo/isic-2018-task-3-256x256/code
    Explore at:
    zip(164325064 bytes)Available download formats
    Dataset updated
    Aug 7, 2024
    Authors
    Mehran Ziadloo
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This dataset is derived from the ISIC Archive with the following changes:

    1. A new integer column is added named "target" with values 0, 1, null. This column is populated using two other columns: "bengin_malignant" and "diagnosis". If the first column explicitly confirms that the record is either "benign" or "malignant", the target is set to "0" and "1" respectively. If the "benign_malignant" column is null, then the value of "diagnosis" column is used to determine the value for "target". The following diagnosis values are considered cancerous and as the result, "target" is set to "1":
    • squamous cell carcinoma
    • basal cell carcinoma
    • melanoma
    • squamous cell carcinoma

    If the "benign_malignant" column is null and the "diagnosis" column is "vascular lesion", the target is set to null.

    DISCLAIMER I'm not a dermatologist and I'm not affiliated with ISIC in any way. I don't know if my approach to setting the target value is acceptable by the ISIC competition. Use at your own risk.

    1. All the images are resized to 256x256 using the following Python code:
    import os
    import multiprocessing as mp
    from PIL import Image, ImageOps
    import glob
    from functools import partial
    
    
    def list_jpg_files(folder_path):
      # Ensure the folder path ends with a slash
      if not folder_path.endswith('/'):
        folder_path += '/'
    
      # Use glob to find all .jpg files in the specified folder (non-recursive)
      jpg_files = glob.glob(folder_path + '*.jpg')
    
      return jpg_files
    
    
    
    def resize_image(image_path, destination_folder):
      # Open the image file
      with Image.open(image_path) as img:
        # Get the original dimensions
        original_width, original_height = img.size
    
        # Calculate the aspect ratio
        aspect_ratio = original_width / original_height
    
        # Determine the new dimensions based on the aspect ratio
        if aspect_ratio > 1:
          # Width is larger, so we will crop the width
          new_width = int(256 * aspect_ratio)
          new_height = 256
        else:
          # Height is larger, so we will crop the height
          new_width = 256
          new_height = int(256 / aspect_ratio)
    
        # Resize the image while maintaining the aspect ratio
        img = img.resize((new_width, new_height))
    
        # Calculate the crop box to center the image
        left = (new_width - 256) / 2
        top = (new_height - 256) / 2
        right = (new_width + 256) / 2
        bottom = (new_height + 256) / 2
    
        # Crop the image if it results in shrinking
        if new_width > 256 or new_height > 256:
          img = img.crop((left, top, right, bottom))
        else:
          # Add black edges if it results in scaling up
          img = ImageOps.expand(img, border=(int(left), int(top), int(left), int(top)), fill='black')
    
        # Resize the image to the final dimensions
        img = img.resize((256, 256))
    
      img.save(os.path.join(destination_folder, os.path.basename(image_path)))
    
    
    source_folder = ""
    destination_folder = ""
    
    images = list_jpg_files(source_folder)
    
    with mp.Pool(processes=12) as pool:
      images = pool.map(partial(resize_image, destination_folder=destination_folder), images)
    print("All images resized")
    

    This code will shrink (down-sample) the image if it is larger than 256x256. But if the image is smaller than 256x256, it will add either vertical or horizontal black edges after scaling up the image. In both scenarios, it will keep the center of the input image in the center of the output image.

    The HDF5 file is created using the following code:

    import os
    import pandas as pd
    from PIL import Image
    import h5py
    import io
    import numpy as np
    
    # File paths
    base_folder = "./isic-2018-task-3-256x256"
    csv_file_path = 'train-metadata.csv'
    image_folder_path = 'train-image/image'
    hdf5_file_path = 'train-image.hdf5'
    
    # Read the CSV file
    df = pd.read_csv(os.path.join(base_folder, csv_file_path))
    
    # Open an HDF5 file
    with h5py.File(os.path.join(base_folder, hdf5_file_path), 'w') as hdf5_file:
      for index, row in df.iterrows():
        isic_id = row['isic_id']
        image_file_path = os.path.join(base_folder, image_folder_path, f'{isic_id}.jpg')
        
        if os.path.exists(image_file_path):
          # Open the image file
          with Image.open(image_file_path) as img:
            # Convert the image to a byte buffer
            img_byte_arr = io.BytesIO()
            img.save(img_byte_arr, format=img.format)
            img_byte_arr = img_byte_arr.getvalue()
            hdf5_file.create_dataset(isic_id, data=np.void(img_byte_arr))
        else:
          print(f"Image file for {isic_id} not found.")
    
    print("HDF5 file created successfully.")
    

    To read the hdf5 file, use the following code:

    import h5py
    from PIL import Image
    
    ...
    
  14. OLM Converter for Mac

    • kaggle.com
    zip
    Updated Mar 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BitVare Software (2022). OLM Converter for Mac [Dataset]. https://www.kaggle.com/datasets/bitvaresoftware/olm-converter-for-mac
    Explore at:
    zip(18499539 bytes)Available download formats
    Dataset updated
    Mar 23, 2022
    Authors
    BitVare Software
    Description

    OLM Converter for Mac allows users to export OLM to PST, PDF, MBOX, EML, MSG, EMLX, VCF, ICS, etc. Add OLM files including contacts, emails, tasks, calendars, journals, etc. to multiple file formats. Mac OLM Converter is a reliable tool to bulk convert OLM files to multiple file formats. The software protects mail metadata elements like a mailing list, from, Cc, To, Bcc, date, email formatting, folder hierarchy, images, color, links, attachments, etc. Tool support OLM file conversion without any data loss and provides various option to save resultant file. Export OLM Contacts to CSV format and Calendars to ICS format. The Mac OLM Converter is compatible with all Mac OS versions. OLM Converter makes sure that the data folder hierarchy is intact as same. The Mac OLM Converter allows users to convert Mac OLM files to 6+ different file formats. After the OLM file conversion, the output file can be used on any Mac and Windows- supported applications.

    Get complete information - https://www.bitvare.com/olm/

  15. The files on your computer

    • kaggle.com
    zip
    Updated Jan 15, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    cogs (2017). The files on your computer [Dataset]. https://www.kaggle.com/cogitoe/crab
    Explore at:
    zip(14326302 bytes)Available download formats
    Dataset updated
    Jan 15, 2017
    Authors
    cogs
    Description

    Dataset: The files on your computer.

    Crab is a command line tool for Mac and Windows that scans file data into a SQLite database, so you can run SQL queries over it.

    e.g. (Win)    C:> crab C:\some\path\MyProject
    or (Mac)    $ crab /some/path/MyProject
    

    You get a CRAB> prompt where you can enter SQL queries on the data, e.g. Count files by extension

    SELECT extension, count(*) 
    FROM files 
    GROUP BY extension;
    

    e.g. List the 5 biggest directories

    SELECT parentpath, sum(bytes)/1e9 as GB 
    FROM files 
    GROUP BY parentpath 
    ORDER BY sum(bytes) DESC LIMIT 5;
    

    Crab provides a virtual table, fileslines, which exposes file contents to SQL

    e.g. Count TODO and FIXME entries in any .c files, recursively

    SELECT fullpath, count(*) FROM fileslines 
    WHERE parentpath like '/Users/GN/HL3/%' and extension = '.c'
      and (data like '%TODO%' or data like '%FIXME%')
    GROUP BY fullpath;
    

    As well there are functions to run programs or shell commands on any subset of files, or lines within files e.g. (Mac) unzip all the .zip files, recursively

    SELECT exec('unzip', '-n', fullpath, '-d', '/Users/johnsmith/Target Dir/') 
    FROM files 
    WHERE parentpath like '/Users/johnsmith/Source Dir/%' and extension = '.zip';
    

    (Here -n tells unzip not to overwrite anything, and -d specifies target directory)

    There is also a function to write query output to file, e.g. (Win) Sort the lines of all the .txt files in a directory and write them to a new file

    SELECT writeln('C:\Users\SJohnson\dictionary2.txt', data) 
    FROM fileslines 
    WHERE parentpath = 'C:\Users\SJohnson\' and extension = '.txt'
    ORDER BY data;
    

    In place of the interactive prompt you can run queries in batch mode. E.g. Here is a one-liner that returns the full path all the files in the current directory

    C:> crab -batch -maxdepth 1 . "SELECT fullpath FROM files"
    

    Crab SQL can also be used in Windows batch files, or Bash scripts, e.g. for ETL processing.

    Crab is free for personal use, $5/mo commercial

    See more details here (mac): [http://etia.co.uk/][1] or here (win): [http://etia.co.uk/win/about/][2]

    An example SQLite database (Mac data) has been uploaded for you to play with. It includes an example files table for the directory tree you get when downloading the Project Gutenberg corpus, which contains 95k directories and 123k files.

    To scan your own files, and get access to the virtual tables and support functions you have to use the Crab SQLite shell, available for download from this page (Mac): [http://etia.co.uk/download/][3] or this page (Win): [http://etia.co.uk/win/download/][4]

    Content

    FILES TABLE

    The FILES table contains details of every item scanned, file or directory. All columns are indexed except 'mode'

    COLUMNS
     fileid (int) primary key -- files table row number, a unique id for each item
     name (text)        -- item name e.g. 'Hei.ttf'
     bytes (int)        -- item size in bytes e.g. 7502752
     depth (int)        -- how far scan recursed to find the item, starts at 0
     accessed (text)      -- datetime item was accessed
     modified (text)      -- datetime item was modified
     basename (text)      -- item name without path or extension, e.g. 'Hei'
     extension (text)     -- item extension including the dot, e.g. '.ttf'
     type (text)        -- item type, 'f' for file or 'd' for directory
     mode (text)        -- further type info and permissions, e.g. 'drwxr-xr-x'
     parentpath (text)     -- absolute path of directory containing the item, e.g. '/Library/Fonts/'
     fullpath (text) unique  -- parentpath of the item concatenated with its name, e.g. '/Library/Fonts/Hei.ttf'
    
    PATHS
    1) parentpath and fullpath don't support abbreviations such as ~ . or .. They're just strings.
    2) Directory paths all have a '/' on the end.
    

    FILESLINES TABLE

    The FILESLINES table is for querying data content of files. It has line number and data columns, with one row for each line of data in each file scanned by Crab.

    This table isn't available in the example dataset, because it's a virtual table and doesn't physically contain data.

    COLUMNS
     linenumber (int) -- line number within file, restarts count from 1 at the first line of each file
     data (text)    -- data content of the files, one entry for each line
    

    FILESLINES also duplicates the columns of the FILES table: fileid, name, bytes, depth, accessed, modified, basename, extension, type, mode, parentpath, and fullpath. This way you can restrict which files are searched without having to join tables.

    Example Gutenberg data

    An example SQLite database (Mac data), database.sqlite, has been uploaded for you to play with. It includes an example files table...

  16. USA_Contracts_medical_equip_2019_2024

    • kaggle.com
    zip
    Updated Apr 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Phil Gieschen (2025). USA_Contracts_medical_equip_2019_2024 [Dataset]. https://www.kaggle.com/datasets/philgieschen/usa-contracts-medical-equip-2019-2024/code
    Explore at:
    zip(16588354 bytes)Available download formats
    Dataset updated
    Apr 7, 2025
    Authors
    Phil Gieschen
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    United States
    Description

    US Government Contract Awards for Medical Equipment (product codes 6515 &6640) 2029-2024. Shows Total amounts Obligated and Outlaid by Zip Code & added Metropolitan Statistical Area for mapping & visualization

    https://www.usaspending.gov/download_center/award_data_archive

    Used Git Bash to remove other products and merge CSVs together:

    !/bin/bash

    Define variables

    input_folder="/c/Users/phgie/Downloads/FY2024_All_Contracts" output_file="combined_filtered.csv" temp_file="temp_filtered.csv"

    Create or clear the output file

    "$output_file"

    Loop through all CSV files in the folder

    for file in "$input_folder"/*.csv; do # Skip the header for all but the first file if [ ! -s "$output_file" ]; then # Include the header row from the first file awk -F, 'NR == 1 || $104 == "6515" || $104 == "6640"' "$file" > "$temp_file" else # Exclude the header row from subsequent files awk -F, 'NR > 1 && ($104 == "6515" || $104 == "6640")' "$file" > "$temp_file" fi

    # Append the filtered content to the output file cat "$temp_file" >> "$output_file" done

    Clean up temporary file

    rm -f "$temp_file"

    echo "Combined and filtered CSVs are saved in $output_file"

  17. Embryo classification based on microscopic images

    • kaggle.com
    Updated Oct 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gaurav Dutta (2023). Embryo classification based on microscopic images [Dataset]. https://www.kaggle.com/datasets/gauravduttakiit/embryo-classification-based-on-microscopic-images
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 3, 2023
    Dataset provided by
    Kaggle
    Authors
    Gaurav Dutta
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset Description Welcome to the "Hung Vuong Hospital Embryo Classification" dataset. This page provides a comprehensive overview of the data files, their formats, and the essential columns you'll encounter in this competition. Taking a moment to understand the data will help you navigate the challenge effectively and make informed decisions during your analysis and modeling.

    The dataset comprises the following key files:

    train folder - Contains images of embryos at day-3 and day-5 for training purposes. test folder - Contains images of embryos at day-3 and day-5 for testing purposes. train.csv - Contains information about the training set. test.csv - Contains information about the test set. sample_submission.csv - A sample submission file that demonstrates the correct submission format. Data Format Expectations

    The embryo images are arranged within subfolders under the train and test directories. Each image is saved in JPG format and is labeled with a prefix. Images corresponding to day-3 embryos have the prefix D3 while images related to day-5 embryos bear the prefix D5. This prefix-based categorization allows for easy identification of the embryo's developmental stage.

    Expected Output

    Your task in this competition is to create a deep learning model that can accurately classify embryo images as 1 for good or 0 for not good for both day-3 and day-5 stages. The model should be trained on the training set and then used to predict the embryo quality in the test set. The ID column assigns an ID to each image. You will create the Class column as the result of model classification. The submission file contains only 2 columns: ID and Class (See the sample submission file)

    Columns

    You will encounter the following columns throughout the dataset:

    ID - Refers to the ID of the images in the test set. Image - Refers to the file name of the embryo images in the train or test folder. Class - Represents the evaluation of the embryo images. This column provides the ground truth label for each image, indicating whether the embryo is classified as 'good' or 'not good'. We encourage you to explore, analyze, and preprocess the provided data to build a robust model for accurate embryo quality classification. Good luck, and may your innovative solutions contribute to advancements in reproductive science!

  18. Complete Pokemon Image Dataset

    • kaggle.com
    zip
    Updated Nov 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hlrhegemony (2020). Complete Pokemon Image Dataset [Dataset]. https://www.kaggle.com/hlrhegemony/pokemon-image-dataset
    Explore at:
    zip(60660766 bytes)Available download formats
    Dataset updated
    Nov 15, 2020
    Authors
    hlrhegemony
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    I was searching for labeled Pokemon images which satisfy these requirements: - Uniform, white backgrounds - Generations 1 through 8 - Multiple images per Pokemon

    I could not find any after searching for a while, so I built one myself!

    Content

    These images are all scraped from https://pokemondb.net/. Each folder contains between 1 and 8 images (all .jpg) of the Pokemon, all with white backgrounds and reasonable file size. There are 2,500+ total images, which is much greater than any other Kaggle dataset I have found which preserve background and picture quality (no random backgrounds, nor some white and some black, etc.). Note that all other forms of the Pokemon (gigantamax, mega evolution, Alolan, Galarian, etc.) are included in the same folder.

    Inspiration

    I hope that a larger, cleaner dataset like this one can result in better GAN's and VAE's.

  19. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
SophieLiu (2019). Data Mining Project - Boston [Dataset]. https://www.kaggle.com/sliu65/data-mining-project-boston
Organization logo

Data Mining Project - Boston

Explore at:
zip(59313797 bytes)Available download formats
Dataset updated
Nov 25, 2019
Authors
SophieLiu
Area covered
Boston
Description

Context

To make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.

Use of Data Files

You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:

This loads the file into R

df<-read.csv('uber.csv')

The next codes is to subset the data into specific car types. The example below only has Uber 'Black' car types.

df_black<-subset(uber_df, uber_df$name == 'Black')

This next portion of code will be to load it into R. First, we must write this dataframe into a csv file on our computer in order to load it into R.

write.csv(df_black, "nameofthefileyouwanttosaveas.csv")

The file will appear in you working directory. If you are not familiar with your working directory. Run this code:

getwd()

The output will be the file path to your working directory. You will find the file you just created in that folder.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?

Search
Clear search
Close search
Google apps
Main menu