Facebook
TwitterTo make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.
You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:
df<-read.csv('uber.csv')
df_black<-subset(uber_df, uber_df$name == 'Black')
write.csv(df_black, "nameofthefileyouwanttosaveas.csv")
getwd()
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is derived from the ISIC Archive with the following changes:
If the "benign_malignant" column is null and the "diagnosis" column is "vascular lesion", the target is set to null.
DISCLAIMER I'm not a dermatologist and I'm not affiliated with ISIC in any way. I don't know if my approach to setting the target value is acceptable by the ISIC competition. Use at your own risk.
import os
import multiprocessing as mp
from PIL import Image, ImageOps
import glob
from functools import partial
def list_jpg_files(folder_path):
# Ensure the folder path ends with a slash
if not folder_path.endswith('/'):
folder_path += '/'
# Use glob to find all .jpg files in the specified folder (non-recursive)
jpg_files = glob.glob(folder_path + '*.jpg')
return jpg_files
def resize_image(image_path, destination_folder):
# Open the image file
with Image.open(image_path) as img:
# Get the original dimensions
original_width, original_height = img.size
# Calculate the aspect ratio
aspect_ratio = original_width / original_height
# Determine the new dimensions based on the aspect ratio
if aspect_ratio > 1:
# Width is larger, so we will crop the width
new_width = int(256 * aspect_ratio)
new_height = 256
else:
# Height is larger, so we will crop the height
new_width = 256
new_height = int(256 / aspect_ratio)
# Resize the image while maintaining the aspect ratio
img = img.resize((new_width, new_height))
# Calculate the crop box to center the image
left = (new_width - 256) / 2
top = (new_height - 256) / 2
right = (new_width + 256) / 2
bottom = (new_height + 256) / 2
# Crop the image if it results in shrinking
if new_width > 256 or new_height > 256:
img = img.crop((left, top, right, bottom))
else:
# Add black edges if it results in scaling up
img = ImageOps.expand(img, border=(int(left), int(top), int(left), int(top)), fill='black')
# Resize the image to the final dimensions
img = img.resize((256, 256))
img.save(os.path.join(destination_folder, os.path.basename(image_path)))
source_folder = ""
destination_folder = ""
images = list_jpg_files(source_folder)
with mp.Pool(processes=12) as pool:
images = pool.map(partial(resize_image, destination_folder=destination_folder), images)
print("All images resized")
This code will shrink (down-sample) the image if it is larger than 256x256. But if the image is smaller than 256x256, it will add either vertical or horizontal black edges after scaling up the image. In both scenarios, it will keep the center of the input image in the center of the output image.
The HDF5 file is created using the following code:
import os
import pandas as pd
from PIL import Image
import h5py
import io
import numpy as np
# File paths
base_folder = "./isic-2018-task-12-256x256"
csv_file_path = 'train-metadata.csv'
image_folder_path = 'train-image/image'
hdf5_file_path = 'train-image.hdf5'
# Read the CSV file
df = pd.read_csv(os.path.join(base_folder, csv_file_path))
# Open an HDF5 file
with h5py.File(os.path.join(base_folder, hdf5_file_path), 'w') as hdf5_file:
for index, row in df.iterrows():
isic_id = row['isic_id']
image_file_path = os.path.join(base_folder, image_folder_path, f'{isic_id}.jpg')
if os.path.exists(image_file_path):
# Open the image file
with Image.open(image_file_path) as img:
# Convert the image to a byte buffer
img_byte_arr = io.BytesIO()
img.save(img_byte_arr, format=img.format)
img_byte_arr = img_byte_arr.getvalue()
hdf5_file.create_dataset(isic_id, data=np.void(img_byte_arr))
else:
print(f"Image file for {isic_id} not found.")
print("HDF5 file created successfully.")
To read the hdf5 file, use the following code:
import h5py
from PIL import Image
...
Facebook
Twitterhttps://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
Make the images sharper and larger for better training
import os
import cv2
import numpy as np
from multiprocessing import Pool, cpu_count
input_dir = "/Users/abrarahasanadil/Downloads/Thesis/dataset/distorted_images"
output_dir = "/Users/abrarahasanadil/Downloads/Thesis/dataset/clear_images"
def preprocess_image(img_file):
img = cv2.imread(img_file)
# Resize the image to have a minimum height of 300 pixels
height, width, _ = img.shape
new_height = 400
new_width = int((new_height / height) * width)
img = cv2.resize(img, (new_width, new_height))
# Apply bilateral filtering to remove noise while keeping edges sharp
img = cv2.bilateralFilter(img, 9, 75, 75)
# Apply unsharp masking to enhance edges
img = cv2.GaussianBlur(img, (0, 0), 3)
img = cv2.addWeighted(
img, 1.2, img, -0.2, 0
) # Reduce the value of alpha from 1.5 to 1.2
# Convert the image to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Increase contrast in darker regions using adaptive histogram equalization
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
gray = clahe.apply(gray)
# Apply a sharpening filter
kernel = np.array([[-1, -1, -1], [-1, 9, -1], [-1, -1, -1]])
gray = cv2.filter2D(gray, -1, kernel)
# Save the output image
output_file = os.path.join(output_dir, os.path.basename(img_file))
os.makedirs(
os.path.dirname(output_file), exist_ok=True
) # Create the output directory if it doesn't exist
cv2.imwrite(output_file, gray)
if _name_ == "_main_":
# Get a list of all the image files in the input directory
image_files = [
os.path.join(input_dir, f) for f in os.listdir(input_dir) if f.endswith(".jpg")
]
# Create a pool of worker processes
num_workers = cpu_count() # Use all available CPU cores
with Pool(num_workers) as pool:
# Preprocess all the images in parallel
pool.map(preprocess_image, image_files)
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is a preprocessed version of the NIH Chest X-ray Dataset. The original images were systematically organized, explored, and enhanced to improve their quality for research and machine learning applications.
This preprocessed dataset is ready for use in further analysis, model training, or clinical research, with improved image quality and consistent organization. No changes were made to the original labels or metadata.
Facebook
TwitterThe dataset is used in final project of STA325 at SUSTech.
Use the following function to extract predictions from test_loader:
```python
def predict(model, loader, device):
model.eval() # Set the model to evaluation mode
predictions = [] # Store predicted classes
image_ids = [] # Store image filenames
with torch.no_grad(): # Disable gradient computation for images, img_paths in tqdm(loader, desc="Predicting on test set"): images = images.to(device) # Move images to the specified device outputs = model(images) # Forward pass to get model outputs _, predicted = torch.max(outputs, 1) # Get predicted classes
# Collect predictions and image IDs
predictions.extend(predicted.cpu().numpy())
image_ids.extend([os.path.basename(path) for path in img_paths])
return image_ids, predictions ```
Call the prediction function with the trained model, test_loader, and device:
python
image_ids, predictions = predict(model, test_loader, device)
import pandas as pd
import os
# Create DataFrame
submission_df = pd.DataFrame({
"id": image_ids, # Image filenames
"label": predictions # Predicted classes
})
# Save to the specified path
OUTPUT_DIR = "logs"
os.makedirs(OUTPUT_DIR, exist_ok=True)
submission_path = os.path.join(OUTPUT_DIR, "submission.csv")
submission_df.to_csv(submission_path, index=False)
print(f"Kaggle submission file saved to {submission_path}")
submission.csv Format:id: Filenames of test images (without paths, e.g., image1.jpg).label: Predicted class indices (e.g., 0, 1, 2, depending on the number of classes).
Example Content:
id,label
000001.jpg,0
000002.jpg,1
000003.jpg,2
Then submit the submission.csv to Kaggle.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accident Detection Model is made using YOLOv8, Google Collab, Python, Roboflow, Deep Learning, OpenCV, Machine Learning, Artificial Intelligence. It can detect an accident on any accident by live camera, image or video provided. This model is trained on a dataset of 3200+ images, These images were annotated on roboflow.
https://user-images.githubusercontent.com/78155393/233774342-287492bb-26c1-4acf-bc2c-9462e97a03ca.png" alt="Survey">
Facebook
Twitterhttps://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api
Please visit dinhanhx/rct
Please take time to read all this readme before using the dataset. Yes I'm serious!
pip install -e .
Go to this PRAW doc page, follow the instructions to get your client id, client secret, and user agent.
Then store them in confidential/reddit.json like this (don't actually write "spooky"):
json
{
"id": "spooky",
"secret": "spooky",
"user-agent": "windows-10:spooky:v0.0.1 (by u/spooky)"
}
(but the number in each category limited by Reddit)
- Output file: data/cosplay.jsonl
- 2161 posts (on 01/03/2023)
python rct/crawl.py
(in post's title) enclosed by square brackets such as [self], [found], ...
- Input file: data/cosplay.jsonl
- Output file: data/clean_cosplay.jsonl
python rct/clean.py
data/clean_cosplay.jsonldata/map_cosplay.jsonl, data/bad_response.jsonlpython rct/download.py ⚠ The image_id, and image_path attributes' values are NOT linearly continuous. For example,
in data/bad_response.jsonl
python
{"image_id": "001912", "image_path": "data/image/001912.jpg"}
and in data/map_cosplay.jsonl
```python
{"image_id": "001911", "image_path": "data/image/001911.jpg"} {"image_id": "001913", "image_path": "data/image/001913.jpg"}
⚠ `image_path` attribute's values are `data/image/*.jpg`. They are relative to the folder `data` containing all `.jsonl` files and `image` folder. The folder `data` is produced by Python scripts.
⚠ `image_path` attribute's values MISMATCH with *the name of folder containing all `.jsonl` files and `image` folder on _Kaggle_*. When you load the data from Kaggle Dataset, `data/image/000000.jpg`'s `data` should be replaced with Kaggle path (see [this notebook](https://www.kaggle.com/code/inhanhv/rct-demo)). It shall become `/kaggle/input/rcosplay-hot-top-images-with-titles/image/000000.jpg`
Facebook
TwitterIn this dataset, I have included both a resized version of the dataset, and a cropped then resized version of the data.
This file contains the name of the file under the 'image' column and the label under the 'level' column.
This folder was created by simply resizing the dataset to 1024x1024 if it is bigger than this size, else it remains the same. The code used to create this dataset is:
import glob
import os
from tqdm import tqdm
import math
from PIL import Image
files = glob.glob('D:\\Experiments with Deep Learning\\DR Kaggle\\train\\train\\train\\*.jpeg')
new_width = 1024
for i in tqdm(range(len(files))):
img = Image.open(files[i])
width,height = img.size
ratio = height/width
if width > new_width:
new_image = img.resize((new_width,math.ceil(ratio*new_width)))
else:
new_image = img
new_image.save('D:\\Experiments with Deep Learning\\DR
Kaggle\\train\\train\\resized_train\\'+os.path.basename(files[i]))
`
In this case, as much of the black space is cropped out by trying to identify the center and radius of the circle of the fundus image. Some of the images turned out to be fully black or very close to fully black, and no mask was found. Hence, those images were manually removed. There may still be some noisy images remaining, however.
The code used to create this dataset is:
# import the necessary packages
import numpy as np
import cv2
import glob
import os
from tqdm import tqdm
import math
from PIL import Image
files = glob.glob('D:\\Experiments with Deep Learning\\DR Kaggle\\train\\train\\train\\*.jpeg')
new_sz = 1024
def crop_image(image):
output = image.copy()
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
ret,gray = cv2.threshold(gray,10,255,cv2.THRESH_BINARY)
contours,hierarchy = cv2.findContours(gray,cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_SIMPLE)
if not contours:
print('no contours!')
flag = 0
return image, flag
cnt = max(contours, key=cv2.contourArea)
((x, y), r) = cv2.minEnclosingCircle(cnt)
x = int(x); y = int(y); r = int(r)
flag = 1
#print(x,y,r)
if r > 100:
return output[0 + (y-r)*int(r
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Name: Celeb-DF Faces Dataset
Description: The Celeb-DF Faces Dataset is a curated collection of facial images extracted from the Celeb-DF dataset. This dataset focuses on providing a comprehensive set of facial images for research and analysis in the field of deepfake detection and facial image analysis. The images are categorized into two classes: "Fake" and "Real," based on the source of the videos.
Dataset Structure:
Image Size: 224x224 pixels
Source Folders:
celeb-df-v2/Celeb-real: Contains authentic facial videos. celeb-df-v2/Celeb-synthesis: Contains synthesized (fake) facial videos. celeb-df-v2/YouTube-real: Contains additional authentic facial videos from YouTube. Output Folder:
celeb_faces_224/: Contains the extracted and resized facial images. Metadata File:
metadata_celebs.csv: A CSV file storing metadata information for each extracted image with the following columns: Name: The filename of the extracted image. Label: The label indicating whether the image is "Fake" or "Real." Creation Process:
Video Frame Extraction:
The first frame from each video in the source folders is extracted. Image Resizing:
The extracted frames are resized to 224x224 pixels to ensure uniformity and compatibility with common machine learning models. Image Storage:
The resized images are saved in the celeb_faces_224/ folder with filenames corresponding to the original video names. Metadata Compilation:
A metadata CSV file (metadata_celebs.csv) is created to store the filenames and labels of the images, indicating whether they are from "Fake" or "Real" videos. Intended Use: The dataset is ideal for tasks such as:
Deepfake detection and analysis Training and evaluation of machine learning models for facial image classification Image forensics research and development Note: This dataset is derived from the Celeb-DF dataset and is intended for research and educational purposes only.
Facebook
Twitterhttps://i.imgur.com/gAwhrwd.jpg" alt="">
External CFD Aerodynamics Dataset - The theme for this data is external flow aerodynamics; simulation of flow physics around some geometry (eg. aircraft, vehicle ) - Iterative based simulation of some flow physics are obtained upon reaching some final convergence criteria - The dataset contains simulation results obtained from different CFD solvers, sorted into folders for similar types of data outputs
Data origin Where the data is stored - Depending on the type of solver, export option, data can be stored in vertices (grid intersections) or cells (each little domain, usually cell centre) - Before solving nonlinear equations, we need to discretise the domain into smaller zones, when the solution achieves convergence, it be exported: - Either the entire domain data, subsets of the flowfield domain, or just tabular data (at specific points)
|Discretised region around a geometry|Part of the result is visualised via a 2D slice |
|--|--|
|https://images-wixmp-ed30a86b8c4ca887773594c2.wixmp.com/f/8cc1eeaa-4046-4c4a-ae93-93d656f68688/df51hov-c527cb71-62c8-4b98-aef9-002dcc7567ea.jpg?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJ1cm46YXBwOjdlMGQxODg5ODIyNjQzNzNhNWYwZDQxNWVhMGQyNmUwIiwiaXNzIjoidXJuOmFwcDo3ZTBkMTg4OTgyMjY0MzczYTVmMGQ0MTVlYTBkMjZlMCIsIm9iaiI6W1t7InBhdGgiOiJcL2ZcLzhjYzFlZWFhLTQwNDYtNGM0YS1hZTkzLTkzZDY1NmY2ODY4OFwvZGY1MWhvdi1jNTI3Y2I3MS02MmM4LTRiOTgtYWVmOS0wMDJkY2M3NTY3ZWEuanBnIn1dXSwiYXVkIjpbInVybjpzZXJ2aWNlOmZpbGUuZG93bmxvYWQiXX0.PgMkNNrg2EmDBkPioLt2MRsoNdXyO1ro5Ol0Z1QP3qM" alt="">|
https://images-wixmp-ed30a86b8c4ca887773594c2.wixmp.com/f/8cc1eeaa-4046-4c4a-ae93-93d656f68688/df51hor-636d665b-d6b3-4f2b-b141-c06199702eff.jpg?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJ1cm46YXBwOjdlMGQxODg5ODIyNjQzNzNhNWYwZDQxNWVhMGQyNmUwIiwiaXNzIjoidXJuOmFwcDo3ZTBkMTg4OTgyMjY0MzczYTVmMGQ0MTVlYTBkMjZlMCIsIm9iaiI6W1t7InBhdGgiOiJcL2ZcLzhjYzFlZWFhLTQwNDYtNGM0YS1hZTkzLTkzZDY1NmY2ODY4OFwvZGY1MWhvci02MzZkNjY1Yi1kNmIzLTRmMmItYjE0MS1jMDYxOTk3MDJlZmYuanBnIn1dXSwiYXVkIjpbInVybjpzZXJ2aWNlOmZpbGUuZG93bmxvYWQiXX0.0H9mu_Crr4zAfbiR8wNWOdaMYG8Duyam0GCMmKqbeA" alt="">|
Folder Structure Utilising the dataset with notebook classes, the recommended data storage structure:
- Main Folder ( Geometry name used in simulation )
- Case Name ( Brief simulation name; what was tested etc )
- Individual Case Name ( If multiple cases were tested etc ):
- flowfield folder (stores multiblock file content - automatically created when saving VTM)
- tab_final ( final iteration tabular data output content )
- tab_iter ( iteratively changing tabular data, etc convergence history of a parameter )
Current Dataset Content Content in the dataset: External CFD Aerodynamics Dataset
Facebook
Twitter“AI-Powered Banking Analytics: Automated Power BI Documentation, Churn Prediction, and Transaction Forecasting”
Project Workflow 1. Data Acquisition (Kaggle) • Dataset sourced from Kaggle (credit card / banking dataset). • Contains customer demographics, credit card transactions, and account details. • Cleaned and transformed data in Power BI for dashboard building.
File Details: File / Folder Name Description .idea/ PyCharm IDE configuration folder (auto-generated). Churn Prediction + Forecasting.py Main Python script for churn prediction (Random Forest) and transaction forecasting (Prophet). churn_model.pkl Saved machine learning model (Random Forest) for churn prediction. Churn_Predictions.xlsx Excel output of churn probabilities and risk categories per customer. Credit Card Financial Dashboard.pbix Power BI dashboard file (interactive BI report). Credit Card Financial Dashboard.pdf Exported PDF version of the Power BI dashboard. credit_card.xlsx Kaggle dataset (credit card transactions / account features). customer.xlsx Kaggle dataset (customer demographic and account info). DocumentationGenerator.py Python script that parses VPAX model and generates automated Power BI documentation. Feature_Importance.xlsx Feature importance scores from churn model (top churn drivers). forecast_model.pkl Saved Prophet model for forecasting monthly transactions. LICENSE License file for open-source/public sharing. model.vpax Exported Power BI data model (via DAX Studio) for documentation. PowerBI_Documentation.docx Word output of auto-generated Power BI documentation. PowerBI_Documentation.xlsx Excel output of auto-generated Power BI documentation. PowerBI_ER_Diagram.png Entity-Relationship diagram image generated from Power BI model. README.md Markdown summary file for GitHub/Kaggle. Transaction_Forecast.xlsx Excel output containing actuals + forecast (Prophet) with confidence bounds.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset is designed for users aiming to train models for text summarization. It contains 2,225 rows of data with two columns: "Text" and "Summary". Each row features a detailed news article or piece of text paired with its corresponding summary, providing a rich resource for developing and fine-tuning summarization algorithms.
This evolving dataset is planned to include additional features, such as text class labels, in future updates. These enhancements will provide more context and facilitate the development of models that can perform summarization across different categories of news content.
Ideal for researchers and developers focused on text summarization tasks, this dataset enables the training of models to effectively compress information while retaining the essence of the original content.
We would like to extend our sincere gratitude to the dataset creator for their contribution to this valuable resource. This dataset, sourced from the BBC News Summary dataset on Kaggle, was created by Pariza. Their work has provided an invaluable asset for those working on text summarization tasks, and we appreciate their efforts in curating and sharing this data with the community.
Thank you for supporting research and development in the field of natural language processing!
This script processes and consolidates text data from various directories containing news articles and their corresponding summaries. It reads the files from specified folders, handles encoding issues, and then creates a DataFrame that is saved as a CSV file for further analysis.
Imports:
numpy (np): Numerical operations library, though it's not used in this script.pandas (pd): Data manipulation and analysis library.os: For interacting with the operating system, e.g., building file paths.glob: For file pattern matching and retrieving file paths.Function: get_texts
text_folders: List of folders containing news article text files.text_list: List to store the content of text files.summ_folder: List of folders containing summary text files.sum_list: List to store the content of summary files.encodings: List of encodings to try for reading files.text_list and sum_list.Data Preparation:
text_folder: List of directories for news articles.summ_folder: List of directories for summaries.text_list and summ_list: Initialize empty lists to store the contents.data_df: Empty DataFrame to store the final data.Execution:
get_texts function to populate text_list and summ_list.data_df with columns 'Text' and 'Summary'.data_df to a CSV file at /kaggle/working/bbc_news_data.csv.Output:
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset is derived from the ISIC Archive with the following changes:
If the "benign_malignant" column is null and the "diagnosis" column is "vascular lesion", the target is set to null.
DISCLAIMER I'm not a dermatologist and I'm not affiliated with ISIC in any way. I don't know if my approach to setting the target value is acceptable by the ISIC competition. Use at your own risk.
import os
import multiprocessing as mp
from PIL import Image, ImageOps
import glob
from functools import partial
def list_jpg_files(folder_path):
# Ensure the folder path ends with a slash
if not folder_path.endswith('/'):
folder_path += '/'
# Use glob to find all .jpg files in the specified folder (non-recursive)
jpg_files = glob.glob(folder_path + '*.jpg')
return jpg_files
def resize_image(image_path, destination_folder):
# Open the image file
with Image.open(image_path) as img:
# Get the original dimensions
original_width, original_height = img.size
# Calculate the aspect ratio
aspect_ratio = original_width / original_height
# Determine the new dimensions based on the aspect ratio
if aspect_ratio > 1:
# Width is larger, so we will crop the width
new_width = int(256 * aspect_ratio)
new_height = 256
else:
# Height is larger, so we will crop the height
new_width = 256
new_height = int(256 / aspect_ratio)
# Resize the image while maintaining the aspect ratio
img = img.resize((new_width, new_height))
# Calculate the crop box to center the image
left = (new_width - 256) / 2
top = (new_height - 256) / 2
right = (new_width + 256) / 2
bottom = (new_height + 256) / 2
# Crop the image if it results in shrinking
if new_width > 256 or new_height > 256:
img = img.crop((left, top, right, bottom))
else:
# Add black edges if it results in scaling up
img = ImageOps.expand(img, border=(int(left), int(top), int(left), int(top)), fill='black')
# Resize the image to the final dimensions
img = img.resize((256, 256))
img.save(os.path.join(destination_folder, os.path.basename(image_path)))
source_folder = ""
destination_folder = ""
images = list_jpg_files(source_folder)
with mp.Pool(processes=12) as pool:
images = pool.map(partial(resize_image, destination_folder=destination_folder), images)
print("All images resized")
This code will shrink (down-sample) the image if it is larger than 256x256. But if the image is smaller than 256x256, it will add either vertical or horizontal black edges after scaling up the image. In both scenarios, it will keep the center of the input image in the center of the output image.
The HDF5 file is created using the following code:
import os
import pandas as pd
from PIL import Image
import h5py
import io
import numpy as np
# File paths
base_folder = "./isic-2018-task-3-256x256"
csv_file_path = 'train-metadata.csv'
image_folder_path = 'train-image/image'
hdf5_file_path = 'train-image.hdf5'
# Read the CSV file
df = pd.read_csv(os.path.join(base_folder, csv_file_path))
# Open an HDF5 file
with h5py.File(os.path.join(base_folder, hdf5_file_path), 'w') as hdf5_file:
for index, row in df.iterrows():
isic_id = row['isic_id']
image_file_path = os.path.join(base_folder, image_folder_path, f'{isic_id}.jpg')
if os.path.exists(image_file_path):
# Open the image file
with Image.open(image_file_path) as img:
# Convert the image to a byte buffer
img_byte_arr = io.BytesIO()
img.save(img_byte_arr, format=img.format)
img_byte_arr = img_byte_arr.getvalue()
hdf5_file.create_dataset(isic_id, data=np.void(img_byte_arr))
else:
print(f"Image file for {isic_id} not found.")
print("HDF5 file created successfully.")
To read the hdf5 file, use the following code:
import h5py
from PIL import Image
...
Facebook
TwitterOLM Converter for Mac allows users to export OLM to PST, PDF, MBOX, EML, MSG, EMLX, VCF, ICS, etc. Add OLM files including contacts, emails, tasks, calendars, journals, etc. to multiple file formats. Mac OLM Converter is a reliable tool to bulk convert OLM files to multiple file formats. The software protects mail metadata elements like a mailing list, from, Cc, To, Bcc, date, email formatting, folder hierarchy, images, color, links, attachments, etc. Tool support OLM file conversion without any data loss and provides various option to save resultant file. Export OLM Contacts to CSV format and Calendars to ICS format. The Mac OLM Converter is compatible with all Mac OS versions. OLM Converter makes sure that the data folder hierarchy is intact as same. The Mac OLM Converter allows users to convert Mac OLM files to 6+ different file formats. After the OLM file conversion, the output file can be used on any Mac and Windows- supported applications.
Get complete information - https://www.bitvare.com/olm/
Facebook
TwitterCrab is a command line tool for Mac and Windows that scans file data into a SQLite database, so you can run SQL queries over it.
e.g. (Win) C:> crab C:\some\path\MyProject
or (Mac) $ crab /some/path/MyProject
You get a CRAB> prompt where you can enter SQL queries on the data, e.g. Count files by extension
SELECT extension, count(*)
FROM files
GROUP BY extension;
e.g. List the 5 biggest directories
SELECT parentpath, sum(bytes)/1e9 as GB
FROM files
GROUP BY parentpath
ORDER BY sum(bytes) DESC LIMIT 5;
Crab provides a virtual table, fileslines, which exposes file contents to SQL
e.g. Count TODO and FIXME entries in any .c files, recursively
SELECT fullpath, count(*) FROM fileslines
WHERE parentpath like '/Users/GN/HL3/%' and extension = '.c'
and (data like '%TODO%' or data like '%FIXME%')
GROUP BY fullpath;
As well there are functions to run programs or shell commands on any subset of files, or lines within files e.g. (Mac) unzip all the .zip files, recursively
SELECT exec('unzip', '-n', fullpath, '-d', '/Users/johnsmith/Target Dir/')
FROM files
WHERE parentpath like '/Users/johnsmith/Source Dir/%' and extension = '.zip';
(Here -n tells unzip not to overwrite anything, and -d specifies target directory)
There is also a function to write query output to file, e.g. (Win) Sort the lines of all the .txt files in a directory and write them to a new file
SELECT writeln('C:\Users\SJohnson\dictionary2.txt', data)
FROM fileslines
WHERE parentpath = 'C:\Users\SJohnson\' and extension = '.txt'
ORDER BY data;
In place of the interactive prompt you can run queries in batch mode. E.g. Here is a one-liner that returns the full path all the files in the current directory
C:> crab -batch -maxdepth 1 . "SELECT fullpath FROM files"
Crab SQL can also be used in Windows batch files, or Bash scripts, e.g. for ETL processing.
Crab is free for personal use, $5/mo commercial
See more details here (mac): [http://etia.co.uk/][1] or here (win): [http://etia.co.uk/win/about/][2]
An example SQLite database (Mac data) has been uploaded for you to play with. It includes an example files table for the directory tree you get when downloading the Project Gutenberg corpus, which contains 95k directories and 123k files.
To scan your own files, and get access to the virtual tables and support functions you have to use the Crab SQLite shell, available for download from this page (Mac): [http://etia.co.uk/download/][3] or this page (Win): [http://etia.co.uk/win/download/][4]
The FILES table contains details of every item scanned, file or directory. All columns are indexed except 'mode'
COLUMNS
fileid (int) primary key -- files table row number, a unique id for each item
name (text) -- item name e.g. 'Hei.ttf'
bytes (int) -- item size in bytes e.g. 7502752
depth (int) -- how far scan recursed to find the item, starts at 0
accessed (text) -- datetime item was accessed
modified (text) -- datetime item was modified
basename (text) -- item name without path or extension, e.g. 'Hei'
extension (text) -- item extension including the dot, e.g. '.ttf'
type (text) -- item type, 'f' for file or 'd' for directory
mode (text) -- further type info and permissions, e.g. 'drwxr-xr-x'
parentpath (text) -- absolute path of directory containing the item, e.g. '/Library/Fonts/'
fullpath (text) unique -- parentpath of the item concatenated with its name, e.g. '/Library/Fonts/Hei.ttf'
PATHS
1) parentpath and fullpath don't support abbreviations such as ~ . or .. They're just strings.
2) Directory paths all have a '/' on the end.
The FILESLINES table is for querying data content of files. It has line number and data columns, with one row for each line of data in each file scanned by Crab.
This table isn't available in the example dataset, because it's a virtual table and doesn't physically contain data.
COLUMNS
linenumber (int) -- line number within file, restarts count from 1 at the first line of each file
data (text) -- data content of the files, one entry for each line
FILESLINES also duplicates the columns of the FILES table: fileid, name, bytes, depth, accessed, modified, basename, extension, type, mode, parentpath, and fullpath. This way you can restrict which files are searched without having to join tables.
An example SQLite database (Mac data), database.sqlite, has been uploaded for you to play with. It includes an example files table...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
US Government Contract Awards for Medical Equipment (product codes 6515 &6640) 2029-2024. Shows Total amounts Obligated and Outlaid by Zip Code & added Metropolitan Statistical Area for mapping & visualization
https://www.usaspending.gov/download_center/award_data_archive
Used Git Bash to remove other products and merge CSVs together:
input_folder="/c/Users/phgie/Downloads/FY2024_All_Contracts" output_file="combined_filtered.csv" temp_file="temp_filtered.csv"
"$output_file"
for file in "$input_folder"/*.csv; do # Skip the header for all but the first file if [ ! -s "$output_file" ]; then # Include the header row from the first file awk -F, 'NR == 1 || $104 == "6515" || $104 == "6640"' "$file" > "$temp_file" else # Exclude the header row from subsequent files awk -F, 'NR > 1 && ($104 == "6515" || $104 == "6640")' "$file" > "$temp_file" fi
# Append the filtered content to the output file cat "$temp_file" >> "$output_file" done
rm -f "$temp_file"
echo "Combined and filtered CSVs are saved in $output_file"
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dataset Description Welcome to the "Hung Vuong Hospital Embryo Classification" dataset. This page provides a comprehensive overview of the data files, their formats, and the essential columns you'll encounter in this competition. Taking a moment to understand the data will help you navigate the challenge effectively and make informed decisions during your analysis and modeling.
The dataset comprises the following key files:
train folder - Contains images of embryos at day-3 and day-5 for training purposes. test folder - Contains images of embryos at day-3 and day-5 for testing purposes. train.csv - Contains information about the training set. test.csv - Contains information about the test set. sample_submission.csv - A sample submission file that demonstrates the correct submission format. Data Format Expectations
The embryo images are arranged within subfolders under the train and test directories. Each image is saved in JPG format and is labeled with a prefix. Images corresponding to day-3 embryos have the prefix D3 while images related to day-5 embryos bear the prefix D5. This prefix-based categorization allows for easy identification of the embryo's developmental stage.
Expected Output
Your task in this competition is to create a deep learning model that can accurately classify embryo images as 1 for good or 0 for not good for both day-3 and day-5 stages. The model should be trained on the training set and then used to predict the embryo quality in the test set. The ID column assigns an ID to each image. You will create the Class column as the result of model classification. The submission file contains only 2 columns: ID and Class (See the sample submission file)
Columns
You will encounter the following columns throughout the dataset:
ID - Refers to the ID of the images in the test set. Image - Refers to the file name of the embryo images in the train or test folder. Class - Represents the evaluation of the embryo images. This column provides the ground truth label for each image, indicating whether the embryo is classified as 'good' or 'not good'. We encourage you to explore, analyze, and preprocess the provided data to build a robust model for accurate embryo quality classification. Good luck, and may your innovative solutions contribute to advancements in reproductive science!
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
I was searching for labeled Pokemon images which satisfy these requirements: - Uniform, white backgrounds - Generations 1 through 8 - Multiple images per Pokemon
I could not find any after searching for a while, so I built one myself!
These images are all scraped from https://pokemondb.net/. Each folder contains between 1 and 8 images (all .jpg) of the Pokemon, all with white backgrounds and reasonable file size. There are 2,500+ total images, which is much greater than any other Kaggle dataset I have found which preserve background and picture quality (no random backgrounds, nor some white and some black, etc.). Note that all other forms of the Pokemon (gigantamax, mega evolution, Alolan, Galarian, etc.) are included in the same folder.
I hope that a larger, cleaner dataset like this one can result in better GAN's and VAE's.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterTo make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.
You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:
df<-read.csv('uber.csv')
df_black<-subset(uber_df, uber_df$name == 'Black')
write.csv(df_black, "nameofthefileyouwanttosaveas.csv")
getwd()
Your data will be in front of the world's largest data science community. What questions do you want to see answered?