Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data sets are used in a controlled experiment, where two classifiers should be compared. train_a.csv and explain.csv are slices from the original data set. train_b.csv contains the same instances as in train_a.csv, but with feature x1 set to 0 to make it unusable to classifier B.
The original data set was created and split using this Python code:
from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression
X, y = make_classification(n_samples=300, n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1, class_sep=0.75, random_state=0) X *= 100
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0) lm = LogisticRegression() lm.fit(X_train, y_train) clf_a = lm
clf_b = LogisticRegression() X2 = X.copy() X2[:, 0] = 0 X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y, test_size=0.5, random_state=0) clf_b.fit(X2_train, y2_train)
X_explain = X_test y_explain = y_test
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data used for publication in "Assessment of surrogate models for flood inundation: The physics-guided LSG model vs. state-of-the-art machine learning models". Five surrogate models for flood inundation is to emulate the results of high-resolution hydrodynamic models. The surrogate models are compared based on accuracy and computational speed for three distinct case studies namely Carlisle (United Kingdom), Chowilla floodplain (Australia), and Burnett River (Australia).The dataset is structured in 5 files - "Carlisle", "Chowilla", "BurnettRV", "Comparison_results", and "Python_data". As a minimum to run the models the "Python_data" file and one of "Carlisle", "Chowilla", or "BurnettRV" are needed. We suggest to use the "Carlisle" case study for initial testing given its small size and small data requirement."Carlisle", "Chowilla", and "BurnettRV" files These files contain hydrodynamic modelling data for training and validation for each individual case study, as well as specific Python scripts for training and running the surrogate models in each case study. There are only small differences between each folder, depending on the hydrodynamic model trying to emulate and input boundary conditions (input features).Each case study file has the following folders:Geometry_data: DEM files, .npz files containing of the high-fidelity models grid (XYZ-coordinates) and areas (Same data is available for the low-fidelity model used in the LSG model), .shp files indicating location of boundaries and main flow paths (mainly used in the LSTM-SRR model). XXX_modeldata: Folder to storage trained model data for each XXX surrogate model. For example, GP_EOF_modeldata contains files used to store the trainined GP-EOF model.HD_model_data: High-fidelity (And low-fidelity) simulation results for all flood events of that case study. This folder also contains all boundary input conditions.HF_EOF_analysis: Storing of data used in the EOF analysis. EOF analysis is applied for the LSG, GP-EOF, and LSTM-EOF surrogate models. Results_data: Storing results of running the evaluation of the surrogate models.Train_test_split_data: The train-test-validation data split is the same for all surrogate models. The specific split for each cross-validation fold is stored in this folder.And Python files:YYY_event_summary, YYY_Extrap_event_summary: Files containing overview of all events, and which events are connected between the low- and high-fidelity models for each YYY case study.EOF_analysis_HFdata_preprocessing, EOF_analysis_HFdata: Preprocessing before EOF analysis and the EOF analysis of the high-fidelity data. This is used for the LSG, GP-EOF, and LSTM-EOF surrogate models.Evaluation, Evaluation_extrap: Scripts for evaluating the surrogate model for that case study and saving the results for each cross-validation fold.train_test_split: Script for splitting the flood datasets for each cross-validation fold, so all surrogate models train on the same data.XXX_training: Script for training each XXX surrogate model.XXX_preprocessing: Some surrogate models might rely on some information that needs to be generated before training. This is performed using these scripts."Comparison_results" fileFiles used for comparing surrogate models and generate the figures in the paper "Assessment of surrogate models for flood inundation: The physics-guided LSG model vs. state-of-the-art machine learning models". Figures are also included. "Python_data" fileFolder containing Python script with utility functions for setting up, training, and running the surrogate models, as well as for evaluating the surrogate models. This folder also contains a python_environment.yml file with all Python package versions and dependencies.This folder also contains two sub-folders:LSG_mods_and_func: Python scripts for using the LSG model. Some of these scripts are also utilized when working with the other surrogate models. SRR_method_master_Zhou2021: Scripts obtained from https://github.com/yuerongz/SRR-method. Small edits have for speed and use in this study.
Facebook
Twittertraining Code ```Python
from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split import os import pandas as pd import numpy as np os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3" TEMP_DIR = "tmp" os.makedirs(TEMP_DIR, exist_ok=True) train = pd.read_csv('input/map-charting-student-math-misunderstandings/train.csv')
train.Misconception = train.Misconception.fillna('NA')
train['target'] = train.Category + ":" + train.Misconception
le = LabelEncoder() train['label'] = le.fit_transform(train['target']) n_classes = len(le.classes_) # Number of unique target classes print(f"Train shape: {train.shape} with {n_classes} target classes") print("Train head:") train.head()
idx = train.apply(lambda row: row.Category.split('_')[0], axis=1) == 'True' correct = train.loc[idx].copy() correct['c'] = correct.groupby(['QuestionId', 'MC_Answer']).MC_Answer.transform('count') correct = correct.sort_values('c', ascending=False) correct = correct.drop_duplicates(['QuestionId']) correct = correct[['QuestionId', 'MC_Answer']] correct['is_correct'] = 1 # Mark these as correct answers
train = train.merge(correct, on=['QuestionId', 'MC_Answer'], how='left') train.is_correct = train.is_correct.fillna(0)
from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch
Model_Name = "unsloth/Meta-Llama-3.1-8B-Instruct"
model = AutoModelForSequenceClassification.from_pretrained(Model_Name, num_labels=n_classes, torch_dtype=torch.bfloat16, device_map="balanced", cache_dir=TEMP_DIR)
tokenizer = AutoTokenizer.from_pretrained(Model_Name, cache_dir=TEMP_DIR)
def format_input(row): x = "Yes" if not row['is_correct']: x = "No" return ( f"Question: {row['QuestionText']} " f"Answer: {row['MC_Answer']} " f"Correct? {x} " f"Student Explanation: {row['StudentExplanation']}" )
train['text'] = train.apply(format_input,axis=1) print("Example prompt for our LLM:") print() print( train.text.values[0] )
from datasets import Dataset
COLS = ['text', 'label']
train_df_clean = train[COLS].copy() # Use 'train' instead of 'train_df'
train_df_clean['label'] = train_df_clean['label'].astype(np.int64)
train_df_clean = train_df_clean.reset_index(drop=True)
train_ds = Dataset.from_pandas(train_df_clean, preserve_index=False)
def tokenize(batch): """Tokenizes a batch of text inputs.""" return tokenizer(batch["text"], truncation=True, max_length=256)
train_ds = train_ds.map(tokenize, batched=True, remove_columns=['text'])
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.resize_token_embeddings(len(tokenizer))
model.config.pad_token_id = tokenizer.pad_token_id
import os from huggingface_hub import scan_cache_dir
cache_info = scan_cache_dir() cache_info.delete_revisions(*[repo.revisions for repo in cache_info.repos]).execute()
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding import tempfile import shutil
os.makedirs(f"{TEMP_DIR}/training_output/", exist_ok=True) os.makedirs(f"{TEMP_DIR}/logs/", exist_ok=True)
training_args = TrainingArguments(
output_dir=f"{TEMP_DIR}/training_output/",
do_train=True,
do_eval=False,
save_strategy="no",
num_train_epochs=3,
per_device_train_batch_size=16,
learning_rate=5e-5,
logging_dir=f"{TEMP_DIR}/logs/",
logging_steps=500,
bf16=True,
fp16=False,
report_to="none",
warmup_ratio=0.1,
lr_scheduler_type="cosine",
dataloader_pin_memory=False,
gradient_checkpointing=True,
)
def compute_map3(eval_pred): """ Computes Mean Average Precision at 3 (MAP@3) for evaluation. """ logits, labels = eval_pred probs = torch.nn.functional.softmax(torch.tensor(logits), dim=-1).numpy()
# Get top 3 predicted class indi...
Facebook
TwitterPrediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics. Authors: Kazutaka Kamiya, MD, PhD, Ik Hee Ryu, MD, MS, Tae Keun Yoo, MD, Jung Sub Kim MD, In Sik Lee, MD, PhD, Jin Kook Kim MD, Wakako Ando CO, Nobuyuki Shoji, MD, PhD, Tomofusa, Yamauchi, MD, PhD, Hitoshi Tabuchi, MD, PhD.
We hypothesize that machine learning of preoperative biometric data obtained by the As-OCT may be clinically beneficial for predicting the actual ICL vault. Therefore, we built the machine learning model using Random Forest to predict ICL vault after surgery.
This multicenter study comprised one thousand seven hundred forty-five eyes of 1745 consecutive patients (656 men and 1089 women), who underwent EVO ICL implantation (V4c and V5 Visian ICL with KS-AquaPORT) for the correction of moderate to high myopia and myopic astigmatism, and who completed at least a 1-month follow-up, at Kitasato University Hospital (Kanagawa, Japan), or at B&VIIT Eye Center (Seoul, Korea).
This data file (RFR_model(feature=12).mat) is the final trained random forest model for MATLAB 2020a.
Python version:
from sklearn.model_selection import train_test_split import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import RandomForestRegressor
from google.colab import auth auth.authenticate_user() from google.colab import drive drive.mount('/content/gdrive')
dataset = pd.read_csv('gdrive/My Drive/ICL/data_icl.csv') dataset.head()
y = dataset['Vault_1M'] X = dataset.drop(['Vault_1M'], axis = 1)
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=0)
parameters = {'bootstrap': True, 'min_samples_leaf': 3, 'n_estimators': 500, 'criterion': 'mae' 'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 6, 'max_leaf_nodes': None}
RF_model = RandomForestRegressor(**parameters) RF_model.fit(train_X, train_y) RF_predictions = RF_model.predict(test_X) importance = RF_model.feature_importances_
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains 200 patient records with 8 health indicators used to predict diabetes risk. The dataset is ideal for binary classification tasks in healthcare analytics, diabetes prediction, and medical research. With balanced gender distribution (50% male, 50% female) and diverse age groups (21-79 years), it provides a solid foundation for machine learning experiments.
| Feature Name | Description | Type | Range/Values |
|---|---|---|---|
| PatientID | Unique patient identifier | Integer | 1-200 |
| Age | Patient's age in years | Integer | 21-79 |
| Gender | Patient's gender | Categorical | Male/Female |
| BMI | Body Mass Index | Float | 18.98-49.35 |
| BloodPressure | Diastolic blood pressure (mm Hg) | Integer | 71-178 |
| Insulin | Insulin level (mu U/ml) | Integer | 15-273 |
| Glucose | Plasma glucose concentration | Integer | 70-198 |
| DiabetesPedigreeFunction | Genetic diabetes likelihood score | Float | 0.148-2.467 |
| Outcome | Diabetes diagnosis (1/0) | Binary | 0 or 1 |
| Feature | Count | Mean | Std Dev | Min | 25% | 50% | 75% | Max |
|---|---|---|---|---|---|---|---|---|
| Age | 200 | 48.27 | 16.15 | 21 | 36 | 50 | 60 | 79 |
| BMI | 200 | 31.99 | 8.21 | 18.98 | 26.06 | 30.93 | 36.33 | 49.35 |
| BloodPressure | 200 | 122.19 | 25.61 | 71 | 102 | 124 | 140 | 178 |
| Insulin | 200 | 137.65 | 70.57 | 15 | 88.25 | 131.5 | 187 | 273 |
| Glucose | 200 | 133.30 | 33.59 | 70 | 112 | 130 | 154 | 198 |
| DiabetesPedigree | 200 | 1.04 | 0.55 | 0.148 | 0.64 | 0.92 | 1.32 | 2.46 |
| Feature | Value | Count | Percentage |
|---|---|---|---|
| Gender | Male | 100 | 50.0% |
| Female | 100 | 50.0% | |
| Outcome | 0 (No Diabetes) | 103 | 51.5% |
| 1 (Diabetes) | 97 | 48.5% |
Age Factor:
BMI Thresholds:
Critical Health Markers:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
# Load dataset
df = pd.read_csv("diabetes_dataset.csv")
# Preprocessing
X = df[['Age','BMI','Glucose','BloodPressure','Insulin','DiabetesPedigreeFunction']]
y = df['Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train model
model = RandomForestClassifier()
model.fit(X_train, y_train)
print(f"Accuracy: {model.score(X_test, y_test):.2f}")
Facebook
TwitterThis dataset contains short video clips organized into six classes of Singapore landmarks/stations: Bedok, City Hall, Clementi, Esplanade, MBS, and Orchard. Each class has 30 .mp4 clips, for a total of 180 videos. It’s designed for tasks like video classification, keypoint extraction, and sequence modeling.
The clips are curated to be consistent per class and are suitable for computer vision pipelines that work with raw videos or frame-level features. The dataset pairs well with downstream processing such as keyframe extraction, pose/hand landmark extraction, and LSTM/Transformer sequence modeling.
Bedok, City Hall, Clementi, Esplanade, MBS, Orchard.mp4)<ClassName>_<Index>.mp4 where Index is 0–29MP_Videos_All/
├── Bedok/
│ ├── Bedok_0.mp4
│ ├── ...
│ └── Bedok_29.mp4
├── City Hall/
│ ├── City Hall_0.mp4
│ ├── ...
│ └── City Hall_29.mp4
├── Clementi/
│ ├── Clementi_0.mp4
│ ├── ...
│ └── Clementi_29.mp4
├── Esplanade/
│ ├── Esplanade_0.mp4
│ ├── ...
│ └── Esplanade_29.mp4
├── MBS/
│ ├── MBS_0.mp4
│ ├── ...
│ └── MBS_29.mp4
└── Orchard/
├── Orchard_0.mp4
├── ...
└── Orchard_29.mp4
Tip: If you need numeric labels, sort class names alphabetically and map to indices:
from pathlib import Path
root = Path("MP_Videos_All")
classes = sorted([p.name for p in root.iterdir() if p.is_dir()])
label_to_index = {label: i for i, label in enumerate(classes)}
label_to_index # {'Bedok': 0, 'City Hall': 1, 'Clementi': 2, 'Esplanade': 3, 'MBS': 4, 'Orchard': 5}
from pathlib import Path
root = Path("MP_Videos_All")
video_paths = sorted(root.rglob("*.mp4"))
print(len(video_paths), "videos")
print(video_paths[0])
print(video_paths[0].parent.name) # class label
import cv2
from pathlib import Path
def read_video_frames(path, max_frames=None):
cap = cv2.VideoCapture(str(path))
frames, count = [], 0
ok, frame = cap.read()
while ok:
frames.append(frame) # BGR format
count += 1
if max_frames and count >= max_frames:
break
ok, frame = cap.read()
cap.release()
return frames
sample_video = next(Path("MP_Videos_All").rglob("*.mp4"))
frames = read_video_frames(sample_video, max_frames=64)
len(frames)
from pathlib import Path
from sklearn.model_selection import train_test_split
root = Path("MP_Videos_All")
all_videos = sorted(root.rglob("*.mp4"))
labels = [p.parent.name for p in all_videos]
X_train, X_tmp, y_train, y_tmp = train_test_split(all_videos, labels, test_size=0.3, stratify=labels, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_tmp, y_tmp, test_size=0.5, stratify=y_tmp, random_state=42)
len(X_train), len(X_val), len(X_test)
If you plan to perform landmark-based or frame-based modeling, consider creating or using companion assets like:
- Annotated variants (e.g., hand-only landmarks)
- Keyframe .npy sequences
- Precomputed datasets for 3-class or 6-class experiments
These resources are commonly produced downstream from this dataset and can be shared as separate Kaggle datasets for convenience.
If you use this dataset, please cite:
@dataset{mp_videos_all_2025,
title = {MP_Videos_All},
author = {Authors},
year = {2025},
url = {Kaggle dataset URL}
}
This dataset is released under CC BY-NC 4.0 (Attribution-NonCommercial). You may use and adapt it for non-commercial purposes with appropriate attribution. For commercial use, please contact the authors.
MP_Videos_All with overview, structure, usage code, and license. Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data sets are used in a controlled experiment, where two classifiers should be compared. train_a.csv and explain.csv are slices from the original data set. train_b.csv contains the same instances as in train_a.csv, but with feature x1 set to 0 to make it unusable to classifier B.
The original data set was created and split using this Python code:
from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression
X, y = make_classification(n_samples=300, n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1, class_sep=0.75, random_state=0) X *= 100
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0) lm = LogisticRegression() lm.fit(X_train, y_train) clf_a = lm
clf_b = LogisticRegression() X2 = X.copy() X2[:, 0] = 0 X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y, test_size=0.5, random_state=0) clf_b.fit(X2_train, y2_train)
X_explain = X_test y_explain = y_test