6 datasets found

Z
One Classifier Ignores a Feature
data.niaid.nih.gov
Updated Apr 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maier, Karl (2022). One Classifier Ignores a Feature [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6502642
Explore at:
Dataset updated
Apr 29, 2022
Authors
Maier, Karl
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data sets are used in a controlled experiment, where two classifiers should be compared. train_a.csv and explain.csv are slices from the original data set. train_b.csv contains the same instances as in train_a.csv, but with feature x1 set to 0 to make it unusable to classifier B.

The original data set was created and split using this Python code:

from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression

X, y = make_classification(n_samples=300, n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1, class_sep=0.75, random_state=0) X *= 100

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0) lm = LogisticRegression() lm.fit(X_train, y_train) clf_a = lm

clf_b = LogisticRegression() X2 = X.copy() X2[:, 0] = 0 X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y, test_size=0.5, random_state=0) clf_b.fit(X2_train, y2_train)

X_explain = X_test y_explain = y_test
u
Surrogate flood model comparison - Datasets and python code
figshare.unimelb.edu.au
bin
Updated Jan 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Niels Fraehr (2024). Surrogate flood model comparison - Datasets and python code [Dataset]. http://doi.org/10.26188/24312658.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.26188/24312658.v1
Dataset updated
Jan 19, 2024
Dataset provided by
The University of Melbourne
Authors
Niels Fraehr
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data used for publication in "Assessment of surrogate models for flood inundation: The physics-guided LSG model vs. state-of-the-art machine learning models". Five surrogate models for flood inundation is to emulate the results of high-resolution hydrodynamic models. The surrogate models are compared based on accuracy and computational speed for three distinct case studies namely Carlisle (United Kingdom), Chowilla floodplain (Australia), and Burnett River (Australia).The dataset is structured in 5 files - "Carlisle", "Chowilla", "BurnettRV", "Comparison_results", and "Python_data". As a minimum to run the models the "Python_data" file and one of "Carlisle", "Chowilla", or "BurnettRV" are needed. We suggest to use the "Carlisle" case study for initial testing given its small size and small data requirement."Carlisle", "Chowilla", and "BurnettRV" files These files contain hydrodynamic modelling data for training and validation for each individual case study, as well as specific Python scripts for training and running the surrogate models in each case study. There are only small differences between each folder, depending on the hydrodynamic model trying to emulate and input boundary conditions (input features).Each case study file has the following folders:Geometry_data: DEM files, .npz files containing of the high-fidelity models grid (XYZ-coordinates) and areas (Same data is available for the low-fidelity model used in the LSG model), .shp files indicating location of boundaries and main flow paths (mainly used in the LSTM-SRR model). XXX_modeldata: Folder to storage trained model data for each XXX surrogate model. For example, GP_EOF_modeldata contains files used to store the trainined GP-EOF model.HD_model_data: High-fidelity (And low-fidelity) simulation results for all flood events of that case study. This folder also contains all boundary input conditions.HF_EOF_analysis: Storing of data used in the EOF analysis. EOF analysis is applied for the LSG, GP-EOF, and LSTM-EOF surrogate models. Results_data: Storing results of running the evaluation of the surrogate models.Train_test_split_data: The train-test-validation data split is the same for all surrogate models. The specific split for each cross-validation fold is stored in this folder.And Python files:YYY_event_summary, YYY_Extrap_event_summary: Files containing overview of all events, and which events are connected between the low- and high-fidelity models for each YYY case study.EOF_analysis_HFdata_preprocessing, EOF_analysis_HFdata: Preprocessing before EOF analysis and the EOF analysis of the high-fidelity data. This is used for the LSG, GP-EOF, and LSTM-EOF surrogate models.Evaluation, Evaluation_extrap: Scripts for evaluating the surrogate model for that case study and saving the results for each cross-validation fold.train_test_split: Script for splitting the flood datasets for each cross-validation fold, so all surrogate models train on the same data.XXX_training: Script for training each XXX surrogate model.XXX_preprocessing: Some surrogate models might rely on some information that needs to be generated before training. This is performed using these scripts."Comparison_results" fileFiles used for comparing surrogate models and generate the figures in the paper "Assessment of surrogate models for flood inundation: The physics-guided LSG model vs. state-of-the-art machine learning models". Figures are also included. "Python_data" fileFolder containing Python script with utility functions for setting up, training, and running the surrogate models, as well as for evaluating the surrogate models. This folder also contains a python_environment.yml file with all Python package versions and dependencies.This folder also contains two sub-folders:LSG_mods_and_func: Python scripts for using the LSG model. Some of these scripts are also utilized when working with the other surrogate models. SRR_method_master_Zhou2021: Scripts obtained from https://github.com/yuerongz/SRR-method. Small edits have for speed and use in this study.
Llama 3.1 8B Correct Labels
kaggle.com
zip
Updated Aug 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jatin Mehra_666 (2025). Llama 3.1 8B Correct Labels [Dataset]. https://www.kaggle.com/datasets/jatinmehra666/llama-3-1-8b-correct-labels
Explore at:
zip(11853454078 bytes)Available download formats
Dataset updated
Aug 26, 2025
Authors
Jatin Mehra_666
Description
training Code ```Python

from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split import os import pandas as pd import numpy as np os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3" TEMP_DIR = "tmp" os.makedirs(TEMP_DIR, exist_ok=True) train = pd.read_csv('input/map-charting-student-math-misunderstandings/train.csv')

Fill missing Misconception values with 'NA'

train.Misconception = train.Misconception.fillna('NA')

Create a combined target label (Category:Misconception)

train['target'] = train.Category + ":" + train.Misconception

Encode target labels to numerical format

le = LabelEncoder() train['label'] = le.fit_transform(train['target']) n_classes = len(le.classes_) # Number of unique target classes print(f"Train shape: {train.shape} with {n_classes} target classes") print("Train head:") train.head()

idx = train.apply(lambda row: row.Category.split('_')[0], axis=1) == 'True' correct = train.loc[idx].copy() correct['c'] = correct.groupby(['QuestionId', 'MC_Answer']).MC_Answer.transform('count') correct = correct.sort_values('c', ascending=False) correct = correct.drop_duplicates(['QuestionId']) correct = correct[['QuestionId', 'MC_Answer']] correct['is_correct'] = 1 # Mark these as correct answers

Merge 'is_correct' flag into the main training DataFrame

train = train.merge(correct, on=['QuestionId', 'MC_Answer'], how='left') train.is_correct = train.is_correct.fillna(0)

from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch

Model_Name = "unsloth/Meta-Llama-3.1-8B-Instruct"

model = AutoModelForSequenceClassification.from_pretrained(Model_Name, num_labels=n_classes, torch_dtype=torch.bfloat16, device_map="balanced", cache_dir=TEMP_DIR)

tokenizer = AutoTokenizer.from_pretrained(Model_Name, cache_dir=TEMP_DIR)

def format_input(row): x = "Yes" if not row['is_correct']: x = "No" return ( f"Question: {row['QuestionText']} " f"Answer: {row['MC_Answer']} " f"Correct? {x} " f"Student Explanation: {row['StudentExplanation']}" )

train['text'] = train.apply(format_input,axis=1) print("Example prompt for our LLM:") print() print( train.text.values[0] )

from datasets import Dataset

Split data into training and validation sets

train_df, val_df = train_test_split(train, test_size=0.2, random_state=42)

Convert to Hugging Face Dataset

COLS = ['text', 'label']

Create clean DataFrame with the full training data

train_df_clean = train[COLS].copy() # Use 'train' instead of 'train_df'

Ensure labels are proper integers

train_df_clean['label'] = train_df_clean['label'].astype(np.int64)

Reset index to ensure clean DataFrame structure

train_df_clean = train_df_clean.reset_index(drop=True)

Create dataset with the full training data

train_ds = Dataset.from_pandas(train_df_clean, preserve_index=False)

def tokenize(batch): """Tokenizes a batch of text inputs.""" return tokenizer(batch["text"], truncation=True, max_length=256)

Apply tokenization to the full dataset

train_ds = train_ds.map(tokenize, batched=True, remove_columns=['text'])

Add a new padding token

tokenizer.add_special_tokens({'pad_token': '[PAD]'})

Resize the model's token embeddings to match the new tokenizer

model.resize_token_embeddings(len(tokenizer))

Set the pad token id in the model's config

model.config.pad_token_id = tokenizer.pad_token_id

2. Clear HF cache after loading

import os from huggingface_hub import scan_cache_dir

Then clear cache to free ~16GB

cache_info = scan_cache_dir() cache_info.delete_revisions(*[repo.revisions for repo in cache_info.repos]).execute()

--- Training Arguments ---

from transformers import TrainingArguments, Trainer, DataCollatorWithPadding import tempfile import shutil

Ensure temp directories exist

os.makedirs(f"{TEMP_DIR}/training_output/", exist_ok=True) os.makedirs(f"{TEMP_DIR}/logs/", exist_ok=True)

--- Training Arguments (FIXED) ---

training_args = TrainingArguments( output_dir=f"{TEMP_DIR}/training_output/",
do_train=True, do_eval=False, save_strategy="no", num_train_epochs=3, per_device_train_batch_size=16, learning_rate=5e-5, logging_dir=f"{TEMP_DIR}/logs/",
logging_steps=500, bf16=True, fp16=False, report_to="none", warmup_ratio=0.1, lr_scheduler_type="cosine", dataloader_pin_memory=False, gradient_checkpointing=True,
)

--- Custom Metric Computation (MAP@3) ---

def compute_map3(eval_pred): """ Computes Mean Average Precision at 3 (MAP@3) for evaluation. """ logits, labels = eval_pred probs = torch.nn.functional.softmax(torch.tensor(logits), dim=-1).numpy()

# Get top 3 predicted class indi...
n
Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning...
narcis.nl
data.mendeley.com
Updated Jan 11, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yoo, T (via Mendeley Data) (2021). Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics" [Dataset]. http://doi.org/10.17632/ffn745r57z.2
Explore at:
Unique identifier
https://doi.org/10.17632/ffn745r57z.2
Dataset updated
Jan 11, 2021
Dataset provided by
Data Archiving and Networked Services (DANS)
Authors
Yoo, T (via Mendeley Data)
Description
Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics. Authors: Kazutaka Kamiya, MD, PhD, Ik Hee Ryu, MD, MS, Tae Keun Yoo, MD, Jung Sub Kim MD, In Sik Lee, MD, PhD, Jin Kook Kim MD, Wakako Ando CO, Nobuyuki Shoji, MD, PhD, Tomofusa, Yamauchi, MD, PhD, Hitoshi Tabuchi, MD, PhD.

We hypothesize that machine learning of preoperative biometric data obtained by the As-OCT may be clinically beneficial for predicting the actual ICL vault. Therefore, we built the machine learning model using Random Forest to predict ICL vault after surgery.

This multicenter study comprised one thousand seven hundred forty-five eyes of 1745 consecutive patients (656 men and 1089 women), who underwent EVO ICL implantation (V4c and V5 Visian ICL with KS-AquaPORT) for the correction of moderate to high myopia and myopic astigmatism, and who completed at least a 1-month follow-up, at Kitasato University Hospital (Kanagawa, Japan), or at B&VIIT Eye Center (Seoul, Korea).

This data file (RFR_model(feature=12).mat) is the final trained random forest model for MATLAB 2020a.

Python version:

from sklearn.model_selection import train_test_split import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import RandomForestRegressor

connect data in your google drive

from google.colab import auth auth.authenticate_user() from google.colab import drive drive.mount('/content/gdrive')

Change the path for the custom data

In this case, we used ICL vault prediction using preop measurement

dataset = pd.read_csv('gdrive/My Drive/ICL/data_icl.csv') dataset.head()

optimal features (sorted by importance) :

1. ICL size 2. ICL power 3. LV 4. CLR 5. ACD 6. ATA

7. MSE 8.Age 9. Pupil size 10. WTW 11. CCT 12. ACW

y = dataset['Vault_1M'] X = dataset.drop(['Vault_1M'], axis = 1)

Split the dataset to train and test data, if necessary.

For example, we can split data to 8:2 as a simple validation test

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=0)

In our study, we already defined the training (B&VIIT Eye Center, n=1455) and test (Kitasato University, n=290) dataset, this code was not necessary to perform our analysis.

Optimal parameter search could be performed in this section

parameters = {'bootstrap': True, 'min_samples_leaf': 3, 'n_estimators': 500, 'criterion': 'mae' 'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 6, 'max_leaf_nodes': None}

RF_model = RandomForestRegressor(**parameters) RF_model.fit(train_X, train_y) RF_predictions = RF_model.predict(test_X) importance = RF_model.feature_importances_

Diabate's dataset

kaggle.com

zip

Updated Aug 9, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Wasiq Ali (2025). Diabate's dataset [Dataset]. https://www.kaggle.com/datasets/wasiqaliyasir/diabates-dataset

Explore at:

zip(4623 bytes)Available download formats

Dataset updated

Aug 9, 2025

Authors

Wasiq Ali

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Diabetes Health Indicators Dataset

1. Overview

This dataset contains 200 patient records with 8 health indicators used to predict diabetes risk. The dataset is ideal for binary classification tasks in healthcare analytics, diabetes prediction, and medical research. With balanced gender distribution (50% male, 50% female) and diverse age groups (21-79 years), it provides a solid foundation for machine learning experiments.

2. Dataset Description

Key Characteristics:

Records: 200
Features: 8 health indicators + target variable
Target Variable: Diabetes diagnosis (1 = positive, 0 = negative)
Data Type: Structured numerical/categorical data
Missing Values: None
Duplicates: None

Feature Details:

Feature Name	Description	Type	Range/Values
PatientID	Unique patient identifier	Integer	1-200
Age	Patient's age in years	Integer	21-79
Gender	Patient's gender	Categorical	Male/Female
BMI	Body Mass Index	Float	18.98-49.35
BloodPressure	Diastolic blood pressure (mm Hg)	Integer	71-178
Insulin	Insulin level (mu U/ml)	Integer	15-273
Glucose	Plasma glucose concentration	Integer	70-198
DiabetesPedigreeFunction	Genetic diabetes likelihood score	Float	0.148-2.467
Outcome	Diabetes diagnosis (1/0)	Binary	0 or 1

3. Statistical Summary

Numerical Features:

Feature	Count	Mean	Std Dev	Min	25%	50%	75%	Max
Age	200	48.27	16.15	21	36	50	60	79
BMI	200	31.99	8.21	18.98	26.06	30.93	36.33	49.35
BloodPressure	200	122.19	25.61	71	102	124	140	178
Insulin	200	137.65	70.57	15	88.25	131.5	187	273
Glucose	200	133.30	33.59	70	112	130	154	198
DiabetesPedigree	200	1.04	0.55	0.148	0.64	0.92	1.32	2.46

Categorical Features:

Feature	Value	Count	Percentage
Gender	Male	100	50.0%
	Female	100	50.0%
Outcome	0 (No Diabetes)	103	51.5%
	1 (Diabetes)	97	48.5%

4. Key Insights & Patterns

🔍 Diabetes Correlations:

Age Factor:
- Patients >50 years: 62% diabetes prevalence
- Patients <30 years: 28% diabetes prevalence
BMI Thresholds:
- BMI >35: 74% diabetes rate
- BMI <25: Only 22% diabetes rate
Critical Health Markers:
- Glucose >140: 78% diabetes risk
- BloodPressure >140: 67% diabetes risk
- Insulin >200: 63% diabetes risk

📊 Gender-based Differences:

Males: Higher average Glucose (135 vs 131) and BMI (32.5 vs 31.4)
Females: Higher DiabetesPedigreeFunction (1.08 vs 1.00)

⚠️ Potential Health Alerts:

15 patients with BMI >45 (severe obesity)
22 patients with Glucose >180 (critical hyperglycemia)
18 patients with BloodPressure >160 (stage 2 hypertension)

5. Suggested Applications

🩺 Medical Use Cases:

Diabetes risk prediction models
Early intervention screening tools
Lifestyle intervention planning
Healthcare resource allocation

🤖 Machine Learning Tasks:

Binary classification (diabetes prediction)
Feature importance analysis
Patient risk stratification
Clustering of high-risk groups

📈 Research Opportunities:

Interaction effects between BMI and genetic factors
Age-specific prevention strategies
Gender-based risk factor analysis
Threshold optimization for early detection

6. Dataset Limitations

Sample Size: Limited to 200 records
Geographic Diversity: No location metadata
Time Factors: No longitudinal data
Additional Health Metrics: Lacks diet/exercise data
Ethnicity Data: Missing demographic diversity info

7. Example Usage

Python Classification Example:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd

# Load dataset
df = pd.read_csv("diabetes_dataset.csv")

# Preprocessing
X = df[['Age','BMI','Glucose','BloodPressure','Insulin','DiabetesPedigreeFunction']]
y = df['Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train model
model = RandomForestClassifier()
model.fit(X_train, y_train)

Evaluate


print(f"Accuracy: {model.score(X_test, y_test):.2f}")

Singapore Sign Language (SgSL)
kaggle.com
zip
Updated Oct 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sritam (2025). Singapore Sign Language (SgSL) [Dataset]. https://www.kaggle.com/datasets/sritampatnaik/singapore-sign-language-sgsl-6-classes-x-30/code
Explore at:
zip(356036922 bytes)Available download formats
Dataset updated
Oct 29, 2025
Authors
Sritam
Description
About Dataset

This dataset contains short video clips organized into six classes of Singapore landmarks/stations: Bedok, City Hall, Clementi, Esplanade, MBS, and Orchard. Each class has 30 .mp4 clips, for a total of 180 videos. It’s designed for tasks like video classification, keypoint extraction, and sequence modeling.

Context

The clips are curated to be consistent per class and are suitable for computer vision pipelines that work with raw videos or frame-level features. The dataset pairs well with downstream processing such as keyframe extraction, pose/hand landmark extraction, and LSTM/Transformer sequence modeling.

Contents

6 classes (folders): Bedok, City Hall, Clementi, Esplanade, MBS, Orchard

30 videos per class (.mp4)

Total videos: 180

Naming pattern: <ClassName>_<Index>.mp4 where Index is 0–29

Directory Structure

MP_Videos_All/ ├── Bedok/ │ ├── Bedok_0.mp4 │ ├── ... │ └── Bedok_29.mp4 ├── City Hall/ │ ├── City Hall_0.mp4 │ ├── ... │ └── City Hall_29.mp4 ├── Clementi/ │ ├── Clementi_0.mp4 │ ├── ... │ └── Clementi_29.mp4 ├── Esplanade/ │ ├── Esplanade_0.mp4 │ ├── ... │ └── Esplanade_29.mp4 ├── MBS/ │ ├── MBS_0.mp4 │ ├── ... │ └── MBS_29.mp4 └── Orchard/ ├── Orchard_0.mp4 ├── ... └── Orchard_29.mp4

Labels

Bedok

City Hall

Clementi

Esplanade

MBS

Orchard

Tip: If you need numeric labels, sort class names alphabetically and map to indices:

from pathlib import Path root = Path("MP_Videos_All") classes = sorted([p.name for p in root.iterdir() if p.is_dir()]) label_to_index = {label: i for i, label in enumerate(classes)} label_to_index # {'Bedok': 0, 'City Hall': 1, 'Clementi': 2, 'Esplanade': 3, 'MBS': 4, 'Orchard': 5}

Quick Start

List Videos and Parse Labels

from pathlib import Path root = Path("MP_Videos_All") video_paths = sorted(root.rglob("*.mp4")) print(len(video_paths), "videos") print(video_paths[0]) print(video_paths[0].parent.name) # class label

Read Frames with OpenCV

import cv2 from pathlib import Path def read_video_frames(path, max_frames=None): cap = cv2.VideoCapture(str(path)) frames, count = [], 0 ok, frame = cap.read() while ok: frames.append(frame) # BGR format count += 1 if max_frames and count >= max_frames: break ok, frame = cap.read() cap.release() return frames sample_video = next(Path("MP_Videos_All").rglob("*.mp4")) frames = read_video_frames(sample_video, max_frames=64) len(frames)

Suggested Train/Val/Test Split

from pathlib import Path from sklearn.model_selection import train_test_split root = Path("MP_Videos_All") all_videos = sorted(root.rglob("*.mp4")) labels = [p.parent.name for p in all_videos] X_train, X_tmp, y_train, y_tmp = train_test_split(all_videos, labels, test_size=0.3, stratify=labels, random_state=42) X_val, X_test, y_val, y_test = train_test_split(X_tmp, y_tmp, test_size=0.5, stratify=y_tmp, random_state=42) len(X_train), len(X_val), len(X_test)

Intended Uses

Video classification and benchmarking on small, balanced classes

Feature extraction (e.g., keyframes, optical flow, pose/hand landmarks)

Sequence modeling with RNN/LSTM/GRU or Transformers

Prototyping real-time recognition pipelines

Limitations

Small dataset size (30 clips per class) — best for prototyping, teaching, or transfer learning

Video resolutions and durations may vary

No official train/val/test split provided; use stratified splitting for reproducibility

Related Resources

If you plan to perform landmark-based or frame-based modeling, consider creating or using companion assets like: - Annotated variants (e.g., hand-only landmarks) - Keyframe .npy sequences - Precomputed datasets for 3-class or 6-class experiments

These resources are commonly produced downstream from this dataset and can be shared as separate Kaggle datasets for convenience.

Acknowledgements

Dataset curated and organized by the authors.

Please credit the dataset in any published work that uses it (see Citation).

Citation

If you use this dataset, please cite:

@dataset{mp_videos_all_2025, title = {MP_Videos_All}, author = {Authors}, year = {2025}, url = {Kaggle dataset URL} }

License

This dataset is released under CC BY-NC 4.0 (Attribution-NonCommercial). You may use and adapt it for non-commercial purposes with appropriate attribution. For commercial use, please contact the authors.

Summary: A complete Kaggle-ready README for MP_Videos_All with overview, structure, usage code, and license.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Maier, Karl (2022). One Classifier Ignores a Feature [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6502642

One Classifier Ignores a Feature

Explore at:

Dataset updated

Apr 29, 2022

Authors

Maier, Karl

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The data sets are used in a controlled experiment, where two classifiers should be compared. train_a.csv and explain.csv are slices from the original data set. train_b.csv contains the same instances as in train_a.csv, but with feature x1 set to 0 to make it unusable to classifier B.

The original data set was created and split using this Python code:

from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression

X, y = make_classification(n_samples=300, n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1, class_sep=0.75, random_state=0) X *= 100

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0) lm = LogisticRegression() lm.fit(X_train, y_train) clf_a = lm

clf_b = LogisticRegression() X2 = X.copy() X2[:, 0] = 0 X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y, test_size=0.5, random_state=0) clf_b.fit(X2_train, y2_train)

X_explain = X_test y_explain = y_test

Clear search

Close search

Google apps

Main menu

One Classifier Ignores a Feature

Surrogate flood model comparison - Datasets and python code

Llama 3.1 8B Correct Labels

Fill missing Misconception values with 'NA'

Create a combined target label (Category:Misconception)

Encode target labels to numerical format

Merge 'is_correct' flag into the main training DataFrame

Split data into training and validation sets

train_df, val_df = train_test_split(train, test_size=0.2, random_state=42)

Convert to Hugging Face Dataset

Create clean DataFrame with the full training data

Ensure labels are proper integers

Reset index to ensure clean DataFrame structure

Create dataset with the full training data

Apply tokenization to the full dataset

Add a new padding token

Resize the model's token embeddings to match the new tokenizer

Set the pad token id in the model's config

2. Clear HF cache after loading

Then clear cache to free ~16GB

--- Training Arguments ---

Ensure temp directories exist

--- Training Arguments (FIXED) ---

--- Custom Metric Computation (MAP@3) ---

Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning...

connect data in your google drive

Change the path for the custom data

In this case, we used ICL vault prediction using preop measurement

optimal features (sorted by importance) :

1. ICL size 2. ICL power 3. LV 4. CLR 5. ACD 6. ATA

7. MSE 8.Age 9. Pupil size 10. WTW 11. CCT 12. ACW

Split the dataset to train and test data, if necessary.

For example, we can split data to 8:2 as a simple validation test

In our study, we already defined the training (B&VIIT Eye Center, n=1455) and test (Kitasato University, n=290) dataset, this code was not necessary to perform our analysis.

Optimal parameter search could be performed in this section

Diabate's dataset

Diabetes Health Indicators Dataset

1. Overview

2. Dataset Description

Key Characteristics:

Feature Details:

3. Statistical Summary

Numerical Features:

Categorical Features:

4. Key Insights & Patterns

🔍 Diabetes Correlations:

📊 Gender-based Differences:

⚠️ Potential Health Alerts:

5. Suggested Applications

🩺 Medical Use Cases:

🤖 Machine Learning Tasks:

📈 Research Opportunities:

6. Dataset Limitations

7. Example Usage

Python Classification Example:

Evaluate

Singapore Sign Language (SgSL)

About Dataset

Context

Contents

Directory Structure

Labels

Quick Start

List Videos and Parse Labels

Read Frames with OpenCV

Suggested Train/Val/Test Split

Intended Uses

Limitations

Related Resources

Acknowledgements

Citation

License

One Classifier Ignores a Feature