29 datasets found
  1. h

    ASDiv-train-test

    • huggingface.co
    Updated Nov 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeong Seong Cheol (2025). ASDiv-train-test [Dataset]. https://huggingface.co/datasets/lejelly/ASDiv-train-test
    Explore at:
    Dataset updated
    Nov 3, 2025
    Authors
    Jeong Seong Cheol
    Description

    ASDiv (train/test 1:9)

    This dataset is derived from EleutherAI/asdiv by splitting the original validation split into train and test with a ratio of 1:9.

      Source
    

    Original dataset: EleutherAI/asdivLink: https://huggingface.co/datasets/EleutherAI/asdiv

      License
    

    Inherits the original dataset's license (CC-BY-NC-4.0) unless otherwise noted in this repository.

      Splitting Details
    

    Method: datasets.Dataset.train_test_split Source split: validation Test… See the full description on the dataset page: https://huggingface.co/datasets/lejelly/ASDiv-train-test.

  2. Z

    One Classifier Ignores a Feature

    • data.niaid.nih.gov
    Updated Apr 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maier, Karl (2022). One Classifier Ignores a Feature [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6502642
    Explore at:
    Dataset updated
    Apr 29, 2022
    Authors
    Maier, Karl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data sets are used in a controlled experiment, where two classifiers should be compared. train_a.csv and explain.csv are slices from the original data set. train_b.csv contains the same instances as in train_a.csv, but with feature x1 set to 0 to make it unusable to classifier B.

    The original data set was created and split using this Python code:

    from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression

    X, y = make_classification(n_samples=300, n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1, class_sep=0.75, random_state=0) X *= 100

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0) lm = LogisticRegression() lm.fit(X_train, y_train) clf_a = lm

    clf_b = LogisticRegression() X2 = X.copy() X2[:, 0] = 0 X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y, test_size=0.5, random_state=0) clf_b.fit(X2_train, y2_train)

    X_explain = X_test y_explain = y_test

  3. Diabetes_Dataset_1.1

    • kaggle.com
    zip
    Updated Nov 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KIRANMAYI G 777 (2023). Diabetes_Dataset_1.1 [Dataset]. https://www.kaggle.com/datasets/kiranmayig777/diabetes-dataset-1-1/code
    Explore at:
    zip(779755 bytes)Available download formats
    Dataset updated
    Nov 2, 2023
    Authors
    KIRANMAYI G 777
    Description

    import pandas as pd import numpy as np

    PERFORMING EDA

    data.head() data.info()

    attributes_data = data.iloc[:, 1:] attributes_data

    attributes_data.describe() attributes_data.corr()

    import seaborn as sns import matplotlib.pyplot as plt

    Calculate correlation matrix

    correlation_matrix = attributes_data.corr() plt.figure(figsize=(18, 10))

    Create a heatmap

    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm') plt.show()

    CHECKING IF DATASET IS LINEAR OR NON-LINEAR

    Calculate correlations between target and predictor columns

    correlations = data.corr()['Diabetes_binary'].drop('Diabetes_binary')

    Create a bar chart

    plt.figure(figsize=(10, 6)) correlations.plot(kind='bar') plt.xlabel('Predictor Columns') plt.ylabel('Correlation values') plt.title('Correlation between Diabetes_binary and Predictors') plt.show()

    CHECKING FOR NULL AND MISSING VALUES, CLEANING THEM

    Count the number of null values in each column

    print(data.isnull().sum())

    to check for missing values in all columns

    print(data.isna().sum())

    LASSO import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.linear_model import Lasso from sklearn.model_selection import train_test_split from sklearn.model_selection import GridSearchCV, KFold

    X = data.iloc[:, 1:] y = data.iloc[:, 0] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

    gridsearchcv is used to find the optimal combination of hyperparameters for a given model

    So, in the end, we can select the best parameters from the listed hyperparameters.

    parameters = {"alpha": np.arange(0.00001, 10, 500)}
    kfold = KFold(n_splits = 10, shuffle=True, random_state = 42) lassoReg = Lasso() lasso_cv = GridSearchCV(lassoReg, param_grid = parameters, cv = kfold) lasso_cv.fit(X, y) print("Best Params {}".format(lasso_cv.best_params_))

    column_names = list(data) column_names = column_names[1:] column_names

    lassoModel = Lasso(alpha = 0.00001) lassoModel.fit(X_train, y_train) lasso_coeff = np.abs(lassoModel.coef_)#making all coefficients positive plt.bar(column_names, lasso_coeff, color = 'orange') plt.xticks(rotation=90) plt.grid() plt.title("Feature Selection Based on Lasso") plt.xlabel("Features") plt.ylabel("Importance") plt.ylim(0, 0.16) plt.show()

    RFE from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

    from sklearn.feature_selection import RFECV from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier() rfecv = RFECV(estimator= model, step = 1, cv = 20, scoring="accuracy") rfecv = rfecv.fit(X_train, y_train)

    num_features_selected = len(rfecv.rankin_)

    Cross-validation scores

    cv_scores = rfecv.ranking_

    Plotting the number of features vs. cross-validation score

    plt.figure(figsize=(10, 6)) plt.xlabel("Number of features selected") plt.ylabel("Score (accuracy)") plt.plot(range(1, num_features_selected + 1), cv_scores, marker='o', color='r') plt.xticks(range(1, num_features_selected + 1)) # Set x-ticks to integers plt.grid() plt.title("RFECV: Number of Features vs. Score(accuracy)") plt.show()

    print("The optimal number of features:", rfecv.n_features_) print("Best features:", X_train.columns[rfecv.support_])

    PCA import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler

    X = data.drop(["Diabetes_binary"], axis=1) y = data["Diabetes_binary"]

    df1=pd.DataFrame(data = data,columns=data.columns) print(df1)

    scaling=StandardScaler() scaling.fit(df1) Scaled_data=scaling.transform(df1) principal=PCA(n_components=3) principal.fit(Scaled_data) x=principal.transform(Scaled_data) print(x.shape)

    principal.components_

    plt.figure(figsize=(10,10))

    plt.scatter(x[:,0],x[:,1],c=data['Diabetes_binary'],cmap='plasma') plt.xlabel('pc1') plt.ylabel('pc2')

    print(principal.explained_variance_ratio_)

    T-SNE from sklearn.manifold import TSNE from numpy import reshape import seaborn as sns

    tsne = TSNE(n_components=3, verbose=1, random_state=42) z = tsne.fit_transform(X)

    df = pd.DataFrame() df["y"] = y df["comp-1"] = z[:,0] df["comp-2"] = z[:,1] df["comp-3"] = z[:,2] sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(), palette=sns.color_palette("husl", 2), data=df).set(title="Diabetes data T-SNE projection")

  4. Llama 3.1 8B Correct Labels

    • kaggle.com
    zip
    Updated Aug 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jatin Mehra_666 (2025). Llama 3.1 8B Correct Labels [Dataset]. https://www.kaggle.com/datasets/jatinmehra666/llama-3-1-8b-correct-labels
    Explore at:
    zip(11853454078 bytes)Available download formats
    Dataset updated
    Aug 26, 2025
    Authors
    Jatin Mehra_666
    Description

    training Code ```Python

    from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split import os import pandas as pd import numpy as np os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3" TEMP_DIR = "tmp" os.makedirs(TEMP_DIR, exist_ok=True) train = pd.read_csv('input/map-charting-student-math-misunderstandings/train.csv')

    Fill missing Misconception values with 'NA'

    train.Misconception = train.Misconception.fillna('NA')

    Create a combined target label (Category:Misconception)

    train['target'] = train.Category + ":" + train.Misconception

    Encode target labels to numerical format

    le = LabelEncoder() train['label'] = le.fit_transform(train['target']) n_classes = len(le.classes_) # Number of unique target classes print(f"Train shape: {train.shape} with {n_classes} target classes") print("Train head:") train.head()

    idx = train.apply(lambda row: row.Category.split('_')[0], axis=1) == 'True' correct = train.loc[idx].copy() correct['c'] = correct.groupby(['QuestionId', 'MC_Answer']).MC_Answer.transform('count') correct = correct.sort_values('c', ascending=False) correct = correct.drop_duplicates(['QuestionId']) correct = correct[['QuestionId', 'MC_Answer']] correct['is_correct'] = 1 # Mark these as correct answers

    Merge 'is_correct' flag into the main training DataFrame

    train = train.merge(correct, on=['QuestionId', 'MC_Answer'], how='left') train.is_correct = train.is_correct.fillna(0)

    from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch

    Model_Name = "unsloth/Meta-Llama-3.1-8B-Instruct"

    model = AutoModelForSequenceClassification.from_pretrained(Model_Name, num_labels=n_classes, torch_dtype=torch.bfloat16, device_map="balanced", cache_dir=TEMP_DIR)

    tokenizer = AutoTokenizer.from_pretrained(Model_Name, cache_dir=TEMP_DIR)

    def format_input(row): x = "Yes" if not row['is_correct']: x = "No" return ( f"Question: {row['QuestionText']} " f"Answer: {row['MC_Answer']} " f"Correct? {x} " f"Student Explanation: {row['StudentExplanation']}" )

    train['text'] = train.apply(format_input,axis=1) print("Example prompt for our LLM:") print() print( train.text.values[0] )

    from datasets import Dataset

    Split data into training and validation sets

    train_df, val_df = train_test_split(train, test_size=0.2, random_state=42)

    Convert to Hugging Face Dataset

    COLS = ['text', 'label']

    Create clean DataFrame with the full training data

    train_df_clean = train[COLS].copy() # Use 'train' instead of 'train_df'

    Ensure labels are proper integers

    train_df_clean['label'] = train_df_clean['label'].astype(np.int64)

    Reset index to ensure clean DataFrame structure

    train_df_clean = train_df_clean.reset_index(drop=True)

    Create dataset with the full training data

    train_ds = Dataset.from_pandas(train_df_clean, preserve_index=False)

    def tokenize(batch): """Tokenizes a batch of text inputs.""" return tokenizer(batch["text"], truncation=True, max_length=256)

    Apply tokenization to the full dataset

    train_ds = train_ds.map(tokenize, batched=True, remove_columns=['text'])

    Add a new padding token

    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

    Resize the model's token embeddings to match the new tokenizer

    model.resize_token_embeddings(len(tokenizer))

    Set the pad token id in the model's config

    model.config.pad_token_id = tokenizer.pad_token_id

    2. Clear HF cache after loading

    import os from huggingface_hub import scan_cache_dir

    Then clear cache to free ~16GB

    cache_info = scan_cache_dir() cache_info.delete_revisions(*[repo.revisions for repo in cache_info.repos]).execute()

    --- Training Arguments ---

    from transformers import TrainingArguments, Trainer, DataCollatorWithPadding import tempfile import shutil

    Ensure temp directories exist

    os.makedirs(f"{TEMP_DIR}/training_output/", exist_ok=True) os.makedirs(f"{TEMP_DIR}/logs/", exist_ok=True)

    --- Training Arguments (FIXED) ---

    training_args = TrainingArguments( output_dir=f"{TEMP_DIR}/training_output/",
    do_train=True, do_eval=False, save_strategy="no", num_train_epochs=3, per_device_train_batch_size=16, learning_rate=5e-5, logging_dir=f"{TEMP_DIR}/logs/",
    logging_steps=500, bf16=True, fp16=False, report_to="none", warmup_ratio=0.1, lr_scheduler_type="cosine", dataloader_pin_memory=False, gradient_checkpointing=True,
    )

    --- Custom Metric Computation (MAP@3) ---

    def compute_map3(eval_pred): """ Computes Mean Average Precision at 3 (MAP@3) for evaluation. """ logits, labels = eval_pred probs = torch.nn.functional.softmax(torch.tensor(logits), dim=-1).numpy()

    # Get top 3 predicted class indi...
    
  5. h

    NoteChat

    • huggingface.co
    Updated Sep 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Montecino (2024). NoteChat [Dataset]. https://huggingface.co/datasets/DanielMontecino/NoteChat
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 12, 2024
    Authors
    Daniel Montecino
    Description

    This dataset is just a split of the original akemiH/NoteChat.

    70% for train 15% for validation 15% for test

    Below is the code snipped used to split the dataset.

    from datasets import DatasetDict from datasets import load_dataset

    DATASET_SRC_NAME = "akemiH/NoteChat" DATASET_DST_NAME = "DanielMontecino/NoteChat"

    dataset = load_dataset(DATASET_SRC_NAME, split="train")

    70% train, 30% test + validation

    train_testvalid = dataset.train_test_split(test_size=0.3, seed=2024)

    Split the 30%… See the full description on the dataset page: https://huggingface.co/datasets/DanielMontecino/NoteChat.

  6. movielens-20m-dataset_train_test

    • kaggle.com
    zip
    Updated May 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sas Pav (2023). movielens-20m-dataset_train_test [Dataset]. https://www.kaggle.com/saspav/movielens-20m-dataset-train-test
    Explore at:
    zip(123408611 bytes)Available download formats
    Dataset updated
    May 7, 2023
    Authors
    Sas Pav
    Description

    def train_test_split(X, train_size=0.7, user_col='userId', item_col='movieId', rating_col='rating', time_col='timestamp'): X.sort_values(by=[time_col], inplace=True) user_ids = X[user_col].unique() X_train_data = [] X_test_data = [] for user_id in tqdm_notebook(user_ids): cur_user = X[X[user_col] == user_id] idx = int(cur_user.shape[0] * train_size) X_train_data.append(cur_user[[user_col, item_col, rating_col]].iloc[:idx, :].values) X_test_data.append(cur_user[[user_col, item_col, rating_col]].iloc[idx:, :].values) X_train = pd.DataFrame(np.vstack(X_train_data), columns=[user_col, item_col, rating_col]) X_test = pd.DataFrame(np.vstack(X_test_data), columns=[user_col, item_col, rating_col]) return X_train, X_test

    # аккуратно, очень долгий процесс

    X_train, X_test = train_test_split(data)

  7. h

    earnings22_robust_split

    • huggingface.co
    Updated Nov 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sanchit Gandhi (2023). earnings22_robust_split [Dataset]. https://huggingface.co/datasets/sanchit-gandhi/earnings22_robust_split
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 6, 2023
    Authors
    Sanchit Gandhi
    Description

    from datasets import load_dataset, DatasetDict ds = load_dataset("anton-l/earnings22_robust", split="test") print(ds) print(" ", "Split to ==>", " ")

    split train 90%/ dev 5% / test 5%

    split twice and combine

    train_devtest = ds.train_test_split(shuffle=True, seed=1, test_size=0.1) dev_test = train_devtest['test'].train_test_split(shuffle=True, seed=1, test_size=0.5) ds_train_dev_test = DatasetDict({'train': train_devtest['train'], 'validation': dev_test['train'], 'test':… See the full description on the dataset page: https://huggingface.co/datasets/sanchit-gandhi/earnings22_robust_split.

  8. n

    Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning...

    • narcis.nl
    • data.mendeley.com
    Updated Jan 11, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yoo, T (via Mendeley Data) (2021). Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics" [Dataset]. http://doi.org/10.17632/ffn745r57z.2
    Explore at:
    Dataset updated
    Jan 11, 2021
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Yoo, T (via Mendeley Data)
    Description

    Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics. Authors: Kazutaka Kamiya, MD, PhD, Ik Hee Ryu, MD, MS, Tae Keun Yoo, MD, Jung Sub Kim MD, In Sik Lee, MD, PhD, Jin Kook Kim MD, Wakako Ando CO, Nobuyuki Shoji, MD, PhD, Tomofusa, Yamauchi, MD, PhD, Hitoshi Tabuchi, MD, PhD.

    We hypothesize that machine learning of preoperative biometric data obtained by the As-OCT may be clinically beneficial for predicting the actual ICL vault. Therefore, we built the machine learning model using Random Forest to predict ICL vault after surgery.

    This multicenter study comprised one thousand seven hundred forty-five eyes of 1745 consecutive patients (656 men and 1089 women), who underwent EVO ICL implantation (V4c and V5 Visian ICL with KS-AquaPORT) for the correction of moderate to high myopia and myopic astigmatism, and who completed at least a 1-month follow-up, at Kitasato University Hospital (Kanagawa, Japan), or at B&VIIT Eye Center (Seoul, Korea).

    This data file (RFR_model(feature=12).mat) is the final trained random forest model for MATLAB 2020a.

    Python version:

    from sklearn.model_selection import train_test_split import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import RandomForestRegressor

    connect data in your google drive

    from google.colab import auth auth.authenticate_user() from google.colab import drive drive.mount('/content/gdrive')

    Change the path for the custom data

    In this case, we used ICL vault prediction using preop measurement

    dataset = pd.read_csv('gdrive/My Drive/ICL/data_icl.csv') dataset.head()

    optimal features (sorted by importance) :

    1. ICL size 2. ICL power 3. LV 4. CLR 5. ACD 6. ATA

    7. MSE 8.Age 9. Pupil size 10. WTW 11. CCT 12. ACW

    y = dataset['Vault_1M'] X = dataset.drop(['Vault_1M'], axis = 1)

    Split the dataset to train and test data, if necessary.

    For example, we can split data to 8:2 as a simple validation test

    train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=0)

    In our study, we already defined the training (B&VIIT Eye Center, n=1455) and test (Kitasato University, n=290) dataset, this code was not necessary to perform our analysis.

    Optimal parameter search could be performed in this section

    parameters = {'bootstrap': True, 'min_samples_leaf': 3, 'n_estimators': 500, 'criterion': 'mae' 'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 6, 'max_leaf_nodes': None}

    RF_model = RandomForestRegressor(**parameters) RF_model.fit(train_X, train_y) RF_predictions = RF_model.predict(test_X) importance = RF_model.feature_importances_

  9. h

    AIME_2024-train-test

    • huggingface.co
    Updated Nov 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeong Seong Cheol (2025). AIME_2024-train-test [Dataset]. https://huggingface.co/datasets/lejelly/AIME_2024-train-test
    Explore at:
    Dataset updated
    Nov 3, 2025
    Authors
    Jeong Seong Cheol
    Description

    AIME 2024 (train/test 1:9)

    This dataset is derived from Maxwell-Jia/AIME_2024 by splitting the original single train split into train and test with a ratio of 1:9.

      Source
    

    Original dataset: Maxwell-Jia/AIME_2024Link: https://huggingface.co/datasets/Maxwell-Jia/AIME_2024

      License
    

    Inherits the original dataset's license (MIT) unless otherwise noted in this repository.

      Splitting Details
    

    Method: datasets.Dataset.train_test_split Test size: 90.0%… See the full description on the dataset page: https://huggingface.co/datasets/lejelly/AIME_2024-train-test.

  10. u

    Surrogate flood model comparison - Datasets and python code

    • figshare.unimelb.edu.au
    bin
    Updated Jan 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niels Fraehr (2024). Surrogate flood model comparison - Datasets and python code [Dataset]. http://doi.org/10.26188/24312658.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 19, 2024
    Dataset provided by
    The University of Melbourne
    Authors
    Niels Fraehr
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data used for publication in "Assessment of surrogate models for flood inundation: The physics-guided LSG model vs. state-of-the-art machine learning models". Five surrogate models for flood inundation is to emulate the results of high-resolution hydrodynamic models. The surrogate models are compared based on accuracy and computational speed for three distinct case studies namely Carlisle (United Kingdom), Chowilla floodplain (Australia), and Burnett River (Australia).The dataset is structured in 5 files - "Carlisle", "Chowilla", "BurnettRV", "Comparison_results", and "Python_data". As a minimum to run the models the "Python_data" file and one of "Carlisle", "Chowilla", or "BurnettRV" are needed. We suggest to use the "Carlisle" case study for initial testing given its small size and small data requirement."Carlisle", "Chowilla", and "BurnettRV" files These files contain hydrodynamic modelling data for training and validation for each individual case study, as well as specific Python scripts for training and running the surrogate models in each case study. There are only small differences between each folder, depending on the hydrodynamic model trying to emulate and input boundary conditions (input features).Each case study file has the following folders:Geometry_data: DEM files, .npz files containing of the high-fidelity models grid (XYZ-coordinates) and areas (Same data is available for the low-fidelity model used in the LSG model), .shp files indicating location of boundaries and main flow paths (mainly used in the LSTM-SRR model). XXX_modeldata: Folder to storage trained model data for each XXX surrogate model. For example, GP_EOF_modeldata contains files used to store the trainined GP-EOF model.HD_model_data: High-fidelity (And low-fidelity) simulation results for all flood events of that case study. This folder also contains all boundary input conditions.HF_EOF_analysis: Storing of data used in the EOF analysis. EOF analysis is applied for the LSG, GP-EOF, and LSTM-EOF surrogate models. Results_data: Storing results of running the evaluation of the surrogate models.Train_test_split_data: The train-test-validation data split is the same for all surrogate models. The specific split for each cross-validation fold is stored in this folder.And Python files:YYY_event_summary, YYY_Extrap_event_summary: Files containing overview of all events, and which events are connected between the low- and high-fidelity models for each YYY case study.EOF_analysis_HFdata_preprocessing, EOF_analysis_HFdata: Preprocessing before EOF analysis and the EOF analysis of the high-fidelity data. This is used for the LSG, GP-EOF, and LSTM-EOF surrogate models.Evaluation, Evaluation_extrap: Scripts for evaluating the surrogate model for that case study and saving the results for each cross-validation fold.train_test_split: Script for splitting the flood datasets for each cross-validation fold, so all surrogate models train on the same data.XXX_training: Script for training each XXX surrogate model.XXX_preprocessing: Some surrogate models might rely on some information that needs to be generated before training. This is performed using these scripts."Comparison_results" fileFiles used for comparing surrogate models and generate the figures in the paper "Assessment of surrogate models for flood inundation: The physics-guided LSG model vs. state-of-the-art machine learning models". Figures are also included. "Python_data" fileFolder containing Python script with utility functions for setting up, training, and running the surrogate models, as well as for evaluating the surrogate models. This folder also contains a python_environment.yml file with all Python package versions and dependencies.This folder also contains two sub-folders:LSG_mods_and_func: Python scripts for using the LSG model. Some of these scripts are also utilized when working with the other surrogate models. SRR_method_master_Zhou2021: Scripts obtained from https://github.com/yuerongz/SRR-method. Small edits have for speed and use in this study.

  11. t

    Privacy-Sensitive Conversations between Care Workers and Care Home Residents...

    • test.researchdata.tuwien.at
    • researchdata.tuwien.ac.at
    bin, text/markdown
    Updated Dec 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reinhard Grabler; Michael Starzinger; Matthias Hirschmanner; Matthias Hirschmanner; Helena Anna Frijns; Helena Anna Frijns; Reinhard Grabler; Michael Starzinger; Reinhard Grabler; Michael Starzinger; Reinhard Grabler; Michael Starzinger (2024). Privacy-Sensitive Conversations between Care Workers and Care Home Residents in a Residential Care Home [Dataset]. http://doi.org/10.70124/hbtq5-ykv92
    Explore at:
    bin, text/markdownAvailable download formats
    Dataset updated
    Dec 6, 2024
    Dataset provided by
    TU Wien
    Authors
    Reinhard Grabler; Michael Starzinger; Matthias Hirschmanner; Matthias Hirschmanner; Helena Anna Frijns; Helena Anna Frijns; Reinhard Grabler; Michael Starzinger; Reinhard Grabler; Michael Starzinger; Reinhard Grabler; Michael Starzinger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 2024 - Aug 2024
    Description

    Dataset Card for "privacy-care-interactions"

    Table of Contents

    Dataset Description

    Purpose and Features

    🔒 Collection of Privacy-Sensitive Conversations between Care Workers and Care Home Residents in an Residential Care Home 🔒

    The dataset is useful to train and evaluate models to identify and classify privacy-sensitive parts of conversations from text, especially in the context of AI assistants and LLMs.

    Dataset Overview

    Language Distribution 🌍

    • English (en): 95

    Locale Distribution 🌎

    • United States (US) 🇺🇸: 95

    Key Facts 🔑

    • This is synthetic data! Generated using proprietary algorithms - no privacy violations!
    • Conversations are classified following the taxonomy for privacy-sensitive robotics by Rueben et al. (2017).
    • The data was manually labeled by an expert.

    Dataset Structure

    Data Instances

    The provided data format is .jsonl, the JSON Lines text format, also called newline-delimited JSON. An example entry looks as follows.

    { "text": "CW: Have you ever been to Italy? CR: Oh, yes... many years ago.", "taxonomy": 0, "category": 0, "affected_speaker": 1, "language": "en", "locale": "US", "data_type": 1, "uid": 16, "split": "train" }

    Data Fields

    The data fields are:

    • text: a string feature. The abbreviaton of the speakers refer to the care worker (CW) and the care recipient (CR).
    • taxonomy: a classification label, with possible values including informational (0), invasion (1), collection (2), processing (3), dissemination (4), physical (5), personal-space (6), territoriality (7), intrusion (8), obtrusion (9), contamination (10), modesty (11), psychological (12), interrogation (13), psychological-distance (14), social (15), association (16), crowding-isolation (17), public-gaze (18), solitude (19), intimacy (20), anonymity (21), reserve (22). The taxonomy is derived from Rueben et al. (2017). The classifications were manually labeled by an expert.
    • category: a classification label, with possible values including personal-information (0), family (1), health (2), thoughts (3), values (4), acquaintance (5), appointment (6). The privacy category affected in the conversation. The classifications were manually labeled by an expert.
    • affected_speaker: a classification label, with possible values including care-worker (0), care-recipient (1), other (2), both (3). The speaker whose privacy is impacted during the conversation. The classifications were manually labeled by an expert.
    • language: a string feature. Language code as defined by ISO 639.
    • locale: a string feature. Regional code as defined by ISO 3166-1 alpha-2.
    • data_type: a string a classification label, with possible values including real (0), synthetic (1).
    • uid: a int64 feature. A unique identifier within the dataset.
    • split: a string feature. Either train, validation or test.

    Dataset Splits

    The dataset has 2 subsets:

    • split: with a total of 95 examples split into train, validation and test (70%-15%-15%)
    • unsplit: with a total of 95 examples in a single train split
    nametrainvalidationtest
    split661415
    unsplit95n/an/a

    The files follow the naming convention subset-split-language.jsonl. The following files are contained in the dataset:

    • split-train-en.jsonl
    • split-validation-en.jsonl
    • split-test-en.jsonl
    • unsplit-train-en.jsonl

    Dataset Creation

    Curation Rationale

    Recording audio of care workers and residents during care interactions, which includes partial and full body washing, giving of medication, as well as wound care, is a highly privacy-sensitive use case. Therefore, a dataset is created, which includes privacy-sensitive parts of conversations, synthesized from real-world data. This dataset serves as a basis for fine-tuning a local LLM to highlight and classify privacy-sensitive sections of transcripts created in care interactions, to further mask them to protect privacy.

    Source Data

    Initial Data Collection

    The intial data was collected in the project Caring Robots of TU Wien in cooperation with Caritas Wien. One project track aims to facilitate Large Languge Models (LLM) to support documentation of care workers, with LLM-generated summaries of audio recordings of interactions between care workers and care home residents. The initial data are the transcriptions of those care interactions.

    Data Processing

    The transcriptions were thoroughly reviewed, and sections containing privacy-sensitive information were identified and marked using qualitative data analysis software by two experts. Subsequently, the accessible portions of the interviews were translated from German to US English using the locally executed LLM icky/translate. In the next step, another llama3.1:70b was used locally to synthesize the conversation segments. This process involved generating similar, yet distinct and new, conversations that are not linked to the original data. The dataset was split using the train_test_split function from the <a href="https://scikit-learn.org/1.5/modules/generated/sklearn.model_selection.train_test_split.html" target="_blank"

  12. o

    Amazon_employee_access_seed_4_nrows_2000_nclasses_10_ncols_100_stratify_True...

    • openml.org
    Updated Nov 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eddie Bergman (2022). Amazon_employee_access_seed_4_nrows_2000_nclasses_10_ncols_100_stratify_True [Dataset]. https://www.openml.org/d/44712
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 17, 2022
    Authors
    Eddie Bergman
    Description

    Subsampling of the dataset Amazon_employee_access (4135) with

    seed=4 args.nrows=2000 args.ncols=100 args.nclasses=10 args.no_stratify=True Generated with the following source code:

      def subsample(
        self,
        seed: int,
        nrows_max: int = 2_000,
        ncols_max: int = 100,
        nclasses_max: int = 10,
        stratified: bool = True,
      ) -> Dataset:
        rng = np.random.default_rng(seed)
    
        x = self.x
        y = self.y
    
        # Uniformly sample
        classes = y.unique()
        if len(classes) > nclasses_max:
          vcs = y.value_counts()
          selected_classes = rng.choice(
            classes,
            size=nclasses_max,
            replace=False,
            p=vcs / sum(vcs),
          )
    
          # Select the indices where one of these classes is present
          idxs = y.index[y.isin(classes)]
          x = x.iloc[idxs]
          y = y.iloc[idxs]
    
        # Uniformly sample columns if required
        if len(x.columns) > ncols_max:
          columns_idxs = rng.choice(
            list(range(len(x.columns))), size=ncols_max, replace=False
          )
          sorted_column_idxs = sorted(columns_idxs)
          selected_columns = list(x.columns[sorted_column_idxs])
          x = x[selected_columns]
        else:
          sorted_column_idxs = list(range(len(x.columns)))
    
        if len(x) > nrows_max:
          # Stratify accordingly
          target_name = y.name
          data = pd.concat((x, y), axis="columns")
          _, subset = train_test_split(
            data,
            test_size=nrows_max,
            stratify=data[target_name],
            shuffle=True,
            random_state=seed,
          )
          x = subset.drop(target_name, axis="columns")
          y = subset[target_name]
    
        # We need to convert categorical columns to string for openml
        categorical_mask = [self.categorical_mask[i] for i in sorted_column_idxs]
        columns = list(x.columns)
    
        return Dataset(
          # Technically this is not the same but it's where it was derived from
          dataset=self.dataset,
          x=x,
          y=y,
          categorical_mask=categorical_mask,
          columns=columns,
        )
    
  13. Cleaned Concrete Datset

    • kaggle.com
    zip
    Updated Aug 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Divyanshu_CODER (2025). Cleaned Concrete Datset [Dataset]. https://www.kaggle.com/datasets/divyanshucoder/concrete-dataset
    Explore at:
    zip(11326 bytes)Available download formats
    Dataset updated
    Aug 13, 2025
    Authors
    Divyanshu_CODER
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    🏗 Concrete Strength Dataset

    📌 Subtitle

    "Predicting the Compressive Strength of Concrete Based on Material Composition and Age"

    📖 Description

    This dataset contains detailed measurements of concrete composition and the corresponding compressive strength (in MPa). It can be used for predictive modeling, regression analysis, and feature engineering in the field of civil engineering and material science.

    Concrete is the most widely used construction material in the world, and predicting its strength accurately is crucial for structural safety, cost optimization, and sustainability. The dataset includes major components like cement, water, aggregates, admixtures, and curing time — all of which play a key role in determining the final strength.

    📊 Dataset Overview

    Total Rows: 1,030

    Total Columns: 9

    No Missing Values ✅

    Feature Description Unit

    Cement Amount of cement used kg/m³ Blast Furnace Slag Amount of blast furnace slag used kg/m³ Fly Ash Amount of fly ash used kg/m³ Water Water content kg/m³ Superplasticizer Chemical admixture to enhance workability kg/m³ Coarse Aggregate Gravel/stones in the mix kg/m³ Fine Aggregate Sand in the mix kg/m³ Age Curing time days Strength Compressive strength of the concrete MPa

    🚀 Use Cases

    Machine Learning Regression Models

    Predict concrete strength based on mix design.

    Feature Engineering Practice

    Apply transformations, scaling, and interaction features.

    Civil Engineering Insights

    Analyze the impact of different materials on strength.

    Optimization Studies

    Reduce cost while maintaining strength requirements.

    📂 File Information

    Filename: concrete_data.csv

    Format: CSV (Comma-Separated Values)

    Encoding: UTF-8

    🛠 Recommended Libraries for Analysis

    import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import r2_score, mean_squared_error

    📌 Example Code Snippet

    Load dataset

    df = pd.read_csv("concrete_data.csv")

    Features & target

    X = df.drop("Strength", axis=1) y = df["Strength"]

    Train-test split

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    Model training

    model = LinearRegression() model.fit(X_train, y_train)

    Prediction

    y_pred = model.predict(X_test)

    Evaluation

    print("R² Score:", r2_score(y_test, y_pred)) print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

    🎯 Potential Projects

    Strength Prediction App — Build a web app to predict concrete strength.

    Material Optimization Dashboard — Visualize how ingredient changes affect strength.

    AI-Driven Quality Control — Use ML to detect suboptimal concrete mixes before production.

    📜 License

    This dataset is available for educational and research purposes. If you use it in publications or projects, kindly cite the source.

    Citation 🏗 Concrete Strength Dataset

    📌 Subtitle

    "Predicting the Compressive Strength of Concrete Based on Material Composition and Age"

    📖 Description

    This dataset contains detailed measurements of concrete composition and the corresponding compressive strength (in MPa). It can be used for predictive modeling, regression analysis, and feature engineering in the field of civil engineering and material science.

    Concrete is the most widely used construction material in the world, and predicting its strength accurately is crucial for structural safety, cost optimization, and sustainability. The dataset includes major components like cement, water, aggregates, admixtures, and curing time — all of which play a key role in determining the final strength.

    📊 Dataset Overview

    Total Rows: 1,030

    Total Columns: 9

    No Missing Values ✅

    Feature Description Unit

    Cement Amount of cement used kg/m³ Blast Furnace Slag Amount of blast furnace slag used kg/m³ Fly Ash Amount of fly ash used kg/m³ Water Water content kg/m³ Superplasticizer Chemical admixture to enhance workability kg/m³ Coarse Aggregate Gravel/stones in the mix kg/m³ Fine Aggregate Sand in the mix kg/m³ Age Curing time days Strength Compressive strength of the concrete MPa

    🚀 Use Cases

    Machine Learning Regression Models

    Predict concrete strength based on mix design.

    Feature Engineering Practice

    Apply transformations, scaling, and interaction features.

    Civil Engineering Insights

    Analyze the impact of different materials on strength.

    Optimization Studies

    Reduce cost while maintaining strength requirements.

    📂 File Information

    Filename: concrete_data.csv

    Format: CSV (Comma-Separated Values)

    Encoding: UTF-8

    🛠 Recommended Libraries for Analysis

    import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.linear_model import LinearRegression from sklearn.mo...

  14. o

    connect-4_seed_2_nrows_2000_nclasses_10_ncols_100_stratify_True

    • openml.org
    Updated Nov 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eddie Bergman (2022). connect-4_seed_2_nrows_2000_nclasses_10_ncols_100_stratify_True [Dataset]. https://www.openml.org/d/44705
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 17, 2022
    Authors
    Eddie Bergman
    Description

    Subsampling of the dataset connect-4 (40668) with

    seed=2 args.nrows=2000 args.ncols=100 args.nclasses=10 args.no_stratify=True Generated with the following source code:

      def subsample(
        self,
        seed: int,
        nrows_max: int = 2_000,
        ncols_max: int = 100,
        nclasses_max: int = 10,
        stratified: bool = True,
      ) -> Dataset:
        rng = np.random.default_rng(seed)
    
        x = self.x
        y = self.y
    
        # Uniformly sample
        classes = y.unique()
        if len(classes) > nclasses_max:
          vcs = y.value_counts()
          selected_classes = rng.choice(
            classes,
            size=nclasses_max,
            replace=False,
            p=vcs / sum(vcs),
          )
    
          # Select the indices where one of these classes is present
          idxs = y.index[y.isin(classes)]
          x = x.iloc[idxs]
          y = y.iloc[idxs]
    
        # Uniformly sample columns if required
        if len(x.columns) > ncols_max:
          columns_idxs = rng.choice(
            list(range(len(x.columns))), size=ncols_max, replace=False
          )
          sorted_column_idxs = sorted(columns_idxs)
          selected_columns = list(x.columns[sorted_column_idxs])
          x = x[selected_columns]
        else:
          sorted_column_idxs = list(range(len(x.columns)))
    
        if len(x) > nrows_max:
          # Stratify accordingly
          target_name = y.name
          data = pd.concat((x, y), axis="columns")
          _, subset = train_test_split(
            data,
            test_size=nrows_max,
            stratify=data[target_name],
            shuffle=True,
            random_state=seed,
          )
          x = subset.drop(target_name, axis="columns")
          y = subset[target_name]
    
        # We need to convert categorical columns to string for openml
        categorical_mask = [self.categorical_mask[i] for i in sorted_column_idxs]
        columns = list(x.columns)
    
        return Dataset(
          # Technically this is not the same but it's where it was derived from
          dataset=self.dataset,
          x=x,
          y=y,
          categorical_mask=categorical_mask,
          columns=columns,
        )
    
  15. Ubiquant Competition Train Data Divided - QS

    • kaggle.com
    zip
    Updated Feb 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabbasso (2022). Ubiquant Competition Train Data Divided - QS [Dataset]. https://www.kaggle.com/fabrizio78/ubiquant-competition-train-data-divided-qs
    Explore at:
    zip(2265752383 bytes)Available download formats
    Dataset updated
    Feb 9, 2022
    Authors
    Fabbasso
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Ubiquant dataset training data split into different csv-files.

    Original Training dataset from Ubiquant Competition has been divided into three subsets of Training-Validation-Test The data has been scaled using QuantileTransformer from ScikitLearn with the following parameters:

    • output_distribution='normal'
    • random_state=17

    Two pkl-files included in the dataset contains the scalers used for the features and target variables.

    Out of the original dataset, 80% has been used to create the Training Dataset, 10% for the Validation and the remaining 10% for the Test set. The Scikitlearn Train_Test_split function has been used for this purpose.

  16. o

    car_seed_2_nrows_2000_nclasses_10_ncols_100_stratify_True

    • openml.org
    Updated Nov 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eddie Bergman (2022). car_seed_2_nrows_2000_nclasses_10_ncols_100_stratify_True [Dataset]. https://www.openml.org/d/44620
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 17, 2022
    Authors
    Eddie Bergman
    Description

    Subsampling of the dataset car (40975) with

    seed=2 args.nrows=2000 args.ncols=100 args.nclasses=10 args.no_stratify=True Generated with the following source code:

      def subsample(
        self,
        seed: int,
        nrows_max: int = 2_000,
        ncols_max: int = 100,
        nclasses_max: int = 10,
        stratified: bool = True,
      ) -> Dataset:
        rng = np.random.default_rng(seed)
    
        x = self.x
        y = self.y
    
        # Uniformly sample
        classes = y.unique()
        if len(classes) > nclasses_max:
          vcs = y.value_counts()
          selected_classes = rng.choice(
            classes,
            size=nclasses_max,
            replace=False,
            p=vcs / sum(vcs),
          )
    
          # Select the indices where one of these classes is present
          idxs = y.index[y.isin(classes)]
          x = x.iloc[idxs]
          y = y.iloc[idxs]
    
        # Uniformly sample columns if required
        if len(x.columns) > ncols_max:
          columns_idxs = rng.choice(
            list(range(len(x.columns))), size=ncols_max, replace=False
          )
          sorted_column_idxs = sorted(columns_idxs)
          selected_columns = list(x.columns[sorted_column_idxs])
          x = x[selected_columns]
        else:
          sorted_column_idxs = list(range(len(x.columns)))
    
        if len(x) > nrows_max:
          # Stratify accordingly
          target_name = y.name
          data = pd.concat((x, y), axis="columns")
          _, subset = train_test_split(
            data,
            test_size=nrows_max,
            stratify=data[target_name],
            shuffle=True,
            random_state=seed,
          )
          x = subset.drop(target_name, axis="columns")
          y = subset[target_name]
    
        # We need to convert categorical columns to string for openml
        categorical_mask = [self.categorical_mask[i] for i in sorted_column_idxs]
        columns = list(x.columns)
    
        return Dataset(
          # Technically this is not the same but it's where it was derived from
          dataset=self.dataset,
          x=x,
          y=y,
          categorical_mask=categorical_mask,
          columns=columns,
        )
    
  17. h

    miia-pothole-train

    • huggingface.co
    Updated Feb 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahmoud Abughali (2024). miia-pothole-train [Dataset]. https://huggingface.co/datasets/mabughali/miia-pothole-train
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 13, 2024
    Authors
    Mahmoud Abughali
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    How to use:

    pip install datasets

    dataset = load_dataset("mabughali/miia-pothole-train", split="train") splits = dataset.train_test_split(test_size=0.2) train_ds = splits['train'] val_ds = splits['test']

  18. Salary vs Years of Experience

    • kaggle.com
    zip
    Updated Oct 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sakshi Gangwani (2023). Salary vs Years of Experience [Dataset]. https://www.kaggle.com/datasets/sakshigangwani/salary-vs-years-of-experience
    Explore at:
    zip(51852 bytes)Available download formats
    Dataset updated
    Oct 6, 2023
    Authors
    Sakshi Gangwani
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    import numpy as np import pandas as pd import matplotlib.pyplot as plt

    dataset = pd.read_csv('Salary_dataset.csv') X = dataset.iloc[:, 1:2].values y = dataset.iloc[:, -1].values

    dataset.head()

    from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

    from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train, y_train)

    y_pred = regressor.predict(X_test)

    plt.scatter(X_train, y_train, color="red") plt.plot(X_train, regressor.predict(X_train), color="blue") plt.title('Salary vs Experience (Training set)') plt.xlabel('Years of Experience') plt.ylabel('Salary') plt.show()

    plt.scatter(X_test, y_test, color = 'red') plt.plot(X_train, regressor.predict(X_train), color = 'blue') plt.title('Salary vs Experience (Test set)') plt.xlabel('Years of Experience') plt.ylabel('Salary') plt.show()

  19. o

    Internet-Advertisements_seed_2_nrows_2000_nclasses_10_ncols_100_stratify_True...

    • openml.org
    Updated Nov 17, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eddie Bergman (2022). Internet-Advertisements_seed_2_nrows_2000_nclasses_10_ncols_100_stratify_True [Dataset]. https://www.openml.org/d/44650
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 17, 2022
    Authors
    Eddie Bergman
    Description

    Subsampling of the dataset Internet-Advertisements (40978) with

    seed=2 args.nrows=2000 args.ncols=100 args.nclasses=10 args.no_stratify=True Generated with the following source code:

      def subsample(
        self,
        seed: int,
        nrows_max: int = 2_000,
        ncols_max: int = 100,
        nclasses_max: int = 10,
        stratified: bool = True,
      ) -> Dataset:
        rng = np.random.default_rng(seed)
    
        x = self.x
        y = self.y
    
        # Uniformly sample
        classes = y.unique()
        if len(classes) > nclasses_max:
          vcs = y.value_counts()
          selected_classes = rng.choice(
            classes,
            size=nclasses_max,
            replace=False,
            p=vcs / sum(vcs),
          )
    
          # Select the indices where one of these classes is present
          idxs = y.index[y.isin(classes)]
          x = x.iloc[idxs]
          y = y.iloc[idxs]
    
        # Uniformly sample columns if required
        if len(x.columns) > ncols_max:
          columns_idxs = rng.choice(
            list(range(len(x.columns))), size=ncols_max, replace=False
          )
          sorted_column_idxs = sorted(columns_idxs)
          selected_columns = list(x.columns[sorted_column_idxs])
          x = x[selected_columns]
        else:
          sorted_column_idxs = list(range(len(x.columns)))
    
        if len(x) > nrows_max:
          # Stratify accordingly
          target_name = y.name
          data = pd.concat((x, y), axis="columns")
          _, subset = train_test_split(
            data,
            test_size=nrows_max,
            stratify=data[target_name],
            shuffle=True,
            random_state=seed,
          )
          x = subset.drop(target_name, axis="columns")
          y = subset[target_name]
    
        # We need to convert categorical columns to string for openml
        categorical_mask = [self.categorical_mask[i] for i in sorted_column_idxs]
        columns = list(x.columns)
    
        return Dataset(
          # Technically this is not the same but it's where it was derived from
          dataset=self.dataset,
          x=x,
          y=y,
          categorical_mask=categorical_mask,
          columns=columns,
        )
    
  20. h

    CIFAR100-custom

    • huggingface.co
    Updated Apr 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrei Semenov (2024). CIFAR100-custom [Dataset]. https://huggingface.co/datasets/Andron00e/CIFAR100-custom
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 16, 2024
    Authors
    Andrei Semenov
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Example of usage: from datasets import load_dataset

    dataset = load_dataset("Andron00e/CIFAR100-custom") splitted_dataset = dataset["train"].train_test_split(test_size=0.2)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jeong Seong Cheol (2025). ASDiv-train-test [Dataset]. https://huggingface.co/datasets/lejelly/ASDiv-train-test

ASDiv-train-test

lejelly/ASDiv-train-test

Explore at:
Dataset updated
Nov 3, 2025
Authors
Jeong Seong Cheol
Description

ASDiv (train/test 1:9)

This dataset is derived from EleutherAI/asdiv by splitting the original validation split into train and test with a ratio of 1:9.

  Source

Original dataset: EleutherAI/asdivLink: https://huggingface.co/datasets/EleutherAI/asdiv

  License

Inherits the original dataset's license (CC-BY-NC-4.0) unless otherwise noted in this repository.

  Splitting Details

Method: datasets.Dataset.train_test_split Source split: validation Test… See the full description on the dataset page: https://huggingface.co/datasets/lejelly/ASDiv-train-test.

Search
Clear search
Close search
Google apps
Main menu