8 datasets found
  1. rsna_small_for_faster_experimentation

    • kaggle.com
    zip
    Updated Dec 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammed Hasan goni (2022). rsna_small_for_faster_experimentation [Dataset]. https://www.kaggle.com/datasets/hasangoni/rsna-small-for-faster-experimentation/code
    Explore at:
    zip(177242332 bytes)Available download formats
    Dataset updated
    Dec 27, 2022
    Authors
    Mohammed Hasan goni
    Description

    Original dataset can be found in this competion. kindly I have found png ROI images here. Then I just created a subset of those dataset, only 10 % of the data to get faster iteration per epoch

  2. h

    hate_speech_dataset

    • huggingface.co
    Updated Jul 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christina Christodoulou (2024). hate_speech_dataset [Dataset]. https://huggingface.co/datasets/christinacdl/hate_speech_dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 27, 2024
    Authors
    Christina Christodoulou
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    32.579 texts in total, 14.012 NOT hateful texts and 18.567 HATEFUL texts All duplicate values were removed Split using sklearn into 80% train and 20% temporary test (stratified label). Then split the test set using 0.50% test and validation (stratified label) Split: 80/10/10 Train set label distribution: 0 ==> 11.210, 1 ==> 14.853, 26.063 in total Validation set label distribution: 0 ==> 1.401, 1 ==> 1.857, 3.258 in total Test set label distribution: 0 ==> 1.401, 1 ==> 1.857, 3.258 in… See the full description on the dataset page: https://huggingface.co/datasets/christinacdl/hate_speech_dataset.

  3. Prediction of Personality Traits using the Big 5 Framework

    • zenodo.org
    csv, text/x-python
    Updated Feb 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neelima Brahmbhatt; Neelima Brahmbhatt (2023). Prediction of Personality Traits using the Big 5 Framework [Dataset]. http://doi.org/10.5281/zenodo.7596072
    Explore at:
    text/x-python, csvAvailable download formats
    Dataset updated
    Feb 2, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Neelima Brahmbhatt; Neelima Brahmbhatt
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The methodology is the core component of any research-related work. The methods used to gain the results are shown in the methodology. Here, the whole research implementation is done using python. There are different steps involved to get the entire research work done which is as follows:

    1. Acquire Personality Dataset

    The kaggle machine learning dataset is a collection of datasets, data generators which are used by machine learning community for analysis purpose. The personality prediction dataset is acquired from the kaggle website. This dataset was collected (2016-2018) through an interactive on-line personality test. The personality test was constructed from the IPIP. The personality prediction dataset can be downloaded in zip file format just by clicking on the link available. The personality prediction file consists of two subject CSV files (test.csv & train.csv). The test.csv file has 0 missing values, 7 attributes, and final label output. Also, the dataset has multivariate characteristics. Here, data-preprocessing is done for checking inconsistent behaviors or trends.

    2. Data preprocessing

    After, Data acquisition the next step is to clean and preprocess the data. The Dataset available has numerical type features. The target value is a five-level personality consisting of serious,lively,responsible,dependable & extraverted. The preprocessed dataset is further split into training and testing datasets. This is achieved by passing feature value, target value, test size to the train-test split method of the scikit-learn package. After splitting of data, the training data is sent to the following Logistic regression & SVM design is used for training the artificial neural networks then test data is used to predict the accuracy of the trained network model.

    3. Feature Extraction

    The following items were presented on one page and each was rated on a five point scale using radio buttons. The order on page was EXT1, AGR1, CSN1, EST1, OPN1, EXT2, etc. The scale was labeled 1=Disagree, 3=Neutral, 5=Agree

            EXT1 I am the life of the party.
            EXT2  I don't talk a lot.
            EXT3  I feel comfortable around people.
            EXT4  I am quiet around strangers.
            EST1  I get stressed out easily.
            EST2  I get irritated easily.
            EST3  I worry about things.
            EST4  I change my mood a lot.
            AGR1  I have a soft heart.
            AGR2  I am interested in people.
            AGR3  I insult people.
            AGR4  I am not really interested in others.
            CSN1  I am always prepared.
            CSN2  I leave my belongings around.
            CSN3  I follow a schedule.
            CSN4  I make a mess of things.
            OPN1  I have a rich vocabulary.
            OPN2  I have difficulty understanding abstract ideas.
            OPN3  I do not have a good imagination.
            OPN4  I use difficult words.

    4. Training the Model

    Train/Test is a method to measure the accuracy of your model. It is called Train/Test because you split the the data set into two sets: a training set and a testing set. 80% for training, and 20% for testing. You train the model using the training set.In this model we trained our dataset using linear_model.LogisticRegression() & svm.SVC() from sklearn Package

    5. Personality Prediction Output

    After the training of the designed neural network, the testing of Logistic Regression & SVM is performed using Cohen_kappa_score & Accuracy Score.

  4. h

    offensive_language_dataset

    • huggingface.co
    Updated Feb 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christina Christodoulou (2024). offensive_language_dataset [Dataset]. https://huggingface.co/datasets/christinacdl/offensive_language_dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 1, 2024
    Authors
    Christina Christodoulou
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    36.528 English texts in total, 12.955 NOT offensive and 23.573O OFFENSIVE texts All duplicate values were removed Split using sklearn into 80% train and 20% temporary test (stratified label). Then split the test set using 0.50% test and validation (stratified label) Split: 80/10/10 Train set label distribution: 0 ==> 10.364, 1 ==> 18.858 Validation set label distribution: 0 ==> 1.296, 1 ==> 2.357 Test set label distribution: 0 ==> 1.295, 1 ==> 2.358 The OLID dataset (Zampieri et al., 2019)… See the full description on the dataset page: https://huggingface.co/datasets/christinacdl/offensive_language_dataset.

  5. Llama 3.1 8B Correct Labels

    • kaggle.com
    zip
    Updated Aug 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jatin Mehra_666 (2025). Llama 3.1 8B Correct Labels [Dataset]. https://www.kaggle.com/datasets/jatinmehra666/llama-3-1-8b-correct-labels
    Explore at:
    zip(11853454078 bytes)Available download formats
    Dataset updated
    Aug 26, 2025
    Authors
    Jatin Mehra_666
    Description

    training Code ```Python

    from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split import os import pandas as pd import numpy as np os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3" TEMP_DIR = "tmp" os.makedirs(TEMP_DIR, exist_ok=True) train = pd.read_csv('input/map-charting-student-math-misunderstandings/train.csv')

    Fill missing Misconception values with 'NA'

    train.Misconception = train.Misconception.fillna('NA')

    Create a combined target label (Category:Misconception)

    train['target'] = train.Category + ":" + train.Misconception

    Encode target labels to numerical format

    le = LabelEncoder() train['label'] = le.fit_transform(train['target']) n_classes = len(le.classes_) # Number of unique target classes print(f"Train shape: {train.shape} with {n_classes} target classes") print("Train head:") train.head()

    idx = train.apply(lambda row: row.Category.split('_')[0], axis=1) == 'True' correct = train.loc[idx].copy() correct['c'] = correct.groupby(['QuestionId', 'MC_Answer']).MC_Answer.transform('count') correct = correct.sort_values('c', ascending=False) correct = correct.drop_duplicates(['QuestionId']) correct = correct[['QuestionId', 'MC_Answer']] correct['is_correct'] = 1 # Mark these as correct answers

    Merge 'is_correct' flag into the main training DataFrame

    train = train.merge(correct, on=['QuestionId', 'MC_Answer'], how='left') train.is_correct = train.is_correct.fillna(0)

    from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch

    Model_Name = "unsloth/Meta-Llama-3.1-8B-Instruct"

    model = AutoModelForSequenceClassification.from_pretrained(Model_Name, num_labels=n_classes, torch_dtype=torch.bfloat16, device_map="balanced", cache_dir=TEMP_DIR)

    tokenizer = AutoTokenizer.from_pretrained(Model_Name, cache_dir=TEMP_DIR)

    def format_input(row): x = "Yes" if not row['is_correct']: x = "No" return ( f"Question: {row['QuestionText']} " f"Answer: {row['MC_Answer']} " f"Correct? {x} " f"Student Explanation: {row['StudentExplanation']}" )

    train['text'] = train.apply(format_input,axis=1) print("Example prompt for our LLM:") print() print( train.text.values[0] )

    from datasets import Dataset

    Split data into training and validation sets

    train_df, val_df = train_test_split(train, test_size=0.2, random_state=42)

    Convert to Hugging Face Dataset

    COLS = ['text', 'label']

    Create clean DataFrame with the full training data

    train_df_clean = train[COLS].copy() # Use 'train' instead of 'train_df'

    Ensure labels are proper integers

    train_df_clean['label'] = train_df_clean['label'].astype(np.int64)

    Reset index to ensure clean DataFrame structure

    train_df_clean = train_df_clean.reset_index(drop=True)

    Create dataset with the full training data

    train_ds = Dataset.from_pandas(train_df_clean, preserve_index=False)

    def tokenize(batch): """Tokenizes a batch of text inputs.""" return tokenizer(batch["text"], truncation=True, max_length=256)

    Apply tokenization to the full dataset

    train_ds = train_ds.map(tokenize, batched=True, remove_columns=['text'])

    Add a new padding token

    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

    Resize the model's token embeddings to match the new tokenizer

    model.resize_token_embeddings(len(tokenizer))

    Set the pad token id in the model's config

    model.config.pad_token_id = tokenizer.pad_token_id

    2. Clear HF cache after loading

    import os from huggingface_hub import scan_cache_dir

    Then clear cache to free ~16GB

    cache_info = scan_cache_dir() cache_info.delete_revisions(*[repo.revisions for repo in cache_info.repos]).execute()

    --- Training Arguments ---

    from transformers import TrainingArguments, Trainer, DataCollatorWithPadding import tempfile import shutil

    Ensure temp directories exist

    os.makedirs(f"{TEMP_DIR}/training_output/", exist_ok=True) os.makedirs(f"{TEMP_DIR}/logs/", exist_ok=True)

    --- Training Arguments (FIXED) ---

    training_args = TrainingArguments( output_dir=f"{TEMP_DIR}/training_output/",
    do_train=True, do_eval=False, save_strategy="no", num_train_epochs=3, per_device_train_batch_size=16, learning_rate=5e-5, logging_dir=f"{TEMP_DIR}/logs/",
    logging_steps=500, bf16=True, fp16=False, report_to="none", warmup_ratio=0.1, lr_scheduler_type="cosine", dataloader_pin_memory=False, gradient_checkpointing=True,
    )

    --- Custom Metric Computation (MAP@3) ---

    def compute_map3(eval_pred): """ Computes Mean Average Precision at 3 (MAP@3) for evaluation. """ logits, labels = eval_pred probs = torch.nn.functional.softmax(torch.tensor(logits), dim=-1).numpy()

    # Get top 3 predicted class indi...
    
  6. h

    clickbait_detection_dataset

    • huggingface.co
    Updated Jan 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christina Christodoulou (2024). clickbait_detection_dataset [Dataset]. https://huggingface.co/datasets/christinacdl/clickbait_detection_dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 15, 2024
    Authors
    Christina Christodoulou
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    37.870 texts in total, 17.850 NOT clickbait texts and 20.020 CLICKBAIT texts

    All duplicate values were removed

    Split using sklearn into 80% train and 20% temporary test (stratified label). Then split the test set using 0.50% test and validation (stratified label)

    Split: 80/10/10

    Train set label distribution: 0 ==> 14.280, 1 ==> 16.016

    Validation set label distribution: 0 ==> 1.785, 1 ==> 2.002

    Test set label distribution: 0 ==> 1.785, 1 ==> 2.002

    The dataset was created from the… See the full description on the dataset page: https://huggingface.co/datasets/christinacdl/clickbait_detection_dataset.

  7. Classifier Model

    • kaggle.com
    zip
    Updated Feb 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeriann L Rhymer (2025). Classifier Model [Dataset]. https://www.kaggle.com/datasets/jeriannlrhymer/regression-model/discussion
    Explore at:
    zip(2163 bytes)Available download formats
    Dataset updated
    Feb 4, 2025
    Authors
    Jeriann L Rhymer
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description

    Purpose of this data is Linear Regression

    Handling categorical features in a scikit-learn model. Carrying out a train/test split. Training a model. Evaluating that model on the testing data.

    The mpg data set represents the fuel economy (in miles per gallon) for 38 popular models of car, measured between 1999 and 2008.

    Factor Type Description manufacturer multi-valued discrete Vehicle manufacturer model multi-valued discrete Model of the vehicle displ continuous Size of engine [litres] year multi-valued discrete Year of vehicle manufacture cyl multi-valued discrete Number of ignition cylinders trans multi-valued discrete Transmission type (manual or automatic) drv multi-valued discrete Driven wheels (f=front, 4=4-wheel, r=rear wheel drive) city continuous Miles per gallon, city driving conditions (fuel economy) hwy continuous Miles per gallon, highway driving conditions (fuel economy) fl multi-valued discrete Vehicle type class multi-valued discrete Vehicle class (suv, compact, etc)

  8. 食品安全主题数据集

    • kaggle.com
    zip
    Updated Mar 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    镜中青日 (2023). 食品安全主题数据集 [Dataset]. https://www.kaggle.com/datasets/modeststudent/foodsafety-data-zh
    Explore at:
    zip(1395205 bytes)Available download formats
    Dataset updated
    Mar 2, 2023
    Authors
    镜中青日
    Description

    一、数据来源 来自:https://www.luge.ai/#/luge/dataDetail?id=71

    二、描述 1. 政务数据相关,数据集适合于食品安全主题分类系统,对信息数据进行分类,通过模型建立、语义分析等方法筛选出食品安全相关的信息,以助力相关部门监管高效精准。 2. 数据已脱敏,所有涉及地点、姓名、机构单位的关键信息已使用“*”进行替换。 3. 本数据集中,1 =涉及食品安全问题,0 =不涉及.

    三、Release Note v2-split: 1. 按照 f"主题:{event_name};详细描述:{content}" 的模板对文字部分进行整合; 2. 并使用 sklearn.model_seletion.train_test_split 将原始数据集划分为 train、dev、test 三部分。

    Initial Release: 原始数据集。

  9. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mohammed Hasan goni (2022). rsna_small_for_faster_experimentation [Dataset]. https://www.kaggle.com/datasets/hasangoni/rsna-small-for-faster-experimentation/code
Organization logo

rsna_small_for_faster_experimentation

rsna breast cancer ROI images subset, created through sklearn train test split

Explore at:
zip(177242332 bytes)Available download formats
Dataset updated
Dec 27, 2022
Authors
Mohammed Hasan goni
Description

Original dataset can be found in this competion. kindly I have found png ROI images here. Then I just created a subset of those dataset, only 10 % of the data to get faster iteration per epoch

Search
Clear search
Close search
Google apps
Main menu