8 datasets found

rsna_small_for_faster_experimentation
kaggle.com
zip
Updated Dec 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammed Hasan goni (2022). rsna_small_for_faster_experimentation [Dataset]. https://www.kaggle.com/datasets/hasangoni/rsna-small-for-faster-experimentation/code
Explore at:
zip(177242332 bytes)Available download formats
Dataset updated
Dec 27, 2022
Authors
Mohammed Hasan goni
Description
Original dataset can be found in this competion. kindly I have found png ROI images here. Then I just created a subset of those dataset, only 10 % of the data to get faster iteration per epoch
h
hate_speech_dataset
huggingface.co
Updated Jul 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christina Christodoulou (2024). hate_speech_dataset [Dataset]. https://huggingface.co/datasets/christinacdl/hate_speech_dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 27, 2024
Authors
Christina Christodoulou
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
32.579 texts in total, 14.012 NOT hateful texts and 18.567 HATEFUL texts All duplicate values were removed Split using sklearn into 80% train and 20% temporary test (stratified label). Then split the test set using 0.50% test and validation (stratified label) Split: 80/10/10 Train set label distribution: 0 ==> 11.210, 1 ==> 14.853, 26.063 in total Validation set label distribution: 0 ==> 1.401, 1 ==> 1.857, 3.258 in total Test set label distribution: 0 ==> 1.401, 1 ==> 1.857, 3.258 in… See the full description on the dataset page: https://huggingface.co/datasets/christinacdl/hate_speech_dataset.
Prediction of Personality Traits using the Big 5 Framework
zenodo.org
csv, text/x-python
Updated Feb 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neelima Brahmbhatt; Neelima Brahmbhatt (2023). Prediction of Personality Traits using the Big 5 Framework [Dataset]. http://doi.org/10.5281/zenodo.7596072
Explore at:
text/x-python, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7596072
Dataset updated
Feb 2, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Neelima Brahmbhatt; Neelima Brahmbhatt
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The methodology is the core component of any research-related work. The methods used to gain the results are shown in the methodology. Here, the whole research implementation is done using python. There are different steps involved to get the entire research work done which is as follows:

1. Acquire Personality Dataset

The kaggle machine learning dataset is a collection of datasets, data generators which are used by machine learning community for analysis purpose. The personality prediction dataset is acquired from the kaggle website. This dataset was collected (2016-2018) through an interactive on-line personality test. The personality test was constructed from the IPIP. The personality prediction dataset can be downloaded in zip file format just by clicking on the link available. The personality prediction file consists of two subject CSV files (test.csv & train.csv). The test.csv file has 0 missing values, 7 attributes, and final label output. Also, the dataset has multivariate characteristics. Here, data-preprocessing is done for checking inconsistent behaviors or trends.

2. Data preprocessing

After, Data acquisition the next step is to clean and preprocess the data. The Dataset available has numerical type features. The target value is a five-level personality consisting of serious,lively,responsible,dependable & extraverted. The preprocessed dataset is further split into training and testing datasets. This is achieved by passing feature value, target value, test size to the train-test split method of the scikit-learn package. After splitting of data, the training data is sent to the following Logistic regression & SVM design is used for training the artificial neural networks then test data is used to predict the accuracy of the trained network model.

3. Feature Extraction

The following items were presented on one page and each was rated on a five point scale using radio buttons. The order on page was EXT1, AGR1, CSN1, EST1, OPN1, EXT2, etc. The scale was labeled 1=Disagree, 3=Neutral, 5=Agree

EXT1 I am the life of the party. EXT2 I don't talk a lot. EXT3 I feel comfortable around people. EXT4 I am quiet around strangers. EST1 I get stressed out easily. EST2 I get irritated easily. EST3 I worry about things. EST4 I change my mood a lot. AGR1 I have a soft heart. AGR2 I am interested in people. AGR3 I insult people. AGR4 I am not really interested in others. CSN1 I am always prepared. CSN2 I leave my belongings around. CSN3 I follow a schedule. CSN4 I make a mess of things. OPN1 I have a rich vocabulary. OPN2 I have difficulty understanding abstract ideas. OPN3 I do not have a good imagination. OPN4 I use difficult words.

4. Training the Model

Train/Test is a method to measure the accuracy of your model. It is called Train/Test because you split the the data set into two sets: a training set and a testing set. 80% for training, and 20% for testing. You train the model using the training set.In this model we trained our dataset using linear_model.LogisticRegression() & svm.SVC() from sklearn Package

5. Personality Prediction Output

After the training of the designed neural network, the testing of Logistic Regression & SVM is performed using Cohen_kappa_score & Accuracy Score.
h
offensive_language_dataset
huggingface.co
Updated Feb 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christina Christodoulou (2024). offensive_language_dataset [Dataset]. https://huggingface.co/datasets/christinacdl/offensive_language_dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 1, 2024
Authors
Christina Christodoulou
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
36.528 English texts in total, 12.955 NOT offensive and 23.573O OFFENSIVE texts All duplicate values were removed Split using sklearn into 80% train and 20% temporary test (stratified label). Then split the test set using 0.50% test and validation (stratified label) Split: 80/10/10 Train set label distribution: 0 ==> 10.364, 1 ==> 18.858 Validation set label distribution: 0 ==> 1.296, 1 ==> 2.357 Test set label distribution: 0 ==> 1.295, 1 ==> 2.358 The OLID dataset (Zampieri et al., 2019)… See the full description on the dataset page: https://huggingface.co/datasets/christinacdl/offensive_language_dataset.
Llama 3.1 8B Correct Labels
kaggle.com
zip
Updated Aug 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jatin Mehra_666 (2025). Llama 3.1 8B Correct Labels [Dataset]. https://www.kaggle.com/datasets/jatinmehra666/llama-3-1-8b-correct-labels
Explore at:
zip(11853454078 bytes)Available download formats
Dataset updated
Aug 26, 2025
Authors
Jatin Mehra_666
Description
training Code ```Python

from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split import os import pandas as pd import numpy as np os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3" TEMP_DIR = "tmp" os.makedirs(TEMP_DIR, exist_ok=True) train = pd.read_csv('input/map-charting-student-math-misunderstandings/train.csv')

Fill missing Misconception values with 'NA'

train.Misconception = train.Misconception.fillna('NA')

Create a combined target label (Category:Misconception)

train['target'] = train.Category + ":" + train.Misconception

Encode target labels to numerical format

le = LabelEncoder() train['label'] = le.fit_transform(train['target']) n_classes = len(le.classes_) # Number of unique target classes print(f"Train shape: {train.shape} with {n_classes} target classes") print("Train head:") train.head()

idx = train.apply(lambda row: row.Category.split('_')[0], axis=1) == 'True' correct = train.loc[idx].copy() correct['c'] = correct.groupby(['QuestionId', 'MC_Answer']).MC_Answer.transform('count') correct = correct.sort_values('c', ascending=False) correct = correct.drop_duplicates(['QuestionId']) correct = correct[['QuestionId', 'MC_Answer']] correct['is_correct'] = 1 # Mark these as correct answers

Merge 'is_correct' flag into the main training DataFrame

train = train.merge(correct, on=['QuestionId', 'MC_Answer'], how='left') train.is_correct = train.is_correct.fillna(0)

from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch

Model_Name = "unsloth/Meta-Llama-3.1-8B-Instruct"

model = AutoModelForSequenceClassification.from_pretrained(Model_Name, num_labels=n_classes, torch_dtype=torch.bfloat16, device_map="balanced", cache_dir=TEMP_DIR)

tokenizer = AutoTokenizer.from_pretrained(Model_Name, cache_dir=TEMP_DIR)

def format_input(row): x = "Yes" if not row['is_correct']: x = "No" return ( f"Question: {row['QuestionText']} " f"Answer: {row['MC_Answer']} " f"Correct? {x} " f"Student Explanation: {row['StudentExplanation']}" )

train['text'] = train.apply(format_input,axis=1) print("Example prompt for our LLM:") print() print( train.text.values[0] )

from datasets import Dataset

Split data into training and validation sets

train_df, val_df = train_test_split(train, test_size=0.2, random_state=42)

Convert to Hugging Face Dataset

COLS = ['text', 'label']

Create clean DataFrame with the full training data

train_df_clean = train[COLS].copy() # Use 'train' instead of 'train_df'

Ensure labels are proper integers

train_df_clean['label'] = train_df_clean['label'].astype(np.int64)

Reset index to ensure clean DataFrame structure

train_df_clean = train_df_clean.reset_index(drop=True)

Create dataset with the full training data

train_ds = Dataset.from_pandas(train_df_clean, preserve_index=False)

def tokenize(batch): """Tokenizes a batch of text inputs.""" return tokenizer(batch["text"], truncation=True, max_length=256)

Apply tokenization to the full dataset

train_ds = train_ds.map(tokenize, batched=True, remove_columns=['text'])

Add a new padding token

tokenizer.add_special_tokens({'pad_token': '[PAD]'})

Resize the model's token embeddings to match the new tokenizer

model.resize_token_embeddings(len(tokenizer))

Set the pad token id in the model's config

model.config.pad_token_id = tokenizer.pad_token_id

2. Clear HF cache after loading

import os from huggingface_hub import scan_cache_dir

Then clear cache to free ~16GB

cache_info = scan_cache_dir() cache_info.delete_revisions(*[repo.revisions for repo in cache_info.repos]).execute()

--- Training Arguments ---

from transformers import TrainingArguments, Trainer, DataCollatorWithPadding import tempfile import shutil

Ensure temp directories exist

os.makedirs(f"{TEMP_DIR}/training_output/", exist_ok=True) os.makedirs(f"{TEMP_DIR}/logs/", exist_ok=True)

--- Training Arguments (FIXED) ---

training_args = TrainingArguments( output_dir=f"{TEMP_DIR}/training_output/",
do_train=True, do_eval=False, save_strategy="no", num_train_epochs=3, per_device_train_batch_size=16, learning_rate=5e-5, logging_dir=f"{TEMP_DIR}/logs/",
logging_steps=500, bf16=True, fp16=False, report_to="none", warmup_ratio=0.1, lr_scheduler_type="cosine", dataloader_pin_memory=False, gradient_checkpointing=True,
)

--- Custom Metric Computation (MAP@3) ---

def compute_map3(eval_pred): """ Computes Mean Average Precision at 3 (MAP@3) for evaluation. """ logits, labels = eval_pred probs = torch.nn.functional.softmax(torch.tensor(logits), dim=-1).numpy()

# Get top 3 predicted class indi...
h
clickbait_detection_dataset
huggingface.co
Updated Jan 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christina Christodoulou (2024). clickbait_detection_dataset [Dataset]. https://huggingface.co/datasets/christinacdl/clickbait_detection_dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 15, 2024
Authors
Christina Christodoulou
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
37.870 texts in total, 17.850 NOT clickbait texts and 20.020 CLICKBAIT texts

All duplicate values were removed

Split using sklearn into 80% train and 20% temporary test (stratified label). Then split the test set using 0.50% test and validation (stratified label)

Split: 80/10/10

Train set label distribution: 0 ==> 14.280, 1 ==> 16.016

Validation set label distribution: 0 ==> 1.785, 1 ==> 2.002

Test set label distribution: 0 ==> 1.785, 1 ==> 2.002

The dataset was created from the… See the full description on the dataset page: https://huggingface.co/datasets/christinacdl/clickbait_detection_dataset.
Classifier Model
kaggle.com
zip
Updated Feb 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeriann L Rhymer (2025). Classifier Model [Dataset]. https://www.kaggle.com/datasets/jeriannlrhymer/regression-model/discussion
Explore at:
zip(2163 bytes)Available download formats
Dataset updated
Feb 4, 2025
Authors
Jeriann L Rhymer
License
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Description
Purpose of this data is Linear Regression

Handling categorical features in a scikit-learn model. Carrying out a train/test split. Training a model. Evaluating that model on the testing data.

The mpg data set represents the fuel economy (in miles per gallon) for 38 popular models of car, measured between 1999 and 2008.

Factor Type Description manufacturer multi-valued discrete Vehicle manufacturer model multi-valued discrete Model of the vehicle displ continuous Size of engine [litres] year multi-valued discrete Year of vehicle manufacture cyl multi-valued discrete Number of ignition cylinders trans multi-valued discrete Transmission type (manual or automatic) drv multi-valued discrete Driven wheels (f=front, 4=4-wheel, r=rear wheel drive) city continuous Miles per gallon, city driving conditions (fuel economy) hwy continuous Miles per gallon, highway driving conditions (fuel economy) fl multi-valued discrete Vehicle type class multi-valued discrete Vehicle class (suv, compact, etc)
食品安全主题数据集
kaggle.com
zip
Updated Mar 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
镜中青日 (2023). 食品安全主题数据集 [Dataset]. https://www.kaggle.com/datasets/modeststudent/foodsafety-data-zh
Explore at:
zip(1395205 bytes)Available download formats
Dataset updated
Mar 2, 2023
Authors
镜中青日
Description
一、数据来源来自：https://www.luge.ai/#/luge/dataDetail?id=71

二、描述 1. 政务数据相关，数据集适合于食品安全主题分类系统，对信息数据进行分类，通过模型建立、语义分析等方法筛选出食品安全相关的信息，以助力相关部门监管高效精准。 2. 数据已脱敏，所有涉及地点、姓名、机构单位的关键信息已使用“*”进行替换。 3. 本数据集中，1 =涉及食品安全问题，0 =不涉及.

三、Release Note v2-split： 1. 按照 f"主题：{event_name}；详细描述：{content}" 的模板对文字部分进行整合； 2. 并使用 sklearn.model_seletion.train_test_split 将原始数据集划分为 train、dev、test 三部分。

Initial Release: 原始数据集。
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Mohammed Hasan goni (2022). rsna_small_for_faster_experimentation [Dataset]. https://www.kaggle.com/datasets/hasangoni/rsna-small-for-faster-experimentation/code

rsna_small_for_faster_experimentation

rsna breast cancer ROI images subset, created through sklearn train test split

Explore at:

zip(177242332 bytes)Available download formats

Dataset updated

Dec 27, 2022

Authors

Mohammed Hasan goni

Description

Original dataset can be found in this competion. kindly I have found png ROI images here. Then I just created a subset of those dataset, only 10 % of the data to get faster iteration per epoch

Clear search

Close search

Google apps

Main menu

rsna_small_for_faster_experimentation

hate_speech_dataset

Prediction of Personality Traits using the Big 5 Framework

offensive_language_dataset

Llama 3.1 8B Correct Labels

Fill missing Misconception values with 'NA'

Create a combined target label (Category:Misconception)

Encode target labels to numerical format

Merge 'is_correct' flag into the main training DataFrame

Split data into training and validation sets

train_df, val_df = train_test_split(train, test_size=0.2, random_state=42)

Convert to Hugging Face Dataset

Create clean DataFrame with the full training data

Ensure labels are proper integers

Reset index to ensure clean DataFrame structure

Create dataset with the full training data

Apply tokenization to the full dataset

Add a new padding token

Resize the model's token embeddings to match the new tokenizer

Set the pad token id in the model's config

2. Clear HF cache after loading

Then clear cache to free ~16GB

--- Training Arguments ---

Ensure temp directories exist

--- Training Arguments (FIXED) ---

--- Custom Metric Computation (MAP@3) ---

clickbait_detection_dataset

Classifier Model

食品安全主题数据集

rsna_small_for_faster_experimentation

rsna breast cancer ROI images subset, created through sklearn train test split