8 datasets found

Z
One Classifier Ignores a Feature
data.niaid.nih.gov
Updated Apr 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maier, Karl (2022). One Classifier Ignores a Feature [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6502642
Explore at:
Dataset updated
Apr 29, 2022
Dataset authored and provided by
Maier, Karl
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data sets are used in a controlled experiment, where two classifiers should be compared. train_a.csv and explain.csv are slices from the original data set. train_b.csv contains the same instances as in train_a.csv, but with feature x1 set to 0 to make it unusable to classifier B.

The original data set was created and split using this Python code:

from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression

X, y = make_classification(n_samples=300, n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1, class_sep=0.75, random_state=0) X *= 100

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0) lm = LogisticRegression() lm.fit(X_train, y_train) clf_a = lm

clf_b = LogisticRegression() X2 = X.copy() X2[:, 0] = 0 X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y, test_size=0.5, random_state=0) clf_b.fit(X2_train, y2_train)

X_explain = X_test y_explain = y_test
1200 pixels spectral datasets
zenodo.org
zip
Updated May 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hui Zhang; Hui Zhang (2024). 1200 pixels spectral datasets [Dataset]. http://doi.org/10.5281/zenodo.11082600
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11082600
Dataset updated
May 21, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Hui Zhang; Hui Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In the Zip, spectral. npy was the average spectral data of red ginseng, mycotoxins and interference impurities, and label. npy was the corresponding label. Spectral data format was [1200,510] and label data format was [1200,1]. The example of data usage (sklearn in Python database was used to establish the classification model) was as follows:

import numpy as np
from sklearn. model_selection import train_test_split
from sklearn. preprocessing import StandardScaler
from sklearn. neighbors import KNeighborsClassifier
from sklearn. metrics import classification_report, accuracy_score

# Load spectral data and labels
x = np.load('.../spectral.npy')[:,1:-1]
y = np.load('.../label.npy')

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Data standardization
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

# Train the KNN model
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model. fit(x_train, y_train)

# Predict
y_pred = knn_model.predict(x_test)

# Print classification reports and accuracy rates
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Accuracy Score:")
print(accuracy_score(y_test, y_pred))
h
CIFAR100-custom
huggingface.co
Updated Apr 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrei Semenov (2024). CIFAR100-custom [Dataset]. https://huggingface.co/datasets/Andron00e/CIFAR100-custom
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 16, 2024
Authors
Andrei Semenov
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Example of usage: from datasets import load_dataset

dataset = load_dataset("Andron00e/CIFAR100-custom") splitted_dataset = dataset["train"].train_test_split(test_size=0.2)
t
Privacy-Sensitive Conversations between Care Workers and Care Home Residents...
test.researchdata.tuwien.ac.at
researchdata.tuwien.ac.at
bin, text/markdown
Updated Dec 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Reinhard Grabler; Michael Starzinger; Matthias Hirschmanner; Matthias Hirschmanner; Helena Anna Frijns; Helena Anna Frijns; Reinhard Grabler; Michael Starzinger; Reinhard Grabler; Michael Starzinger; Reinhard Grabler; Michael Starzinger (2024). Privacy-Sensitive Conversations between Care Workers and Care Home Residents in a Residential Care Home [Dataset]. http://doi.org/10.70124/hbtq5-ykv92
Explore at:
bin, text/markdownAvailable download formats
Unique identifier
https://doi.org/10.70124/hbtq5-ykv92
Dataset updated
Dec 6, 2024
Dataset provided by
TU Wien
Authors
Reinhard Grabler; Michael Starzinger; Matthias Hirschmanner; Matthias Hirschmanner; Helena Anna Frijns; Helena Anna Frijns; Reinhard Grabler; Michael Starzinger; Reinhard Grabler; Michael Starzinger; Reinhard Grabler; Michael Starzinger
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 2024 - Aug 2024
Description
Dataset Card for "privacy-care-interactions"

Table of Contents

Dataset Description

Purpose and Features

Dataset Overview

Language Distribution

Locale Distribution

Key Facts

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Annotations

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions

Dataset Description

Purpose and Features

🔒 Collection of Privacy-Sensitive Conversations between Care Workers and Care Home Residents in an Residential Care Home 🔒

The dataset is useful to train and evaluate models to identify and classify privacy-sensitive parts of conversations from text, especially in the context of AI assistants and LLMs.

Dataset Overview

Total entries: 95

Number of distinct taxonomy categories in the public dataset: 4

Number of distinct conversational categories in public dataset: 7

Papers:

Continues the work of: Privacy Agents: Utilizing Large Language Models to Safeguard Contextual Integrity in Elderly Care

Continues the work of: Prototype of a care documentation support system using audio recordings of care actions and large language models

Language Distribution 🌍

English (en): 95

Locale Distribution 🌎

United States (US) 🇺🇸: 95

Key Facts 🔑

This is synthetic data! Generated using proprietary algorithms - no privacy violations!

Conversations are classified following the taxonomy for privacy-sensitive robotics by Rueben et al. (2017).

The data was manually labeled by an expert.

Dataset Structure

Data Instances

The provided data format is .jsonl, the JSON Lines text format, also called newline-delimited JSON. An example entry looks as follows.

{ "text": "CW: Have you ever been to Italy? CR: Oh, yes... many years ago.", "taxonomy": 0, "category": 0, "affected_speaker": 1, "language": "en", "locale": "US", "data_type": 1, "uid": 16, "split": "train" }

Data Fields

The data fields are:

text: a string feature. The abbreviaton of the speakers refer to the care worker (CW) and the care recipient (CR).

taxonomy: a classification label, with possible values including informational (0), invasion (1), collection (2), processing (3), dissemination (4), physical (5), personal-space (6), territoriality (7), intrusion (8), obtrusion (9), contamination (10), modesty (11), psychological (12), interrogation (13), psychological-distance (14), social (15), association (16), crowding-isolation (17), public-gaze (18), solitude (19), intimacy (20), anonymity (21), reserve (22). The taxonomy is derived from Rueben et al. (2017). The classifications were manually labeled by an expert.

category: a classification label, with possible values including personal-information (0), family (1), health (2), thoughts (3), values (4), acquaintance (5), appointment (6). The privacy category affected in the conversation. The classifications were manually labeled by an expert.

affected_speaker: a classification label, with possible values including care-worker (0), care-recipient (1), other (2), both (3). The speaker whose privacy is impacted during the conversation. The classifications were manually labeled by an expert.

language: a string feature. Language code as defined by ISO 639.

locale: a string feature. Regional code as defined by ISO 3166-1 alpha-2.

data_type: a string a classification label, with possible values including real (0), synthetic (1).

uid: a int64 feature. A unique identifier within the dataset.

split: a string feature. Either train, validation or test.

Dataset Splits

The dataset has 2 subsets:

split: with a total of 95 examples split into train, validation and test (70%-15%-15%)

unsplit: with a total of 95 examples in a single train split

name train validation test
split 66 14 15
unsplit 95 n/a n/a

The files follow the naming convention subset-split-language.jsonl. The following files are contained in the dataset:

split-train-en.jsonl

split-validation-en.jsonl

split-test-en.jsonl

unsplit-train-en.jsonl

Dataset Creation

Curation Rationale

Recording audio of care workers and residents during care interactions, which includes partial and full body washing, giving of medication, as well as wound care, is a highly privacy-sensitive use case. Therefore, a dataset is created, which includes privacy-sensitive parts of conversations, synthesized from real-world data. This dataset serves as a basis for fine-tuning a local LLM to highlight and classify privacy-sensitive sections of transcripts created in care interactions, to further mask them to protect privacy.

Source Data

Initial Data Collection

The intial data was collected in the project Caring Robots of TU Wien in cooperation with Caritas Wien. One project track aims to facilitate Large Languge Models (LLM) to support documentation of care workers, with LLM-generated summaries of audio recordings of interactions between care workers and care home residents. The initial data are the transcriptions of those care interactions.

Data Processing

The transcriptions were thoroughly reviewed, and sections containing privacy-sensitive information were identified and marked using qualitative data analysis software by two experts. Subsequently, the accessible portions of the interviews were translated from German to US English using the locally executed LLM icky/translate. In the next step, another llama3.1:70b was used locally to synthesize the conversation segments. This process involved generating similar, yet distinct and new, conversations that are not linked to the original data. The dataset was split using the train_test_split function from the <a href="https://scikit-learn.org/1.5/modules/generated/sklearn.model_selection.train_test_split.html" target="_blank"
CUB 200 Bird Species XML Detection Dataset
kaggle.com
Updated Jan 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sovit Ranjan Rath (2023). CUB 200 Bird Species XML Detection Dataset [Dataset]. https://www.kaggle.com/datasets/sovitrath/cub-200-bird-species-xml-detection-dataset/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 8, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sovit Ranjan Rath
Description
This dataset contains the bounding box annotations of the Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset for object detection in XML format. It has been adapted from 200 Bird Species with 11,788 Images. The train/test split is according to the information provided in the original train_test_split.txt file.
h
AutoGUI-v1-280k
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aldo Rivaldo, AutoGUI-v1-280k [Dataset]. https://huggingface.co/datasets/4lspace/AutoGUI-v1-280k
Explore at:
Authors
Aldo Rivaldo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Authors: Hongxin Li, Jingfan Chen, Jingran Su, Yuntao Chen, Qing Li, Zhaoxiang Zhang Authors' Paper: arXiv Authors' Project Website: AutoGUI: Scaling GUI Grounding with Autonomous Functionality Annotations from LLMs Original dataset: AutoGUI/AutoGUI-v1-702k

changes made:

excluded data with with image resolution of 427x745 and 590x1008 by using Dataset.filter(lambda e: e["image_size"] not in ["427x745","590x1008"]) sliced 40% of total dataset by using Dataset.train_test_split(train_size=0.4… See the full description on the dataset page: https://huggingface.co/datasets/4lspace/AutoGUI-v1-280k.
h
MIDIstral
huggingface.co
Updated Nov 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alex (2024). MIDIstral [Dataset]. https://huggingface.co/datasets/asigalov61/MIDIstral
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 28, 2024
Authors
Alex
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
MIDIstral

MIDI images with detailed captions for MIDI description and classification Install and use

from PIL import Image import io

from datasets import load_dataset

Function to deserialize an image

def deserialize_image(byte_data): img_byte_arr = io.BytesIO(byte_data) img = Image.open(img_byte_arr) return img

dataset = load_dataset("asigalov61/MIDIstral", split='train').train_test_split(test_size=0.01)

dataset_split = 'train'… See the full description on the dataset page: https://huggingface.co/datasets/asigalov61/MIDIstral.
h
Korean_STS_all
huggingface.co
Updated Mar 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haryeom (2025). Korean_STS_all [Dataset]. https://huggingface.co/datasets/CocoRoF/Korean_STS_all
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 18, 2025
Authors
Haryeom
Description
Kakao에서 공개한 NLI Task를 위한 학습용 데이터셋입니다. 2개의 Sentence와 그 관계를 서술하는 총 3개의 컬럼으로 되어있습니다. 총 7,128쌍의 데이터가 존재합니다. Reference: (https://github.com/kakaobrain/kor-nlu-datasets)

사용예시

dataset_repo = "x2bee/Korean_STS_all" dataset = load_dataset(dataset_repo) dataset = dataset['train']

test_size = 0.1 test_split_seed = 42

split_dataset = dataset.train_test_split(test_size=test_size, seed=test_split_seed) train_dataset = split_dataset["train"] test_dataset = split_dataset["test"]

def… See the full description on the dataset page: https://huggingface.co/datasets/CocoRoF/Korean_STS_all.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

name	train	validation	test
split	66	14	15
unsplit	95	n/a	n/a

Facebook

Twitter

Click to copy link

Link copied

Cite

Maier, Karl (2022). One Classifier Ignores a Feature [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6502642

One Classifier Ignores a Feature

Explore at:

Dataset updated

Apr 29, 2022

Dataset authored and provided by

Maier, Karl

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The data sets are used in a controlled experiment, where two classifiers should be compared. train_a.csv and explain.csv are slices from the original data set. train_b.csv contains the same instances as in train_a.csv, but with feature x1 set to 0 to make it unusable to classifier B.

The original data set was created and split using this Python code:

from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression

X, y = make_classification(n_samples=300, n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1, class_sep=0.75, random_state=0) X *= 100

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0) lm = LogisticRegression() lm.fit(X_train, y_train) clf_a = lm

clf_b = LogisticRegression() X2 = X.copy() X2[:, 0] = 0 X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y, test_size=0.5, random_state=0) clf_b.fit(X2_train, y2_train)

X_explain = X_test y_explain = y_test

Clear search

Close search

Google apps

Main menu

One Classifier Ignores a Feature

1200 pixels spectral datasets

CIFAR100-custom

Privacy-Sensitive Conversations between Care Workers and Care Home Residents...

Dataset Card for "privacy-care-interactions"

Table of Contents

Dataset Description

Purpose and Features

Dataset Overview

Language Distribution 🌍

Locale Distribution 🌎

Key Facts 🔑

Dataset Structure

Data Instances

Data Fields

Dataset Splits

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection

Data Processing

CUB 200 Bird Species XML Detection Dataset

AutoGUI-v1-280k

MIDIstral

Function to deserialize an image

Korean_STS_all

One Classifier Ignores a Feature