4 datasets found
  1. Z

    One Classifier Ignores a Feature

    • data.niaid.nih.gov
    Updated Apr 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maier, Karl (2022). One Classifier Ignores a Feature [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6502642
    Explore at:
    Dataset updated
    Apr 29, 2022
    Dataset authored and provided by
    Maier, Karl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data sets are used in a controlled experiment, where two classifiers should be compared. train_a.csv and explain.csv are slices from the original data set. train_b.csv contains the same instances as in train_a.csv, but with feature x1 set to 0 to make it unusable to classifier B.

    The original data set was created and split using this Python code:

    from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression

    X, y = make_classification(n_samples=300, n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1, class_sep=0.75, random_state=0) X *= 100

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0) lm = LogisticRegression() lm.fit(X_train, y_train) clf_a = lm

    clf_b = LogisticRegression() X2 = X.copy() X2[:, 0] = 0 X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y, test_size=0.5, random_state=0) clf_b.fit(X2_train, y2_train)

    X_explain = X_test y_explain = y_test

  2. 1200 pixels spectral datasets

    • zenodo.org
    zip
    Updated May 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hui Zhang; Hui Zhang (2024). 1200 pixels spectral datasets [Dataset]. http://doi.org/10.5281/zenodo.11082600
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 21, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hui Zhang; Hui Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In the Zip, spectral. npy was the average spectral data of red ginseng, mycotoxins and interference impurities, and label. npy was the corresponding label. Spectral data format was [1200,510] and label data format was [1200,1]. The example of data usage (sklearn in Python database was used to establish the classification model) was as follows:

    import numpy as np
    from sklearn. model_selection import train_test_split
    from sklearn. preprocessing import StandardScaler
    from sklearn. neighbors import KNeighborsClassifier
    from sklearn. metrics import classification_report, accuracy_score

    # Load spectral data and labels
    x = np.load('.../spectral.npy')[:,1:-1]
    y = np.load('.../label.npy')

    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

    # Data standardization
    scaler = StandardScaler()
    x_train = scaler.fit_transform(x_train)
    x_test = scaler.transform(x_test)

    # Train the KNN model
    knn_model = KNeighborsClassifier(n_neighbors=5)
    knn_model. fit(x_train, y_train)

    # Predict
    y_pred = knn_model.predict(x_test)

    # Print classification reports and accuracy rates
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    print("Accuracy Score:")
    print(accuracy_score(y_test, y_pred))

  3. n

    Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning...

    • narcis.nl
    • data.mendeley.com
    Updated Jan 11, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yoo, T (via Mendeley Data) (2021). Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics" [Dataset]. http://doi.org/10.17632/ffn745r57z.2
    Explore at:
    Dataset updated
    Jan 11, 2021
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Yoo, T (via Mendeley Data)
    Description

    Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics. Authors: Kazutaka Kamiya, MD, PhD, Ik Hee Ryu, MD, MS, Tae Keun Yoo, MD, Jung Sub Kim MD, In Sik Lee, MD, PhD, Jin Kook Kim MD, Wakako Ando CO, Nobuyuki Shoji, MD, PhD, Tomofusa, Yamauchi, MD, PhD, Hitoshi Tabuchi, MD, PhD.

    We hypothesize that machine learning of preoperative biometric data obtained by the As-OCT may be clinically beneficial for predicting the actual ICL vault. Therefore, we built the machine learning model using Random Forest to predict ICL vault after surgery.

    This multicenter study comprised one thousand seven hundred forty-five eyes of 1745 consecutive patients (656 men and 1089 women), who underwent EVO ICL implantation (V4c and V5 Visian ICL with KS-AquaPORT) for the correction of moderate to high myopia and myopic astigmatism, and who completed at least a 1-month follow-up, at Kitasato University Hospital (Kanagawa, Japan), or at B&VIIT Eye Center (Seoul, Korea).

    This data file (RFR_model(feature=12).mat) is the final trained random forest model for MATLAB 2020a.

    Python version:

    from sklearn.model_selection import train_test_split import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import RandomForestRegressor

    connect data in your google drive

    from google.colab import auth auth.authenticate_user() from google.colab import drive drive.mount('/content/gdrive')

    Change the path for the custom data

    In this case, we used ICL vault prediction using preop measurement

    dataset = pd.read_csv('gdrive/My Drive/ICL/data_icl.csv') dataset.head()

    optimal features (sorted by importance) :

    1. ICL size 2. ICL power 3. LV 4. CLR 5. ACD 6. ATA

    7. MSE 8.Age 9. Pupil size 10. WTW 11. CCT 12. ACW

    y = dataset['Vault_1M'] X = dataset.drop(['Vault_1M'], axis = 1)

    Split the dataset to train and test data, if necessary.

    For example, we can split data to 8:2 as a simple validation test

    train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=0)

    In our study, we already defined the training (B&VIIT Eye Center, n=1455) and test (Kitasato University, n=290) dataset, this code was not necessary to perform our analysis.

    Optimal parameter search could be performed in this section

    parameters = {'bootstrap': True, 'min_samples_leaf': 3, 'n_estimators': 500, 'criterion': 'mae' 'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 6, 'max_leaf_nodes': None}

    RF_model = RandomForestRegressor(**parameters) RF_model.fit(train_X, train_y) RF_predictions = RF_model.predict(test_X) importance = RF_model.feature_importances_

  4. t

    Privacy-Sensitive Conversations between Care Workers and Care Home Residents...

    • test.researchdata.tuwien.ac.at
    • researchdata.tuwien.ac.at
    • +1more
    bin, text/markdown
    Updated Dec 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reinhard Grabler; Michael Starzinger; Matthias Hirschmanner; Matthias Hirschmanner; Helena Anna Frijns; Helena Anna Frijns; Reinhard Grabler; Michael Starzinger; Reinhard Grabler; Michael Starzinger; Reinhard Grabler; Michael Starzinger (2024). Privacy-Sensitive Conversations between Care Workers and Care Home Residents in a Residential Care Home [Dataset]. http://doi.org/10.70124/hbtq5-ykv92
    Explore at:
    bin, text/markdownAvailable download formats
    Dataset updated
    Dec 6, 2024
    Dataset provided by
    TU Wien
    Authors
    Reinhard Grabler; Michael Starzinger; Matthias Hirschmanner; Matthias Hirschmanner; Helena Anna Frijns; Helena Anna Frijns; Reinhard Grabler; Michael Starzinger; Reinhard Grabler; Michael Starzinger; Reinhard Grabler; Michael Starzinger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 2024 - Aug 2024
    Description

    Dataset Card for "privacy-care-interactions"

    Table of Contents

    Dataset Description

    Purpose and Features

    πŸ”’ Collection of Privacy-Sensitive Conversations between Care Workers and Care Home Residents in an Residential Care Home πŸ”’

    The dataset is useful to train and evaluate models to identify and classify privacy-sensitive parts of conversations from text, especially in the context of AI assistants and LLMs.

    Dataset Overview

    Language Distribution 🌍

    • English (en): 95

    Locale Distribution 🌎

    • United States (US) πŸ‡ΊπŸ‡Έ: 95

    Key Facts πŸ”‘

    • This is synthetic data! Generated using proprietary algorithms - no privacy violations!
    • Conversations are classified following the taxonomy for privacy-sensitive robotics by Rueben et al. (2017).
    • The data was manually labeled by an expert.

    Dataset Structure

    Data Instances

    The provided data format is .jsonl, the JSON Lines text format, also called newline-delimited JSON. An example entry looks as follows.

    { "text": "CW: Have you ever been to Italy? CR: Oh, yes... many years ago.", "taxonomy": 0, "category": 0, "affected_speaker": 1, "language": "en", "locale": "US", "data_type": 1, "uid": 16, "split": "train" }

    Data Fields

    The data fields are:

    • text: a string feature. The abbreviaton of the speakers refer to the care worker (CW) and the care recipient (CR).
    • taxonomy: a classification label, with possible values including informational (0), invasion (1), collection (2), processing (3), dissemination (4), physical (5), personal-space (6), territoriality (7), intrusion (8), obtrusion (9), contamination (10), modesty (11), psychological (12), interrogation (13), psychological-distance (14), social (15), association (16), crowding-isolation (17), public-gaze (18), solitude (19), intimacy (20), anonymity (21), reserve (22). The taxonomy is derived from Rueben et al. (2017). The classifications were manually labeled by an expert.
    • category: a classification label, with possible values including personal-information (0), family (1), health (2), thoughts (3), values (4), acquaintance (5), appointment (6). The privacy category affected in the conversation. The classifications were manually labeled by an expert.
    • affected_speaker: a classification label, with possible values including care-worker (0), care-recipient (1), other (2), both (3). The speaker whose privacy is impacted during the conversation. The classifications were manually labeled by an expert.
    • language: a string feature. Language code as defined by ISO 639.
    • locale: a string feature. Regional code as defined by ISO 3166-1 alpha-2.
    • data_type: a string a classification label, with possible values including real (0), synthetic (1).
    • uid: a int64 feature. A unique identifier within the dataset.
    • split: a string feature. Either train, validation or test.

    Dataset Splits

    The dataset has 2 subsets:

    • split: with a total of 95 examples split into train, validation and test (70%-15%-15%)
    • unsplit: with a total of 95 examples in a single train split
    nametrainvalidationtest
    split661415
    unsplit95n/an/a

    The files follow the naming convention subset-split-language.jsonl. The following files are contained in the dataset:

    • split-train-en.jsonl
    • split-validation-en.jsonl
    • split-test-en.jsonl
    • unsplit-train-en.jsonl

    Dataset Creation

    Curation Rationale

    Recording audio of care workers and residents during care interactions, which includes partial and full body washing, giving of medication, as well as wound care, is a highly privacy-sensitive use case. Therefore, a dataset is created, which includes privacy-sensitive parts of conversations, synthesized from real-world data. This dataset serves as a basis for fine-tuning a local LLM to highlight and classify privacy-sensitive sections of transcripts created in care interactions, to further mask them to protect privacy.

    Source Data

    Initial Data Collection

    The intial data was collected in the project Caring Robots of TU Wien in cooperation with Caritas Wien. One project track aims to facilitate Large Languge Models (LLM) to support documentation of care workers, with LLM-generated summaries of audio recordings of interactions between care workers and care home residents. The initial data are the transcriptions of those care interactions.

    Data Processing

    The transcriptions were thoroughly reviewed, and sections containing privacy-sensitive information were identified and marked using qualitative data analysis software by two experts. Subsequently, the accessible portions of the interviews were translated from German to US English using the locally executed LLM icky/translate. In the next step, another llama3.1:70b was used locally to synthesize the conversation segments. This process involved generating similar, yet distinct and new, conversations that are not linked to the original data. The dataset was split using the train_test_split function from the <a href="https://scikit-learn.org/1.5/modules/generated/sklearn.model_selection.train_test_split.html" target="_blank"

  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Maier, Karl (2022). One Classifier Ignores a Feature [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6502642

One Classifier Ignores a Feature

Explore at:
Dataset updated
Apr 29, 2022
Dataset authored and provided by
Maier, Karl
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The data sets are used in a controlled experiment, where two classifiers should be compared. train_a.csv and explain.csv are slices from the original data set. train_b.csv contains the same instances as in train_a.csv, but with feature x1 set to 0 to make it unusable to classifier B.

The original data set was created and split using this Python code:

from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression

X, y = make_classification(n_samples=300, n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1, class_sep=0.75, random_state=0) X *= 100

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0) lm = LogisticRegression() lm.fit(X_train, y_train) clf_a = lm

clf_b = LogisticRegression() X2 = X.copy() X2[:, 0] = 0 X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y, test_size=0.5, random_state=0) clf_b.fit(X2_train, y2_train)

X_explain = X_test y_explain = y_test

Search
Clear search
Close search
Google apps
Main menu