10 datasets found
  1. MeDAL Dataset

    • kaggle.com
    • opendatalab.com
    • +1more
    zip
    Updated Nov 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    xhlulu (2020). MeDAL Dataset [Dataset]. https://www.kaggle.com/xhlulu/medal-emnlp
    Explore at:
    zip(7324382521 bytes)Available download formats
    Dataset updated
    Nov 16, 2020
    Authors
    xhlulu
    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2352583%2F868a18fb09d7a1d3da946d74a9857130%2FLogo.PNG?generation=1604973725053566&alt=media" alt="">

    Medical Dataset for Abbreviation Disambiguation for Natural Language Understanding (MeDAL) is a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. It was published at the ClinicalNLP workshop at EMNLP.

    ๐Ÿ’ป Code ๐Ÿค— Dataset (Hugging Face) ๐Ÿ’พ Dataset (Kaggle) ๐Ÿ’ฝ Dataset (Zenodo) ๐Ÿ“œ Paper (ACL) ๐Ÿ“ Paper (Arxiv) โšก Pre-trained ELECTRA (Hugging Face)

    Downloading the data

    We recommend downloading from Kaggle if you can authenticate through their API. The advantage to Kaggle is that the data is compressed, so it will be faster to download. Links to the data can be found at the top of the readme.

    First, you will need to create an account on kaggle.com. Afterwards, you will need to install the kaggle API: pip install kaggle

    Then, you will need to follow the instructions here to add your username and key. Once that's done, you can run: kaggle datasets download xhlulu/medal-emnlp

    Now, unzip everything and place them inside the data directory: unzip -nq crawl-300d-2M-subword.zip -d data mv data/pretrain_sample/* data/

    Loading FastText Embeddings

    For the LSTM models, we will need to use the fastText embeddings. To do so, first download and extract the weights: wget -nc -P data/ https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip unzip -nq data/crawl-300d-2M-subword.zip -d data/

    Model Quickstart

    Using Torch Hub

    You can directly load LSTM and LSTM-SA with torch.hub: ```python import torch

    lstm = torch.hub.load("BruceWen120/medal", "lstm") lstm_sa = torch.hub.load("BruceWen120/medal", "lstm_sa") ```

    If you want to use the Electra model, you need to first install transformers: pip install transformers Then, you can load it with torch.hub: python import torch electra = torch.hub.load("BruceWen120/medal", "electra")

    Using Huggingface transformers

    If you are only interested in the pre-trained ELECTRA weights (without the disambiguation head), you can load it directly from the Hugging Face Repository:

    from transformers import AutoModel, AutoTokenizer
    
    model = AutoModel.from_pretrained("xhlu/electra-medal")
    tokenizer = AutoTokenizer.from_pretrained("xhlu/electra-medal")
    

    Citation

    Download the bibtex here, or copy the text below: @inproceedings{wen-etal-2020-medal, title = "{M}e{DAL}: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining", author = "Wen, Zhi and Lu, Xing Han and Reddy, Siva", booktitle = "Proceedings of the 3rd Clinical Natural Language Processing Workshop", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.clinicalnlp-1.15", pages = "130--135", }

    License, Terms and Conditions

    The ELECTRA model is licensed under Apache 2.0. The license for the libraries used in this project (transformers, pytorch, etc.) can be found in their respective GitHub repository. Our model is released under a MIT license.

    The original dataset was retrieved and modified from the NLM website. By using this dataset, you are bound by the terms and conditions specified by NLM:

    INTRODUCTION

    Downloading data from the National Library of Medicine FTP servers indicates your acceptance of the following Terms and Conditions: No charges, usage fees or royalties are paid to NLM for this data.

    MEDLINE/PUBMED SPECIFIC TERMS

    NLM freely provides PubMed/MEDLINE data. Please note some PubMed/MEDLINE abstracts may be protected by copyright.

    GENERAL TERMS AND CONDITIONS

    • Users of the data agree to:

      • acknowledge NLM as the source of the data by including the phrase "Courtesy of the U.S. National Library of Medicine" in a clear and conspicuous manner,
      • properly use registration and/or trademark symbols when referring to NLM products, and
      • not indicate or imply that NLM has endorsed its products/services/applications.
    • Users who republish or redistribute the data (services, products or raw data) agree to:

      • maintain the most current version of all distributed data, or
      • make known in a clear and conspicuous manner that the products/services/applications do not reflect the most current/accurate data available from NLM.
    • These data are produced with a reasonable standard of care, but NLM makes no warranties express or implied, including no warranty of merchantability or fitness for particular purpose, regarding the accuracy or completeness of the data. Users agree to hold NLM and the U.S. Government harmless from any liability resulting from errors in the data. NLM disclaims any liability for any consequences due to use, misuse, or interpretation of information contained or not contained in the data.

    • NLM does not provide legal advice regarding copyright, fair use, or other aspects of intellectual property rights. See the NLM Copyright page.

    • NLM reserves the right to change the type and format of its machine-readable data. NLM will take reasonable steps to inform users of any changes to the format of the data before the data are distributed via the announcement section or subscription to email and RSS updates.

  2. Bill Authentication

    • kaggle.com
    Updated Dec 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sagnick Bhar (2020). Bill Authentication [Dataset]. https://www.kaggle.com/sagnickbhar/bill-authentication/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 4, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sagnick Bhar
    Description

    Dataset

    This dataset was created by Sagnick Bhar

    Contents

  3. Banknote authentication

    • kaggle.com
    Updated Feb 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pranjali Pilankar (2025). Banknote authentication [Dataset]. https://www.kaggle.com/datasets/pranjalipilankar/banknote-authentication/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 8, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Pranjali Pilankar
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Pranjali Pilankar

    Released under Apache 2.0

    Contents

  4. Bank Note Authentication

    • kaggle.com
    Updated Nov 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ranjitha C (2022). Bank Note Authentication [Dataset]. https://www.kaggle.com/datasets/ranjithablr955/bank-note-authentication/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 2, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ranjitha C
    Description

    Dataset

    This dataset was created by Ranjitha C

    Contents

  5. data_banknote_authentication

    • kaggle.com
    Updated Dec 18, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jackson Harper (2017). data_banknote_authentication [Dataset]. https://www.kaggle.com/jacksonharper/data_banknote_authentication/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 18, 2017
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jackson Harper
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Taken from: http://archive.ics.uci.edu/ml/datasets/banknote+authentication

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  6. BankNote Authentication UCI

    • kaggle.com
    Updated Jan 26, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shantanu (2020). BankNote Authentication UCI [Dataset]. https://www.kaggle.com/shantanuss/banknote-authentication-uci/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 26, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shantanu
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Data were extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used. The final images have 400x 400 pixels. Due to the object lens and distance to the investigated object gray-scale pictures with a resolution of about 660 dpi were gained. Wavelet Transform tool were used to extract features from images.

    Attribute Information:

    1. variance of Wavelet Transformed image (continuous)
    2. skewness of Wavelet Transformed image (continuous)
    3. curtosis of Wavelet Transformed image (continuous)
    4. entropy of image (continuous)
    5. class (integer)

    Acknowledgements

    https://archive.ics.uci.edu/ml/datasets/banknote+authentication#

    Source:

    Owner of database: Volker Lohweg (University of Applied Sciences, Ostwestfalen-Lippe, volker.lohweg '@' hs-owl.de) Donor of database: Helene Dรƒยถrksen (University of Applied Sciences, Ostwestfalen-Lippe, helene.doerksen '@' hs-owl.de) Date received: August, 2012

  7. Intrusion Detect. CICEV2023: DDoS Attack Profiling

    • kaggle.com
    zip
    Updated Mar 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agung Pambudi (2025). Intrusion Detect. CICEV2023: DDoS Attack Profiling [Dataset]. https://www.kaggle.com/datasets/agungpambudi/secure-intrusion-detection-ddos-attacks-profiling/code
    Explore at:
    zip(231762852 bytes)Available download formats
    Dataset updated
    Mar 27, 2025
    Authors
    Agung Pambudi
    Description

    To cite the dataset please reference it as Y. Kim, S. Hakak, and A. Ghorbani. "DDoS Attack Dataset (CICEV2023) against EV Authentication in Charging Infrastructure," in 2023 20th Annual International Conference on Privacy, Security and Trust (PST), IEEE Computer Society, pp. 1-9, August 2023.

    Explore a comprehensive dataset capturing DDoS attack scenarios within electric vehicle (EV) charging infrastructure. This dataset features diverse machine learning attributes, including packet access counts, system status details, and authentication profiles across multiple charging stations and grid services. Simulated attack scenarios, authentication protocols, and extensive profiling results offer invaluable insights for training and testing detection models in safeguarding EV charging systems against cyber threats.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5737185%2F2dec3a047fec426e0b6d2f7672d25016%2Fadjusted-5221113.jpg?generation=1743055158796994&alt=media" alt=""> Figure 1: Proposed simulator structure, source: Y. Kim, S. Hakak, and A. Ghorbani.


    Acknowledgment :

    The authors sincerely appreciate the support provided by the Canadian Institute for Cybersecurity (CIC), as well as the funding received from the Canada Research Chair and the Atlantic Canada Opportunities Agency (ACOA).


    Reference :

    Y. Kim, S. Hakak, and A. Ghorbani. "DDoS Attack Dataset (CICEV2023) against EV Authentication in Charging Infrastructure," in 2023 20th Annual International Conference on Privacy, Security and Trust (PST), IEEE Computer Society, pp. 1-9, August 2023.

  8. Employee Total Hours Timeseries Prediction

    • kaggle.com
    Updated Jul 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sunil sharanappa (2021). Employee Total Hours Timeseries Prediction [Dataset]. https://www.kaggle.com/datasets/sunilsharanappa/employee-totalhours-timeseries-prediction/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 17, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    sunil sharanappa
    Description

    Context

    The Dataset is fully dedicated for the developers who want to build & train the model on employee login,logout and total time spent at work. This dataset provides data from 1st January 2020 to 26th June 2021, The 4 parameters here are Date,In,Out and Total_Hours.

    This a genuine data of an employee whose login and logout was managed in an external application. Employee is supposed to login and logout everyday in the external application(due to mandatory work from home during this pandemic) else Data will be captured through flap barrier and stored in the external system when employee goes to the office automatically.

    Content

    File contains Data of one employee with his login and logout time along with total time spent(logout hours - login hours) captured over a period of 1.5 years(2020-2021).

    To this employee Saturday & Sunday is off, Indian Public holidays were applicable. Saturday,Sunday,Public holiday and when employee is on leave(sick/privilege) you see In,Out column having value "Public holiday/weekend" and total hours is 0 for that day.

  9. Data from: User activity dataset

    • kaggle.com
    Updated Apr 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rasika Ekanayaka @ devLK (2025). User activity dataset [Dataset]. https://www.kaggle.com/datasets/rasikaekanayakadevlk/user-activity-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 7, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rasika Ekanayaka @ devLK
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Behavioral Intelligence Report: Multi-Factor User Risk Profiling

    This dataset is a high-resolution behavioral footprint of digital identities tracked across secure systems. Each entry captures a singular user's interaction pattern within a session or time slice โ€” painting a picture far deeper than mere login records. Think of this as a blend between cyber forensics, psychology, and data science.

    Whatโ€™s Captured?

    Each row is a behavioral profile with metrics spanning authentication behavior, interaction fidelity, device trust, and anomaly indicators. Together, these features feed into systems that assess identity trustworthiness, risk of compromise, and insider threat potential.

    Key Attributes Explained:

    • Authentication Behavior

      • failedLoginAttempts: A direct indicator of either poor user memory or brute-force attempts.
      • accessFrequency: Number of times a user accessed the system within the profiling window.
      • loginConsistency: Measures how predictably a user logs in (e.g., same times/days).
    • Device & Location Trust

      • deviceConsistency: Tracks if the user is using familiar devices (0 = new/unknown).
      • accessLocationConsistency: Are they logging in from expected geo-zones?
    • Behavioral Biometrics

      • dwellTime: Time actively spent during the session (in seconds).
      • mouseMovements, scrollBehavior: Subtle cues that indicate "humanness" โ€” used to detect bots, fatigue, or impersonation.
    • Security Alerts

      • incidentReports: Number of alerts triggered due to policy or behavioral violations.
      • passwordResets: Self-service or forced resets โ€” often signs of account insecurity.
      • anomalousActivity: Binary flag (1 = something off), typically flagged by an AI/ML anomaly engine.
      • failedTransactions: Operations attempted but not successfully executed, possibly unauthorized access to functions or errors due to unfamiliar environments.
    • Session & Access Details

      • sessionDuration: Total time (HH:MM:SS) the session remained open.
      • mfaEnabled: Is Multi-Factor Authentication active? (1 = Yes, 0 = No)
      • accessToSensitiveData: Whether this session accessed confidential datasets or restricted zones.

    Use Cases

    This telemetry log is ideal for:

    • Zero Trust Security Models: Assessing whether to escalate authentication requirements in real time.
    • Insider Threat Detection: Watching for signs of abnormal but authenticated misuse.
    • User Risk Scoring: Feeding into scoring models for risk-adaptive access control systems.
    • Behavioral Forensics: Reconstructing incidents during breach investigations.
    • ML Training Sets: Labeled behaviors for supervised models in security analytics.

    Sample Insights

    • Repeated use of unknown devices with inconsistent locations might indicate account hijacking.
    • Low mouse movement with long dwell time and high scroll behavior? Possibly bot or automated scraping.
    • Sudden spikes in failed transactions or password resets? Red flags for social engineering or session hijack attempts.
  10. noxmap

    • kaggle.com
    zip
    Updated Nov 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    parseltung (2020). noxmap [Dataset]. https://www.kaggle.com/parselt/noxmap
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Nov 29, 2020
    Authors
    parseltung
    Description

    import pandas as pd import numpy as np import ee.mapclient import datetime import matplotlib.pyplot as plt import os import ee from google.oauth2.credentials import Credentials ee.Authenticate() ee.Initialize() column = 'tropospheric_NO2_column_number_density' dataset = "COPERNICUS/S5P/OFFL/L3_NO2" begin_date = '2019-01-01'; end_date = '2019-01-31' s5p = ee.ImageCollection(dataset).filterDate(begin_date, end_date) ime = s5p.mean().select(column) task = ee.batch.Export.image.toDrive(**{ 'image': ime, 'description': 'ime', 'folder':'Example_folder', 'scale': 10000 }) task.start()

  11. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
xhlulu (2020). MeDAL Dataset [Dataset]. https://www.kaggle.com/xhlulu/medal-emnlp
Organization logo

MeDAL Dataset

Explore at:
zip(7324382521 bytes)Available download formats
Dataset updated
Nov 16, 2020
Authors
xhlulu
Description

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2352583%2F868a18fb09d7a1d3da946d74a9857130%2FLogo.PNG?generation=1604973725053566&alt=media" alt="">

Medical Dataset for Abbreviation Disambiguation for Natural Language Understanding (MeDAL) is a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. It was published at the ClinicalNLP workshop at EMNLP.

๐Ÿ’ป Code ๐Ÿค— Dataset (Hugging Face) ๐Ÿ’พ Dataset (Kaggle) ๐Ÿ’ฝ Dataset (Zenodo) ๐Ÿ“œ Paper (ACL) ๐Ÿ“ Paper (Arxiv) โšก Pre-trained ELECTRA (Hugging Face)

Downloading the data

We recommend downloading from Kaggle if you can authenticate through their API. The advantage to Kaggle is that the data is compressed, so it will be faster to download. Links to the data can be found at the top of the readme.

First, you will need to create an account on kaggle.com. Afterwards, you will need to install the kaggle API: pip install kaggle

Then, you will need to follow the instructions here to add your username and key. Once that's done, you can run: kaggle datasets download xhlulu/medal-emnlp

Now, unzip everything and place them inside the data directory: unzip -nq crawl-300d-2M-subword.zip -d data mv data/pretrain_sample/* data/

Loading FastText Embeddings

For the LSTM models, we will need to use the fastText embeddings. To do so, first download and extract the weights: wget -nc -P data/ https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip unzip -nq data/crawl-300d-2M-subword.zip -d data/

Model Quickstart

Using Torch Hub

You can directly load LSTM and LSTM-SA with torch.hub: ```python import torch

lstm = torch.hub.load("BruceWen120/medal", "lstm") lstm_sa = torch.hub.load("BruceWen120/medal", "lstm_sa") ```

If you want to use the Electra model, you need to first install transformers: pip install transformers Then, you can load it with torch.hub: python import torch electra = torch.hub.load("BruceWen120/medal", "electra")

Using Huggingface transformers

If you are only interested in the pre-trained ELECTRA weights (without the disambiguation head), you can load it directly from the Hugging Face Repository:

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("xhlu/electra-medal")
tokenizer = AutoTokenizer.from_pretrained("xhlu/electra-medal")

Citation

Download the bibtex here, or copy the text below: @inproceedings{wen-etal-2020-medal, title = "{M}e{DAL}: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining", author = "Wen, Zhi and Lu, Xing Han and Reddy, Siva", booktitle = "Proceedings of the 3rd Clinical Natural Language Processing Workshop", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.clinicalnlp-1.15", pages = "130--135", }

License, Terms and Conditions

The ELECTRA model is licensed under Apache 2.0. The license for the libraries used in this project (transformers, pytorch, etc.) can be found in their respective GitHub repository. Our model is released under a MIT license.

The original dataset was retrieved and modified from the NLM website. By using this dataset, you are bound by the terms and conditions specified by NLM:

INTRODUCTION

Downloading data from the National Library of Medicine FTP servers indicates your acceptance of the following Terms and Conditions: No charges, usage fees or royalties are paid to NLM for this data.

MEDLINE/PUBMED SPECIFIC TERMS

NLM freely provides PubMed/MEDLINE data. Please note some PubMed/MEDLINE abstracts may be protected by copyright.

GENERAL TERMS AND CONDITIONS

  • Users of the data agree to:

    • acknowledge NLM as the source of the data by including the phrase "Courtesy of the U.S. National Library of Medicine" in a clear and conspicuous manner,
    • properly use registration and/or trademark symbols when referring to NLM products, and
    • not indicate or imply that NLM has endorsed its products/services/applications.
  • Users who republish or redistribute the data (services, products or raw data) agree to:

    • maintain the most current version of all distributed data, or
    • make known in a clear and conspicuous manner that the products/services/applications do not reflect the most current/accurate data available from NLM.
  • These data are produced with a reasonable standard of care, but NLM makes no warranties express or implied, including no warranty of merchantability or fitness for particular purpose, regarding the accuracy or completeness of the data. Users agree to hold NLM and the U.S. Government harmless from any liability resulting from errors in the data. NLM disclaims any liability for any consequences due to use, misuse, or interpretation of information contained or not contained in the data.

  • NLM does not provide legal advice regarding copyright, fair use, or other aspects of intellectual property rights. See the NLM Copyright page.

  • NLM reserves the right to change the type and format of its machine-readable data. NLM will take reasonable steps to inform users of any changes to the format of the data before the data are distributed via the announcement section or subscription to email and RSS updates.

Search
Clear search
Close search
Google apps
Main menu