Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
84.5% (10
This Evaluation dataset contains example utterances taken from the "change order" intent from Bitext's pre-built Customer Service domain (which itself covers common intents present across Bitext's 20 pre-built domains). The data can be used to evaluate intent recognition models Natural Language Understanding (NLU) platforms.
The dataset contains 10,000 utterances, extracted from a larger dataset of over 1,000,000 utterances, including language register variations such as politeness, colloquial, swearing, indirect style... To select the utterances, we use stratified sampling to generate a dataset with a general user language register profile.
The dataset also reflects commonly occurring linguistic phenomena of real-life chatbots, such as: - spelling mistakes - run-on words - missing punctuation
Each entry in the dataset contains an example utterance along with its corresponding intent, category and additional linguistic information. Each line contains the following four fields: - flags: the applicable linguistic flags - utterance: an example user utterance - category: the high-level intent category - intent: the intent corresponding to the user utterance
The dataset contains annotations for linguistic phenomena, which can be used to adapt bot training to different user language profiles. These flags are: B - Basic syntactic structure L - Lexical variation (synonyms) M - Morphological variation (plurals, tenses…) C - Complex/Coordinated syntactic structure E - Expanded abbreviations (I'm -> I am, I'd -> I would…) I - Interrogative structure K - Keyword only P - Politeness variation Q - Colloquial variation W - Offensive language Z - Noise (spelling, punctuation…)
These phenomena make the training dataset more effective and make bots more accurate and robust.
The intent categories covered by the dataset are: ORDER
The intents covered by the dataset are: change_order
(c) Bitext Innovations, 2022
http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/
The dataset and source code for paper "Automating Intention Mining".
The code is based on dennybritz's implementation of Yoon Kim's paper Convolutional Neural Networks for Sentence Classification.
By default, the code uses Tensorflow 0.12. Some errors might be reported when using other versions of Tensorflow due to the incompatibility of some APIs.
Running 'online_prediction.py', you can input any sentence and check the classification result produced by a pre-trained CNN model. The model uses all sentences of the four Github projects as training data.
Running 'play.py', you can get the evaluation result of cross-project prediction. Please check the code for more details of the configuration. By default, it will use the four Github projects as training data to predict the sentences in DECA dataset, and in this setting, the category 'aspect evaluation' and 'others' are dropped since DECA dataset does not contain these two categories.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The proposed research intends to improve the current service desk model by using Conversational Language Understanding (CLU) processes embedded in the chatbot model, to understand the user’s input and automate the ticket resolution process as well as improve the customer service experience and efficiency. The CLU data will be trained, thus it will be able to cover all the possible user input. The chatbot will then be designed to have five main dialogue flows consisting of, changing the user’s current password, checking the user’s mobile number that is listed in Azure Active Directory (AAD), updating the user’s mobile number in AAD, creating a new ticket to the ticketing system, and creating a follow-up ticket to the ticketing system. A trained CLU data with a high prediction score based on the proposed dialogue flow will then be embedded with the chatbot design. It would produce a next-level chatbot that is able to understand the user’s intent, classify the user’s intent, automate the user’s Level 1 (L1) proposed request without any human technician’s interaction, and create a ticket in the ticketing system for any request that is not covered by the chatbot yet.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Fixation Count
A dataset containing basic conversations, mental health FAQ, classical therapy conversations, and general advice provided to people suffering from anxiety and depression.
This dataset can be used to train a model for a chatbot that can behave like a therapist in order to provide emotional support to people with anxiety & depression.
The dataset contains intents. An “intent” is the intention behind a user's message. For instance, If I were to say “I am sad” to the chatbot, the intent, in this case, would be “sad”. Depending upon the intent, there is a set of Patterns and Responses appropriate for the intent. Patterns are some examples of a user’s message which aligns with the intent while Responses are the replies that the chatbot provides in accordance with the intent. Various intents are defined and their patterns and responses are used as the model’s training data to identify a particular intent.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for Online Shoppers Purchasing Intention Dataset
Dataset Summary
This dataset is a reupload of the Online Shoppers Purchasing Intention Dataset from the UCI Machine Learning Repository.
NOTE: The information below is from the original dataset description from UCI's website.
Overview
Of the 12,330 sessions in the dataset, 84.5% (10,422) were negative class samples that did not end with shopping, and the rest (1908) were positive class samples… See the full description on the dataset page: https://huggingface.co/datasets/jlh/uci-shopper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Job Classification Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/HRAnalyticRepository/job-classification-dataset on 30 September 2021.
--- Dataset description provided by original source is as follows ---
This is a dataset containing some fictional job class specs information. Typically job class specs have information which characterize the job class- its features, and a label- in this case a pay grade - something to predict that the features are related to.
The data is a static snapshot. The contents are ID column - a sequential number Job Family ID Job Family Description Job Class ID Job Class Description PayGrade- numeric Education Level Experience Organizational Impact Problem Solving Supervision Contact Level Financial Budget PG- Alpha label for PayGrade
This data is purely fictional
The intent is to use machine learning classification algorithms to predict PG from Educational level through to Financial budget information.
Typically job classification in HR is time consuming and cumbersome as a manual activity. The intent is to show how machine learning and People Analytics can be brought to bear on this task.
--- Original source retains full ownership of the source dataset ---
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The data was obtained from the following website: https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset Sakar,C. and Kastro,Yomi. (2018). Online Shoppers Purchasing Intention Dataset. UCI Machine Learning Repository. https://doi.org/10.24432/C5F88Q.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Quantitative analysis of intent classification module.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description This dataset consists of 400 text-only fine-tuned versions of multi-turn conversations in the English language based on 10 categories and 19 use cases. It has been generated with ethically sourced human-in-the-loop data methods and aligned with supervised fine-tuning, direct preference optimization, and reinforcement learning through human feedback.
The human-annotated data is focused on data quality and precision to enhance the generative response of models used for AI chatbots, thereby improving their recall memory and recognition ability for continued assistance.
Key Features Prompts focused on user intent and were devised using natural language processing techniques. Multi-turn prompts with up to 5 turns to enhance responsive memory of large language models for pretraining. Conversational interactions for queries related to varied aspects of writing, coding, knowledge assistance, data manipulation, reasoning, and classification.
Dataset Source Subject matter expert annotators @SoftAgeAI have annotated the data at simple and complex levels, focusing on quality factors such as content accuracy, clarity, coherence, grammar, depth of information, and overall usefulness.
Structure & Fields The dataset is organized into different columns, which are detailed below:
P1, R1, P2, R2, P3, R3, P4, R4, P5 (object): These columns represent the sequence of prompts (P) and responses (R) within a single interaction. Each interaction can have up to 5 prompts and 5 corresponding responses, capturing the flow of a conversation. The prompts are user inputs, and the responses are the model's outputs. Use Case (object): Specifies the primary application or scenario for which the interaction is designed, such as "Q&A helper" or "Writing assistant." This classification helps in identifying the purpose of the dialogue. Type (object): Indicates the complexity of the interaction, with entries labeled as "Complex" in this dataset. This denotes that the dialogues involve more intricate and multi-layered exchanges. Category (object): Broadly categorizes the interaction type, such as "Open-ended QA" or "Writing." This provides context on the nature of the conversation, whether it is for generating creative content, providing detailed answers, or engaging in complex problem-solving. Intended Use Cases
The dataset can enhance query assistance model functioning related to shopping, coding, creative writing, travel assistance, marketing, citation, academic writing, language assistance, research topics, specialized knowledge, reasoning, and STEM-based. The dataset intends to aid generative models for e-commerce, customer assistance, marketing, education, suggestive user queries, and generic chatbots. It can pre-train large language models with supervision-based fine-tuned annotated data and for retrieval-augmented generative models. The dataset stands free of violence-based interactions that can lead to harm, conflict, discrimination, brutality, or misinformation. Potential Limitations & Biases This is a static dataset, so the information is dated May 2024.
Note If you have any questions related to our data annotation and human review services for large language model training and fine-tuning, please contact us at SoftAge Information Technology Limited at info@softage.ai.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
One of the primary challenges inherent in utilizing deep learning models is the scarcity and accessibility hurdles associated with acquiring datasets of sufficient size to facilitate effective training of these networks. This is particularly significant in object detection, shape completion, and fracture assembly. Instead of scanning a large number of real-world fragments, it is possible to generate massive datasets with synthetic pieces. However, realistic fragmentation is computationally intensive in the preparation (e.g., pre-factured models) and generation. Otherwise, simpler algorithms such as Voronoi diagrams provide faster processing speeds at the expense of compromising realism. Hence, it is required to balance computational efficiency and realism for generating large datasets for marching learning.
We proposed a GPU-based fragmentation method to improve the baseline Discrete Voronoi Chain aimed at completing this dataset generation task. The dataset in this repository includes voxelized fragments from high-resolution 3D models, curated to be used as training sets for machine learning models. More specifically, these models come from an archaeological dataset, which led to more than 1M fragments from 1,052 Iberian vessels. In this dataset, fragments are not stored individually; instead, the fragmented voxelizations are provided in a compressed binary file (.rle.zip). Once uncompressed, each fragment is represented by a different number in the grid. The class to which each vessel belongs is also included in class.csv. The GPU-based pipeline that generated this dataset is explained at https://doi.org/10.1016/j.cag.2024.104104" target="_blank" rel="noreferrer noopener">https://doi.org/10.1016/j.cag.2024.104104.
Please, note that this dataset originally provided voxel data, point clouds and triangle meshes. However, we opted for including only voxel data because 1) the original dataset is too large to be uploaded to Zenodo and 2) the original intent of our paper is to generate implicit data in the form of voxels. If interested in the whole dataset (450GB), please visit the web page of our research institute.
snips
This is a text classification dataset. It is intended for machine learning research and experimentation. This dataset is obtained via formatting another publicly available data to be compatible with our AutoIntent Library.
Usage
It is intended to be used with our AutoIntent Library: from autointent import Dataset
snips = Dataset.from_hub("AutoIntent/snips")
Source
This dataset is taken from benayas/snips and formatted with our AutoIntent Library:… See the full description on the dataset page: https://huggingface.co/datasets/DeepPavlov/snips.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset containing the images and labels for the MultNIST data used in the CVPR NAS workshop Unseen-data challenge under the codename "Mateo"The MultNIST dataset is a constructed dataset from MNIST Images. The intention of this dataset is to require machine learning models to do more than just image classification but also perform a calculation, in this case multiplaction followed by a mod operation. For each image, three MNIST Images were randomly chosen and combined together through the colour channels, resulting in a three colour-channel image so each MNIST image represents one colour channel. The data is in a channels-first format with a shape of (n, 3, 28, 28) where n is the number of samples in the corresponding set (50,000 for training, 10,000 for validation, and 10,000 for testing).There are ten classes in the dataset, with 7,000 examples of each, distributed evenly between the three subsets.The label of each image is generated using the formula "(r * b * g) % 10" where r, g, and b are the red, green, and blue colour channels respectively. An example of a MultNIST Image would be a rgb configuation of 3, 7, and 4 respectively, which would result in a label of 4 ((3 * 7 * 4) % 10).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
**************** NTU Dataset ReadMe file *******************Please consider the latest version.Attached files contain our data collected inside Nanyang Technological University Campus for pedestrian intention prediction. The dataset is particularly designed to capture spontaneous vehicle influences on pedestrian crossing/not-crossing intention. We utilize this dataset in our paper "Context Model for Pedestrian Intention Prediction using Factored Latent-Dynamic Conditional Random Fields" submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence.The dataset consists of 35 crossing and 35 stopping* (not-crossing) scenarios. The image sequences are in 'Image_sequences' folder. 'stopping_instants.csv' and 'crossing_instants.csv' files provide the stopping and crossing instants respectively, utilized for labeling the data and providing ground-truth for evaluation. Camera1 and Camera2 images are synchronized. Two cameras were used to capture the whole scene of interest.We provide pedestrian and vehicle bounding boxes obtained from [1]. The occlusions and mis-detections are linearly interpolated. All necessary detections are stored in 'Object_detector_pedestrians_vehicles' folder. Each column within the csv files ('car_bndbox_..') corresponds to a unique tracked car within each image sequence. Each of the pedestrian csv files ('ped_bndbox_..') contains only one column, as we consider each pedestrian in the scene separately. Additional details:* [xmin xmax ymin ymax] = left right top down* Dataset frequency: 15 fps.* Camera parameters (in pixels): f = 1135, principal point = (960, 540).Additionally, we provide semantic segmentation output [2] and our depth parameters. As the data were collected in two phases, there are two files in each folder, highlighting the sequences in each phase.Crossing sequences 1-28 and stopping sequences 1-24 were collected in Phase 1, while crossing sequences 29-35 and stopping sequences 25-35 were collected in Phase 2.We obtained the optical flow from [3]. Our model (FLDCRF and LSTM) codes are available in 'Models' folder.If you use our dataset in your research, please cite our paper:"S. Neogi, M. Hoy, W. Chaoqun, J. Dauwels, 'Context Based Pedestrian Intention Prediction Using Factored Latent Dynamic Conditional Random Fields', IEEE SSCI-2017."Please email us if you have any questions:1. Satyajit Neogi, PhD Student, Nanyang Technological University @ satyajit001@e.ntu.edu.sg 2. Justin Dauwels, Associate Professor, Nanyang Technological University @ jdauwels@ntu.edu.sgOur other group members include:3. Dr. Michael Hoy, @ mch.hoy@gmail.com4. Dr. Kang Dang, @ kangdang@gmail.com5. Ms. Lakshmi Prasanna Kachireddy, 6. Mr. Mok Bo Chuan Lance,7. Dr. Hang Yu, @ fhlyhv@gmail.comReferences:1. S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks", NIPS 2015.2. A. Kendall, V. Badrinarayanan, R. Cipolla,
Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding", BMVC 2017.3. C. Liu. ``Beyond Pixels: Exploring New Representations and Applications for Motion Analysis". Doctoral Thesis. Massachusetts Institute of Technology. May 2009.* Please note, we had to remove sequence Stopping-33 for privacy reasons.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 4,992 rows of structured information derived from a triage system designed for managing and prioritizing comments in collaborative environments. Using advanced machine learning techniques, such as GEMMA-2B for intent classification, Hugging Face models for sentiment analysis, and Latent Dirichlet Allocation (LDA) for topic modeling, each comment is analyzed across six dimensions: urgency, importance, sentiment, actionability, resolution status, and thematic relevance.
The dataset can support tasks in:
Key Features: Hierarchical Labels: Multi-level classifications (level_0 to level_4) for each comment. Priority Scores: Aggregated values representing the criticality of each comment. Sentiment Analysis: Positive, neutral, and negative sentiment scoring. LDA Topics: Thematic insights for comment context.
Metadata: Rows: 4,992 Columns: 49 Tags: NLP, Machine Learning, Sentiment Analysis, Comment Triage, Topic Modeling, Collaboration
File Details: Filename: triaged_comments_with_priority_and_labels_hierarchy.csv License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The sensitivity of hyperspectral imaging (imaging spectroscopy) to haemoglobin derivatives makes it a promising tool for detection and classification of blood. However, due to complexity and high dimensionality of hyperspectral images, the development of hyperspectral blood detection algorithms is challenging. To facilitate their development, we present a new hyperspectral blood detection dataset. This dataset consists of 14 hyperspectral images (ENVI format) of a mock-up scene containing blood and visually similar substances (e.g. artificial blood or tomato concentrate). Images were taken over a period of three weeks and differ in terms of background composition and lighting intensity. To facilitate the use of data, the dataset includes an annotation of classes: pixels where blood and similar substances are visible have been marked by the authors. The main intention behind the dataset is to serve as testing data for Machine Learning methods for hyperspectral target detection and classification.
Deep Learning Hard (DL-HARD) is an annotated dataset designed to more effectively evaluate neural ranking models on complex topics. It builds on TREC Deep Learning (DL) questions extensively annotated with query intent categories, answer types, wikified entities, topic categories, and result type metadata from a leading web search engine. DL-HARD contains 50 queries from the official 2019/2020 evaluation benchmark, half of which are newly and independently assessed. Overall, DL-HARD is a new resource that promotes research on neural ranking methods by focusing on challenging and complex queries.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview: This is a lab-based dataset with videos recording volunteers (medical students) washing their hands as part of a hand-washing monitoring and feedback experiment. The dataset is collected in the Medical Education Technology Center (METC) of Riga Stradins University, Riga, Latvia. In total, 72 participants took part in the experiments, each washing their hands three times, in a randomized order, going through three different hand-washing feedback approaches (user interfaces of a mobile app). The data was annotated in real time by a human operator, in order to give the experiment participants real-time feedback on their performance. There are 212 hand washing episodes in total, each of which is annotated by a single person. The annotations classify the washing movements according to the World Health Organization's (WHO) guidelines by marking each frame in each video with a certain movement code.
This dataset is part on three dataset series all following the same format:
https://zenodo.org/record/4537209 - data collected in Pauls Stradins Clinical University Hospital
https://zenodo.org/record/5808764 - data collected in Jurmala Hospital
https://zenodo.org/record/5808789 - data collected in the Medical Education Technology Center (METC) of Riga Stradins University
Note #1: we recommend that when using this dataset for machine learning, allowances are made for the reaction speed of the human operator labeling the data. For example, the annotations can be expected to be incorrect a short while after the person in the video switches their washing movements.
Application: The intention of this dataset is to serve as a basis for training machine learning classifiers for automated hand washing movement recognition and quality control.
Statistics:
Frame rate: ~16 FPS (slightly variable, as the video are reconstructed from a sequence of jpg images taken with max framerate supported by the capturing devices).
Resolution: 640x480
Number of videos: 212
Number of annotation files: 212
Movement codes (in JSON files):
1: Hand washing movement — Palm to palm
2: Hand washing movement — Palm over dorsum, fingers interlaced
3: Hand washing movement — Palm to palm, fingers interlaced
4: Hand washing movement — Backs of fingers to opposing palm, fingers interlocked
5: Hand washing movement — Rotational rubbing of the thumb
6: Hand washing movement — Fingertips to palm
0: Other hand washing movement
Note #2: The original dataset of JPG images is available upon request. There are 13 annotation classes in the original dataset: for each of the six washing movements defined by the WHO, "correct" and "incorrect" execution is market with two different labels. In this published dataset, all incorrect executions are marked with code 0, as "other" washing movement.
Acknowledgments: The dataset collection was funded by the Latvian Council of Science project: "Automated hand washing quality control and quality evaluation system with real-time feedback", No: lzp - Nr. 2020/2-0309.
References: For more detailed information, see this article, describing a similar dataset collected in a different project:
M. Lulla, A. Rutkovskis, A. Slavinska, A. Vilde, A. Gromova, M. Ivanovs, A. Skadins, R. Kadikis, A. Elsts. Hand-Washing Video Dataset Annotated According to the World Health Organization’s Hand-Washing Guidelines. Data. 2021; 6(4):38. https://doi.org/10.3390/data6040038
Contact information: atis.elsts@edi.lv
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By SoLID (From Huggingface) [source]
The dataset consists of multiple files for different purposes. The validation.csv file contains a set of carefully selected assembly shellcodes that serve the purpose of validation. These shellcodes are used to ensure the accuracy and integrity of any models or algorithms trained on this dataset.
The train.csv file contains both the intent column, which describes the purpose or objective behind each specific shellcode, and its corresponding assembly code snippets in order to facilitate supervised learning during training procedures. This file proves to be immensely valuable for researchers, practitioners, and developers seeking to study or develop effective techniques for dealing with malicious code analysis or security-related tasks.
For testing purposes, the test.csv file provides yet another collection of assembly shellcodes that can be employed as test cases to assess the performance, robustness, and generalization capability of various models or methodologies developed within this domain.
Understanding the Dataset
The dataset consists of multiple files that serve different purposes:
train.csv
: This file contains the intent and corresponding assembly code snippets for training purposes. It can be used to train machine learning models or develop algorithms based on shellcode analysis.
test.csv
: The test.csv file in the dataset contains a collection of assembly shellcodes specifically designed for testing purposes. You can use these shellcodes to evaluate and validate your models or analysis techniques.
validation.csv
: The validation.csv file includes a set of assembly shellcodes that are specifically reserved for validation purposes. These shellcodes can be used separately to ensure the accuracy and reliability of your models.Columns in the Dataset
The columns available in each CSV file are as follows:
intent: The intent column describes the purpose or objective of each specific shellcode entry. It provides information regarding what action or achievement is intended by using that particular piece of code.
snippet: The snippet column contains the actual assembly code corresponding to each intent entry in its respective row. It includes all necessary instructions and data required to execute the desired action specified by that intent.
Utilizing the Dataset
To effectively utilize this dataset, follow these general steps:
Familiarize yourself with assembly language: Assembly language is essential when working with shellcodes since they consist of low-level machine instructions understood by processors directly.
Explore intents: Start by analyzing and understanding different intents present in the dataset entries thoroughly. Each intent represents a specific goal or purpose behind creating an individual piece of code.
Examine snippets: Review the assembly code snippets corresponding to each intent entry. Carefully study the instructions and data used in the shellcode, as they directly influence their intended actions.
Train your models: If you are working on machine learning or algorithm development, utilize the
train.csv
file to train your models based on the labeled intent and snippet data provided. This step will enable you to build powerful tools for analyzing or detecting shellcodes automatically.Evaluate using test datasets: Use the various assembly shellcodes present in
test.csv
to evaluate and validate your trained models or analysis techniques. This evaluation will help
- Malware analysis: The dataset can be used for studying and analyzing various shellcode techniques used in malware attacks. Researchers and security professionals can use this dataset to develop detection and prevention mechanisms against such attacks.
- Penetration testing: Security experts can use this dataset to simulate real-world attack scenarios and test the effectiveness of their defensive measures. By having access to a diverse range of shellcodes, they can identify vulnerabilities in systems and patch them before malicious actors exploit them.
- Machine learning training: This dataset can be used to train machine learning models for automatic detection or classification of shellcodes. By combining the intent column (which describes the objective of each shellcode) with the corresponding assembly code snippets, researchers can develop algorithms that automatically identify the purpose or ...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
84.5% (10