MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The benchmarking datasets used for deepBlink. The npz files contain train/valid/test splits inside and can be used directly. The files belong to the following challenges / classes:- ISBI Particle tracking challenge: microtubule, vesicle, receptor- Custom synthetic (based on http://smal.ws): particle- Custom fixed cell: smfish- Custom live cell: suntagThe csv files are to determine which image in the test splits correspond to which original image, SNR, and density.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset contains the text of a remarkable collection of short stories known as the TinyStories Corpus. With over 2,000 annotated stories, it is populated with an array of diverse styles and genres from multiple sources. This corpus is enriched by intricate annotations across each narrative content, making it a valuable resource for narrative text classification. The text field in each row includes the entirety of each story that can be used to identify plots, characters and other features associated with story-telling techniques. Through this collection of stories, users will gain an extensive insight into a wide range of narratives which could be used to produce powerful machine learning models for Narrative Text Classification
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
In this dataset, each row contains a short story along with its associated labels for narrative text classification tasks. The data consists of the following columns: - text: The story text itself (string) - validation.csv: Contains a set of short stories for validation (dataframe) - train.csv: Contains the text of short stories used for narrative text classification (dataframe)
The data contained in both files can be used for various types of machine learning tasks related to narrative text classification. These include but are not limited to experiments such as determining story genres, predicting user reactions, sentiment analysis etc.
To get started with using this dataset, begin by downloading both validation and train csv files from Kaggle datasets page and saving them on your computer or local environment. Once downloaded, you may need to preprocess both datasets by cleaning up any unnecessary/wrongly formatted values or duplicate entries if any exists within it before proceeding further on to your research work or machine learning task experimentations as these have great impacts on your research results accuracy rate which you do not want compromised!
Next step is simply loading up these two datasets into Python pandas dataframes so that they can easily be manipulated and analyzed using common tools associated with Natural Language Processing(NLP). This would require you writing few simple lines using pandas API functions like read_csv(), .append(), .concat()etc depending upon what kind of analysis/experiment you intend conducting afterwards utilizing this dataset in Python Jupyter Notebook framework as well as other machine learning frameworks popular among data scientists like scikit-learn if it will be something more complex than simple NLP task operations!
By now if done everything mentioned correctly here then we are ready now to finally get into actually working out our desired applications from exploring potential connections between different narratives or character traits via supervised Machine Learning models such as Naive Bayes Classifier among many others that could ultimately provide us useful insights revealing patterns existing underneath all those texts! With all necessary datas loaded up in supporting python platforms correctly so feel free to make interesting discoveries/predictions from extensive analyses provided by this richly annotated TinyStories Narrative Dataset!
- Creating a text classification algorithm to automatically categorize short stories by genre.
- Developing an AI-based summarization tool to quickly summarize the main points in a story.
- Developing an AI-based story generator that can generate new stories based on existing ones in the dataset
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: validation.csv | Column name | Description | |:--------------|:--------------------------------| | text | The text of the story. (String) |
File: train.csv | Column name | Description | |:--------------|:----------------------------...
The details of the files provided, as well as the column information for domain knowledge, is given below:
Files train.csv - the training set test.csv - the test set sample_submission.csv - a sample submission file in the correct format
Columns ID - ID of the patient A1_Score to A10_Score - Score based on Autism Spectrum Quotient (AQ) 10 item screening tool age - Age of the patient in years gender - Gender of the patient ethnicity - Ethnicity of the patient jaundice - Whether the patient had jaundice at the time of birth autism - Whether an immediate family member has been diagnosed with autism contry_of_res - Country of residence of the patient used_app_before - Whether the patient has undergone a screening test before result - Score for AQ1-10 screening test age_desc - Age of the patient relation - Relation of patient who completed the test Class/ASD - Classified result as 0 or 1. Here 0 represents No and 1 represents Yes. This is the target column, and during submission submit the values as 0 or 1 only.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
Welcome to the DailyDialog dataset, your gateway to unlocking conversation potential through multi-turn dialog experiences! Our dataset consists of conversations written by humans, which serve as a more accurate reflection of our day-to-day conversations than other datasets. Additionally, we have included manually labeled communication intentions and emotion fields in our data that can be used for advancing dialog systems.
Whether you’re a researcher looking for new approaches in dialog systems or someone simply curious about conversation dynamics from the perspective of computer science – this dataset is here to help! We invite you to explore and make use of this data for its full potential and advance the research field further.
Our three main files (train.csv, validation.csv, test.csv) each provide key columns such as dialogue , act , and emotion , enabling you to get an even deeper understanding into how effective conversations really work -- so what are you waiting for? Unlock your conversation potential today with DailyDialog!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Welcome and thank you for your interest in the DailyDialog dataset! This dataset is designed to unlock conversation potential through multi-turn dialog experiences and provide a better understanding of conversations in our day-to-day lives. Whether you are a student, researcher, or just plain curious, this guide is here to help you get started with using the DailyDialog dataset for your own research or exploration.
The DailyDialog dataset includes three files: train.csv, validation.csv, and test.csv which all contain dialog, act and emotion fields that can be used by those who wish to evaluate existing approaches in the field of dialogue systems or perform new experiments on conversational models. All data found in this dataset is written by humans and thus contains less noise than other datasets typically seen online.
The first step when using this data set would be to familiarize yourself with the different fields found within each file: * Dialog – The dialog field contains the conversation between two people (String). * Act – The act field contains communication intentions of both parties involved within the dialogue (String). * Emotion – The emotion field labels any emotions expressed during a particular dialogue (String).
Once you understand what each of these three fields mean it’s time to start exploring! You can use any programming language/software as well as statistical methods such as text analysis tools like RapidMiner or Natural Language Processing libraries like NLTK or Spacy to use these fields in order to further explore them individually or together on a more profound level. Additionally, if you are interested specifically into machine learning tasks there might also be possibilities such as generating new conversations from our data set (e.g., chat bots) using reinforcement learning models such deep learning architectures / neural networks for natural language understanding tasks etc..which can be explored too!
All said done we believe that the ability of unlocking underlying patterns embedded within real life conversations will enable researchers in various domains & research areas (e.g., AI / ML ones) enable their efforts great success & have an exciting journey :)
- Developing a conversational AI system that can replicate authentic conversations by modeling the emotion and communication intentions present in the DailyDialog dataset.
- Creating a language-learning tool which can customize personalized dialogues based on the DailyDialog data to help foreign language learners get used to spoken dialogue.
- Utilizing the DailyDialog data to develop an interactive chatbot with customized responses and emotions, allowing users to learn more about their conversational skills through simulated conversations
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. [See Other Information](https://creativecommons...
NSL-KDD is a data set suggested to solve some of the inherent problems of the KDD'99 data set . Although, this new version of the KDD data set still suffers from some of the problems discussed by McHugh and may not be a perfect representative of existing real networks, because of the lack of public data sets for network-based IDSs, we believe it still can be applied as an effective benchmark data set to help researchers compare different intrusion detection methods.
Furthermore, the number of records in the NSL-KDD train and test sets are reasonable. This advantage makes it affordable to run the experiments on the complete set without the need to randomly select a small portion. Consequently, evaluation results of different research work will be consistent and comparable.
Data files
KDDTrain+.ARFF: The full NSL-KDD train set with binary labels in ARFF format
KDDTrain+.TXT: The full NSL-KDD train set including attack-type labels and difficulty level in CSV format
KDDTrain+_20Percent.ARFF: A 20% subset of the KDDTrain+.arff file
KDDTrain+_20Percent.TXT: A 20% subset of the KDDTrain+.txt file
KDDTest+.ARFF: The full NSL-KDD test set with binary labels in ARFF format
KDDTest+.TXT: The full NSL-KDD test set including attack-type labels and difficulty level in CSV format
KDDTest-21.ARFF: A subset of the KDDTest+.arff file which does not include records with difficulty level of 21 out of 21
KDDTest-21.TXT: A subset of the KDDTest+.txt file which does not include records with difficulty level of 21 out of 21
; cic@unb.ca.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the second version of the Google Landmarks dataset (GLDv2), which contains images annotated with labels representing human-made and natural landmarks. The dataset can be used for landmark recognition and retrieval experiments. This version of the dataset contains approximately 5 million images, split into 3 sets of images: train, index and test. The dataset was presented in our CVPR'20 paper. In this repository, we present download links for all dataset files and relevant code for metric computation. This dataset was associated to two Kaggle challenges, on landmark recognition and landmark retrieval. Results were discussed as part of a CVPR'19 workshop. In this repository, we also provide scores for the top 10 teams in the challenges, based on the latest ground-truth version. Please visit the challenge and workshop webpages for more details on the data, tasks and technical solutions from top teams.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://raw.githubusercontent.com/Masterx-AI/Project_Retail_Analysis_with_Walmart/main/Wallmart1.jpg" alt="">
One of the leading retail stores in the US, Walmart, would like to predict the sales and demand accurately. There are certain events and holidays which impact sales on each day. There are sales data available for 45 stores of Walmart. The business is facing a challenge due to unforeseen demands and runs out of stock some times, due to the inappropriate machine learning algorithm. An ideal ML algorithm will predict demand accurately and ingest factors like economic conditions including CPI, Unemployment Index, etc.
Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of all, which are the Super Bowl, Labour Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data. Historical sales data for 45 Walmart stores located in different regions are available.
The dataset is taken from Kaggle.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Huggingface Hub: link
The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. For the majority of the questions, an additional paragraph with supporting evidence for the correct answer is provided.
This dataset consists of science questions and their corresponding distractors, correct answers, and supports. The questions are designed to evaluate a person's knowledge of science. The distractors are designed to confuse the test taker and lead them away from the correct answer. The correct answers are provided so that the test taker can check their work. The supports are designed to help the test taker understand the question and find the correct answer
- Train a model to answer scientific questions
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: validation.csv | Column name | Description | |:-------------------|:--------------------------------------------------| | question | The question text. (String) | | distractor3 | One of the distractors for the question. (String) | | distractor1 | One of the distractors for the question. (String) | | distractor2 | One of the distractors for the question. (String) | | correct_answer | The correct answer for the question. (String) | | support | The supporting text for the question. (String) |
File: train.csv | Column name | Description | |:-------------------|:--------------------------------------------------| | question | The question text. (String) | | distractor3 | One of the distractors for the question. (String) | | distractor1 | One of the distractors for the question. (String) | | distractor2 | One of the distractors for the question. (String) | | correct_answer | The correct answer for the question. (String) | | support | The supporting text for the question. (String) |
File: test.csv | Column name | Description | |:-------------------|:--------------------------------------------------| | question | The question text. (String) | | distractor3 | One of the distractors for the question. (String) | | distractor1 | One of the distractors for the question. (String) | | distractor2 | One of the distractors for the question. (String) | | correct_answer | The correct answer for the question. (String) | | support | The supporting text for the question. (String) |
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Context
The data presented here was obtained in a Kali Machine from University of Cincinnati,Cincinnati,OHIO by carrying out packet captures for 1 hour during the evening on Oct 9th,2023 using Wireshark.This dataset consists of 394137 instances were obtained and stored in a CSV (Comma Separated Values) file.This large dataset could be used utilised for different machine learning applications for instance classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.
The dataset can be used for a variety of machine learning tasks, such as network intrusion detection, traffic classification, and anomaly detection.
Content :
This network traffic dataset consists of 7 features.Each instance contains the information of source and destination IP addresses, The majority of the properties are numeric in nature, however there are also nominal and date kinds due to the Timestamp.
The network traffic flow statistics (No. Time Source Destination Protocol Length Info) were obtained using Wireshark (https://www.wireshark.org/).
Dataset Columns:
No : Number of Instance. Timestamp : Timestamp of instance of network traffic Source IP: IP address of Source Destination IP: IP address of Destination Portocol: Protocol used by the instance Length: Length of Instance Info: Information of Traffic Instance
Acknowledgements :
I would like thank University of Cincinnati for giving the infrastructure for generation of network traffic data set.
Ravikumar Gattu , Susmitha Choppadandi
Inspiration : This dataset goes beyond the majority of network traffic classification datasets, which only identify the type of application (WWW, DNS, ICMP,ARP,RARP) that an IP flow contains. Instead, it generates machine learning models that can identify specific applications (like Tiktok,Wikipedia,Instagram,Youtube,Websites,Blogs etc.) from IP flow statistics (there are currently 25 applications in total).
**Dataset License: ** CC0: Public Domain
Dataset Usages : This dataset can be used for different machine learning applications in the field of cybersecurity such as classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.
ML techniques benefits from this Dataset :
This dataset is highly useful because it consists of 394137 instances of network traffic data obtained by using the 25 applications on a public,private and Enterprise networks.Also,the dataset consists of very important features that can be used for most of the applications of Machine learning in cybersecurity.Here are few of the potential machine learning applications that could be benefited from this dataset are :
Network Performance Monitoring : This large network traffic data set can be utilised for analysing the network traffic to identifying the network patterns in the network .This help in designing the network security algorithms for minimise the network probelms.
Anamoly Detection : Large network traffic dataset can be utilised training the machine learning models for finding the irregularitues in the traffic which could help identify the cyber attacks.
3.Network Intrusion Detection : This large dataset could be utilised for machine algorithms training and designing the models for detection of the traffic issues,Malicious traffic network attacks and DOS attacks as well.
The ChestX-ray8 dataset which contains 108,948 frontal-view X-ray images of 32,717 unique patients.
Each image in the data set contains multiple text-mined labels identifying 14 different pathological conditions. These in turn can be used by physicians to diagnose 8 different diseases. We will use this data to develop a single model that will provide binary classification predictions for each of the 14 labeled pathologies. In other words it will predict 'positive' or 'negative' for each of the pathologies. You can download the entire dataset for free here. (https://nihcc.app.box.com/v/ChestXray-NIHCC)
I have provided a ~1000 image subset of the images here The dataset includes a CSV file that provides the labels for each X-ray.
To make your job a bit easier, I have processed the labels for our small sample and generated three new files to get you started. These three files are:
train-small-new.csv: 875 images from our dataset to be used for training. valid-small-new.csv: 109 images from our dataset to be used for validation. test-small-new.csv: 420 images from our dataset to be used for testing. This dataset has been annotated by consensus among four different radiologists for 5 of our 14 pathologies:
Consolidation Edema Effusion Cardiomegaly Atelectasis
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This data set is a resized version of the original dataset Railway Track Fault Detection located at The original images are very large. It takes
a considerable amount of time to resize the images for training. To eliminate that I recreated the dataset with the images
reduced to 224 X 224 X 3. I also added a csv file rails.csv which provides an easy means to load the data set.
.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
From a young age, hopeful talents devote time, money, and training to the sport. Yet, while the next superstar is guaranteed to start off in youth or semi-professional leagues, these leagues often have the fewest resources to invest. This includes resources for the collection of event data which helps generate insights into the performance of the teams and players.
****About Dataset:**** This dataset with 460 training and test videos in 2 folders was collected by dataset of competition videos. All videos are in MP4 format.
** Please note that the number of videos in each folder is different
Version 1 --> 460 MP4 file in 2 Folder + .CSV file Version 2 --> Coming Soon!
competition page: https://www.kaggle.com/competitions/dfl-bundesliga-data-shootout
wish you all the best
We provide you with a data set in CSV format. The data set contains 8,000 train instances and 2000 test instance There are 304 input features, labeled x001 to x304.
The target variable is labeled y.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Huggingface Hub: link
The OpenAI HumanEval dataset is a handcrafted set of 164 programming problems designed to challenge code generation models. The problems include a function signature, docstring, body, and several unit tests, all handwritten to ensure they're not included in the training set of code generation models. The entry point for each problem is the prompt, making it an ideal dataset for testing natural language processing and machine learning models' ability to generate Python programs from scratch
To use this dataset, simply download the zip file and extract it. The resulting directory will contain the following files:
canonical_solution.py: The solution to the problem. (String) entry_point.py: The entry point for the problem. (String) prompt.txt: The prompt for the problem. (String) test.py: The unit tests for the problem
- The dataset could be used to develop a model that generates programs from natural language.
- The dataset could be used to develop a model that completes or debugs programs.
- The dataset could be used to develop a model that writes unit tests for programs
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: test.csv | Column name | Description | |:-----------------------|:--------------------------------------------------------------------------------------------------| | prompt | A natural language description of the programming problem. (String) | | canonical_solution | The correct Python code solution to the problem. (String) | | test | A set of unit tests that the generated code must pass in order to be considered correct. (String) | | entry_point | The starting point for the generated code. (String) |
Not seeing a result you expected?
Learn how you can add new datasets to our index.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The benchmarking datasets used for deepBlink. The npz files contain train/valid/test splits inside and can be used directly. The files belong to the following challenges / classes:- ISBI Particle tracking challenge: microtubule, vesicle, receptor- Custom synthetic (based on http://smal.ws): particle- Custom fixed cell: smfish- Custom live cell: suntagThe csv files are to determine which image in the test splits correspond to which original image, SNR, and density.