14 datasets found

f
Datasets
figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bastian Eichenberger; YinXiu Zhan (2023). Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.12958037.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12958037.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Bastian Eichenberger; YinXiu Zhan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The benchmarking datasets used for deepBlink. The npz files contain train/valid/test splits inside and can be used directly. The files belong to the following challenges / classes:- ISBI Particle tracking challenge: microtubule, vesicle, receptor- Custom synthetic (based on http://smal.ws): particle- Custom fixed cell: smfish- Custom live cell: suntagThe csv files are to determine which image in the test splits correspond to which original image, SNR, and density.
TinyStories
kaggle.com
opendatalab.com
+1more
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). TinyStories [Dataset]. https://www.kaggle.com/datasets/thedevastator/tinystories-narrative-classification
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 24, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
TinyStories

A Diverse, Richly Annotated Corpus of Short-Form Stories

By Huggingface Hub [source]

About this dataset

This dataset contains the text of a remarkable collection of short stories known as the TinyStories Corpus. With over 2,000 annotated stories, it is populated with an array of diverse styles and genres from multiple sources. This corpus is enriched by intricate annotations across each narrative content, making it a valuable resource for narrative text classification. The text field in each row includes the entirety of each story that can be used to identify plots, characters and other features associated with story-telling techniques. Through this collection of stories, users will gain an extensive insight into a wide range of narratives which could be used to produce powerful machine learning models for Narrative Text Classification

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

In this dataset, each row contains a short story along with its associated labels for narrative text classification tasks. The data consists of the following columns: - text: The story text itself (string) - validation.csv: Contains a set of short stories for validation (dataframe) - train.csv: Contains the text of short stories used for narrative text classification (dataframe)

The data contained in both files can be used for various types of machine learning tasks related to narrative text classification. These include but are not limited to experiments such as determining story genres, predicting user reactions, sentiment analysis etc.

To get started with using this dataset, begin by downloading both validation and train csv files from Kaggle datasets page and saving them on your computer or local environment. Once downloaded, you may need to preprocess both datasets by cleaning up any unnecessary/wrongly formatted values or duplicate entries if any exists within it before proceeding further on to your research work or machine learning task experimentations as these have great impacts on your research results accuracy rate which you do not want compromised!

Next step is simply loading up these two datasets into Python pandas dataframes so that they can easily be manipulated and analyzed using common tools associated with Natural Language Processing(NLP). This would require you writing few simple lines using pandas API functions like read_csv(), .append(), .concat()etc depending upon what kind of analysis/experiment you intend conducting afterwards utilizing this dataset in Python Jupyter Notebook framework as well as other machine learning frameworks popular among data scientists like scikit-learn if it will be something more complex than simple NLP task operations!

By now if done everything mentioned correctly here then we are ready now to finally get into actually working out our desired applications from exploring potential connections between different narratives or character traits via supervised Machine Learning models such as Naive Bayes Classifier among many others that could ultimately provide us useful insights revealing patterns existing underneath all those texts! With all necessary datas loaded up in supporting python platforms correctly so feel free to make interesting discoveries/predictions from extensive analyses provided by this richly annotated TinyStories Narrative Dataset!

Research Ideas

Creating a text classification algorithm to automatically categorize short stories by genre.

Developing an AI-based summarization tool to quickly summarize the main points in a story.

Developing an AI-based story generator that can generate new stories based on existing ones in the dataset

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv | Column name | Description | |:--------------|:--------------------------------| | text | The text of the story. (String) |

File: train.csv | Column name | Description | |:--------------|:----------------------------...
Autism Prediction Dataset
kaggle.com
Updated Jan 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shivam D Shinde (2023). Autism Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/shivamshinde123/autismprediction
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 31, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shivam D Shinde
Description
The details of the files provided, as well as the column information for domain knowledge, is given below:

Files train.csv - the training set test.csv - the test set sample_submission.csv - a sample submission file in the correct format

Columns ID - ID of the patient A1_Score to A10_Score - Score based on Autism Spectrum Quotient (AQ) 10 item screening tool age - Age of the patient in years gender - Gender of the patient ethnicity - Ethnicity of the patient jaundice - Whether the patient had jaundice at the time of birth autism - Whether an immediate family member has been diagnosed with autism contry_of_res - Country of residence of the patient used_app_before - Whether the patient has undergone a screening test before result - Score for AQ1-10 screening test age_desc - Age of the patient relation - Relation of patient who completed the test Class/ASD - Classified result as 0 or 1. Here 0 represents No and 1 represents Yes. This is the target column, and during submission submit the values as 0 or 1 only.
DailyDialog (Multi-turn Dialog)
kaggle.com
Updated Nov 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). DailyDialog (Multi-turn Dialog) [Dataset]. https://www.kaggle.com/datasets/thedevastator/dailydialog-unlock-the-conversation-potential-in
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 29, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
DailyDialog (Multi-turn Dialog)

Dialogues that reflect our daily communication way and cover various topics

By Huggingface Hub [source]

About this dataset

Welcome to the DailyDialog dataset, your gateway to unlocking conversation potential through multi-turn dialog experiences! Our dataset consists of conversations written by humans, which serve as a more accurate reflection of our day-to-day conversations than other datasets. Additionally, we have included manually labeled communication intentions and emotion fields in our data that can be used for advancing dialog systems.

Whether you’re a researcher looking for new approaches in dialog systems or someone simply curious about conversation dynamics from the perspective of computer science – this dataset is here to help! We invite you to explore and make use of this data for its full potential and advance the research field further.

Our three main files (train.csv, validation.csv, test.csv) each provide key columns such as dialogue , act , and emotion , enabling you to get an even deeper understanding into how effective conversations really work -- so what are you waiting for? Unlock your conversation potential today with DailyDialog!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Welcome and thank you for your interest in the DailyDialog dataset! This dataset is designed to unlock conversation potential through multi-turn dialog experiences and provide a better understanding of conversations in our day-to-day lives. Whether you are a student, researcher, or just plain curious, this guide is here to help you get started with using the DailyDialog dataset for your own research or exploration.

The DailyDialog dataset includes three files: train.csv, validation.csv, and test.csv which all contain dialog, act and emotion fields that can be used by those who wish to evaluate existing approaches in the field of dialogue systems or perform new experiments on conversational models. All data found in this dataset is written by humans and thus contains less noise than other datasets typically seen online.

The first step when using this data set would be to familiarize yourself with the different fields found within each file: * Dialog – The dialog field contains the conversation between two people (String). * Act – The act field contains communication intentions of both parties involved within the dialogue (String). * Emotion – The emotion field labels any emotions expressed during a particular dialogue (String).

Once you understand what each of these three fields mean it’s time to start exploring! You can use any programming language/software as well as statistical methods such as text analysis tools like RapidMiner or Natural Language Processing libraries like NLTK or Spacy to use these fields in order to further explore them individually or together on a more profound level. Additionally, if you are interested specifically into machine learning tasks there might also be possibilities such as generating new conversations from our data set (e.g., chat bots) using reinforcement learning models such deep learning architectures / neural networks for natural language understanding tasks etc..which can be explored too!

All said done we believe that the ability of unlocking underlying patterns embedded within real life conversations will enable researchers in various domains & research areas (e.g., AI / ML ones) enable their efforts great success & have an exciting journey :)

Research Ideas

Developing a conversational AI system that can replicate authentic conversations by modeling the emotion and communication intentions present in the DailyDialog dataset.

Creating a language-learning tool which can customize personalized dialogues based on the DailyDialog data to help foreign language learners get used to spoken dialogue.

Utilizing the DailyDialog data to develop an interactive chatbot with customized responses and emotions, allowing users to learn more about their conversational skills through simulated conversations

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. [See Other Information](https://creativecommons...
i
NSL-KDD dataset
impactcybertrust.org
Updated Jan 1, 2009
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
External Data Source (2009). NSL-KDD dataset [Dataset]. http://doi.org/10.23721/100/1478792
Explore at:
Unique identifier
https://doi.org/10.23721/100/1478792
Dataset updated
Jan 1, 2009
Authors
External Data Source
Time period covered
Jan 1, 2009
Description
NSL-KDD is a data set suggested to solve some of the inherent problems of the KDD'99 data set . Although, this new version of the KDD data set still suffers from some of the problems discussed by McHugh and may not be a perfect representative of existing real networks, because of the lack of public data sets for network-based IDSs, we believe it still can be applied as an effective benchmark data set to help researchers compare different intrusion detection methods.

Furthermore, the number of records in the NSL-KDD train and test sets are reasonable. This advantage makes it affordable to run the experiments on the complete set without the need to randomly select a small portion. Consequently, evaluation results of different research work will be consistent and comparable.

Data files

KDDTrain+.ARFF: The full NSL-KDD train set with binary labels in ARFF format
KDDTrain+.TXT: The full NSL-KDD train set including attack-type labels and difficulty level in CSV format
KDDTrain+_20Percent.ARFF: A 20% subset of the KDDTrain+.arff file
KDDTrain+_20Percent.TXT: A 20% subset of the KDDTrain+.txt file
KDDTest+.ARFF: The full NSL-KDD test set with binary labels in ARFF format
KDDTest+.TXT: The full NSL-KDD test set including attack-type labels and difficulty level in CSV format
KDDTest-21.ARFF: A subset of the KDDTest+.arff file which does not include records with difficulty level of 21 out of 21
KDDTest-21.TXT: A subset of the KDDTest+.txt file which does not include records with difficulty level of 21 out of 21
; cic@unb.ca.
Google Landmarks Dataset v2
github.com
opendatalab.com
Updated Sep 27, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google (2019). Google Landmarks Dataset v2 [Dataset]. https://github.com/cvdfoundation/google-landmark
Explore at:
Dataset updated
Sep 27, 2019
Dataset provided by
Googlehttp://google.com/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the second version of the Google Landmarks dataset (GLDv2), which contains images annotated with labels representing human-made and natural landmarks. The dataset can be used for landmark recognition and retrieval experiments. This version of the dataset contains approximately 5 million images, split into 3 sets of images: train, index and test. The dataset was presented in our CVPR'20 paper. In this repository, we present download links for all dataset files and relevant code for metric computation. This dataset was associated to two Kaggle challenges, on landmark recognition and landmark retrieval. Results were discussed as part of a CVPR'19 workshop. In this repository, we also provide scores for the top 10 teams in the challenges, based on the latest ground-truth version. Please visit the challenge and workshop webpages for more details on the data, tasks and technical solutions from top teams.
Walmart Dataset
kaggle.com
Updated Dec 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M Yasser H (2021). Walmart Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/walmart-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 26, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
M Yasser H
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
https://raw.githubusercontent.com/Masterx-AI/Project_Retail_Analysis_with_Walmart/main/Wallmart1.jpg" alt="">

Description:

One of the leading retail stores in the US, Walmart, would like to predict the sales and demand accurately. There are certain events and holidays which impact sales on each day. There are sales data available for 45 stores of Walmart. The business is facing a challenge due to unforeseen demands and runs out of stock some times, due to the inappropriate machine learning algorithm. An ideal ML algorithm will predict demand accurately and ingest factors like economic conditions including CPI, Unemployment Index, etc.

Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of all, which are the Super Bowl, Labour Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data. Historical sales data for 45 Walmart stores located in different regions are available.

Acknowledgements

The dataset is taken from Kaggle.

Objective:

Understand the Dataset & cleanup (if required).

Build Regression models to predict the sales w.r.t single & multiple features.

Also evaluate the models & compare their respective scores like R2, RMSE, etc.
SciQ (Scientific Question Answering)
kaggle.com
Updated Nov 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). SciQ (Scientific Question Answering) [Dataset]. https://www.kaggle.com/datasets/thedevastator/sciq-a-dataset-for-science-question-answering
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 20, 2022
Dataset provided by
Kaggle
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
SciQ: A Dataset for Science Question Answering

The Next Generation Science Standards

Source

Huggingface Hub: link

About this dataset

The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. For the majority of the questions, an additional paragraph with supporting evidence for the correct answer is provided.

How to use the dataset

This dataset consists of science questions and their corresponding distractors, correct answers, and supports. The questions are designed to evaluate a person's knowledge of science. The distractors are designed to confuse the test taker and lead them away from the correct answer. The correct answers are provided so that the test taker can check their work. The supports are designed to help the test taker understand the question and find the correct answer

Research Ideas

Train a model to answer scientific questions

Acknowledgements

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv | Column name | Description | |:-------------------|:--------------------------------------------------| | question | The question text. (String) | | distractor3 | One of the distractors for the question. (String) | | distractor1 | One of the distractors for the question. (String) | | distractor2 | One of the distractors for the question. (String) | | correct_answer | The correct answer for the question. (String) | | support | The supporting text for the question. (String) |

File: train.csv | Column name | Description | |:-------------------|:--------------------------------------------------| | question | The question text. (String) | | distractor3 | One of the distractors for the question. (String) | | distractor1 | One of the distractors for the question. (String) | | distractor2 | One of the distractors for the question. (String) | | correct_answer | The correct answer for the question. (String) | | support | The supporting text for the question. (String) |

File: test.csv | Column name | Description | |:-------------------|:--------------------------------------------------| | question | The question text. (String) | | distractor3 | One of the distractors for the question. (String) | | distractor1 | One of the distractors for the question. (String) | | distractor2 | One of the distractors for the question. (String) | | correct_answer | The correct answer for the question. (String) | | support | The supporting text for the question. (String) |
Network Traffic Dataset
kaggle.com
Updated Oct 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ravikumar Gattu (2023). Network Traffic Dataset [Dataset]. https://www.kaggle.com/datasets/ravikumargattu/network-traffic-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 31, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ravikumar Gattu
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The data presented here was obtained in a Kali Machine from University of Cincinnati,Cincinnati,OHIO by carrying out packet captures for 1 hour during the evening on Oct 9th,2023 using Wireshark.This dataset consists of 394137 instances were obtained and stored in a CSV (Comma Separated Values) file.This large dataset could be used utilised for different machine learning applications for instance classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.

The dataset can be used for a variety of machine learning tasks, such as network intrusion detection, traffic classification, and anomaly detection.

Content :

This network traffic dataset consists of 7 features.Each instance contains the information of source and destination IP addresses, The majority of the properties are numeric in nature, however there are also nominal and date kinds due to the Timestamp.

The network traffic flow statistics (No. Time Source Destination Protocol Length Info) were obtained using Wireshark (https://www.wireshark.org/).

Dataset Columns:

No : Number of Instance. Timestamp : Timestamp of instance of network traffic Source IP: IP address of Source Destination IP: IP address of Destination Portocol: Protocol used by the instance Length: Length of Instance Info: Information of Traffic Instance

Acknowledgements :

I would like thank University of Cincinnati for giving the infrastructure for generation of network traffic data set.

Ravikumar Gattu , Susmitha Choppadandi

Inspiration : This dataset goes beyond the majority of network traffic classification datasets, which only identify the type of application (WWW, DNS, ICMP,ARP,RARP) that an IP flow contains. Instead, it generates machine learning models that can identify specific applications (like Tiktok,Wikipedia,Instagram,Youtube,Websites,Blogs etc.) from IP flow statistics (there are currently 25 applications in total).

**Dataset License: ** CC0: Public Domain

Dataset Usages : This dataset can be used for different machine learning applications in the field of cybersecurity such as classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.

ML techniques benefits from this Dataset :

This dataset is highly useful because it consists of 394137 instances of network traffic data obtained by using the 25 applications on a public,private and Enterprise networks.Also,the dataset consists of very important features that can be used for most of the applications of Machine learning in cybersecurity.Here are few of the potential machine learning applications that could be benefited from this dataset are :

Network Performance Monitoring : This large network traffic data set can be utilised for analysing the network traffic to identifying the network patterns in the network .This help in designing the network security algorithms for minimise the network probelms.

Anamoly Detection : Large network traffic dataset can be utilised training the machine learning models for finding the irregularitues in the traffic which could help identify the cyber attacks.

3.Network Intrusion Detection : This large dataset could be utilised for machine algorithms training and designing the models for detection of the traffic issues,Malicious traffic network attacks and DOS attacks as well.
Chest X-Ray Worldwide Datasets
kaggle.com
zip
Updated Dec 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Homayoon khadivi (2020). Chest X-Ray Worldwide Datasets [Dataset]. https://www.kaggle.com/homayoonkhadivi/chest-xray-worldwide-datasets
Explore at:
zip(415991000 bytes)Available download formats
Dataset updated
Dec 9, 2020
Authors
Homayoon khadivi
Description
The ChestX-ray8 dataset which contains 108,948 frontal-view X-ray images of 32,717 unique patients.

Each image in the data set contains multiple text-mined labels identifying 14 different pathological conditions. These in turn can be used by physicians to diagnose 8 different diseases. We will use this data to develop a single model that will provide binary classification predictions for each of the 14 labeled pathologies. In other words it will predict 'positive' or 'negative' for each of the pathologies. You can download the entire dataset for free here. (https://nihcc.app.box.com/v/ChestXray-NIHCC)

I have provided a ~1000 image subset of the images here The dataset includes a CSV file that provides the labels for each X-ray.

To make your job a bit easier, I have processed the labels for our small sample and generated three new files to get you started. These three files are:

train-small-new.csv: 875 images from our dataset to be used for training. valid-small-new.csv: 109 images from our dataset to be used for validation. test-small-new.csv: 420 images from our dataset to be used for testing. This dataset has been annotated by consensus among four different radiologists for 5 of our 14 pathologies:

Consolidation Edema Effusion Cardiomegaly Atelectasis
Railway Track fault Detection Resized (224 X 224)
kaggle.com
Updated Aug 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gerry (2022). Railway Track fault Detection Resized (224 X 224) [Dataset]. https://www.kaggle.com/datasets/gpiosenka/railway-track-fault-detection-resized-224-x-224
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 11, 2022
Dataset provided by
Kaggle
Authors
Gerry
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This data set is a resized version of the original dataset Railway Track Fault Detection located at The original images are very large. It takes
a considerable amount of time to resize the images for training. To eliminate that I recreated the dataset with the images
reduced to 224 X 224 X 3. I also added a csv file rails.csv which provides an easy means to load the data set. .
DFL Bundesliga 460 MP4 Videos in 30Sec. + CSV
kaggle.com
Updated Aug 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saber (2022). DFL Bundesliga 460 MP4 Videos in 30Sec. + CSV [Dataset]. https://www.kaggle.com/datasets/saberghaderi/-dfl-bundesliga-460-mp4-videos-in-30sec-csv
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 4, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Saber
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
From a young age, hopeful talents devote time, money, and training to the sport. Yet, while the next superstar is guaranteed to start off in youth or semi-professional leagues, these leagues often have the fewest resources to invest. This includes resources for the collection of event data which helps generate insights into the performance of the teams and players.

****About Dataset:**** This dataset with 460 training and test videos in 2 folders was collected by dataset of competition videos. All videos are in MP4 format.

** Please note that the number of videos in each folder is different

Version 1 --> 460 MP4 file in 2 Folder + .CSV file Version 2 --> Coming Soon!

competition page: https://www.kaggle.com/competitions/dfl-bundesliga-data-shootout

wish you all the best
Credit Score Prediction
kaggle.com
zip
Updated Sep 14, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prasanna Venkatesh (2019). Credit Score Prediction [Dataset]. https://www.kaggle.com/datasets/prasy46/credit-score-prediction
Explore at:
zip(13703505 bytes)Available download formats
Dataset updated
Sep 14, 2019
Authors
Prasanna Venkatesh
Description
Data

We provide you with a data set in CSV format. The data set contains 8,000 train instances and 2000 test instance There are 304 input features, labeled x001 to x304.

The target variable is labeled y.

Task Create a model to predict the target variable y.

A report - A Power point presentation

Any custom code you used

Instructions for me to run your model on a separate data set

What should be in the report?

List of any assumptions that you made

Description of your methodology and solution path

List of algorithms and techniques you used

List of tools and frameworks you used

Results and evaluation of your models

How to evaluate the model

Use the Root Mean Square Error (RMSE).

If the absolute error of a prediction is greater than 4.0, I regard the prediction as "wrong". Otherwise, it is "correct".

Any other evaluation measure that you believe is appropriate.
OpenAI HumanEval (Coding Challenges & Unit-tests)
kaggle.com
Updated Nov 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). OpenAI HumanEval (Coding Challenges & Unit-tests) [Dataset]. https://www.kaggle.com/datasets/thedevastator/handcrafted-dataset-for-code-generation-models
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 21, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
OpenAI HumanEval (Coding Challenges & Unit-tests)

164 programming problems with a function signature, docstring, body, unittests

Source

Huggingface Hub: link

About this dataset

The OpenAI HumanEval dataset is a handcrafted set of 164 programming problems designed to challenge code generation models. The problems include a function signature, docstring, body, and several unit tests, all handwritten to ensure they're not included in the training set of code generation models. The entry point for each problem is the prompt, making it an ideal dataset for testing natural language processing and machine learning models' ability to generate Python programs from scratch

How to use the dataset

To use this dataset, simply download the zip file and extract it. The resulting directory will contain the following files:

canonical_solution.py: The solution to the problem. (String) entry_point.py: The entry point for the problem. (String) prompt.txt: The prompt for the problem. (String) test.py: The unit tests for the problem

Research Ideas

The dataset could be used to develop a model that generates programs from natural language.

The dataset could be used to develop a model that completes or debugs programs.

The dataset could be used to develop a model that writes unit tests for programs

Acknowledgements

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: test.csv | Column name | Description | |:-----------------------|:--------------------------------------------------------------------------------------------------| | prompt | A natural language description of the programming problem. (String) | | canonical_solution | The correct Python code solution to the problem. (String) | | test | A set of unit tests that the generated code must pass in order to be considered correct. (String) | | entry_point | The starting point for the generated code. (String) |
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Bastian Eichenberger; YinXiu Zhan (2023). Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.12958037.v1

Datasets

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.12958037.v1

Dataset updated

May 31, 2023

Dataset provided by

figshare

Authors

Bastian Eichenberger; YinXiu Zhan

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

The benchmarking datasets used for deepBlink. The npz files contain train/valid/test splits inside and can be used directly. The files belong to the following challenges / classes:- ISBI Particle tracking challenge: microtubule, vesicle, receptor- Custom synthetic (based on http://smal.ws): particle- Custom fixed cell: smfish- Custom live cell: suntagThe csv files are to determine which image in the test splits correspond to which original image, SNR, and density.

Clear search

Close search

Google apps

Main menu

Datasets

TinyStories

TinyStories

A Diverse, Richly Annotated Corpus of Short-Form Stories

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Autism Prediction Dataset

DailyDialog (Multi-turn Dialog)

DailyDialog (Multi-turn Dialog)

Dialogues that reflect our daily communication way and cover various topics

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

NSL-KDD dataset

Google Landmarks Dataset v2

Walmart Dataset

Description:

Acknowledgements

Objective:

SciQ (Scientific Question Answering)

SciQ: A Dataset for Science Question Answering

The Next Generation Science Standards

Source

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Network Traffic Dataset

Chest X-Ray Worldwide Datasets

Railway Track fault Detection Resized (224 X 224)

DFL Bundesliga 460 MP4 Videos in 30Sec. + CSV

Credit Score Prediction

Data

Task Create a model to predict the target variable y.

What should be in the report?

How to evaluate the model

OpenAI HumanEval (Coding Challenges & Unit-tests)

OpenAI HumanEval (Coding Challenges & Unit-tests)

164 programming problems with a function signature, docstring, body, unittests

Source

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Datasets