100+ datasets found

The LargeST Benchmark Dataset
kaggle.com
Updated Jun 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
liuxu77 (2023). The LargeST Benchmark Dataset [Dataset]. https://www.kaggle.com/datasets/liuxu77/largest
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 13, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
liuxu77
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This is the official website for downloading the CA sub-dataset of the LargeST benchmark dataset. There are a total of 7 files in this page. Among them, 5 files in .h5 format contain the traffic flow raw data from 2017 to 2021, 1 file in .csv format provides the metadata for sensors, and 1 file in .npy format represents the adjacency matrix constructed based on road network distances. Please refer to https://github.com/liuxu77/LargeST for more information.
Predictive Maintenance Dataset
kaggle.com
Updated Nov 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Himanshu Agarwal (2022). Predictive Maintenance Dataset [Dataset]. https://www.kaggle.com/datasets/hiimanshuagarwal/predictive-maintenance-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 7, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Himanshu Agarwal
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
A company has a fleet of devices transmitting daily sensor readings. They would like to create a predictive maintenance solution to proactively identify when maintenance should be performed. This approach promises cost savings over routine or time based preventive maintenance, because tasks are performed only when warranted.

The task is to build a predictive model using machine learning to predict the probability of a device failure. When building this model, be sure to minimize false positives and false negatives. The column you are trying to Predict is called failure with binary value 0 for non-failure and 1 for failure.
Fake News Prediction Dataset
kaggle.com
Updated Nov 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rajat Kumar (2023). Fake News Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/rajatkumar30/fake-news
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 3, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rajat Kumar
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
** Please Upvote if you like the dataset **

Fake news or hoax news is false or misleading information presented as news. Fake news often has the aim of damaging the reputation of a person or entity, or making money through advertising revenue.

This dataset is having Both Fake and Real news.

The columns present in the dataset are:-

1) Title -> Title of the News

2) Text -> Text or Content of the News

3) Label -> Labelling the news as Fake or Real
🫀 Heart Disease Dataset
kaggle.com
Updated Apr 8, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mexwell (2024). 🫀 Heart Disease Dataset [Dataset]. https://www.kaggle.com/datasets/mexwell/heart-disease-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 8, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
mexwell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This heart disease dataset is curated by combining 5 popular heart disease datasets already available independently but not combined before. In this dataset, 5 heart datasets are combined over 11 common features which makes it the largest heart disease dataset available so far for research purposes. The five datasets used for its curation are:

Cleveland

Hungarian

Switzerland

Long Beach VA

Statlog (Heart) Data Set.

This dataset consists of 1190 instances with 11 features. These datasets were collected and combined at one place to help advance research on CAD-related machine learning and data mining algorithms, and hopefully to ultimately advance clinical diagnosis and early treatment.

Acknowlegement

Foto von Kenny Eliason auf Unsplash
Online Sales Dataset - Popular Marketplace Data
kaggle.com
Updated May 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ShreyanshVerma27 (2024). Online Sales Dataset - Popular Marketplace Data [Dataset]. https://www.kaggle.com/datasets/shreyanshverma27/online-sales-dataset-popular-marketplace-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 25, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
ShreyanshVerma27
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset provides a comprehensive overview of online sales transactions across different product categories. Each row represents a single transaction with detailed information such as the order ID, date, category, product name, quantity sold, unit price, total price, region, and payment method.

Columns:

Order ID: Unique identifier for each sales order.

Date:Date of the sales transaction.

Category:Broad category of the product sold (e.g., Electronics, Home Appliances, Clothing, Books, Beauty Products, Sports).

Product Name:Specific name or model of the product sold.

Quantity:Number of units of the product sold in the transaction.

Unit Price:Price of one unit of the product.

Total Price: Total revenue generated from the sales transaction (Quantity * Unit Price).

Region:Geographic region where the transaction occurred (e.g., North America, Europe, Asia).

Payment Method: Method used for payment (e.g., Credit Card, PayPal, Debit Card).

Insights:

1. Analyze sales trends over time to identify seasonal patterns or growth opportunities.

2. Explore the popularity of different product categories across regions.

3. Investigate the impact of payment methods on sales volume or revenue.

4. Identify top-selling products within each category to optimize inventory and marketing strategies.

5. Evaluate the performance of specific products or categories in different regions to tailor marketing campaigns accordingly.
Book-Crossing Dataset
kaggle.com
zip
Updated Sep 7, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
somnambWl (2019). Book-Crossing Dataset [Dataset]. https://www.kaggle.com/datasets/somnambwl/bookcrossing-dataset
Explore at:
zip(17632108 bytes)Available download formats
Dataset updated
Sep 7, 2019
Authors
somnambWl
Description
Book-Crossing dataset mined by Cai-Nicolas Ziegler

Freely available for research use when acknowledged with the following reference (further details on the dataset are given in this publication):

PDF

Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, Georg Lausen; Proceedings of the 14th International World Wide Web Conference (WWW '05), May 10-14, 2005, Chiba, Japan. To appear.

Further information and the original dataset can be found at the original webpage.

Changes to the dataset:

Location removed as it comes in different formats not in default (city, state, country).

Transferred from ISO-8859-1 to UTF-8

Manually fixed a few rows with incorrect number of columns

Note:

out of 278859 users:

only 99053 rated at least 1 book

only 43385 rated at least 2 books.

only 12306 rated at least 10 books.

out of 271379 books:

only 270171 are rated at least once.

only 124513 have at least 2 ratings.

only 17480 have at least 10 ratings.
Apple Quality
kaggle.com
Updated Jan 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nidula Elgiriyewithana ⚡ (2024). Apple Quality [Dataset]. http://doi.org/10.34740/kaggle/dsv/7384155
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/7384155
Dataset updated
Jan 11, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nidula Elgiriyewithana ⚡
Description
Description:

This dataset contains information about various attributes of a set of fruits, providing insights into their characteristics. The dataset includes details such as fruit ID, size, weight, sweetness, crunchiness, juiciness, ripeness, acidity, and quality.

Key Features:

A_id: Unique identifier for each fruit

Size: Size of the fruit

Weight: Weight of the fruit

Sweetness: Degree of sweetness of the fruit

Crunchiness: Texture indicating the crunchiness of the fruit

Juiciness: Level of juiciness of the fruit

Ripeness: Stage of ripeness of the fruit

Acidity: Acidity level of the fruit

Quality: Overall quality of the fruit

Potential Use Cases:

Fruit Classification: Develop a classification model to categorize fruits based on their features.

Quality Prediction: Build a model to predict the quality rating of fruits using various attributes.

The dataset was generously provided by an American agriculture company. The data has been scaled and cleaned for ease of use.

If you find this dataset useful, your support through an upvote would be greatly appreciated ❤️🙂 Thank you
May 2015 Reddit Comments
kaggle.com
zip
Updated Jun 4, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaggle (2019). May 2015 Reddit Comments [Dataset]. https://www.kaggle.com/datasets/kaggle/reddit-comments-may-2015
Explore at:
zip(21429083286 bytes)Available download formats
Dataset updated
Jun 4, 2019
Dataset authored and provided by
Kagglehttp://kaggle.com/
License
https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api
Description
Recently Reddit released an enormous dataset containing all ~1.7 billion of their publicly available comments. The full dataset is an unwieldy 1+ terabyte uncompressed, so we've decided to host a small portion of the comments here for Kagglers to explore. (You don't even need to leave your browser!)

You can find all the comments from May 2015 on scripts for your natural language processing pleasure. What had redditors laughing, bickering, and NSFW-ing this spring?

Who knows? Top visualizations may just end up on Reddit.

Data Description

The database has one table, May2015, with the following fields:

created_utc

ups

subreddit_id

link_id

name

score_hidden

author_flair_css_class

author_flair_text

subreddit

id

removal_reason

gilded

downs

archived

author

score

retrieved_on

body

distinguished

edited

controversiality

parent_id
ML Datasets
kaggle.com
Updated May 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bikram Saha (2023). ML Datasets [Dataset]. https://www.kaggle.com/datasets/imbikramsaha/ml-datasets/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 1, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Bikram Saha
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The dataset contains a diverse range of examples, including classification, regression, clustering, and dimensionality reduction problems, with varying levels of complexity and varying numbers of features. Each dataset comes with a detailed description of the problem and the corresponding features, making it easy to understand and work with. Additionally, the dataset provides an opportunity for machine learning enthusiasts to experiment with different SkLearn algorithms and evaluate their performance on different datasets. This dataset is perfect for both beginners and advanced practitioners looking to hone their skills in various machine learning techniques.
AI Vs Human Text
kaggle.com
Updated Jan 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shayan Gerami (2024). AI Vs Human Text [Dataset]. https://www.kaggle.com/datasets/shanegerami/ai-vs-human-text
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 10, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shayan Gerami
Description
Around 500K essays are available in this dataset, both created by AI and written by Human.

I have gathered the data from multiple sources, added them together and removed the duplicates
LLM: 7 prompt training dataset
kaggle.com
Updated Nov 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carl McBride Ellis (2023). LLM: 7 prompt training dataset [Dataset]. https://www.kaggle.com/datasets/carlmcbrideellis/llm-7-prompt-training-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 15, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Carl McBride Ellis
License
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Description
Version 4: Adding the data from "LLM-generated essay using PaLM from Google Gen-AI" kindly generated by Kingki19 / Muhammad Rizqi.
File: train_essays_RDizzl3_seven_v2.csv
Human texts: 14247 LLM texts: 3004

See also: a new dataset of an additional 4900 LLM generated texts: LLM: Mistral-7B Instruct texts

Version 3: "**The RDizzl3 Seven**"
File: train_essays_RDizzl3_seven_v1.csv

"Car-free cities"

"Does the electoral college work?"

"Exploring Venus"

"The Face on Mars"

"Facial action coding system"

"A Cowboy Who Rode the Waves"

"Driverless cars"

How this dataset was made: see the notebook "LLM: Make 7 prompt train dataset"

Version 2: (train_essays_7_prompts_v2.csv) This dataset is composed of 13,712 human texts and 1638 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.

Namely:

"Car-free cities"

"Does the electoral college work?"

"Exploring Venus"

"The Face on Mars"

"Facial action coding system"

"Seeking multiple opinions"

"Phones and driving"

This dataset is a derivative of the datasets

LLM Generated Essays for the Detect AI Comp! by Radek Osmulski

persuade corpus 2.0 provided by Nicholas Broad

daigt data - llama 70b and falcon180b by Nicholas Broad

Hello, Claude! 1000 essays from Anthropic... by Darragh

as well as the original competition training dataset

Version 1:This dataset is composed of 13,712 human texts and 1165 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.

Loan Approval Classification Dataset

kaggle.com

Updated Oct 29, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Ta-wei Lo (2024). Loan Approval Classification Dataset [Dataset]. https://www.kaggle.com/datasets/taweilo/loan-approval-classification-data

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 29, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Ta-wei Lo

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

1. Data Source

This dataset is a synthetic version inspired by the original Credit Risk dataset on Kaggle and enriched with additional variables based on Financial Risk for Loan Approval data. SMOTENC was used to simulate new data points to enlarge the instances. The dataset is structured for both categorical and continuous features.

2. Metadata

The dataset contains 45,000 records and 14 variables, each described below:

Column	Description	Type
`person_age`	Age of the person	Float
`person_gender`	Gender of the person	Categorical
`person_education`	Highest education level	Categorical
`person_income`	Annual income	Float
`person_emp_exp`	Years of employment experience	Integer
`person_home_ownership`	Home ownership status (e.g., rent, own, mortgage)	Categorical
`loan_amnt`	Loan amount requested	Float
`loan_intent`	Purpose of the loan	Categorical
`loan_int_rate`	Loan interest rate	Float
`loan_percent_income`	Loan amount as a percentage of annual income	Float
`cb_person_cred_hist_length`	Length of credit history in years	Float
`credit_score`	Credit score of the person	Integer
`previous_loan_defaults_on_file`	Indicator of previous loan defaults	Categorical
`loan_status` (target variable)	Loan approval status: 1 = approved; 0 = rejected	Integer

3. Data Usage

The dataset can be used for multiple purposes:

Exploratory Data Analysis (EDA): Analyze key features, distribution patterns, and relationships to understand credit risk factors.
Classification: Build predictive models to classify the loan_status variable (approved/not approved) for potential applicants.
Regression: Develop regression models to predict the credit_score variable based on individual and loan-related attributes.

Mind the data issue from the original data, such as the instance > 100-year-old as age.

This dataset provides a rich basis for understanding financial risk factors and simulating predictive modeling processes for loan approval and credit scoring.

Feel free to leave comments on the discussion. I'd appreciate your upvote if you find my dataset useful! 😀

Human Activity Recognition (HAR - Video Dataset)
kaggle.com
Updated May 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sharjeel M. (2023). Human Activity Recognition (HAR - Video Dataset) [Dataset]. http://doi.org/10.34740/kaggle/dsv/5722068
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/5722068
Dataset updated
May 19, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sharjeel M.
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The dataset contains a comprehensive collection of human activity videos, spanning across 7 distinct classes. These classes include clapping, meeting and splitting, sitting, standing still, walking, walking while reading book, and walking while using the phone.

Each video clip in the dataset showcases a specific human activity and has been labeled with the corresponding class to facilitate supervised learning.

The primary inspiration behind creating this dataset is to enable machines to recognize and classify human activities accurately. With the advent of computer vision and deep learning techniques, it has become increasingly important to train machine learning models on large and diverse datasets to improve their accuracy and robustness.
Legal Text Classification Dataset
kaggle.com
Updated Oct 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
A.Mohan kumar (2023). Legal Text Classification Dataset [Dataset]. https://www.kaggle.com/datasets/amohankumar/legal-text-classification-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 17, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
A.Mohan kumar
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The dataset contains a total of 25000 legal cases in the form of text documents. Each document has been annotated with catchphrases, citations sentences, citation catchphrases, and citation classes. Citation classes indicate the type of treatment given to the cases cited by the present case.
Predicting Heart Failure
kaggle.com
Updated Sep 13, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aman Chauhan (2022). Predicting Heart Failure [Dataset]. https://www.kaggle.com/datasets/whenamancodes/heart-failure-clinical-records
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 13, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aman Chauhan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide. Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.

Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

Attribute Information:

Thirteen (13) clinical features: - age: age of the patient (years) - anaemia: decrease of red blood cells or hemoglobin (boolean) - high blood pressure: if the patient has hypertension (boolean) - creatinine phosphokinase (CPK): level of the CPK enzyme in the blood (mcg/L) - diabetes: if the patient has diabetes (boolean) - ejection fraction: percentage of blood leaving the heart at each contraction (percentage) - platelets: platelets in the blood (kiloplatelets/mL) - sex: woman or man (binary) - serum creatinine: level of serum creatinine in the blood (mg/dL) - serum sodium: level of serum sodium in the blood (mEq/L) - smoking: if the patient smokes or not (boolean) - time: follow-up period (days) - [target] death event: if the patient deceased during the follow-up period (boolean)

More - Find More Exciting🙀 Datasets Here - An Upvote👍 A Dayᕙ(`▿´)ᕗ , Keeps Aman Hurray Hurray..... ٩(˘◡˘)۶Haha
Network Traffic Dataset
kaggle.com
Updated Oct 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ravikumar Gattu (2023). Network Traffic Dataset [Dataset]. https://www.kaggle.com/datasets/ravikumargattu/network-traffic-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 31, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ravikumar Gattu
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The data presented here was obtained in a Kali Machine from University of Cincinnati,Cincinnati,OHIO by carrying out packet captures for 1 hour during the evening on Oct 9th,2023 using Wireshark.This dataset consists of 394137 instances were obtained and stored in a CSV (Comma Separated Values) file.This large dataset could be used utilised for different machine learning applications for instance classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.

The dataset can be used for a variety of machine learning tasks, such as network intrusion detection, traffic classification, and anomaly detection.

Content :

This network traffic dataset consists of 7 features.Each instance contains the information of source and destination IP addresses, The majority of the properties are numeric in nature, however there are also nominal and date kinds due to the Timestamp.

The network traffic flow statistics (No. Time Source Destination Protocol Length Info) were obtained using Wireshark (https://www.wireshark.org/).

Dataset Columns:

No : Number of Instance. Timestamp : Timestamp of instance of network traffic Source IP: IP address of Source Destination IP: IP address of Destination Portocol: Protocol used by the instance Length: Length of Instance Info: Information of Traffic Instance

Acknowledgements :

I would like thank University of Cincinnati for giving the infrastructure for generation of network traffic data set.

Ravikumar Gattu , Susmitha Choppadandi

Inspiration : This dataset goes beyond the majority of network traffic classification datasets, which only identify the type of application (WWW, DNS, ICMP,ARP,RARP) that an IP flow contains. Instead, it generates machine learning models that can identify specific applications (like Tiktok,Wikipedia,Instagram,Youtube,Websites,Blogs etc.) from IP flow statistics (there are currently 25 applications in total).

**Dataset License: ** CC0: Public Domain

Dataset Usages : This dataset can be used for different machine learning applications in the field of cybersecurity such as classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.

ML techniques benefits from this Dataset :

This dataset is highly useful because it consists of 394137 instances of network traffic data obtained by using the 25 applications on a public,private and Enterprise networks.Also,the dataset consists of very important features that can be used for most of the applications of Machine learning in cybersecurity.Here are few of the potential machine learning applications that could be benefited from this dataset are :

Network Performance Monitoring : This large network traffic data set can be utilised for analysing the network traffic to identifying the network patterns in the network .This help in designing the network security algorithms for minimise the network probelms.

Anamoly Detection : Large network traffic dataset can be utilised training the machine learning models for finding the irregularitues in the traffic which could help identify the cyber attacks.

3.Network Intrusion Detection : This large dataset could be utilised for machine algorithms training and designing the models for detection of the traffic issues,Malicious traffic network attacks and DOS attacks as well.
Medical Text Dataset -Cancer Doc Classification
kaggle.com
Updated Aug 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Falgunipatel19 (2022). Medical Text Dataset -Cancer Doc Classification [Dataset]. https://www.kaggle.com/datasets/falgunipatel19/biomedical-text-publication-classification
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 6, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Falgunipatel19
Description
For Biomedical text document classification, abstract and full papers(whose length less than or equal to 6 pages) available and used. This dataset focused on long research paper whose page size more than 6 pages. Dataset includes cancer documents to be classified into 3 categories like 'Thyroid_Cancer','Colon_Cancer','Lung_Cancer'. Total publications=7569. it has 3 class labels in dataset. number of samples in each categories: colon cancer=2579, lung cancer=2180, thyroid cancer=2810
DFL Bundesliga 460 MP4 Videos in 30Sec. + CSV
kaggle.com
Updated Aug 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saber (2022). DFL Bundesliga 460 MP4 Videos in 30Sec. + CSV [Dataset]. https://www.kaggle.com/datasets/saberghaderi/-dfl-bundesliga-460-mp4-videos-in-30sec-csv
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 4, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Saber
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
From a young age, hopeful talents devote time, money, and training to the sport. Yet, while the next superstar is guaranteed to start off in youth or semi-professional leagues, these leagues often have the fewest resources to invest. This includes resources for the collection of event data which helps generate insights into the performance of the teams and players.

****About Dataset:**** This dataset with 460 training and test videos in 2 folders was collected by dataset of competition videos. All videos are in MP4 format.

** Please note that the number of videos in each folder is different

Version 1 --> 460 MP4 file in 2 Folder + .CSV file Version 2 --> Coming Soon!

competition page: https://www.kaggle.com/competitions/dfl-bundesliga-data-shootout

wish you all the best
Laptop Price Explorer: The ML Model
kaggle.com
Updated Dec 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sagar Puniyani (2023). Laptop Price Explorer: The ML Model [Dataset]. https://www.kaggle.com/datasets/sagaraiarchitect/laptop-price-explorer-the-ml-model/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 12, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sagar Puniyani
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Explore the dynamic world of laptops with our comprehensive dataset that delves into the intricate details of various portable computing devices. This dataset is a treasure trove of information for tech enthusiasts, market analysts, and anyone interested in understanding the diverse landscape of laptops.

https://github.com/Sagar-Puniyani/DataGeneration the link for the GitHub code where the Dataset is Generated, and how the Implementation of the Model.
Physical Exercise Recognition Dataset
kaggle.com
Updated Feb 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhannad Tuameh (2023). Physical Exercise Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/muhannadtuameh/exercise-recognition
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 16, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Muhannad Tuameh
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Note:

Because this dataset has been used in a competition, we had to hide some of the data to prepare the test dataset for the competition. Thus, in the previous version of the dataset, only train.csv file is existed.

Content

This dataset represents 10 different physical poses that can be used to distinguish 5 exercises. The exercises are Push-up, Pull-up, Sit-up, Jumping Jack and Squat. For every exercise, 2 different classes have been used to represent the terminal positions of that exercise (e.g., “up” and “down” positions for push-ups).

Collection Process

About 500 videos of people doing the exercises have been used in order to collect this data. The videos are from Countix Dataset that contain the YouTube links of several human activity videos. Using a simple Python script, the videos of 5 different physical exercises are downloaded. From every video, at least 2 frames are manually extracted. The extracted frames represent the terminal positions of the exercise.

Processing Data

For every frame, MediaPipe framework is used for applying pose estimation, which detects the human skeleton of the person in the frame. The landmark model in MediaPipe Pose predicts the location of 33 pose landmarks (see figure below). Visit Mediapipe Pose Classification page for more details.

https://mediapipe.dev/images/mobile/pose_tracking_full_body_landmarks.png" alt="33 pose landmarks">

Facebook

Twitter

Click to copy link

Link copied

Cite

liuxu77 (2023). The LargeST Benchmark Dataset [Dataset]. https://www.kaggle.com/datasets/liuxu77/largest

The LargeST Benchmark Dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 13, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

liuxu77

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

This is the official website for downloading the CA sub-dataset of the LargeST benchmark dataset. There are a total of 7 files in this page. Among them, 5 files in .h5 format contain the traffic flow raw data from 2017 to 2021, 1 file in .csv format provides the metadata for sensors, and 1 file in .npy format represents the adjacency matrix constructed based on road network distances. Please refer to https://github.com/liuxu77/LargeST for more information.

Clear search

Close search

Google apps

Main menu

The LargeST Benchmark Dataset

Predictive Maintenance Dataset

Fake News Prediction Dataset

🫀 Heart Disease Dataset

Acknowlegement

Online Sales Dataset - Popular Marketplace Data

Columns:

Insights:

Book-Crossing Dataset

Apple Quality

Description:

Key Features:

Potential Use Cases:

May 2015 Reddit Comments

Data Description

ML Datasets

AI Vs Human Text

LLM: 7 prompt training dataset

Loan Approval Classification Dataset

1. Data Source

2. Metadata

3. Data Usage

Feel free to leave comments on the discussion. I'd appreciate your upvote if you find my dataset useful! 😀

Human Activity Recognition (HAR - Video Dataset)

Legal Text Classification Dataset

Predicting Heart Failure

Attribute Information:

Network Traffic Dataset

Medical Text Dataset -Cancer Doc Classification

DFL Bundesliga 460 MP4 Videos in 30Sec. + CSV

Laptop Price Explorer: The ML Model

Physical Exercise Recognition Dataset

Note:

Content

Collection Process

Processing Data

The LargeST Benchmark Dataset