100+ datasets found

Student Score - Hypothesis Testing (T Test)
kaggle.com
zip
Updated Sep 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). Student Score - Hypothesis Testing (T Test) [Dataset]. https://www.kaggle.com/datasets/vikramamin/student-score-hypothesis-testing-t-test
Explore at:
zip(7328 bytes)Available download formats
Dataset updated
Sep 21, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fbf9b8f2d8afc8aad16aadf167ee53777%2FPicture1.png?generation=1695275487466508&alt=media" alt="">

Data Cleaning https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F5ac517a06cd3aff12b58297504902583%2FPicture2.png?generation=1695276101423952&alt=media" alt="">

Convert data types of the required variables https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F61c8e665b906ba21d0579d06ab85b028%2FPicture3.png?generation=1695276209705142&alt=media" alt="">

Run libraries dplyr, ggplot2, tidyverse, tidyr

Find out the count of male vs female students https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fb33f152cfb579742aca479923f271b6d%2FPicture4.png?generation=1695276542256981&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F210dd1f7bf238efff7227f5465c77806%2FCount%20of%20Students.jpeg?generation=1695276553831777&alt=media" alt="">

We keep only two columns namely 'Sex' and 'G3' and remove the other columns https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F148205c33fd7cafc1a0ac05ac205c2b1%2FPicture5.png?generation=1695276691132338&alt=media" alt="">

t=-2.0651 indicates the distance from 0

df = 390.57 is related to the sample size, how many free data points are available for making comparisons

p value = 0.03958 is the probability value and indicates that we can reject the null hypothesis as it is less than that of alpha (0.05). Hence it is statisticall y significant.

95% confidence interval suggests that the true difference in means will lie between -1.85 and -0.04 (95% of time)

We can see the difference in means between the two groups (10.91-9.96) = 0.95

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F02b61a708cb074be592362c39ad33779%2FPicture6.png?generation=1695277010381962&alt=media" alt="">

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F769ed45d3ef398e14589e461e3d3fedd%2FHistogram.jpeg?generation=1695277023581085&alt=media" alt="">

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F21025562578ad7901abd35319a09579d%2FPicture7.png?generation=1695277093476017&alt=media" alt="">

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F0ead3643b8e945b83a257cdb30871143%2FDensity%20plot.jpeg?generation=1695277110253483&alt=media" alt="">

Both the histogram and the density plot indicate that there are students who got 0. Could this be due to non attendance of exams. Let us find out the number of students who got 0.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F407028608e7d32361198d15ef854ace2%2FPicture8.png?generation=1695277271891422&alt=media" alt="">

-38 students in total out of 395 have got a score of 0. That is 9.62% students. - Let us check the mean for both groups by removing students who got zeros. - We have created a new data frame called student 2 which includes a total of 357 students with no zero marks

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F06194aa5ed468045ef0d8cdeb82945d5%2FPicture9.png?generation=1695277409031566&alt=media" alt="">

Conclusion:

mean of females is 11.20 and 11.86 of males. The difference in mean of the two groups is 0.66 as compared to the earlier mean difference of 0.95.

P value is shown as 0.05335. For us to reject the null hypothesis the p value should be less than 0.05.

Therefore it is difficult to say if it is statistically significant.
(🌅 Sunset) Kaggle Users' Country + Regions Info
kaggle.com
zip
Updated Feb 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BwandoWando (2024). (🌅 Sunset) Kaggle Users' Country + Regions Info [Dataset]. https://www.kaggle.com/datasets/bwandowando/kaggle-user-country-regions
Explore at:
zip(2376511 bytes)Available download formats
Dataset updated
Feb 14, 2024
Authors
BwandoWando
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
[Context]

The official Meta-Kaggle dataset contains the Users.csv file which contains Username, DisplayName, RegisterDate, and PerformanceTier fields but doesn't contain location data of the Kaggle Users. This dataset augments that data with additional country and region information.

[Note]

I haven't included the username and displayname values on purpose, just the userid to be joined back to the Meta-Kaggle official Users.csv file.

[Limitations]

It is possible that some users haven't inputted their details when the scraper went through their accounts and thus have missing data. Another possibility is that users may have updated their info after the scraper went through their accounts, thus resulting in inconsistencies.

[How I defined active in this dataset]

Users that have received an upvote in the forums, datasets, or notebooks

Users that have given an upvote in the forums, datasets, or notebooks

Users that have created a thread, a forum post, a notebook, or a dataset

Users that made a competition submission

Users that exist in the Meta-Kaggle Users dataset

Date cut-off of Jan 01, 2019

[Update]

15-Feb-2024- Since the Kaggle member's profile page update, the scrapers arent working anymore as the UI layout has changed. Will fix this when we get the time.
T-Shirts Dataset
kaggle.com
zip
Updated Jul 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sunny Kusawa (2022). T-Shirts Dataset [Dataset]. https://www.kaggle.com/datasets/sunnykusawa/tshirts/code
Explore at:
zip(375026934 bytes)Available download formats
Dataset updated
Jul 15, 2022
Authors
Sunny Kusawa
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This data set is collected from online ecommerce site and its very raw and junk data as you can get in industries for your data science project.

Applying your image preprocessing skill on such data will help you to understand the real time problems and challenges in industry projects.

It has some junk, partial as well as multiple t-shirt views in single image.

You can perform different task on this data set like, Beginner - Resize all images to 48 * 48 size - Convert all images to gray scale images Intermediate - Perform image masking on all images - you can also develop classifier to detect given image is t-shirt or not Advance - Try to cluster tshirt images - try if you can cluster based on color - try if you can cluster based on full, partial, nultiple, and junk tshirt images
Clothes Dataset
kaggle.com
zip
Updated Dec 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RyanBadai (2024). Clothes Dataset [Dataset]. https://www.kaggle.com/datasets/ryanbadai/clothes-dataset
Explore at:
zip(1473495763 bytes)Available download formats
Dataset updated
Dec 20, 2024
Authors
RyanBadai
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains images of clothing items scraped from Carousell, an online marketplace, specifically curated for image classification tasks. It includes a diverse set of classes representing different types of clothing, making it an excellent resource for machine learning and computer vision projects. The dataset is organized into the following 15 classes: - Blazer - Celana_Panjang (Long Pants) - Celana_Pendek (Shorts) - Gaun (Dresses) - Hoodie - Jaket (Jacket) - Jaket_Denim (Denim Jacket) - Jaket_Olahraga (Sports Jacket) - Jeans - Kaos (T-shirt) - Kemeja (Shirt) - Mantel (Coat) - Polo - Rok (Skirt) - Sweter (Sweater)

The images in this dataset represent various styles, textures, and colors, offering a comprehensive resource for training models to recognize and classify clothing categories. It is ideal for tasks such as building fashion recommendation systems, creating virtual try-on applications, or studying visual trends in fashion e-commerce. Whether you are an enthusiast or a professional, this dataset can help explore and experiment with deep learning techniques in the realm of fashion.
Website Performance dataset
kaggle.com
zip
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
maida sajid (2024). Website Performance dataset [Dataset]. https://www.kaggle.com/datasets/maidasajid/website-performance-dataset
Explore at:
zip(19979 bytes)Available download formats
Dataset updated
Jul 11, 2024
Authors
maida sajid
Description
This dataset contains website performance metrics, including response time and throughput, collected from Pingdom and Site24x7. The data has been meticulously labeled by students from FAST NUCES and UMT.
Twitter Tweets Sentiment Dataset
kaggle.com
zip
Updated Apr 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M Yasser H (2022). Twitter Tweets Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset
Explore at:
zip(1289519 bytes)Available download formats
Dataset updated
Apr 8, 2022
Authors
M Yasser H
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
https://raw.githubusercontent.com/Masterx-AI/Project_Twitter_Sentiment_Analysis_/main/twitt.jpg" alt="">

Description:

Twitter is an online Social Media Platform where people share their their though as tweets. It is observed that some people misuse it to tweet hateful content. Twitter is trying to tackle this problem and we shall help it by creating a strong NLP based-classifier model to distinguish the negative tweets & block such tweets. Can you build a strong classifier model to predict the same?

Each row contains the text of a tweet and a sentiment label. In the training set you are provided with a word or phrase drawn from the tweet (selected_text) that encapsulates the provided sentiment.

Make sure, when parsing the CSV, to remove the beginning / ending quotes from the text field, to ensure that you don't include them in your training.

You're attempting to predict the word or phrase from the tweet that exemplifies the provided sentiment. The word or phrase should include all characters within that span (i.e. including commas, spaces, etc.)

Columns:

textID - unique ID for each piece of text

text - the text of the tweet

sentiment - the general sentiment of the tweet

Acknowledgement:

The dataset is download from Kaggle Competetions:
https://www.kaggle.com/c/tweet-sentiment-extraction/data?select=train.csv

Objective:

Understand the Dataset & cleanup (if required).

Build classification models to predict the twitter sentiments.

Compare the evaluation metrics of vaious classification algorithms.
sample data T
kaggle.com
zip
Updated Oct 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SelcukCan (2022). sample data T [Dataset]. https://www.kaggle.com/datasets/selcukcan/sample-data
Explore at:
zip(557 bytes)Available download formats
Dataset updated
Oct 18, 2022
Authors
SelcukCan
Description
Dataset

This dataset was created by SelcukCan

Contents
Theft Detection
kaggle.com
zip
Updated Nov 11, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simran Singh (2018). Theft Detection [Dataset]. https://www.kaggle.com/datasets/thatbrownkid/thefttest
Explore at:
zip(22734050133 bytes)Available download formats
Dataset updated
Nov 11, 2018
Authors
Simran Singh
Description
Dataset

This dataset was created by Simran Singh

Contents
Lung disease 5 Class dataset T, P, N, E, C
kaggle.com
zip
Updated Aug 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Obaidul Haque (2025). Lung disease 5 Class dataset T, P, N, E, C [Dataset]. https://www.kaggle.com/datasets/obaidulhaque/lung-disease-5-class-dataset-t-p-n-e-c
Explore at:
zip(3814132911 bytes)Available download formats
Dataset updated
Aug 29, 2025
Authors
Obaidul Haque
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The dataset is a Mix of various data gathered from across the Kaggle Platform here is the Links of the datasets: 1. https://www.kaggle.com/datasets/fernando2rad/x-ray-lung-diseases-images-9-classes?select=04+Doen%C3%A7as+Pulmonares+Obstrutivas+%28Enfisema%2C+Broncopneumonia%2C+Bronquiectasia%2C+Embolia%29 2. https://www.kaggle.com/datasets/yasserhessein/tuberculosis-chest-x-rays-images/data 3. https://www.kaggle.com/datasets/basitkhan12/covid-and-pneumonia-chest-x-rays 4. https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia

The dataset is a Mix of 5 Classes of data 1. Tuberculosis - 5144 images 2. Pneumonia - 5121 3. Normal - 5083 4. Emphysema - 4928 5. COVID-19 - 5523 total This dataset is consist of 25799 data All data are radiography Xrays images.
Multimodal Stroke Image Dataset
kaggle.com
zip
Updated Apr 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Türker TUNCER (2025). Multimodal Stroke Image Dataset [Dataset]. https://www.kaggle.com/datasets/turkertuncer/multimodal-stroke-image-dataset
Explore at:
zip(582781243 bytes)Available download formats
Dataset updated
Apr 18, 2025
Authors
Türker TUNCER
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset was collected retrospectively from a single medical center between 2021-2023 years. CT imaging was performed on a 128‐slice GE Revolution scanner in the axial plane with a 5 mm slice thickness. MRI was acquired on a GE Signa 1.5 T system using axial diffusion‑weighted imaging (DWI) sequences at b‐values of 0 and 1000 s/mm². All acquisitions followed the hospital’s standard stroke screening protocol. There are two distinct classes: (1) stroke and (2) control. These classes were clinically verificated with nuerologists and neuroradiologists. The dataset comprises data from 230 participants, with a gender distribution of 113 females and 117 males. Among these participants, 115 were diagnosed with stroke, while the remaining 115 were categorized under the control group. An average of 7-8 cross-sectional images were used for each imaging type. The dataset includes a total of 5,336 CT and MRI (2226 CT + 3110 MR) images, with 2,695 images representing stroke cases and 2,641 images corresponding to control cases. All patient imaging data were fully anonymized before analysis. Identifiers such as name, date of birth, patient ID, and acquisition timestamps were removed from all image headers. We reviewed the dataset for missing images or labels and excluded any cases with incomplete CT or MR series; no imputation was performed. Reference labels were assigned by one neuroradiologist and two emergency medicine specialists, based on clinical reports and follow‑up data.
Telco Customer Churn
kaggle.com
zip
Updated Feb 23, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BlastChar (2018). Telco Customer Churn [Dataset]. https://www.kaggle.com/datasets/blastchar/telco-customer-churn
Explore at:
zip(175758 bytes)Available download formats
Dataset updated
Feb 23, 2018
Authors
BlastChar
Description
Context

"Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs." [IBM Sample Data Sets]

Content

Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

The data set includes information about:

Customers who left within the last month – the column is called Churn

Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies

Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges

Demographic info about customers – gender, age range, and if they have partners and dependents

Inspiration

To explore this type of models and learn more about the subject.

New version from IBM: https://community.ibm.com/community/user/businessanalytics/blogs/steven-macko/2019/07/11/telco-customer-churn-1113
t-data
kaggle.com
zip
Updated Oct 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
kusanovbaiastan (2024). t-data [Dataset]. https://www.kaggle.com/datasets/kusanovbaiastan/t-data/discussion
Explore at:
zip(557 bytes)Available download formats
Dataset updated
Oct 28, 2024
Authors
kusanovbaiastan
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by kusanovbaiastan

Released under Apache 2.0

Contents
Stock Price and News Headlines
kaggle.com
zip
Updated Sep 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prateik Lohani (2023). Stock Price and News Headlines [Dataset]. https://www.kaggle.com/datasets/prateiklohani/stock-price-and-news-headlines
Explore at:
zip(3302851 bytes)Available download formats
Dataset updated
Sep 10, 2023
Authors
Prateik Lohani
Description
Basic dataset for sentiment analysis, and for prediction of stock prices based on news headlines. I tried to look for this dataset on kaggle but couldn't find this anywhere.

(By no means am I the original creator of this dataset. I just found it and uploaded it here since I couldn't find it on kaggle. Please let me know if it is already present on kaggle- if it is, then I'll remove this one at once.)

Anyway, There are label values in the file (Label column): - 0: Stock price DECREASED - 1: Stock price INCREASED due to headlines (or at least didn't decrease)

Encoding info given in file desc. 👇
English Wikipedia People Dataset
kaggle.com
zip
Updated Jul 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wikimedia (2025). English Wikipedia People Dataset [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/english-wikipedia-people-dataset
Explore at:
zip(4293465577 bytes)Available download formats
Dataset updated
Jul 31, 2025
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Authors
Wikimedia
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Summary

This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.

The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).

We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.

Data Structure

File name: wme_people_infobox.tar.gz

Size of compressed file: 4.12 GB

Size of uncompressed file: 21.28 GB

Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.

The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.

Stats

Infoboxes - Compressed: 2GB - Uncompressed: 11GB

Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB

Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921

This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.

Maintenance and Support

This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs

Initial Data Collection and Normalization

The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).

Who are the source language producers?

Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.

Attribution

Terms and conditions

Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...
Fruit-Image-Dataset
kaggle.com
zip
Updated Aug 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ishan Dandekar (2022). Fruit-Image-Dataset [Dataset]. https://www.kaggle.com/datasets/ishandandekar/fruitimagedataset
Explore at:
zip(416927622 bytes)Available download formats
Dataset updated
Aug 21, 2022
Authors
Ishan Dandekar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains images of 131 various fruits and vegetables. The original version of this dataset is available here. The original version of the dataset was used for the Fruits-360. Although the original dataset used for hasn't been updated for over 2 years, the dataset on Kaggle has been updated various times providing better images. The dataset should be used for image classification. Do check the Github repository of the source here.

Dataset Properties: (taken from the description in the repository itself) - Total number of images : 90483 - Training set size : 67692 images (one fruit or vegetable per image) - Test set size : 22688 images (20% of total data) - Number of classes : 131 total fruits and vegetables - Filename format : image_index_100.jpg (e.g. 32_100.jpg) or r_image_index_100.jpg (e.g. r_32_100.jpg) or r2_image_index_100.jpg or r3_image_index_100.jpg. "r" stands for rotated fruit. "r2" means that the fruit was rotated around the 3rd axis. "100" comes from image size (100x100 pixels).

Folder structure - Train : This folder has multiple subfolders labelled as the fruit's/vegetable's name and contains the respective images. These images were used to train the models in the research paper. - Test : This folder has multiple subfolders labelled as the fruit's/vegetable's name and contains the respective images. These images were used to test the models in the research paper.

All credits to the researchers themselves. I made this dataset for my own ease-of-use.
Pre and Post-Exercise Heart Rate Analysis
kaggle.com
zip
Updated Sep 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdullah M Almutairi (2024). Pre and Post-Exercise Heart Rate Analysis [Dataset]. https://www.kaggle.com/datasets/abdullahmalmutairi/pre-and-post-exercise-heart-rate-analysis
Explore at:
zip(3857 bytes)Available download formats
Dataset updated
Sep 29, 2024
Authors
Abdullah M Almutairi
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Overview:

This dataset contains simulated (hypothetical) but almost realistic (based on AI) data related to sleep, heart rate, and exercise habits of 500 individuals. It includes both pre-exercise and post-exercise resting heart rates, allowing for analyses such as a dependent t-test (Paired Sample t-test) to observe changes in heart rate after an exercise program. The dataset also includes additional health-related variables, such as age, hours of sleep per night, and exercise frequency.

The data is designed for tasks involving hypothesis testing, health analytics, or even machine learning applications that predict changes in heart rate based on personal attributes and exercise behavior. It can be used to understand the relationships between exercise frequency, sleep, and changes in heart rate.

File: Filename: heart_rate_data.csv File Format: CSV

- Features (Columns):

Age: Description: The age of the individual. Type: Integer Range: 18-60 years Relevance: Age is an important factor in determining heart rate and the effects of exercise.

Sleep Hours: Description: The average number of hours the individual sleeps per night. Type: Float Range: 3.0 - 10.0 hours Relevance: Sleep is a crucial health metric that can impact heart rate and exercise recovery.

Exercise Frequency (Days/Week): Description: The number of days per week the individual engages in physical exercise. Type: Integer Range: 1-7 days/week Relevance: More frequent exercise may lead to greater heart rate improvements and better cardiovascular health.

Resting Heart Rate Before: Description: The individual’s resting heart rate measured before beginning a 6-week exercise program. Type: Integer Range: 50 - 100 bpm (beats per minute) Relevance: This is a key health indicator, providing a baseline measurement for the individual’s heart rate.

Resting Heart Rate After: Description: The individual’s resting heart rate measured after completing the 6-week exercise program. Type: Integer Range: 45 - 95 bpm (lower than the "Resting Heart Rate Before" due to the effects of exercise). Relevance: This variable is essential for understanding how exercise affects heart rate over time, and it can be used to perform a dependent t-test analysis.

Max Heart Rate During Exercise: Description: The maximum heart rate the individual reached during exercise sessions. Type: Integer Range: 120 - 190 bpm Relevance: This metric helps in understanding cardiovascular strain during exercise and can be linked to exercise frequency or fitness levels.

Potential Uses: Dependent T-Test Analysis: The dataset is particularly suited for a dependent (paired) t-test where you compare the resting heart rate before and after the exercise program for each individual.

Exploratory Data Analysis (EDA):Investigate relationships between sleep, exercise frequency, and changes in heart rate. Potential analyses include correlations between sleep hours and resting heart rate improvement, or regression analyses to predict heart rate after exercise.

Machine Learning: Use the dataset for predictive modeling, and build a beginner regression model to predict post-exercise heart rate using age, sleep, and exercise frequency as features.

Health and Fitness Insights: This dataset can be useful for studying how different factors like sleep and age influence heart rate changes and overall cardiovascular health.

License: Choose an appropriate open license, such as:

CC BY 4.0 (Attribution 4.0 International).

Inspiration for Kaggle Users: How does exercise frequency influence the reduction in resting heart rate? Is there a relationship between sleep and heart rate improvements post-exercise? Can we predict the post-exercise heart rate using other health variables? How do age and exercise frequency interact to affect heart rate?

Acknowledgments: This is a simulated dataset for educational purposes, generated to demonstrate statistical and machine learning applications in the field of health analytics.
Human vs. LLM Text Corpus
kaggle.com
zip
Updated Jan 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zachary Grinberg (2024). Human vs. LLM Text Corpus [Dataset]. https://www.kaggle.com/datasets/starblasters8/human-vs-llm-text-corpus
Explore at:
zip(2059496493 bytes)Available download formats
Dataset updated
Jan 10, 2024
Authors
Zachary Grinberg
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
I'm currently writing a research paper on AI Detection and its accuracy/effectiveness. While doing so, over the past few months I've generated a large amount of text using various LLMs. This is a dataset/corpus containing all of the data I generated/gathered as well as the text that was generated by various other users.

If you have any questions please post them on the Discussion page or contact me through Kaggle. Generating all of this took many hours of work and a few hundred dollars, all I ask in return is that you credit me if you find this dataset useful in your research. Also, an upvote would mean the world.

Ps. The picture is of my dog, Tessa, who passed away recently. I wasn't sure what to put as the picture so I thought that was better than nothing.

Here are the datasets I used in addition to the text I generated PLEASE UPVOTE THEM!:

https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset

https://www.kaggle.com/datasets/thedrcat/daigt-v3-train-dataset

https://www.kaggle.com/datasets/alejopaullier/daigt-external-dataset

https://www.kaggle.com/datasets/nbroad/persaude-corpus-2

https://www.kaggle.com/datasets/nbroad/daigt-data-llama-70b-and-falcon180b

https://www.kaggle.com/datasets/radek1/llm-generated-essays

https://www.kaggle.com/datasets/darraghdog/hello-claude-1000-essays-from-anthropic

https://www.kaggle.com/datasets/kingki19/llm-generated-essay-using-palm-from-google-gen-ai

https://www.kaggle.com/datasets/thedrcat/daigt-proper-train-dataset

https://www.kaggle.com/competitions/llm-detect-ai-generated-text/data

https://www.kaggle.com/competitions/feedback-prize-english-language-learning/data

https://www.kaggle.com/datasets/japkeeratsingh/ielts-writing

https://github.com/yafuly/DeepfakeTextDetect

https://huggingface.co/datasets/qwedsacf/ivypanda-essays

https://huggingface.co/datasets/nid989/EssayFroum-Dataset

https://huggingface.co/datasets/whateverweird17/essay_grade_v1

https://huggingface.co/datasets/dim/essayforum_raw_writing_10k

https://huggingface.co/datasets/ChristophSchuhmann/essays-with-instructions

https://huggingface.co/datasets/whateverweird17/essay_grade_v2
T-Test_Left_Temporal_Right_Temporal_ttv
kaggle.com
zip
Updated Dec 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tariq Javed (2024). T-Test_Left_Temporal_Right_Temporal_ttv [Dataset]. https://www.kaggle.com/datasets/tariqjaved/t-test-left-temporal-right-temporal-ttv
Explore at:
zip(1092704621 bytes)Available download formats
Dataset updated
Dec 26, 2024
Authors
Tariq Javed
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by Tariq Javed

Released under Apache 2.0

Contents

Cora with Semi-Supervised

kaggle.com

zip

Updated Feb 11, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

cyz020403 (2024). Cora with Semi-Supervised [Dataset]. https://www.kaggle.com/datasets/cyz020403/corasupervised

Explore at:

zip(195584 bytes)Available download formats

Dataset updated

Feb 11, 2024

Authors

cyz020403

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Cora is a widely used node classification dataset. What you are seeing now is the processed version from PyG. Its source file comes from the following paper:

Revisiting Semi-Supervised Learning with Graph Embeddings. Zhilin Yang, William W. Cohen, Ruslan Salakhutdinov. ICML 2016.

Please cite the above paper if these are useful to you.

Statistical data

Name	#nodes	#edges	#features	#classes
Cora	2708	10556	1433	7

For further description of the data please refer to the 'File Description' section below.

Processing

This dataset can be downloaded directly from PyG. For the needs of Kaggle evaluation, I simply processed it.

You can run the following code to get the same .csv file:

import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from torch_geometric.datasets import Planetoid

dataset = Planetoid('./', 'Cora')
data = dataset[0]

x = data.x
y = data.y
edge_index = data.edge_index
train_mask = data.train_mask
val_mask = data.val_mask
test_mask = data.test_mask

y_train = y[train_mask]
y_val = y[val_mask]
y_test = y[test_mask]

train_index = torch.arange(0, 140)
val_index = torch.arange(140, 640)
test_index = torch.arange(1708, 2708)

y_train = torch.cat((train_index.reshape(-1, 1), y_train.reshape(-1, 1)), dim=1)
y_val = torch.cat((val_index.reshape(-1, 1), y_val.reshape(-1, 1)), dim=1)
y_test = torch.cat((test_index.reshape(-1, 1), y_test.reshape(-1, 1)), dim=1)

x_df = pd.DataFrame(x.numpy())
x_header = ['x' + str(i) for i in range(x_df.shape[1])]
x_df.to_csv('./data/x.csv', index=False, header=x_header)

edge_index_df = pd.DataFrame(edge_index.t().numpy())
edge_index_header = ['source', 'target']
edge_index_df.to_csv('./data/edge_index.csv', index=False, header=edge_index_header)

y_header = ['index', 'label']
y_train_df = pd.DataFrame(y_train.numpy())
y_train_df.to_csv('./data/y_train.csv', index=False, header=y_header)

y_val_df = pd.DataFrame(y_val.numpy())
y_val_df.to_csv('./data/y_val.csv', index=False, header=y_header)

y_test_df = pd.DataFrame(y_test.numpy())
y_test_df.to_csv('./data/y_test.csv', index=False, header=y_header)

T-Test_Parietal_ttv
kaggle.com
zip
Updated Dec 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tariq Javed (2024). T-Test_Parietal_ttv [Dataset]. https://www.kaggle.com/datasets/tariqjaved/t-test-parietal-ttv/code
Explore at:
zip(1118037814 bytes)Available download formats
Dataset updated
Dec 20, 2024
Authors
Tariq Javed
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by Tariq Javed

Released under Apache 2.0

Contents

Facebook

Twitter

Click to copy link

Link copied

Cite

vikram amin (2023). Student Score - Hypothesis Testing (T Test) [Dataset]. https://www.kaggle.com/datasets/vikramamin/student-score-hypothesis-testing-t-test

Student Score - Hypothesis Testing (T Test)

Using T Test for finding out the difference in means of two groups

Explore at:

zip(7328 bytes)Available download formats

Dataset updated

Sep 21, 2023

Authors

vikram amin

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fbf9b8f2d8afc8aad16aadf167ee53777%2FPicture1.png?generation=1695275487466508&alt=media" alt="">

Data Cleaning https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F5ac517a06cd3aff12b58297504902583%2FPicture2.png?generation=1695276101423952&alt=media" alt="">
Convert data types of the required variables https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F61c8e665b906ba21d0579d06ab85b028%2FPicture3.png?generation=1695276209705142&alt=media" alt="">
Run libraries dplyr, ggplot2, tidyverse, tidyr
Find out the count of male vs female students https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fb33f152cfb579742aca479923f271b6d%2FPicture4.png?generation=1695276542256981&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F210dd1f7bf238efff7227f5465c77806%2FCount%20of%20Students.jpeg?generation=1695276553831777&alt=media" alt="">
We keep only two columns namely 'Sex' and 'G3' and remove the other columns https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F148205c33fd7cafc1a0ac05ac205c2b1%2FPicture5.png?generation=1695276691132338&alt=media" alt="">
t=-2.0651 indicates the distance from 0
df = 390.57 is related to the sample size, how many free data points are available for making comparisons
p value = 0.03958 is the probability value and indicates that we can reject the null hypothesis as it is less than that of alpha (0.05). Hence it is statisticall y significant.
95% confidence interval suggests that the true difference in means will lie between -1.85 and -0.04 (95% of time)
We can see the difference in means between the two groups (10.91-9.96) = 0.95

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F02b61a708cb074be592362c39ad33779%2FPicture6.png?generation=1695277010381962&alt=media" alt="">

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F769ed45d3ef398e14589e461e3d3fedd%2FHistogram.jpeg?generation=1695277023581085&alt=media" alt="">

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F21025562578ad7901abd35319a09579d%2FPicture7.png?generation=1695277093476017&alt=media" alt="">

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F0ead3643b8e945b83a257cdb30871143%2FDensity%20plot.jpeg?generation=1695277110253483&alt=media" alt="">

Both the histogram and the density plot indicate that there are students who got 0. Could this be due to non attendance of exams. Let us find out the number of students who got 0.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F407028608e7d32361198d15ef854ace2%2FPicture8.png?generation=1695277271891422&alt=media" alt="">

-38 students in total out of 395 have got a score of 0. That is 9.62% students. - Let us check the mean for both groups by removing students who got zeros. - We have created a new data frame called student 2 which includes a total of 357 students with no zero marks

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F06194aa5ed468045ef0d8cdeb82945d5%2FPicture9.png?generation=1695277409031566&alt=media" alt="">

Conclusion:
mean of females is 11.20 and 11.86 of males. The difference in mean of the two groups is 0.66 as compared to the earlier mean difference of 0.95.
P value is shown as 0.05335. For us to reject the null hypothesis the p value should be less than 0.05.
Therefore it is difficult to say if it is statistically significant.

Clear search

Close search

Google apps

Main menu

Student Score - Hypothesis Testing (T Test)

(🌅 Sunset) Kaggle Users' Country + Regions Info

[Context]

[Note]

[Limitations]

[How I defined active in this dataset]

[Update]

T-Shirts Dataset

Clothes Dataset

Website Performance dataset

Twitter Tweets Sentiment Dataset

Description:

Columns:

Acknowledgement:

Objective:

sample data T

Dataset

Contents

Theft Detection

Dataset

Contents

Lung disease 5 Class dataset T, P, N, E, C

Multimodal Stroke Image Dataset

Telco Customer Churn

Context

Content

Inspiration

t-data

Dataset

Contents

Stock Price and News Headlines

English Wikipedia People Dataset

Summary

Data Structure

Stats

Maintenance and Support

Initial Data Collection and Normalization

Who are the source language producers?

Attribution

Fruit-Image-Dataset

Pre and Post-Exercise Heart Rate Analysis

Human vs. LLM Text Corpus

T-Test_Left_Temporal_Right_Temporal_ttv

Dataset

Contents

Cora with Semi-Supervised

Description

Statistical data

Processing

T-Test_Parietal_ttv

Dataset

Contents

Student Score - Hypothesis Testing (T Test)

Using T Test for finding out the difference in means of two groups