100+ datasets found
  1. Student Score - Hypothesis Testing (T Test)

    • kaggle.com
    zip
    Updated Sep 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vikram amin (2023). Student Score - Hypothesis Testing (T Test) [Dataset]. https://www.kaggle.com/datasets/vikramamin/student-score-hypothesis-testing-t-test
    Explore at:
    zip(7328 bytes)Available download formats
    Dataset updated
    Sep 21, 2023
    Authors
    vikram amin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fbf9b8f2d8afc8aad16aadf167ee53777%2FPicture1.png?generation=1695275487466508&alt=media" alt="">

    • Data Cleaning https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F5ac517a06cd3aff12b58297504902583%2FPicture2.png?generation=1695276101423952&alt=media" alt="">

    • Convert data types of the required variables https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F61c8e665b906ba21d0579d06ab85b028%2FPicture3.png?generation=1695276209705142&alt=media" alt="">

    • Run libraries dplyr, ggplot2, tidyverse, tidyr

    • Find out the count of male vs female students https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fb33f152cfb579742aca479923f271b6d%2FPicture4.png?generation=1695276542256981&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F210dd1f7bf238efff7227f5465c77806%2FCount%20of%20Students.jpeg?generation=1695276553831777&alt=media" alt="">

    • We keep only two columns namely 'Sex' and 'G3' and remove the other columns https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F148205c33fd7cafc1a0ac05ac205c2b1%2FPicture5.png?generation=1695276691132338&alt=media" alt="">

    • t=-2.0651 indicates the distance from 0

    • df = 390.57 is related to the sample size, how many free data points are available for making comparisons

    • p value = 0.03958 is the probability value and indicates that we can reject the null hypothesis as it is less than that of alpha (0.05). Hence it is statisticall y significant.

    • 95% confidence interval suggests that the true difference in means will lie between -1.85 and -0.04 (95% of time)

    • We can see the difference in means between the two groups (10.91-9.96) = 0.95

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F02b61a708cb074be592362c39ad33779%2FPicture6.png?generation=1695277010381962&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F769ed45d3ef398e14589e461e3d3fedd%2FHistogram.jpeg?generation=1695277023581085&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F21025562578ad7901abd35319a09579d%2FPicture7.png?generation=1695277093476017&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F0ead3643b8e945b83a257cdb30871143%2FDensity%20plot.jpeg?generation=1695277110253483&alt=media" alt="">

    • Both the histogram and the density plot indicate that there are students who got 0. Could this be due to non attendance of exams. Let us find out the number of students who got 0.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F407028608e7d32361198d15ef854ace2%2FPicture8.png?generation=1695277271891422&alt=media" alt="">

    -38 students in total out of 395 have got a score of 0. That is 9.62% students. - Let us check the mean for both groups by removing students who got zeros. - We have created a new data frame called student 2 which includes a total of 357 students with no zero marks

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F06194aa5ed468045ef0d8cdeb82945d5%2FPicture9.png?generation=1695277409031566&alt=media" alt="">

    • Conclusion:
    • mean of females is 11.20 and 11.86 of males. The difference in mean of the two groups is 0.66 as compared to the earlier mean difference of 0.95.
    • P value is shown as 0.05335. For us to reject the null hypothesis the p value should be less than 0.05.
    • Therefore it is difficult to say if it is statistically significant.
  2. (🌅 Sunset) Kaggle Users' Country + Regions Info

    • kaggle.com
    zip
    Updated Feb 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BwandoWando (2024). (🌅 Sunset) Kaggle Users' Country + Regions Info [Dataset]. https://www.kaggle.com/datasets/bwandowando/kaggle-user-country-regions
    Explore at:
    zip(2376511 bytes)Available download formats
    Dataset updated
    Feb 14, 2024
    Authors
    BwandoWando
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    [Context]

    The official Meta-Kaggle dataset contains the Users.csv file which contains Username, DisplayName, RegisterDate, and PerformanceTier fields but doesn't contain location data of the Kaggle Users. This dataset augments that data with additional country and region information.

    [Note]

    I haven't included the username and displayname values on purpose, just the userid to be joined back to the Meta-Kaggle official Users.csv file.

    [Limitations]

    It is possible that some users haven't inputted their details when the scraper went through their accounts and thus have missing data. Another possibility is that users may have updated their info after the scraper went through their accounts, thus resulting in inconsistencies.

    [How I defined active in this dataset]

    • Users that have received an upvote in the forums, datasets, or notebooks
    • Users that have given an upvote in the forums, datasets, or notebooks
    • Users that have created a thread, a forum post, a notebook, or a dataset
    • Users that made a competition submission
    • Users that exist in the Meta-Kaggle Users dataset
    • Date cut-off of Jan 01, 2019

    [Update]

    • 15-Feb-2024- Since the Kaggle member's profile page update, the scrapers arent working anymore as the UI layout has changed. Will fix this when we get the time.
  3. T-Shirts Dataset

    • kaggle.com
    zip
    Updated Jul 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sunny Kusawa (2022). T-Shirts Dataset [Dataset]. https://www.kaggle.com/datasets/sunnykusawa/tshirts/code
    Explore at:
    zip(375026934 bytes)Available download formats
    Dataset updated
    Jul 15, 2022
    Authors
    Sunny Kusawa
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This data set is collected from online ecommerce site and its very raw and junk data as you can get in industries for your data science project.

    Applying your image preprocessing skill on such data will help you to understand the real time problems and challenges in industry projects.

    It has some junk, partial as well as multiple t-shirt views in single image.

    You can perform different task on this data set like, Beginner - Resize all images to 48 * 48 size - Convert all images to gray scale images Intermediate - Perform image masking on all images - you can also develop classifier to detect given image is t-shirt or not Advance - Try to cluster tshirt images - try if you can cluster based on color - try if you can cluster based on full, partial, nultiple, and junk tshirt images

  4. Clothes Dataset

    • kaggle.com
    zip
    Updated Dec 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RyanBadai (2024). Clothes Dataset [Dataset]. https://www.kaggle.com/datasets/ryanbadai/clothes-dataset
    Explore at:
    zip(1473495763 bytes)Available download formats
    Dataset updated
    Dec 20, 2024
    Authors
    RyanBadai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains images of clothing items scraped from Carousell, an online marketplace, specifically curated for image classification tasks. It includes a diverse set of classes representing different types of clothing, making it an excellent resource for machine learning and computer vision projects. The dataset is organized into the following 15 classes: - Blazer - Celana_Panjang (Long Pants) - Celana_Pendek (Shorts) - Gaun (Dresses) - Hoodie - Jaket (Jacket) - Jaket_Denim (Denim Jacket) - Jaket_Olahraga (Sports Jacket) - Jeans - Kaos (T-shirt) - Kemeja (Shirt) - Mantel (Coat) - Polo - Rok (Skirt) - Sweter (Sweater)

    The images in this dataset represent various styles, textures, and colors, offering a comprehensive resource for training models to recognize and classify clothing categories. It is ideal for tasks such as building fashion recommendation systems, creating virtual try-on applications, or studying visual trends in fashion e-commerce. Whether you are an enthusiast or a professional, this dataset can help explore and experiment with deep learning techniques in the realm of fashion.

  5. Website Performance dataset

    • kaggle.com
    zip
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    maida sajid (2024). Website Performance dataset [Dataset]. https://www.kaggle.com/datasets/maidasajid/website-performance-dataset
    Explore at:
    zip(19979 bytes)Available download formats
    Dataset updated
    Jul 11, 2024
    Authors
    maida sajid
    Description

    This dataset contains website performance metrics, including response time and throughput, collected from Pingdom and Site24x7. The data has been meticulously labeled by students from FAST NUCES and UMT.

  6. Twitter Tweets Sentiment Dataset

    • kaggle.com
    zip
    Updated Apr 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M Yasser H (2022). Twitter Tweets Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset
    Explore at:
    zip(1289519 bytes)Available download formats
    Dataset updated
    Apr 8, 2022
    Authors
    M Yasser H
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://raw.githubusercontent.com/Masterx-AI/Project_Twitter_Sentiment_Analysis_/main/twitt.jpg" alt="">

    Description:

    Twitter is an online Social Media Platform where people share their their though as tweets. It is observed that some people misuse it to tweet hateful content. Twitter is trying to tackle this problem and we shall help it by creating a strong NLP based-classifier model to distinguish the negative tweets & block such tweets. Can you build a strong classifier model to predict the same?

    Each row contains the text of a tweet and a sentiment label. In the training set you are provided with a word or phrase drawn from the tweet (selected_text) that encapsulates the provided sentiment.

    Make sure, when parsing the CSV, to remove the beginning / ending quotes from the text field, to ensure that you don't include them in your training.

    You're attempting to predict the word or phrase from the tweet that exemplifies the provided sentiment. The word or phrase should include all characters within that span (i.e. including commas, spaces, etc.)

    Columns:

    1. textID - unique ID for each piece of text
    2. text - the text of the tweet
    3. sentiment - the general sentiment of the tweet

    Acknowledgement:

    The dataset is download from Kaggle Competetions:
    https://www.kaggle.com/c/tweet-sentiment-extraction/data?select=train.csv

    Objective:

    • Understand the Dataset & cleanup (if required).
    • Build classification models to predict the twitter sentiments.
    • Compare the evaluation metrics of vaious classification algorithms.
  7. sample data T

    • kaggle.com
    zip
    Updated Oct 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SelcukCan (2022). sample data T [Dataset]. https://www.kaggle.com/datasets/selcukcan/sample-data
    Explore at:
    zip(557 bytes)Available download formats
    Dataset updated
    Oct 18, 2022
    Authors
    SelcukCan
    Description

    Dataset

    This dataset was created by SelcukCan

    Contents

  8. Theft Detection

    • kaggle.com
    zip
    Updated Nov 11, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simran Singh (2018). Theft Detection [Dataset]. https://www.kaggle.com/datasets/thatbrownkid/thefttest
    Explore at:
    zip(22734050133 bytes)Available download formats
    Dataset updated
    Nov 11, 2018
    Authors
    Simran Singh
    Description

    Dataset

    This dataset was created by Simran Singh

    Contents

  9. Lung disease 5 Class dataset T, P, N, E, C

    • kaggle.com
    zip
    Updated Aug 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Obaidul Haque (2025). Lung disease 5 Class dataset T, P, N, E, C [Dataset]. https://www.kaggle.com/datasets/obaidulhaque/lung-disease-5-class-dataset-t-p-n-e-c
    Explore at:
    zip(3814132911 bytes)Available download formats
    Dataset updated
    Aug 29, 2025
    Authors
    Obaidul Haque
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The dataset is a Mix of various data gathered from across the Kaggle Platform here is the Links of the datasets: 1. https://www.kaggle.com/datasets/fernando2rad/x-ray-lung-diseases-images-9-classes?select=04+Doen%C3%A7as+Pulmonares+Obstrutivas+%28Enfisema%2C+Broncopneumonia%2C+Bronquiectasia%2C+Embolia%29 2. https://www.kaggle.com/datasets/yasserhessein/tuberculosis-chest-x-rays-images/data 3. https://www.kaggle.com/datasets/basitkhan12/covid-and-pneumonia-chest-x-rays 4. https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia

    The dataset is a Mix of 5 Classes of data 1. Tuberculosis - 5144 images 2. Pneumonia - 5121 3. Normal - 5083 4. Emphysema - 4928 5. COVID-19 - 5523 total This dataset is consist of 25799 data All data are radiography Xrays images.

  10. Multimodal Stroke Image Dataset

    • kaggle.com
    zip
    Updated Apr 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TĂźrker TUNCER (2025). Multimodal Stroke Image Dataset [Dataset]. https://www.kaggle.com/datasets/turkertuncer/multimodal-stroke-image-dataset
    Explore at:
    zip(582781243 bytes)Available download formats
    Dataset updated
    Apr 18, 2025
    Authors
    TĂźrker TUNCER
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset was collected retrospectively from a single medical center between 2021-2023 years. CT imaging was performed on a 128‐slice GE Revolution scanner in the axial plane with a 5 mm slice thickness. MRI was acquired on a GE Signa 1.5 T system using axial diffusion‑weighted imaging (DWI) sequences at b‐values of 0 and 1000 s/mm². All acquisitions followed the hospital’s standard stroke screening protocol. There are two distinct classes: (1) stroke and (2) control. These classes were clinically verificated with nuerologists and neuroradiologists. The dataset comprises data from 230 participants, with a gender distribution of 113 females and 117 males. Among these participants, 115 were diagnosed with stroke, while the remaining 115 were categorized under the control group. An average of 7-8 cross-sectional images were used for each imaging type. The dataset includes a total of 5,336 CT and MRI (2226 CT + 3110 MR) images, with 2,695 images representing stroke cases and 2,641 images corresponding to control cases. All patient imaging data were fully anonymized before analysis. Identifiers such as name, date of birth, patient ID, and acquisition timestamps were removed from all image headers. We reviewed the dataset for missing images or labels and excluded any cases with incomplete CT or MR series; no imputation was performed. Reference labels were assigned by one neuroradiologist and two emergency medicine specialists, based on clinical reports and follow‑up data.

  11. Telco Customer Churn

    • kaggle.com
    zip
    Updated Feb 23, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BlastChar (2018). Telco Customer Churn [Dataset]. https://www.kaggle.com/datasets/blastchar/telco-customer-churn
    Explore at:
    zip(175758 bytes)Available download formats
    Dataset updated
    Feb 23, 2018
    Authors
    BlastChar
    Description

    Context

    "Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs." [IBM Sample Data Sets]

    Content

    Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

    The data set includes information about:

    • Customers who left within the last month – the column is called Churn
    • Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
    • Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
    • Demographic info about customers – gender, age range, and if they have partners and dependents

    Inspiration

    To explore this type of models and learn more about the subject.

    New version from IBM: https://community.ibm.com/community/user/businessanalytics/blogs/steven-macko/2019/07/11/telco-customer-churn-1113

  12. t-data

    • kaggle.com
    zip
    Updated Oct 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    kusanovbaiastan (2024). t-data [Dataset]. https://www.kaggle.com/datasets/kusanovbaiastan/t-data/discussion
    Explore at:
    zip(557 bytes)Available download formats
    Dataset updated
    Oct 28, 2024
    Authors
    kusanovbaiastan
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by kusanovbaiastan

    Released under Apache 2.0

    Contents

  13. Stock Price and News Headlines

    • kaggle.com
    zip
    Updated Sep 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prateik Lohani (2023). Stock Price and News Headlines [Dataset]. https://www.kaggle.com/datasets/prateiklohani/stock-price-and-news-headlines
    Explore at:
    zip(3302851 bytes)Available download formats
    Dataset updated
    Sep 10, 2023
    Authors
    Prateik Lohani
    Description

    Basic dataset for sentiment analysis, and for prediction of stock prices based on news headlines. I tried to look for this dataset on kaggle but couldn't find this anywhere.

    (By no means am I the original creator of this dataset. I just found it and uploaded it here since I couldn't find it on kaggle. Please let me know if it is already present on kaggle- if it is, then I'll remove this one at once.)

    Anyway, There are label values in the file (Label column): - 0: Stock price DECREASED - 1: Stock price INCREASED due to headlines (or at least didn't decrease)

    Encoding info given in file desc. 👇

  14. English Wikipedia People Dataset

    • kaggle.com
    zip
    Updated Jul 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikimedia (2025). English Wikipedia People Dataset [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/english-wikipedia-people-dataset
    Explore at:
    zip(4293465577 bytes)Available download formats
    Dataset updated
    Jul 31, 2025
    Dataset provided by
    Wikimedia Foundationhttp://www.wikimedia.org/
    Authors
    Wikimedia
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Summary

    This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.

    The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).

    We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.

    Data Structure

    • File name: wme_people_infobox.tar.gz
    • Size of compressed file: 4.12 GB
    • Size of uncompressed file: 21.28 GB

    Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.

    The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.

    Stats

    Infoboxes - Compressed: 2GB - Uncompressed: 11GB

    Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB

    Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921

    This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.

    Maintenance and Support

    This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs

    Initial Data Collection and Normalization

    The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).

    Who are the source language producers?

    Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.

    Attribution

    Terms and conditions

    Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...

  15. Fruit-Image-Dataset

    • kaggle.com
    zip
    Updated Aug 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ishan Dandekar (2022). Fruit-Image-Dataset [Dataset]. https://www.kaggle.com/datasets/ishandandekar/fruitimagedataset
    Explore at:
    zip(416927622 bytes)Available download formats
    Dataset updated
    Aug 21, 2022
    Authors
    Ishan Dandekar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains images of 131 various fruits and vegetables. The original version of this dataset is available here. The original version of the dataset was used for the Fruits-360. Although the original dataset used for hasn't been updated for over 2 years, the dataset on Kaggle has been updated various times providing better images. The dataset should be used for image classification. Do check the Github repository of the source here.

    Dataset Properties: (taken from the description in the repository itself) - Total number of images : 90483 - Training set size : 67692 images (one fruit or vegetable per image) - Test set size : 22688 images (20% of total data) - Number of classes : 131 total fruits and vegetables - Filename format : image_index_100.jpg (e.g. 32_100.jpg) or r_image_index_100.jpg (e.g. r_32_100.jpg) or r2_image_index_100.jpg or r3_image_index_100.jpg. "r" stands for rotated fruit. "r2" means that the fruit was rotated around the 3rd axis. "100" comes from image size (100x100 pixels).

    Folder structure - Train : This folder has multiple subfolders labelled as the fruit's/vegetable's name and contains the respective images. These images were used to train the models in the research paper. - Test : This folder has multiple subfolders labelled as the fruit's/vegetable's name and contains the respective images. These images were used to test the models in the research paper.

    All credits to the researchers themselves. I made this dataset for my own ease-of-use.

  16. Pre and Post-Exercise Heart Rate Analysis

    • kaggle.com
    zip
    Updated Sep 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdullah M Almutairi (2024). Pre and Post-Exercise Heart Rate Analysis [Dataset]. https://www.kaggle.com/datasets/abdullahmalmutairi/pre-and-post-exercise-heart-rate-analysis
    Explore at:
    zip(3857 bytes)Available download formats
    Dataset updated
    Sep 29, 2024
    Authors
    Abdullah M Almutairi
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Overview:

    This dataset contains simulated (hypothetical) but almost realistic (based on AI) data related to sleep, heart rate, and exercise habits of 500 individuals. It includes both pre-exercise and post-exercise resting heart rates, allowing for analyses such as a dependent t-test (Paired Sample t-test) to observe changes in heart rate after an exercise program. The dataset also includes additional health-related variables, such as age, hours of sleep per night, and exercise frequency.

    The data is designed for tasks involving hypothesis testing, health analytics, or even machine learning applications that predict changes in heart rate based on personal attributes and exercise behavior. It can be used to understand the relationships between exercise frequency, sleep, and changes in heart rate.

    File: Filename: heart_rate_data.csv File Format: CSV

    - Features (Columns):

    Age: Description: The age of the individual. Type: Integer Range: 18-60 years Relevance: Age is an important factor in determining heart rate and the effects of exercise.

    Sleep Hours: Description: The average number of hours the individual sleeps per night. Type: Float Range: 3.0 - 10.0 hours Relevance: Sleep is a crucial health metric that can impact heart rate and exercise recovery.

    Exercise Frequency (Days/Week): Description: The number of days per week the individual engages in physical exercise. Type: Integer Range: 1-7 days/week Relevance: More frequent exercise may lead to greater heart rate improvements and better cardiovascular health.

    Resting Heart Rate Before: Description: The individual’s resting heart rate measured before beginning a 6-week exercise program. Type: Integer Range: 50 - 100 bpm (beats per minute) Relevance: This is a key health indicator, providing a baseline measurement for the individual’s heart rate.

    Resting Heart Rate After: Description: The individual’s resting heart rate measured after completing the 6-week exercise program. Type: Integer Range: 45 - 95 bpm (lower than the "Resting Heart Rate Before" due to the effects of exercise). Relevance: This variable is essential for understanding how exercise affects heart rate over time, and it can be used to perform a dependent t-test analysis.

    Max Heart Rate During Exercise: Description: The maximum heart rate the individual reached during exercise sessions. Type: Integer Range: 120 - 190 bpm Relevance: This metric helps in understanding cardiovascular strain during exercise and can be linked to exercise frequency or fitness levels.

    Potential Uses: Dependent T-Test Analysis: The dataset is particularly suited for a dependent (paired) t-test where you compare the resting heart rate before and after the exercise program for each individual.

    Exploratory Data Analysis (EDA):Investigate relationships between sleep, exercise frequency, and changes in heart rate. Potential analyses include correlations between sleep hours and resting heart rate improvement, or regression analyses to predict heart rate after exercise.

    Machine Learning: Use the dataset for predictive modeling, and build a beginner regression model to predict post-exercise heart rate using age, sleep, and exercise frequency as features.

    Health and Fitness Insights: This dataset can be useful for studying how different factors like sleep and age influence heart rate changes and overall cardiovascular health.

    License: Choose an appropriate open license, such as:

    CC BY 4.0 (Attribution 4.0 International).

    Inspiration for Kaggle Users: How does exercise frequency influence the reduction in resting heart rate? Is there a relationship between sleep and heart rate improvements post-exercise? Can we predict the post-exercise heart rate using other health variables? How do age and exercise frequency interact to affect heart rate?

    Acknowledgments: This is a simulated dataset for educational purposes, generated to demonstrate statistical and machine learning applications in the field of health analytics.

  17. Human vs. LLM Text Corpus

    • kaggle.com
    zip
    Updated Jan 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zachary Grinberg (2024). Human vs. LLM Text Corpus [Dataset]. https://www.kaggle.com/datasets/starblasters8/human-vs-llm-text-corpus
    Explore at:
    zip(2059496493 bytes)Available download formats
    Dataset updated
    Jan 10, 2024
    Authors
    Zachary Grinberg
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    I'm currently writing a research paper on AI Detection and its accuracy/effectiveness. While doing so, over the past few months I've generated a large amount of text using various LLMs. This is a dataset/corpus containing all of the data I generated/gathered as well as the text that was generated by various other users.

    If you have any questions please post them on the Discussion page or contact me through Kaggle. Generating all of this took many hours of work and a few hundred dollars, all I ask in return is that you credit me if you find this dataset useful in your research. Also, an upvote would mean the world.

    Ps. The picture is of my dog, Tessa, who passed away recently. I wasn't sure what to put as the picture so I thought that was better than nothing.

    Here are the datasets I used in addition to the text I generated PLEASE UPVOTE THEM!:

  18. T-Test_Left_Temporal_Right_Temporal_ttv

    • kaggle.com
    zip
    Updated Dec 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tariq Javed (2024). T-Test_Left_Temporal_Right_Temporal_ttv [Dataset]. https://www.kaggle.com/datasets/tariqjaved/t-test-left-temporal-right-temporal-ttv
    Explore at:
    zip(1092704621 bytes)Available download formats
    Dataset updated
    Dec 26, 2024
    Authors
    Tariq Javed
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Tariq Javed

    Released under Apache 2.0

    Contents

  19. Cora with Semi-Supervised

    • kaggle.com
    zip
    Updated Feb 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    cyz020403 (2024). Cora with Semi-Supervised [Dataset]. https://www.kaggle.com/datasets/cyz020403/corasupervised
    Explore at:
    zip(195584 bytes)Available download formats
    Dataset updated
    Feb 11, 2024
    Authors
    cyz020403
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Description

    Cora is a widely used node classification dataset. What you are seeing now is the processed version from PyG. Its source file comes from the following paper:

    Revisiting Semi-Supervised Learning with Graph Embeddings. Zhilin Yang, William W. Cohen, Ruslan Salakhutdinov. ICML 2016.

    Please cite the above paper if these are useful to you.

    Statistical data

    Name#nodes#edges#features#classes
    Cora27081055614337

    For further description of the data please refer to the 'File Description' section below.

    Processing

    This dataset can be downloaded directly from PyG. For the needs of Kaggle evaluation, I simply processed it.

    You can run the following code to get the same .csv file:

    import pandas as pd
    import numpy as np
    import torch
    import torch.nn as nn
    from torch_geometric.datasets import Planetoid
    
    dataset = Planetoid('./', 'Cora')
    data = dataset[0]
    
    x = data.x
    y = data.y
    edge_index = data.edge_index
    train_mask = data.train_mask
    val_mask = data.val_mask
    test_mask = data.test_mask
    
    y_train = y[train_mask]
    y_val = y[val_mask]
    y_test = y[test_mask]
    
    train_index = torch.arange(0, 140)
    val_index = torch.arange(140, 640)
    test_index = torch.arange(1708, 2708)
    
    y_train = torch.cat((train_index.reshape(-1, 1), y_train.reshape(-1, 1)), dim=1)
    y_val = torch.cat((val_index.reshape(-1, 1), y_val.reshape(-1, 1)), dim=1)
    y_test = torch.cat((test_index.reshape(-1, 1), y_test.reshape(-1, 1)), dim=1)
    
    x_df = pd.DataFrame(x.numpy())
    x_header = ['x' + str(i) for i in range(x_df.shape[1])]
    x_df.to_csv('./data/x.csv', index=False, header=x_header)
    
    edge_index_df = pd.DataFrame(edge_index.t().numpy())
    edge_index_header = ['source', 'target']
    edge_index_df.to_csv('./data/edge_index.csv', index=False, header=edge_index_header)
    
    y_header = ['index', 'label']
    y_train_df = pd.DataFrame(y_train.numpy())
    y_train_df.to_csv('./data/y_train.csv', index=False, header=y_header)
    
    y_val_df = pd.DataFrame(y_val.numpy())
    y_val_df.to_csv('./data/y_val.csv', index=False, header=y_header)
    
    y_test_df = pd.DataFrame(y_test.numpy())
    y_test_df.to_csv('./data/y_test.csv', index=False, header=y_header)
    

    ​

  20. T-Test_Parietal_ttv

    • kaggle.com
    zip
    Updated Dec 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tariq Javed (2024). T-Test_Parietal_ttv [Dataset]. https://www.kaggle.com/datasets/tariqjaved/t-test-parietal-ttv/code
    Explore at:
    zip(1118037814 bytes)Available download formats
    Dataset updated
    Dec 20, 2024
    Authors
    Tariq Javed
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Tariq Javed

    Released under Apache 2.0

    Contents

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
vikram amin (2023). Student Score - Hypothesis Testing (T Test) [Dataset]. https://www.kaggle.com/datasets/vikramamin/student-score-hypothesis-testing-t-test
Organization logo

Student Score - Hypothesis Testing (T Test)

Using T Test for finding out the difference in means of two groups

Explore at:
zip(7328 bytes)Available download formats
Dataset updated
Sep 21, 2023
Authors
vikram amin
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fbf9b8f2d8afc8aad16aadf167ee53777%2FPicture1.png?generation=1695275487466508&alt=media" alt="">

  • Data Cleaning https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F5ac517a06cd3aff12b58297504902583%2FPicture2.png?generation=1695276101423952&alt=media" alt="">

  • Convert data types of the required variables https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F61c8e665b906ba21d0579d06ab85b028%2FPicture3.png?generation=1695276209705142&alt=media" alt="">

  • Run libraries dplyr, ggplot2, tidyverse, tidyr

  • Find out the count of male vs female students https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fb33f152cfb579742aca479923f271b6d%2FPicture4.png?generation=1695276542256981&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F210dd1f7bf238efff7227f5465c77806%2FCount%20of%20Students.jpeg?generation=1695276553831777&alt=media" alt="">

  • We keep only two columns namely 'Sex' and 'G3' and remove the other columns https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F148205c33fd7cafc1a0ac05ac205c2b1%2FPicture5.png?generation=1695276691132338&alt=media" alt="">

  • t=-2.0651 indicates the distance from 0

  • df = 390.57 is related to the sample size, how many free data points are available for making comparisons

  • p value = 0.03958 is the probability value and indicates that we can reject the null hypothesis as it is less than that of alpha (0.05). Hence it is statisticall y significant.

  • 95% confidence interval suggests that the true difference in means will lie between -1.85 and -0.04 (95% of time)

  • We can see the difference in means between the two groups (10.91-9.96) = 0.95

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F02b61a708cb074be592362c39ad33779%2FPicture6.png?generation=1695277010381962&alt=media" alt="">

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F769ed45d3ef398e14589e461e3d3fedd%2FHistogram.jpeg?generation=1695277023581085&alt=media" alt="">

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F21025562578ad7901abd35319a09579d%2FPicture7.png?generation=1695277093476017&alt=media" alt="">

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F0ead3643b8e945b83a257cdb30871143%2FDensity%20plot.jpeg?generation=1695277110253483&alt=media" alt="">

  • Both the histogram and the density plot indicate that there are students who got 0. Could this be due to non attendance of exams. Let us find out the number of students who got 0.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F407028608e7d32361198d15ef854ace2%2FPicture8.png?generation=1695277271891422&alt=media" alt="">

-38 students in total out of 395 have got a score of 0. That is 9.62% students. - Let us check the mean for both groups by removing students who got zeros. - We have created a new data frame called student 2 which includes a total of 357 students with no zero marks

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F06194aa5ed468045ef0d8cdeb82945d5%2FPicture9.png?generation=1695277409031566&alt=media" alt="">

  • Conclusion:
  • mean of females is 11.20 and 11.86 of males. The difference in mean of the two groups is 0.66 as compared to the earlier mean difference of 0.95.
  • P value is shown as 0.05335. For us to reject the null hypothesis the p value should be less than 0.05.
  • Therefore it is difficult to say if it is statistically significant.
Search
Clear search
Close search
Google apps
Main menu