100+ datasets found
  1. Adventure Works 2022 CSVs

    • kaggle.com
    zip
    Updated Nov 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Algorismus (2022). Adventure Works 2022 CSVs [Dataset]. https://www.kaggle.com/datasets/algorismus/adventure-works-in-excel-tables
    Explore at:
    zip(567646 bytes)Available download formats
    Dataset updated
    Nov 2, 2022
    Authors
    Algorismus
    License

    http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

    Description

    Adventure Works 2022 dataset

    How this Dataset is created?

    On the official website the dataset is available over SQL server (localhost) and CSVs to be used via Power BI Desktop running on Virtual Lab (Virtaul Machine). As per first two steps of Importing data are executed in the virtual lab and then resultant Power BI tables are copied in CSVs. Added records till year 2022 as required.

    How this Dataset may help you?

    this dataset will be helpful in case you want to work offline with Adventure Works data in Power BI desktop in order to carry lab instructions as per training material on official website. The dataset is useful in case you want to work on Power BI desktop Sales Analysis example from Microsoft website PL 300 learning.

    How to use this Dataset?

    Download the CSV file(s) and import in Power BI desktop as tables. The CSVs are named as tables created after first two steps of importing data as mentioned in the PL-300 Microsoft Power BI Data Analyst exam lab.

  2. Stylish Product Image Dataset

    • kaggle.com
    zip
    Updated May 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Santosh Kumar (2022). Stylish Product Image Dataset [Dataset]. https://www.kaggle.com/datasets/kuchhbhi/stylish-product-image-dataset
    Explore at:
    zip(9509715613 bytes)Available download formats
    Dataset updated
    May 21, 2022
    Authors
    Santosh Kumar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context:

    The idea came to my mind to scrap this data. I was working on an e-commerce project Fashion Product Recommendation (an end-to-end project). In this project, upload any fashion image and it will show the 10 closest recommendations.

    https://user-images.githubusercontent.com/40932902/169657090-20d3342d-d472-48e3-bc34-8a9686b09961.png" alt="">

    https://user-images.githubusercontent.com/40932902/169657035-870bb803-f985-482a-ac16-789d0fcf2a2b.png" alt="">

    https://user-images.githubusercontent.com/40932902/169013855-099838d6-8612-45ce-8961-28ccf44f81f7.png" alt="">

    I completed my project on this image dataset . The problem I faced while deploying on the Heroku server. Due to the large project file size, I was unable to deploy as Heroku offers limited memory space for a free account.

    As currently, I am only familiar with Heroku. Learning AWS for big projects. So, I decided to scrap my own image dataset with much more information that can help me to transform this project to the next level. Scraped this data from flipkart.com(e-commerce website) in two formats Image and textual data in tabular format.

    About this Dataset:

    This dataset contains 65k images (400x450 pixel)) of fashion/style products and accessories like clothing, footwear, accessories, and many more. There is a CSV file also mapped with the image name and the id column in tabular data. The name of the image is in a unique numerical format like 1.png, 62299.png Image name and Id columns are the same. So, suppose you want to find the details of any image then you can find them using the image name id, go to the Id column in the csv file and that id rows will be the details of the image. You can find the notebook in the code section which I used to scrap this data.

    Columns of CSV Dataset: 1. id : Unique id same as the image name 2. brand: Brand name of the product 3. title: Title of the product 4. sold_price: selling price of the product 5. actual_price: Actual price of the product 6. url : unique URL of every product 7. img: Image URL

    How did helped me this dataset: 1. I trained my CNN model using the image data, that's the only use of the image dataset. 2. In my front-end page of the project to display results, I used Image URL and displayed after extracting from the web. This helped me to not upload the image dataset with the project on the server and this saved huge memory space. 3. Using the url displaying live price and** ratings** from the Flipkart website. 4. And there is a Buy button mapped with the url you will be redirected to the original product page and buy it from there. after using this dataset I changed my project name from Fashion Product Recommender to Flipkart Fashion Product Recommender. ๐Ÿ˜„๐Ÿ˜„๐Ÿ˜„

    Still, the memory problem was not resolved as the model trained file was above 500MB on the complete dataset. So I tried on multiple sets and finally, I deployed after training on 1000 images only. In the future, I will try on another platform to deploy the complete project. I learned many new things while working on this dataset.

    Your Job:

    1. You can use this dataset in your deep learning projects, go and try to create interesting projects.
    2. You can use CSV data in your Machine Learning projects, first you need to do feature construction from the title columns as there is much information hidden and some data cleaning required.
    3. There is two complete records missing in csv data, your job is to find the missing data with the help of image dataset and fill as per your knowledge.

    This is a huge dataset in terms of records as well as memory size. To download this dataset you need high internet speed.

    To download the same dataset in small size less than 500mb you can find it here, everything is the same as this dataset only I reduced the pixel of the image from 400x450px to ** 65x80pixels**.

    Pls, Rate this work

    Support with Upvote... that encourages me to research more.

    Share your feedback, reviews, and suggestions if any.

    Thanks!!

  3. train csv file

    • kaggle.com
    zip
    Updated May 5, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emmanuel Arias (2018). train csv file [Dataset]. https://www.kaggle.com/datasets/eamanu/train
    Explore at:
    zip(33695 bytes)Available download formats
    Dataset updated
    May 5, 2018
    Authors
    Emmanuel Arias
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Dataset

    This dataset was created by Emmanuel Arias

    Released under Database: Open Database, Contents: Database Contents

    Contents

  4. Top Rated TV Shows

    • kaggle.com
    zip
    Updated Jan 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shreya Gupta (2025). Top Rated TV Shows [Dataset]. https://www.kaggle.com/datasets/shreyajii/top-rated-tv-shows
    Explore at:
    zip(314571 bytes)Available download formats
    Dataset updated
    Jan 5, 2025
    Authors
    Shreya Gupta
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset provides information about top-rated TV shows, collected from The Movie Database (TMDb) API. It can be used for data analysis, recommendation systems, and insights on popular television content.

    Key Stats:

    Total Pages: 109 Total Results: 2098 TV shows Data Source: TMDb API Sorting Criteria: Highest-rated by vote_average (average rating) with a minimum vote count of 200 Data Fields (Columns):

    id: Unique identifier for the TV show name: Title of the TV show vote_average: Average rating given by users vote_count: Total number of votes received first_air_date: The date when the show was first aired original_language: Language in which the show was originally produced genre_ids: Genre IDs linked to the show's genres overview: A brief summary of the show popularity: Popularity score based on audience engagement poster_path: URL path for the show's poster image Accessing the Dataset via API (Python Example):

    python Copy code import requests

    api_key = 'YOUR_API_KEY_HERE' url = "https://api.themoviedb.org/3/discover/tv" params = { 'api_key': api_key, 'include_adult': 'false', 'language': 'en-US', 'page': 1, 'sort_by': 'vote_average.desc', 'vote_count.gte': 200 }

    response = requests.get(url, params=params) data = response.json()

    Display the first show

    print(data['results'][0]) Dataset Use Cases:

    Data Analysis: Explore trends in highly-rated TV shows. Recommendation Systems: Build personalized TV show suggestions. Visualization: Create charts to showcase ratings or genre distribution. Machine Learning: Predict show popularity using historical data. Exporting and Sharing the Dataset (Google Colab Example):

    python Copy code import pandas as pd

    Convert the API data to a DataFrame

    df = pd.DataFrame(data['results'])

    Save to CSV and upload to Google Drive

    from google.colab import drive drive.mount('/content/drive') df.to_csv('/content/drive/MyDrive/top_rated_tv_shows.csv', index=False) Ways to Share the Dataset:

    Google Drive: Upload and share a public link. Kaggle: Create a public dataset for collaboration. GitHub: Host the CSV file in a repository for easy sharing.

  5. UCI-dataset

    • kaggle.com
    zip
    Updated Aug 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Waquar Azam (2022). UCI-dataset [Dataset]. https://www.kaggle.com/datasets/mdwaquarazam/ucidatasetlist
    Explore at:
    zip(20774 bytes)Available download formats
    Dataset updated
    Aug 17, 2022
    Authors
    Md Waquar Azam
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is about list of dataset provided by UCI ML , If you are a learner and want some data on the basis of year ,categories, profession or some other criteria you search it from here.

    There are 8 rows in the dataset in which all details are given. --link --Data-Name --data type --default task --attribute-type --instances --attributes --year

    Some missing values are present there also,

    You can analyse the as per your requirement

    EDA

  6. Chicago Data Portal

    • kaggle.com
    zip
    Updated Dec 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David (2020). Chicago Data Portal [Dataset]. https://www.kaggle.com/zhaodianwen/chicago-data-portal
    Explore at:
    zip(125083 bytes)Available download formats
    Dataset updated
    Dec 8, 2020
    Authors
    David
    Description

    Assignment Topic: In this assignment, you will download the datasets provided, load them into a database, write and execute SQL queries to answer the problems provided, and upload a screenshot showing the correct SQL query and result for review by your peers. A Jupyter notebook is provided in the preceding lesson to help you with the process.

    This assignment involves 3 datasets for the city of Chicago obtained from the Chicago Data Portal:

    1. Chicago Socioeconomic Indicators

    This dataset contains a selection of six socioeconomic indicators of public health significance and a hardship index, by Chicago community area, for the years 2008 โ€“ 2012.

    1. Chicago Public Schools

    This dataset shows all school level performance data used to create CPS School Report Cards for the 2011-2012 school year.

    1. Chicago Crime Data

    This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days.

    Instructions:

    1. Review the datasets

    Before you begin, you will need to become familiar with the datasets. Snapshots for the three datasets in .CSV format can be downloaded from the following links:

    Chicago Socioeconomic Indicators: Click here

    Chicago Public Schools: Click here

    Chicago Crime Data: Click here

    NOTE: Ensure you have downloaded the datasets using the links above instead of directly from the Chicago Data Portal. The versions linked here are subsets of the original datasets and have some of the column names modified to be more database friendly which will make it easier to complete this assignment. The CSV file provided above for the Chicago Crime Data is a very small subset of the full dataset available from the Chicago Data Portal. The original dataset is over 1.55GB in size and contains over 6.5 million rows. For the purposes of this assignment you will use a much smaller sample with only about 500 rows.

    1. Load the datasets into a database

    Perform this step using the LOAD tool in the Db2 console. You will need to create 3 tables in the database, one for each dataset, named as follows, and then load the respective .CSV file into the table:

    CENSUS_DATA

    CHICAGO_PUBLIC_SCHOOLS

    CHICAGO_CRIME_DATA

  7. Ecommerce Dataset (Products & Sizes Included)

    • kaggle.com
    zip
    Updated Nov 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anvit kumar (2025). Ecommerce Dataset (Products & Sizes Included) [Dataset]. https://www.kaggle.com/datasets/anvitkumar/shopping-dataset
    Explore at:
    zip(1274856 bytes)Available download formats
    Dataset updated
    Nov 13, 2025
    Authors
    Anvit kumar
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    ๐Ÿ“ฆ Ecommerce Dataset (Products & Sizes Included)

    ๐Ÿ›๏ธ Essential Data for Building an Ecommerce Website & Analyzing Online Shopping Trends ๐Ÿ“Œ Overview This dataset contains 1,000+ ecommerce products, including detailed information on pricing, ratings, product specifications, seller details, and more. It is designed to help data scientists, developers, and analysts build product recommendation systems, price prediction models, and sentiment analysis tools.

    ๐Ÿ”น Dataset Features

    Column Name Description product_id Unique identifier for the product title Product name/title product_description Detailed product description rating Average customer rating (0-5) ratings_count Number of ratings received initial_price Original product price discount Discount percentage (%) final_price Discounted price currency Currency of the price (e.g., USD, INR) images URL(s) of product images delivery_options Available delivery methods (e.g., standard, express) product_details Additional product attributes breadcrumbs Category path (e.g., Electronics > Smartphones) product_specifications Technical specifications of the product amount_of_stars Distribution of star ratings (1-5 stars) what_customers_said Customer reviews (sentiments) seller_name Name of the product seller sizes Available sizes (for clothing, shoes, etc.) videos Product video links (if available) seller_information Seller details, such as location and rating variations Different variants of the product (e.g., color, size) best_offer Best available deal for the product more_offers Other available deals/offers category Product category

    ๐Ÿ“Š Potential Use Cases

    ๐Ÿ“Œ Build an Ecommerce Website: Use this dataset to design a functional online store with product listings, filtering, and sorting. ๐Ÿ” Price Prediction Models: Predict product prices based on features like ratings, category, and discount. ๐ŸŽฏ Recommendation Systems: Suggest products based on user preferences, rating trends, and customer feedback. ๐Ÿ—ฃ Sentiment Analysis: Analyze what_customers_said to understand customer satisfaction and product popularity. ๐Ÿ“ˆ Market & Competitor Analysis: Track pricing trends, popular categories, and seller performance. ๐Ÿ” Why Use This Dataset? โœ… Rich Feature Set: Includes all necessary ecommerce attributes. โœ… Realistic Pricing & Rating Data: Useful for price analysis and recommendations. โœ… Multi-Purpose: Suitable for machine learning, web development, and data visualization. โœ… Structured Format: Easy-to-use CSV format for quick integration.

    ๐Ÿ“‚ Dataset Format CSV file (ecommerce_dataset.csv) 1000+ samples Multi-category coverage ๐Ÿ”— How to Use? Download the dataset from Kaggle. Load it in Python using Pandas: python Copy Edit import pandas as pd
    df = pd.read_csv("ecommerce_dataset.csv")
    df.head() Explore trends & patterns using visualization tools (Seaborn, Matplotlib). Build models & applications based on the dataset!

  8. Induction Motor Fault Dataset

    • kaggle.com
    zip
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saber MalekzadeH (2023). Induction Motor Fault Dataset [Dataset]. https://www.kaggle.com/datasets/sabermalek/imfds
    Explore at:
    zip(4071800548 bytes)Available download formats
    Dataset updated
    Jun 21, 2023
    Authors
    Saber MalekzadeH
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The test bench used to acquire the dataset was composed of two similar triphasic squirrel cage induction machines, two frequency converters, a failure emulation control panel, and a resistor load bank. One of the induction machines was properly prepared to enable the emulation of stator winding inter-turns to short-circuit. Its stator circuit was re-winded, making it possible to access the ramifications of the winding, in order to insert inter-turn short circuits. Different levels of short-circuit can be emulated, from very incipient defects to severe situations. It operates as a motor and the other machine emulates the mechanical load of the motor. The frequency converters are used to drive the induction machines. This way, the machines can work at different driving frequencies. The induction machines used have the following specifications: 4 poles, 1 HP of mechanical power, delta configuration, 220V of supply voltage, and 3A of rated current. The frequency converters are both WEG CFW-08 (WEG, 2019). Two types of faults were simulated:

    • High Impedance (HI): Represents the initial stage of the fault, in which the electric insulator is beginning to degrade and a parallel current path appears; Low Impedance: Represents a full short-circuit. The current flows in the new path and a voltage is induced in the shorted coil. For all faults simulated, the short-circuit current intensity is limited to its rated value, using a variable resistor (50 ฮฉ), to prevent permanent damage to the windings. Different intensity levels are also emulated depending on the amount of shorted turns. Three levels are considered: 1.41%, 4.81%, and 9.26% of the stator winding. Combining the type and intensity of the defects, there are, respectively, HI-1, HI-2, and HI-3, for high impedance, and LI-1, LI-2, and LI-3, for low impedance failure. The signals from flux and current transducer are filtered, conditioned, and digitalized, meanwhile, the motor is operating under a specific configuration of frequency, load, and failure (type and intensity). In total, 2590 patterns were acquired: 350 of normal class; 2240 of fault conditions, distributed into 6 defective classes: high impedance fault of levels 1, 2, and 3; and low impedance fault of levels 1, 2, and 3. For each class, there are patterns acquired with no mechanical load attached, 50% of the rated load, and 100% of the rated load. The driving frequency also varied from 30 Hz to 60 Hz, with steps of 5 Hz. To monitor the axial leakage flux, a coil of 100 turns of copper wire 24 AWG was placed around the motor shaft. The current of the 3 phases of the motor was acquired using current transformers (CT) model SCT013-030

    x1 is for first channel, x2 for second channel, x3 for third channel and x4 is for forth channel. Every 100000 data is for a data sample in the whole dataset.

    Preprocessed from: This link

  9. ECB speeches etc.since 1997 (updated weekly)

    • kaggle.com
    zip
    Updated Nov 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roberto Lofaro (2025). ECB speeches etc.since 1997 (updated weekly) [Dataset]. https://www.kaggle.com/robertolofaro/ecb-speeches-1997-to-20191122-frequencies-dm
    Explore at:
    zip(1814115 bytes)Available download formats
    Dataset updated
    Nov 3, 2025
    Authors
    Roberto Lofaro
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    I am preparing a book on change to add to my publications (https://robertolofaro.com/published), and I was looking into speeches delivered by ECB, and the search on the website wasn't what I needed.

    Started posting online updates in late 2019, currently the online webapp that allows to search via a tag cloud is updated on a weekly basis, each Monday evening.

    Search by tag: https://robertolofaro.com/ECBSpeech (links also to dataset on kaggle)

    From 2024-03-25, the dataset contains also the AI-based audio transcripts of any ECB item collected, whenever the audio file is accessible

    source: ECB website

    Content

    In late October/early November 2019, ECB posted on Linkedin a link to a CSV dataset extending from 1997 up to 2019-10-25 with all the speeches delivered, as per their website

    The dataset was "flat"- and I needed to both search quickly for associations of people to concepts, and to see directly the relevant speech in a human-readable format (as some speeches had pictures, tables, attachments, etc)

    So, I recycled a concept that I had developed for other purposes and used in an experimental "search by tag cloud on structured content" on https://robertolofaro.com/BFM2013tag

    The result is https://robertolofaro.com/ECBSpeech, that contains information from the CSV file (see website for the link to the source), with the additional information as shown within the "About this file".

    The concept behind this sharing of the dataset on Kaggle, and releasing on my public website the application I use to navigate date (I have a local Xampp where I use this and other applications to support the research side of my past business and current publication activities) is shared on http://robertolofaro.com/datademocracy

    This tag cloud contains the most common words 1997-2020 across the dataset

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3925987%2Fcf58205d2447ed7355c1a4e213f5b477%2F20200902_kagglerelease.png?generation=1599033600865103&alt=media" alt="">

    Acknowledgements

    Thanks to the ECB for saving my time (I was going to copy-and-paste or "scrape" with R from the speeches posted on their website) by releasing the dataset https://www.ecb.europa.eu/press/key/html/downloads.en.html

    Inspiration

    In my cultural and organizational change activities and within data collection, collation, and processing to support management decision-making (including my own) since the 1980s, I always saw that the more data we collect, the less time to retrieve it when needed there is.

    I usually worked across multiple environments, industries, cultures, and "collecting" was never good enough if I could not then "retrieve by association".

    In storytelling is fine just to roughly remember "cameos from the past", but in data storytelling (or when trying to implement a new organization, process, or even just software or data analysis) being able to pinpoint a source that might have been there before is equally important.

    So, I am simply exploring different ways to cross-reference information from different domains, as I am quite confident that within all the open data (including the ECB speeches) there are the results of what niche experts saw on various items.

    Therefore, why should time and resources be wasted on redoing what was done from others, when you can start from their endpoint, before adapting first, and adopting then (if relevant)?

    Updates

    2020-01-25: added GITHUB repository for versioning and release of additional material as the upload of the new export_datamart.csv wasn't possible, it is now available at: https://github.com/robertolofaro/ecbspeech

    changes in the dataset: 1. fixed language codes 2. added speeches published on the ECB website in January 2020 (up to 2020-01-25 09:00 CET) 3. added all the items listed under the "interview" section of the ECB website

    current content: 340 interviews, 2374 speeches

    2020-01-29: the same file on GITHUB released on 2020-01-25, containing both speeches and interviews, and within an additional column to differentiate between the two, is now available on Kaggle

    current content: 340 interviews, 2374 speeches

    2020-02-26: monthly update, with items released on the ECB website up to 2020-02-22

    current content: 2731 items, 345 interviews, 2386 speeches

    2020-03-25: monthly update, with items released on the ECB website up to 2020-03-20

    since March 2020, the dataset includes also press conferences available on he ECB website

    current content: 2988 records (2392 speeches, 351 interviews, 245 press conferences)

    2020-06-07: update, with items released on the ECB website up to 2020-06-07

    since June 2020, the dataset includes also press conferences, blog posts, and podcasts available on the ECB website

    current content: 3030 records (2399 speeches, 369 interviews, 247 press conferences, 8 blog posts, 7 ECB Podcast). ...

  10. Reddit /r/datasets Dataset

    • kaggle.com
    zip
    Updated Nov 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Reddit /r/datasets Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/the-meta-corpus-of-datasets-the-reddit-dataset
    Explore at:
    zip(9619636 bytes)Available download formats
    Dataset updated
    Nov 28, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The Meta-Corpus of Datasets: The Reddit Dataset

    The Complete Collection of Datasets Posted on Reddit

    By SocialGrep [source]

    About this dataset

    A subreddit dataset is a collection of posts and comments made on Reddit's /r/datasets board. This dataset contains all the posts and comments made on the /r/datasets subreddit from its inception to March 1, 2022. The dataset was procured using SocialGrep. The data does not include usernames to preserve users' anonymity and to prevent targeted harassment

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • ๐Ÿšจ Your notebook can be here! ๐Ÿšจ!

    How to use the dataset

    In order to use this dataset, you will need to have a text editor such as Microsoft Word or LibreOffice installed on your computer. You will also need a web browser such as Google Chrome or Mozilla Firefox.

    Once you have the necessary software installed, open the The Reddit Dataset folder and double-click on the the-reddit-dataset-dataset-posts.csv file to open it in your preferred text editor.

    In the document, you will see a list of posts with the following information for each one: title, sentiment, score, URL, created UTC, permalink, subreddit NSFW status, and subreddit name.

    You can use this information to analyze trends in data sets posted on /r/datasets over time. For example, you could calculate the average score for all posts and compare it to the average score for posts in specific subReddits. Additionally, sentiment analysis could be performed on the titles of posts to see if there is a correlation between positive/negative sentiment and upvotes/downvotes

    Research Ideas

    • Finding correlations between different types of datasets
    • Determining which datasets are most popular on Reddit
    • Analyzing the sentiments of post and comments on Reddit's /r/datasets board

    Acknowledgements

    If you use this dataset in your research, please credit the original authors.

    Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: the-reddit-dataset-dataset-comments.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | body | The body of the post. (String) | | sentiment | The sentiment of the post. (String) | | score | The score of the post. (Integer) |

    File: the-reddit-dataset-dataset-posts.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | score | The score of the post. (Integer) | | domain | The domain of the post. (String) | | url | The URL of the post. (String) | | selftext | The self-text of the post. (String) | | title | The title of the post. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit SocialGrep.

  11. Telco Customer Churn

    • kaggle.com
    zip
    Updated Feb 23, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BlastChar (2018). Telco Customer Churn [Dataset]. https://www.kaggle.com/datasets/blastchar/telco-customer-churn
    Explore at:
    zip(175758 bytes)Available download formats
    Dataset updated
    Feb 23, 2018
    Authors
    BlastChar
    Description

    Context

    "Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs." [IBM Sample Data Sets]

    Content

    Each row represents a customer, each column contains customerโ€™s attributes described on the column Metadata.

    The data set includes information about:

    • Customers who left within the last month โ€“ the column is called Churn
    • Services that each customer has signed up for โ€“ phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
    • Customer account information โ€“ how long theyโ€™ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
    • Demographic info about customers โ€“ gender, age range, and if they have partners and dependents

    Inspiration

    To explore this type of models and learn more about the subject.

    New version from IBM: https://community.ibm.com/community/user/businessanalytics/blogs/steven-macko/2019/07/11/telco-customer-churn-1113

  12. Iris Species

    • kaggle.com
    zip
    Updated Sep 27, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCI Machine Learning (2016). Iris Species [Dataset]. https://www.kaggle.com/datasets/uciml/iris
    Explore at:
    zip(3687 bytes)Available download formats
    Dataset updated
    Sep 27, 2016
    Dataset authored and provided by
    UCI Machine Learning
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.

    It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

    The columns in this dataset are:

    • Id
    • SepalLengthCm
    • SepalWidthCm
    • PetalLengthCm
    • PetalWidthCm
    • Species

    Sepal Width vs. Sepal Length

  13. Social Media and Mental Health

    • kaggle.com
    zip
    Updated Jul 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SouvikAhmed071 (2023). Social Media and Mental Health [Dataset]. https://www.kaggle.com/datasets/souvikahmed071/social-media-and-mental-health
    Explore at:
    zip(10944 bytes)Available download formats
    Dataset updated
    Jul 18, 2023
    Authors
    SouvikAhmed071
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    This dataset was originally collected for a data science and machine learning project that aimed at investigating the potential correlation between the amount of time an individual spends on social media and the impact it has on their mental health.

    The project involves conducting a survey to collect data, organizing the data, and using machine learning techniques to create a predictive model that can determine whether a person should seek professional help based on their answers to the survey questions.

    This project was completed as part of a Statistics course at a university, and the team is currently in the process of writing a report and completing a paper that summarizes and discusses the findings in relation to other research on the topic.

    The following is the Google Colab link to the project, done on Jupyter Notebook -

    https://colab.research.google.com/drive/1p7P6lL1QUw1TtyUD1odNR4M6TVJK7IYN

    The following is the GitHub Repository of the project -

    https://github.com/daerkns/social-media-and-mental-health

    Libraries used for the Project -

    Pandas
    Numpy
    Matplotlib
    Seaborn
    Sci-kit Learn
    
  14. Webpage Information for 5000+ Kaggle Competitions

    • kaggle.com
    zip
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony Wynne (2023). Webpage Information for 5000+ Kaggle Competitions [Dataset]. https://www.kaggle.com/anthony35813/webpage-data-for-kaggle-competitions
    Explore at:
    zip(102059495 bytes)Available download formats
    Dataset updated
    Nov 8, 2023
    Authors
    Anthony Wynne
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    I produced the dataset whilst working on the 2023 Kaggle AI report. The Meta Kaggle dataset provides helpful information about the Kaggle competitions but not the original descriptive text from the Kaggle web pages for each competition. We have information about the solutions but not the original problem. So, I wrote some web scraping scripts to collect and store that information.

    Not all Kaggle web pages have that information available; some are missing or broken. Hence the nulls in the data. Secondly, note that not all previous Kaggle competitions exist in the Meta Kaggle data, which was used to collect the webpage slugs.

    The scrapping scripts iterate over the IDs in Meta Kaggle competitions.csv data and attempt to collect the webpage data for that competition if it is currently null in the database. Hence new IDs will cause the scripts to go and collect their data, and each week, the scripts will try and fill in any links that were not working previously.

    I have recently converted the original local scraping scripts on my machine into a Kaggle notebook that now updates this dataset weekly on Mondays. The notebook also explains the scraping procedure and its automation to keep this dataset up-to-date.

    Note that the CompetitionId field joins to the Id of the competitions.csv of the Meta Kaggle dataset so that this information can be combined with the rest of Meta Kaggle.

    My primary reason for collecting the data was for some text classification work I wanted to do, and I will publish it here soon. I hope that the data is useful to some other projects as well :-)

  15. Comprehensive Goodreads Book Dataset

    • kaggle.com
    zip
    Updated Aug 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evil Spirit05 (2024). Comprehensive Goodreads Book Dataset [Dataset]. https://www.kaggle.com/datasets/evilspirit05/comprehensive-goodreads-book-dataset
    Explore at:
    zip(2866123 bytes)Available download formats
    Dataset updated
    Aug 8, 2024
    Authors
    Evil Spirit05
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description
    The data for this project was meticulously gathered from Goodreads, focusing on the curated list of books that are deemed essential reading. The data collection process was carried out in two distinct phases to ensure comprehensive and accurate capture of all relevant information.
    

    Source:

    Goodreads Listing: https://www.goodreads.com/list/show/264.Books_That_Everyone_Should_Read_At_Least_Once

    Data Collection Steps:

    Book URL Scraping:

    • Objective: The primary goal of this step was to extract the URLs of the books listed on the Goodreads page, along with their corresponding titles. This is a crucial preliminary step that allows for subsequent detailed data collection.
    • Methodology: I employed a custom-built Python script, scraper\book_url_scraper.py, designed specifically to navigate the Goodreads page and identify each book's URL. The script systematically parses the HTML structure of the listing page, extracts the URLs, and pairs them with the book titles.
    • Data Storage: The collected URLs and titles were compiled into a CSV file named book_urls.csv, which is stored in the scraper folder. This CSV file acts as a reference list, containing essential links and titles needed for the next phase of data collection.

    Book Details Scraping:

    • Objective: This phase aimed to enrich the dataset by collecting detailed descriptions and genre classifications for each book using the URLs obtained in the previous step. This provides a deeper understanding of each book's content and category.
    • Methodology: Utilizing the URLs stored in book_urls.csv, I developed and executed another Python script, scraper\book_details_scraper.py. This script accesses each URL, retrieves the book's detailed description, and identifies its genre(s). The process involves parsing the book's page to extract relevant information accurately.
    • Data Storage: The extracted descriptions and genres were organized and saved into a CSV file named book_details.csv, located in the data folder. This file contains comprehensive information about each book, including its description and genre, facilitating detailed analysis and research.

    Summary:

    The data collection effort resulted in the comprehensive gathering of details for 6,313 books. This dataset includes essential information such as book titles, URLs, detailed descriptions, and genres. The structured approach, involving separate scripts for URL extraction and detailed data scraping, ensures that the dataset is both thorough and well-organized. The final dataset, encapsulated in book_details.csv, provides a robust foundation for further exploration, analysis, and insights into the literary works recommended on Goodreads.
    
  16. Housing Prices Dataset

    • kaggle.com
    zip
    Updated Jan 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M Yasser H (2022). Housing Prices Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/housing-prices-dataset
    Explore at:
    zip(4740 bytes)Available download formats
    Dataset updated
    Jan 12, 2022
    Authors
    M Yasser H
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://raw.githubusercontent.com/Masterx-AI/Project_Housing_Price_Prediction_/main/hs.jpg" alt="">

    Description:

    A simple yet challenging project, to predict the housing price based on certain factors like house area, bedrooms, furnished, nearness to mainroad, etc. The dataset is small yet, it's complexity arises due to the fact that it has strong multicollinearity. Can you overcome these obstacles & build a decent predictive model?

    Acknowledgement:

    Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81โ€“102. Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.

    Objective:

    • Understand the Dataset & cleanup (if required).
    • Build Regression models to predict the sales w.r.t a single & multiple feature.
    • Also evaluate the models & compare thier respective scores like R2, RMSE, etc.
  17. Stress316L

    • kaggle.com
    zip
    Updated Feb 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahshad Lotfinia (2021). Stress316L [Dataset]. https://www.kaggle.com/datasets/mahshadlotfinia/stress316l
    Explore at:
    zip(516534 bytes)Available download formats
    Dataset updated
    Feb 1, 2021
    Authors
    Mahshad Lotfinia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    https://badges.frapsoft.com/os/v2/open-source.svg?v=103" alt="Open Source Love"> https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat" alt=""> https://licensebuttons.net/l/by/4.0/88x31.png" alt="">

    In case you use this dataset, please cite the original paper:

    Mahshad Lotfinia, and Soroosh Tayebi Arasteh. "Machine Learning-Based Generalized Model for Finite Element Analysis of Roll Deflection During the Austenitic Stainless Steel 316L Strip Rolling". arXiv:2102.02470, February 2021.

    BibTex

    @misc{Stress316L,
     title={Machine Learning-Based Generalized Model for Finite Element Analysis of Roll Deflection During the Austenitic Stainless Steel 316L Strip Rolling}, 
     author={Mahshad Lotfinia and Soroosh Tayebi Arasteh},
     year={2021},
     eprint={2102.02470},
     archivePrefix={arXiv},
     primaryClass={cs.LG}
    

    SUMMARY

    Unlike the other groups of metals, Austenitic Stainless Steel 316L has an unpredictable Strain-Stress curve. Thus, we conducted a series of mechanical tensile tests at different strain rates. Afterwards, using this dataset we can train a neural network to predict the best Strain-Stress curve that predicts more accurate values of the flow stress during the cold deformation.

    DATA COLLECTION

    We conducted four sets of Uniaxial Tensile Tests in 0.001Sโˆ’1, 0.00052Sโˆ’1, 0.0052Sโˆ’1, and 0.052Sโˆ’1 strain rates in the room temperature on our Austenitic Stainless Steel 316L sample. According to the ASTME8 standard, the ASS316L sheets with an initial thickness of 4 mm, width of 6 mm, and Gage length of 32 mm were utilized for the tensile tests using a compression test machine (Electro Mechanic Instron 4208). The results were transferred to the Santam Machine Controller software for recording, which led to obtaining the extension data (in mm) and the force data (in N), which were converted to the true-strain and true-stress values. The data conversion procedure was done by considering the cross-section of the loaded force, which for our case was 24 mm^2.

    DATASET CONTENTS

    15,858 different Strain-Stress values at 4 different strain rates.

    • ./Stress316L_data/labels.csv: Stress values.
    • ./Stress316L_data/features.csv: Strain & Strain rate values for the corresponding points in the ./Stress316L_data/labels.csv.
    • ./Stress316L_data/x_y_initial.csv: Strain-Stress values.

    DATA FORMAT FOR ALL THE FILES

    All the files are provided in the "csv" format.

    The dataset URL:

    https://kaggle.com/mahshadlotfinia/Stress316L/
    

    LICENSE

    The accompanying dataset is released under a Creative Commons Attribution 4.0 International License.

    SOURCE CODE

    The official source code of the paper: https://github.com/mahshadlotfinia/Stress316L/

    CONTACT

    E-mail: mahshad.lotfinia@alum.sharif.edu

    REFERENCES:

    Materials Science and Engineering Mechanical Lab, the Sharif University of Technology, Tehran, Iran.

  18. Tensorflow-Friendly-MRNA-Competition-Dataset

    • kaggle.com
    zip
    Updated Oct 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harrison TW White (2023). Tensorflow-Friendly-MRNA-Competition-Dataset [Dataset]. https://www.kaggle.com/datasets/harrisontwwhite/tensorflow-friendly-mrna-competition-dataset
    Explore at:
    zip(1046236048 bytes)Available download formats
    Dataset updated
    Oct 25, 2023
    Authors
    Harrison TW White
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This data is taken as a link between the competition dataset csv: https://www.kaggle.com/competitions/rsna-breast-cancer-detection

    and the 256x256 images of that data set created here: https://www.kaggle.com/datasets/theoviel/rsna-breast-cancer-256-pngs

    This should allow the data to be read in as a directory from TensorFlow allowing the labels to be attached to the images themselves rather than in a separate csv file.

  19. CIFAR-10 Python in CSV

    • kaggle.com
    zip
    Updated Jun 22, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    fedesoriano (2021). CIFAR-10 Python in CSV [Dataset]. https://www.kaggle.com/fedesoriano/cifar10-python-in-csv
    Explore at:
    zip(218807675 bytes)Available download formats
    Dataset updated
    Jun 22, 2021
    Authors
    fedesoriano
    Description

    Context

    The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. The classes are completely mutually exclusive. There are 50000 training images and 10000 test images.

    The batches.meta file contains the label names of each class.

    The dataset was originally divided in 5 training batches with 10000 images per batch. The original dataset can be found here: https://www.cs.toronto.edu/~kriz/cifar.html. This dataset contains all the training data and test data in the same CSV file so it is easier to load.

    Content

    Here is the list of the 10 classes in the CIFAR-10:

    Classes: 1) 0: airplane 2) 1: automobile 3) 2: bird 4) 3: cat 5) 4: deer 6) 5: dog 7) 6: frog 8) 7: horse 9) 8: ship 10) 9: truck

    Acknowledgements

    • Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009. Link

    How to load the batches.meta file (Python)

    The function used to open the file: def unpickle(file): import pickle with open(file, 'rb') as fo: dict = pickle.load(fo, encoding='bytes') return dict

    Example of how to read the file: metadata_path = './cifar-10-python/batches.meta' # change this path metadata = unpickle(metadata_path)

  20. DeBERTa-v3-Base for Sentiment Regression

    • kaggle.com
    zip
    Updated Aug 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AnthonyTherrien (2024). DeBERTa-v3-Base for Sentiment Regression [Dataset]. https://www.kaggle.com/datasets/anthonytherrien/deberta-v3-base-for-sentiment-regression/suggestions?status=pending&yourSuggestions=true
    Explore at:
    zip(664634762 bytes)Available download formats
    Dataset updated
    Aug 10, 2024
    Authors
    AnthonyTherrien
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Project Overview

    • Objective: Fine-tune the microsoft/deberta-v3-base model for sentiment regression.
    • Dataset: CSV file containing 1.6 million tweets with sentiment labels.

    Dataset

    • Source: training.1600000.processed.noemoticon.csv
    • Link: https://www.kaggle.com/datasets/kazanova/sentiment140
    • Columns:
      • target: Sentiment polarity (converted to float)
      • ids: Tweet IDs
      • date: Date of the tweet
      • flag: Query flag
      • user: User handle
      • text: Tweet text
    • Size: 1.6 million rows

    Preprocessing Steps

    1. Load Dataset: Loaded CSV file without headers using ISO-8859-1 encoding.
    2. Rename Columns: Renamed columns for better readability.
    3. Target Conversion: Converted target column to float.
    4. Shuffle Dataset: Shuffled dataset with a seed for randomness.

    Model Selection

    • Model: microsoft/deberta-v3-base
    • Tokenizer: Used the AutoTokenizer from Hugging Face with max_length=160 and padding='max_length'.

    Tokenization

    • Process:
      • Tokenized the dataset using multiprocessing (12 cores).
      • Applied padding and truncation to ensure uniform input size.

    Dataset Split

    • Train/Test Split:
      • Training set: 97.5% of the data
      • Validation set: 2.5% of the data

    Training Configuration

    • Training Arguments:
      • Learning Rate: 1.25e-5
      • Batch Size: 24
      • Epochs: 2
      • Weight Decay: 0.001
      • Gradient Accumulation: 6 steps
      • Warmup Steps: 256
      • Evaluation Strategy: Evaluate at the end of each epoch
      • Mixed Precision Training: Enabled (fp16=True)

    Model Training

    • Trainer: Used Hugging Face's Trainer class for model training and evaluation.

    Evaluation

    • Results: The model was evaluated on the validation set after training, with results saved for further analysis.

    Conclusion

    • The fine-tuned DeBERTa-v3 model is now ready for sentiment regression tasks, with the final model and tokenizer saved for deployment.

    Citation

    @misc{he2021debertav3,
    title={DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing}, 
    author={Pengcheng He and Jianfeng Gao and Weizhu Chen},
    year={2021},
    eprint={2111.09543},
    archivePrefix={arXiv},
    primaryClass={cs.CL}}
    
    @inproceedings{
    he2021deberta,
    title={DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION},
    author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
    booktitle={International Conference on Learning Representations},
    year={2021},
    url={https://openreview.net/forum?id=XPZIaotutsD}}
    
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Algorismus (2022). Adventure Works 2022 CSVs [Dataset]. https://www.kaggle.com/datasets/algorismus/adventure-works-in-excel-tables
Organization logo

Adventure Works 2022 CSVs

Dataset of Adventure Works from SQL to CSVs (useful for PL-300 exam)

Explore at:
zip(567646 bytes)Available download formats
Dataset updated
Nov 2, 2022
Authors
Algorismus
License

http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

Description

Adventure Works 2022 dataset

How this Dataset is created?

On the official website the dataset is available over SQL server (localhost) and CSVs to be used via Power BI Desktop running on Virtual Lab (Virtaul Machine). As per first two steps of Importing data are executed in the virtual lab and then resultant Power BI tables are copied in CSVs. Added records till year 2022 as required.

How this Dataset may help you?

this dataset will be helpful in case you want to work offline with Adventure Works data in Power BI desktop in order to carry lab instructions as per training material on official website. The dataset is useful in case you want to work on Power BI desktop Sales Analysis example from Microsoft website PL 300 learning.

How to use this Dataset?

Download the CSV file(s) and import in Power BI desktop as tables. The CSVs are named as tables created after first two steps of importing data as mentioned in the PL-300 Microsoft Power BI Data Analyst exam lab.

Search
Clear search
Close search
Google apps
Main menu