100+ datasets found

Adventure Works 2022 CSVs
kaggle.com
zip
Updated Nov 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Algorismus (2022). Adventure Works 2022 CSVs [Dataset]. https://www.kaggle.com/datasets/algorismus/adventure-works-in-excel-tables
Explore at:
zip(567646 bytes)Available download formats
Dataset updated
Nov 2, 2022
Authors
Algorismus
License
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Description
Adventure Works 2022 dataset

How this Dataset is created?

On the official website the dataset is available over SQL server (localhost) and CSVs to be used via Power BI Desktop running on Virtual Lab (Virtaul Machine). As per first two steps of Importing data are executed in the virtual lab and then resultant Power BI tables are copied in CSVs. Added records till year 2022 as required.

How this Dataset may help you?

this dataset will be helpful in case you want to work offline with Adventure Works data in Power BI desktop in order to carry lab instructions as per training material on official website. The dataset is useful in case you want to work on Power BI desktop Sales Analysis example from Microsoft website PL 300 learning.

How to use this Dataset?

Download the CSV file(s) and import in Power BI desktop as tables. The CSVs are named as tables created after first two steps of importing data as mentioned in the PL-300 Microsoft Power BI Data Analyst exam lab.
Stylish Product Image Dataset
kaggle.com
zip
Updated May 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Santosh Kumar (2022). Stylish Product Image Dataset [Dataset]. https://www.kaggle.com/datasets/kuchhbhi/stylish-product-image-dataset
Explore at:
zip(9509715613 bytes)Available download formats
Dataset updated
May 21, 2022
Authors
Santosh Kumar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context:

The idea came to my mind to scrap this data. I was working on an e-commerce project Fashion Product Recommendation (an end-to-end project). In this project, upload any fashion image and it will show the 10 closest recommendations.

https://user-images.githubusercontent.com/40932902/169657090-20d3342d-d472-48e3-bc34-8a9686b09961.png" alt="">

https://user-images.githubusercontent.com/40932902/169657035-870bb803-f985-482a-ac16-789d0fcf2a2b.png" alt="">

https://user-images.githubusercontent.com/40932902/169013855-099838d6-8612-45ce-8961-28ccf44f81f7.png" alt="">

I completed my project on this image dataset . The problem I faced while deploying on the Heroku server. Due to the large project file size, I was unable to deploy as Heroku offers limited memory space for a free account.

As currently, I am only familiar with Heroku. Learning AWS for big projects. So, I decided to scrap my own image dataset with much more information that can help me to transform this project to the next level. Scraped this data from flipkart.com(e-commerce website) in two formats Image and textual data in tabular format.

About this Dataset:

This dataset contains 65k images (400x450 pixel)) of fashion/style products and accessories like clothing, footwear, accessories, and many more. There is a CSV file also mapped with the image name and the id column in tabular data. The name of the image is in a unique numerical format like 1.png, 62299.png Image name and Id columns are the same. So, suppose you want to find the details of any image then you can find them using the image name id, go to the Id column in the csv file and that id rows will be the details of the image. You can find the notebook in the code section which I used to scrap this data.

Columns of CSV Dataset: 1. id : Unique id same as the image name 2. brand: Brand name of the product 3. title: Title of the product 4. sold_price: selling price of the product 5. actual_price: Actual price of the product 6. url : unique URL of every product 7. img: Image URL

How did helped me this dataset: 1. I trained my CNN model using the image data, that's the only use of the image dataset. 2. In my front-end page of the project to display results, I used Image URL and displayed after extracting from the web. This helped me to not upload the image dataset with the project on the server and this saved huge memory space. 3. Using the url displaying live price and** ratings** from the Flipkart website. 4. And there is a Buy button mapped with the url you will be redirected to the original product page and buy it from there. after using this dataset I changed my project name from Fashion Product Recommender to Flipkart Fashion Product Recommender. 😄😄😄

Still, the memory problem was not resolved as the model trained file was above 500MB on the complete dataset. So I tried on multiple sets and finally, I deployed after training on 1000 images only. In the future, I will try on another platform to deploy the complete project. I learned many new things while working on this dataset.

Your Job:

You can use this dataset in your deep learning projects, go and try to create interesting projects.

You can use CSV data in your Machine Learning projects, first you need to do feature construction from the title columns as there is much information hidden and some data cleaning required.

There is two complete records missing in csv data, your job is to find the missing data with the help of image dataset and fill as per your knowledge.

This is a huge dataset in terms of records as well as memory size. To download this dataset you need high internet speed.

To download the same dataset in small size less than 500mb you can find it here, everything is the same as this dataset only I reduced the pixel of the image from 400x450px to ** 65x80pixels**.

Pls, Rate this work

Support with Upvote... that encourages me to research more.

Share your feedback, reviews, and suggestions if any.

Thanks!!
train csv file
kaggle.com
zip
Updated May 5, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emmanuel Arias (2018). train csv file [Dataset]. https://www.kaggle.com/datasets/eamanu/train
Explore at:
zip(33695 bytes)Available download formats
Dataset updated
May 5, 2018
Authors
Emmanuel Arias
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Dataset

This dataset was created by Emmanuel Arias

Released under Database: Open Database, Contents: Database Contents

Contents
Top Rated TV Shows
kaggle.com
zip
Updated Jan 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shreya Gupta (2025). Top Rated TV Shows [Dataset]. https://www.kaggle.com/datasets/shreyajii/top-rated-tv-shows
Explore at:
zip(314571 bytes)Available download formats
Dataset updated
Jan 5, 2025
Authors
Shreya Gupta
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset provides information about top-rated TV shows, collected from The Movie Database (TMDb) API. It can be used for data analysis, recommendation systems, and insights on popular television content.

Key Stats:

Total Pages: 109 Total Results: 2098 TV shows Data Source: TMDb API Sorting Criteria: Highest-rated by vote_average (average rating) with a minimum vote count of 200 Data Fields (Columns):

id: Unique identifier for the TV show name: Title of the TV show vote_average: Average rating given by users vote_count: Total number of votes received first_air_date: The date when the show was first aired original_language: Language in which the show was originally produced genre_ids: Genre IDs linked to the show's genres overview: A brief summary of the show popularity: Popularity score based on audience engagement poster_path: URL path for the show's poster image Accessing the Dataset via API (Python Example):

python Copy code import requests

api_key = 'YOUR_API_KEY_HERE' url = "https://api.themoviedb.org/3/discover/tv" params = { 'api_key': api_key, 'include_adult': 'false', 'language': 'en-US', 'page': 1, 'sort_by': 'vote_average.desc', 'vote_count.gte': 200 }

response = requests.get(url, params=params) data = response.json()

Display the first show

print(data['results'][0]) Dataset Use Cases:

Data Analysis: Explore trends in highly-rated TV shows. Recommendation Systems: Build personalized TV show suggestions. Visualization: Create charts to showcase ratings or genre distribution. Machine Learning: Predict show popularity using historical data. Exporting and Sharing the Dataset (Google Colab Example):

python Copy code import pandas as pd

Convert the API data to a DataFrame

df = pd.DataFrame(data['results'])

Save to CSV and upload to Google Drive

from google.colab import drive drive.mount('/content/drive') df.to_csv('/content/drive/MyDrive/top_rated_tv_shows.csv', index=False) Ways to Share the Dataset:

Google Drive: Upload and share a public link. Kaggle: Create a public dataset for collaboration. GitHub: Host the CSV file in a repository for easy sharing.
UCI-dataset
kaggle.com
zip
Updated Aug 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Waquar Azam (2022). UCI-dataset [Dataset]. https://www.kaggle.com/datasets/mdwaquarazam/ucidatasetlist
Explore at:
zip(20774 bytes)Available download formats
Dataset updated
Aug 17, 2022
Authors
Md Waquar Azam
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is about list of dataset provided by UCI ML , If you are a learner and want some data on the basis of year ,categories, profession or some other criteria you search it from here.

There are 8 rows in the dataset in which all details are given. --link --Data-Name --data type --default task --attribute-type --instances --attributes --year

Some missing values are present there also,

You can analyse the as per your requirement

EDA
Chicago Data Portal
kaggle.com
zip
Updated Dec 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David (2020). Chicago Data Portal [Dataset]. https://www.kaggle.com/zhaodianwen/chicago-data-portal
Explore at:
zip(125083 bytes)Available download formats
Dataset updated
Dec 8, 2020
Authors
David
Description
Assignment Topic: In this assignment, you will download the datasets provided, load them into a database, write and execute SQL queries to answer the problems provided, and upload a screenshot showing the correct SQL query and result for review by your peers. A Jupyter notebook is provided in the preceding lesson to help you with the process.

This assignment involves 3 datasets for the city of Chicago obtained from the Chicago Data Portal:

Chicago Socioeconomic Indicators

This dataset contains a selection of six socioeconomic indicators of public health significance and a hardship index, by Chicago community area, for the years 2008 – 2012.

Chicago Public Schools

This dataset shows all school level performance data used to create CPS School Report Cards for the 2011-2012 school year.

Chicago Crime Data

This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days.

Instructions:

Review the datasets

Before you begin, you will need to become familiar with the datasets. Snapshots for the three datasets in .CSV format can be downloaded from the following links:

Chicago Socioeconomic Indicators: Click here

Chicago Public Schools: Click here

Chicago Crime Data: Click here

NOTE: Ensure you have downloaded the datasets using the links above instead of directly from the Chicago Data Portal. The versions linked here are subsets of the original datasets and have some of the column names modified to be more database friendly which will make it easier to complete this assignment. The CSV file provided above for the Chicago Crime Data is a very small subset of the full dataset available from the Chicago Data Portal. The original dataset is over 1.55GB in size and contains over 6.5 million rows. For the purposes of this assignment you will use a much smaller sample with only about 500 rows.

Load the datasets into a database

Perform this step using the LOAD tool in the Db2 console. You will need to create 3 tables in the database, one for each dataset, named as follows, and then load the respective .CSV file into the table:

CENSUS_DATA

CHICAGO_PUBLIC_SCHOOLS

CHICAGO_CRIME_DATA
Ecommerce Dataset (Products & Sizes Included)
kaggle.com
zip
Updated Nov 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anvit kumar (2025). Ecommerce Dataset (Products & Sizes Included) [Dataset]. https://www.kaggle.com/datasets/anvitkumar/shopping-dataset
Explore at:
zip(1274856 bytes)Available download formats
Dataset updated
Nov 13, 2025
Authors
Anvit kumar
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
📦 Ecommerce Dataset (Products & Sizes Included)

🛍️ Essential Data for Building an Ecommerce Website & Analyzing Online Shopping Trends 📌 Overview This dataset contains 1,000+ ecommerce products, including detailed information on pricing, ratings, product specifications, seller details, and more. It is designed to help data scientists, developers, and analysts build product recommendation systems, price prediction models, and sentiment analysis tools.

🔹 Dataset Features

Column Name Description product_id Unique identifier for the product title Product name/title product_description Detailed product description rating Average customer rating (0-5) ratings_count Number of ratings received initial_price Original product price discount Discount percentage (%) final_price Discounted price currency Currency of the price (e.g., USD, INR) images URL(s) of product images delivery_options Available delivery methods (e.g., standard, express) product_details Additional product attributes breadcrumbs Category path (e.g., Electronics > Smartphones) product_specifications Technical specifications of the product amount_of_stars Distribution of star ratings (1-5 stars) what_customers_said Customer reviews (sentiments) seller_name Name of the product seller sizes Available sizes (for clothing, shoes, etc.) videos Product video links (if available) seller_information Seller details, such as location and rating variations Different variants of the product (e.g., color, size) best_offer Best available deal for the product more_offers Other available deals/offers category Product category

📊 Potential Use Cases

📌 Build an Ecommerce Website: Use this dataset to design a functional online store with product listings, filtering, and sorting. 🔍 Price Prediction Models: Predict product prices based on features like ratings, category, and discount. 🎯 Recommendation Systems: Suggest products based on user preferences, rating trends, and customer feedback. 🗣 Sentiment Analysis: Analyze what_customers_said to understand customer satisfaction and product popularity. 📈 Market & Competitor Analysis: Track pricing trends, popular categories, and seller performance. 🔍 Why Use This Dataset? ✅ Rich Feature Set: Includes all necessary ecommerce attributes. ✅ Realistic Pricing & Rating Data: Useful for price analysis and recommendations. ✅ Multi-Purpose: Suitable for machine learning, web development, and data visualization. ✅ Structured Format: Easy-to-use CSV format for quick integration.

📂 Dataset Format CSV file (ecommerce_dataset.csv) 1000+ samples Multi-category coverage 🔗 How to Use? Download the dataset from Kaggle. Load it in Python using Pandas: python Copy Edit import pandas as pd
df = pd.read_csv("ecommerce_dataset.csv")
df.head() Explore trends & patterns using visualization tools (Seaborn, Matplotlib). Build models & applications based on the dataset!
Induction Motor Fault Dataset
kaggle.com
zip
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saber MalekzadeH (2023). Induction Motor Fault Dataset [Dataset]. https://www.kaggle.com/datasets/sabermalek/imfds
Explore at:
zip(4071800548 bytes)Available download formats
Dataset updated
Jun 21, 2023
Authors
Saber MalekzadeH
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The test bench used to acquire the dataset was composed of two similar triphasic squirrel cage induction machines, two frequency converters, a failure emulation control panel, and a resistor load bank. One of the induction machines was properly prepared to enable the emulation of stator winding inter-turns to short-circuit. Its stator circuit was re-winded, making it possible to access the ramifications of the winding, in order to insert inter-turn short circuits. Different levels of short-circuit can be emulated, from very incipient defects to severe situations. It operates as a motor and the other machine emulates the mechanical load of the motor. The frequency converters are used to drive the induction machines. This way, the machines can work at different driving frequencies. The induction machines used have the following specifications: 4 poles, 1 HP of mechanical power, delta configuration, 220V of supply voltage, and 3A of rated current. The frequency converters are both WEG CFW-08 (WEG, 2019). Two types of faults were simulated:

High Impedance (HI): Represents the initial stage of the fault, in which the electric insulator is beginning to degrade and a parallel current path appears; Low Impedance: Represents a full short-circuit. The current flows in the new path and a voltage is induced in the shorted coil. For all faults simulated, the short-circuit current intensity is limited to its rated value, using a variable resistor (50 Ω), to prevent permanent damage to the windings. Different intensity levels are also emulated depending on the amount of shorted turns. Three levels are considered: 1.41%, 4.81%, and 9.26% of the stator winding. Combining the type and intensity of the defects, there are, respectively, HI-1, HI-2, and HI-3, for high impedance, and LI-1, LI-2, and LI-3, for low impedance failure. The signals from flux and current transducer are filtered, conditioned, and digitalized, meanwhile, the motor is operating under a specific configuration of frequency, load, and failure (type and intensity). In total, 2590 patterns were acquired: 350 of normal class; 2240 of fault conditions, distributed into 6 defective classes: high impedance fault of levels 1, 2, and 3; and low impedance fault of levels 1, 2, and 3. For each class, there are patterns acquired with no mechanical load attached, 50% of the rated load, and 100% of the rated load. The driving frequency also varied from 30 Hz to 60 Hz, with steps of 5 Hz. To monitor the axial leakage flux, a coil of 100 turns of copper wire 24 AWG was placed around the motor shaft. The current of the 3 phases of the motor was acquired using current transformers (CT) model SCT013-030

x1 is for first channel, x2 for second channel, x3 for third channel and x4 is for forth channel. Every 100000 data is for a data sample in the whole dataset.

Preprocessed from: This link
ECB speeches etc.since 1997 (updated weekly)
kaggle.com
zip
Updated Nov 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roberto Lofaro (2025). ECB speeches etc.since 1997 (updated weekly) [Dataset]. https://www.kaggle.com/robertolofaro/ecb-speeches-1997-to-20191122-frequencies-dm
Explore at:
zip(1814115 bytes)Available download formats
Dataset updated
Nov 3, 2025
Authors
Roberto Lofaro
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context

I am preparing a book on change to add to my publications (https://robertolofaro.com/published), and I was looking into speeches delivered by ECB, and the search on the website wasn't what I needed.

Started posting online updates in late 2019, currently the online webapp that allows to search via a tag cloud is updated on a weekly basis, each Monday evening.

Search by tag: https://robertolofaro.com/ECBSpeech (links also to dataset on kaggle)

From 2024-03-25, the dataset contains also the AI-based audio transcripts of any ECB item collected, whenever the audio file is accessible

source: ECB website

Content

In late October/early November 2019, ECB posted on Linkedin a link to a CSV dataset extending from 1997 up to 2019-10-25 with all the speeches delivered, as per their website

The dataset was "flat"- and I needed to both search quickly for associations of people to concepts, and to see directly the relevant speech in a human-readable format (as some speeches had pictures, tables, attachments, etc)

So, I recycled a concept that I had developed for other purposes and used in an experimental "search by tag cloud on structured content" on https://robertolofaro.com/BFM2013tag

The result is https://robertolofaro.com/ECBSpeech, that contains information from the CSV file (see website for the link to the source), with the additional information as shown within the "About this file".

The concept behind this sharing of the dataset on Kaggle, and releasing on my public website the application I use to navigate date (I have a local Xampp where I use this and other applications to support the research side of my past business and current publication activities) is shared on http://robertolofaro.com/datademocracy

This tag cloud contains the most common words 1997-2020 across the dataset

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3925987%2Fcf58205d2447ed7355c1a4e213f5b477%2F20200902_kagglerelease.png?generation=1599033600865103&alt=media" alt="">

Acknowledgements

Thanks to the ECB for saving my time (I was going to copy-and-paste or "scrape" with R from the speeches posted on their website) by releasing the dataset https://www.ecb.europa.eu/press/key/html/downloads.en.html

Inspiration

In my cultural and organizational change activities and within data collection, collation, and processing to support management decision-making (including my own) since the 1980s, I always saw that the more data we collect, the less time to retrieve it when needed there is.

I usually worked across multiple environments, industries, cultures, and "collecting" was never good enough if I could not then "retrieve by association".

In storytelling is fine just to roughly remember "cameos from the past", but in data storytelling (or when trying to implement a new organization, process, or even just software or data analysis) being able to pinpoint a source that might have been there before is equally important.

So, I am simply exploring different ways to cross-reference information from different domains, as I am quite confident that within all the open data (including the ECB speeches) there are the results of what niche experts saw on various items.

Therefore, why should time and resources be wasted on redoing what was done from others, when you can start from their endpoint, before adapting first, and adopting then (if relevant)?

Updates

2020-01-25: added GITHUB repository for versioning and release of additional material as the upload of the new export_datamart.csv wasn't possible, it is now available at: https://github.com/robertolofaro/ecbspeech

changes in the dataset: 1. fixed language codes 2. added speeches published on the ECB website in January 2020 (up to 2020-01-25 09:00 CET) 3. added all the items listed under the "interview" section of the ECB website

current content: 340 interviews, 2374 speeches

2020-01-29: the same file on GITHUB released on 2020-01-25, containing both speeches and interviews, and within an additional column to differentiate between the two, is now available on Kaggle

current content: 340 interviews, 2374 speeches

2020-02-26: monthly update, with items released on the ECB website up to 2020-02-22

current content: 2731 items, 345 interviews, 2386 speeches

2020-03-25: monthly update, with items released on the ECB website up to 2020-03-20

since March 2020, the dataset includes also press conferences available on he ECB website

current content: 2988 records (2392 speeches, 351 interviews, 245 press conferences)

2020-06-07: update, with items released on the ECB website up to 2020-06-07

since June 2020, the dataset includes also press conferences, blog posts, and podcasts available on the ECB website

current content: 3030 records (2399 speeches, 369 interviews, 247 press conferences, 8 blog posts, 7 ECB Podcast). ...
Reddit /r/datasets Dataset
kaggle.com
zip
Updated Nov 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Reddit /r/datasets Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/the-meta-corpus-of-datasets-the-reddit-dataset
Explore at:
zip(9619636 bytes)Available download formats
Dataset updated
Nov 28, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The Meta-Corpus of Datasets: The Reddit Dataset

The Complete Collection of Datasets Posted on Reddit

By SocialGrep [source]

About this dataset

A subreddit dataset is a collection of posts and comments made on Reddit's /r/datasets board. This dataset contains all the posts and comments made on the /r/datasets subreddit from its inception to March 1, 2022. The dataset was procured using SocialGrep. The data does not include usernames to preserve users' anonymity and to prevent targeted harassment

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

In order to use this dataset, you will need to have a text editor such as Microsoft Word or LibreOffice installed on your computer. You will also need a web browser such as Google Chrome or Mozilla Firefox.

Once you have the necessary software installed, open the The Reddit Dataset folder and double-click on the the-reddit-dataset-dataset-posts.csv file to open it in your preferred text editor.

In the document, you will see a list of posts with the following information for each one: title, sentiment, score, URL, created UTC, permalink, subreddit NSFW status, and subreddit name.

You can use this information to analyze trends in data sets posted on /r/datasets over time. For example, you could calculate the average score for all posts and compare it to the average score for posts in specific subReddits. Additionally, sentiment analysis could be performed on the titles of posts to see if there is a correlation between positive/negative sentiment and upvotes/downvotes

Research Ideas

Finding correlations between different types of datasets

Determining which datasets are most popular on Reddit

Analyzing the sentiments of post and comments on Reddit's /r/datasets board

Acknowledgements

If you use this dataset in your research, please credit the original authors.

Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: the-reddit-dataset-dataset-comments.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | body | The body of the post. (String) | | sentiment | The sentiment of the post. (String) | | score | The score of the post. (Integer) |

File: the-reddit-dataset-dataset-posts.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | score | The score of the post. (Integer) | | domain | The domain of the post. (String) | | url | The URL of the post. (String) | | selftext | The self-text of the post. (String) | | title | The title of the post. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit SocialGrep.
Telco Customer Churn
kaggle.com
zip
Updated Feb 23, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BlastChar (2018). Telco Customer Churn [Dataset]. https://www.kaggle.com/datasets/blastchar/telco-customer-churn
Explore at:
zip(175758 bytes)Available download formats
Dataset updated
Feb 23, 2018
Authors
BlastChar
Description
Context

"Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs." [IBM Sample Data Sets]

Content

Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

The data set includes information about:

Customers who left within the last month – the column is called Churn

Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies

Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges

Demographic info about customers – gender, age range, and if they have partners and dependents

Inspiration

To explore this type of models and learn more about the subject.

New version from IBM: https://community.ibm.com/community/user/businessanalytics/blogs/steven-macko/2019/07/11/telco-customer-churn-1113
Iris Species
kaggle.com
zip
Updated Sep 27, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCI Machine Learning (2016). Iris Species [Dataset]. https://www.kaggle.com/datasets/uciml/iris
Explore at:
zip(3687 bytes)Available download formats
Dataset updated
Sep 27, 2016
Dataset authored and provided by
UCI Machine Learning
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.

It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

The columns in this dataset are:

Id

SepalLengthCm

SepalWidthCm

PetalLengthCm

PetalWidthCm

Species
Social Media and Mental Health
kaggle.com
zip
Updated Jul 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SouvikAhmed071 (2023). Social Media and Mental Health [Dataset]. https://www.kaggle.com/datasets/souvikahmed071/social-media-and-mental-health
Explore at:
zip(10944 bytes)Available download formats
Dataset updated
Jul 18, 2023
Authors
SouvikAhmed071
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
This dataset was originally collected for a data science and machine learning project that aimed at investigating the potential correlation between the amount of time an individual spends on social media and the impact it has on their mental health.

The project involves conducting a survey to collect data, organizing the data, and using machine learning techniques to create a predictive model that can determine whether a person should seek professional help based on their answers to the survey questions.

This project was completed as part of a Statistics course at a university, and the team is currently in the process of writing a report and completing a paper that summarizes and discusses the findings in relation to other research on the topic.

The following is the Google Colab link to the project, done on Jupyter Notebook -

https://colab.research.google.com/drive/1p7P6lL1QUw1TtyUD1odNR4M6TVJK7IYN

The following is the GitHub Repository of the project -

https://github.com/daerkns/social-media-and-mental-health

Libraries used for the Project -

Pandas Numpy Matplotlib Seaborn Sci-kit Learn
Webpage Information for 5000+ Kaggle Competitions
kaggle.com
zip
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anthony Wynne (2023). Webpage Information for 5000+ Kaggle Competitions [Dataset]. https://www.kaggle.com/anthony35813/webpage-data-for-kaggle-competitions
Explore at:
zip(102059495 bytes)Available download formats
Dataset updated
Nov 8, 2023
Authors
Anthony Wynne
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
I produced the dataset whilst working on the 2023 Kaggle AI report. The Meta Kaggle dataset provides helpful information about the Kaggle competitions but not the original descriptive text from the Kaggle web pages for each competition. We have information about the solutions but not the original problem. So, I wrote some web scraping scripts to collect and store that information.

Not all Kaggle web pages have that information available; some are missing or broken. Hence the nulls in the data. Secondly, note that not all previous Kaggle competitions exist in the Meta Kaggle data, which was used to collect the webpage slugs.

The scrapping scripts iterate over the IDs in Meta Kaggle competitions.csv data and attempt to collect the webpage data for that competition if it is currently null in the database. Hence new IDs will cause the scripts to go and collect their data, and each week, the scripts will try and fill in any links that were not working previously.

I have recently converted the original local scraping scripts on my machine into a Kaggle notebook that now updates this dataset weekly on Mondays. The notebook also explains the scraping procedure and its automation to keep this dataset up-to-date.

Note that the CompetitionId field joins to the Id of the competitions.csv of the Meta Kaggle dataset so that this information can be combined with the rest of Meta Kaggle.

My primary reason for collecting the data was for some text classification work I wanted to do, and I will publish it here soon. I hope that the data is useful to some other projects as well :-)
Comprehensive Goodreads Book Dataset
kaggle.com
zip
Updated Aug 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evil Spirit05 (2024). Comprehensive Goodreads Book Dataset [Dataset]. https://www.kaggle.com/datasets/evilspirit05/comprehensive-goodreads-book-dataset
Explore at:
zip(2866123 bytes)Available download formats
Dataset updated
Aug 8, 2024
Authors
Evil Spirit05
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The data for this project was meticulously gathered from Goodreads, focusing on the curated list of books that are deemed essential reading. The data collection process was carried out in two distinct phases to ensure comprehensive and accurate capture of all relevant information.

Source:

Goodreads Listing: https://www.goodreads.com/list/show/264.Books_That_Everyone_Should_Read_At_Least_Once

Data Collection Steps:

Book URL Scraping:

Objective: The primary goal of this step was to extract the URLs of the books listed on the Goodreads page, along with their corresponding titles. This is a crucial preliminary step that allows for subsequent detailed data collection.

Methodology: I employed a custom-built Python script, scraper\book_url_scraper.py, designed specifically to navigate the Goodreads page and identify each book's URL. The script systematically parses the HTML structure of the listing page, extracts the URLs, and pairs them with the book titles.

Data Storage: The collected URLs and titles were compiled into a CSV file named book_urls.csv, which is stored in the scraper folder. This CSV file acts as a reference list, containing essential links and titles needed for the next phase of data collection.

Book Details Scraping:

Objective: This phase aimed to enrich the dataset by collecting detailed descriptions and genre classifications for each book using the URLs obtained in the previous step. This provides a deeper understanding of each book's content and category.

Methodology: Utilizing the URLs stored in book_urls.csv, I developed and executed another Python script, scraper\book_details_scraper.py. This script accesses each URL, retrieves the book's detailed description, and identifies its genre(s). The process involves parsing the book's page to extract relevant information accurately.

Data Storage: The extracted descriptions and genres were organized and saved into a CSV file named book_details.csv, located in the data folder. This file contains comprehensive information about each book, including its description and genre, facilitating detailed analysis and research.

Summary:

The data collection effort resulted in the comprehensive gathering of details for 6,313 books. This dataset includes essential information such as book titles, URLs, detailed descriptions, and genres. The structured approach, involving separate scripts for URL extraction and detailed data scraping, ensures that the dataset is both thorough and well-organized. The final dataset, encapsulated in book_details.csv, provides a robust foundation for further exploration, analysis, and insights into the literary works recommended on Goodreads.
Housing Prices Dataset
kaggle.com
zip
Updated Jan 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M Yasser H (2022). Housing Prices Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/housing-prices-dataset
Explore at:
zip(4740 bytes)Available download formats
Dataset updated
Jan 12, 2022
Authors
M Yasser H
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
https://raw.githubusercontent.com/Masterx-AI/Project_Housing_Price_Prediction_/main/hs.jpg" alt="">

Description:

A simple yet challenging project, to predict the housing price based on certain factors like house area, bedrooms, furnished, nearness to mainroad, etc. The dataset is small yet, it's complexity arises due to the fact that it has strong multicollinearity. Can you overcome these obstacles & build a decent predictive model?

Acknowledgement:

Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102. Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.

Objective:

Understand the Dataset & cleanup (if required).

Build Regression models to predict the sales w.r.t a single & multiple feature.

Also evaluate the models & compare thier respective scores like R2, RMSE, etc.
Stress316L
kaggle.com
zip
Updated Feb 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahshad Lotfinia (2021). Stress316L [Dataset]. https://www.kaggle.com/datasets/mahshadlotfinia/stress316l
Explore at:
zip(516534 bytes)Available download formats
Dataset updated
Feb 1, 2021
Authors
Mahshad Lotfinia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
https://badges.frapsoft.com/os/v2/open-source.svg?v=103" alt="Open Source Love"> https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat" alt=""> https://licensebuttons.net/l/by/4.0/88x31.png" alt="">

In case you use this dataset, please cite the original paper:

Mahshad Lotfinia, and Soroosh Tayebi Arasteh. "Machine Learning-Based Generalized Model for Finite Element Analysis of Roll Deflection During the Austenitic Stainless Steel 316L Strip Rolling". arXiv:2102.02470, February 2021.

BibTex

@misc{Stress316L, title={Machine Learning-Based Generalized Model for Finite Element Analysis of Roll Deflection During the Austenitic Stainless Steel 316L Strip Rolling}, author={Mahshad Lotfinia and Soroosh Tayebi Arasteh}, year={2021}, eprint={2102.02470}, archivePrefix={arXiv}, primaryClass={cs.LG}

Paper URL: https://arxiv.org/abs/2102.02470

SUMMARY

Unlike the other groups of metals, Austenitic Stainless Steel 316L has an unpredictable Strain-Stress curve. Thus, we conducted a series of mechanical tensile tests at different strain rates. Afterwards, using this dataset we can train a neural network to predict the best Strain-Stress curve that predicts more accurate values of the flow stress during the cold deformation.

DATA COLLECTION

We conducted four sets of Uniaxial Tensile Tests in 0.001S−1, 0.00052S−1, 0.0052S−1, and 0.052S−1 strain rates in the room temperature on our Austenitic Stainless Steel 316L sample. According to the ASTME8 standard, the ASS316L sheets with an initial thickness of 4 mm, width of 6 mm, and Gage length of 32 mm were utilized for the tensile tests using a compression test machine (Electro Mechanic Instron 4208). The results were transferred to the Santam Machine Controller software for recording, which led to obtaining the extension data (in mm) and the force data (in N), which were converted to the true-strain and true-stress values. The data conversion procedure was done by considering the cross-section of the loaded force, which for our case was 24 mm^2.

DATASET CONTENTS

15,858 different Strain-Stress values at 4 different strain rates.

./Stress316L_data/labels.csv: Stress values.

./Stress316L_data/features.csv: Strain & Strain rate values for the corresponding points in the ./Stress316L_data/labels.csv.

./Stress316L_data/x_y_initial.csv: Strain-Stress values.

DATA FORMAT FOR ALL THE FILES

All the files are provided in the "csv" format.

The dataset URL:

https://kaggle.com/mahshadlotfinia/Stress316L/

LICENSE

The accompanying dataset is released under a Creative Commons Attribution 4.0 International License.

SOURCE CODE

The official source code of the paper: https://github.com/mahshadlotfinia/Stress316L/

CONTACT

E-mail: mahshad.lotfinia@alum.sharif.edu

REFERENCES:

Materials Science and Engineering Mechanical Lab, the Sharif University of Technology, Tehran, Iran.
Tensorflow-Friendly-MRNA-Competition-Dataset
kaggle.com
zip
Updated Oct 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harrison TW White (2023). Tensorflow-Friendly-MRNA-Competition-Dataset [Dataset]. https://www.kaggle.com/datasets/harrisontwwhite/tensorflow-friendly-mrna-competition-dataset
Explore at:
zip(1046236048 bytes)Available download formats
Dataset updated
Oct 25, 2023
Authors
Harrison TW White
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This data is taken as a link between the competition dataset csv: https://www.kaggle.com/competitions/rsna-breast-cancer-detection

and the 256x256 images of that data set created here: https://www.kaggle.com/datasets/theoviel/rsna-breast-cancer-256-pngs

This should allow the data to be read in as a directory from TensorFlow allowing the labels to be attached to the images themselves rather than in a separate csv file.
CIFAR-10 Python in CSV
kaggle.com
zip
Updated Jun 22, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
fedesoriano (2021). CIFAR-10 Python in CSV [Dataset]. https://www.kaggle.com/fedesoriano/cifar10-python-in-csv
Explore at:
zip(218807675 bytes)Available download formats
Dataset updated
Jun 22, 2021
Authors
fedesoriano
Description
Context

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. The classes are completely mutually exclusive. There are 50000 training images and 10000 test images.

The batches.meta file contains the label names of each class.

The dataset was originally divided in 5 training batches with 10000 images per batch. The original dataset can be found here: https://www.cs.toronto.edu/~kriz/cifar.html. This dataset contains all the training data and test data in the same CSV file so it is easier to load.

Content

Here is the list of the 10 classes in the CIFAR-10:

Classes: 1) 0: airplane 2) 1: automobile 3) 2: bird 4) 3: cat 5) 4: deer 6) 5: dog 7) 6: frog 8) 7: horse 9) 8: ship 10) 9: truck

Acknowledgements

Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009. Link

How to load the batches.meta file (Python)

The function used to open the file: def unpickle(file): import pickle with open(file, 'rb') as fo: dict = pickle.load(fo, encoding='bytes') return dict

Example of how to read the file: metadata_path = './cifar-10-python/batches.meta' # change this path metadata = unpickle(metadata_path)
DeBERTa-v3-Base for Sentiment Regression
kaggle.com
zip
Updated Aug 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AnthonyTherrien (2024). DeBERTa-v3-Base for Sentiment Regression [Dataset]. https://www.kaggle.com/datasets/anthonytherrien/deberta-v3-base-for-sentiment-regression/suggestions?status=pending&yourSuggestions=true
Explore at:
zip(664634762 bytes)Available download formats
Dataset updated
Aug 10, 2024
Authors
AnthonyTherrien
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Project Overview

Objective: Fine-tune the microsoft/deberta-v3-base model for sentiment regression.

Dataset: CSV file containing 1.6 million tweets with sentiment labels.

Dataset

Source: training.1600000.processed.noemoticon.csv

Link: https://www.kaggle.com/datasets/kazanova/sentiment140

Columns:

target: Sentiment polarity (converted to float)

ids: Tweet IDs

date: Date of the tweet

flag: Query flag

user: User handle

text: Tweet text

Size: 1.6 million rows

Preprocessing Steps

Load Dataset: Loaded CSV file without headers using ISO-8859-1 encoding.

Rename Columns: Renamed columns for better readability.

Target Conversion: Converted target column to float.

Shuffle Dataset: Shuffled dataset with a seed for randomness.

Model Selection

Model: microsoft/deberta-v3-base

Tokenizer: Used the AutoTokenizer from Hugging Face with max_length=160 and padding='max_length'.

Tokenization

Process:

Tokenized the dataset using multiprocessing (12 cores).

Applied padding and truncation to ensure uniform input size.

Dataset Split

Train/Test Split:

Training set: 97.5% of the data

Validation set: 2.5% of the data

Training Configuration

Training Arguments:

Learning Rate: 1.25e-5

Batch Size: 24

Epochs: 2

Weight Decay: 0.001

Gradient Accumulation: 6 steps

Warmup Steps: 256

Evaluation Strategy: Evaluate at the end of each epoch

Mixed Precision Training: Enabled (fp16=True)

Model Training

Trainer: Used Hugging Face's Trainer class for model training and evaluation.

Evaluation

Results: The model was evaluated on the validation set after training, with results saved for further analysis.

Conclusion

The fine-tuned DeBERTa-v3 model is now ready for sentiment regression tasks, with the final model and tokenizer saved for deployment.

Citation

@misc{he2021debertav3, title={DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing}, author={Pengcheng He and Jianfeng Gao and Weizhu Chen}, year={2021}, eprint={2111.09543}, archivePrefix={arXiv}, primaryClass={cs.CL}} @inproceedings{ he2021deberta, title={DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION}, author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen}, booktitle={International Conference on Learning Representations}, year={2021}, url={https://openreview.net/forum?id=XPZIaotutsD}}

Facebook

Twitter

Click to copy link

Link copied

Cite

Algorismus (2022). Adventure Works 2022 CSVs [Dataset]. https://www.kaggle.com/datasets/algorismus/adventure-works-in-excel-tables

Adventure Works 2022 CSVs

Dataset of Adventure Works from SQL to CSVs (useful for PL-300 exam)

Explore at:

zip(567646 bytes)Available download formats

Dataset updated

Nov 2, 2022

Authors

Algorismus

License

http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

Description

Adventure Works 2022 dataset

How this Dataset is created?

On the official website the dataset is available over SQL server (localhost) and CSVs to be used via Power BI Desktop running on Virtual Lab (Virtaul Machine). As per first two steps of Importing data are executed in the virtual lab and then resultant Power BI tables are copied in CSVs. Added records till year 2022 as required.

How this Dataset may help you?

this dataset will be helpful in case you want to work offline with Adventure Works data in Power BI desktop in order to carry lab instructions as per training material on official website. The dataset is useful in case you want to work on Power BI desktop Sales Analysis example from Microsoft website PL 300 learning.

How to use this Dataset?

Download the CSV file(s) and import in Power BI desktop as tables. The CSVs are named as tables created after first two steps of importing data as mentioned in the PL-300 Microsoft Power BI Data Analyst exam lab.

Clear search

Close search

Google apps

Main menu

Adventure Works 2022 CSVs

Adventure Works 2022 dataset

How this Dataset is created?

How this Dataset may help you?

How to use this Dataset?

Stylish Product Image Dataset

Context:

About this Dataset:

Your Job:

This is a huge dataset in terms of records as well as memory size. To download this dataset you need high internet speed.

Pls, Rate this work

Support with Upvote... that encourages me to research more.

Share your feedback, reviews, and suggestions if any.

Thanks!!

train csv file

Dataset

Contents

Top Rated TV Shows

Display the first show

Convert the API data to a DataFrame

Save to CSV and upload to Google Drive

UCI-dataset

EDA

Chicago Data Portal

Ecommerce Dataset (Products & Sizes Included)

Induction Motor Fault Dataset

ECB speeches etc.since 1997 (updated weekly)

Context

Content

Acknowledgements

Inspiration

Updates

Reddit /r/datasets Dataset

The Meta-Corpus of Datasets: The Reddit Dataset

The Complete Collection of Datasets Posted on Reddit

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Telco Customer Churn

Context

Content

Inspiration

Iris Species

Social Media and Mental Health

Webpage Information for 5000+ Kaggle Competitions

Comprehensive Goodreads Book Dataset

Source:

Goodreads Listing: https://www.goodreads.com/list/show/264.Books_That_Everyone_Should_Read_At_Least_Once

Data Collection Steps:

Book URL Scraping:

Book Details Scraping:

Summary:

Housing Prices Dataset

Description:

Acknowledgement:

Objective:

Stress316L

In case you use this dataset, please cite the original paper:

BibTex

SUMMARY

DATA COLLECTION

DATASET CONTENTS

DATA FORMAT FOR ALL THE FILES

LICENSE

SOURCE CODE

CONTACT

REFERENCES:

Tensorflow-Friendly-MRNA-Competition-Dataset

CIFAR-10 Python in CSV

Context

Content

Acknowledgements

How to load the batches.meta file (Python)

DeBERTa-v3-Base for Sentiment Regression