Facebook
TwitterThis is a dataset downloaded off excelbianalytics.com created off of random VBA logic. I recently performed an extensive exploratory data analysis on it and I included new columns to it, namely: Unit margin, Order year, Order month, Order weekday and Order_Ship_Days which I think can help with analysis on the data. I shared it because I thought it was a great dataset to practice analytical processes on for newbies like myself.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Comprehensive football (soccer) data lake from Transfermarkt, clean and structured for analysis and machine learning.
Everything in raw CSV format – perfect for EDA, ML, and advanced football analytics.
A complete football data lake covering players, teams, transfers, performances, market values, injuries, and national team stats. Perfect for analysts, data scientists, researchers, and enthusiasts.
Here’s the high-level schema to help you understand the dataset structure:
https://i.imgur.com/WXLIx3L.png" alt="Transfermarkt Dataset ER Diagram">
Organized into 10 well-structured CSV categories:
Most football datasets are pre-processed and restrictive. This one is raw, rich, and flexible:
I’m always excited to collaborate on innovative football data projects. If you’ve got an idea, let’s make it happen together!
If this dataset helps you:
- Upvote on Kaggle
- Star the GitHub repo
- Share with others in the football analytics community
football analytics soccer dataset transfermarkt sports analytics machine learning football research player statistics
🔥 Analyze football like never before. Your next AI or analytics project starts here.
Facebook
TwitterThis data is artificially generated. It can be used for practicing data visualization and analysis skills. Please note that since the data is generated randomly, it may not reflect real-world sales data accurately. However, it should serve as a good starting point for practicing data analysis and visualization.
Description :
• Sales Date: This column contains the date of each sale. The dates are generated for a period of 120 days starting from January 1, 2023. • Category: This column contains the category of the product sold. The categories include ‘Electronics’, ‘Clothing’, and ‘Home & Kitchen’. • Subcategory: This column contains the subcategory of the product sold. Each category has its own set of subcategories. For example, the ‘Electronics’ category includes subcategories such as ‘Communication’, ‘Computers’, and ‘Wearables’. • ProductName: This column contains the name of the product sold. Each subcategory has its own set of products. For example, the ‘Communication’ subcategory includes products such as ‘Walkie Talkie’, ‘Cell Phone’, and ‘Smart Phone’. • Salesperson: This column contains the name of the salesperson who made the sale. There are different salespersons assigned to each category. • Gender: This column contains the gender of the salesperson. The gender is determined based on the salesperson’s name. • Unit sold: This column contains the number of units of the product sold in the sale. The number of units sold is a random number between 1 and 100. • Original Price: This column contains the original price of the product. The original price is a random number between 10 and 1000. • Sales Price: This column contains the sales price of the product. The sales price is calculated as a random fraction of the original price, ensuring that the sales price is always slightly higher than the original price.
For information on 'How to generate a dataset', click here.
Facebook
TwitterThis dataset was created by Merve Afranur ARTAR
Facebook
TwitterDescription: This dataset contains detailed information about videos from various YouTube channels that specialize in data science and analytics. It includes metrics such as views, likes, comments, and publication dates. The dataset consists of 22862 rows, providing a robust sample for analyzing trends in content engagement, popularity of topics over time, and comparison of channels' performance.
Column Descriptors:
Channel_Name: The name of the YouTube channel. Title: The title of the video. Published_date: The date when the video was published. Views: The number of views the video has received. Like_count: The number of likes the video has received. Comment_Count: The number of comments on the video.
This dataset contains information from the following YouTube channels:
['sentdex', 'freeCodeCamp.org' ,'CampusX', 'Darshil Parmar',' Keith Galli' ,'Alex The Analyst', 'Socratica' , Krish Naik', 'StatQuest with Josh Starmer', 'Nicholas Renotte', 'Leila Gharani', 'Rob Mulla' ,'Ryan Nolan Data', 'techTFQ', 'Dataquest' ,'WsCube Tech', 'Chandoo', 'Luke Barousse', 'Andrej Karpathy', 'Thu Vu data analytics', 'Guy in a Cube', 'Tableau Tim', 'codebasics', 'DeepLearningAI', 'Rishabh Mishra' 'ExcelIsFun', 'Kevin Stratvert' ' Ken Jee','Kaggle' , 'Tina Huang']
This dataset can be used for various analyses, including but not limited to:
Identifying the most popular videos and channels in the data science field.
Understanding viewer engagement trends over time.
Comparing the performance of different types of content across multiple channels.
Performing a comparison between different channels to find the best-performing ones.
Identifying the best videos to watch for specific topics in data science and analytics.
Conducting a detailed analysis of your favorite YouTube channel to understand its content strategy and performance.
Note: The data is current as of the date of extraction and may not reflect real-time changes on YouTube. For any analyses, ensure to consider the date when the data was last updated to maintain accuracy and relevance.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The Dirty Retail Store Sales dataset contains 12,575 rows of synthetic data representing sales transactions from a retail store. The dataset includes eight product categories with 25 items per category, each having static prices. It is designed to simulate real-world sales data, including intentional "dirtiness" such as missing or inconsistent values. This dataset is suitable for practicing data cleaning, exploratory data analysis (EDA), and feature engineering.
retail_store_sales.csv| Column Name | Description | Example Values |
|---|---|---|
Transaction ID | A unique identifier for each transaction. Always present and unique. | TXN_1234567 |
Customer ID | A unique identifier for each customer. 25 unique customers. | CUST_01 |
Category | The category of the purchased item. | Food, Furniture |
Item | The name of the purchased item. May contain missing values or None. | Item_1_FOOD, None |
Price Per Unit | The static price of a single unit of the item. May contain missing or None values. | 4.00, None |
Quantity | The quantity of the item purchased. May contain missing or None values. | 1, None |
Total Spent | The total amount spent on the transaction. Calculated as Quantity * Price Per Unit. | 8.00, None |
Payment Method | The method of payment used. May contain missing or invalid values. | Cash, Credit Card |
Location | The location where the transaction occurred. May contain missing or invalid values. | In-store, Online |
Transaction Date | The date of the transaction. Always present and valid. | 2023-01-15 |
Discount Applied | Indicates if a discount was applied to the transaction. May contain missing values. | True, False, None |
The dataset includes the following categories, each containing 25 items with corresponding codes, names, and static prices:
| Item Code | Item Name | Price |
|---|---|---|
| Item_1_EHE | Blender | 5.0 |
| Item_2_EHE | Microwave | 6.5 |
| Item_3_EHE | Toaster | 8.0 |
| Item_4_EHE | Vacuum Cleaner | 9.5 |
| Item_5_EHE | Air Purifier | 11.0 |
| Item_6_EHE | Electric Kettle | 12.5 |
| Item_7_EHE | Rice Cooker | 14.0 |
| Item_8_EHE | Iron | 15.5 |
| Item_9_EHE | Ceiling Fan | 17.0 |
| Item_10_EHE | Table Fan | 18.5 |
| Item_11_EHE | Hair Dryer | 20.0 |
| Item_12_EHE | Heater | 21.5 |
| Item_13_EHE | Humidifier | 23.0 |
| Item_14_EHE | Dehumidifier | 24.5 |
| Item_15_EHE | Coffee Maker | 26.0 |
| Item_16_EHE | Portable AC | 27.5 |
| Item_17_EHE | Electric Stove | 29.0 |
| Item_18_EHE | Pressure Cooker | 30.5 |
| Item_19_EHE | Induction Cooktop | 32.0 |
| Item_20_EHE | Water Dispenser | 33.5 |
| Item_21_EHE | Hand Blender | 35.0 |
| Item_22_EHE | Mixer Grinder | 36.5 |
| Item_23_EHE | Sandwich Maker | 38.0 |
| Item_24_EHE | Air Fryer | 39.5 |
| Item_25_EHE | Juicer | 41.0 |
| Item Code | Item Name | Price |
|---|---|---|
| Item_1_FUR | Office Chair | 5.0 |
| Item_2_FUR | Sofa | 6.5 |
| Item_3_FUR | Coffee Table | 8.0 |
| Item_4_FUR | Dining Table | 9.5 |
| Item_5_FUR | Bookshelf | 11.0 |
| Item_6_FUR | Bed F... |
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides a comprehensive collection of synthetic job postings to facilitate research and analysis in the field of job market trends, natural language processing (NLP), and machine learning. Created for educational and research purposes, this dataset offers a diverse set of job listings across various industries and job types.
We would like to express our gratitude to the Python Faker library for its invaluable contribution to the dataset generation process. Additionally, we appreciate the guidance provided by ChatGPT in fine-tuning the dataset, ensuring its quality, and adhering to ethical standards.
Please note that the examples provided are fictional and for illustrative purposes. You can tailor the descriptions and examples to match the specifics of your dataset. It is not suitable for real-world applications and should only be used within the scope of research and experimentation. You can also reach me via email at: rrana157@gmail.com
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
📊 Delhi Metro Ridership & Operational Statistics Dataset
A comprehensive dataset representing ridership, ticket revenue, and operational performance of the Delhi Metro one of the largest urban transit systems in the world.
The Delhi Metro is a rapid transit system serving the National Capital Region (NCR) of India. It plays a crucial role in reducing traffic congestion and providing sustainable public transportation to millions of passengers every day.
This dataset captures multiple performance indicators of the Delhi Metro network over time, including:
Total metro trips operated Daily total passengers Ticket revenue Average passenger distance traveled per trip Top stations based on passenger demand Total stations operational
These data points help in analyzing metro usage patterns, operational efficiency, and transit demand in the region.
This dataset enables research in:
Urban transport planning Revenue & demand forecasting Passenger travel behavior analysis Transportation infrastructure optimization Dashboard development & data storytelling Academic machine learning projects
Data has been collected, cleaned, and aggregated using publicly available metro operational insights, news reports, and transit performance summaries released by the Delhi Metro Rail Corporation (DMRC).
| Field | Description |
|---|---|
Date | Date of operation |
Total_Trips | Number of train trips operated on that day |
Total_Passengers | Total ridership for that day |
Total_Revenue | Ticketing revenue (₹ INR) |
Avg_Fare | Revenue divided by passengers |
Avg_Distance | Estimated average travel distance per passenger |
Passengers_per_Trip | Ridership divided by number of trips |
Revenue_Ticket | Ticket revenue per trip |
Ticket_Type (optional) | Type of ticket or trip category |
Top_Stations | Highest-demand stations on that day |
(Adjust fields based on your actual dataset columns — I can refine if you share final structure.)
License: CC BY 4.0 (Users must provide attribution when using the dataset)
If you want, I can also add:
Thumbnail Image for Kaggle Dataset Tags & Categories for better discoverability Example Notebooks (Exploration + Forecast models) Dashboard Preview Screenshots
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The Car Prices dataset contains detailed information about various car models, including their manufacturing year, make, model, trim, body type, transmission, and state of condition. With over 550,000 entries, this dataset is an excellent resource for exploring trends in car prices, analyzing market value fluctuations, and developing predictive models for the automotive industry.
| Year | Make | Model | Trim | Body | Transmission | State | Condition | Odometer |
|---|---|---|---|---|---|---|---|---|
| 2015 | Kia | Sorento | LX | SUV | Automatic | CA | 5 | 16,639 |
| 2014 | BMW | 3 Series | 328i | Sedan | Automatic | CA | 4 | 13,310 |
| 2015 | Nissan | Altima | 2.5 S | Sedan | Automatic | CA | 1 | 5,554 |
| 2014 | Chevrolet | Camaro | LT | Convertible | Automatic | CA | 3 | 4,809 |
| 2015 | Ford | Fusion | SE | Sedan | Automatic | CA | 2 | 5,559 |
This dataset is available under the MIT License, making it suitable for both commercial and non-commercial use.
Download Now and explore the intricacies of car prices with this rich and diverse dataset!
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Detailed dataset comprising health and demographic data of 100,000 individuals, aimed at facilitating diabetes-related research and predictive modeling. This dataset includes information on gender, age, location, race, hypertension, heart disease, smoking history, BMI, HbA1c level, blood glucose level, and diabetes status.
This dataset can be used for various analytical and machine learning purposes, such as:
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
🛒 E-Commerce Customer Behavior and Sales Dataset 📊 Dataset Overview This comprehensive dataset contains 5,000 e-commerce transactions from a Turkish online retail platform, spanning from January 2023 to March 2024. The dataset provides detailed insights into customer demographics, purchasing behavior, product preferences, and engagement metrics.
🎯 Use Cases This dataset is perfect for:
Customer Segmentation Analysis: Identify distinct customer groups based on behavior Sales Forecasting: Predict future sales trends and patterns Recommendation Systems: Build product recommendation engines Customer Lifetime Value (CLV) Prediction: Estimate customer value Churn Analysis: Identify customers at risk of leaving Marketing Campaign Optimization: Target customers effectively Price Optimization: Analyze price sensitivity across categories Delivery Performance Analysis: Optimize logistics and shipping 📁 Dataset Structure The dataset contains 18 columns with the following features:
Order Information Order_ID: Unique identifier for each order (ORD_XXXXXX format) Date: Transaction date (2023-01-01 to 2024-03-26) Customer Demographics Customer_ID: Unique customer identifier (CUST_XXXXX format) Age: Customer age (18-75 years) Gender: Customer gender (Male, Female, Other) City: Customer city (10 major Turkish cities) Product Information Product_Category: 8 categories (Electronics, Fashion, Home & Garden, Sports, Books, Beauty, Toys, Food) Unit_Price: Price per unit (in TRY/Turkish Lira) Quantity: Number of units purchased (1-5) Transaction Details Discount_Amount: Discount applied (if any) Total_Amount: Final transaction amount after discount Payment_Method: Payment method used (5 types) Customer Behavior Metrics Device_Type: Device used for purchase (Mobile, Desktop, Tablet) Session_Duration_Minutes: Time spent on website (1-120 minutes) Pages_Viewed: Number of pages viewed during session (1-50) Is_Returning_Customer: Whether customer has purchased before (True/False) Post-Purchase Metrics Delivery_Time_Days: Delivery duration (1-30 days) Customer_Rating: Customer satisfaction rating (1-5 stars) 📈 Key Statistics Total Records: 5,000 transactions Date Range: January 2023 - March 2024 (15 months) Average Transaction Value: ~450 TRY Customer Satisfaction: 3.9/5.0 average rating Returning Customer Rate: 60% Mobile Usage: 55% of transactions 🔍 Data Quality ✅ No missing values ✅ Consistent formatting across all fields ✅ Realistic data distributions ✅ Proper data types for all columns ✅ Logical relationships between features 💡 Sample Analysis Ideas Customer Segmentation with K-Means Clustering
Segment customers based on spending, frequency, and recency Sales Trend Analysis
Identify seasonal patterns and peak shopping periods Product Category Performance
Compare revenue, ratings, and return rates across categories Device-Based Behavior Analysis
Understand how device choice affects purchasing patterns Predictive Modeling
Build models to predict customer ratings or purchase amounts City-Level Market Analysis
Compare market performance across different cities 🛠️ Technical Details File Format: CSV (Comma-Separated Values) Encoding: UTF-8 File Size: ~500 KB Delimiter: Comma (,) 📚 Column Descriptions Column Name Data Type Description Example Order_ID String Unique order identifier ORD_001337 Customer_ID String Unique customer identifier CUST_01337 Date DateTime Transaction date 2023-06-15 Age Integer Customer age 35 Gender String Customer gender Female City String Customer city Istanbul Product_Category String Product category Electronics Unit_Price Float Price per unit 1299.99 Quantity Integer Units purchased 2 Discount_Amount Float Discount applied 129.99 Total_Amount Float Final amount paid 2469.99 Payment_Method String Payment method Credit Card Device_Type String Device used Mobile Session_Duration_Minutes Integer Session time 15 Pages_Viewed Integer Pages viewed 8 Is_Returning_Customer Boolean Returning customer True Delivery_Time_Days Integer Delivery duration 3 Customer_Rating Integer Satisfaction rating 5 🎓 Learning Outcomes By working with this dataset, you can learn:
Data cleaning and preprocessing techniques Exploratory Data Analysis (EDA) with Python/R Statistical analysis and hypothesis testing Machine learning model development Data visualization best practices Business intelligence and reporting 📝 Citation If you use this dataset in your research or project, please cite:
E-Commerce Customer Behavior and Sales Dataset (2024) Turkish Online Retail Platform Data (2023-2024) Available on Kaggle ⚖️ License This dataset is released under the CC0: Public Domain license. You are free to use it for any purpose.
🤝 Contribution Found any issues or have suggestions? Feel free to provide feedback!
📞 Contact For questions or collaborations, please reach out through Kaggle.
Happy Analyzing! 🚀
Keywords: e-c...
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Overview
Dataset Title: Diabetes Dataset 2019 Year: 2019
Variables (Columns)
Pregnancies: Number of times pregnant Glucose: Plasma glucose concentration (mg/dL) BloodPressure: Diastolic blood pressure (mm Hg) SkinThickness: Triceps skinfold thickness (mm) Insulin: 2-Hour serum insulin (mu U/ml) BMI: Body mass index (weight in kg / (height in m)^2) DiabetesPedigreeFunction: Diabetes pedigree function (a measure of genetic influence) Age: Age (years)
Outcome
Binary variable indicating the presence (1) or absence (0) of diabetes Data Examples:
The dataset contains multiple rows, with each row representing an individual case or patient. Each row includes information on the number of pregnancies, glucose levels, blood pressure, skin thickness, insulin levels, BMI, diabetes pedigree function, age, and outcome (diabetes presence or absence).
The purpose of this dataset is to be focused on studying the relationship between various factors (e.g., pregnancies, glucose levels, BMI) and the presence or absence of diabetes.
Diabetes Dataset Analysis
Exploratory Data Analysis (EDA): Explore the distributions, relationships, and summary statistics of the variables. Predictive Modeling: Develop predictive models to determine the likelihood of diabetes based on the given variables. Feature Importance: Assess the importance of each variable in predicting the presence or absence of diabetes. Risk Assessment: Identify key risk factors associated with diabetes based on the dataset.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F16731800%2Ffde74c2056b6a7ba8fa37e973d59a4a0%2Ffile327c32b276d.gif?generation=1705002897579997&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F16731800%2F7e743528f7b168b29fa457cb906744b8%2Fdensity.png?generation=1705003015226745&alt=media" alt="">
The Social Media Sentiments Analysis Dataset offers a fascinating glimpse into the intricate tapestry of emotions, trends, and interactions prevalent across diverse social media platforms. This dataset serves as a snapshot of user-generated content, encompassing textual expressions, timestamps, hashtags, geographical locations, engagement metrics such as likes and retweets, and user identifiers. Each entry unveils a unique narrative—moments of surprise, excitement, admiration, thrill, contentment, and more—shared by individuals globally.
Key Features
Text: The user-generated content, a window into diverse sentiments.
Sentiment: Emotions categorized for insightful analysis.
Timestamp: Date and time details providing a temporal dimension.
User: Unique identifiers of contributors, enabling user-specific insights.
Platform: Indicates the social media platform of origin, allowing platform-specific analysis.
Hashtags: Identifies trending topics and themes, unraveling popular narratives.
Likes: Quantifies user engagement, reflecting content appreciation.
Retweets: Reflects content popularity, showcasing the extent of its reach.
Country: Geographical origin of each post, facilitating geographical analysis.
Year, Month, Day, Hour: Temporal details for comprehensive temporal analysis.
The richness of the dataset allows for versatile analytical applications:
Sentiment Analysis: Explore the emotional landscape by categorizing user-generated content into surprise, excitement, admiration, thrill, contentment, and more.
Temporal Analysis: Investigate trends over time, identifying patterns, fluctuations, or recurring themes in social media content.
User Behavior Insights: Analyze user engagement through likes and retweets, discovering popular content and user preferences.
Platform-Specific Analysis: Examine variations in content across different social media platforms, understanding how sentiments vary.
Hashtag Trends: Identify trending topics and themes by analyzing hashtags, uncovering popular or recurring ones.
Geographical Analysis: Explore content distribution based on the country of origin, understanding regional variations in sentiment and topic preferences.
User Identification: Utilize user identifiers to track specific contributors, analyzing the impact of influential users on sentiment trends.
Cross-Analysis: Combine multiple features for in-depth insights. For example, analyze sentiment trends over time or across different platforms and countries.
In conclusion, the Social Media Sentiments Analysis Dataset provides a robust foundation for nuanced explorations into the dynamic world of social media interactions, offering researchers and analysts a wealth of opportunities for comprehensive insights.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains a list of sales and movement data by item and department appended monthly.
It is rich in information that can be leveraged for various data science applications. For instance, analyzing this dataset can offer insights into consumer behavior, such as preferences for specific types of beverages (e.g., wine, beer) during different times of the year. Furthermore, the dataset can be used to identify trends in sales and transfers, highlighting seasonal effects or the impact of certain suppliers on the market.
One could start with exploratory data analysis (EDA) to understand the basic distribution of sales and transfers across different item types and suppliers. Time series analysis can provide insights into seasonal trends and sales forecasts. Cluster analysis might reveal groups of suppliers or items with similar sales patterns, which can be useful for targeted marketing and inventory management.
Facebook
TwitterThe Customer Shopping Preferences Dataset offers valuable insights into consumer behavior and purchasing patterns. Understanding customer preferences and trends is critical for businesses to tailor their products, marketing strategies, and overall customer experience. This dataset captures a wide range of customer attributes including age, gender, purchase history, preferred payment methods, frequency of purchases, and more. Analyzing this data can help businesses make informed decisions, optimize product offerings, and enhance customer satisfaction. The dataset stands as a valuable resource for businesses aiming to align their strategies with customer needs and preferences. It's important to note that this dataset is a Synthetic Dataset Created for Beginners to learn more about Data Analysis and Machine Learning.
This dataset encompasses various features related to customer shopping preferences, gathering essential information for businesses seeking to enhance their understanding of their customer base. The features include customer age, gender, purchase amount, preferred payment methods, frequency of purchases, and feedback ratings. Additionally, data on the type of items purchased, shopping frequency, preferred shopping seasons, and interactions with promotional offers is included. With a collection of 3900 records, this dataset serves as a foundation for businesses looking to apply data-driven insights for better decision-making and customer-centric strategies.
https://i.imgur.com/6UEqejq.png" alt="">
This dataset is a synthetic creation generated using ChatGPT to simulate a realistic customer shopping experience. Its purpose is to provide a platform for beginners and data enthusiasts, allowing them to create, enjoy, practice, and learn from a dataset that mirrors real-world customer shopping behavior. The aim is to foster learning and experimentation in a simulated environment, encouraging a deeper understanding of data analysis and interpretation in the context of consumer preferences and retail scenarios.
Cover Photo by: Freepik
Thumbnail by: Clothing icons created by Flat Icons - Flaticon
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This extensive dataset comprises approximately 50,000 academic papers along with their corresponding metadata, designed to facilitate various natural language processing (NLP) tasks such as classification and retrieval. The dataset covers a diverse range of research domains, including but not limited to computer science, biology, social sciences, engineering, and more. The list of all categories can be found here. With its comprehensive collection of academic papers and enriched metadata, this dataset serves as a valuable resource for researchers and data enthusiasts interested in advancing NLP applications in the academic domain.
Metadata: The dataset includes essential metadata for each paper, such as the publish date, title, summary/abstract, author(s), and category. The metadata is meticulously curated to ensure accuracy and consistency, enabling researchers to swiftly extract valuable insights and conduct exploratory data analysis.
Vast Paper Collection: With nearly 50,000 academic papers, this dataset encompasses a broad spectrum of research topics and domains, making it suitable for a wide range of NLP tasks, including but not limited to document classification, topic modeling, and document retrieval.
Application Flexibility: The dataset is meticulously preprocessed and annotated, making it adaptable for various NLP applications. Researchers and practitioners can use it for tasks like sentiment analysis, keyword extraction, and more.
Document Classification: Leverage this dataset to build powerful classifiers capable of categorizing academic papers into relevant research domains or topics. This can aid in automated content organization and information retrieval.
Document Retrieval: Develop efficient retrieval models that can quickly identify and retrieve relevant papers based on user queries or specific keywords. Such models can streamline the research process and assist researchers in finding relevant literature faster.
Topic Modeling: Use this dataset to perform topic modeling and extract meaningful topics or themes present within the academic papers. This can provide valuable insights into the prevailing research trends and interests within different disciplines.
Recommendation Systems: Employ the dataset to build personalized recommendation systems that suggest relevant papers to researchers based on their previous interests or research focus.
We would like to express our gratitude to the authors and publishers of the academic papers included in this dataset for their valuable contributions to the research community. By making this dataset publicly available, we hope to foster advancements in natural language processing and support data-driven research across diverse domains.
As the curators of this dataset, we have made every effort to ensure the accuracy and quality of the data. However, we cannot guarantee the absolute correctness of the information or the suitability of the dataset for any specific purpose. Users are encouraged to exercise their judgment and discretion while utilizing the dataset for their research projects.
We sincerely hope that this dataset proves to be a valuable resource for the NLP community and contributes to the development of innovative solutions in academic research and beyond. Happy analyzing and modeling!
Facebook
TwitterThis dataset contains Kaggle ML & DS Survey data for 2018-2021. Cleaned and improved dataset.
In the original data (2018, 2019, 2020, 2021) answers to the questions were contained in different columns, the questions and answer options could differ. Single and multi-column columns had the same header type: Q1, Q2 ...
In this dataset, questions are grouped into SA / GA categories - single answers and group answers. Also cleared columns from spaces and different answer options.
Modified large categories - grouped by value or categorized as Other. Filling the category only if there is an empty value, not by simple summation, but by replacement.
This dataset contains the following:
- kaggle_survey_2018-2021_header.csv: the tabular dataset containing the header data
- kaggle_survey_2018-2021_data.csv: the tabular dataset containing the aggregated data from 2018 to 2021
- code_samples.pdf: pdf file containing code examples
Link : https://www.kaggle.com/c/kaggle-survey-2021 Link : https://www.kaggle.com/c/kaggle-survey-2020 Link : https://www.kaggle.com/c/kaggle-survey-2019 Link : https://www.kaggle.com/kaggle/kaggle-survey-2018
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Employee Satisfaction Survey dataset is a comprehensive collection of information regarding employees within a company. It includes essential details such as employee identification numbers, self-reported satisfaction levels, performance evaluations, project involvement, work hours, tenure with the company, work accidents, promotions received in the last 5 years, departmental affiliations, and salary levels. This dataset offers valuable insights into the factors influencing employee satisfaction and can be used to analyze and understand various aspects of the workplace environment.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The Dirty Cafe Sales dataset contains 10,000 rows of synthetic data representing sales transactions in a cafe. This dataset is intentionally "dirty," with missing values, inconsistent data, and errors introduced to provide a realistic scenario for data cleaning and exploratory data analysis (EDA). It can be used to practice cleaning techniques, data wrangling, and feature engineering.
dirty_cafe_sales.csv| Column Name | Description | Example Values |
|---|---|---|
Transaction ID | A unique identifier for each transaction. Always present and unique. | TXN_1234567 |
Item | The name of the item purchased. May contain missing or invalid values (e.g., "ERROR"). | Coffee, Sandwich |
Quantity | The quantity of the item purchased. May contain missing or invalid values. | 1, 3, UNKNOWN |
Price Per Unit | The price of a single unit of the item. May contain missing or invalid values. | 2.00, 4.00 |
Total Spent | The total amount spent on the transaction. Calculated as Quantity * Price Per Unit. | 8.00, 12.00 |
Payment Method | The method of payment used. May contain missing or invalid values (e.g., None, "UNKNOWN"). | Cash, Credit Card |
Location | The location where the transaction occurred. May contain missing or invalid values. | In-store, Takeaway |
Transaction Date | The date of the transaction. May contain missing or incorrect values. | 2023-01-01 |
Missing Values:
Item, Payment Method, Location) may contain missing values represented as None or empty cells.Invalid Values:
"ERROR" or "UNKNOWN" to simulate real-world data issues.Price Consistency:
The dataset includes the following menu items with their respective price ranges:
| Item | Price($) |
|---|---|
| Coffee | 2 |
| Tea | 1.5 |
| Sandwich | 4 |
| Salad | 5 |
| Cake | 3 |
| Cookie | 1 |
| Smoothie | 4 |
| Juice | 3 |
This dataset is suitable for: - Practicing data cleaning techniques such as handling missing values, removing duplicates, and correcting invalid entries. - Exploring EDA techniques like visualizations and summary statistics. - Performing feature engineering for machine learning workflows.
To clean this dataset, consider the following steps: 1. Handle Missing Values: - Fill missing numeric values with the median or mean. - Replace missing categorical values with the mode or "Unknown."
Handle Invalid Values:
"ERROR" and "UNKNOWN" with NaN or appropriate values.Date Consistency:
Feature Engineering:
Day of the Week or Transaction Month, for further analysis.This dataset is released under the CC BY-SA 4.0 License. You are free to use, share, and adapt it, provided you give appropriate credit.
If you have any questions or feedback, feel free to reach out through the dataset's discussion board on Kaggle.
Facebook
TwitterThe "Daily Transactions" dataset contains information on dummy transactions made by an individual on a daily basis. The dataset includes data on the products that were purchased, the amount spent on each product, the date and time of each transaction, the payment mode of each transaction, and the source of each record (Expense/Income).
This dataset can be used to analyze purchasing behavior and money management, forecasting expenses, and optimizing savings and budgeting strategies. The dataset is well-suited for data analysis and machine learning applications,it can be used to train predictive models and make data-driven decisions.
Column Descriptors
Facebook
TwitterThis is a dataset downloaded off excelbianalytics.com created off of random VBA logic. I recently performed an extensive exploratory data analysis on it and I included new columns to it, namely: Unit margin, Order year, Order month, Order weekday and Order_Ship_Days which I think can help with analysis on the data. I shared it because I thought it was a great dataset to practice analytical processes on for newbies like myself.