26 datasets found
  1. Data analysis with pandas and python

    • kaggle.com
    zip
    Updated Apr 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    乡TOBY乡 (2023). Data analysis with pandas and python [Dataset]. https://www.kaggle.com/datasets/toby000/data-analysis-with-pandas-and-python
    Explore at:
    zip(701073 bytes)Available download formats
    Dataset updated
    Apr 16, 2023
    Authors
    乡TOBY乡
    Description

    This dataset includes data that is provided in the Udemy course "Data Analysis with Pandas and Python" by Boris Paskhaver.

  2. Sales_2019_Analysis

    • kaggle.com
    zip
    Updated Jan 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Đức Phát Trương (2023). Sales_2019_Analysis [Dataset]. https://www.kaggle.com/datasets/cphttrng/sales-2019-analysis
    Explore at:
    zip(2504483 bytes)Available download formats
    Dataset updated
    Jan 15, 2023
    Authors
    Đức Phát Trương
    Description

    Dataset

    This dataset was created by Đức Phát Trương

    Contents

  3. Practice Dataset

    • kaggle.com
    zip
    Updated Sep 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seif Hafez (2025). Practice Dataset [Dataset]. https://www.kaggle.com/datasets/seifhafez/practice-dataset
    Explore at:
    zip(13049 bytes)Available download formats
    Dataset updated
    Sep 20, 2025
    Authors
    Seif Hafez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was created in 2025 by the CATReloaded team in the Data Science Circle at Mansoura University, Faculty of Engineering, Egypt.

    The dataset was originally prepared as the supporting material for a pandas practice notebook. That notebook was designed as a practical task after Corey Schafer’s YouTube pandas course

    The goal was to create a comprehensive pandas challenge that includes almost every skill you might need when working with pandas. The idea is that you can save the code and revisit it later whenever you need a reference.

    I believe this task could be useful for:

    • Anyone just starting with pandas

    • Learners who want a structured challenge to test and refresh their skills

    • People looking for a practice task they can build on, enhance, or adapt

    📌 The full task is available in the pinned notebook here:

    👉 "https://www.kaggle.com/code/seifhafez/pandas-exercise/edit">Link to Notebook

    💡 Notes:

    • The task may contain non-beginner-friendly questions, so don’t worry if they take some time.

    • I plan to provide solutions/answers when I have free time to write them down.

    • If anyone from the community shares model answers, I’ll be very grateful. I will gladly give credit and mention those contributions so others can benefit from them too.

    • You are welcome to design new tasks or variations using this dataset or notebook, as long as credit is kept to the CATReloaded team.

    📖 To explore more of what we do, check out the CATReloaded Roadmap on GitHub:

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F19471804%2F9dcd0bfb323cfa328e83bd8a2b7944a7%2F458741397_513503334603832_744753795589333817_n.jpg?generation=1758812067506227&alt=media" alt="">

  4. Panda Code 2

    • kaggle.com
    zip
    Updated Jul 25, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jay (2020). Panda Code 2 [Dataset]. https://www.kaggle.com/jakobw/panda-code-2
    Explore at:
    zip(5641817960 bytes)Available download formats
    Dataset updated
    Jul 25, 2020
    Authors
    Jay
    Description

    Dataset

    This dataset was created by Jay

    Released under Data files © Original Authors

    Contents

  5. Python Recipes

    • kaggle.com
    zip
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Olga Belitskaya (2022). Python Recipes [Dataset]. https://www.kaggle.com/datasets/olgabelitskaya/python-recipes/code
    Explore at:
    zip(72646 bytes)Available download formats
    Dataset updated
    Aug 1, 2022
    Authors
    Olga Belitskaya
    Description

    \[\color{#ff35fe}{\mathbb{Context}}\]

    The main idea is to create collections with standard code recipes.

    \[\color{#ff35fe}{\mathbb{Content}}\]

    Files with the .py (and similar) formats.

    \[\color{#ff35fe}{\mathbb{Acknowledgments}}\]

    Many thanks for the user comments.

    \[\color{#ff35fe}{\mathbb{Inspiration}}\]

    Could this data be a time saver in data processing?

    \[\color{#ff35fe}{\mathbb{Russian \; Notes \; Python \; DataAnalysis}}\]

    Тема 2

    2.1 Введение в профессию «Аналитик данных»

    2.2 Введение в программирование на языке Python

    2.3 Синтаксис языка программирования Python

    2.4 Типы данных в Python Часть 1

    2.5 Типы данных в Python Часть 2

    2.6 Python Standard Library

    2.7 Преобразование типов данных в Python

    2.8 Python: ввод и вывод

    2.9 Управляющая конструкция if и тернарные операторы

    2.10 Управляющяя конструкция циклы

    2.11 Управляющяя конструкция исключения

    2.12 Строки и методы их обработки

    2.13 Списки. Кортежи

    2.14 Множества и словари

    2.15 Сочетание последовательных типов

    2.16 - 2.18 В разработке

    2.19 Функции и итераторы

    2.20 Функции sorted(), map(), filter(), reduce()

    2.21 - 2.24 В разработке

    2.25 Объектно–ориентированное программирование (ООП)

    2.26 Упражнения в объектно–ориентированном программировании

    2.27 В разработке

    2.28 Python NumPy Часть 1

    2.29 Python NumPy Часть 2

    2.30 Python NumPy Часть 3

    2.31 Pandas - Типы и структура данных

    2.32 Pandas - Простейшие операции

    2.33 Pandas - Трансформация данных

    2.34 - 2.36 В разработке

    2.37 Python SciPy

    2.38 Python Matplotlib Часть 1 Регулирование парамеров

    2.39 Python Matplotlib Часть 2 Композиция графиков

    2.40 Python Matplotlib Часть 3 Графическое проектирование

    2.41, 2.42 В разработке

    2.43 Графика. Обзор Python и других инструментов. Часть 1

    2.44 Графика. Обзор Python и других инструментов. Часть 2

    Тема 3

    3.1 Этапы анализа данных

    3.2 Типы данных

    3.3 Измерительные шкалы в аналитике

    Тема 4

    4.1 Исследовательский анализ данных

    [4.2...

  6. panda 40c

    • kaggle.com
    zip
    Updated Jul 22, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dmitry A. Grechka (2020). panda 40c [Dataset]. https://www.kaggle.com/dgrechka/panda-40c
    Explore at:
    zip(83316746 bytes)Available download formats
    Dataset updated
    Jul 22, 2020
    Authors
    Dmitry A. Grechka
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Dmitry A. Grechka

    Released under Attribution 3.0 Unported (CC BY 3.0)

    Contents

  7. injection-molding-QA

    • kaggle.com
    • huggingface.co
    zip
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mustafa Keser (2024). injection-molding-QA [Dataset]. https://www.kaggle.com/datasets/mustafakeser4/injection-molding-qa/data
    Explore at:
    zip(2998736 bytes)Available download formats
    Dataset updated
    Apr 2, 2024
    Authors
    Mustafa Keser
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    injection-molding-QA

    Description

    This dataset contains questions and answers related to injection molding, focusing on topics such as 'Materials', 'Techniques', 'Machinery', 'Troubleshooting', 'Safety','Design','Maintenance','Manufacturing','Development','R&D'. The dataset is provided in CSV format with two columns: Questions and Answers.

    Usage

    Researchers, practitioners, and enthusiasts in the field of injection molding can utilize this dataset for tasks such as:

    • Natural Language Processing (NLP) tasks such as question answering, text generation, and summarization.
    • Training and evaluation of machine learning models for understanding and generating responses related to injection molding.

    Example pandas

    import pandas as pd
    
    # Load the dataset
    dataset = pd.read_csv('injection_molds_dataset.csv')
    
    # Display the first few rows
    print(dataset. Head())
    

    Example datasets

    from datasets import load_dataset
    
    # Load the dataset
    dataset = load_dataset("mustafakeser/injection-molding-QA")
    
    # Display dataset info
    print(dataset)
    
    # Accessing the first few examples
    print(dataset['train'][:5])
    #or 
    dataset['train'].to_pandas()
    
    

    Columns

    1. Questions: Contains questions related to injection molding.
    2. Answers: Provides detailed answers to the corresponding questions.

    Citation

    If you use this dataset in your work, please consider citing it as:

    @misc{injectionmold_dataset,
     author = {Your Name},
     title = {Injection Molds Dataset},
     year = {2024},
     publisher = {Hugging Face},
     journal = {Hugging Face Datasets},
     howpublished = {\url{link to the dataset}},
    }
    

    Huggingface

    https://huggingface.co/datasets/mustafakeser/injection-molding-QA mustafakeser/injection-molding-QA

    Notes

    This dataset curated with gemini-1.0-pro

    license: apache-2.0

  8. 8 Nations' YouTube Data Videos Trends: 3200

    • kaggle.com
    zip
    Updated May 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    suat selvi (2025). 8 Nations' YouTube Data Videos Trends: 3200 [Dataset]. https://www.kaggle.com/datasets/suatselvi/8-nations-youtube-data-videos-trends-3200
    Explore at:
    zip(484381 bytes)Available download formats
    Dataset updated
    May 9, 2025
    Authors
    suat selvi
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    YouTube
    Description

    youtube data analysis course videos 8 countries

    Overview

    This dataset includes ~3200 YouTube videos focused on data analysis from 8 countries (~400 videos per country). Featuring data from Turkey, USA, Russia, Italy, France, Germany, Japan, and Spain, each video provides 8 key features. Ideal for global data science trend analysis!

    Content

    • Countries: Turkey (TR), USA (US), Russia (RU), Italy (IT), France (FR), Germany (DE), Japan (JP), Spain (ES)
    • Features (8 columns, example):
      • title`: Video title
      • views_count: Total views
      • comment_count: Total comments
      • likes_count: Total likes -'dislike_count':Total dislikes
    • Additional Features (if added):
      • country_code: Country code (e.g., TR, US)
      • country_name: Full country name
      • like_view_ratio: Likes-to-views ratio
    • Size: ~3200 rows, 8+ columns
    • Format: CSV files
      • all_countries.csv: Combined dataset
      • Country-specific files (e.g., TR_videos.csv, US_videos.csv)

    Potential Use Cases

    • Compare engagement (views, likes) across Turkey, USA, Russia, and other nations.
    • Analyze trending data science topics using tags from different countries.
    • Study how publish_date impacts video popularity in each region.
    • Visualize country-specific trends with Seaborn or Matplotlib.

    Data Preparation

    • Cleaning: Missing values in likes and views filled with median/zero. NaN in tags set to "Unknown".
    • Standardization: publish_date formatted as YYYY-MM-DD.
    • Structure: Includes individual country files and a combined all_countries.csv.

    Notes

    • Data collected from YouTube API (or specify your method).
    • Some metrics may be incomplete due to API limitations.
    • Feedback and suggestions are welcome!

    Get Started

    Load the dataset with Pandas: ```python import pandas as pd df = pd.read_csv('all_countries.csv')

    Example: Top 5 videos by views

    print(df.sort_values('views', ascending=False).head())

  9. panda 37c

    • kaggle.com
    zip
    Updated Jul 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dmitry A. Grechka (2020). panda 37c [Dataset]. https://www.kaggle.com/datasets/dgrechka/panda-37c
    Explore at:
    zip(83370085 bytes)Available download formats
    Dataset updated
    Jul 21, 2020
    Authors
    Dmitry A. Grechka
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Dmitry A. Grechka

    Released under Attribution 3.0 Unported (CC BY 3.0)

    Contents

  10. PANDA init class model1

    • kaggle.com
    zip
    Updated Jul 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iafoss (2020). PANDA init class model1 [Dataset]. https://www.kaggle.com/datasets/iafoss/panda-init-class-model1
    Explore at:
    zip(8248473039 bytes)Available download formats
    Dataset updated
    Jul 24, 2020
    Authors
    Iafoss
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Iafoss

    Released under Attribution 4.0 International (CC BY 4.0)

    Contents

  11. parquetfile-python-25k

    • kaggle.com
    zip
    Updated Feb 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JS (2024). parquetfile-python-25k [Dataset]. https://www.kaggle.com/datasets/jayshah1234/parquetfile-python-25k
    Explore at:
    zip(20082586 bytes)Available download formats
    Dataset updated
    Feb 25, 2024
    Authors
    JS
    Description

    Go to hf and search for flytech/python-codes-25k and download parquet file and upload the dataset on kaggle and call it by pandas

  12. Mlcourse.ai-2020

    • kaggle.com
    zip
    Updated Oct 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anas qais (2020). Mlcourse.ai-2020 [Dataset]. https://www.kaggle.com/anasqais/mlcourseai2020
    Explore at:
    zip(15881 bytes)Available download formats
    Dataset updated
    Oct 14, 2020
    Authors
    anas qais
    Description

    Open Machine Learning Course mlcourse.ai is designed to perfectly balance theory and practice; therefore, each topic is followed by an assignment with a deadline in a week. You can also take part in several Kaggle Inclass competitions held during the course and write your own tutorials. The next session launches in September, 2019. For more info go to the mlcourse.ai main page. Outline This is the list of published articles on medium.com (English), habr.com (Russian), and jqr.com (Chinese). See Kernels of this Dataset for the same material in English. 1. Exploratory Data Analysis with Pandas uk ru, cn, Kaggle Kernel 2. Visual Data Analysis with Python uk ru, cn, Kaggle Kernels: part1, part2 3. Classification, Decision Trees and k Nearest Neighbors uk, ru, cn, Kaggle Kernel 4. Linear Classification and Regression uk, ru, cn, Kaggle Kernels: part1, part2, part3, part4, part5 5. Bagging and Random Forest uk, ru, cn, Kaggle Kernels: part1, part2, part3 6. Feature Engineering and Feature Selection uk, ru, cn, Kaggle Kernel 7. Unsupervised Learning: Principal Component Analysis and Clustering uk, ru, cn, Kaggle Kernel 8. Vowpal Wabbit: Learning with Gigabytes of Data uk, ru, cn, Kaggle Kernel 9. Time Series Analysis with Python, part 1 uk, ru, cn. Predicting future with Facebook Prophet, part 2 uk, cn Kaggle Kernels: part1, part2 10. Gradient Boosting uk, ru, cn, Kaggle Kernel Assignments Each topic is followed by an assignment. See demo versions in this Dataset. Solutions will be discussed in the upcoming run of the course. Kaggle competitions 1. Catch Me If You Can: Intruder Detection through Webpage Session Tracking. Kaggle Inclass 2. How good is your Medium article? Kaggle Inclass Rating Throughout the course we are maintaining a student rating. It takes into account credits scored in assignments and Kaggle competitions. Top students (according to the final rating) will be listed on a special Wiki page. Community Discussions between students are held in the #mlcourse_ai channel of the OpenDataScience Slack team. A registration form will be shared prior to the start of the new session Collaboration You can publish Kernels using this Dataset. But please respect others' interests: don't share solutions to assignments and well-performing solutions for Kaggle Inclass competitions. If you notice any typos/errors in course material, please open an Issue or make a pull request in the course repo. The course is free but you can support organizers by making a pledge on Patreon (monthly support) or a one-time payment on Ko-fi

  13. Toys Images

    • kaggle.com
    zip
    Updated Nov 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed Hisham Abdelzaher (2025). Toys Images [Dataset]. https://www.kaggle.com/datasets/mh0386/toys-images
    Explore at:
    zip(53485433 bytes)Available download formats
    Dataset updated
    Nov 6, 2025
    Authors
    Mohamed Hisham Abdelzaher
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset comprises a diverse collection of images featuring two classes of toys. Images of 105 Panda and 150 Rabbit Toys. It offers versatility for researchers and developers interested in creating AI models capable of generating realistic and novel toy-related images. It includes labelled categories for ease of classification and can be a valuable resource for advancing the capabilities of generative AI in the realm of playful and imaginative content creation and classification between the Panda and Rabbit class.

  14. Israel-Palestine Conflict Tweets Dataset

    • kaggle.com
    zip
    Updated Jan 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MehyarMlaweh (2024). Israel-Palestine Conflict Tweets Dataset [Dataset]. https://www.kaggle.com/datasets/mehyarmlaweh/israel-palestine-conflict-tweets-dataset
    Explore at:
    zip(2016138 bytes)Available download formats
    Dataset updated
    Jan 1, 2024
    Authors
    MehyarMlaweh
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    Israel
    Description

    This dataset contains tweets related to the Israel-Palestine conflict from October 17, 2023, to December 17, 2023. It includes information on tweet IDs, links, text, date, likes, and comments, categorized into different ranges of like counts.

    Dataset Details

    • Date Range: October 17, 2023 - December 17, 2023
    • Total Tweets: 15,478
    • Unique Tweets: 14,854

    Data Description

    The dataset consists of the following columns:

    ColumnDescription
    idUnique identifier for the tweet
    linkURL link to the tweet
    textText content of the tweet
    dateDate and time when the tweet was posted
    likesNumber of likes the tweet received
    commentsNumber of comments the tweet received
    LabelLike count range category
    CountNumber of tweets in the like count range category

    How to Process the Data

    To process the dataset, you can use the following Python code. This code reads the CSV file, cleans the tweets, tokenizes and lemmatizes the text, and filters out non-English tweets.

    Required Libraries

    Make sure you have the following libraries installed:

    pip install pandas nltk langdetect
    

    Data Processing Code

    Here’s the code to process the tweets:

    import pandas as pd
    import re
    from nltk.tokenize import word_tokenize
    from nltk.corpus import stopwords
    from nltk.stem import WordNetLemmatizer
    from langdetect import detect, LangDetectException
    # Define the TweetProcessor class
    class TweetProcessor:
      def _init_(self, file_path):
        """
        Initialize the object with the path to the CSV file.
        """
        self.df = pd.read_csv(file_path)
        # Convert 'text' column to string type
        self.df['text'] = self.df['text'].astype(str)
      def clean_tweet(self, tweet):
        """
        Clean a tweet by removing links, special characters, and extra spaces.
        """
        # Remove links
        tweet = re.sub(r'https\S+', '', tweet, flags=re.MULTILINE)
        # Remove special characters and numbers
        tweet = re.sub(r'\W', ' ', tweet)
        # Replace multiple spaces with a single space
        tweet = re.sub(r'\s+', ' ', tweet)
        # Remove leading and trailing spaces
        tweet = tweet.strip()
        return tweet
      def tokenize_and_lemmatize(self, tweet):
        """
        Tokenize and lemmatize a tweet by converting to lowercase, removing stopwords, and lemmatizing.
        """
        # Tokenize the text
        tokens = word_tokenize(tweet)
        # Remove punctuation and numbers, and convert to lowercase
        tokens = [word.lower() for word in tokens if word.isalpha()]
        # Remove stopwords
        stop_words = set(stopwords.words('english'))
        tokens = [word for word in tokens if word not in stop_words]
        # Lemmatize the tokens
        lemmatizer = WordNetLemmatizer()
        tokens = [lemmatizer.lemmatize(word) for word in tokens]
        # Join tokens back into a single string
        return ' '.join(tokens)
      def process_tweets(self):
        """
        Apply cleaning and lemmatization functions to the tweets in the DataFrame.
        """
        def lang(x):
          try:
            return detect(x) == 'en'
          except LangDetectException:
            return False
        # Filter tweets for English language
        self.df = self.df[self.df['text'].apply(lang)]
        # Apply cleaning function
        self.df['cleaned_text'] = self.df['text'].apply(self.clean_tweet)
        # Apply tokenization and lemmatization function
        self.df['tokenized_and_lemmatized'] = self.df['cleaned_text'].apply(self.tokenize_and_lemmatize)
    

    Feel free to add or modify any details according to your specific requirements!

    Let me know if there’s anything else you’d like to adjust or add!

    Usage

    This dataset can be used for various research purposes, including sentiment analysis, trend analysis, and event impact studies related to the Israel-Palestine conflict. For questions or feedback, please contact:

    • Name: Mehyar Mlaweh
    • Email: mehyarmlaweh0@gmail.com
  15. Cora with Semi-Supervised

    • kaggle.com
    zip
    Updated Feb 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    cyz020403 (2024). Cora with Semi-Supervised [Dataset]. https://www.kaggle.com/datasets/cyz020403/corasupervised
    Explore at:
    zip(195584 bytes)Available download formats
    Dataset updated
    Feb 11, 2024
    Authors
    cyz020403
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Description

    Cora is a widely used node classification dataset. What you are seeing now is the processed version from PyG. Its source file comes from the following paper:

    Revisiting Semi-Supervised Learning with Graph Embeddings. Zhilin Yang, William W. Cohen, Ruslan Salakhutdinov. ICML 2016.

    Please cite the above paper if these are useful to you.

    Statistical data

    Name#nodes#edges#features#classes
    Cora27081055614337

    For further description of the data please refer to the 'File Description' section below.

    Processing

    This dataset can be downloaded directly from PyG. For the needs of Kaggle evaluation, I simply processed it.

    You can run the following code to get the same .csv file:

    import pandas as pd
    import numpy as np
    import torch
    import torch.nn as nn
    from torch_geometric.datasets import Planetoid
    
    dataset = Planetoid('./', 'Cora')
    data = dataset[0]
    
    x = data.x
    y = data.y
    edge_index = data.edge_index
    train_mask = data.train_mask
    val_mask = data.val_mask
    test_mask = data.test_mask
    
    y_train = y[train_mask]
    y_val = y[val_mask]
    y_test = y[test_mask]
    
    train_index = torch.arange(0, 140)
    val_index = torch.arange(140, 640)
    test_index = torch.arange(1708, 2708)
    
    y_train = torch.cat((train_index.reshape(-1, 1), y_train.reshape(-1, 1)), dim=1)
    y_val = torch.cat((val_index.reshape(-1, 1), y_val.reshape(-1, 1)), dim=1)
    y_test = torch.cat((test_index.reshape(-1, 1), y_test.reshape(-1, 1)), dim=1)
    
    x_df = pd.DataFrame(x.numpy())
    x_header = ['x' + str(i) for i in range(x_df.shape[1])]
    x_df.to_csv('./data/x.csv', index=False, header=x_header)
    
    edge_index_df = pd.DataFrame(edge_index.t().numpy())
    edge_index_header = ['source', 'target']
    edge_index_df.to_csv('./data/edge_index.csv', index=False, header=edge_index_header)
    
    y_header = ['index', 'label']
    y_train_df = pd.DataFrame(y_train.numpy())
    y_train_df.to_csv('./data/y_train.csv', index=False, header=y_header)
    
    y_val_df = pd.DataFrame(y_val.numpy())
    y_val_df.to_csv('./data/y_val.csv', index=False, header=y_header)
    
    y_test_df = pd.DataFrame(y_test.numpy())
    y_test_df.to_csv('./data/y_test.csv', index=False, header=y_header)
    

  16. 🏃🏻‍♂️ Long-distance running dataset

    • kaggle.com
    zip
    Updated Mar 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mexwell (2024). 🏃🏻‍♂️ Long-distance running dataset [Dataset]. https://www.kaggle.com/datasets/mexwell/long-distance-running-dataset
    Explore at:
    zip(393989255 bytes)Available download formats
    Dataset updated
    Mar 7, 2024
    Authors
    mexwell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    About

    This dataset contains 10,703,690 records of running training during 2019 and 2020, from 36,412 athletes from around the world. The records were obtained through web scraping of a large social network for athletes on the internet.

    The data with the athletes' activities are contained in dataframe objects (tabular data) and saved in the Parquet file format using the Pandas library, part of the Python ecosystem for data science. Each Pandas dataframe contains the following data (as different columns) for each athlete (as different rows), the first word identifies the name of the column in the dataframe: - datetime: date of the running activity; - athlete: a computer-generated ID for the athlete (integer); - distance: distance of running (floating-point number, in kilometers); - duration: duration of running (floating-point number, in minutes); - gender: gender (string 'M' of 'F'); - age_group: age interval (one of the strings '18 - 34', '35 - 54', or '55 +'); - country: country of origin of the athlete (string); - major: marathon(s) and year(s) the athlete ran (comma-separated list of strings).

    For convenience, we created files with the athletes' activities data sampled at different frequencies: day 'd', week 'w', month 'm', and quarter 'q' (i.e., there are files with the distance and duration of running accumulated at each day, week, month, and quarter) for each year, 2019 and 2020. Accordingly, the files are named 'run_ww_yyyy_f.parquet', where 'yyyy' is '2019' or '2020' and 'f' is 'd', 'w', 'm' or 'q' (without quotes). The dataset also contains data with different government’s stringency indexes for the COVID-19 pandemic. These data are saved as text files and were obtained from https://ourworldindata.org/covid-stringency-index.

    Acknowlegement

    Foto von sporlab auf Unsplash

  17. MMM Weekly Data - Geo:India

    • kaggle.com
    zip
    Updated Jul 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SubhagatoAdak (2025). MMM Weekly Data - Geo:India [Dataset]. https://www.kaggle.com/datasets/subhagatoadak/mmm-weekly-data-geoindia
    Explore at:
    zip(2463044 bytes)Available download formats
    Dataset updated
    Jul 18, 2025
    Authors
    SubhagatoAdak
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    India
    Description

    Synthetic India FMCG MMM Dataset (Weekly, 3 Years, Multi-Geo / Multi-Channel)

    Subtitle: 3-Year Weekly Multi-Channel FMCG Marketing Mix Panel for India Grain: Week-ending Saturday × Geography × Brand × SKU Span: 156 weeks (2 Jul 2022 – 27 Jun 2025) Scope: 8 Indian geographies • 3 brands × 3 SKUs each (9 SKUs) • Full marketing, trade, price, distribution & macro controls • AI creative quality scores for digital banners.

    This dataset is synthetic but behaviorally realistic, generated to help analysts experiment with Marketing Mix Modeling (MMM), media effectiveness, price/promo analytics, distribution effects, and hierarchical causal inference without using proprietary commercial data.

    Why This Dataset?

    Real MMM training data is rarely public due to confidentiality. This synthetic panel:

    • Mirrors common FMCG (CPG) category dynamics in India (festive spikes, monsoon effects, geo scale differences).
    • Includes paid media channels (TV, YouTube, Facebook, Instagram, Print, Radio).
    • Captures promotions & trade levers (feature, display, temporary price reduction, trade spend).
    • Provides distribution & availability metrics (Weighted Distribution, Numeric Distribution, TDP, NOS).
    • Includes pricing (MRP, Net Price under TPR).
    • Adds macro signals (CPI, GDP, Festival Index, Rainfall Index) aligned to India’s seasonality.
    • Introduces AI Content Scores (Facebook & Instagram banner creative quality) — letting you explore creative × media interaction models.
    • Delivered at a granular panel (Geo × Brand × SKU) suitable for pooled, hierarchical, or Bayesian MMM workflows.

    Files

    FileDescription
    synthetic_mmm_weekly_india_SAT.csvMain dataset. 11,232 rows × 28 columns. Weekly (week-ending Saturday).

    (If you also upload the Monday version, note it clearly and point users to which to use.)

    Quick Start

    import pandas as pd
    
    df = pd.read_csv("/kaggle/input/synthetic-india-fmcg-mmm/synthetic_mmm_weekly_india_SAT.csv",
             parse_dates=["Week"])
    
    df.info()
    df.head()
    

    Aggregate to Geo-Brand Weekly

    geo_brand = (
      df.groupby(["Week","Geo","Brand"], as_index=False)
       .sum(numeric_only=True)
    )
    

    Create Modeling-Friendly Features

    Example: log-transform sales value, normalize media, build price index.

    import numpy as np
    
    m = geo_brand.copy()
    m["log_sales_val"] = np.log1p(m["Sales_Value"])
    m["price_index"] = m["Net_Price"] / m.groupby(["Geo","Brand"])["Net_Price"].transform("mean")
    

    Calendar Notes

    • Week variable = week-ending Saturday (Pandas freq W-SAT).
    • First week: 2022-07-02; last week: 2025-06-27 (depending on 156-week span anchor).
    • To derive a week-start (Sunday) date:

      df["Week_Start"] = df["Week"] - pd.Timedelta(days=6)
      

    Data Dictionary

    Key Dimensions

    ColumnTypeDescription
    WeekdateWeek-ending Saturday timestamp.
    Geocategorical8 rollups: NORTH, SOUTH, EAST, WEST, CENTRAL, NORTHEAST, METRO_DELHI, METRO_MUMBAI.
    BrandcategoricalBrandA / BrandB / BrandC.
    SKUcategoricalBrand-level SKU IDs (3 per brand).

    Commercial Outcomes

    ColumnTypeNotes
    Sales_UnitsfloatModeled weekly unit sales after macro, distribution, price, promo & media effects. Lognormal noise added.
    Sales_ValuefloatSales_Units × Net_Price. Use for revenue MMM or ROI analyses.

    Pricing

    ColumnTypeNotes
    MRPfloatBaseline list price (per-unit). Drifts with CPI & brand positioning.
    Net_PricefloatEffective real...
  18. Stone Classification

    • kaggle.com
    zip
    Updated Mar 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khadgar (2025). Stone Classification [Dataset]. https://www.kaggle.com/datasets/claydonwang/stone-classification
    Explore at:
    zip(69490 bytes)Available download formats
    Dataset updated
    Mar 18, 2025
    Authors
    Khadgar
    Description

    Outline

    The dataset is used in final project of STA325 at SUSTech.

    How to Generate submission.csv from test_loader

    1. Define the Prediction Function

    Use the following function to extract predictions from test_loader: ```python def predict(model, loader, device): model.eval() # Set the model to evaluation mode predictions = [] # Store predicted classes image_ids = [] # Store image filenames

    with torch.no_grad(): # Disable gradient computation for images, img_paths in tqdm(loader, desc="Predicting on test set"): images = images.to(device) # Move images to the specified device outputs = model(images) # Forward pass to get model outputs _, predicted = torch.max(outputs, 1) # Get predicted classes

      # Collect predictions and image IDs
      predictions.extend(predicted.cpu().numpy())
      image_ids.extend([os.path.basename(path) for path in img_paths])
    

    return image_ids, predictions ```

    2. Run Predictions

    Call the prediction function with the trained model, test_loader, and device: python image_ids, predictions = predict(model, test_loader, device)

    3. Create the Submission File

    import pandas as pd
    import os
    
    # Create DataFrame
    submission_df = pd.DataFrame({
      "id": image_ids,  # Image filenames
      "label": predictions # Predicted classes
    })
    
    # Save to the specified path
    OUTPUT_DIR = "logs"
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    submission_path = os.path.join(OUTPUT_DIR, "submission.csv")
    submission_df.to_csv(submission_path, index=False)
    print(f"Kaggle submission file saved to {submission_path}")
    

    Output Description

    • submission.csv Format:
      The file contains two columns:
    • id: Filenames of test images (without paths, e.g., image1.jpg).
    • label: Predicted class indices (e.g., 0, 1, 2, depending on the number of classes).

    • Example Content: id,label 000001.jpg,0 000002.jpg,1 000003.jpg,2 Then submit the submission.csv to Kaggle.

  19. Vezora/Tested-188k-Python-Alpaca: Functional

    • kaggle.com
    zip
    Updated Nov 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Vezora/Tested-188k-Python-Alpaca: Functional [Dataset]. https://www.kaggle.com/datasets/thedevastator/vezora-tested-188k-python-alpaca-functional-pyth/discussion
    Explore at:
    zip(12200606 bytes)Available download formats
    Dataset updated
    Nov 30, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Vezora/Tested-188k-Python-Alpaca: Functional Python Code Dataset

    188k Functional Python Code Samples

    By Vezora (From Huggingface) [source]

    About this dataset

    The Vezora/Tested-188k-Python-Alpaca dataset is a comprehensive collection of functional Python code samples, specifically designed for training and analysis purposes. With 188,000 samples, this dataset offers an extensive range of examples that cater to the research needs of Python programming enthusiasts.

    This valuable resource consists of various columns, including input, which represents the input or parameters required for executing the Python code sample. The instruction column describes the task or objective that the Python code sample aims to solve. Additionally, there is an output column that showcases the resulting output generated by running the respective Python code.

    By utilizing this dataset, researchers can effectively study and analyze real-world scenarios and applications of Python programming. Whether for educational purposes or development projects, this dataset serves as a reliable reference for individuals seeking practical examples and solutions using Python

    How to use the dataset

    The Vezora/Tested-188k-Python-Alpaca dataset is a comprehensive collection of functional Python code samples, containing 188,000 samples in total. This dataset can be a valuable resource for researchers and programmers interested in exploring various aspects of Python programming.

    Contents of the Dataset

    The dataset consists of several columns:

    • output: This column represents the expected output or result that is obtained when executing the corresponding Python code sample.
    • instruction: It provides information about the task or instruction that each Python code sample is intended to solve.
    • input: The input parameters or values required to execute each Python code sample.

    Exploring the Dataset

    To make effective use of this dataset, it is essential to understand its structure and content properly. Here are some steps you can follow:

    • Importing Data: Load the dataset into your preferred environment for data analysis using appropriate tools like pandas in Python.
    import pandas as pd
    
    # Load the dataset
    df = pd.read_csv('train.csv')
    
    • Understanding Column Names: Familiarize yourself with the column names and their meanings by referring to the provided description.
    # Display column names
    print(df.columns)
    
    • Sample Exploration: Get an initial understanding of the data structure by examining a few random samples from different columns.
    # Display random samples from 'output' column
    print(df['output'].sample(5))
    
    • Analyzing Instructions: Analyze different instructions or tasks present in the 'instruction' column to identify specific areas you are interested in studying or learning about.
    # Count unique instructions and display top ones with highest occurrences
    instruction_counts = df['instruction'].value_counts()
    print(instruction_counts.head(10))
    

    Potential Use Cases

    The Vezora/Tested-188k-Python-Alpaca dataset can be utilized in various ways:

    • Code Analysis: Analyze the code samples to understand common programming patterns and best practices.
    • Code Debugging: Use code samples with known outputs to test and debug your own Python programs.
    • Educational Purposes: Utilize the dataset as a teaching tool for Python programming classes or tutorials.
    • Machine Learning Applications: Train machine learning models to predict outputs based on given inputs.

    Remember that this dataset provides a plethora of diverse Python coding examples, allowing you to explore different

    Research Ideas

    • Code analysis: Researchers and developers can use this dataset to analyze various Python code samples and identify patterns, best practices, and common mistakes. This can help in improving code quality and optimizing performance.
    • Language understanding: Natural language processing techniques can be applied to the instruction column of this dataset to develop models that can understand and interpret natural language instructions for programming tasks.
    • Code generation: The input column of this dataset contains the required inputs for executing each Python code sample. Researchers can build models that generate Python code based on specific inputs or task requirements using the examples provided in this dataset. This can be useful in automating repetitive programming tasks o...
  20. FBI Hate Crimes in USA (1991-2020)

    • kaggle.com
    zip
    Updated Dec 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan (2021). FBI Hate Crimes in USA (1991-2020) [Dataset]. https://www.kaggle.com/jonathanrevere/fbi-hate-crimes-in-usa-19912020
    Explore at:
    zip(8929139 bytes)Available download formats
    Dataset updated
    Dec 9, 2021
    Authors
    Jonathan
    License

    https://www.usa.gov/government-works/https://www.usa.gov/government-works/

    Area covered
    United States
    Description

    Background

    I recently finished the offered courses in Python and Pandas and wanted to practice sorting, creating dataframes, and grouping. I decided to use the hate crime data that is offered by the FBI. To practice, I preemptively separated the full csv file for each territory and state for ease of use by anyone that wants to access their states data right away. Also it provided good practice for coding.

    Content

    These datasets contain the date of the crime, what kind of crime it was, the offenders race, the victim's race, victim counts (if the victim was a minor or adult), what state and city the crime occurred in, and so on.

    Also included is the methodology file so that you can see more context of the data itself and how it was collected.

    Acknowledgements

    I thought this would be a totally new dataset that has yet been uploaded to kaggle, but I did notice another dataset here, but that hasn't been updated in 2 years. But, I would like to thank that author since it did help me structure how to actually write this out😃 .

    Further credit to the FBI for collecting this data which can be found here.

    And of course thanks to kaggle for the free courses.

    Inspiration

    You can use this for several questions to track what years (or decade) had the highest concentration of hate crimes. Also, you can use the full csv file to organize by region for a similar question. But if you want to concentrate on your state, then that is also doable, just download the appropriate table. You can then find what areas in your state had the most hate crimes.

    You can also figure out what's the most common hate crime victim over a specific timeframe.

    Code I used

    (Any feedback is appreciated!) ```

    import relevant packages

    import pandas as pd

    load dataset

    hate_crime = pd.read_csv(filepath)

    list of states

    states = ['AL','AK','AZ', 'AR','CA','CO','CT','DC','DE','FL','FS','GA','GM','HI','IA','ID', 'IL','IN','KS','KY','LA','MD','ME','MI','MN','MO','MS','MT','NB','NC','ND','NH', 'NJ','NM','NV','NY','OH','OK','OR','PA','RI','SC','SC','SD','TN','TX','UT','VA', 'VT','WA','WI','WV','WY']

    create DataFrames of just other States

    def create_DataFrame(State_Abbr): ''' Parameters ---------- State_Abbr : TYPE == STR USER ENTERS STATE ABBREVIATION

    Returns
    -------
    DATAFRAME OF HATE CRIMES IN THAT STATE
    
    '''
    #overall this step is unnecessary because I'm not making an executable or anything
    if type(State_Abbr) != str or len(State_Abbr) != 2 or State_Abbr not in states:
      print('Please Enter the State Abbreviation for the desired state')
    else: #here's the useful bits ^_^
      hate_df = pd.DataFrame(hate_crime.loc[hate_crime.STATE_ABBR == State_Abbr])
      return hate_df
    

    def create_csv(state_lst): ''' Parameters ---------- def create_csv[state_lst] : Input state list to create seperate csv files for each state.

    Returns
    -------
    A csv of hate crimes within individual states
    '''
    for state in state_lst:
      df = create_DataFrame(state)
      df.to_csv('Hate Crimes in {} 1991-2020.csv'.format(state))
    return
    
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
乡TOBY乡 (2023). Data analysis with pandas and python [Dataset]. https://www.kaggle.com/datasets/toby000/data-analysis-with-pandas-and-python
Organization logo

Data analysis with pandas and python

This is a dataset which is provided in Udemy course.

Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
zip(701073 bytes)Available download formats
Dataset updated
Apr 16, 2023
Authors
乡TOBY乡
Description

This dataset includes data that is provided in the Udemy course "Data Analysis with Pandas and Python" by Boris Paskhaver.

Search
Clear search
Close search
Google apps
Main menu