66 datasets found
  1. All Seaborn Built-in Datasets πŸ“Šβœ¨

    • kaggle.com
    zip
    Updated Aug 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdelrahman Mohamed (2024). All Seaborn Built-in Datasets πŸ“Šβœ¨ [Dataset]. https://www.kaggle.com/datasets/abdoomoh/all-seaborn-built-in-datasets
    Explore at:
    zip(1383218 bytes)Available download formats
    Dataset updated
    Aug 27, 2024
    Authors
    Abdelrahman Mohamed
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Description: - This dataset includes all 22 built-in datasets from the Seaborn library, a widely used Python data visualization tool. Seaborn's built-in datasets are essential resources for anyone interested in practicing data analysis, visualization, and machine learning. They span a wide range of topics, from classic datasets like the Iris flower classification to real-world data such as Titanic survival records and diamond characteristics.

    • Included Datasets:
      • Anagrams: Analysis of word anagram patterns.
      • Anscombe: Anscombe's quartet demonstrating the importance of data visualization.
      • Attention: Data on attention span variations in different scenarios.
      • Brain Networks: Connectivity data within brain networks.
      • Car Crashes: US car crash statistics.
      • Diamonds: Data on diamond properties including price, cut, and clarity.
      • Dots: Randomly generated data for scatter plot visualization.
      • Dow Jones: Historical records of the Dow Jones Industrial Average.
      • Exercise: The relationship between exercise and health metrics.
      • Flights: Monthly passenger numbers on flights.
      • FMRI: Functional MRI data capturing brain activity.
      • Geyser: Eruption times of the Old Faithful geyser.
      • Glue: Strength of glue under different conditions.
      • Health Expenditure: Health expenditure statistics across countries.
      • Iris: Famous dataset for classifying Iris species.
      • MPG: Miles per gallon for various vehicles.
      • Penguins: Data on penguin species and their features.
      • Planets: Characteristics of discovered exoplanets.
      • Sea Ice: Measurements of sea ice extent.
      • Taxis: Taxi trips data in a city.
      • Tips: Tipping data collected from a restaurant.
      • Titanic: Survival data from the Titanic disaster.

    This complete collection serves as an excellent starting point for anyone looking to improve their data science skills, offering a wide array of datasets suitable for both beginners and advanced users.

  2. Python Seaborn Datas

    • kaggle.com
    zip
    Updated Feb 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammet Ikbal Elek (2020). Python Seaborn Datas [Dataset]. https://www.kaggle.com/mielek/python-seaborn-datas
    Explore at:
    zip(27575 bytes)Available download formats
    Dataset updated
    Feb 13, 2020
    Authors
    Muhammet Ikbal Elek
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Muhammet Ikbal Elek

    Released under CC0: Public Domain

    Contents

  3. practisingondatasets with seaborn python library

    • kaggle.com
    zip
    Updated Apr 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emrecan Γ–zkan (2020). practisingondatasets with seaborn python library [Dataset]. https://www.kaggle.com/emrecanozkan/practisingondatasets-with-seaborn-python-library
    Explore at:
    zip(998 bytes)Available download formats
    Dataset updated
    Apr 3, 2020
    Authors
    Emrecan Γ–zkan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is about SeaBorn Python Library and I was just practising.

  4. Seaborn (Flights, Iris, Tips)

    • kaggle.com
    zip
    Updated Jan 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohan Pradhan (2024). Seaborn (Flights, Iris, Tips) [Dataset]. https://www.kaggle.com/datasets/mohanpradhan42/seaborn-flights-iris-tips
    Explore at:
    zip(3639 bytes)Available download formats
    Dataset updated
    Jan 3, 2024
    Authors
    Mohan Pradhan
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Mohan Pradhan

    Released under Apache 2.0

    Contents

  5. h

    python

    • huggingface.co
    Updated Feb 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    changamonika (2025). python [Dataset]. https://huggingface.co/datasets/changamonika/python
    Explore at:
    Dataset updated
    Feb 20, 2025
    Authors
    changamonika
    Description

    pip install numpy pandas scikit-learn matplotlib seaborn

      Sample dataset: Age, Salary, and whether they purchased (1 = Yes, 0 = No)
    

    data = { 'Age': [22, 25, 47, 52, 46, 56, 24, 27, 32, 37], 'Salary': [20000, 25000, 50000, 60000, 58000, 70000, 22000, 27000, 32000, 37000], 'Purchased': [0, 0, 1, 1, 1, 1, 0, 0, 1, 1] } df = pd.DataFrame(data)

      Split dataset into Features (X) and Target (y)
    

    X = df[['Age', 'Salary']] # Independent variables y = df['Purchased'] #… See the full description on the dataset page: https://huggingface.co/datasets/changamonika/python.

  6. Ecommerce Dataset for Data Analysis

    • kaggle.com
    zip
    Updated Sep 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shrishti Manja (2024). Ecommerce Dataset for Data Analysis [Dataset]. https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis/code
    Explore at:
    zip(2028853 bytes)Available download formats
    Dataset updated
    Sep 19, 2024
    Authors
    Shrishti Manja
    Description

    This dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.

    About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.

    Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.

    This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.

    This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning

  7. h

    watches

    • huggingface.co
    Updated Nov 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    gil (2025). watches [Dataset]. https://huggingface.co/datasets/yotam22/watches
    Explore at:
    Dataset updated
    Nov 17, 2025
    Authors
    gil
    Description

    πŸ•°οΈ Exploratory Data Analysis of Luxury Watch Prices

      Overview
    

    This project analyzes a large dataset of luxury watches to understand which factors influence price.We focus on brand, movement type, case material, size, gender, and production year.All work was done in Python (Pandas, NumPy, Matplotlib/Seaborn) on Google Colab.

      Dataset
    

    Rows: ~172,000
    Columns: 14
    Unit of observation: one watch listing

    Main columns

    name – watch/listing title
    price – listed… See the full description on the dataset page: https://huggingface.co/datasets/yotam22/watches.

  8. cyclistic case study visualisations

    • kaggle.com
    zip
    Updated Feb 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ishita Arora 11 (2023). cyclistic case study visualisations [Dataset]. https://www.kaggle.com/datasets/ishitaarora1111/cyclistic-case-study-visualisations
    Explore at:
    zip(124120 bytes)Available download formats
    Dataset updated
    Feb 9, 2023
    Authors
    Ishita Arora 11
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    These are the visuals created from the resultant findings of the Cyclistic bike share data analysis , using the latest 12 months of data of time period of January,2022 - December 2022 .

  9. Data from: Can LLMs Replace Manual Annotation of Software Engineering...

    • zenodo.org
    pdf, text/x-python +1
    Updated Oct 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Blinded; Blinded (2024). Can LLMs Replace Manual Annotation of Software Engineering Artifacts? [Dataset]. http://doi.org/10.5281/zenodo.13917054
    Explore at:
    zip, text/x-python, pdfAvailable download formats
    Dataset updated
    Oct 10, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Blinded; Blinded
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Required Libraries

    The following libraries are required to run the scripts in this repository. You can install them using `pip`:

    ```bash pip install pandas numpy argparse json time random openai copy statistics krippendorff sklearn seaborn matplotlib together anthropic google-generativeai

    Make sure to also install any other dependencies required by the specific model API if you plan on using models like GPT-4 or Claude:

    • openai
    • anthropic
    • together

    All the experiments were done using python 3.10.11

    For each dataset, we have a folder that contains process.py, heatmap.py, ira_sample.py. The folder also contains the relevant datasets and plots.

    File Description:

    1. data_result: This folder contains the file with the dataset and few-shot samples. After running process.py, all the results will be accumuted to data_result folder. Note that this folder is already containing all the data and model generated results in .jsonl fomat files. You do not need to run process.py to generate the results.
    2. Plots: This folder is containing the generated plots which can be generated by running heatmap.py and ira_sample.py.
    3. process.py: This file will generate the results/annotations from the model based on the given parameters. We have shared the necessary command to run this file at the bottom. Note that you need API keys from different organizations to run the script. However, we have shared all the model generated results on data_result folder.
    4. heatmap.py: Running this file will generate the heatmap that we presented from Figure 1-5 in the paper. The generated plots will be stored in "Plots" folder.
    5. ira_sample.py: Running this file will generate the plots that we presented from Figure 7-10 in the paper. The generated plots will be stored in "Plots" folder.

    Commands for datasets (Except Code Summarization):

    Generating samples for different models:

    python process.py --model gpt-4 --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx

    python process.py --model gpt-3.5-turbo --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx

    python process.py --model llama3--fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx

    python process.py --model mixtral --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx

    python process.py --model claude --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx

    python process.py --model gemini --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx

    For Figure (1-5):

    python heatmap.py

    For Figure (7-10):

    python ira_sample.py

    Commands for datasets (Code Summarization):

    python process.py --what accurate --model gpt-4 --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx

    python process.py --what accurate --model gpt-3.5-turbo --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx

    python process.py --what accurate --model llama3--fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx

    python process.py --what accurate --model mixtral --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx

    python process.py --what accurate --model claude --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx

    python process.py --what accurate --model gemini --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx

    For Figure (1-5):

    python heatmap.py

    For Figure (7-10):

    python ira_sample.py

    What="accurate", "adequate", "concise", "similarity"

    For Figure 6:

    python scatter.py

    For Figure 12 & 13, please copy majority.py and probability.py outside the shared folders.

    For Figure 12:

    python probability.py

    For Figure 6:

    python majority.py

    We also provided sample prompts from all datasets in Prompts.pdf

  10. 4

    Dataset for 'Identifying Key Drivers of Product Formation in Microbial...

    • data.4tu.nl
    zip
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marika Zegers; Moumita Roy; Ludovic Jourdin, Dataset for 'Identifying Key Drivers of Product Formation in Microbial Electrosynthesis with a Mixed Linear Regression Analysis' [Dataset]. http://doi.org/10.4121/5e840d08-55f6-4daa-a639-048cebcd8266.v1
    Explore at:
    zipAvailable download formats
    Dataset provided by
    4TU.ResearchData
    Authors
    Marika Zegers; Moumita Roy; Ludovic Jourdin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Feb 1, 2024 - Dec 1, 2024
    Dataset funded by
    Delft University of Technology
    NWO
    Description

    The analysed data and complete scripts for the permutation tests and mixed linear regression models (MLRMs) used in the paper 'Identifying Key Drivers of Product Formation in Microbial Electrosynthesis with a Mixed Linear Regression Analysis'.

    Python version 3.10.13 with packages numpy, pandas, os, scipy.optimize, scipy.stats, sklearn.metrics, matplotlib.pyplot, statsmodels.formula.api, seaborn are required to run the .py files. Ensure all packages are installed before running the scripts. Data files required to run the code (.xlsx and .csv format) are included in the relevant folders.

  11. D

    Data from: Code for: Experimental Investigations of the Flow-Following...

    • darus.uni-stuttgart.de
    Updated Mar 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sebastian Hofmann; Ryan Rautenbach (2023). Code for: Experimental Investigations of the Flow-Following Capabilities and Hydrodynamic Characteristics of Lagrangian Sensor Particles With Respect to Their Centre of Mass [Dataset]. http://doi.org/10.18419/DARUS-3314
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 13, 2023
    Dataset provided by
    DaRUS
    Authors
    Sebastian Hofmann; Ryan Rautenbach
    License

    https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html

    Dataset funded by
    DFG
    Description

    Data for 2D Lagrangian Particle tracking and evaluation for their hydrodynamic characteristics ## Abstract This dataset entails PYTHON code for fluid mechanic evaluation of Lagrangian Particles with the "Consensus-Based tracking with Selective Rejection of Tracklets" (CSRT) algorithm in the "OpenCV" library, written by Ryan Rautenbach in the framework of his Master thesis. ## Workflow for Lagrangian Particle tracking and evaluatio via OpenCV In the following a brief introduction and guide based on the folders in the repository is laid out. More code specific instructions can be found in the respective codes. working_env_RMR.yml --> Contains the entire environment including software versions (here used with Spyder IDE and Conda) with which the datasets were evaluated. 01 --> The tracking always begins with the same 01_milti[...] folder in which the python code with OpenCV algorithm is located. For tracking the tracking to work certain directories are required in which the raw images are to be stored (separate from anything else) as well as a directory in which the results are to be save (not the same directory as the raw data). After tracking is completed for all respective experiments and the results directories are adequately labelled and stored any of the other code files can be used for respective analyses. The order of folders beyond the first 01 directory has no relevance to the order of evaluation however can ease the understanding of evaluated data if followed. 02 --> Evaluation of amount of circulations and respective circulation time in experimental vat. (code can be extended to calculate the circulation time based on the various plains that are artificially set) 03 --> Code for the calculation of the amount of contacts with the vat floor. Code requires certain visual evaluations based on the LP trajectories, as the plain/barrier for the contact evaluation has to be manually set. 04 --> Contains two codes that can be applied to results data to combine individual results into larger more processable arrays within python. 05 --> Contains the code to plot the trajectory of single experiments of Lagrangian particles based on their positional results and velocity at respective position, highlighting the trajectory over the experiment. 06 --> Condes to create 1D histograms based on the probability density distribution and velocity distributions in cumulative experiments. 07 --> Codes for plotting the 2D probability density distribution (2D Histograms) of Lagrangian Particles based on the cumulative experiments. Code provides values for the 2D grid, plotting is conducted in Origin Lab or similar graphing tools, graphing can also be conducted in python whereby the seaborn (matplotlib) library is suggested. 08 --> Contain the code for the dimensionless evaluation of the results based on the respective Stokes number approaches and weighted averages. 2D histograms are also vital to this evaluation, whereby the plotting is again conducted in Origin Lab as values are only calculated in code. 09 --> Directory does not contain any python codes but instead contains the respective Origin Lab files for the graphing, plotting and evaluation of results calculated via python is given. Respective tables, histograms and heat maps are hereby given to be used as templates if necessary.

  12. 911 Calls Data (Subset)

    • kaggle.com
    zip
    Updated Jun 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hardly_human (2020). 911 Calls Data (Subset) [Dataset]. https://www.kaggle.com/rehan1024/911-calls-data-subset
    Explore at:
    zip(3828316 bytes)Available download formats
    Dataset updated
    Jun 3, 2020
    Authors
    hardly_human
    License

    https://www.usa.gov/government-works/https://www.usa.gov/government-works/

    Description

    Dataset

    This dataset was created by hardly_human

    Released under U.S. Government Works

    Contents

  13. Bank Data Analysis

    • kaggle.com
    zip
    Updated Feb 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steve Gallegos (2022). Bank Data Analysis [Dataset]. https://www.kaggle.com/stevegallegos/bank-marketing-data-set
    Explore at:
    zip(376757 bytes)Available download formats
    Dataset updated
    Feb 23, 2022
    Authors
    Steve Gallegos
    Description

    Data Set Information

    The bank.csv dataset describes about a phone call between customer and customer care staffs who are working for Portuguese banking institution. The dataset is about, whether the customer will get the scheme or product such as bank term deposit. Maximum the data will have β€˜yes’ or β€˜no’ type data.

    Goal

    The main goal is to predict if clients will subscribe to a term deposit or not.

    Attribute Information

    -Input Variables -

    Bank Client Data: 1 - age: (numeric) 2 - job: type of job (categorical: admin., blue-collar, entrepreneur, housemaid, management, retired, self-employed, services, student, technician, unemployed, unknown) 3 - marital: marital status (categorical: divorced, married, single, unknown; note: divorced means either divorced or widowed) 4 - education: (categorical: basic.4y, basic.6y, basic.9y, high.school, illiterate, professional.course, university.degree, unknown) 5 - default: has credit in default? (categorical: no, yes, unknown) 6 - housing: has housing loan? (categorical: no, yes, unknown) 7 - loan: has personal loan? (categorical: no, yes, unknown)

    Related with the Last Contact of the Current Campaign: 8 - contact: contact communication type (categorical: cellular, telephone) 9 - month: last contact month of year (categorical: jan, feb, mar, ..., nov, dec) 10 - day_of_week: last contact day of the week (categorical: mon, tue, wed, thu, fri) 11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

    Other Attributes: 12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) 14 - previous: number of contacts performed before this campaign and for this client (numeric) 15 - poutcome: outcome of the previous marketing campaign (categorical: failure, nonexistent, success)

    #Social and Economic Context Attributes 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric) 17 - cons.price.idx: consumer price index - monthly indicator (numeric) 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 19 - euribor3m: euribor 3 month rate - daily indicator (numeric) 20 - nr.employed: number of employees - quarterly indicator (numeric)

    Output Variable (Desired Target): 21 - y (deposit): - has the client subscribed a term deposit? (binary: yes, no) -> changed column title from '***y***' to '***deposit***'

    Source

    [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

  14. u

    Analysis of network performance when confirmed traffic is present in Long...

    • researchdata.up.ac.za
    zip
    Updated Feb 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jaco Marais; Gerhardus Hancke; Adnan Abu Mahfouz (2024). Analysis of network performance when confirmed traffic is present in Long Range Wide Area Networks (LoRaWANs) [Dataset]. http://doi.org/10.25403/UPresearchdata.22113050.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 16, 2024
    Dataset provided by
    University of Pretoria
    Authors
    Jaco Marais; Gerhardus Hancke; Adnan Abu Mahfouz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Quantitative data of figures and graphing scripts from the thesis titled 'Developing a congestion management scheme to reduce the impact of congestion in mixed traffic LoRaWANs'. The files contain the processed output of simulations conducted with a modified version of the ns-3 plugin lorawan. Processed simulation output was Pandas dataframes stored in text files. Software used: ns-3 (version 3.30), Jupyter notebooks, Python with packages sem, pandas, seaborn, modified version of lorawan module from signetlabdei. Python scripts refer to Std and Ex, std refers to the standard LoRaWAN module and Ex refers to the Extended version of the module with the algorithms presented in the thesis. Text files contain a legend at the top of all of the fields present in the dataframe.

  15. CpG Signature Profiling and Heatmap Visualization of SARS-CoV Genomes:...

    • figshare.com
    txt
    Updated Apr 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tahir Bhatti (2025). CpG Signature Profiling and Heatmap Visualization of SARS-CoV Genomes: Tracing the Genomic Divergence From SARS-CoV (2003) to SARS-CoV-2 (2019) [Dataset]. http://doi.org/10.6084/m9.figshare.28736501.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Apr 5, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Tahir Bhatti
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ObjectiveThe primary objective of this study was to analyze CpG dinucleotide dynamics in coronaviruses by comparing Wuhan-Hu-1 with its closest and most distant relatives. Heatmaps were generated to visualize CpG counts and O/E ratios across intergenic regions, providing a clear depiction of conserved and divergent CpG patterns.Methods1. Data CollectionSource : The dataset includes CpG counts and O/E ratios for various coronaviruses, extracted from publicly available genomic sequences.Format : Data was compiled into a CSV file containing columns for intergenic regions, CpG counts, and O/E ratios for each virus.2. PreprocessingData Cleaning :Missing values (NaN), infinite values (inf, -inf), and blank entries were handled using Python's pandas library.Missing values were replaced with column means, and infinite values were capped at a large finite value (1e9).Reshaping :The data was reshaped into matrices for CpG counts and O/E ratios using meltpandas[] and pivot[] functions.3. Distance CalculationEuclidean Distance :Pairwise Euclidean distances were calculated between Wuhan-Hu-1 and other viruses using the scipy.spatial.distance.euclidean function.Distances were computed separately for CpG counts and O/E ratios, and the total distance was derived as the sum of both metrics.4. Identification of Closest and Distant RelativesThe virus with the smallest total distance was identified as the closest relative .The virus with the largest total distance was identified as the most distant relative .5. Heatmap GenerationTools :Heatmaps were generated using Python's seaborn library (sns.heatmap) and matplotlib for visualization.Parameters :Heatmaps were annotated with numerical values for clarity.A color gradient (coolwarm) was used to represent varying CpG counts and O/E ratios.Titles and axis labels were added to describe the comparison between Wuhan-Hu-1 and its relatives.ResultsClosest Relative :The closest relative to Wuhan-Hu-1 was identified based on the smallest Euclidean distance.Heatmaps for CpG counts and O/E ratios show high similarity in specific intergenic regions.Most Distant Relative :The most distant relative was identified based on the largest Euclidean distance.Heatmaps reveal significant differences in CpG dynamics compared to Wuhan-Hu-1 .Tools and LibrariesThe following tools and libraries were used in this analysis:Programming Language :Python 3.13Libraries :pandas: For data manipulation and cleaning.numpy: For numerical operations and handling missing/infinite values.scipy.spatial.distance: For calculating Euclidean distances.seaborn: For generating heatmaps.matplotlib: For additional visualization enhancements.File Formats :Input: CSV files containing CpG counts and O/E ratios.Output: PNG images of heatmaps.Files IncludedCSV File :Contains the raw data of CpG counts and O/E ratios for all viruses.Heatmap Images :Heatmaps for CpG counts and O/E ratios comparing Wuhan-Hu-1 with its closest and most distant relatives.Python Script :Full Python code used for data processing, distance calculation, and heatmap generation.Usage NotesResearchers can use this dataset to further explore the evolutionary dynamics of CpG dinucleotides in coronaviruses.The Python script can be adapted to analyze other viral genomes or datasets.Heatmaps provide a visual summary of CpG dynamics, aiding in hypothesis generation and experimental design.AcknowledgmentsSpecial thanks to the open-source community for developing tools like pandas, numpy, seaborn, and matplotlib.This work was conducted as part of an independent research project in molecular biology and bioinformatics.LicenseThis dataset is shared under the CC BY 4.0 License , allowing others to share and adapt the material as long as proper attribution is given.DOI: 10.6084/m9.figshare.28736501

  16. Customer Sale Dataset for Data Visualization

    • kaggle.com
    Updated Jun 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Atul (2025). Customer Sale Dataset for Data Visualization [Dataset]. https://www.kaggle.com/datasets/atulkgoyl/customer-sale-dataset-for-visualization
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 6, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Atul
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This synthetic dataset is designed specifically for practicing data visualization and exploratory data analysis (EDA) using popular Python libraries like Seaborn, Matplotlib, and Pandas.

    Unlike most public datasets, this one includes a diverse mix of column types:

    πŸ“… Date columns (for time series and trend plots) πŸ”’ Numerical columns (for histograms, boxplots, scatter plots) 🏷️ Categorical columns (for bar charts, group analysis)

    Whether you are a beginner learning how to visualize data or an intermediate user testing new charting techniques, this dataset offers a versatile playground.

    Feel free to:

    Create EDA notebooks Practice plotting techniques Experiment with filtering, grouping, and aggregations πŸ› οΈ No missing values, no data cleaning needed β€” just download and start exploring!

    Hope you find this helpful. Looking forward to hearing from you all.

  17. User Profiling and Segmentation Project

    • kaggle.com
    Updated Jul 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sanjana Murthy (2024). User Profiling and Segmentation Project [Dataset]. https://www.kaggle.com/datasets/sanjanamurthy392/user-profiling-and-segmentation-project
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 9, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sanjana Murthy
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    About Datasets: - Domain : Marketing - Project: User Profiling and Segmentation - Datasets: user_profile_for_ads.csv - Dataset Type: Excel Data - Dataset Size: 16k+ records

    KPI's: 1. Distribution of Key Demographic Variables like: a. Count of Age b. Count of Gender c. Count of Education Level d. Count of Income Level e. Count of Device Usage

    1. Understanding Online Behavior like: a. Count of Time Spent Online (hrs/Weekday) b. Count of Time Spent Online (hrs/Weekend)

    2. Ad Interaction Metrics: a. Count of likes and Reactions b. Count of click through rates (CTR) c. Count of Conversion Rate d. Count of Ad Interaction Time (secs) e. Count of Ad Interaction Time by Top Interests

    Process: 1. Understanding the problem 2. Data Collection 3. Exploring and analyzing the data 4. Interpreting the results

    This data contains pandas, matplotlib, seaborn, isnull, set_style, suptitle, countplot, palette, tight_layout, figsize, histplot, barplot, sklearn, standardscaler, OneHotEncoder, ColumnTransformer, Pipeline, KMeans, cluster_means, groupby, numpy, radar_df

  18. Compare Baseball Player Statistics using Visualiza

    • kaggle.com
    zip
    Updated Sep 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdelaziz Sami (2024). Compare Baseball Player Statistics using Visualiza [Dataset]. https://www.kaggle.com/datasets/abdelazizsami/compare-baseball-player-statistics-using-visualiza
    Explore at:
    zip(1030978 bytes)Available download formats
    Dataset updated
    Sep 28, 2024
    Authors
    Abdelaziz Sami
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    To compare baseball player statistics effectively using visualization, we can create some insightful plots. Below are the steps to accomplish this in Python using libraries like Pandas and Matplotlib or Seaborn.

    1. Load the Data

    First, we need to load the judge.csv file into a DataFrame. This will allow us to manipulate and analyze the data easily.

    2. Explore the Data

    Before creating visualizations, it’s good to understand the data structure and identify the columns we want to compare. The relevant columns in your data include pitch_type, release_speed, game_date, and events.

    3. Visualization

    We can create various visualizations, such as: - A bar chart to compare the average release speed of different pitch types. - A line plot to visualize trends over time based on game dates. - A scatter plot to analyze the relationship between release speed and the outcome of the pitches (e.g., strikeouts, home runs).

    Example Code

    Here is a sample code to demonstrate how to create these visualizations using Matplotlib and Seaborn:

    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    # Load the data
    df = pd.read_csv('judge.csv')
    
    # Display the first few rows of the dataframe
    print(df.head())
    
    # Set the style of seaborn
    sns.set(style="whitegrid")
    
    # 1. Average Release Speed by Pitch Type
    plt.figure(figsize=(12, 6))
    avg_speed = df.groupby('pitch_type')['release_speed'].mean().sort_values()
    sns.barplot(x=avg_speed.values, y=avg_speed.index, palette="viridis")
    plt.title('Average Release Speed by Pitch Type')
    plt.xlabel('Average Release Speed (mph)')
    plt.ylabel('Pitch Type')
    plt.show()
    
    # 2. Trends in Release Speed Over Time
    # First, convert the 'game_date' to datetime
    df['game_date'] = pd.to_datetime(df['game_date'])
    
    plt.figure(figsize=(14, 7))
    sns.lineplot(data=df, x='game_date', y='release_speed', estimator='mean', ci=None)
    plt.title('Trends in Release Speed Over Time')
    plt.xlabel('Game Date')
    plt.ylabel('Average Release Speed (mph)')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
    
    # 3. Scatter Plot of Release Speed vs. Events
    plt.figure(figsize=(12, 6))
    sns.scatterplot(data=df, x='release_speed', y='events', hue='pitch_type', alpha=0.7)
    plt.title('Release Speed vs. Events')
    plt.xlabel('Release Speed (mph)')
    plt.ylabel('Event Type')
    plt.legend(title='Pitch Type', bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.show()
    

    Explanation of the Code

    • Data Loading: The CSV file is loaded into a Pandas DataFrame.
    • Average Release Speed: A bar chart shows the average release speed for each pitch type.
    • Trends Over Time: A line plot illustrates the trend in release speed over time, which can indicate changes in performance or strategy.
    • Scatter Plot: A scatter plot visualizes the relationship between release speed and different events, providing insight into performance outcomes.

    Conclusion

    These visualizations will help you compare player statistics in a meaningful way. You can customize the plots further based on your specific needs, such as filtering data for specific players or seasons. If you have any specific comparisons in mind or additional data to visualize, let me know!

  19. Coffee Sales Dataset

    • kaggle.com
    zip
    Updated Aug 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Navjot Kaushal (2025). Coffee Sales Dataset [Dataset]. https://www.kaggle.com/datasets/navjotkaushal/coffee-sales-dataset/code
    Explore at:
    zip(38970 bytes)Available download formats
    Dataset updated
    Aug 19, 2025
    Authors
    Navjot Kaushal
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains coffee shop transaction records, including details about sales, payment type, time of purchase, and customer preferences. It is specifically curated for data visualization, dashboard building, and business analytics projects in tools like Power BI, Tableau, and Python visualization libraries (Matplotlib, Seaborn, Plotly).

    With attributes covering time of day, weekdays, months, coffee types, and revenue, this dataset provides a strong foundation for analyzing customer behavior, sales patterns, and business performance trends.

    Dataset Structure-

    File format: CSV Columns (features):

    1. hour_of_day β†’ Hour of purchase (0–23)

    2. cash_type β†’ Mode of payment (cash / card)

    3. money β†’ Transaction amount (in local currency)

    4. coffee_name β†’ Type of coffee purchased (e.g., Latte, Americano, Hot Chocolate)

    5. Time_of_Day β†’ Categorized time of purchase (Morning, Afternoon, Night)

    6. Weekday β†’ Day of the week (e.g., Mon, Tue, …)

    7. Month_name β†’ Month of purchase (e.g., Jan, Feb, Mar)

    8. Weekdaysort β†’ Numeric representation for weekday ordering (1 = Mon, 7 = Sun)

    9. Monthsort β†’ Numeric representation for month ordering (1 = Jan, 12 = Dec)

    10. Date β†’ Date of transaction (YYYY-MM-DD)

    11. Time β†’ Exact time of transaction (HH:MM:SS)

    Potential Data Visualizations-

    This dataset is well-suited for interactive dashboards and visual reports, such as:

    πŸ“Š Sales by Coffee Type (e.g., top-selling drinks)

    ⏰ Sales by Hour of Day (peak business hours)

    πŸŒ… Sales by Time of Day (Morning vs Afternoon vs Night trends)

    πŸ“… Sales by Weekday & Month (seasonal & weekly demand patterns)

    πŸ’³ Payment Method Breakdown (cash vs card usage)

    πŸ“ˆ Revenue Trends Over Time (daily/monthly growth analysis)

    Use Cases-

    • Power BI / Tableau dashboards

    • Python data visualization (Matplotlib, Seaborn, Plotly)

    • Data storytelling projects

    • Business intelligence & decision-making simulations

    • Training projects for beginners in data visualization & analytics

  20. Superstore Sales dataset

    • kaggle.com
    zip
    Updated Jul 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NayakGanesh007 (2025). Superstore Sales dataset [Dataset]. https://www.kaggle.com/datasets/nayakganesh007/superstore-sales-dataset
    Explore at:
    zip(1030287 bytes)Available download formats
    Dataset updated
    Jul 14, 2025
    Authors
    NayakGanesh007
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    πŸ“Œ Description: This dataset contains sales records from a global superstore, including details about orders, customers, products, shipping, and profitability. The goal of this analysis is to uncover business insights related to sales performance, regional trends, shipping efficiency, and product profitability.

    πŸ” Key Objectives: Analyze overall sales, profit, and discount trends

    Identify top-performing regions, segments, and categories

    Evaluate the impact of shipping mode on delivery and profit

    Perform RFM (Recency, Frequency, Monetary) analysis on customers

    Visualize key metrics with Matplotlib and Seaborn

    πŸ“ Dataset Features: Column --Description Order ID--- Unique identifier for each order Order Date--- Date when the order was placed Ship Mode --Mode of shipping used Customer Name --Full name of the customer Segment ---Customer segment (Consumer, Corporate, Home Office) Region ---Geographical region of the order Product Category --Category and sub-category of the product Sales --Sales amount Quantity --Number of units sold Profit --Profit earned on the sale Discount--- Discount applied on the product

    πŸ›  Tools & Libraries: Python

    Pandas, NumPy – for data manipulation

    Matplotlib, Seaborn – for data visualization

    Excel – for data import and inspection

    🎯 Business Impact: By understanding sales and profitability patterns, this analysis helps identify opportunities for cost optimization, product focus, and regional strategy β€” essential for scaling business performance.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Abdelrahman Mohamed (2024). All Seaborn Built-in Datasets πŸ“Šβœ¨ [Dataset]. https://www.kaggle.com/datasets/abdoomoh/all-seaborn-built-in-datasets
Organization logo

All Seaborn Built-in Datasets πŸ“Šβœ¨

A Complete Set of Seaborn Datasets for Analysis and Visualization

Explore at:
zip(1383218 bytes)Available download formats
Dataset updated
Aug 27, 2024
Authors
Abdelrahman Mohamed
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Description: - This dataset includes all 22 built-in datasets from the Seaborn library, a widely used Python data visualization tool. Seaborn's built-in datasets are essential resources for anyone interested in practicing data analysis, visualization, and machine learning. They span a wide range of topics, from classic datasets like the Iris flower classification to real-world data such as Titanic survival records and diamond characteristics.

  • Included Datasets:
    • Anagrams: Analysis of word anagram patterns.
    • Anscombe: Anscombe's quartet demonstrating the importance of data visualization.
    • Attention: Data on attention span variations in different scenarios.
    • Brain Networks: Connectivity data within brain networks.
    • Car Crashes: US car crash statistics.
    • Diamonds: Data on diamond properties including price, cut, and clarity.
    • Dots: Randomly generated data for scatter plot visualization.
    • Dow Jones: Historical records of the Dow Jones Industrial Average.
    • Exercise: The relationship between exercise and health metrics.
    • Flights: Monthly passenger numbers on flights.
    • FMRI: Functional MRI data capturing brain activity.
    • Geyser: Eruption times of the Old Faithful geyser.
    • Glue: Strength of glue under different conditions.
    • Health Expenditure: Health expenditure statistics across countries.
    • Iris: Famous dataset for classifying Iris species.
    • MPG: Miles per gallon for various vehicles.
    • Penguins: Data on penguin species and their features.
    • Planets: Characteristics of discovered exoplanets.
    • Sea Ice: Measurements of sea ice extent.
    • Taxis: Taxi trips data in a city.
    • Tips: Tipping data collected from a restaurant.
    • Titanic: Survival data from the Titanic disaster.

This complete collection serves as an excellent starting point for anyone looking to improve their data science skills, offering a wide array of datasets suitable for both beginners and advanced users.

Search
Clear search
Close search
Google apps
Main menu