18 datasets found

Classicmodels
kaggle.com
zip
Updated Dec 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javier Landaeta (2024). Classicmodels [Dataset]. https://www.kaggle.com/datasets/javierlandaeta/classicmodels
Explore at:
zip(65751 bytes)Available download formats
Dataset updated
Dec 15, 2024
Authors
Javier Landaeta
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Abstract This project presents a comprehensive analysis of a company's annual sales, using the classic dataset classicmodels as the database. Python is used as the main programming language, along with the Pandas, NumPy and SQLAlchemy libraries for data manipulation and analysis, and PostgreSQL as the database management system.

The main objective of the project is to answer key questions related to the company's sales performance, such as: Which were the most profitable products and customers? Were sales goals met? The results obtained serve as input for strategic decision making in future sales campaigns.

Methodology 1. Data Extraction:

A connection is established with the PostgreSQL database to extract the relevant data from the orders, orderdetails, customers, products and employees tables.

A reusable function is created to read each table and load it into a Pandas DataFrame.

2. Data Cleansing and Transformation:

An exploratory analysis of the data is performed to identify missing values, inconsistencies, and outliers.

New variables are calculated, such as the total value of each sale, cost, and profit.

Different DataFrames are joined using primary and foreign keys to obtain a complete view of sales.

3. Exploratory Data Analysis (EDA):

Key metrics such as total sales, number of unique customers, and average order value are calculated.

Data is grouped by different dimensions (products, customers, dates) to identify patterns and trends.

Results are visualized using relevant graphics (histograms, bar charts, etc.).

4. Modeling and Prediction:

Although the main focus of the project is descriptive, predictive modeling techniques (e.g., time series) could be explored to forecast future sales.

5. Report Generation:

Detailed reports are created in Pandas DataFrames format that answer specific business questions.

These reports are stored in new PostgreSQL tables for further analysis and visualization.

Results - Identification of top products and customers: The best-selling products and the customers that generate the most revenue are identified. - Analysis of sales trends: Sales trends over time are analyzed and possible factors that influence sales behavior are identified. - Calculation of key metrics: Metrics such as average profit margin and sales growth rate are calculated.

Conclusions This project demonstrates how Python and PostgreSQL can be effectively used to analyze large data sets and obtain valuable insights for business decision making. The results obtained can serve as a starting point for future research and development in the area of sales analysis.

Technologies Used - Python: Pandas, NumPy, SQLAlchemy, Matplotlib/Seaborn - Database: PostgreSQL - Tools: Jupyter Notebook - Keywords: data analysis, Python, PostgreSQL, Pandas, NumPy, SQLAlchemy, EDA, sales, business intelligence
Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...
zenodo.org
data.europa.eu
zip
Updated Oct 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. http://doi.org/10.5281/zenodo.6832242
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6832242
Dataset updated
Oct 20, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LifeSnaps Dataset Documentation

Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.

The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.

Data Import: Reading CSV

For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.

Data Import: Setting up a MongoDB (Recommended)

To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.

To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.

For the Fitbit data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c fitbit

For the SEMA data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c sema

For surveys data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c surveys

If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.

Data Availability

The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:

{ _id:
D
Replication Data for: Multiwavelets applied to metal-ligand interactions:...
dataverse.azure.uit.no
dataverse.no
+1more
csv, txt
Updated Sep 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anders Brakestad; Anders Brakestad; Peter Wind; Peter Wind; Stig Rune Jensen; Stig Rune Jensen; Luca Frediani; Luca Frediani; Kathrin Helen Hopmann; Kathrin Helen Hopmann (2023). Replication Data for: Multiwavelets applied to metal-ligand interactions: Energies free from basis set errors [Dataset]. http://doi.org/10.18710/WA5YCF
Explore at:
txt(7948), csv(388602), txt(102078)Available download formats
Unique identifier
https://doi.org/10.18710/WA5YCF
Dataset updated
Sep 28, 2023
Dataset provided by
DataverseNO
Authors
Anders Brakestad; Anders Brakestad; Peter Wind; Peter Wind; Stig Rune Jensen; Stig Rune Jensen; Luca Frediani; Luca Frediani; Kathrin Helen Hopmann; Kathrin Helen Hopmann
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset funded by
The Research Council of Norway
UNINETT Sigma2
Tromsø Research Foundation
Description
Introduction This Dataverse record contains data for reproducing the results in our corresponding journal article. For more information about the computational protocols used to generate the data, please see the journal article or the ChemRxiv entry (see below). How to use This data set two data files: molecular coordinates (ALL_GEOMETRIES.txt) and metal-ligand interaction energy data (Raw_Data.csv). These formats lend themselves for easy preparation and analysis with Python. For example, in order to load the data set into a Pandas DataFrame, do the following: import pandas as pd data = pd.read_csv('Raw_Data.csv') You can prepare a list of all geometries in the following way: with open('ALL_GEOMETRIES.txt') as f: raw_string = f.read() molecules = [mol.split(' ') for mol in raw_string.split('

')] The ReadMe file contains descriptions of all data fields found in Raw_Data.csv. All energies are given in Hartrees, and all geometries are given in Angströms. Journal article Brakestad et al. "Multiwavelets applied to metal–ligand interactions: Energies free from basis set errors". J. Chem. Phys. (2021) Abstract from journal article Transition metal-catalyzed reactions invariably include steps where ligands associate or dissociate. In order to obtain reliable energies for such reactions, sufficiently large basis sets need to be employed. In this paper, we have used high-precision multiwavelet calculations to compute the metal–ligand association energies for 27 transition metal complexes with common ligands, such as H2, CO, olefins, and solvent molecules. By comparing our multiwavelet results to a variety of frequently used Gaussian-type basis sets, we show that counterpoise corrections, which are widely employed to correct for basis set superposition errors, often lead to underbinding. Additionally, counterpoise corrections are difficult to employ when the association step also involves a chemical transformation. Multiwavelets, which can be conveniently applied to all types of reactions, provide a promising alternative for computing electronic interaction energies free from any basis set errors. ChemRxiv record https://doi.org/10.26434/chemrxiv.13669951.v1
Japanese Anime: An In-Depth IMDb Data Set
kaggle.com
zip
Updated Sep 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lorentz (2023). Japanese Anime: An In-Depth IMDb Data Set [Dataset]. https://www.kaggle.com/datasets/lorentzyeung/all-japanese-anime-titles-in-imdb
Explore at:
zip(3513223 bytes)Available download formats
Dataset updated
Sep 8, 2023
Authors
Lorentz
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Introduction to the IMDb Anime Dataset (45718 titles)

Methodology

The dataset is fetched on 8 Sept, 2023, at 18:00 pm London time.

The dataset was generated using a web scraping script written in Python, utilizing the Scrapy library. The script navigates through IMDb's list of animations originating from Japan, scraping relevant information from each listing. The spider starts from the URL https://www.imdb.com/search/title/?genres=Animation&countries=jp and follows the "Next" links to traverse through multiple pages of listings.

Summary of Results

The dataset provides a comprehensive view of various animations listed on IMDb that are categorized under the genre "Animation" and originate from Japan. It includes details such as the title, genre, user rating, number of votes, runtime, year of release, summary, stars, certificate, metascore, gross earnings, episode flag, and episode title when applicable.

However, the dataset also includes some animations not regarded as Japanese Anime, e.g. Toy Storys. It is because I can only filter the Anime by using regions, but the origin of production.

Detailed Column Introduction

Title: The name of the animation. Genre: The genre(s) under which the animation falls, e.g., Action, Adventure, etc. User Rating: The IMDb user rating out of 10. Number of Votes: The total number of IMDb users who have rated the animation. Runtime: The duration of the animation in minutes. Year: The year the animation was released or started airing. Summary: A brief or full summary of the animation's plot. Full summaries are fetched when available. Stars: List of main actors or voice actors involved in the animation. Certificate: The certification of the animation, e.g., PG, PG-13, etc. Metascore: The Metascore rating, if available, which is an aggregated score from various critics. Gross: The gross earnings or box office collection of the animation. Episode: A binary flag indicating whether the listing is for an episode of a series (1 for yes, 0 for no). Episode Title: The title of the episode if the listing is for an episode; otherwise, it will be None.

Possible Usages

Exploratory Data Analysis (EDA) Genre Popularity: Analyze which genres are most popular based on user ratings and number of votes. Year-wise Trends: Examine how the popularity of anime has evolved over the years.

Predictive Modeling Rating Prediction: Use machine learning algorithms to predict the rating of an anime based on features like genre, runtime, and stars. Success Prediction: Predict the financial success (Gross earnings) of an anime based on various features.

Content Recommendation Personalized Recommendations: Use user ratings and genre information to build a recommendation system.

Sentiment Analysis Summary Sentiment: Perform sentiment analysis on the summary to see if the tone of the summary correlates with user ratings or other features.

**Network Analysis Actor Collaboration: Create a network graph to analyze frequent collaborations between actors.

Time-Series Analysis Rating Over Time: Analyze how ratings evolve over time for long-running series.

Market Research Target Audience: Use the certificate and genre information to identify target demographics for marketing anime-related products.

Academic Research Cultural Impact: Study the cultural impact of anime by analyzing its popularity, genres, and actors.

Data Visualization Interactive Dashboards: Create dashboards to visualize the data and allow users to filter by various criteria like genre, year, or rating.

Natural Language Processing (NLP) Topic Modeling: Use NLP techniques to identify common themes or topics in the summaries.

By leveraging Python for data analysis, you can use libraries like Pandas for data manipulation, Matplotlib and Seaborn for data visualization, and scikit-learn for machine learning to extract valuable insights from this dataset.
GoT Characters Screen Time
kaggle.com
zip
Updated Dec 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). GoT Characters Screen Time [Dataset]. https://www.kaggle.com/datasets/thedevastator/uncover-the-mystery-behind-got-characters-screen/code
Explore at:
zip(13664 bytes)Available download formats
Dataset updated
Dec 14, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
GoT Characters Screen Time

How Long did Characters Spend on Screen?

By Ændrew Rininsland [source]

About this dataset

This dataset provides an eye-opening look into the characters and actors involved in the globally acclaimed TV series, Game of Thrones. By examining the screen time and episodes for each character, as well as their portrayed actor or actress's IMDB URL, one can gain remarkable insight into which characters have seized the spotlight in this epic saga. Compiled by ninewheels0 on IMDB, this dataset took a long time to amass and deserves appreciation for doing so. Each character is listed with their length of screen time measured in minutes with fractional seconds (i.e., 1.5 minutes means one minute and thirty seconds). With each character's contribution to screen time equal to what they will remember by within the minds of those watching Game of Thrones around the world, witness how each actor has intruded into our lives across multiple seasons!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

How to Use the GoT Characters Screen Time Dataset This Kaggle dataset contains information about the screentime of characters in the Game of Thrones TV series, including their name, IMDB URL, screentime (in minutes with fractional seconds), number of episodes appeared in and the actor/actress portraying them. It is a helpful resource for game theorists who want to further study character arcs and build theories around key points in certain stories.

First off, make sure you acknowledge ninewheels0 on IMDB as they created this list before it was uploaded onto Kaggle. Read the “About this dataset” section carefully before getting started to remember all sources and credits should be given out properly.

To begin studying and analyzing this data set you can use various software tools that allow for analyzing and visualizing large data sets such as Python pandas function which allows you to easily study every aspect provided through columns such as name or portrayed_by_name). You can also use Tableau which let’s you turn your selected columns into charts or graphs so that patterns are easily found within large datasets. Additionally, tools such as Excel can also be used for similar purposes but not nearly as well organized so it would only work if very few interactions are done between different elements from within this CSV file.

When analyzing these huge datasets it is important to note down certain key questions you want to answer while understanding what kind of information is already presented there – what correlations exist? What could affect a certain element? Is there something specific I want to uncover through my analysis etc.. After deciding on those major things one should take a look at distinct elements present within each column such as its highest values (max) or lowest values (min). Afterwards one should remember to always check consistency with patterns found due do outliers before making any kind of accusations or assumptions afterwards - sometimes they might influence our results more than expected sometimes not providing us with much insight at all but instead just confusing our story even further . Knowing how each element interacts with other variables from within same dataset helps when looking into relation between two separate items from inside same file!

Finally after taking into account primarily described ways we can start drawing parallels between different parts present inside same As soon we did sufficient amount amount if observation steps we may even have enough evidence needed finish majority aspects related research phase like proving possible hypothesis correct or incorrect !

Research Ideas

Create an interactive feature on a website/app to compare the screentime across all characters in Game of Thrones

Analyze how the screentime of characters evolve over time and seasons

Using ML algorithms, explore different patterns in the data to identify relationships between screen time and other factors such as character gender, type etc

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: GOT_screentimes.csv | Column...
Aluminum alloy industrial materials defect
figshare.com
zip
Updated Dec 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ying Han; Yugang Wang (2024). Aluminum alloy industrial materials defect [Dataset]. http://doi.org/10.6084/m9.figshare.27922929.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27922929.v3
Dataset updated
Dec 3, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ying Han; Yugang Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset used in this study experiment was from the preliminary competition dataset of the 2018 Guangdong Industrial Intelligent Manufacturing Big Data Intelligent Algorithm Competition organized by Tianchi Feiyue Cloud (https://tianchi.aliyun.com/competition/entrance/231682/introduction). We have selected the dataset, removing images that do not meet the requirements of our experiment. All datasets have been classified for training and testing. The image pixels are all 2560×1960. Before training, all defects need to be labeled using labelimg and saved as json files. Then, all json files are converted to txt files. Finally, the organized defect dataset is detected and classified.Description of the data and file structureThis is a project based on the YOLOv8 enhanced algorithm for aluminum defect classification and detection tasks.All code has been tested on Windows computers with Anaconda and CUDA-enabled GPUs. The following instructions allow users to run the code in this repository based on a Windows+CUDA GPU system already in use.Files and variablesFile: defeat_dataset.zipDescription:SetupPlease follow the steps below to set up the project:Download Project RepositoryDownload the project repository defeat_dataset.zip from the following location.Unzip and navigate to the project folder; it should contain a subfolder: quexian_datasetDownload data1.Download data .defeat_dataset.zip2.Unzip the downloaded data and move the 'defeat_dataset' folder into the project's main folder.3. Make sure that your defeat_dataset folder now contains a subfolder: quexian_dataset.4. Within the folder you should find various subfolders such as addquexian-13, quexian_dataset, new_dataset-13, etc.softwareSet up the Python environment1.Download and install the Anaconda.2.Once Anaconda is installed, activate the Anaconda Prompt. For Windows, click Start, search for Anaconda Prompt, and open it.3.Create a new conda environment with Python 3.8. You can name it whatever you like; for example. Enter the following command: conda create -n yolov8 python=3.84.Activate the created environment. If the name is , enter: conda activate yolov8Download and install the Visual Studio Code.Install PyTorch based on your system:For Windows/Linux users with a CUDA GPU: bash conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forgeInstall some necessary libraries:Install scikit-learn with the command: conda install anaconda scikit-learn=0.24.1Install astropy with: conda install astropy=4.2.1Install pandas using: conda install anaconda pandas=1.2.4Install Matplotlib with: conda install conda-forge matplotlib=3.5.3Install scipy by entering: conda install scipy=1.10.1RepeatabilityFor PyTorch, it's a well-known fact:There is no guarantee of fully reproducible results between PyTorch versions, individual commits, or different platforms. In addition, results may not be reproducible between CPU and GPU executions, even if the same seed is used.All results in the Analysis Notebook that involve only model evaluation are fully reproducible. However, when it comes to updating the model on the GPU, the results of model training on different machines vary.Access informationOther publicly accessible locations of the data:https://tianchi.aliyun.com/dataset/public/Data was derived from the following sources:https://tianchi.aliyun.com/dataset/140666Data availability statementThe ten datasets used in this study come from Guangdong Industrial Wisdom Big Data Innovation Competition - Intelligent Algorithm Competition Rematch. and the dataset download link is https://tianchi.aliyun.com/competition/entrance/231682/information?lang=en-us. Officially, there are 4,356 images, including single blemish images, multiple blemish images and no blemish images. The official website provides 4,356 images, including single defect images, multiple defect images and no defect images. We have selected only single defect images and multiple defect images, which are 3,233 images in total. The ten defects are non-conductive, effacement, miss bottom corner, orange, peel, varicolored, jet, lacquer bubble, jump into a pit, divulge the bottom and blotch. Each image contains one or more defects, and the resolution of the defect images are all 2560×1920.By investigating the literature, we found that most of the experiments were done with 10 types of defects, so we chose three more types of defects that are more different from these ten types and more in number, which are suitable for the experiments. The three newly added datasets come from the preliminary dataset of Guangdong Industrial Wisdom Big Data Intelligent Algorithm Competition. The dataset can be downloaded from https://tianchi.aliyun.com/dataset/140666. There are 3,000 images in total, among which 109, 73 and 43 images are for the defects of bruise, camouflage and coating cracking respectively. Finally, the 10 types of defects in the rematch and the 3 types of defects selected in the preliminary round are fused into a new dataset, which is examined in this dataset.In the processing of the dataset, we tried different division ratios, such as 8:2, 7:3, 7:2:1, etc. After testing, we found that the experimental results did not differ much for different division ratios. Therefore, we divide the dataset according to the ratio of 7:2:1, the training set accounts for 70%, the validation set accounts for 20%, and the testing set accounts for 10%. At the same time, the random number seed is set to 0 to ensure that the results obtained are consistent every time the model is trained.Finally, the mean Average Precision (mAP) metric obtained from the experiment was tested on the dataset a total of three times. Each time the results differed very little, but for the accuracy of the experimental results, we took the average value derived from the highest and lowest results. The highest was 71.5% and the lowest was 71.1%, resulting in an average detection accuracy of 71.3% for the final experiment.All data and images utilized in this research are from publicly available sources, and the original creators have given their consent for these materials to be published in open-access formats.The settings for other parameters are as follows. epochs: 200，patience: 50，batch: 16，imgsz: 640，pretrained: true，optimizer: SGD，close_mosaic: 10，iou: 0.7，momentum: 0.937，weight_decay: 0.0005，box: 7.5，cls: 0.5，dfl: 1.5，pose: 12.0，kobj: 1.0，save_dir: runs/trainThe defeat_dataset.(ZIP)is mentioned in the Supporting information section of our manuscript. The underlying data are held at Figshare. DOI: 10.6084/m9.figshare.27922929.The results_images.zipin the system contains the experimental results graphs.The images_1.zipand images_2.zipin the system contain all the images needed to generate the manuscript.tex manuscript.

RBD24 - Risk Activities Dataset 2024

zenodo.org

bin

Updated Mar 4, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime; Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime (2025). RBD24 - Risk Activities Dataset 2024 [Dataset]. http://doi.org/10.5281/zenodo.13787591

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13787591

Dataset updated

Mar 4, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime; Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Introduction

This repository contains a selection of behavioral datasets collected using soluble agents and labeled using realistic threat simulation and IDS rules. The collected datasets are anonymized and aggregated using time window representations. The dataset generation pipeline preprocesses the application logs from the corporate network, structures them according to entities and users inventory, and labels them based on the IDS and phishing simulation appliances.

This repository is associated with the article "RBD24: A labelled dataset with risk activities using log applications data" published in the journal Computers & Security. For more information go to https://doi.org/10.1016/j.cose.2024.104290" target="_blank" rel="noreferrer noopener">https://doi.org/10.1016/j.cose.2024.104290

Summary of the Datasets

The RBD24 dataset comprises various risk activities collected from real entities and users over a period of 15 days, with the samples segmented by Desktop (DE) and Smartphone (SM) devices.

DatasetId	Entity	Observed Behaviour	Groundtruth	Sample Shape
Crypto_desktop.parquet	DE	Miner Checking	IDS	0: 738/161202, 1: 11/1343
Crypto_smarphone.parquet	SM	Miner Checking	IDS	0: 613/180021, 1: 4/956
OutFlash_desktop.parquet	DE	Outdated software components	IDS	0: 738/161202, 1: 56/10820
OutFlash_smartphone.parquet	SM	Outdated software components	IDS	0: 613/180021, 1: 22/6639
OutTLS_desktop.parquet	DE	Outdated TLS protocol	IDS	0: 738/161202, 1: 18/2458
OutTLS_smartphone.parquet	SM	Outdated TLS protocol	IDS	0: 613/180021, 1: 11/2930
P2P_desktop.parquet	DE	P2P Activity	IDS	0: 738/161202, 1: 177/35892
P2P_smartphone.parquet	SM	P2P Activity	IDS	0: 613/180021, 1: 94/21688
NonEnc_desktop.parquet	DE	Non-encrypted password	IDS	0: 738/161202, 1: 291/59943
NonEnc_smaprthone.parquet	SM	Non-encrypted password	IDS	0: 613/180021, 1: 167/41434
Phishing_desktop.parquet	DE	Phishing email	Experimental Campaign	0: 98/13864, 1: 19/3072
Phishing_smartphone.parquet	SM	Phishing email	Experimental Campaign	0: 117/34006, 1: 26/8968

Methodology

To collect the dataset, we have deployed multiple agents and soluble agents within an infrastructure with
more than 3k entities, comprising laptops, workstations, and smartphone devices. The methods to build
ground truth are as follows:

- Simulator: We launch different realistic phishing campaigns, aiming to expose user credentials or defeat access to a service.
- IDS: We deploy an IDS to collect various alerts associated with behavioral anomalies, such as cryptomining or peer-to-peer traffic.

For each user exposed to the behaviors stated in the summary table, different TW is computed, aggregating
user behavior within a fixed time interval. This TW serves as the basis for generating various supervised
and unsupervised methods.

Sample Representation

The time windows (TW) are a data representation based on aggregated logs from multimodal sources between two
timestamps. In this study, logs from HTTP, DNS, SSL, and SMTP are taken into consideration, allowing the
construction of rich behavioral profiles. The indicators described in the TE are a set of manually curated
interpretable features designed to describe device-level properties within the specified time frame. The most
influential features are described below.

User:** A unique hash value that identifies a user.
Timestamp:** The timestamp of the windows.
Features
Label: 1 if the user exhibits compromised behavior, 0 otherwise. -1 indicates that it is a TW with an unknown label.

Dataset Format

Parquet format uses a columnar storage format, which enhances efficiency and compression, making it suitable for large datasets and complex analytical tasks. It has support across various tools and languages, including Python. Parquet can be used with pandas library in Python, allowing pandas to read and write Parquet files through the `pyarrow` or `fastparquet` libraries. Its efficient data retrieval and fast query execution improve performance over other formats. Compared to row-based storage formats such as CSV, Parquet's columnar storage greatly reduces read times and storage costs for large datasets. Although binary formats like HDF5 are effective for specific use cases, Parquet provides broader compatibility and optimization. The provided datasets use the Parquet format. Here’s an example of how to retrieve data using pandas, ensure you have the fastparquet library installed:

```python
import pandas as pd

# Reading a Parquet file
df = pd.read_parquet(
'path_to_your_file.parquet',
engine='fastparquet'
)

```

d
The Kidmose CANid Dataset (KCID)
data.dtu.dk
txt
Updated Nov 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brooke Elizabeth Kidmose; Andreas Brasen Kidmose (2025). The Kidmose CANid Dataset (KCID) [Dataset]. http://doi.org/10.11583/DTU.30483005.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.11583/DTU.30483005.v2
Dataset updated
Nov 3, 2025
Dataset provided by
Technical University of Denmark
Authors
Brooke Elizabeth Kidmose; Andreas Brasen Kidmose
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Kidmose CANid Dataset (KCID)The Kidmose CANid Dataset (KCID) contains CAN bus data collected by Brooke and Andreas Kidmose from 16 different drivers across 4 different vehicles. This dataset is designed to support driver identification and authentication research.The term "CANid" reflects the dataset's dual purpose: data collected from the CAN bus for driver identification research.VEHICLESThe dataset includes data from four different vehicles across various manufacturers and model years:2011 Chevrolet Traverse - 5-door full-size SUV crossover, AWD, 8 drivers (8 unique drivers in single-driver traces; 1 additional driver in a mixed trace)2017 Ford Focus - 5-door compact station wagon, FWD, 4 drivers2017 Subaru Forester - 5-door compact SUV crossover, AWD, 6 drivers (6 unique drivers in single-driver traces; 3 additional drivers in mixed traces)2022 Honda CR-V Touring - 5-door compact SUV crossover, AWD, 1 driverNote: The number of drivers includes volunteer drivers whose data was captured in single-driver traces, where we know who was driving at all times. We exclude volunteer drivers whose data is only available in mixed traces because we do not know when each specific driver was actually operating the vehicle.DRIVERSThe dataset includes 16 drivers across different demographic categories:Male Drivers:Under 30 years: 4 drivers ("male-under30-1" through "male-under30-4")30-55 years: 4 drivers ("male-30-55-1" through "male-30-55-4")Over 55 years: 3 drivers ("male-over55-1" through "male-over55-3")Female Drivers:All ages: 5 drivers ("female-all-ages-1" through "female-all-ages-5")Driver Directory Structure: Driver identifiers are used as directory/folder names. Within each directory, you will find traces collected from that particular driver, with additional information (location, data collection method, etc.) specified in the filename.Note: We use "unknown driver(s)" in directory names when we know that one or more volunteer drivers was operating the vehicle, but we cannot identify who was driving or when. We used a standalone data logger for some data collection sessions. If we failed to download the data and clear the logger's memory before switching drivers, this resulted in mixed traces and, occasionally, "unknown driver(s)" entries. Unfortunately, some of our volunteer drivers were short-term visitors, so we did not have the opportunity to redo their traces as single-driver traces.LOCATIONSData collection took place across multiple locations:DK - DenmarkUSA - United States of AmericaFL - FloridaNE - NebraskaNE-to-FL - Trip from Nebraska to FloridaTN - TennesseeTN-to-NE - Trip from Tennessee to NebraskaLocation codes appear in filenames (e.g., USA-FL-CANEdge-00000001.mf4 indicates data collected in Florida, USA).DATA COLLECTION METHODSThree different data collection methods were employed:CANEdge - CSS Electronics CANEdge2: Standalone data logger that connects to the OBD-II port and logs to an SD cardKorlan - Korlan USB2CAN: CAN-to-USB cable connecting the vehicle's OBD-II port to a laptopKvaser - Kvaser Hybrid CAN-LIN: CAN-to-USB cable connecting the vehicle's OBD-II port to a laptopThe data collection method is indicated in filenames (e.g., USA-FL-CANEdge-00000001.mf4).FILE TYPESThe dataset provides data in three formats to support different use cases:.mf4 (MDF4) Format: Measurement Data Format version 4 (MDF4)Binary format standardized by the Association for Standardization of Automation (ASAM)Advantages: Compact size, popular with automotive/CAN toolsUse case: Native format from CSS Electronics CANEdge2Reference: https://www.csselectronics.com/pages/mf4-mdf4-measurement-data-format.log Format: Text-based log formatCompatibility: Linux SocketCAN can-utilsAdvantages: Compatibility with SocketCAN can-utils; if a .log file is replayed, then data can be captured and monitored using Python's python-can libraryReferences: https://github.com/linux-can/can-utils, https://packages.debian.org/sid/can-utils, https://python-can.readthedocs.io/en/stable/.csv Format: Text-based comma-separated values (CSV) formatAdvantages: Easy to load with Python using the pandas library; easy to use with Python-based machine learning frameworks (e.g., scikit-learn, Keras, TensorFlow, PyTorch)Usage: Load with Python pandas: pd.read_csv()Reference: https://pandas.pydata.org/SPECIALIZED EXPERIMENTSThe KCID Dataset includes five specialized experiments:Fixed Routes ExperimentVehicles: 2011 Chevrolet Traverse, 2017 Subaru ForesterDrivers: male-30-55-3, male-30-55-4, male-over55-1, female-all-ages-1, female-all-ages-2, female-all-ages-5Location: Florida, USA (specific routes)Data Collection Methods: CSS Electronics CANEdge2, Kvaser Hybrid CAN-LINPurpose: Capture CAN traces for specific, mappable routes; eliminate route-based variations in driver authentication data (e.g., low-speed local routes vs. high-speed long-distance routes)OBD Requests and Responses ExperimentVehicle: 2011 Chevrolet TraverseDriver: female-all-ages-5Location: Florida, USAData Collection Method: CSS Electronics CANEdge2Purpose: Capture OBD requests and responses Arbitration IDs: Requests: 0x7DF, Responses: 0x7E8Tire Pressure ExperimentVehicle: 2011 Chevrolet TraverseDriver: female-all-ages-5Location: Florida, USAData Collection Method: Kvaser Hybrid CAN-LINPurpose: Capture normal and low tire pressure scenariosApplications: Detect tire pressure issues via CAN bus analysis; develop predictive maintenance strategiesDriving Modes and Features ExperimentVehicle: 2017 Ford FocusDriver: male-30-55-1Location: DenmarkData Collection Method: Korlan USB2CANPurpose: Capture different driving (and non-driving) modes and featuresExamples: gear (park, reverse, neutral, drive, sport); headlights on/offStationary Vehicles ExperimentVehicles: 2024 Chevrolet Malibu, 2025 Toyota CorollaDriver: N/A (vehicles remained stationary)Location: Florida, USAData Collection Method: Kvaser Hybrid CAN-LINPurpose: Capture CAN bus traffic from very new, very modern vehicles; identify differences between an older vehicle's CAN bus (e.g., 2011 Chevrolet Traverse), and a newer vehicle's CAN bus (e.g., 2024 Chevrolet Malibu)ADDITIONAL DOCUMENTATIONEach "specialized experiment" directory contains a detailed README.md file with specific information about the experiment and the data collected.RESEARCH APPLICATIONSThis dataset supports various research areas:Driver authentication, driver fingerprintingBehavioral biometrics in the automotive domainVehicle diagnostics and predictive maintenanceMachine learning in the automotive domainCAN bus analysis and reverse engineeringCITATIONIf you use the Kidmose CANid Dataset in your research, please cite appropriately. Citation information will be updated when our paper is published in a peer-reviewed venue.Article Citation:APA Style: Kidmose, B. E., Kidmose, A. B., and Zou, C. C. (2025). A critical roadmap to driver authentication via CAN bus: Dataset review, introduction of the Kidmose CANid Dataset (KCID), and proof of concept. arXiv. https://arxiv.org/pdf/2510.25856MLA Style: Kidmose, Brooke Elizabeth, Andreas Brasen Kidmose, and Cliff C. Zou. "A Critical Roadmap to Driver Authentication via CAN Bus: Dataset Review, Introduction of the Kidmose CANid Dataset (KCID), and Proof of Concept." arXiv, 2025. doi:10.48550/arXiv.2510.25856Chicago Style: Kidmose, Brooke Elizabeth, Andreas Brasen Kidmose, and Cliff C. Zou. "A Critical Roadmap to Driver Authentication via CAN Bus: Dataset Review, Introduction of the Kidmose CANid Dataset (KCID), and Proof of Concept." arXiv (2025). doi:10.48550/arXiv.2510.25856Dataset Citation:APA Style: Kidmose, B. E. and Kidmose, A. B. (2025). Kidmose CANid Dataset (KCID) v1. [Data set]. Technical University of Denmark. https://doi.org/10.11583/DTU.30483005.v1MLA Style: Kidmose, Brooke Elizabeth, and Andreas Brasen Kidmose. "Kidmose CANid Dataset (KCID) v1." Technical University of Denmark, 30 Oct. 2025. Web. {Date accessed in dd mmm yyyy format}. doi:10.11583/DTU.30483005.v1Chicago Style: Kidmose, Brooke Elizabeth, and Andreas Brasen Kidmose. 2025. "Kidmose CANid Dataset (KCID) v1." Technical University of Denmark. doi:10.11583/DTU.30483005.v1
Comprehensive Literary Greats Dataset
kaggle.com
zip
Updated Jan 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Comprehensive Literary Greats Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/comprehensive-literary-greats-dataset
Explore at:
zip(29940528 bytes)Available download formats
Dataset updated
Jan 29, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Comprehensive Literary Greats Dataset

50,000+ Books Rated and Awarded Across Language, Genre, and Format

By [source]

About this dataset

This remarkable dataset provides an awe-inspiring collection of over 50,000 books, encompassing the world's best practices in literature, poetry, and authorship. For each book included in the dataset, users can gain access to a wealth of insightful information such as title, author(s), average rating given by readers and critics alike, a brief description highlighting its plot or characteristics; language it is written in; unique ISBN which enables potential buyers to locate their favorite works with ease; genres it belongs to; any awards it has won or characters that inhabit its storyworld.

Additionally, seeking out readers' opinions on exceptional books is made easier due to the availability of bbeScore (best books ever score) alongside details for the most accurate ratings given through well-detailed breakdowns in “ratingsByStars” section. Making sure visibility and recognition are granted fairly – be it a classic novel from time immemorial or merely recently released newcomers - this source also allows us to evaluate new stories based off readers' engagement rate highlighted by likedPercent column (the percentage of readers who liked the book), bbeVotes (number of votes casted) as well as entries related to date published - including showstopping firstPublishDate!

Aspiring literature researchers; literary historians and those seeking hidden literary gems alike would no doubt benefit from delving into this magnificent collection – 25 variables regarding different novels & poets that are presented by Kaggle open source dataset “Best Books Ever: A Comprehensive Historical Collection of Literary Greats”. What worlds awaits you?

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Whether you are a student, researcher, or enthusiast of literature, this dataset provides a valuable source for exploring literary works from varied time periods and genres. By accessing all 25 variables in the dataset, readers have the opportunity to use them for building visualizations, creating new analysis tools and models, or finding books you might be interested in reading.

First after downloading the dataset into Kaggle Notebooks platform or other programming interfaces of your choice such as R Studio/Python Jupyter Notebooks (Pandas) - make sure that data is arranged into columns with clearly labeled title names. This will help you understand which variable is related to what precise information. Afterwards explore each variable by finding any patterns across particular titles or interesting findings about certain authors/ratings that are available in your research interests.

Utilize the vital columns of Title (title), Author(author), Rating (rating), Description (description), Language (language), Genres (genres) and Characters(characters) - these can assist you in discovering different trends between books according to style of composition or character types etc. Move further down on examining more specific details offered by Book Format(bookFormat), Edition(edition) Pages(pages). Peruse publisher info along with Publish Date(publishDate). Besides these structural elements also take note of Awards column considering recent recognition different titles have received; also observe how much ratings has been collected per text through Numbers Ratings column-(numRatings); analyze reader's feedback according on Ratings By Stars(_ratingsByStars); view LikedPercentage rate provided by readers when analyzing particular book(_likedPercent).

Apart from more accessible factors mentioned previously delve deeper onto more sophisticated data presented: Setting (_setting); Cover Image (_coverImg); BbeScore_bbeScore); BbeVotes_bbeVotes). All those should provide greater insight when trying to explain why certain book has made its way onto GoodReads top selections list! To find value estimate test out Price (_price)) column too - determining if some texts retain large popularity despite rather costly publishing options cost-wise available on market currently?

Finally combine different aspects observed while researching concerning individual titles- create personalized recommendations based upon released comprehensive lists! To achieve that utilize ISUBN code provided; compare publication Vs first publication dates historically recorded; verify awards labeling procedure relied upon give context information on discussed here books progress over years

Research Ideas

Creating a web or mobile...
Electrical Grid Half Hourly (UK)
kaggle.com
zip
Updated Jan 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Electrical Grid Half Hourly (UK) [Dataset]. https://www.kaggle.com/thedevastator/gb-electrical-grid-half-hourly-data-2008-present
Explore at:
zip(18175666 bytes)Available download formats
Dataset updated
Jan 2, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
United Kingdom
Description
Electrical Grid Half Hourly (UK)

Investigating Supply and Demand of Electricity in the UK

By [source]

About this dataset

This dataset contains a wealth of information on the electrical half-hourly data for Great Britain from 2008 up until present day. This dataset is sourced from both the Elexon Portal and National Grid, providing you with an in-depth view into electricity supply and demand in the UK. It includes conventional generation, wind generation, nuclear generation, pumped storage and imports & exports. With columns such as ELEXM_SETTLEMENT_DATE, ELEXM_SETTLEMENT_PERIOD, ELEXM_UTC etc., this dataset is ideal for anyone looking to gain a truly comprehensive understanding of current energy situation in Britain!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Introduction This data set contains compiled and cleaned half-hourly electricity data for Great Britain. It is sourced from various providers such as Elexon Portal and National Grid, making it a great tool for studying electrical supply and demand in the UK. This guide will provide an overview of this dataset, walking through the required steps on how to use this dataset effectively.

Getting Familiar with the Data: The first step in using this dataset is to get familiar with its features. This includes looking at the columns/variables available, their descriptions/units, and meaning. Under each column you'll find additional information about its data type (eg; integer or float) which is helpful to understand while performing any kind of analysis with it. Another excellent way to explore this data set would be by simply looking at some examples of each column's contents by printing out a few rows of the table so one can further investigate them based of those values listed thereupon. Doing so should give you more clarity over what type of questions you can answer with your analyses – keeping in mind that not all datasets are suitable for addressing all potential queries concerning research inquiry!

Understanding Relationships: After getting familiarised with its features & attributes, it's important to start understanding how they're related to each other which makes up our overall analysis process when dealing with any given dataset(s). To do that we take into consideration variable characteristics - such as presence or absence (or correlations) between certain columns - when ultimately constructing relationships among diverse elements comprising specific cases under study through various operations like merging & merging adjacent tables within the same frame wor for transforming raw input information into meaningful knowledge derived after completing analytics task(s). This all helps us gain insight on patterns present throughout the entire collection as well as individual items themselves whether individually or collectively over time leading towards desired outputs necessary to answer particular questions being asked about underlying trends found inside datasets used at hand!

Performing Analyses: We then start running analytical approaches directly on given raw information extracted during the previous step (understanding relationships) either via compilation-based processing methods within statistical environments like R studio/Python Pandas libraries etcetera; these tools allow us to create models upon activating suitable algorithms processing tools helping visualize pattern displays interpreting different feature combinations under scrutiny when focusing particularly towards better understanding interdependencies/correlations existing among different case variables studied until reaching desired insights required solving existing problems coming up next when proceeding towards generating user-specific targeted solutions depending context per question posed during initial exploratory use cases outlined

Research Ideas

Analyzing long-term trends in electricity generation and consumption of different sources over time.

Using machine learning algorithms to predict future energy consumption, production and pricing in the UK electricity market.

Developing more efficient methods of powering homes, businesses and other organizations based on energy consumption patterns from this dataset

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy,...
n
Walking in a heterogeneous landscape: dispersal, gene-flow and conservation...
data-staging.niaid.nih.gov
data.niaid.nih.gov
+2more
zip
Updated Jul 30, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tianxiao Ma; Yibo Hu; Isa-Rita Russo; Yonggang Nie; Tianyou Yang; Lijuan Xiong; Shuai Ma; Tao Meng; Han Han; Ximing Zhang; Mike W. Bruford; Fuwen Wei; Isa-Rita M. Russo; Michael W. Bruford (2018). Walking in a heterogeneous landscape: dispersal, gene-flow and conservation implications for the giant panda in the Qinling Mountains [Dataset]. http://doi.org/10.5061/dryad.5sh56g0
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.5sh56g0
Dataset updated
Jul 30, 2018
Dataset provided by
Cardiff University
Changqing National Nature Reserve; Shaanxi China
Institute of Zoology
Zoological Society of London
Guangxi Forest Inventory & Planning Institute; Nanning Guangxi China
Guizhou Normal University
Chinese Academy of Sciences
Authors
Tianxiao Ma; Yibo Hu; Isa-Rita Russo; Yonggang Nie; Tianyou Yang; Lijuan Xiong; Shuai Ma; Tao Meng; Han Han; Ximing Zhang; Mike W. Bruford; Fuwen Wei; Isa-Rita M. Russo; Michael W. Bruford
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Qinling
Description
Understanding the interaction between life history, demography and population genetics in threatened species is critical for the conservations of viable populations. In the context of habitat loss and fragmentation, identifying the factors that underpin the structuring of genetic variation within populations can allow conservationists to evaluate habitat quality and connectivity and help to design dispersal corridors effectively. In this study, we carried out a detailed, fine-scale landscape genetic investigation of a giant panda population for the first time, using a large microsatellite data set and examined the role of isolation-by-barriers (IBB), isolation-by-distance (IBD) and isolation-by-resistance (IBR) in shaping the genetic variation pattern of giant pandas in the Qinling Mountains. We found that the Qinling population comprises one continuous genetic cluster, and among the landscape hypotheses tested, gene flow was found to be correlated with resistance gradients for two topographic factors, rather than geographical distance or barriers. Gene-flow was inferred to be facilitated by easterly slope aspect and to be constrained by land surface with high topographic complexity. These factors are related to benign micro-climatic conditions for both the pandas and the food resources they rely on and more accessible topographic conditions for movement, respectively. We identified optimal corridors based on these results, aiming to promote gene flow between human-induced habitat fragments. These findings provide insight into the permeability and affinities of the giant panda habitat and offer important reference for the conservation of the giant panda and its habitat.
Walmart complete updated stocks dataset
kaggle.com
zip
Updated Mar 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M Atif Latif (2025). Walmart complete updated stocks dataset [Dataset]. https://www.kaggle.com/datasets/matiflatif/walmart-complete-stocks-dataweekly-updated
Explore at:
zip(1909332 bytes)Available download formats
Dataset updated
Mar 15, 2025
Authors
M Atif Latif
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Walmart (WMT) Stock Price Data (1970 - 2025)

Dataset Overview:

This dataset contains historical stock price data for Walmart Inc. (WMT) from October 1, 1970, to January 31, 2025. The data includes key stock market indicators such as opening price, closing price, adjusted closing price, highest and lowest prices of the day, and trading volume. This dataset can be valuable for financial analysis, stock market trend prediction, and machine learning applications in quantitative finance.

Data Source

The data has been collected from publicly available financial sources and covers over 13,000 trading days, providing a comprehensive view of Walmart’s stock performance over several decades.

Columns Description

Date: The trading date (1970-10-01).

Open: The opening price of Walmart stock for the day.

High: The highest price reached during the trading session.

Low: The lowest price recorded during the trading session.

Close: The closing price at the end of the trading day.

Adj Close: The adjusted closing price, which accounts for stock splits and dividends.

Volume: The total number of shares traded on that particular day.

Potential Use Cases

This dataset can be used for a variety of financial and data science applications, including:

✔ Stock Market Analysis – Study historical trends and price movements.

✔ Time Series Forecasting – Develop predictive models using machine learning.

✔ Technical Analysis – Apply moving averages, RSI, and other trading indicators.

✔ Market Volatility Analysis – Assess market fluctuations over different periods.

✔ Algorithmic Trading – Backtest trading strategies based on historical data.

Data Integrity

No missing values.

Data spans over 50 years, ensuring long-term trend analysis.

Preprocessed and structured for easy use in Python, R, and other data science tools.

How to Use the Data?

You can load the dataset using Pandas in Python: ``` import pandas as pd

Load the dataset

df = pd.read_csv("WMT_1970-10-01_2025-01-31.csv")

Display the first few rows

df.head() ```

Acknowledgments

This dataset is provided for educational and research purposes. Please ensure proper attribution if used in projects or research.

More Dataset

This data set is scrape by Muhammad Atif Latif.

For more Datasets justCLICK HERE
n
A dataset of 5 million city trees from 63 US cities: species, location,...
data.niaid.nih.gov
data-staging.niaid.nih.gov
+2more
zip
Updated Aug 31, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dakota McCoy; Benjamin Goulet-Scott; Weilin Meng; Bulent Atahan; Hana Kiros; Misako Nishino; John Kartesz (2022). A dataset of 5 million city trees from 63 US cities: species, location, nativity status, health, and more. [Dataset]. http://doi.org/10.5061/dryad.2jm63xsrf
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2jm63xsrf
Dataset updated
Aug 31, 2022
Dataset provided by
Worcester Polytechnic Institute
The Biota of North America Program (BONAP)
Harvard University
Cornell University
Stanford University
Authors
Dakota McCoy; Benjamin Goulet-Scott; Weilin Meng; Bulent Atahan; Hana Kiros; Misako Nishino; John Kartesz
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
United States
Description
Sustainable cities depend on urban forests. City trees -- a pillar of urban forests -- improve our health, clean the air, store CO2, and cool local temperatures. Comparatively less is known about urban forests as ecosystems, particularly their spatial composition, nativity statuses, biodiversity, and tree health. Here, we assembled and standardized a new dataset of N=5,660,237 trees from 63 of the largest US cities. The data comes from tree inventories conducted at the level of cities and/or neighborhoods. Each data sheet includes detailed information on tree location, species, nativity status (whether a tree species is naturally occurring or introduced), health, size, whether it is in a park or urban area, and more (comprising 28 standardized columns per datasheet). This dataset could be analyzed in combination with citizen-science datasets on bird, insect, or plant biodiversity; social and demographic data; or data on the physical environment. Urban forests offer a rare opportunity to intentionally design biodiverse, heterogenous, rich ecosystems. Methods See eLife manuscript for full details. Below, we provide a summary of how the dataset was collected and processed.

Data Acquisition We limited our search to the 150 largest cities in the USA (by census population). To acquire raw data on street tree communities, we used a search protocol on both Google and Google Datasets Search (https://datasetsearch.research.google.com/). We first searched the city name plus each of the following: street trees, city trees, tree inventory, urban forest, and urban canopy (all combinations totaled 20 searches per city, 10 each in Google and Google Datasets Search). We then read the first page of google results and the top 20 results from Google Datasets Search. If the same named city in the wrong state appeared in the results, we redid the 20 searches adding the state name. If no data were found, we contacted a relevant state official via email or phone with an inquiry about their street tree inventory. Datasheets were received and transformed to .csv format (if they were not already in that format). We received data on street trees from 64 cities. One city, El Paso, had data only in summary format and was therefore excluded from analyses.

Data Cleaning All code used is in the zipped folder Data S5 in the eLife publication. Before cleaning the data, we ensured that all reported trees for each city were located within the greater metropolitan area of the city (for certain inventories, many suburbs were reported - some within the greater metropolitan area, others not). First, we renamed all columns in the received .csv sheets, referring to the metadata and according to our standardized definitions (Table S4). To harmonize tree health and condition data across different cities, we inspected metadata from the tree inventories and converted all numeric scores to a descriptive scale including “excellent,” “good”, “fair”, “poor”, “dead”, and “dead/dying”. Some cities included only three points on this scale (e.g., “good”, “poor”, “dead/dying”) while others included five (e.g., “excellent,” “good”, “fair”, “poor”, “dead”). Second, we used pandas in Python (W. McKinney & Others, 2011) to correct typos, non-ASCII characters, variable spellings, date format, units used (we converted all units to metric), address issues, and common name format. In some cases, units were not specified for tree diameter at breast height (DBH) and tree height; we determined the units based on typical sizes for trees of a particular species. Wherever diameter was reported, we assumed it was DBH. We standardized health and condition data across cities, preserving the highest granularity available for each city. For our analysis, we converted this variable to a binary (see section Condition and Health). We created a column called “location_type” to label whether a given tree was growing in the built environment or in green space. All of the changes we made, and decision points, are preserved in Data S9. Third, we checked the scientific names reported using gnr_resolve in the R library taxize (Chamberlain & Szöcs, 2013), with the option Best_match_only set to TRUE (Data S9). Through an iterative process, we manually checked the results and corrected typos in the scientific names until all names were either a perfect match (n=1771 species) or partial match with threshold greater than 0.75 (n=453 species). BGS manually reviewed all partial matches to ensure that they were the correct species name, and then we programmatically corrected these partial matches (for example, Magnolia grandifolia-- which is not a species name of a known tree-- was corrected to Magnolia grandiflora, and Pheonix canariensus was corrected to its proper spelling of Phoenix canariensis). Because many of these tree inventories were crowd-sourced or generated in part through citizen science, such typos and misspellings are to be expected. Some tree inventories reported species by common names only. Therefore, our fourth step in data cleaning was to convert common names to scientific names. We generated a lookup table by summarizing all pairings of common and scientific names in the inventories for which both were reported. We manually reviewed the common to scientific name pairings, confirming that all were correct. Then we programmatically assigned scientific names to all common names (Data S9). Fifth, we assigned native status to each tree through reference to the Biota of North America Project (Kartesz, 2018), which has collected data on all native and non-native species occurrences throughout the US states. Specifically, we determined whether each tree species in a given city was native to that state, not native to that state, or that we did not have enough information to determine nativity (for cases where only the genus was known). Sixth, some cities reported only the street address but not latitude and longitude. For these cities, we used the OpenCageGeocoder (https://opencagedata.com/) to convert addresses to latitude and longitude coordinates (Data S9). OpenCageGeocoder leverages open data and is used by many academic institutions (see https://opencagedata.com/solutions/academia). Seventh, we trimmed each city dataset to include only the standardized columns we identified in Table S4. After each stage of data cleaning, we performed manual spot checking to identify any issues.
🍕🍽️ Pizza Restaurant Sales
kaggle.com
zip
Updated Oct 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shi Long Zhuang (2022). 🍕🍽️ Pizza Restaurant Sales [Dataset]. https://www.kaggle.com/datasets/shilongzhuang/pizza-sales/versions/2
Explore at:
zip(4329148 bytes)Available download formats
Dataset updated
Oct 21, 2022
Authors
Shi Long Zhuang
Description
Contents

This pizza sales dataset make up 12 relevant features: - order_id: Unique identifier for each order placed by a table - order_details_id: Unique identifier for each pizza placed within each order (pizzas of the same type and size are kept in the same row, and the quantity increases) - pizza_id: Unique key identifier that ties the pizza ordered to its details, like size and price - quantity: Quantity ordered for each pizza of the same type and size - order_date: Date the order was placed (entered into the system prior to cooking & serving) - order_time: Time the order was placed (entered into the system prior to cooking & serving) - unit_price: Price of the pizza in USD - total_price: unit_price * quantity - pizza_size: Size of the pizza (Small, Medium, Large, X Large, or XX Large) - pizza_type: Unique key identifier that ties the pizza ordered to its details, like size and price - pizza_ingredients: ingredients used in the pizza as shown in the menu (they all include Mozzarella Cheese, even if not specified; and they all include Tomato Sauce, unless another sauce is specified) - pizza_name: Name of the pizza as shown in the menu

🍕The Pizza Challenge

For the Maven Pizza Challenge, you’ll be playing the role of a BI Consultant hired by Plato's Pizza, a Greek-inspired pizza place in New Jersey. You've been hired to help the restaurant use data to improve operations, and just received the following note:

Welcome aboard, we're glad you're here to help!

Things are going OK here at Plato's, but there's room for improvement. We've been collecting transactional data for the past year, but really haven't been able to put it to good use. Hoping you can analyze the data and put together a report to help us find opportunities to drive more sales and work more efficiently.

Here are some questions that we'd like to be able to answer:

What days and times do we tend to be busiest?

How many pizzas are we making during peak periods?

What are our best and worst-selling pizzas?

What's our average order value?

How well are we utilizing our seating capacity? (we have 15 tables and 60 seats)

That's all I can think of for now, but if you have any other ideas I'd love to hear them – you're the expert!

Thanks in advance,

Mario Maven (Manager, Plato's Pizza)

Colllection Methodology

The public dataset is completely available on the Maven Analytics website platform where it stores and consolidates all available datasets for analysis in the Data Playground. The specific individual datasets at hand can be obtained at this link below: https://www.mavenanalytics.io/blog/maven-pizza-challenge

📌I set up the data model to include all the related instances in one single table so obtaining data for analysis is made easier.

My Inspiration

Complete details were also provided about the challenge in the link if you are interested. The purpose of uploading here is to conduct exploratory data analysis about the dataset beforehand with the use of Pandas and data visualization libraries in order to have a comprehensive review of the data and translate my findings and insights in the form of a single page visualization.
The Weather Dataset
kaggle.com
zip
Updated Sep 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guillem SD (2023). The Weather Dataset [Dataset]. https://www.kaggle.com/datasets/guillemservera/global-daily-climate-data
Explore at:
zip(223125687 bytes)Available download formats
Dataset updated
Sep 3, 2023
Authors
Guillem SD
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description

Feel free to FORK THIS NOTEBOOK in order to correctly load the data for your project!

Overview: This dataset offers a comprehensive collection of Daily weather readings from major cities around the world. In the first release, it included only capitals, but now it also adds main cities worldwide and hourly data as well, making up to ~1250 cities. Some locations provide historical data tracing back to January 2, 1833, giving users a deep dive into long-term weather patterns and their evolution.

Data License and Updates: This dataset is updated every Sunday using data from Meteostat API, ensuring access to the latest week's data without overburdening the data source.

Cities DataFrame (cities.csv)

This dataframe offers details about individual cities and weather stations. - Columns: - station_id: Unique ID for the weather station. - city_name: Name of the city. - country: The country where the city is located. - state: The state or province within the country. - iso2: The two-letter country code. - iso3: The three-letter country code. - latitude: Latitude coordinate of the city. - longitude: Longitude coordinate of the city.

Countries DataFrame (countires.csv)

This dataframe contains information about different countries, providing insights into their geographic and demographic characteristics. - Columns: - iso3: The three-letter code representing the country. - country: The English name of the country. - native_name: The native name of the country. - iso2: The two-letter code representing the country. - population: The population of the country. - area: The total land area of the country in square kilometers. - capital: The name of the capital city. - capital_lat: The latitude coordinate of the capital city. - capital_lng: The longitude coordinate of the capital city. - region: The specific region within the continent where the country is located. - continent: The continent to which the country belongs. - hemisphere: The hemisphere in which the country is located (e.g., Northern, Southern).

Daily Weather DataFrame (daily_weather.parquet)

This dataframe provides weather data on a daily basis. - Columns: - station_id: Unique ID for the weather station. - city_name: Name of the city where the station is located. - date: Date of the weather record. - season: Season corresponding to the date (e.g., summer, winter). - avg_temp_c: Average temperature in Celsius. - min_temp_c: Minimum temperature in Celsius. - max_temp_c: Maximum temperature in Celsius. - precipitation_mm: Precipitation in millimeters. - snow_depth_mm: Snow depth in millimeters. - avg_wind_dir_deg: Average wind direction in degrees. - avg_wind_speed_kmh: Average wind speed in kilometers per hour. - peak_wind_gust_kmh: Peak wind gust in kilometers per hour. - avg_sea_level_pres_hpa: Average sea-level pressure in hectopascals. - sunshine_total_min: Total sunshine duration in minutes.

These dataframes can be utilized for various analyses such as weather trend prediction, climate studies, geographic analysis, demographic insights, and more.

Dataset Image Source: Photo credits to 越过山丘. View the original image here.
Materials and their Mechanical Properties
kaggle.com
zip
Updated Apr 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Purushottam Nawale (2023). Materials and their Mechanical Properties [Dataset]. https://www.kaggle.com/datasets/purushottamnawale/materials
Explore at:
zip(145487 bytes)Available download formats
Dataset updated
Apr 15, 2023
Authors
Purushottam Nawale
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
We utilized a dataset of Machine Design materials, which includes information on their mechanical properties. The dataset was obtained from the Autodesk Material Library and comprises 15 columns, also referred to as features/attributes. This dataset is a real-world dataset, and it does not contain any random values. However, due to missing values, we only utilized seven of these columns for our ML model. You can access the related GitHub Repository here: https://github.com/purushottamnawale/material-selection-using-machine-learning

To develop a ML model, we employed several Python libraries, including NumPy, pandas, scikit-learn, and graphviz, in addition to other technologies such as Weka, MS Excel, VS Code, Kaggle, Jupyter Notebook, and GitHub. We employed Weka software to swiftly visualize the data and comprehend the relationships between the features, without requiring any programming expertise.

My Problem statement is Material Selection for EV Chassis. So, if you have any specific ideas, be sure to implement them and add the codes on Kaggle.

A Detailed Research Paper is available on https://iopscience.iop.org/article/10.1088/1742-6596/2601/1/012014
Health insurance dataset | India-2022
kaggle.com
Updated May 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
balaji adithya (2023). Health insurance dataset | India-2022 [Dataset]. https://www.kaggle.com/datasets/balajiadithya/health-insurance-dataset-india-2022
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 28, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
balaji adithya
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Area covered
India
Description
Context

This public dataset contains data concerning the public and private insurance companies provided by IRDAI(Insurance Regulatory and Development Authority of India) from 2013-2022. This is a multi-index data and can be a great practice to hone manipulation of pandas multi-index dataframes. Mainly, the business of the companies (total premiums and number of policies), subscription information(number of people subscribed), Claims incurred and the Network hospitals enrolled by Third Party Administrators are attributes focused by the dataset.

Content

The Excel file contains the following data | Table No.| Contents| | --- | --- | |**A**|**III.A: HEALTH INSURANCE BUSINESS OF GENERAL AND HEALTH INSURERS**| |62| Health Insurance - Number of Policies, Number of Persons Covered and Gross Premium| |63| Personal Accident Insurance - Number of Policies, Number of Persons Covered and Gross Premium| |64| Overseas Travel Insurance - Number of Policies, Number of Persons Covered and Gross Premium| |65| Domestic Travel Insurance - Number of Policies, Number of Persons Covered and Gross Premium| |66| Health Insurance - Net Premium Earned, Incurred Claims and Incurred Claims Ratio| |67| Personal Accident Insurance - Net Premium Earned, Incurred Claims and Incurred Claims Ratio| |68| Overseas Travel Insurance - Net Earned Premium, Incurred Claims and Incurred Claims Ratio| |69| Domestic Travel Insurance - Net Earned Premium, Incurred Claims and Incurred Claims Ratio| |70| Details of Claims Development and Aging - Health Insurance Business| |71| State-wise Health Insurance Business| |72| State-wise Individual Health Insurance Business| |73| State-wise Personal Accident Insurance Business| |74| State-wise Overseas Insurance Business| |75| State-wise Domestic Insurance Business| |76| State-wise Claims Settlement under Health Insurance Business| |**B**|**III.B: HEALTH INSURANCE BUSINESS OF LIFE INSURERS**| |77| Health Insurance Business in respect of Products offered by Life Insurers - New Busienss| |78| Health Insurance Business in respect of Products offered by Life insurers - Renewal Business| |79| Health Insurance Business in respect of Riders attached to Life Insurance Products - New Business| |80| Health Insurance Business in respect of Riders attached to Life Insurance Products - Renewal Business| |**C**|**III.C: OTHERS**| |81| Network Hospital Enrolled by TPAs| |82| State-wise Details on Number of Network Providers |
Germeval18 - Text Classification Dataset
kaggle.com
zip
Updated Dec 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Germeval18 - Text Classification Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/text-classification-dataset
Explore at:
zip(538082 bytes)Available download formats
Dataset updated
Dec 5, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Text Classification Dataset

Text Classification Dataset with Binary and Multi-class Labels

By Philipp Schmid (From Huggingface) [source]

About this dataset

The dataset is provided in two separate files: train.csv and test.csv. The train.csv file contains a substantial amount of labeled data with columns for the text data itself, as well as their corresponding binary and multi-class labels. This enables users to develop and train machine learning models effectively using this dataset.

Similarly, test.csv includes additional examples for evaluating pre-trained models or assessing model performance after training on train.csv. It follows a similar structure as train.csv with columns representing text data, binary labels, and multi-class labels.

With its rich content and extensive labeling scheme for binary and multi-class classification tasks combined with its ease of use due to its tabular format in CSV files makes this dataset an excellent choice for anyone looking to advance their NLP capabilities through diverse text classification challenges

How to use the dataset

How to Use this Dataset for Text Classification

This guide will provide you with useful information on how to effectively utilize this dataset for your text classification projects.

Understanding the Columns

The dataset consists of several columns, each serving a specific purpose:

text: This column contains the actual text data that needs to be classified. It is the primary feature for your modeling task.

binary: This column represents the binary classification label associated with each text entry. The label indicates whether the text belongs to one class or another. For example, it could be used to classify emails as either spam or not spam.

multi: This column represents the multi-class classification label associated with each text entry. The label indicates which class or category the text belongs to out of multiple possible classes. For instance, it can be used to categorize news articles into topics like sports, politics, entertainment, etc.

Dataset Files

The dataset is provided in two files: train.csv and test.csv.

train.csv: This file contains a subset of labeled data specifically intended for training your models. It includes columns for both text data and their corresponding binary and multi-class labels.

test.csv: In order to evaluate your trained models' performance on unseen data, this file provides additional examples similar in structure and format as train.csv. It includes columns for both texts and their respective binary and multi-class labels as well.

Getting Started

To make use of this dataset effectively, here are some steps you can follow:

Download both train.csv and test.csv files containing labeled examples.

Load these datasets into your preferred machine learning environment (such as Python with libraries like Pandas or Scikit-learn).

Explore the dataset by examining its structure, summary statistics, and visualizations.

Preprocess the text data as needed, which may include techniques like tokenization, removing stop words, stemming/lemmatizing, and encoding text into numerical representations (such as bag-of-words or TF-IDF vectors).

Consider splitting the train.csv data further into training and validation sets for model development and evaluation.

Select appropriate machine learning algorithms for your text classification task (e.g., Naive Bayes, Logistic Regression, Support Vector Machines) and train them

Research Ideas

Sentiment Analysis: The dataset can be used to classify text data into positive or negative sentiment, based on the binary classification label. This can be helpful in analyzing customer reviews, social media sentiment, and feedback analysis.

Topic Categorization: The multi-class classification label can be used to categorize text into different topics or themes. This can be useful in organizing large amounts of text data, such as news articles or research papers.

Spam Detection: The binary classification label can be used to identify whether a text message or email is spam or not. This can help users filter out unwanted messages and improve their overall communication experience. Overall, this dataset provides an opportunity to create models for various applications of text classification such as sentiment analysis, topic categorization, and spam detection

Acknowledgements

If you use this dataset in your research, please credit the original authors. [Data Source](https://huggingface.co/datase...
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Javier Landaeta (2024). Classicmodels [Dataset]. https://www.kaggle.com/datasets/javierlandaeta/classicmodels

Classicmodels

Analysis with Python and PostgreSQL

Explore at:

47 scholarly articles cite this dataset (View in Google Scholar)

zip(65751 bytes)Available download formats

Dataset updated

Dec 15, 2024

Authors

Javier Landaeta

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Abstract This project presents a comprehensive analysis of a company's annual sales, using the classic dataset classicmodels as the database. Python is used as the main programming language, along with the Pandas, NumPy and SQLAlchemy libraries for data manipulation and analysis, and PostgreSQL as the database management system.

The main objective of the project is to answer key questions related to the company's sales performance, such as: Which were the most profitable products and customers? Were sales goals met? The results obtained serve as input for strategic decision making in future sales campaigns.

Methodology 1. Data Extraction:

A connection is established with the PostgreSQL database to extract the relevant data from the orders, orderdetails, customers, products and employees tables.
A reusable function is created to read each table and load it into a Pandas DataFrame.

2. Data Cleansing and Transformation:

An exploratory analysis of the data is performed to identify missing values, inconsistencies, and outliers.
New variables are calculated, such as the total value of each sale, cost, and profit.
Different DataFrames are joined using primary and foreign keys to obtain a complete view of sales.

3. Exploratory Data Analysis (EDA):

Key metrics such as total sales, number of unique customers, and average order value are calculated.
Data is grouped by different dimensions (products, customers, dates) to identify patterns and trends.
Results are visualized using relevant graphics (histograms, bar charts, etc.).

4. Modeling and Prediction:

Although the main focus of the project is descriptive, predictive modeling techniques (e.g., time series) could be explored to forecast future sales.

5. Report Generation:

Detailed reports are created in Pandas DataFrames format that answer specific business questions.
These reports are stored in new PostgreSQL tables for further analysis and visualization.

Results - Identification of top products and customers: The best-selling products and the customers that generate the most revenue are identified. - Analysis of sales trends: Sales trends over time are analyzed and possible factors that influence sales behavior are identified. - Calculation of key metrics: Metrics such as average profit margin and sales growth rate are calculated.

Conclusions This project demonstrates how Python and PostgreSQL can be effectively used to analyze large data sets and obtain valuable insights for business decision making. The results obtained can serve as a starting point for future research and development in the area of sales analysis.

Technologies Used - Python: Pandas, NumPy, SQLAlchemy, Matplotlib/Seaborn - Database: PostgreSQL - Tools: Jupyter Notebook - Keywords: data analysis, Python, PostgreSQL, Pandas, NumPy, SQLAlchemy, EDA, sales, business intelligence

Clear search

Close search

Google apps

Main menu

Classicmodels

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

Replication Data for: Multiwavelets applied to metal-ligand interactions:...

Japanese Anime: An In-Depth IMDb Data Set

Introduction to the IMDb Anime Dataset (45718 titles)

Methodology

Summary of Results

Detailed Column Introduction

Possible Usages

GoT Characters Screen Time

GoT Characters Screen Time

How Long did Characters Spend on Screen?

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Aluminum alloy industrial materials defect

RBD24 - Risk Activities Dataset 2024

Introduction

Summary of the Datasets

Methodology

Sample Representation

Dataset Format

The Kidmose CANid Dataset (KCID)

Comprehensive Literary Greats Dataset

Comprehensive Literary Greats Dataset

50,000+ Books Rated and Awarded Across Language, Genre, and Format

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Electrical Grid Half Hourly (UK)

Electrical Grid Half Hourly (UK)

Investigating Supply and Demand of Electricity in the UK

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Walking in a heterogeneous landscape: dispersal, gene-flow and conservation...

Walmart complete updated stocks dataset

Walmart (WMT) Stock Price Data (1970 - 2025)

Dataset Overview:

Data Source

Columns Description

Potential Use Cases

Data Integrity

How to Use the Data?

Load the dataset

Display the first few rows

Acknowledgments

More Dataset

A dataset of 5 million city trees from 63 US cities: species, location,...

🍕🍽️ Pizza Restaurant Sales

Contents

🍕The Pizza Challenge

Colllection Methodology

My Inspiration

The Weather Dataset

Feel free to FORK THIS NOTEBOOK in order to correctly load the data for your project!

Cities DataFrame (cities.csv)

Countries DataFrame (countires.csv)

Daily Weather DataFrame (daily_weather.parquet)

Materials and their Mechanical Properties

Health insurance dataset | India-2022

Context

Content

Germeval18 - Text Classification Dataset

Text Classification Dataset

Text Classification Dataset with Binary and Multi-class Labels

About this dataset

How to use the dataset

How to Use this Dataset for Text Classification

Cities DataFrame (`cities.csv`)

Countries DataFrame (`countires.csv`)

Daily Weather DataFrame (`daily_weather.parquet`)