Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Abstract This project presents a comprehensive analysis of a company's annual sales, using the classic dataset classicmodels as the database. Python is used as the main programming language, along with the Pandas, NumPy and SQLAlchemy libraries for data manipulation and analysis, and PostgreSQL as the database management system.
The main objective of the project is to answer key questions related to the company's sales performance, such as: Which were the most profitable products and customers? Were sales goals met? The results obtained serve as input for strategic decision making in future sales campaigns.
Methodology 1. Data Extraction:
2. Data Cleansing and Transformation:
3. Exploratory Data Analysis (EDA):
4. Modeling and Prediction:
5. Report Generation:
Results - Identification of top products and customers: The best-selling products and the customers that generate the most revenue are identified. - Analysis of sales trends: Sales trends over time are analyzed and possible factors that influence sales behavior are identified. - Calculation of key metrics: Metrics such as average profit margin and sales growth rate are calculated.
Conclusions This project demonstrates how Python and PostgreSQL can be effectively used to analyze large data sets and obtain valuable insights for business decision making. The results obtained can serve as a starting point for future research and development in the area of sales analysis.
Technologies Used - Python: Pandas, NumPy, SQLAlchemy, Matplotlib/Seaborn - Database: PostgreSQL - Tools: Jupyter Notebook - Keywords: data analysis, Python, PostgreSQL, Pandas, NumPy, SQLAlchemy, EDA, sales, business intelligence
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LifeSnaps Dataset Documentation
Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.
The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.
Data Import: Reading CSV
For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.
Data Import: Setting up a MongoDB (Recommended)
To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.
To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.
For the Fitbit data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c fitbit
For the SEMA data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c sema
For surveys data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c surveys
If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.
Data Availability
The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:
{
_id:
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Introduction This Dataverse record contains data for reproducing the results in our corresponding journal article. For more information about the computational protocols used to generate the data, please see the journal article or the ChemRxiv entry (see below). How to use This data set two data files: molecular coordinates (ALL_GEOMETRIES.txt) and metal-ligand interaction energy data (Raw_Data.csv). These formats lend themselves for easy preparation and analysis with Python. For example, in order to load the data set into a Pandas DataFrame, do the following: import pandas as pd data = pd.read_csv('Raw_Data.csv') You can prepare a list of all geometries in the following way: with open('ALL_GEOMETRIES.txt') as f: raw_string = f.read() molecules = [mol.split(' ') for mol in raw_string.split('
')] The ReadMe file contains descriptions of all data fields found in Raw_Data.csv. All energies are given in Hartrees, and all geometries are given in Angströms. Journal article Brakestad et al. "Multiwavelets applied to metal–ligand interactions: Energies free from basis set errors". J. Chem. Phys. (2021) Abstract from journal article Transition metal-catalyzed reactions invariably include steps where ligands associate or dissociate. In order to obtain reliable energies for such reactions, sufficiently large basis sets need to be employed. In this paper, we have used high-precision multiwavelet calculations to compute the metal–ligand association energies for 27 transition metal complexes with common ligands, such as H2, CO, olefins, and solvent molecules. By comparing our multiwavelet results to a variety of frequently used Gaussian-type basis sets, we show that counterpoise corrections, which are widely employed to correct for basis set superposition errors, often lead to underbinding. Additionally, counterpoise corrections are difficult to employ when the association step also involves a chemical transformation. Multiwavelets, which can be conveniently applied to all types of reactions, provide a promising alternative for computing electronic interaction energies free from any basis set errors. ChemRxiv record https://doi.org/10.26434/chemrxiv.13669951.v1
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
The dataset is fetched on 8 Sept, 2023, at 18:00 pm London time.
The dataset was generated using a web scraping script written in Python, utilizing the Scrapy library. The script navigates through IMDb's list of animations originating from Japan, scraping relevant information from each listing. The spider starts from the URL https://www.imdb.com/search/title/?genres=Animation&countries=jp and follows the "Next" links to traverse through multiple pages of listings.
The dataset provides a comprehensive view of various animations listed on IMDb that are categorized under the genre "Animation" and originate from Japan. It includes details such as the title, genre, user rating, number of votes, runtime, year of release, summary, stars, certificate, metascore, gross earnings, episode flag, and episode title when applicable.
However, the dataset also includes some animations not regarded as Japanese Anime, e.g. Toy Storys. It is because I can only filter the Anime by using regions, but the origin of production.
Title: The name of the animation. Genre: The genre(s) under which the animation falls, e.g., Action, Adventure, etc. User Rating: The IMDb user rating out of 10. Number of Votes: The total number of IMDb users who have rated the animation. Runtime: The duration of the animation in minutes. Year: The year the animation was released or started airing. Summary: A brief or full summary of the animation's plot. Full summaries are fetched when available. Stars: List of main actors or voice actors involved in the animation. Certificate: The certification of the animation, e.g., PG, PG-13, etc. Metascore: The Metascore rating, if available, which is an aggregated score from various critics. Gross: The gross earnings or box office collection of the animation. Episode: A binary flag indicating whether the listing is for an episode of a series (1 for yes, 0 for no). Episode Title: The title of the episode if the listing is for an episode; otherwise, it will be None.
Exploratory Data Analysis (EDA) Genre Popularity: Analyze which genres are most popular based on user ratings and number of votes. Year-wise Trends: Examine how the popularity of anime has evolved over the years.
Predictive Modeling Rating Prediction: Use machine learning algorithms to predict the rating of an anime based on features like genre, runtime, and stars. Success Prediction: Predict the financial success (Gross earnings) of an anime based on various features.
Content Recommendation Personalized Recommendations: Use user ratings and genre information to build a recommendation system.
Sentiment Analysis Summary Sentiment: Perform sentiment analysis on the summary to see if the tone of the summary correlates with user ratings or other features.
**Network Analysis Actor Collaboration: Create a network graph to analyze frequent collaborations between actors.
Time-Series Analysis Rating Over Time: Analyze how ratings evolve over time for long-running series.
Market Research Target Audience: Use the certificate and genre information to identify target demographics for marketing anime-related products.
Academic Research Cultural Impact: Study the cultural impact of anime by analyzing its popularity, genres, and actors.
Data Visualization Interactive Dashboards: Create dashboards to visualize the data and allow users to filter by various criteria like genre, year, or rating.
Natural Language Processing (NLP) Topic Modeling: Use NLP techniques to identify common themes or topics in the summaries.
By leveraging Python for data analysis, you can use libraries like Pandas for data manipulation, Matplotlib and Seaborn for data visualization, and scikit-learn for machine learning to extract valuable insights from this dataset.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Ændrew Rininsland [source]
This dataset provides an eye-opening look into the characters and actors involved in the globally acclaimed TV series, Game of Thrones. By examining the screen time and episodes for each character, as well as their portrayed actor or actress's IMDB URL, one can gain remarkable insight into which characters have seized the spotlight in this epic saga. Compiled by ninewheels0 on IMDB, this dataset took a long time to amass and deserves appreciation for doing so. Each character is listed with their length of screen time measured in minutes with fractional seconds (i.e., 1.5 minutes means one minute and thirty seconds). With each character's contribution to screen time equal to what they will remember by within the minds of those watching Game of Thrones around the world, witness how each actor has intruded into our lives across multiple seasons!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
How to Use the GoT Characters Screen Time Dataset This Kaggle dataset contains information about the screentime of characters in the Game of Thrones TV series, including their name, IMDB URL, screentime (in minutes with fractional seconds), number of episodes appeared in and the actor/actress portraying them. It is a helpful resource for game theorists who want to further study character arcs and build theories around key points in certain stories.
First off, make sure you acknowledge ninewheels0 on IMDB as they created this list before it was uploaded onto Kaggle. Read the “About this dataset” section carefully before getting started to remember all sources and credits should be given out properly.
To begin studying and analyzing this data set you can use various software tools that allow for analyzing and visualizing large data sets such as Python pandas function which allows you to easily study every aspect provided through columns such as name or portrayed_by_name). You can also use Tableau which let’s you turn your selected columns into charts or graphs so that patterns are easily found within large datasets. Additionally, tools such as Excel can also be used for similar purposes but not nearly as well organized so it would only work if very few interactions are done between different elements from within this CSV file.
When analyzing these huge datasets it is important to note down certain key questions you want to answer while understanding what kind of information is already presented there – what correlations exist? What could affect a certain element? Is there something specific I want to uncover through my analysis etc.. After deciding on those major things one should take a look at distinct elements present within each column such as its highest values (max) or lowest values (min). Afterwards one should remember to always check consistency with patterns found due do outliers before making any kind of accusations or assumptions afterwards - sometimes they might influence our results more than expected sometimes not providing us with much insight at all but instead just confusing our story even further . Knowing how each element interacts with other variables from within same dataset helps when looking into relation between two separate items from inside same file!
Finally after taking into account primarily described ways we can start drawing parallels between different parts present inside same As soon we did sufficient amount amount if observation steps we may even have enough evidence needed finish majority aspects related research phase like proving possible hypothesis correct or incorrect !
- Create an interactive feature on a website/app to compare the screentime across all characters in Game of Thrones
- Analyze how the screentime of characters evolve over time and seasons
- Using ML algorithms, explore different patterns in the data to identify relationships between screen time and other factors such as character gender, type etc
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: GOT_screentimes.csv | Column...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset used in this study experiment was from the preliminary competition dataset of the 2018 Guangdong Industrial Intelligent Manufacturing Big Data Intelligent Algorithm Competition organized by Tianchi Feiyue Cloud (https://tianchi.aliyun.com/competition/entrance/231682/introduction). We have selected the dataset, removing images that do not meet the requirements of our experiment. All datasets have been classified for training and testing. The image pixels are all 2560×1960. Before training, all defects need to be labeled using labelimg and saved as json files. Then, all json files are converted to txt files. Finally, the organized defect dataset is detected and classified.Description of the data and file structureThis is a project based on the YOLOv8 enhanced algorithm for aluminum defect classification and detection tasks.All code has been tested on Windows computers with Anaconda and CUDA-enabled GPUs. The following instructions allow users to run the code in this repository based on a Windows+CUDA GPU system already in use.Files and variablesFile: defeat_dataset.zipDescription:SetupPlease follow the steps below to set up the project:Download Project RepositoryDownload the project repository defeat_dataset.zip from the following location.Unzip and navigate to the project folder; it should contain a subfolder: quexian_datasetDownload data1.Download data .defeat_dataset.zip2.Unzip the downloaded data and move the 'defeat_dataset' folder into the project's main folder.3. Make sure that your defeat_dataset folder now contains a subfolder: quexian_dataset.4. Within the folder you should find various subfolders such as addquexian-13, quexian_dataset, new_dataset-13, etc.softwareSet up the Python environment1.Download and install the Anaconda.2.Once Anaconda is installed, activate the Anaconda Prompt. For Windows, click Start, search for Anaconda Prompt, and open it.3.Create a new conda environment with Python 3.8. You can name it whatever you like; for example. Enter the following command: conda create -n yolov8 python=3.84.Activate the created environment. If the name is , enter: conda activate yolov8Download and install the Visual Studio Code.Install PyTorch based on your system:For Windows/Linux users with a CUDA GPU: bash conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forgeInstall some necessary libraries:Install scikit-learn with the command: conda install anaconda scikit-learn=0.24.1Install astropy with: conda install astropy=4.2.1Install pandas using: conda install anaconda pandas=1.2.4Install Matplotlib with: conda install conda-forge matplotlib=3.5.3Install scipy by entering: conda install scipy=1.10.1RepeatabilityFor PyTorch, it's a well-known fact:There is no guarantee of fully reproducible results between PyTorch versions, individual commits, or different platforms. In addition, results may not be reproducible between CPU and GPU executions, even if the same seed is used.All results in the Analysis Notebook that involve only model evaluation are fully reproducible. However, when it comes to updating the model on the GPU, the results of model training on different machines vary.Access informationOther publicly accessible locations of the data:https://tianchi.aliyun.com/dataset/public/Data was derived from the following sources:https://tianchi.aliyun.com/dataset/140666Data availability statementThe ten datasets used in this study come from Guangdong Industrial Wisdom Big Data Innovation Competition - Intelligent Algorithm Competition Rematch. and the dataset download link is https://tianchi.aliyun.com/competition/entrance/231682/information?lang=en-us. Officially, there are 4,356 images, including single blemish images, multiple blemish images and no blemish images. The official website provides 4,356 images, including single defect images, multiple defect images and no defect images. We have selected only single defect images and multiple defect images, which are 3,233 images in total. The ten defects are non-conductive, effacement, miss bottom corner, orange, peel, varicolored, jet, lacquer bubble, jump into a pit, divulge the bottom and blotch. Each image contains one or more defects, and the resolution of the defect images are all 2560×1920.By investigating the literature, we found that most of the experiments were done with 10 types of defects, so we chose three more types of defects that are more different from these ten types and more in number, which are suitable for the experiments. The three newly added datasets come from the preliminary dataset of Guangdong Industrial Wisdom Big Data Intelligent Algorithm Competition. The dataset can be downloaded from https://tianchi.aliyun.com/dataset/140666. There are 3,000 images in total, among which 109, 73 and 43 images are for the defects of bruise, camouflage and coating cracking respectively. Finally, the 10 types of defects in the rematch and the 3 types of defects selected in the preliminary round are fused into a new dataset, which is examined in this dataset.In the processing of the dataset, we tried different division ratios, such as 8:2, 7:3, 7:2:1, etc. After testing, we found that the experimental results did not differ much for different division ratios. Therefore, we divide the dataset according to the ratio of 7:2:1, the training set accounts for 70%, the validation set accounts for 20%, and the testing set accounts for 10%. At the same time, the random number seed is set to 0 to ensure that the results obtained are consistent every time the model is trained.Finally, the mean Average Precision (mAP) metric obtained from the experiment was tested on the dataset a total of three times. Each time the results differed very little, but for the accuracy of the experimental results, we took the average value derived from the highest and lowest results. The highest was 71.5% and the lowest was 71.1%, resulting in an average detection accuracy of 71.3% for the final experiment.All data and images utilized in this research are from publicly available sources, and the original creators have given their consent for these materials to be published in open-access formats.The settings for other parameters are as follows. epochs: 200,patience: 50,batch: 16,imgsz: 640,pretrained: true,optimizer: SGD,close_mosaic: 10,iou: 0.7,momentum: 0.937,weight_decay: 0.0005,box: 7.5,cls: 0.5,dfl: 1.5,pose: 12.0,kobj: 1.0,save_dir: runs/trainThe defeat_dataset.(ZIP)is mentioned in the Supporting information section of our manuscript. The underlying data are held at Figshare. DOI: 10.6084/m9.figshare.27922929.The results_images.zipin the system contains the experimental results graphs.The images_1.zipand images_2.zipin the system contain all the images needed to generate the manuscript.tex manuscript.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains a selection of behavioral datasets collected using soluble agents and labeled using realistic threat simulation and IDS rules. The collected datasets are anonymized and aggregated using time window representations. The dataset generation pipeline preprocesses the application logs from the corporate network, structures them according to entities and users inventory, and labels them based on the IDS and phishing simulation appliances.
This repository is associated with the article "RBD24: A labelled dataset with risk activities using log applications data" published in the journal Computers & Security. For more information go to https://doi.org/10.1016/j.cose.2024.104290" target="_blank" rel="noreferrer noopener">https://doi.org/10.1016/j.cose.2024.104290
The RBD24 dataset comprises various risk activities collected from real entities and users over a period of 15 days, with the samples segmented by Desktop (DE) and Smartphone (SM) devices.
| DatasetId | Entity | Observed Behaviour | Groundtruth | Sample Shape |
| Crypto_desktop.parquet | DE | Miner Checking | IDS | 0: 738/161202, 1: 11/1343 |
| Crypto_smarphone.parquet | SM | Miner Checking | IDS | 0: 613/180021, 1: 4/956 |
| OutFlash_desktop.parquet | DE | Outdated software components | IDS | 0: 738/161202, 1: 56/10820 |
| OutFlash_smartphone.parquet | SM | Outdated software components | IDS | 0: 613/180021, 1: 22/6639 |
| OutTLS_desktop.parquet | DE | Outdated TLS protocol | IDS | 0: 738/161202, 1: 18/2458 |
| OutTLS_smartphone.parquet | SM | Outdated TLS protocol | IDS | 0: 613/180021, 1: 11/2930 |
| P2P_desktop.parquet | DE | P2P Activity | IDS | 0: 738/161202, 1: 177/35892 |
| P2P_smartphone.parquet | SM | P2P Activity | IDS | 0: 613/180021, 1: 94/21688 |
| NonEnc_desktop.parquet | DE | Non-encrypted password | IDS | 0: 738/161202, 1: 291/59943 |
| NonEnc_smaprthone.parquet | SM | Non-encrypted password | IDS | 0: 613/180021, 1: 167/41434 |
| Phishing_desktop.parquet | DE | Phishing email |
Experimental Campaign | 0: 98/13864, 1: 19/3072 |
| Phishing_smartphone.parquet | SM | Phishing email | Experimental Campaign | 0: 117/34006, 1: 26/8968 |
To collect the dataset, we have deployed multiple agents and soluble agents within an infrastructure with
more than 3k entities, comprising laptops, workstations, and smartphone devices. The methods to build
ground truth are as follows:
- Simulator: We launch different realistic phishing campaigns, aiming to expose user credentials or defeat access to a service.
- IDS: We deploy an IDS to collect various alerts associated with behavioral anomalies, such as cryptomining or peer-to-peer traffic.
For each user exposed to the behaviors stated in the summary table, different TW is computed, aggregating
user behavior within a fixed time interval. This TW serves as the basis for generating various supervised
and unsupervised methods.
The time windows (TW) are a data representation based on aggregated logs from multimodal sources between two
timestamps. In this study, logs from HTTP, DNS, SSL, and SMTP are taken into consideration, allowing the
construction of rich behavioral profiles. The indicators described in the TE are a set of manually curated
interpretable features designed to describe device-level properties within the specified time frame. The most
influential features are described below.
Parquet format uses a columnar storage format, which enhances efficiency and compression, making it suitable for large datasets and complex analytical tasks. It has support across various tools and languages, including Python. Parquet can be used with pandas library in Python, allowing pandas to read and write Parquet files through the `pyarrow` or `fastparquet` libraries. Its efficient data retrieval and fast query execution improve performance over other formats. Compared to row-based storage formats such as CSV, Parquet's columnar storage greatly reduces read times and storage costs for large datasets. Although binary formats like HDF5 are effective for specific use cases, Parquet provides broader compatibility and optimization. The provided datasets use the Parquet format. Here’s an example of how to retrieve data using pandas, ensure you have the fastparquet library installed:
```pythonimport pandas as pd
# Reading a Parquet filedf = pd.read_parquet( 'path_to_your_file.parquet', engine='fastparquet' )
```
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Kidmose CANid Dataset (KCID)The Kidmose CANid Dataset (KCID) contains CAN bus data collected by Brooke and Andreas Kidmose from 16 different drivers across 4 different vehicles. This dataset is designed to support driver identification and authentication research.The term "CANid" reflects the dataset's dual purpose: data collected from the CAN bus for driver identification research.VEHICLESThe dataset includes data from four different vehicles across various manufacturers and model years:2011 Chevrolet Traverse - 5-door full-size SUV crossover, AWD, 8 drivers (8 unique drivers in single-driver traces; 1 additional driver in a mixed trace)2017 Ford Focus - 5-door compact station wagon, FWD, 4 drivers2017 Subaru Forester - 5-door compact SUV crossover, AWD, 6 drivers (6 unique drivers in single-driver traces; 3 additional drivers in mixed traces)2022 Honda CR-V Touring - 5-door compact SUV crossover, AWD, 1 driverNote: The number of drivers includes volunteer drivers whose data was captured in single-driver traces, where we know who was driving at all times. We exclude volunteer drivers whose data is only available in mixed traces because we do not know when each specific driver was actually operating the vehicle.DRIVERSThe dataset includes 16 drivers across different demographic categories:Male Drivers:Under 30 years: 4 drivers ("male-under30-1" through "male-under30-4")30-55 years: 4 drivers ("male-30-55-1" through "male-30-55-4")Over 55 years: 3 drivers ("male-over55-1" through "male-over55-3")Female Drivers:All ages: 5 drivers ("female-all-ages-1" through "female-all-ages-5")Driver Directory Structure: Driver identifiers are used as directory/folder names. Within each directory, you will find traces collected from that particular driver, with additional information (location, data collection method, etc.) specified in the filename.Note: We use "unknown driver(s)" in directory names when we know that one or more volunteer drivers was operating the vehicle, but we cannot identify who was driving or when. We used a standalone data logger for some data collection sessions. If we failed to download the data and clear the logger's memory before switching drivers, this resulted in mixed traces and, occasionally, "unknown driver(s)" entries. Unfortunately, some of our volunteer drivers were short-term visitors, so we did not have the opportunity to redo their traces as single-driver traces.LOCATIONSData collection took place across multiple locations:DK - DenmarkUSA - United States of AmericaFL - FloridaNE - NebraskaNE-to-FL - Trip from Nebraska to FloridaTN - TennesseeTN-to-NE - Trip from Tennessee to NebraskaLocation codes appear in filenames (e.g., USA-FL-CANEdge-00000001.mf4 indicates data collected in Florida, USA).DATA COLLECTION METHODSThree different data collection methods were employed:CANEdge - CSS Electronics CANEdge2: Standalone data logger that connects to the OBD-II port and logs to an SD cardKorlan - Korlan USB2CAN: CAN-to-USB cable connecting the vehicle's OBD-II port to a laptopKvaser - Kvaser Hybrid CAN-LIN: CAN-to-USB cable connecting the vehicle's OBD-II port to a laptopThe data collection method is indicated in filenames (e.g., USA-FL-CANEdge-00000001.mf4).FILE TYPESThe dataset provides data in three formats to support different use cases:.mf4 (MDF4) Format: Measurement Data Format version 4 (MDF4)Binary format standardized by the Association for Standardization of Automation (ASAM)Advantages: Compact size, popular with automotive/CAN toolsUse case: Native format from CSS Electronics CANEdge2Reference: https://www.csselectronics.com/pages/mf4-mdf4-measurement-data-format.log Format: Text-based log formatCompatibility: Linux SocketCAN can-utilsAdvantages: Compatibility with SocketCAN can-utils; if a .log file is replayed, then data can be captured and monitored using Python's python-can libraryReferences: https://github.com/linux-can/can-utils, https://packages.debian.org/sid/can-utils, https://python-can.readthedocs.io/en/stable/.csv Format: Text-based comma-separated values (CSV) formatAdvantages: Easy to load with Python using the pandas library; easy to use with Python-based machine learning frameworks (e.g., scikit-learn, Keras, TensorFlow, PyTorch)Usage: Load with Python pandas: pd.read_csv()Reference: https://pandas.pydata.org/SPECIALIZED EXPERIMENTSThe KCID Dataset includes five specialized experiments:Fixed Routes ExperimentVehicles: 2011 Chevrolet Traverse, 2017 Subaru ForesterDrivers: male-30-55-3, male-30-55-4, male-over55-1, female-all-ages-1, female-all-ages-2, female-all-ages-5Location: Florida, USA (specific routes)Data Collection Methods: CSS Electronics CANEdge2, Kvaser Hybrid CAN-LINPurpose: Capture CAN traces for specific, mappable routes; eliminate route-based variations in driver authentication data (e.g., low-speed local routes vs. high-speed long-distance routes)OBD Requests and Responses ExperimentVehicle: 2011 Chevrolet TraverseDriver: female-all-ages-5Location: Florida, USAData Collection Method: CSS Electronics CANEdge2Purpose: Capture OBD requests and responses Arbitration IDs: Requests: 0x7DF, Responses: 0x7E8Tire Pressure ExperimentVehicle: 2011 Chevrolet TraverseDriver: female-all-ages-5Location: Florida, USAData Collection Method: Kvaser Hybrid CAN-LINPurpose: Capture normal and low tire pressure scenariosApplications: Detect tire pressure issues via CAN bus analysis; develop predictive maintenance strategiesDriving Modes and Features ExperimentVehicle: 2017 Ford FocusDriver: male-30-55-1Location: DenmarkData Collection Method: Korlan USB2CANPurpose: Capture different driving (and non-driving) modes and featuresExamples: gear (park, reverse, neutral, drive, sport); headlights on/offStationary Vehicles ExperimentVehicles: 2024 Chevrolet Malibu, 2025 Toyota CorollaDriver: N/A (vehicles remained stationary)Location: Florida, USAData Collection Method: Kvaser Hybrid CAN-LINPurpose: Capture CAN bus traffic from very new, very modern vehicles; identify differences between an older vehicle's CAN bus (e.g., 2011 Chevrolet Traverse), and a newer vehicle's CAN bus (e.g., 2024 Chevrolet Malibu)ADDITIONAL DOCUMENTATIONEach "specialized experiment" directory contains a detailed README.md file with specific information about the experiment and the data collected.RESEARCH APPLICATIONSThis dataset supports various research areas:Driver authentication, driver fingerprintingBehavioral biometrics in the automotive domainVehicle diagnostics and predictive maintenanceMachine learning in the automotive domainCAN bus analysis and reverse engineeringCITATIONIf you use the Kidmose CANid Dataset in your research, please cite appropriately. Citation information will be updated when our paper is published in a peer-reviewed venue.Article Citation:APA Style: Kidmose, B. E., Kidmose, A. B., and Zou, C. C. (2025). A critical roadmap to driver authentication via CAN bus: Dataset review, introduction of the Kidmose CANid Dataset (KCID), and proof of concept. arXiv. https://arxiv.org/pdf/2510.25856MLA Style: Kidmose, Brooke Elizabeth, Andreas Brasen Kidmose, and Cliff C. Zou. "A Critical Roadmap to Driver Authentication via CAN Bus: Dataset Review, Introduction of the Kidmose CANid Dataset (KCID), and Proof of Concept." arXiv, 2025. doi:10.48550/arXiv.2510.25856Chicago Style: Kidmose, Brooke Elizabeth, Andreas Brasen Kidmose, and Cliff C. Zou. "A Critical Roadmap to Driver Authentication via CAN Bus: Dataset Review, Introduction of the Kidmose CANid Dataset (KCID), and Proof of Concept." arXiv (2025). doi:10.48550/arXiv.2510.25856Dataset Citation:APA Style: Kidmose, B. E. and Kidmose, A. B. (2025). Kidmose CANid Dataset (KCID) v1. [Data set]. Technical University of Denmark. https://doi.org/10.11583/DTU.30483005.v1MLA Style: Kidmose, Brooke Elizabeth, and Andreas Brasen Kidmose. "Kidmose CANid Dataset (KCID) v1." Technical University of Denmark, 30 Oct. 2025. Web. {Date accessed in dd mmm yyyy format}. doi:10.11583/DTU.30483005.v1Chicago Style: Kidmose, Brooke Elizabeth, and Andreas Brasen Kidmose. 2025. "Kidmose CANid Dataset (KCID) v1." Technical University of Denmark. doi:10.11583/DTU.30483005.v1
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This remarkable dataset provides an awe-inspiring collection of over 50,000 books, encompassing the world's best practices in literature, poetry, and authorship. For each book included in the dataset, users can gain access to a wealth of insightful information such as title, author(s), average rating given by readers and critics alike, a brief description highlighting its plot or characteristics; language it is written in; unique ISBN which enables potential buyers to locate their favorite works with ease; genres it belongs to; any awards it has won or characters that inhabit its storyworld.
Additionally, seeking out readers' opinions on exceptional books is made easier due to the availability of bbeScore (best books ever score) alongside details for the most accurate ratings given through well-detailed breakdowns in “ratingsByStars” section. Making sure visibility and recognition are granted fairly – be it a classic novel from time immemorial or merely recently released newcomers - this source also allows us to evaluate new stories based off readers' engagement rate highlighted by likedPercent column (the percentage of readers who liked the book), bbeVotes (number of votes casted) as well as entries related to date published - including showstopping firstPublishDate!
Aspiring literature researchers; literary historians and those seeking hidden literary gems alike would no doubt benefit from delving into this magnificent collection – 25 variables regarding different novels & poets that are presented by Kaggle open source dataset “Best Books Ever: A Comprehensive Historical Collection of Literary Greats”. What worlds awaits you?
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Whether you are a student, researcher, or enthusiast of literature, this dataset provides a valuable source for exploring literary works from varied time periods and genres. By accessing all 25 variables in the dataset, readers have the opportunity to use them for building visualizations, creating new analysis tools and models, or finding books you might be interested in reading.
First after downloading the dataset into Kaggle Notebooks platform or other programming interfaces of your choice such as R Studio/Python Jupyter Notebooks (Pandas) - make sure that data is arranged into columns with clearly labeled title names. This will help you understand which variable is related to what precise information. Afterwards explore each variable by finding any patterns across particular titles or interesting findings about certain authors/ratings that are available in your research interests.
Utilize the vital columns of Title (title), Author(author), Rating (rating), Description (description), Language (language), Genres (genres) and Characters(characters) - these can assist you in discovering different trends between books according to style of composition or character types etc. Move further down on examining more specific details offered by Book Format(bookFormat), Edition(edition) Pages(pages). Peruse publisher info along with Publish Date(publishDate). Besides these structural elements also take note of Awards column considering recent recognition different titles have received; also observe how much ratings has been collected per text through Numbers Ratings column-(numRatings); analyze reader's feedback according on Ratings By Stars(_ratingsByStars); view LikedPercentage rate provided by readers when analyzing particular book(_likedPercent).
Apart from more accessible factors mentioned previously delve deeper onto more sophisticated data presented: Setting (_setting); Cover Image (_coverImg); BbeScore_bbeScore); BbeVotes_bbeVotes). All those should provide greater insight when trying to explain why certain book has made its way onto GoodReads top selections list! To find value estimate test out Price (_price)) column too - determining if some texts retain large popularity despite rather costly publishing options cost-wise available on market currently?
Finally combine different aspects observed while researching concerning individual titles- create personalized recommendations based upon released comprehensive lists! To achieve that utilize ISUBN code provided; compare publication Vs first publication dates historically recorded; verify awards labeling procedure relied upon give context information on discussed here books progress over years
- Creating a web or mobile...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This dataset contains a wealth of information on the electrical half-hourly data for Great Britain from 2008 up until present day. This dataset is sourced from both the Elexon Portal and National Grid, providing you with an in-depth view into electricity supply and demand in the UK. It includes conventional generation, wind generation, nuclear generation, pumped storage and imports & exports. With columns such as ELEXM_SETTLEMENT_DATE, ELEXM_SETTLEMENT_PERIOD, ELEXM_UTC etc., this dataset is ideal for anyone looking to gain a truly comprehensive understanding of current energy situation in Britain!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Introduction This data set contains compiled and cleaned half-hourly electricity data for Great Britain. It is sourced from various providers such as Elexon Portal and National Grid, making it a great tool for studying electrical supply and demand in the UK. This guide will provide an overview of this dataset, walking through the required steps on how to use this dataset effectively.
Getting Familiar with the Data: The first step in using this dataset is to get familiar with its features. This includes looking at the columns/variables available, their descriptions/units, and meaning. Under each column you'll find additional information about its data type (eg; integer or float) which is helpful to understand while performing any kind of analysis with it. Another excellent way to explore this data set would be by simply looking at some examples of each column's contents by printing out a few rows of the table so one can further investigate them based of those values listed thereupon. Doing so should give you more clarity over what type of questions you can answer with your analyses – keeping in mind that not all datasets are suitable for addressing all potential queries concerning research inquiry!
Understanding Relationships: After getting familiarised with its features & attributes, it's important to start understanding how they're related to each other which makes up our overall analysis process when dealing with any given dataset(s). To do that we take into consideration variable characteristics - such as presence or absence (or correlations) between certain columns - when ultimately constructing relationships among diverse elements comprising specific cases under study through various operations like merging & merging adjacent tables within the same frame wor for transforming raw input information into meaningful knowledge derived after completing analytics task(s). This all helps us gain insight on patterns present throughout the entire collection as well as individual items themselves whether individually or collectively over time leading towards desired outputs necessary to answer particular questions being asked about underlying trends found inside datasets used at hand!
Performing Analyses: We then start running analytical approaches directly on given raw information extracted during the previous step (understanding relationships) either via compilation-based processing methods within statistical environments like R studio/Python Pandas libraries etcetera; these tools allow us to create models upon activating suitable algorithms processing tools helping visualize pattern displays interpreting different feature combinations under scrutiny when focusing particularly towards better understanding interdependencies/correlations existing among different case variables studied until reaching desired insights required solving existing problems coming up next when proceeding towards generating user-specific targeted solutions depending context per question posed during initial exploratory use cases outlined
- Analyzing long-term trends in electricity generation and consumption of different sources over time.
- Using machine learning algorithms to predict future energy consumption, production and pricing in the UK electricity market.
- Developing more efficient methods of powering homes, businesses and other organizations based on energy consumption patterns from this dataset
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy,...
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Understanding the interaction between life history, demography and population genetics in threatened species is critical for the conservations of viable populations. In the context of habitat loss and fragmentation, identifying the factors that underpin the structuring of genetic variation within populations can allow conservationists to evaluate habitat quality and connectivity and help to design dispersal corridors effectively. In this study, we carried out a detailed, fine-scale landscape genetic investigation of a giant panda population for the first time, using a large microsatellite data set and examined the role of isolation-by-barriers (IBB), isolation-by-distance (IBD) and isolation-by-resistance (IBR) in shaping the genetic variation pattern of giant pandas in the Qinling Mountains. We found that the Qinling population comprises one continuous genetic cluster, and among the landscape hypotheses tested, gene flow was found to be correlated with resistance gradients for two topographic factors, rather than geographical distance or barriers. Gene-flow was inferred to be facilitated by easterly slope aspect and to be constrained by land surface with high topographic complexity. These factors are related to benign micro-climatic conditions for both the pandas and the food resources they rely on and more accessible topographic conditions for movement, respectively. We identified optimal corridors based on these results, aiming to promote gene flow between human-induced habitat fragments. These findings provide insight into the permeability and affinities of the giant panda habitat and offer important reference for the conservation of the giant panda and its habitat.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains historical stock price data for Walmart Inc. (WMT) from October 1, 1970, to January 31, 2025. The data includes key stock market indicators such as opening price, closing price, adjusted closing price, highest and lowest prices of the day, and trading volume. This dataset can be valuable for financial analysis, stock market trend prediction, and machine learning applications in quantitative finance.
The data has been collected from publicly available financial sources and covers over 13,000 trading days, providing a comprehensive view of Walmart’s stock performance over several decades.
Date: The trading date (1970-10-01).
Open: The opening price of Walmart stock for the day.
High: The highest price reached during the trading session.
Low: The lowest price recorded during the trading session.
Close: The closing price at the end of the trading day.
Adj Close: The adjusted closing price, which accounts for stock splits and dividends.
Volume: The total number of shares traded on that particular day.
This dataset can be used for a variety of financial and data science applications, including:
✔ Stock Market Analysis – Study historical trends and price movements.
✔ Time Series Forecasting – Develop predictive models using machine learning.
✔ Technical Analysis – Apply moving averages, RSI, and other trading indicators.
✔ Market Volatility Analysis – Assess market fluctuations over different periods.
✔ Algorithmic Trading – Backtest trading strategies based on historical data.
No missing values.
Data spans over 50 years, ensuring long-term trend analysis.
Preprocessed and structured for easy use in Python, R, and other data science tools.
You can load the dataset using Pandas in Python: ``` import pandas as pd
df = pd.read_csv("WMT_1970-10-01_2025-01-31.csv")
df.head() ```
This dataset is provided for educational and research purposes. Please ensure proper attribution if used in projects or research.
This data set is scrape by Muhammad Atif Latif.
For more Datasets justCLICK HERE
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Sustainable cities depend on urban forests. City trees -- a pillar of urban forests -- improve our health, clean the air, store CO2, and cool local temperatures. Comparatively less is known about urban forests as ecosystems, particularly their spatial composition, nativity statuses, biodiversity, and tree health. Here, we assembled and standardized a new dataset of N=5,660,237 trees from 63 of the largest US cities. The data comes from tree inventories conducted at the level of cities and/or neighborhoods. Each data sheet includes detailed information on tree location, species, nativity status (whether a tree species is naturally occurring or introduced), health, size, whether it is in a park or urban area, and more (comprising 28 standardized columns per datasheet). This dataset could be analyzed in combination with citizen-science datasets on bird, insect, or plant biodiversity; social and demographic data; or data on the physical environment. Urban forests offer a rare opportunity to intentionally design biodiverse, heterogenous, rich ecosystems. Methods See eLife manuscript for full details. Below, we provide a summary of how the dataset was collected and processed.
Data Acquisition We limited our search to the 150 largest cities in the USA (by census population). To acquire raw data on street tree communities, we used a search protocol on both Google and Google Datasets Search (https://datasetsearch.research.google.com/). We first searched the city name plus each of the following: street trees, city trees, tree inventory, urban forest, and urban canopy (all combinations totaled 20 searches per city, 10 each in Google and Google Datasets Search). We then read the first page of google results and the top 20 results from Google Datasets Search. If the same named city in the wrong state appeared in the results, we redid the 20 searches adding the state name. If no data were found, we contacted a relevant state official via email or phone with an inquiry about their street tree inventory. Datasheets were received and transformed to .csv format (if they were not already in that format). We received data on street trees from 64 cities. One city, El Paso, had data only in summary format and was therefore excluded from analyses.
Data Cleaning All code used is in the zipped folder Data S5 in the eLife publication. Before cleaning the data, we ensured that all reported trees for each city were located within the greater metropolitan area of the city (for certain inventories, many suburbs were reported - some within the greater metropolitan area, others not). First, we renamed all columns in the received .csv sheets, referring to the metadata and according to our standardized definitions (Table S4). To harmonize tree health and condition data across different cities, we inspected metadata from the tree inventories and converted all numeric scores to a descriptive scale including “excellent,” “good”, “fair”, “poor”, “dead”, and “dead/dying”. Some cities included only three points on this scale (e.g., “good”, “poor”, “dead/dying”) while others included five (e.g., “excellent,” “good”, “fair”, “poor”, “dead”). Second, we used pandas in Python (W. McKinney & Others, 2011) to correct typos, non-ASCII characters, variable spellings, date format, units used (we converted all units to metric), address issues, and common name format. In some cases, units were not specified for tree diameter at breast height (DBH) and tree height; we determined the units based on typical sizes for trees of a particular species. Wherever diameter was reported, we assumed it was DBH. We standardized health and condition data across cities, preserving the highest granularity available for each city. For our analysis, we converted this variable to a binary (see section Condition and Health). We created a column called “location_type” to label whether a given tree was growing in the built environment or in green space. All of the changes we made, and decision points, are preserved in Data S9. Third, we checked the scientific names reported using gnr_resolve in the R library taxize (Chamberlain & Szöcs, 2013), with the option Best_match_only set to TRUE (Data S9). Through an iterative process, we manually checked the results and corrected typos in the scientific names until all names were either a perfect match (n=1771 species) or partial match with threshold greater than 0.75 (n=453 species). BGS manually reviewed all partial matches to ensure that they were the correct species name, and then we programmatically corrected these partial matches (for example, Magnolia grandifolia-- which is not a species name of a known tree-- was corrected to Magnolia grandiflora, and Pheonix canariensus was corrected to its proper spelling of Phoenix canariensis). Because many of these tree inventories were crowd-sourced or generated in part through citizen science, such typos and misspellings are to be expected. Some tree inventories reported species by common names only. Therefore, our fourth step in data cleaning was to convert common names to scientific names. We generated a lookup table by summarizing all pairings of common and scientific names in the inventories for which both were reported. We manually reviewed the common to scientific name pairings, confirming that all were correct. Then we programmatically assigned scientific names to all common names (Data S9). Fifth, we assigned native status to each tree through reference to the Biota of North America Project (Kartesz, 2018), which has collected data on all native and non-native species occurrences throughout the US states. Specifically, we determined whether each tree species in a given city was native to that state, not native to that state, or that we did not have enough information to determine nativity (for cases where only the genus was known). Sixth, some cities reported only the street address but not latitude and longitude. For these cities, we used the OpenCageGeocoder (https://opencagedata.com/) to convert addresses to latitude and longitude coordinates (Data S9). OpenCageGeocoder leverages open data and is used by many academic institutions (see https://opencagedata.com/solutions/academia). Seventh, we trimmed each city dataset to include only the standardized columns we identified in Table S4. After each stage of data cleaning, we performed manual spot checking to identify any issues.
Facebook
TwitterThis pizza sales dataset make up 12 relevant features:
- order_id: Unique identifier for each order placed by a table
- order_details_id: Unique identifier for each pizza placed within each order (pizzas of the same type and size are kept in the same row, and the quantity increases)
- pizza_id: Unique key identifier that ties the pizza ordered to its details, like size and price
- quantity: Quantity ordered for each pizza of the same type and size
- order_date: Date the order was placed (entered into the system prior to cooking & serving)
- order_time: Time the order was placed (entered into the system prior to cooking & serving)
- unit_price: Price of the pizza in USD
- total_price: unit_price * quantity
- pizza_size: Size of the pizza (Small, Medium, Large, X Large, or XX Large)
- pizza_type: Unique key identifier that ties the pizza ordered to its details, like size and price
- pizza_ingredients: ingredients used in the pizza as shown in the menu (they all include Mozzarella Cheese, even if not specified; and they all include Tomato Sauce, unless another sauce is specified)
- pizza_name: Name of the pizza as shown in the menu
For the Maven Pizza Challenge, you’ll be playing the role of a BI Consultant hired by Plato's Pizza, a Greek-inspired pizza place in New Jersey. You've been hired to help the restaurant use data to improve operations, and just received the following note:
Welcome aboard, we're glad you're here to help!
Things are going OK here at Plato's, but there's room for improvement. We've been collecting transactional data for the past year, but really haven't been able to put it to good use. Hoping you can analyze the data and put together a report to help us find opportunities to drive more sales and work more efficiently.
Here are some questions that we'd like to be able to answer:
- What days and times do we tend to be busiest?
- How many pizzas are we making during peak periods?
- What are our best and worst-selling pizzas?
- What's our average order value?
- How well are we utilizing our seating capacity? (we have 15 tables and 60 seats)
That's all I can think of for now, but if you have any other ideas I'd love to hear them – you're the expert!
Thanks in advance,
Mario Maven (Manager, Plato's Pizza)
The public dataset is completely available on the Maven Analytics website platform where it stores and consolidates all available datasets for analysis in the Data Playground. The specific individual datasets at hand can be obtained at this link below: https://www.mavenanalytics.io/blog/maven-pizza-challenge
📌I set up the data model to include all the related instances in one single table so obtaining data for analysis is made easier.
Complete details were also provided about the challenge in the link if you are interested. The purpose of uploading here is to conduct exploratory data analysis about the dataset beforehand with the use of Pandas and data visualization libraries in order to have a comprehensive review of the data and translate my findings and insights in the form of a single page visualization.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Overview: This dataset offers a comprehensive collection of Daily weather readings from major cities around the world. In the first release, it included only capitals, but now it also adds main cities worldwide and hourly data as well, making up to ~1250 cities. Some locations provide historical data tracing back to January 2, 1833, giving users a deep dive into long-term weather patterns and their evolution.
Data License and Updates: This dataset is updated every Sunday using data from Meteostat API, ensuring access to the latest week's data without overburdening the data source.
cities.csv)This dataframe offers details about individual cities and weather stations.
- Columns:
- station_id: Unique ID for the weather station.
- city_name: Name of the city.
- country: The country where the city is located.
- state: The state or province within the country.
- iso2: The two-letter country code.
- iso3: The three-letter country code.
- latitude: Latitude coordinate of the city.
- longitude: Longitude coordinate of the city.
countires.csv)This dataframe contains information about different countries, providing insights into their geographic and demographic characteristics.
- Columns:
- iso3: The three-letter code representing the country.
- country: The English name of the country.
- native_name: The native name of the country.
- iso2: The two-letter code representing the country.
- population: The population of the country.
- area: The total land area of the country in square kilometers.
- capital: The name of the capital city.
- capital_lat: The latitude coordinate of the capital city.
- capital_lng: The longitude coordinate of the capital city.
- region: The specific region within the continent where the country is located.
- continent: The continent to which the country belongs.
- hemisphere: The hemisphere in which the country is located (e.g., Northern, Southern).
daily_weather.parquet)This dataframe provides weather data on a daily basis.
- Columns:
- station_id: Unique ID for the weather station.
- city_name: Name of the city where the station is located.
- date: Date of the weather record.
- season: Season corresponding to the date (e.g., summer, winter).
- avg_temp_c: Average temperature in Celsius.
- min_temp_c: Minimum temperature in Celsius.
- max_temp_c: Maximum temperature in Celsius.
- precipitation_mm: Precipitation in millimeters.
- snow_depth_mm: Snow depth in millimeters.
- avg_wind_dir_deg: Average wind direction in degrees.
- avg_wind_speed_kmh: Average wind speed in kilometers per hour.
- peak_wind_gust_kmh: Peak wind gust in kilometers per hour.
- avg_sea_level_pres_hpa: Average sea-level pressure in hectopascals.
- sunshine_total_min: Total sunshine duration in minutes.
These dataframes can be utilized for various analyses such as weather trend prediction, climate studies, geographic analysis, demographic insights, and more.
Dataset Image Source: Photo credits to 越过山丘. View the original image here.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
We utilized a dataset of Machine Design materials, which includes information on their mechanical properties. The dataset was obtained from the Autodesk Material Library and comprises 15 columns, also referred to as features/attributes. This dataset is a real-world dataset, and it does not contain any random values. However, due to missing values, we only utilized seven of these columns for our ML model. You can access the related GitHub Repository here: https://github.com/purushottamnawale/material-selection-using-machine-learning
To develop a ML model, we employed several Python libraries, including NumPy, pandas, scikit-learn, and graphviz, in addition to other technologies such as Weka, MS Excel, VS Code, Kaggle, Jupyter Notebook, and GitHub. We employed Weka software to swiftly visualize the data and comprehend the relationships between the features, without requiring any programming expertise.
My Problem statement is Material Selection for EV Chassis. So, if you have any specific ideas, be sure to implement them and add the codes on Kaggle.
A Detailed Research Paper is available on https://iopscience.iop.org/article/10.1088/1742-6596/2601/1/012014
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This public dataset contains data concerning the public and private insurance companies provided by IRDAI(Insurance Regulatory and Development Authority of India) from 2013-2022. This is a multi-index data and can be a great practice to hone manipulation of pandas multi-index dataframes. Mainly, the business of the companies (total premiums and number of policies), subscription information(number of people subscribed), Claims incurred and the Network hospitals enrolled by Third Party Administrators are attributes focused by the dataset.
The Excel file contains the following data | Table No.| Contents| | --- | --- | |**A**|**III.A: HEALTH INSURANCE BUSINESS OF GENERAL AND HEALTH INSURERS**| |62| Health Insurance - Number of Policies, Number of Persons Covered and Gross Premium| |63| Personal Accident Insurance - Number of Policies, Number of Persons Covered and Gross Premium| |64| Overseas Travel Insurance - Number of Policies, Number of Persons Covered and Gross Premium| |65| Domestic Travel Insurance - Number of Policies, Number of Persons Covered and Gross Premium| |66| Health Insurance - Net Premium Earned, Incurred Claims and Incurred Claims Ratio| |67| Personal Accident Insurance - Net Premium Earned, Incurred Claims and Incurred Claims Ratio| |68| Overseas Travel Insurance - Net Earned Premium, Incurred Claims and Incurred Claims Ratio| |69| Domestic Travel Insurance - Net Earned Premium, Incurred Claims and Incurred Claims Ratio| |70| Details of Claims Development and Aging - Health Insurance Business| |71| State-wise Health Insurance Business| |72| State-wise Individual Health Insurance Business| |73| State-wise Personal Accident Insurance Business| |74| State-wise Overseas Insurance Business| |75| State-wise Domestic Insurance Business| |76| State-wise Claims Settlement under Health Insurance Business| |**B**|**III.B: HEALTH INSURANCE BUSINESS OF LIFE INSURERS**| |77| Health Insurance Business in respect of Products offered by Life Insurers - New Busienss| |78| Health Insurance Business in respect of Products offered by Life insurers - Renewal Business| |79| Health Insurance Business in respect of Riders attached to Life Insurance Products - New Business| |80| Health Insurance Business in respect of Riders attached to Life Insurance Products - Renewal Business| |**C**|**III.C: OTHERS**| |81| Network Hospital Enrolled by TPAs| |82| State-wise Details on Number of Network Providers |
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Philipp Schmid (From Huggingface) [source]
The dataset is provided in two separate files: train.csv and test.csv. The train.csv file contains a substantial amount of labeled data with columns for the text data itself, as well as their corresponding binary and multi-class labels. This enables users to develop and train machine learning models effectively using this dataset.
Similarly, test.csv includes additional examples for evaluating pre-trained models or assessing model performance after training on train.csv. It follows a similar structure as train.csv with columns representing text data, binary labels, and multi-class labels.
With its rich content and extensive labeling scheme for binary and multi-class classification tasks combined with its ease of use due to its tabular format in CSV files makes this dataset an excellent choice for anyone looking to advance their NLP capabilities through diverse text classification challenges
How to Use this Dataset for Text Classification
This guide will provide you with useful information on how to effectively utilize this dataset for your text classification projects.
Understanding the Columns
The dataset consists of several columns, each serving a specific purpose:
text: This column contains the actual text data that needs to be classified. It is the primary feature for your modeling task.
binary: This column represents the binary classification label associated with each text entry. The label indicates whether the text belongs to one class or another. For example, it could be used to classify emails as either spam or not spam.
multi: This column represents the multi-class classification label associated with each text entry. The label indicates which class or category the text belongs to out of multiple possible classes. For instance, it can be used to categorize news articles into topics like sports, politics, entertainment, etc.
Dataset Files
The dataset is provided in two files:
train.csvandtest.csv.
train.csv: This file contains a subset of labeled data specifically intended for training your models. It includes columns for both text data and their corresponding binary and multi-class labels.
test.csv: In order to evaluate your trained models' performance on unseen data, this file provides additional examples similar in structure and format as
train.csv. It includes columns for both texts and their respective binary and multi-class labels as well.Getting Started
To make use of this dataset effectively, here are some steps you can follow:
- Download both
train.csvandtest.csvfiles containing labeled examples.- Load these datasets into your preferred machine learning environment (such as Python with libraries like Pandas or Scikit-learn).
- Explore the dataset by examining its structure, summary statistics, and visualizations.
- Preprocess the text data as needed, which may include techniques like tokenization, removing stop words, stemming/lemmatizing, and encoding text into numerical representations (such as bag-of-words or TF-IDF vectors).
- Consider splitting the
train.csvdata further into training and validation sets for model development and evaluation.- Select appropriate machine learning algorithms for your text classification task (e.g., Naive Bayes, Logistic Regression, Support Vector Machines) and train them
- Sentiment Analysis: The dataset can be used to classify text data into positive or negative sentiment, based on the binary classification label. This can be helpful in analyzing customer reviews, social media sentiment, and feedback analysis.
- Topic Categorization: The multi-class classification label can be used to categorize text into different topics or themes. This can be useful in organizing large amounts of text data, such as news articles or research papers.
- Spam Detection: The binary classification label can be used to identify whether a text message or email is spam or not. This can help users filter out unwanted messages and improve their overall communication experience. Overall, this dataset provides an opportunity to create models for various applications of text classification such as sentiment analysis, topic categorization, and spam detection
If you use this dataset in your research, please credit the original authors. [Data Source](https://huggingface.co/datase...
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Abstract This project presents a comprehensive analysis of a company's annual sales, using the classic dataset classicmodels as the database. Python is used as the main programming language, along with the Pandas, NumPy and SQLAlchemy libraries for data manipulation and analysis, and PostgreSQL as the database management system.
The main objective of the project is to answer key questions related to the company's sales performance, such as: Which were the most profitable products and customers? Were sales goals met? The results obtained serve as input for strategic decision making in future sales campaigns.
Methodology 1. Data Extraction:
2. Data Cleansing and Transformation:
3. Exploratory Data Analysis (EDA):
4. Modeling and Prediction:
5. Report Generation:
Results - Identification of top products and customers: The best-selling products and the customers that generate the most revenue are identified. - Analysis of sales trends: Sales trends over time are analyzed and possible factors that influence sales behavior are identified. - Calculation of key metrics: Metrics such as average profit margin and sales growth rate are calculated.
Conclusions This project demonstrates how Python and PostgreSQL can be effectively used to analyze large data sets and obtain valuable insights for business decision making. The results obtained can serve as a starting point for future research and development in the area of sales analysis.
Technologies Used - Python: Pandas, NumPy, SQLAlchemy, Matplotlib/Seaborn - Database: PostgreSQL - Tools: Jupyter Notebook - Keywords: data analysis, Python, PostgreSQL, Pandas, NumPy, SQLAlchemy, EDA, sales, business intelligence