https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is built on data from Overbuff with the help of python and selenium. Development environment - Jupyter Notebook.
The tables contain the data for competitive seasons 1-4 and for quick play for each hero and rank along with the standard statistics (common to each hero as well as information belonging to a specific hero).
Note: data for some columns are missing on Overbuff site (there is '—' instead of a specific value), so they were dropped: Scoped Crits for Ashe and Widowmaker, Rip Tire Kills for Junkrat, Minefield Kills for Wrecking Ball. 'Self Healing' column for Bastion was dropped too as Bastion doesn't have this property anymore in OW2. Also, there are no values for "Javelin Spin Kills / 10min" for Orisa in season 1 (the column was dropped). Overall, all missing values were cleaned.
Attention: Overbuff doesn't contain info about OW 1 competitive seasons (when you change a skill tier, the data isn't changed). If you know a site where it's possible to get this data, please, leave a comment. Thank you!
The code on GitHub .
All procedure is done in 5 stages:
Data is retrieved directly from HTML elements on the page with the selenium tool on python.
After scraping, data was cleansed: 1) Deleted comma separator on thousands (e.g. 1,009 => 1009). 2) Translated time representation (e.g. '01:23') to seconds (1*60 + 23 => 83). 3) Lúcio has become Lucio, Torbjörn - Torbjorn.
Data were arranged into a table and saved to CSV.
Columns which are supposed to have only numeric values are checked. All non-numeric values are dropped. This stage helps to find missing values which contain '—' instead and delete them.
Additional missing values are searched for and dealt with. It's either column rename that happens (as the program cannot infer the correct column name for missing values) or a column drop. This stage ensures all wrong data are truly fixed.
The procedure to fetch the data takes 7 minutes on average.
This project and code were born from this GitHub code.
import_missing_lat_long.py This script takes a GeoNames URL of a location, retrieves the latitude and longitude of this location from the GeoNames database and inserts these values in the corresponding Location knowledge element in the CAP. import_missing_biograpgy.py This script takes a ULAN URL of an artist, retrieves his/her biographical details from the ULAN database and inserts these details in the corresponding Person knowledge element in the CAP. import missing nationalities.py This script takes a ULAN URL of an artist, retrieves his/her nationality information from the ULAN database and inserts these details in the corresponding Person knowledge element in the CAP. import missing alt_names.py This script takes a ULAN URL of an artist, retrieves his/her alternative names by which he or she is also known from the ULAN database and inserts these details in the corresponding Person knowledge element in the CAP. Find_missing_birth_and_death_information.py This script takes a ULAN URL of an artist, retrieves his/her birth and death dates from the ULAN database and inserts these details in the corresponding Person knowledge element in the CAP.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Overview This repository contains ready-to-use frequency time series as well as the corresponding pre-processing scripts in python. The data covers three synchronous areas of the European power grid: Continental Europe Great Britain Nordic This work is part of the paper "Predictability of Power Grid Frequency"[1]. Please cite this paper, when using the data and the code. For a detailed documentation of the pre-processing procedure we refer to the supplementary material of the paper. Data sources We downloaded the frequency recordings from publically available repositories of three different Transmission System Operators (TSOs). Continental Europe [2]: We downloaded the data from the German TSO TransnetBW GmbH, which retains the Copyright on the data, but allows to re-publish it upon request [3]. Great Britain [4]: The download was supported by National Grid ESO Open Data, which belongs to the British TSO National Grid. They publish the frequency recordings under the NGESO Open License [5]. Nordic [6]: We obtained the data from the Finish TSO Fingrid, which provides the data under the open license CC-BY 4.0 [7]. Content of the repository A) Scripts In the "Download_scripts" folder you will find three scripts to automatically download frequency data from the TSO's websites. In "convert_data_format.py" we save the data with corrected timestamp formats. Missing data is marked as NaN (processing step (1) in the supplementary material of [1]). In "clean_corrupted_data.py" we load the converted data and identify corrupted recordings. We mark them as NaN and clean some of the resulting data holes (processing step (2) in the supplementary material of [1]). The python scripts run with Python 3.7 and with the packages found in "requirements.txt". B) Yearly converted and cleansed data The folders "
Data DescriptionThe CADDI dataset is designed to support research in in-class activity recognition using IMU data from low-cost sensors. It provides multimodal data capturing 19 different activities performed by 12 participants in a classroom environment, utilizing both IMU sensors from a Samsung Galaxy Watch 5 and synchronized stereo camera images. This dataset enables the development and validation of activity recognition models using sensor fusion techniques.Data Generation ProceduresThe data collection process involved recording both continuous and instantaneous activities that typically occur in a classroom setting. The activities were captured using a custom setup, which included:A Samsung Galaxy Watch 5 to collect accelerometer, gyroscope, and rotation vector data at 100Hz.A ZED stereo camera capturing 1080p images at 25-30 fps.A synchronized computer acting as a data hub, receiving IMU data and storing images in real-time.A D-Link DSR-1000AC router for wireless communication between the smartwatch and the computer.Participants were instructed to arrange their workspace as they would in a real classroom, including a laptop, notebook, pens, and a backpack. Data collection was performed under realistic conditions, ensuring that activities were captured naturally.Temporal and Spatial ScopeThe dataset contains a total of 472.03 minutes of recorded data.The IMU sensors operate at 100Hz, while the stereo camera captures images at 25-30Hz.Data was collected from 12 participants, each performing all 19 activities multiple times.The geographical scope of data collection was Alicante, Spain, under controlled indoor conditions.Dataset ComponentsThe dataset is organized into JSON and PNG files, structured hierarchically:IMU Data: Stored in JSON files, containing:Samsung Linear Acceleration Sensor (X, Y, Z values, 100Hz)LSM6DSO Gyroscope (X, Y, Z values, 100Hz)Samsung Rotation Vector (X, Y, Z, W quaternion values, 100Hz)Samsung HR Sensor (heart rate, 1Hz)OPT3007 Light Sensor (ambient light levels, 5Hz)Stereo Camera Images: High-resolution 1920×1080 PNG files from left and right cameras.Synchronization: Each IMU data record and image is timestamped for precise alignment.Data StructureThe dataset is divided into continuous and instantaneous activities:Continuous Activities (e.g., typing, writing, drawing) were recorded for 210 seconds, with the central 200 seconds retained.Instantaneous Activities (e.g., raising a hand, drinking) were repeated 20 times per participant, with data captured only during execution.The dataset is structured as:/continuous/subject_id/activity_name/ /camera_a/ → Left camera images /camera_b/ → Right camera images /sensors/ → JSON files with IMU data
/instantaneous/subject_id/activity_name/repetition_id/ /camera_a/ /camera_b/ /sensors/ Data Quality & Missing DataThe smartwatch buffers 100 readings per second before sending them, ensuring minimal data loss.Synchronization latency between the smartwatch and the computer is negligible.Not all IMU samples have corresponding images due to different recording rates.Outliers and anomalies were handled by discarding incomplete sequences at the start and end of continuous activities.Error Ranges & LimitationsSensor data may contain noise due to minor hand movements.The heart rate sensor operates at 1Hz, limiting its temporal resolution.Camera exposure settings were automatically adjusted, which may introduce slight variations in lighting.File Formats & Software CompatibilityIMU data is stored in JSON format, readable with Python’s json library.Images are in PNG format, compatible with all standard image processing tools.Recommended libraries for data analysis:Python: numpy, pandas, scikit-learn, tensorflow, pytorchVisualization: matplotlib, seabornDeep Learning: Keras, PyTorchPotential ApplicationsDevelopment of activity recognition models in educational settings.Study of student engagement based on movement patterns.Investigation of sensor fusion techniques combining visual and IMU data.This dataset represents a unique contribution to activity recognition research, providing rich multimodal data for developing robust models in real-world educational environments.CitationIf you find this project helpful for your research, please cite our work using the following bibtex entry:@misc{marquezcarpintero2025caddiinclassactivitydetection, title={CADDI: An in-Class Activity Detection Dataset using IMU data from low-cost sensors}, author={Luis Marquez-Carpintero and Sergio Suescun-Ferrandiz and Monica Pina-Navarro and Miguel Cazorla and Francisco Gomez-Donoso}, year={2025}, eprint={2503.02853}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.02853}, }
Market basket analysis with Apriori algorithm
The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.
Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.
Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.
Number of Attributes: 7
https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">
First, we need to load required libraries. Shortly I describe all libraries.
https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">
Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.
https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png">
https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">
After we will clear our data frame, will remove missing values.
https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">
To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Terrestrial vessel automatic identification system (AIS) data was collected around Ålesund, Norway in 2020, from multiple receiving stations with unsynchronized clocks. Features are 'mmsi', 'imo', 'length', 'latitude', 'longitude', 'sog', 'cog', 'true_heading', 'datetime UTC', 'navigational status', and 'message number'. Compact parquet files can be turned into data frames with python's pandas library. Data is irregularly sampled because of the navigational status. The preprocessing script for training the machine learning models can be found here. There you will find gathered dozen of trainable models and hundreds of datasets. Visit this website for more information about the data. If you have additional questions, please find our information in the links below:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research Domain/Project:
This dataset is part of the Tour Recommendation System project, which focuses on predicting user preferences and ratings for various tourist places and events. It belongs to the field of Machine Learning, specifically applied to Recommender Systems and Predictive Analytics.
Purpose:
The dataset serves as the training and evaluation data for a Decision Tree Regressor model, which predicts ratings (from 1-5) for different tourist destinations based on user preferences. The model can be used to recommend places or events to users based on their predicted ratings.
Creation Methodology:
The dataset was originally collected from a tourism platform where users rated various tourist places and events. The data was preprocessed to remove missing or invalid entries (such as #NAME?
in rating columns). It was then split into subsets for training, validation, and testing the model.
Structure of the Dataset:
The dataset is stored as a CSV file (user_ratings_dataset.csv
) and contains the following columns:
place_or_event_id: Unique identifier for each tourist place or event.
rating: Rating given by the user, ranging from 1 to 5.
The data is split into three subsets:
Training Set: 80% of the dataset used to train the model.
Validation Set: A small portion used for hyperparameter tuning.
Test Set: 20% used to evaluate model performance.
Folder and File Naming Conventions:
The dataset files are stored in the following structure:
user_ratings_dataset.csv
: The original dataset file containing user ratings.
tour_recommendation_model.pkl
: The saved model after training.
actual_vs_predicted_chart.png
: A chart comparing actual and predicted ratings.
Software Requirements:
To open and work with this dataset, the following software and libraries are required:
Python 3.x
Pandas for data manipulation
Scikit-learn for training and evaluating machine learning models
Matplotlib for chart generation
Joblib for saving and loading the trained model
The dataset can be opened and processed using any Python environment that supports these libraries.
Additional Resources:
The model training code, README file, and performance chart are available in the project repository.
For detailed explanation and code, please refer to the GitHub repository (or any other relevant link for the code).
Dataset Reusability:
The dataset is structured for easy use in training machine learning models for recommendation systems. Researchers and practitioners can utilize it to:
Train other types of models (e.g., regression, classification).
Experiment with different features or add more metadata to enrich the dataset.
Data Integrity:
The dataset has been cleaned and preprocessed to remove invalid values (such as #NAME?
or missing ratings). However, users should ensure they understand the structure and the preprocessing steps taken before reusing it.
Licensing:
The dataset is provided under the CC BY 4.0 license, which allows free usage, distribution, and modification, provided that proper attribution is given.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
I used the Paul Rossotti’s data set on my personal projects. However, after a long time using it I noticed that I would need more old and recent data, so I ended up with a more complete data set and I thought that might help someone. Since his data set was used as a base, all the credits goes to him. I only incremented it. Also I am willing to update this data set yearly.
You can access his work using this link on the reference section.
This data set contains information about the box score of every NBA game since 1949-50 until now. You can get the data individually for each season, decade or a compiled of all the data. In total the data set has, approximately, 120 features/columns/attributes that goes from basic stats (like total points, rebounds, assists, blocks, and so on) to more advanced ones (like floor impact counter, assist rate, possessions, pace, play% and much more!).
Each game will contain the same features to the home team and its opponent (away team) and some other features related to the game itself (like game date, season, season type and match winner). If you like stats and NBA, this data set was made for you!
If do you wanna more about the formulas used and its meaning, please check the reference section. Also you can check the “features_description” file. There you will find a brief description of each feature and its respective formula (only for more advanced stats).
LAST TIME THE DATA SET WAS UPDATED:
July 26, 2021 (07/26/2021) – 1pm EDT
Questions about the dataset:
Q:How did you collected the data? A: I created a web scrapper using python to do the hard work.
Q: How did you filled the missing values? A: For the float columns I filled with “0.0”. For the object columns I left with a NaN value, but don’t need to worry about it. The only columns that I need to do that was teamWins, teamLosses, opptWins, opptLosses. However only 8 rows in the entire data set has NaN values! Great news, isn’t it?
Q: Where can I see the description/formula for each attribute/column/feature? A: You can check it out in the “features_informations” file inside the data set.
Q: Will you constantly update the data set? A: Yes!
Q: The data contains only regular reason games? A: No! The data contains playoffs games as well.
About the stats and formulas used: https://www.basketball-reference.com/about/glossary.html https://basketball.realgm.com/info/glossary https://www.kaggle.com/pablote/nba-enhanced-stats (Paul Rossotti’s data set)
Where the data was collected: https://www.basketball-reference.com/leagues/
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This data set contains information about the box score of every WNBA game since 1997 until now. You can get the data individually for each season, decade or a compiled of all the data. In total the data set has, approximately, 100 features/columns/attributes that goes from basic stats (like total points, rebounds, assists, blocks, and so on) to more advanced ones (like floor impact counter, assist rate, possessions, pace, play% and much more!).
Each game will contain the same features to the home team and its opponent (away team) and some other features related to the game itself (like game date, season, season type and match winner). If you like stats and NBA, this data set was made for you!
If do you wanna more about the formulas used and its meaning, please check the reference section. Also you can check the “features_description” file. There you will find a brief description of each feature and its respective formula (only for more advanced stats).
LAST TIME THE DATA SET WAS UPDATED:
January 13, 2021 (01/13/2021) – 1pm EDT
Questions about the dataset:
Q:How did you collected the data? A: I created a web scrapper using python to do the hard work.
Q: How did you filled the missing values? A: For the float columns I filled with “0.0”. For the object columns I left with a NaN value, but don’t need to worry about it. The only columns that I need to do that was teamWins, teamLosses, opptWins, opptLosses. However only 8 rows in the entire data set has NaN values! Great news, isn’t it?
Q: Where can I see the description/formula for each attribute/column/feature? A: You can check it out in the “features_informations” file inside the data set.
Q: Will you constantly update the data set? A: Yes!
Q: The data contains only regular reason games? A: No! The data contains playoffs games as well.
About the stats and formulas used: https://www.basketball-reference.com/about/glossary.html https://basketball.realgm.com/info/glossary https://www.kaggle.com/rafaelgreca/nba-games-box-score-since-1949 (My other data set about the NBA)
Where the data was collected: https://www.basketball-reference.com/leagues/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains trajectories as well as body poses of pedestrians and cyclists in road traffic recorded in Aschaffenburg, Germany. It is appropriate for training and testing methods for trajectory forecasting and intention prediction of vulnerable road users (VRUs) based on the past trajectory and body poses.
The dataset consists of more than 6526 trajectories of pedestrians and 1734 trajectories of cyclists recorded by a research vehicle of the University of Applied Sciences Aschaffenburg (Kooperative Automatisierte Verkehrssysteme) in urban traffic. The trajectories have been measured with the help of a stereo camera while compensating the vehicle's own motion. The body posture of the pedestrians and cyclists is available in the form of 2D and 3D poses. The 2D poses contain joint positions in an image coordinate system, while the 3D poses contain actual three-dimensional positions. A detailed description and evaluation of the pose estimation method can be found in [1]. In addition to the trajectories and the poses, manually created labels of the respective motion states are included.
To read the provided data, unzip the file first. It contains one json file for each of the trajectories. Each json file contains the following data:
vru_type: type of the VRU (pedestrian ('ped') or cyclist ('bike'))
timestamps: UTC-Timestamps. The motions of the VRUs were recorded at a frequency of 25 Hz.
set: Assignment to one of the three datasets train, validation or test. For pedestrians and cyclists, 60% of the data is used for training, 20% for validation and the remaining 20% for testing. During all splits, it was ensured that the distribution of the motion states is as similar as possible.
pose2d: 2D poses with 18 joint positions in image coordinates with an additional uncertainty between 0 and 1 (third coordinate). Missing positions are encoded as 'nan'.
pose3d: 3D poses with the trajectories of 14 joints in an three dimensional coordinate system. Missing positions are encoded as 'nan'.
head_smoothed: Smoothed (by rts smoother) trajectory of the head in an three dimensional coordinate system. It is treated as ground truth position and must not be used as input for a prediction method.
motion_primitives: One-hot encoded labels of the respective motion state. For pedestrians, a distinction is made between the states wait, start, move, and stop. For cyclists, the states wait, start, move, stop, turn left, and turn right are annotated.
Python code for reading the data can be found on Github: github.com/CooperativeAutomatedTrafficSystemsLab/Aschaffenburg-Pose-Dataset
Citation
If you find this dataset useful, please cite this paper (and refer the data as Aschaffenburg Pose Dataset or APD):
Kress, V. ; Zernetsch, S. ; Doll, K. ; Sick, B. : Pose Based Trajectory Forecast of Vulnerable Road Users Using Recurrent Neural Networks. In: Pattern Recognition. ICPR International Workshops and Challenges, Springer International Publishing, 2020, pp. 57-71
Similar Datasets
Pedestrians and Cyclists in Road Traffic: Trajectories, 3D Poses and Semantic Maps
Cyclist Actions: Optical Flow Sequences and Trajectories
Cyclist Actions: Motion History Images and Trajectories
More datasets
Acknowledgment
This work was supported by “Zentrum Digitalisierung.Bayern”. In addition, the work is backed by the project DeCoInt2 , supported by the German Research Foundation (DFG) within the priority program SPP 1835: “Kooperativ interagierende Automobile”, grant numbers DO 1186/1-2 and SI 674/11-2.
References
[1] Kress, V. ; Jung, J. ; Zernetsch, S. ; Doll, K. ; Sick, B. : Human Pose Estimation in Real Traffic Scenes. In: IEEE Symposium Series on Computational Intelligence (SSCI), 2018, pp. 518–523, doi: 10.1109/SSCI.2018.8628660
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides pre-processed frequency time series data for 2020-2023, covering three synchronous areas of the European power grid:
This work is part of the paper "Probabilistic and Explainable Machine Learning for Tabular Power Grid Data"[1]. Please cite this paper, when using the data and the code.
This dataset extends the time coverage of the original dataset [2], which covered 2012-2021. For a detailed documentation of the pre-processing procedure we refer to the supplementary material of paper "Predictability of Power Grid Frequency"[3]. The same methodology and preprocessing procedures have been applied to maintain consistency and comparability with the original work.
1) In the `Download_scripts` folder you will find three scripts to automatically download frequency data from the TSO's websites.
2) In `convert_data_format.py` we save the data with corrected timestamp formats.
3) In `clean_corrupted_data.py` we load the converted data and identify corrupted recordings. We mark them as NaN and clean some of the resulting data holes (processing step (2) in the supplementary material of [3]).
The python scripts were adapted to run with Python 3.11 and with the packages found in `requirements.txt`.
The folder `Data_cleansed` contains the output of `clean_corrupted_data.py`.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is built on data from Overbuff with the help of python and selenium. Development environment - Jupyter Notebook.
The tables contain the data for competitive seasons 1-4 and for quick play for each hero and rank along with the standard statistics (common to each hero as well as information belonging to a specific hero).
Note: data for some columns are missing on Overbuff site (there is '—' instead of a specific value), so they were dropped: Scoped Crits for Ashe and Widowmaker, Rip Tire Kills for Junkrat, Minefield Kills for Wrecking Ball. 'Self Healing' column for Bastion was dropped too as Bastion doesn't have this property anymore in OW2. Also, there are no values for "Javelin Spin Kills / 10min" for Orisa in season 1 (the column was dropped). Overall, all missing values were cleaned.
Attention: Overbuff doesn't contain info about OW 1 competitive seasons (when you change a skill tier, the data isn't changed). If you know a site where it's possible to get this data, please, leave a comment. Thank you!
The code on GitHub .
All procedure is done in 5 stages:
Data is retrieved directly from HTML elements on the page with the selenium tool on python.
After scraping, data was cleansed: 1) Deleted comma separator on thousands (e.g. 1,009 => 1009). 2) Translated time representation (e.g. '01:23') to seconds (1*60 + 23 => 83). 3) Lúcio has become Lucio, Torbjörn - Torbjorn.
Data were arranged into a table and saved to CSV.
Columns which are supposed to have only numeric values are checked. All non-numeric values are dropped. This stage helps to find missing values which contain '—' instead and delete them.
Additional missing values are searched for and dealt with. It's either column rename that happens (as the program cannot infer the correct column name for missing values) or a column drop. This stage ensures all wrong data are truly fixed.
The procedure to fetch the data takes 7 minutes on average.
This project and code were born from this GitHub code.