12 datasets found

Overwatch 2 statistics
kaggle.com
Updated Jun 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mykhailo Kachan (2023). Overwatch 2 statistics [Dataset]. https://www.kaggle.com/datasets/mykhailokachan/overwatch-2-statistics
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 27, 2023
Dataset provided by
Kaggle
Authors
Mykhailo Kachan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is built on data from Overbuff with the help of python and selenium. Development environment - Jupyter Notebook.

The tables contain the data for competitive seasons 1-4 and for quick play for each hero and rank along with the standard statistics (common to each hero as well as information belonging to a specific hero).

Note: data for some columns are missing on Overbuff site (there is '—' instead of a specific value), so they were dropped: Scoped Crits for Ashe and Widowmaker, Rip Tire Kills for Junkrat, Minefield Kills for Wrecking Ball. 'Self Healing' column for Bastion was dropped too as Bastion doesn't have this property anymore in OW2. Also, there are no values for "Javelin Spin Kills / 10min" for Orisa in season 1 (the column was dropped). Overall, all missing values were cleaned.

Attention: Overbuff doesn't contain info about OW 1 competitive seasons (when you change a skill tier, the data isn't changed). If you know a site where it's possible to get this data, please, leave a comment. Thank you!

The code on GitHub .

All procedure is done in 5 stages:

Stage 1:

Data is retrieved directly from HTML elements on the page with the selenium tool on python.

Stage 2:

After scraping, data was cleansed: 1) Deleted comma separator on thousands (e.g. 1,009 => 1009). 2) Translated time representation (e.g. '01:23') to seconds (1*60 + 23 => 83). 3) Lúcio has become Lucio, Torbjörn - Torbjorn.

Stage 3:

Data were arranged into a table and saved to CSV.

Stage 4:

Columns which are supposed to have only numeric values are checked. All non-numeric values are dropped. This stage helps to find missing values which contain '—' instead and delete them.

Stage 5:

Additional missing values are searched for and dealt with. It's either column rename that happens (as the program cannot infer the correct column name for missing values) or a column drop. This stage ensures all wrong data are truly fixed.

The procedure to fetch the data takes 7 minutes on average.

This project and code were born from this GitHub code.
o
Python scripts for automatically enchancing MOP entries from geonames and...
explore.openaire.eu
Updated Jan 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Effie Karuzaki (2024). Python scripts for automatically enchancing MOP entries from geonames and Ulan [Dataset]. http://doi.org/10.5281/zenodo.10532597
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.10532597
Dataset updated
Jan 19, 2024
Authors
Effie Karuzaki
Description
import_missing_lat_long.py This script takes a GeoNames URL of a location, retrieves the latitude and longitude of this location from the GeoNames database and inserts these values in the corresponding Location knowledge element in the CAP. import_missing_biograpgy.py This script takes a ULAN URL of an artist, retrieves his/her biographical details from the ULAN database and inserts these details in the corresponding Person knowledge element in the CAP. import missing nationalities.py This script takes a ULAN URL of an artist, retrieves his/her nationality information from the ULAN database and inserts these details in the corresponding Person knowledge element in the CAP. import missing alt_names.py This script takes a ULAN URL of an artist, retrieves his/her alternative names by which he or she is also known from the ULAN database and inserts these details in the corresponding Person knowledge element in the CAP. Find_missing_birth_and_death_information.py This script takes a ULAN URL of an artist, retrieves his/her birth and death dates from the ULAN database and inserts these details in the corresponding Person knowledge element in the CAP.
Learn Data Science Series Part 1
kaggle.com
Updated Dec 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rupesh Kumar (2022). Learn Data Science Series Part 1 [Dataset]. https://www.kaggle.com/datasets/hunter0007/learn-data-science-part-1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 30, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rupesh Kumar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

Overview:

Chapter 1: Getting started with pandas

Chapter 2: Analysis: Bringing it all together and making decisions

Chapter 3: Appending to DataFrame

Chapter 4: Boolean indexing of dataframes

Chapter 5: Categorical data

Chapter 6: Computational Tools

Chapter 7: Creating DataFrames

Chapter 8: Cross sections of different axes with MultiIndex

Chapter 9: Data Types

Chapter 10: Dealing with categorical variables

Chapter 11: Duplicated data

Chapter 12: Getting information about DataFrames

Chapter 13: Gotchas of pandas

Chapter 14: Graphs and Visualizations

Chapter 15: Grouping Data

Chapter 16: Grouping Time Series Data

Chapter 17: Holiday Calendars

Chapter 18: Indexing and selecting data

Chapter 19: IO for Google BigQuery

Chapter 20: JSON

Chapter 21: Making Pandas Play Nice With Native Python Datatypes

Chapter 22: Map Values

Chapter 23: Merge, join, and concatenate

Chapter 24: Meta: Documentation Guidelines

Chapter 25: Missing Data

Chapter 26: MultiIndex

Chapter 27: Pandas Datareader

Chapter 28: Pandas IO tools (reading and saving data sets)

Chapter 29: pd.DataFrame.apply

Chapter 30: Read MySQL to DataFrame

Chapter 31: Read SQL Server to Dataframe

Chapter 32: Reading files into pandas DataFrame

Chapter 33: Resampling

Chapter 34: Reshaping and pivoting

Chapter 35: Save pandas dataframe to a csv file

Chapter 36: Series

Chapter 37: Shifting and Lagging Data

Chapter 38: Simple manipulation of DataFrames

Chapter 39: String manipulation

Chapter 40: Using .ix, .iloc, .loc, .at and .iat to access a DataFrame

Chapter 41: Working with Time Series
o
Pre-Processed Power Grid Frequency Time Series
explore.openaire.eu
data.niaid.nih.gov
+1more
Updated Apr 22, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johannes Kruse; Benjamin Schäfer; Dirk Witthaut (2020). Pre-Processed Power Grid Frequency Time Series [Dataset]. http://doi.org/10.5281/zenodo.5105820
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5105820
Dataset updated
Apr 22, 2020
Authors
Johannes Kruse; Benjamin Schäfer; Dirk Witthaut
Description
Overview This repository contains ready-to-use frequency time series as well as the corresponding pre-processing scripts in python. The data covers three synchronous areas of the European power grid: Continental Europe Great Britain Nordic This work is part of the paper "Predictability of Power Grid Frequency"[1]. Please cite this paper, when using the data and the code. For a detailed documentation of the pre-processing procedure we refer to the supplementary material of the paper. Data sources We downloaded the frequency recordings from publically available repositories of three different Transmission System Operators (TSOs). Continental Europe [2]: We downloaded the data from the German TSO TransnetBW GmbH, which retains the Copyright on the data, but allows to re-publish it upon request [3]. Great Britain [4]: The download was supported by National Grid ESO Open Data, which belongs to the British TSO National Grid. They publish the frequency recordings under the NGESO Open License [5]. Nordic [6]: We obtained the data from the Finish TSO Fingrid, which provides the data under the open license CC-BY 4.0 [7]. Content of the repository A) Scripts In the "Download_scripts" folder you will find three scripts to automatically download frequency data from the TSO's websites. In "convert_data_format.py" we save the data with corrected timestamp formats. Missing data is marked as NaN (processing step (1) in the supplementary material of [1]). In "clean_corrupted_data.py" we load the converted data and identify corrupted recordings. We mark them as NaN and clean some of the resulting data holes (processing step (2) in the supplementary material of [1]). The python scripts run with Python 3.7 and with the packages found in "requirements.txt". B) Yearly converted and cleansed data The folders "
u
Data from: CADDI: An in-Class Activity Detection Dataset using IMU data from...
observatorio-cientifico.ua.es
scidb.cn
Updated 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Pina-Navarro, Monica; Gomez-Donoso, Francisco; Cazorla, Miguel; Marquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Pina-Navarro, Monica; Gomez-Donoso, Francisco; Cazorla, Miguel (2025). CADDI: An in-Class Activity Detection Dataset using IMU data from low-cost sensors [Dataset]. https://observatorio-cientifico.ua.es/documentos/668fc49bb9e7c03b01be251c
Explore at:
Dataset updated
2025
Authors
Marquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Pina-Navarro, Monica; Gomez-Donoso, Francisco; Cazorla, Miguel; Marquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Pina-Navarro, Monica; Gomez-Donoso, Francisco; Cazorla, Miguel
Description
Data DescriptionThe CADDI dataset is designed to support research in in-class activity recognition using IMU data from low-cost sensors. It provides multimodal data capturing 19 different activities performed by 12 participants in a classroom environment, utilizing both IMU sensors from a Samsung Galaxy Watch 5 and synchronized stereo camera images. This dataset enables the development and validation of activity recognition models using sensor fusion techniques.Data Generation ProceduresThe data collection process involved recording both continuous and instantaneous activities that typically occur in a classroom setting. The activities were captured using a custom setup, which included:A Samsung Galaxy Watch 5 to collect accelerometer, gyroscope, and rotation vector data at 100Hz.A ZED stereo camera capturing 1080p images at 25-30 fps.A synchronized computer acting as a data hub, receiving IMU data and storing images in real-time.A D-Link DSR-1000AC router for wireless communication between the smartwatch and the computer.Participants were instructed to arrange their workspace as they would in a real classroom, including a laptop, notebook, pens, and a backpack. Data collection was performed under realistic conditions, ensuring that activities were captured naturally.Temporal and Spatial ScopeThe dataset contains a total of 472.03 minutes of recorded data.The IMU sensors operate at 100Hz, while the stereo camera captures images at 25-30Hz.Data was collected from 12 participants, each performing all 19 activities multiple times.The geographical scope of data collection was Alicante, Spain, under controlled indoor conditions.Dataset ComponentsThe dataset is organized into JSON and PNG files, structured hierarchically:IMU Data: Stored in JSON files, containing:Samsung Linear Acceleration Sensor (X, Y, Z values, 100Hz)LSM6DSO Gyroscope (X, Y, Z values, 100Hz)Samsung Rotation Vector (X, Y, Z, W quaternion values, 100Hz)Samsung HR Sensor (heart rate, 1Hz)OPT3007 Light Sensor (ambient light levels, 5Hz)Stereo Camera Images: High-resolution 1920×1080 PNG files from left and right cameras.Synchronization: Each IMU data record and image is timestamped for precise alignment.Data StructureThe dataset is divided into continuous and instantaneous activities:Continuous Activities (e.g., typing, writing, drawing) were recorded for 210 seconds, with the central 200 seconds retained.Instantaneous Activities (e.g., raising a hand, drinking) were repeated 20 times per participant, with data captured only during execution.The dataset is structured as:/continuous/subject_id/activity_name/ /camera_a/ → Left camera images /camera_b/ → Right camera images /sensors/ → JSON files with IMU data

/instantaneous/subject_id/activity_name/repetition_id/ /camera_a/ /camera_b/ /sensors/ Data Quality & Missing DataThe smartwatch buffers 100 readings per second before sending them, ensuring minimal data loss.Synchronization latency between the smartwatch and the computer is negligible.Not all IMU samples have corresponding images due to different recording rates.Outliers and anomalies were handled by discarding incomplete sequences at the start and end of continuous activities.Error Ranges & LimitationsSensor data may contain noise due to minor hand movements.The heart rate sensor operates at 1Hz, limiting its temporal resolution.Camera exposure settings were automatically adjusted, which may introduce slight variations in lighting.File Formats & Software CompatibilityIMU data is stored in JSON format, readable with Python’s json library.Images are in PNG format, compatible with all standard image processing tools.Recommended libraries for data analysis:Python: numpy, pandas, scikit-learn, tensorflow, pytorchVisualization: matplotlib, seabornDeep Learning: Keras, PyTorchPotential ApplicationsDevelopment of activity recognition models in educational settings.Study of student engagement based on movement patterns.Investigation of sensor fusion techniques combining visual and IMU data.This dataset represents a unique contribution to activity recognition research, providing rich multimodal data for developing robust models in real-world educational environments.CitationIf you find this project helpful for your research, please cite our work using the following bibtex entry:@misc{marquezcarpintero2025caddiinclassactivitydetection, title={CADDI: An in-Class Activity Detection Dataset using IMU data from low-cost sensors}, author={Luis Marquez-Carpintero and Sergio Suescun-Ferrandiz and Monica Pina-Navarro and Miguel Cazorla and Francisco Gomez-Donoso}, year={2025}, eprint={2503.02853}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.02853}, }
Market Basket Analysis
kaggle.com
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 9, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
AIS data
zenodo.org
data.niaid.nih.gov
zip
Updated Jun 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luka Grgičević; Luka Grgičević (2023). AIS data [Dataset]. http://doi.org/10.5281/zenodo.8064564
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8064564
Dataset updated
Jun 26, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Luka Grgičević; Luka Grgičević
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Terrestrial vessel automatic identification system (AIS) data was collected around Ålesund, Norway in 2020, from multiple receiving stations with unsynchronized clocks. Features are 'mmsi', 'imo', 'length', 'latitude', 'longitude', 'sog', 'cog', 'true_heading', 'datetime UTC', 'navigational status', and 'message number'. Compact parquet files can be turned into data frames with python's pandas library. Data is irregularly sampled because of the navigational status. The preprocessing script for training the machine learning models can be found here. There you will find gathered dozen of trainable models and hundreds of datasets. Visit this website for more information about the data. If you have additional questions, please find our information in the links below:

Luka Grgičević

Ottar Laurits Osen
t
Tour Recommendation Model
test.researchdata.tuwien.at
bin, png +1
Updated May 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar (2025). Tour Recommendation Model [Dataset]. http://doi.org/10.70124/akpf6-8p175
Explore at:
text/markdown, png, binAvailable download formats
Unique identifier
https://doi.org/10.70124/akpf6-8p175
Dataset updated
May 14, 2025
Dataset provided by
TU Wien
Authors
Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 28, 2025
Description
Dataset Description for Tour Recommendation Model

Context and Methodology:

Research Domain/Project:
This dataset is part of the Tour Recommendation System project, which focuses on predicting user preferences and ratings for various tourist places and events. It belongs to the field of Machine Learning, specifically applied to Recommender Systems and Predictive Analytics.

Purpose:
The dataset serves as the training and evaluation data for a Decision Tree Regressor model, which predicts ratings (from 1-5) for different tourist destinations based on user preferences. The model can be used to recommend places or events to users based on their predicted ratings.

Creation Methodology:
The dataset was originally collected from a tourism platform where users rated various tourist places and events. The data was preprocessed to remove missing or invalid entries (such as #NAME? in rating columns). It was then split into subsets for training, validation, and testing the model.

Technical Details:

Structure of the Dataset:
The dataset is stored as a CSV file (user_ratings_dataset.csv) and contains the following columns:

place_or_event_id: Unique identifier for each tourist place or event.

rating: Rating given by the user, ranging from 1 to 5.

The data is split into three subsets:

Training Set: 80% of the dataset used to train the model.

Validation Set: A small portion used for hyperparameter tuning.

Test Set: 20% used to evaluate model performance.

Folder and File Naming Conventions:
The dataset files are stored in the following structure:

user_ratings_dataset.csv: The original dataset file containing user ratings.

tour_recommendation_model.pkl: The saved model after training.

actual_vs_predicted_chart.png: A chart comparing actual and predicted ratings.

Software Requirements:
To open and work with this dataset, the following software and libraries are required:

Python 3.x

Pandas for data manipulation

Scikit-learn for training and evaluating machine learning models

Matplotlib for chart generation

Joblib for saving and loading the trained model

The dataset can be opened and processed using any Python environment that supports these libraries.

Additional Resources:

The model training code, README file, and performance chart are available in the project repository.

For detailed explanation and code, please refer to the GitHub repository (or any other relevant link for the code).

Further Details:

Dataset Reusability:
The dataset is structured for easy use in training machine learning models for recommendation systems. Researchers and practitioners can utilize it to:

Train other types of models (e.g., regression, classification).

Experiment with different features or add more metadata to enrich the dataset.

Data Integrity:
The dataset has been cleaned and preprocessed to remove invalid values (such as #NAME? or missing ratings). However, users should ensure they understand the structure and the preprocessing steps taken before reusing it.

Licensing:
The dataset is provided under the CC BY 4.0 license, which allows free usage, distribution, and modification, provided that proper attribution is given.
NBA Games Box Score Since 1949
kaggle.com
Updated Jul 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Greca (2021). NBA Games Box Score Since 1949 [Dataset]. https://www.kaggle.com/rafaelgreca/nba-games-box-score-since-1949/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 27, 2021
Dataset provided by
Kaggle
Authors
Rafael Greca
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context

I used the Paul Rossotti’s data set on my personal projects. However, after a long time using it I noticed that I would need more old and recent data, so I ended up with a more complete data set and I thought that might help someone. Since his data set was used as a base, all the credits goes to him. I only incremented it. Also I am willing to update this data set yearly.

You can access his work using this link on the reference section.

Content

This data set contains information about the box score of every NBA game since 1949-50 until now. You can get the data individually for each season, decade or a compiled of all the data. In total the data set has, approximately, 120 features/columns/attributes that goes from basic stats (like total points, rebounds, assists, blocks, and so on) to more advanced ones (like floor impact counter, assist rate, possessions, pace, play% and much more!).

Each game will contain the same features to the home team and its opponent (away team) and some other features related to the game itself (like game date, season, season type and match winner). If you like stats and NBA, this data set was made for you!

If do you wanna more about the formulas used and its meaning, please check the reference section. Also you can check the “features_description” file. There you will find a brief description of each feature and its respective formula (only for more advanced stats).

Acknowledgements

LAST TIME THE DATA SET WAS UPDATED:

July 26, 2021 (07/26/2021) – 1pm EDT

Questions about the dataset:

Q:How did you collected the data? A: I created a web scrapper using python to do the hard work.

Q: How did you filled the missing values? A: For the float columns I filled with “0.0”. For the object columns I left with a NaN value, but don’t need to worry about it. The only columns that I need to do that was teamWins, teamLosses, opptWins, opptLosses. However only 8 rows in the entire data set has NaN values! Great news, isn’t it?

Q: Where can I see the description/formula for each attribute/column/feature? A: You can check it out in the “features_informations” file inside the data set.

Q: Will you constantly update the data set? A: Yes!

Q: The data contains only regular reason games? A: No! The data contains playoffs games as well.

References

About the stats and formulas used: https://www.basketball-reference.com/about/glossary.html https://basketball.realgm.com/info/glossary https://www.kaggle.com/pablote/nba-enhanced-stats (Paul Rossotti’s data set)

Where the data was collected: https://www.basketball-reference.com/leagues/
WNBA Games Box Score Since 1997
kaggle.com
Updated Jan 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Greca (2021). WNBA Games Box Score Since 1997 [Dataset]. https://www.kaggle.com/datasets/rafaelgreca/wnba-games-box-score-since-1997/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 13, 2021
Dataset provided by
Kaggle
Authors
Rafael Greca
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Content

This data set contains information about the box score of every WNBA game since 1997 until now. You can get the data individually for each season, decade or a compiled of all the data. In total the data set has, approximately, 100 features/columns/attributes that goes from basic stats (like total points, rebounds, assists, blocks, and so on) to more advanced ones (like floor impact counter, assist rate, possessions, pace, play% and much more!).

Each game will contain the same features to the home team and its opponent (away team) and some other features related to the game itself (like game date, season, season type and match winner). If you like stats and NBA, this data set was made for you!

If do you wanna more about the formulas used and its meaning, please check the reference section. Also you can check the “features_description” file. There you will find a brief description of each feature and its respective formula (only for more advanced stats).

Acknowledgements

LAST TIME THE DATA SET WAS UPDATED:

January 13, 2021 (01/13/2021) – 1pm EDT

Questions about the dataset:

Q:How did you collected the data? A: I created a web scrapper using python to do the hard work.

Q: How did you filled the missing values? A: For the float columns I filled with “0.0”. For the object columns I left with a NaN value, but don’t need to worry about it. The only columns that I need to do that was teamWins, teamLosses, opptWins, opptLosses. However only 8 rows in the entire data set has NaN values! Great news, isn’t it?

Q: Where can I see the description/formula for each attribute/column/feature? A: You can check it out in the “features_informations” file inside the data set.

Q: Will you constantly update the data set? A: Yes!

Q: The data contains only regular reason games? A: No! The data contains playoffs games as well.

References

About the stats and formulas used: https://www.basketball-reference.com/about/glossary.html https://basketball.realgm.com/info/glossary https://www.kaggle.com/rafaelgreca/nba-games-box-score-since-1949 (My other data set about the NBA)

Where the data was collected: https://www.basketball-reference.com/leagues/
Z
Aschaffenburg Pose Dataset
data.niaid.nih.gov
zenodo.org
Updated Feb 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sick, Bernhard (2022). Aschaffenburg Pose Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5724485
Explore at:
Dataset updated
Feb 11, 2022
Dataset provided by
Sick, Bernhard
Hetzel, Manuel
Bieshaar, Maarten
Reichert, Hannes
Fuchs, Erich
Kress, Viktor
Doll, Konrad
Zernetsch, Stefan
Reitberger, Günther
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Aschaffenburg
Description
This dataset contains trajectories as well as body poses of pedestrians and cyclists in road traffic recorded in Aschaffenburg, Germany. It is appropriate for training and testing methods for trajectory forecasting and intention prediction of vulnerable road users (VRUs) based on the past trajectory and body poses.

The dataset consists of more than 6526 trajectories of pedestrians and 1734 trajectories of cyclists recorded by a research vehicle of the University of Applied Sciences Aschaffenburg (Kooperative Automatisierte Verkehrssysteme) in urban traffic. The trajectories have been measured with the help of a stereo camera while compensating the vehicle's own motion. The body posture of the pedestrians and cyclists is available in the form of 2D and 3D poses. The 2D poses contain joint positions in an image coordinate system, while the 3D poses contain actual three-dimensional positions. A detailed description and evaluation of the pose estimation method can be found in [1]. In addition to the trajectories and the poses, manually created labels of the respective motion states are included.

To read the provided data, unzip the file first. It contains one json file for each of the trajectories. Each json file contains the following data:

vru_type: type of the VRU (pedestrian ('ped') or cyclist ('bike'))

timestamps: UTC-Timestamps. The motions of the VRUs were recorded at a frequency of 25 Hz.

set: Assignment to one of the three datasets train, validation or test. For pedestrians and cyclists, 60% of the data is used for training, 20% for validation and the remaining 20% for testing. During all splits, it was ensured that the distribution of the motion states is as similar as possible.

pose2d: 2D poses with 18 joint positions in image coordinates with an additional uncertainty between 0 and 1 (third coordinate). Missing positions are encoded as 'nan'.

pose3d: 3D poses with the trajectories of 14 joints in an three dimensional coordinate system. Missing positions are encoded as 'nan'.

head_smoothed: Smoothed (by rts smoother) trajectory of the head in an three dimensional coordinate system. It is treated as ground truth position and must not be used as input for a prediction method.

motion_primitives: One-hot encoded labels of the respective motion state. For pedestrians, a distinction is made between the states wait, start, move, and stop. For cyclists, the states wait, start, move, stop, turn left, and turn right are annotated.

Python code for reading the data can be found on Github: github.com/CooperativeAutomatedTrafficSystemsLab/Aschaffenburg-Pose-Dataset

Citation

If you find this dataset useful, please cite this paper (and refer the data as Aschaffenburg Pose Dataset or APD):

Kress, V. ; Zernetsch, S. ; Doll, K. ; Sick, B. : Pose Based Trajectory Forecast of Vulnerable Road Users Using Recurrent Neural Networks. In: Pattern Recognition. ICPR International Workshops and Challenges, Springer International Publishing, 2020, pp. 57-71

Similar Datasets

Pedestrians and Cyclists in Road Traffic: Trajectories, 3D Poses and Semantic Maps

Cyclist Actions: Optical Flow Sequences and Trajectories

Cyclist Actions: Motion History Images and Trajectories

More datasets

Acknowledgment

This work was supported by “Zentrum Digitalisierung.Bayern”. In addition, the work is backed by the project DeCoInt2 , supported by the German Research Foundation (DFG) within the priority program SPP 1835: “Kooperativ interagierende Automobile”, grant numbers DO 1186/1-2 and SI 674/11-2.

References

[1] Kress, V. ; Jung, J. ; Zernetsch, S. ; Doll, K. ; Sick, B. : Human Pose Estimation in Real Traffic Scenes. In: IEEE Symposium Series on Computational Intelligence (SSCI), 2018, pp. 518–523, doi: 10.1109/SSCI.2018.8628660
Pre-Processed Power Grid Frequency Time Series (2020-2023)
zenodo.org
bin, zip
Updated Jul 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexandra Nikoltchovska; Alexandra Nikoltchovska; Sebastian Pütz; Xiao Li; Xiao Li; Veit Hagenmeyer; Veit Hagenmeyer; Benjamin Schäfer; Benjamin Schäfer; Sebastian Pütz (2025). Pre-Processed Power Grid Frequency Time Series (2020-2023) [Dataset]. http://doi.org/10.5281/zenodo.15784548
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15784548
Dataset updated
Jul 4, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alexandra Nikoltchovska; Alexandra Nikoltchovska; Sebastian Pütz; Xiao Li; Xiao Li; Veit Hagenmeyer; Veit Hagenmeyer; Benjamin Schäfer; Benjamin Schäfer; Sebastian Pütz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

This dataset provides pre-processed frequency time series data for 2020-2023, covering three synchronous areas of the European power grid:

Continental Europe

Nordic

This work is part of the paper "Probabilistic and Explainable Machine Learning for Tabular Power Grid Data"[1]. Please cite this paper, when using the data and the code.

Relationship to Previous Work

This dataset extends the time coverage of the original dataset [2], which covered 2012-2021. For a detailed documentation of the pre-processing procedure we refer to the supplementary material of paper "Predictability of Power Grid Frequency"[3]. The same methodology and preprocessing procedures have been applied to maintain consistency and comparability with the original work.

Data Sources

Continental Europe [4]: To get the data just replace the year and month in the url with the needed one. In June 2022 the frequency data of Continental Europe was moved to [5]. The download and preprocessing scripts were adapted to the new data sources. To distinguish between both data sources, the data from the SG HoBA data [5] was stored in a separate folder. In order to keep the structure with one file per year, combine_2022_ce_data.ipynb notebook was used to combine the data from the first and second half of 2022 and the data was saved in the 2022_cleansed/TransnetBW/2022.zip folder.

Nordic [6] We obtained the data from the Finish TSO Fingrid, which provides the data under the open license CC-BY 4.0 [7].

Content of the repository

A) Scripts

1) In the `Download_scripts` folder you will find three scripts to automatically download frequency data from the TSO's websites.
2) In `convert_data_format.py` we save the data with corrected timestamp formats.
3) In `clean_corrupted_data.py` we load the converted data and identify corrupted recordings. We mark them as NaN and clean some of the resulting data holes (processing step (2) in the supplementary material of [3]).

The python scripts were adapted to run with Python 3.11 and with the packages found in `requirements.txt`.

b) Yearly converted and cleansed data

The folder `Data_cleansed` contains the output of `clean_corrupted_data.py`.

File type: The files are zipped csv-files, where each file comprises one year.

Data format: The files contain two columns. The second column contains the frequency values in Hz. The first one represents the time stamps in the format *Year-Month-Day Hour-Minute-Second*, which is given as naive local time. The local time refers to the following time zones and includes Daylight Saving Times (python time zone in brackets):

TransnetBW: Continental European Time (*CET*)

Fingrid: Finland (*Europe/Helsinki*)

NaN representation: We mark corrupted and missing data as "NaN" in the csv-files.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Mykhailo Kachan (2023). Overwatch 2 statistics [Dataset]. https://www.kaggle.com/datasets/mykhailokachan/overwatch-2-statistics

Overwatch 2 statistics

overwatch 2 seasons 1-4 statistics + quick play

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 27, 2023

Dataset provided by

Kaggle

Authors

Mykhailo Kachan

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset is built on data from Overbuff with the help of python and selenium. Development environment - Jupyter Notebook.

The tables contain the data for competitive seasons 1-4 and for quick play for each hero and rank along with the standard statistics (common to each hero as well as information belonging to a specific hero).

Note: data for some columns are missing on Overbuff site (there is '—' instead of a specific value), so they were dropped: Scoped Crits for Ashe and Widowmaker, Rip Tire Kills for Junkrat, Minefield Kills for Wrecking Ball. 'Self Healing' column for Bastion was dropped too as Bastion doesn't have this property anymore in OW2. Also, there are no values for "Javelin Spin Kills / 10min" for Orisa in season 1 (the column was dropped). Overall, all missing values were cleaned.

Attention: Overbuff doesn't contain info about OW 1 competitive seasons (when you change a skill tier, the data isn't changed). If you know a site where it's possible to get this data, please, leave a comment. Thank you!

The code on GitHub .

All procedure is done in 5 stages:

Stage 1:

Data is retrieved directly from HTML elements on the page with the selenium tool on python.

Stage 2:

After scraping, data was cleansed: 1) Deleted comma separator on thousands (e.g. 1,009 => 1009). 2) Translated time representation (e.g. '01:23') to seconds (1*60 + 23 => 83). 3) Lúcio has become Lucio, Torbjörn - Torbjorn.

Stage 3:

Data were arranged into a table and saved to CSV.

Stage 4:

Columns which are supposed to have only numeric values are checked. All non-numeric values are dropped. This stage helps to find missing values which contain '—' instead and delete them.

Stage 5:

Additional missing values are searched for and dealt with. It's either column rename that happens (as the program cannot infer the correct column name for missing values) or a column drop. This stage ensures all wrong data are truly fixed.

The procedure to fetch the data takes 7 minutes on average.

This project and code were born from this GitHub code.

Clear search

Close search

Google apps

Main menu

Overwatch 2 statistics

Stage 1:

Stage 2:

Stage 3:

Stage 4:

Stage 5:

Python scripts for automatically enchancing MOP entries from geonames and...

Learn Data Science Series Part 1

Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

Overview:

Pre-Processed Power Grid Frequency Time Series

Data from: CADDI: An in-Class Activity Detection Dataset using IMU data from...

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

AIS data

Tour Recommendation Model

Dataset Description for Tour Recommendation Model

Context and Methodology:

Technical Details:

Further Details:

NBA Games Box Score Since 1949

Context

Content

Acknowledgements

References

WNBA Games Box Score Since 1997

Content

Acknowledgements

References

Aschaffenburg Pose Dataset

Pre-Processed Power Grid Frequency Time Series (2020-2023)

Overview

Relationship to Previous Work

Data Sources

Content of the repository

A) Scripts

b) Yearly converted and cleansed data

Overwatch 2 statistics

overwatch 2 seasons 1-4 statistics + quick play

Stage 1:

Stage 2:

Stage 3:

Stage 4:

Stage 5: