12 datasets found
  1. Overwatch 2 statistics

    • kaggle.com
    Updated Jun 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mykhailo Kachan (2023). Overwatch 2 statistics [Dataset]. https://www.kaggle.com/datasets/mykhailokachan/overwatch-2-statistics
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 27, 2023
    Dataset provided by
    Kaggle
    Authors
    Mykhailo Kachan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is built on data from Overbuff with the help of python and selenium. Development environment - Jupyter Notebook.

    The tables contain the data for competitive seasons 1-4 and for quick play for each hero and rank along with the standard statistics (common to each hero as well as information belonging to a specific hero).

    Note: data for some columns are missing on Overbuff site (there is '—' instead of a specific value), so they were dropped: Scoped Crits for Ashe and Widowmaker, Rip Tire Kills for Junkrat, Minefield Kills for Wrecking Ball. 'Self Healing' column for Bastion was dropped too as Bastion doesn't have this property anymore in OW2. Also, there are no values for "Javelin Spin Kills / 10min" for Orisa in season 1 (the column was dropped). Overall, all missing values were cleaned.

    Attention: Overbuff doesn't contain info about OW 1 competitive seasons (when you change a skill tier, the data isn't changed). If you know a site where it's possible to get this data, please, leave a comment. Thank you!

    The code on GitHub .

    All procedure is done in 5 stages:

    Stage 1:

    Data is retrieved directly from HTML elements on the page with the selenium tool on python.

    Stage 2:

    After scraping, data was cleansed: 1) Deleted comma separator on thousands (e.g. 1,009 => 1009). 2) Translated time representation (e.g. '01:23') to seconds (1*60 + 23 => 83). 3) Lúcio has become Lucio, Torbjörn - Torbjorn.

    Stage 3:

    Data were arranged into a table and saved to CSV.

    Stage 4:

    Columns which are supposed to have only numeric values are checked. All non-numeric values are dropped. This stage helps to find missing values which contain '—' instead and delete them.

    Stage 5:

    Additional missing values are searched for and dealt with. It's either column rename that happens (as the program cannot infer the correct column name for missing values) or a column drop. This stage ensures all wrong data are truly fixed.

    The procedure to fetch the data takes 7 minutes on average.

    This project and code were born from this GitHub code.

  2. o

    Python scripts for automatically enchancing MOP entries from geonames and...

    • explore.openaire.eu
    Updated Jan 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Effie Karuzaki (2024). Python scripts for automatically enchancing MOP entries from geonames and Ulan [Dataset]. http://doi.org/10.5281/zenodo.10532597
    Explore at:
    Dataset updated
    Jan 19, 2024
    Authors
    Effie Karuzaki
    Description

    import_missing_lat_long.py This script takes a GeoNames URL of a location, retrieves the latitude and longitude of this location from the GeoNames database and inserts these values in the corresponding Location knowledge element in the CAP. import_missing_biograpgy.py This script takes a ULAN URL of an artist, retrieves his/her biographical details from the ULAN database and inserts these details in the corresponding Person knowledge element in the CAP. import missing nationalities.py This script takes a ULAN URL of an artist, retrieves his/her nationality information from the ULAN database and inserts these details in the corresponding Person knowledge element in the CAP. import missing alt_names.py This script takes a ULAN URL of an artist, retrieves his/her alternative names by which he or she is also known from the ULAN database and inserts these details in the corresponding Person knowledge element in the CAP. Find_missing_birth_and_death_information.py This script takes a ULAN URL of an artist, retrieves his/her birth and death dates from the ULAN database and inserts these details in the corresponding Person knowledge element in the CAP.

  3. Learn Data Science Series Part 1

    • kaggle.com
    Updated Dec 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rupesh Kumar (2022). Learn Data Science Series Part 1 [Dataset]. https://www.kaggle.com/datasets/hunter0007/learn-data-science-part-1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 30, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rupesh Kumar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

    Overview:

    • Chapter 1: Getting started with pandas
    • Chapter 2: Analysis: Bringing it all together and making decisions
    • Chapter 3: Appending to DataFrame
    • Chapter 4: Boolean indexing of dataframes
    • Chapter 5: Categorical data
    • Chapter 6: Computational Tools
    • Chapter 7: Creating DataFrames
    • Chapter 8: Cross sections of different axes with MultiIndex
    • Chapter 9: Data Types
    • Chapter 10: Dealing with categorical variables
    • Chapter 11: Duplicated data
    • Chapter 12: Getting information about DataFrames
    • Chapter 13: Gotchas of pandas
    • Chapter 14: Graphs and Visualizations
    • Chapter 15: Grouping Data
    • Chapter 16: Grouping Time Series Data
    • Chapter 17: Holiday Calendars
    • Chapter 18: Indexing and selecting data
    • Chapter 19: IO for Google BigQuery
    • Chapter 20: JSON
    • Chapter 21: Making Pandas Play Nice With Native Python Datatypes
    • Chapter 22: Map Values
    • Chapter 23: Merge, join, and concatenate
    • Chapter 24: Meta: Documentation Guidelines
    • Chapter 25: Missing Data
    • Chapter 26: MultiIndex
    • Chapter 27: Pandas Datareader
    • Chapter 28: Pandas IO tools (reading and saving data sets)
    • Chapter 29: pd.DataFrame.apply
    • Chapter 30: Read MySQL to DataFrame
    • Chapter 31: Read SQL Server to Dataframe
    • Chapter 32: Reading files into pandas DataFrame
    • Chapter 33: Resampling
    • Chapter 34: Reshaping and pivoting
    • Chapter 35: Save pandas dataframe to a csv file
    • Chapter 36: Series
    • Chapter 37: Shifting and Lagging Data
    • Chapter 38: Simple manipulation of DataFrames
    • Chapter 39: String manipulation
    • Chapter 40: Using .ix, .iloc, .loc, .at and .iat to access a DataFrame
    • Chapter 41: Working with Time Series
  4. o

    Pre-Processed Power Grid Frequency Time Series

    • explore.openaire.eu
    • data.niaid.nih.gov
    • +1more
    Updated Apr 22, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johannes Kruse; Benjamin Schäfer; Dirk Witthaut (2020). Pre-Processed Power Grid Frequency Time Series [Dataset]. http://doi.org/10.5281/zenodo.5105820
    Explore at:
    Dataset updated
    Apr 22, 2020
    Authors
    Johannes Kruse; Benjamin Schäfer; Dirk Witthaut
    Description

    Overview This repository contains ready-to-use frequency time series as well as the corresponding pre-processing scripts in python. The data covers three synchronous areas of the European power grid: Continental Europe Great Britain Nordic This work is part of the paper "Predictability of Power Grid Frequency"[1]. Please cite this paper, when using the data and the code. For a detailed documentation of the pre-processing procedure we refer to the supplementary material of the paper. Data sources We downloaded the frequency recordings from publically available repositories of three different Transmission System Operators (TSOs). Continental Europe [2]: We downloaded the data from the German TSO TransnetBW GmbH, which retains the Copyright on the data, but allows to re-publish it upon request [3]. Great Britain [4]: The download was supported by National Grid ESO Open Data, which belongs to the British TSO National Grid. They publish the frequency recordings under the NGESO Open License [5]. Nordic [6]: We obtained the data from the Finish TSO Fingrid, which provides the data under the open license CC-BY 4.0 [7]. Content of the repository A) Scripts In the "Download_scripts" folder you will find three scripts to automatically download frequency data from the TSO's websites. In "convert_data_format.py" we save the data with corrected timestamp formats. Missing data is marked as NaN (processing step (1) in the supplementary material of [1]). In "clean_corrupted_data.py" we load the converted data and identify corrupted recordings. We mark them as NaN and clean some of the resulting data holes (processing step (2) in the supplementary material of [1]). The python scripts run with Python 3.7 and with the packages found in "requirements.txt". B) Yearly converted and cleansed data The folders "

  5. u

    Data from: CADDI: An in-Class Activity Detection Dataset using IMU data from...

    • observatorio-cientifico.ua.es
    • scidb.cn
    Updated 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Pina-Navarro, Monica; Gomez-Donoso, Francisco; Cazorla, Miguel; Marquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Pina-Navarro, Monica; Gomez-Donoso, Francisco; Cazorla, Miguel (2025). CADDI: An in-Class Activity Detection Dataset using IMU data from low-cost sensors [Dataset]. https://observatorio-cientifico.ua.es/documentos/668fc49bb9e7c03b01be251c
    Explore at:
    Dataset updated
    2025
    Authors
    Marquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Pina-Navarro, Monica; Gomez-Donoso, Francisco; Cazorla, Miguel; Marquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Pina-Navarro, Monica; Gomez-Donoso, Francisco; Cazorla, Miguel
    Description

    Data DescriptionThe CADDI dataset is designed to support research in in-class activity recognition using IMU data from low-cost sensors. It provides multimodal data capturing 19 different activities performed by 12 participants in a classroom environment, utilizing both IMU sensors from a Samsung Galaxy Watch 5 and synchronized stereo camera images. This dataset enables the development and validation of activity recognition models using sensor fusion techniques.Data Generation ProceduresThe data collection process involved recording both continuous and instantaneous activities that typically occur in a classroom setting. The activities were captured using a custom setup, which included:A Samsung Galaxy Watch 5 to collect accelerometer, gyroscope, and rotation vector data at 100Hz.A ZED stereo camera capturing 1080p images at 25-30 fps.A synchronized computer acting as a data hub, receiving IMU data and storing images in real-time.A D-Link DSR-1000AC router for wireless communication between the smartwatch and the computer.Participants were instructed to arrange their workspace as they would in a real classroom, including a laptop, notebook, pens, and a backpack. Data collection was performed under realistic conditions, ensuring that activities were captured naturally.Temporal and Spatial ScopeThe dataset contains a total of 472.03 minutes of recorded data.The IMU sensors operate at 100Hz, while the stereo camera captures images at 25-30Hz.Data was collected from 12 participants, each performing all 19 activities multiple times.The geographical scope of data collection was Alicante, Spain, under controlled indoor conditions.Dataset ComponentsThe dataset is organized into JSON and PNG files, structured hierarchically:IMU Data: Stored in JSON files, containing:Samsung Linear Acceleration Sensor (X, Y, Z values, 100Hz)LSM6DSO Gyroscope (X, Y, Z values, 100Hz)Samsung Rotation Vector (X, Y, Z, W quaternion values, 100Hz)Samsung HR Sensor (heart rate, 1Hz)OPT3007 Light Sensor (ambient light levels, 5Hz)Stereo Camera Images: High-resolution 1920×1080 PNG files from left and right cameras.Synchronization: Each IMU data record and image is timestamped for precise alignment.Data StructureThe dataset is divided into continuous and instantaneous activities:Continuous Activities (e.g., typing, writing, drawing) were recorded for 210 seconds, with the central 200 seconds retained.Instantaneous Activities (e.g., raising a hand, drinking) were repeated 20 times per participant, with data captured only during execution.The dataset is structured as:/continuous/subject_id/activity_name/ /camera_a/ → Left camera images /camera_b/ → Right camera images /sensors/ → JSON files with IMU data

    /instantaneous/subject_id/activity_name/repetition_id/ /camera_a/ /camera_b/ /sensors/ Data Quality & Missing DataThe smartwatch buffers 100 readings per second before sending them, ensuring minimal data loss.Synchronization latency between the smartwatch and the computer is negligible.Not all IMU samples have corresponding images due to different recording rates.Outliers and anomalies were handled by discarding incomplete sequences at the start and end of continuous activities.Error Ranges & LimitationsSensor data may contain noise due to minor hand movements.The heart rate sensor operates at 1Hz, limiting its temporal resolution.Camera exposure settings were automatically adjusted, which may introduce slight variations in lighting.File Formats & Software CompatibilityIMU data is stored in JSON format, readable with Python’s json library.Images are in PNG format, compatible with all standard image processing tools.Recommended libraries for data analysis:Python: numpy, pandas, scikit-learn, tensorflow, pytorchVisualization: matplotlib, seabornDeep Learning: Keras, PyTorchPotential ApplicationsDevelopment of activity recognition models in educational settings.Study of student engagement based on movement patterns.Investigation of sensor fusion techniques combining visual and IMU data.This dataset represents a unique contribution to activity recognition research, providing rich multimodal data for developing robust models in real-world educational environments.CitationIf you find this project helpful for your research, please cite our work using the following bibtex entry:@misc{marquezcarpintero2025caddiinclassactivitydetection, title={CADDI: An in-Class Activity Detection Dataset using IMU data from low-cost sensors}, author={Luis Marquez-Carpintero and Sergio Suescun-Ferrandiz and Monica Pina-Navarro and Miguel Cazorla and Francisco Gomez-Donoso}, year={2025}, eprint={2503.02853}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.02853}, }

  6. Market Basket Analysis

    • kaggle.com
    Updated Dec 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 9, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aslan Ahmedov
    Description

    Market Basket Analysis

    Market basket analysis with Apriori algorithm

    The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

    Introduction

    Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

    An Example of Association Rules

    Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

    Strategy

    • Data Import
    • Data Understanding and Exploration
    • Transformation of the data – so that is ready to be consumed by the association rules algorithm
    • Running association rules
    • Exploring the rules generated
    • Filtering the generated rules
    • Visualization of Rule

    Dataset Description

    • File name: Assignment-1_Data
    • List name: retaildata
    • File format: . xlsx
    • Number of Row: 522065
    • Number of Attributes: 7

      • BillNo: 6-digit number assigned to each transaction. Nominal.
      • Itemname: Product name. Nominal.
      • Quantity: The quantities of each product per transaction. Numeric.
      • Date: The day and time when each transaction was generated. Numeric.
      • Price: Product price. Numeric.
      • CustomerID: 5-digit number assigned to each customer. Nominal.
      • Country: Name of the country where each customer resides. Nominal.

    imagehttps://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

    Libraries in R

    First, we need to load required libraries. Shortly I describe all libraries.

    • arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
    • arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
    • tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
    • readxl - Read Excel Files in R.
    • plyr - Tools for Splitting, Applying and Combining Data.
    • ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
    • knitr - Dynamic Report generation in R.
    • magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
    • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
    • tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

    imagehttps://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

    Data Pre-processing

    Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

    imagehttps://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> imagehttps://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

    After we will clear our data frame, will remove missing values.

    imagehttps://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

    To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

  7. AIS data

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jun 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luka Grgičević; Luka Grgičević (2023). AIS data [Dataset]. http://doi.org/10.5281/zenodo.8064564
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 26, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Luka Grgičević; Luka Grgičević
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Terrestrial vessel automatic identification system (AIS) data was collected around Ålesund, Norway in 2020, from multiple receiving stations with unsynchronized clocks. Features are 'mmsi', 'imo', 'length', 'latitude', 'longitude', 'sog', 'cog', 'true_heading', 'datetime UTC', 'navigational status', and 'message number'. Compact parquet files can be turned into data frames with python's pandas library. Data is irregularly sampled because of the navigational status. The preprocessing script for training the machine learning models can be found here. There you will find gathered dozen of trainable models and hundreds of datasets. Visit this website for more information about the data. If you have additional questions, please find our information in the links below:

  8. t

    Tour Recommendation Model

    • test.researchdata.tuwien.at
    bin, png +1
    Updated May 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar (2025). Tour Recommendation Model [Dataset]. http://doi.org/10.70124/akpf6-8p175
    Explore at:
    text/markdown, png, binAvailable download formats
    Dataset updated
    May 14, 2025
    Dataset provided by
    TU Wien
    Authors
    Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 28, 2025
    Description

    Dataset Description for Tour Recommendation Model

    Context and Methodology:

    • Research Domain/Project:
      This dataset is part of the Tour Recommendation System project, which focuses on predicting user preferences and ratings for various tourist places and events. It belongs to the field of Machine Learning, specifically applied to Recommender Systems and Predictive Analytics.

    • Purpose:
      The dataset serves as the training and evaluation data for a Decision Tree Regressor model, which predicts ratings (from 1-5) for different tourist destinations based on user preferences. The model can be used to recommend places or events to users based on their predicted ratings.

    • Creation Methodology:
      The dataset was originally collected from a tourism platform where users rated various tourist places and events. The data was preprocessed to remove missing or invalid entries (such as #NAME? in rating columns). It was then split into subsets for training, validation, and testing the model.

    Technical Details:

    • Structure of the Dataset:
      The dataset is stored as a CSV file (user_ratings_dataset.csv) and contains the following columns:

      • place_or_event_id: Unique identifier for each tourist place or event.

      • rating: Rating given by the user, ranging from 1 to 5.

      The data is split into three subsets:

      • Training Set: 80% of the dataset used to train the model.

      • Validation Set: A small portion used for hyperparameter tuning.

      • Test Set: 20% used to evaluate model performance.

    • Folder and File Naming Conventions:
      The dataset files are stored in the following structure:

      • user_ratings_dataset.csv: The original dataset file containing user ratings.

      • tour_recommendation_model.pkl: The saved model after training.

      • actual_vs_predicted_chart.png: A chart comparing actual and predicted ratings.

    • Software Requirements:
      To open and work with this dataset, the following software and libraries are required:

      • Python 3.x

      • Pandas for data manipulation

      • Scikit-learn for training and evaluating machine learning models

      • Matplotlib for chart generation

      • Joblib for saving and loading the trained model

      The dataset can be opened and processed using any Python environment that supports these libraries.

    • Additional Resources:

      • The model training code, README file, and performance chart are available in the project repository.

      • For detailed explanation and code, please refer to the GitHub repository (or any other relevant link for the code).

    Further Details:

    • Dataset Reusability:
      The dataset is structured for easy use in training machine learning models for recommendation systems. Researchers and practitioners can utilize it to:

      • Train other types of models (e.g., regression, classification).

      • Experiment with different features or add more metadata to enrich the dataset.

    • Data Integrity:
      The dataset has been cleaned and preprocessed to remove invalid values (such as #NAME? or missing ratings). However, users should ensure they understand the structure and the preprocessing steps taken before reusing it.

    • Licensing:
      The dataset is provided under the CC BY 4.0 license, which allows free usage, distribution, and modification, provided that proper attribution is given.

  9. NBA Games Box Score Since 1949

    • kaggle.com
    Updated Jul 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Greca (2021). NBA Games Box Score Since 1949 [Dataset]. https://www.kaggle.com/rafaelgreca/nba-games-box-score-since-1949/tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 27, 2021
    Dataset provided by
    Kaggle
    Authors
    Rafael Greca
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    I used the Paul Rossotti’s data set on my personal projects. However, after a long time using it I noticed that I would need more old and recent data, so I ended up with a more complete data set and I thought that might help someone. Since his data set was used as a base, all the credits goes to him. I only incremented it. Also I am willing to update this data set yearly.

    You can access his work using this link on the reference section.

    Content

    This data set contains information about the box score of every NBA game since 1949-50 until now. You can get the data individually for each season, decade or a compiled of all the data. In total the data set has, approximately, 120 features/columns/attributes that goes from basic stats (like total points, rebounds, assists, blocks, and so on) to more advanced ones (like floor impact counter, assist rate, possessions, pace, play% and much more!).

    Each game will contain the same features to the home team and its opponent (away team) and some other features related to the game itself (like game date, season, season type and match winner). If you like stats and NBA, this data set was made for you!

    If do you wanna more about the formulas used and its meaning, please check the reference section. Also you can check the “features_description” file. There you will find a brief description of each feature and its respective formula (only for more advanced stats).

    Acknowledgements

    LAST TIME THE DATA SET WAS UPDATED:

    July 26, 2021 (07/26/2021) – 1pm EDT

    Questions about the dataset:

    Q:How did you collected the data? A: I created a web scrapper using python to do the hard work.

    Q: How did you filled the missing values? A: For the float columns I filled with “0.0”. For the object columns I left with a NaN value, but don’t need to worry about it. The only columns that I need to do that was teamWins, teamLosses, opptWins, opptLosses. However only 8 rows in the entire data set has NaN values! Great news, isn’t it?

    Q: Where can I see the description/formula for each attribute/column/feature? A: You can check it out in the “features_informations” file inside the data set.

    Q: Will you constantly update the data set? A: Yes!

    Q: The data contains only regular reason games? A: No! The data contains playoffs games as well.

    References

    About the stats and formulas used: https://www.basketball-reference.com/about/glossary.html https://basketball.realgm.com/info/glossary https://www.kaggle.com/pablote/nba-enhanced-stats (Paul Rossotti’s data set)

    Where the data was collected: https://www.basketball-reference.com/leagues/

  10. WNBA Games Box Score Since 1997

    • kaggle.com
    Updated Jan 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Greca (2021). WNBA Games Box Score Since 1997 [Dataset]. https://www.kaggle.com/datasets/rafaelgreca/wnba-games-box-score-since-1997/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 13, 2021
    Dataset provided by
    Kaggle
    Authors
    Rafael Greca
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Content

    This data set contains information about the box score of every WNBA game since 1997 until now. You can get the data individually for each season, decade or a compiled of all the data. In total the data set has, approximately, 100 features/columns/attributes that goes from basic stats (like total points, rebounds, assists, blocks, and so on) to more advanced ones (like floor impact counter, assist rate, possessions, pace, play% and much more!).

    Each game will contain the same features to the home team and its opponent (away team) and some other features related to the game itself (like game date, season, season type and match winner). If you like stats and NBA, this data set was made for you!

    If do you wanna more about the formulas used and its meaning, please check the reference section. Also you can check the “features_description” file. There you will find a brief description of each feature and its respective formula (only for more advanced stats).

    Acknowledgements

    LAST TIME THE DATA SET WAS UPDATED:

    January 13, 2021 (01/13/2021) – 1pm EDT

    Questions about the dataset:

    Q:How did you collected the data? A: I created a web scrapper using python to do the hard work.

    Q: How did you filled the missing values? A: For the float columns I filled with “0.0”. For the object columns I left with a NaN value, but don’t need to worry about it. The only columns that I need to do that was teamWins, teamLosses, opptWins, opptLosses. However only 8 rows in the entire data set has NaN values! Great news, isn’t it?

    Q: Where can I see the description/formula for each attribute/column/feature? A: You can check it out in the “features_informations” file inside the data set.

    Q: Will you constantly update the data set? A: Yes!

    Q: The data contains only regular reason games? A: No! The data contains playoffs games as well.

    References

    About the stats and formulas used: https://www.basketball-reference.com/about/glossary.html https://basketball.realgm.com/info/glossary https://www.kaggle.com/rafaelgreca/nba-games-box-score-since-1949 (My other data set about the NBA)

    Where the data was collected: https://www.basketball-reference.com/leagues/

  11. Z

    Aschaffenburg Pose Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sick, Bernhard (2022). Aschaffenburg Pose Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5724485
    Explore at:
    Dataset updated
    Feb 11, 2022
    Dataset provided by
    Sick, Bernhard
    Hetzel, Manuel
    Bieshaar, Maarten
    Reichert, Hannes
    Fuchs, Erich
    Kress, Viktor
    Doll, Konrad
    Zernetsch, Stefan
    Reitberger, Günther
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Aschaffenburg
    Description

    This dataset contains trajectories as well as body poses of pedestrians and cyclists in road traffic recorded in Aschaffenburg, Germany. It is appropriate for training and testing methods for trajectory forecasting and intention prediction of vulnerable road users (VRUs) based on the past trajectory and body poses.

    The dataset consists of more than 6526 trajectories of pedestrians and 1734 trajectories of cyclists recorded by a research vehicle of the University of Applied Sciences Aschaffenburg (Kooperative Automatisierte Verkehrssysteme) in urban traffic. The trajectories have been measured with the help of a stereo camera while compensating the vehicle's own motion. The body posture of the pedestrians and cyclists is available in the form of 2D and 3D poses. The 2D poses contain joint positions in an image coordinate system, while the 3D poses contain actual three-dimensional positions. A detailed description and evaluation of the pose estimation method can be found in [1]. In addition to the trajectories and the poses, manually created labels of the respective motion states are included.

    To read the provided data, unzip the file first. It contains one json file for each of the trajectories. Each json file contains the following data:

    vru_type: type of the VRU (pedestrian ('ped') or cyclist ('bike'))

    timestamps: UTC-Timestamps. The motions of the VRUs were recorded at a frequency of 25 Hz.

    set: Assignment to one of the three datasets train, validation or test. For pedestrians and cyclists, 60% of the data is used for training, 20% for validation and the remaining 20% for testing. During all splits, it was ensured that the distribution of the motion states is as similar as possible.

    pose2d: 2D poses with 18 joint positions in image coordinates with an additional uncertainty between 0 and 1 (third coordinate). Missing positions are encoded as 'nan'.

    pose3d: 3D poses with the trajectories of 14 joints in an three dimensional coordinate system. Missing positions are encoded as 'nan'.

    head_smoothed: Smoothed (by rts smoother) trajectory of the head in an three dimensional coordinate system. It is treated as ground truth position and must not be used as input for a prediction method.

    motion_primitives: One-hot encoded labels of the respective motion state. For pedestrians, a distinction is made between the states wait, start, move, and stop. For cyclists, the states wait, start, move, stop, turn left, and turn right are annotated.

    Python code for reading the data can be found on Github: github.com/CooperativeAutomatedTrafficSystemsLab/Aschaffenburg-Pose-Dataset

    Citation

    If you find this dataset useful, please cite this paper (and refer the data as Aschaffenburg Pose Dataset or APD):

    Kress, V. ; Zernetsch, S. ; Doll, K. ; Sick, B. : Pose Based Trajectory Forecast of Vulnerable Road Users Using Recurrent Neural Networks. In: Pattern Recognition. ICPR International Workshops and Challenges, Springer International Publishing, 2020, pp. 57-71

    Similar Datasets

    Pedestrians and Cyclists in Road Traffic: Trajectories, 3D Poses and Semantic Maps

    Cyclist Actions: Optical Flow Sequences and Trajectories

    Cyclist Actions: Motion History Images and Trajectories

    More datasets

    Acknowledgment

    This work was supported by “Zentrum Digitalisierung.Bayern”. In addition, the work is backed by the project DeCoInt2 , supported by the German Research Foundation (DFG) within the priority program SPP 1835: “Kooperativ interagierende Automobile”, grant numbers DO 1186/1-2 and SI 674/11-2.

    References

    [1] Kress, V. ; Jung, J. ; Zernetsch, S. ; Doll, K. ; Sick, B. : Human Pose Estimation in Real Traffic Scenes. In: IEEE Symposium Series on Computational Intelligence (SSCI), 2018, pp. 518–523, doi: 10.1109/SSCI.2018.8628660

  12. Pre-Processed Power Grid Frequency Time Series (2020-2023)

    • zenodo.org
    bin, zip
    Updated Jul 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexandra Nikoltchovska; Alexandra Nikoltchovska; Sebastian Pütz; Xiao Li; Xiao Li; Veit Hagenmeyer; Veit Hagenmeyer; Benjamin Schäfer; Benjamin Schäfer; Sebastian Pütz (2025). Pre-Processed Power Grid Frequency Time Series (2020-2023) [Dataset]. http://doi.org/10.5281/zenodo.15784548
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Jul 4, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alexandra Nikoltchovska; Alexandra Nikoltchovska; Sebastian Pütz; Xiao Li; Xiao Li; Veit Hagenmeyer; Veit Hagenmeyer; Benjamin Schäfer; Benjamin Schäfer; Sebastian Pütz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    This dataset provides pre-processed frequency time series data for 2020-2023, covering three synchronous areas of the European power grid:

    • Continental Europe
    • Nordic

    This work is part of the paper "Probabilistic and Explainable Machine Learning for Tabular Power Grid Data"[1]. Please cite this paper, when using the data and the code.

    Relationship to Previous Work

    This dataset extends the time coverage of the original dataset [2], which covered 2012-2021. For a detailed documentation of the pre-processing procedure we refer to the supplementary material of paper "Predictability of Power Grid Frequency"[3]. The same methodology and preprocessing procedures have been applied to maintain consistency and comparability with the original work.

    Data Sources

    • Continental Europe [4]: To get the data just replace the year and month in the url with the needed one. In June 2022 the frequency data of Continental Europe was moved to [5]. The download and preprocessing scripts were adapted to the new data sources. To distinguish between both data sources, the data from the SG HoBA data [5] was stored in a separate folder. In order to keep the structure with one file per year, combine_2022_ce_data.ipynb notebook was used to combine the data from the first and second half of 2022 and the data was saved in the 2022_cleansed/TransnetBW/2022.zip folder.
    • Nordic [6] We obtained the data from the Finish TSO Fingrid, which provides the data under the open license CC-BY 4.0 [7].

    Content of the repository

    A) Scripts

    1) In the `Download_scripts` folder you will find three scripts to automatically download frequency data from the TSO's websites.
    2) In `convert_data_format.py` we save the data with corrected timestamp formats.
    3) In `clean_corrupted_data.py` we load the converted data and identify corrupted recordings. We mark them as NaN and clean some of the resulting data holes (processing step (2) in the supplementary material of [3]).

    The python scripts were adapted to run with Python 3.11 and with the packages found in `requirements.txt`.

    b) Yearly converted and cleansed data

    The folder `Data_cleansed` contains the output of `clean_corrupted_data.py`.

    • File type: The files are zipped csv-files, where each file comprises one year.
    • Data format: The files contain two columns. The second column contains the frequency values in Hz. The first one represents the time stamps in the format *Year-Month-Day Hour-Minute-Second*, which is given as naive local time. The local time refers to the following time zones and includes Daylight Saving Times (python time zone in brackets):
      • TransnetBW: Continental European Time (*CET*)
      • Fingrid: Finland (*Europe/Helsinki*)
      • NaN representation: We mark corrupted and missing data as "NaN" in the csv-files.
  13. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mykhailo Kachan (2023). Overwatch 2 statistics [Dataset]. https://www.kaggle.com/datasets/mykhailokachan/overwatch-2-statistics
Organization logo

Overwatch 2 statistics

overwatch 2 seasons 1-4 statistics + quick play

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 27, 2023
Dataset provided by
Kaggle
Authors
Mykhailo Kachan
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset is built on data from Overbuff with the help of python and selenium. Development environment - Jupyter Notebook.

The tables contain the data for competitive seasons 1-4 and for quick play for each hero and rank along with the standard statistics (common to each hero as well as information belonging to a specific hero).

Note: data for some columns are missing on Overbuff site (there is '—' instead of a specific value), so they were dropped: Scoped Crits for Ashe and Widowmaker, Rip Tire Kills for Junkrat, Minefield Kills for Wrecking Ball. 'Self Healing' column for Bastion was dropped too as Bastion doesn't have this property anymore in OW2. Also, there are no values for "Javelin Spin Kills / 10min" for Orisa in season 1 (the column was dropped). Overall, all missing values were cleaned.

Attention: Overbuff doesn't contain info about OW 1 competitive seasons (when you change a skill tier, the data isn't changed). If you know a site where it's possible to get this data, please, leave a comment. Thank you!

The code on GitHub .

All procedure is done in 5 stages:

Stage 1:

Data is retrieved directly from HTML elements on the page with the selenium tool on python.

Stage 2:

After scraping, data was cleansed: 1) Deleted comma separator on thousands (e.g. 1,009 => 1009). 2) Translated time representation (e.g. '01:23') to seconds (1*60 + 23 => 83). 3) Lúcio has become Lucio, Torbjörn - Torbjorn.

Stage 3:

Data were arranged into a table and saved to CSV.

Stage 4:

Columns which are supposed to have only numeric values are checked. All non-numeric values are dropped. This stage helps to find missing values which contain '—' instead and delete them.

Stage 5:

Additional missing values are searched for and dealt with. It's either column rename that happens (as the program cannot infer the correct column name for missing values) or a column drop. This stage ensures all wrong data are truly fixed.

The procedure to fetch the data takes 7 minutes on average.

This project and code were born from this GitHub code.

Search
Clear search
Close search
Google apps
Main menu