Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was acquired during Cyber Czech β a hands-on cyber defense exercise (Red Team/Blue Team) held in March 2019 at Masaryk University, Brno, Czech Republic. Network traffic flows and a high variety of event logs were captured in an exercise network deployed in the KYPO Cyber Range Platform.
Contents
The dataset covers two distinct time intervals, which correspond to the official schedule of the exercise. The timestamps provided below are in the ISO 8601 date format.
Day 1, March 19, 2019
Start: 2019-03-19T11:00:00.000000+01:00
End: 2019-03-19T18:00:00.000000+01:00
Day 2, March 20, 2019
Start: 2019-03-20T08:00:00.000000+01:00
End: 2019-03-20T15:30:00.000000+01:00
The captured and collected data were normalized into three distinct event types and they are stored as structured JSON. The data are sorted by a timestamp, which represents the time they were observed. Each event type includes a raw payload ready for further processing and analysis. The description of the respective event types and the corresponding data files follows.
cz.muni.csirt.IpfixEntry.tgz β an archive of IPFIX traffic flows enriched with an additional payload of parsed application protocols in raw JSON.
cz.muni.csirt.SyslogEntry.tgz β an archive of Linux Syslog entries with the payload of corresponding text-based log messages.
cz.muni.csirt.WinlogEntry.tgz β an archive of Windows Event Log entries with the payload of original events in raw XML.
Each archive listed above includes a directory of the same name with the following four files, ready to be processed.
data.json.gz β the actual data entries in a single gzipped JSON file.
dictionary.yml β data dictionary for the entries.
schema.ddl β data schema for Apache Spark analytics engine.
schema.jsch β JSON schema for the entries.
Finally, the exercise network topology is described in a machine-readable NetJSON format and it is a part of a set of auxiliary files archive β auxiliary-material.tgz β which includes the following.
global-gateway-config.json β the network configuration of the global gateway in the NetJSON format.
global-gateway-routing.json β the routing configuration of the global gateway in the NetJSON format.
redteam-attack-schedule.{csv,odt} β the schedule of the Red Team attacks in CSV and ODT format. Source for Table 2.
redteam-reserved-ip-ranges.{csv,odt} β the list of IP segments reserved for the Red Team in CSV and ODT format. Source for Table 1.
topology.{json,pdf,png} β the topology of the complete Cyber Czech exercise network in the NetJSON, PDF and PNG format.
topology-small.{pdf,png} β simplified topology in the PDF and PNG format. Source for Figure 1.
This dataset was created by Chinmay Choudhary
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains key characteristics about the data described in the Data Descriptor Reference values for resting and post exercise hemodynamic parameters in a 6-18 year old population. Contents:
1. human readable metadata summary table in CSV format
2. machine readable metadata file in JSON format
Versioning Note:Version 2 was generated when the metadata format was updated from JSON to JSON-LD. This was an automatic process that changed only the format, not the contents, of the metadata.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Piano dataset containing 127 audios from the 6 first exercises of Hanon's The Virtuoso Pianist, and the corresponding onset and chroma labeling. It is used for the automatic assessment of piano exercises.
It also contains the annotation (.JSON) and lilypond (.ly) files for the feature extraction and visualization of the music score of each exercise, as well as the pretrained model files (all_exs_p.joblib and all_ex_r.joblib).
PSB2: The Second Program Synthesis Benchmark Suite Datasets Version 1.0.1 (see version history at bottom) This repository contains datasets for the 25 problems described in the paper PSB2: The Second Program Synthesis Benchmark Suite. These problems come from a variety of sources, and require a range of programming constructs and datatypes to solve. These datasets are designed to be usable for any method of performing general program synthesis, including and not limited to inductive program synthesis and evolutionary methods such as genetic programming. For more information, see the associated website: https://cs.hamilton.edu/~thelmuth/PSB2/PSB2.html Use Each problem in the benchmark suite is located in a separate directory in the datasets
directory. For each problem, we provide a set of edge
cases and a set of random
cases. The edge
cases are hand-chosen cases representing the limits of the problem. The random
cases are all generated based on problem-specific distributions. For each problem, we included exactly 1 million random
cases. A typical use of these datasets for a set of runs of program synthesis would be: For each run, use every edge
case in the training set For each run, use a different, randomly-sampled set of random
cases in the training set. Use a larger set of random
cases as an unseen test set. Sampling Libraries We provide the following libraries to make the downloading and sampling of these datasets easier. Using these libraries, you do not need to download the entire dataset from Zenodo; the individual problem datasets are downloaded and stored once when first sampling them. Python: https://github.com/thelmuth/psb2-python Clojure: https://github.com/thelmuth/psb2-clojure Dataset format Each edge and random dataset is provided in three formats: CSV, JSON, and EDN, with all three formats containing identical data. The CSV files are formatted as follows: The first row of the file is the column names. Each following row corresponds to one set of program inputs and expected outputs. Input columns are labeled input1
, input2
, etc., and output columns are labeled output1
, output2
, etc. In CSVs, string inputs and outputs are double quoted when necessary, but not if not necessary. Newlines within strings are escaped. Columns in CSV files are comma-separated. The JSON and EDN files are formatted using the JSON Lines standard (adapted for EDN). Each case is put on its own line of the data file. The files should be read line-by-line and each parsed into an object/map using a JSON/EDN parser. Citation If you use these datasets in a publication, please cite the paper PSB2: The Second Program Synthesis Benchmark Suite and include a link to this repository. BibTeX entry for paper: @InProceedings{Helmuth:2021:GECCO, author = "Thomas Helmuth and Peter Kelly", title = "{PSB2}: The Second Program Synthesis Benchmark Suite", booktitle = "2021 Genetic and Evolutionary Computation Conference", series = {GECCO '21}, year = "2021", isbn13 = {978-1-4503-8350-9}, address = {Lille, France}, size = {10 pages}, doi = {10.1145/3449639.3459285}, publisher = {ACM}, publisher_address = {New York, NY, USA}, month = {10-14} # jul, doi-url = {https://doi.org/10.1145/3449639.3459285}, URL = {https://dl.acm.org/doi/10.1145/3449639.3459285}, } Version History 1.0.0 - 2021/4/10 - Initial publication of PSB2 datasets on Zenodo. 1.0.1 - 2021/7/9 - Changes to CSVs to quote all strings that could be read as integers. No changes in actual data, just formatting.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
While not as popular as in the USA, European basketball has its own fan base as well. I have seen some datasets scattered around on European basketball, however could not find one that has play by play data historically. Hence, I decided to create my own with help of Euroleague's JSON format that is easy to reach.
Data includes play by play data for the Euroleague (top European basketball competition) since 2007, for the Eurocup (second tier international competition) since 2012 to 2020.
Data is acquired via Euroleague's JSON format which was easy to get. I create a short Do it Yourself (DIY) notebook for the ones who want to get the data for themselves.
These two github repos are my starting points, without their upstart would be hard to compile this data.
Regularly updating this data and making interesting analysis would be definetely interesting to see. As my time allows, I will try to make them.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains 50000 ranked ladder matches from the Dota 2 data dump created by Opendota. It was inspired by the Dota 2 Matches data published here by Joe Ramir. This is an update and improved version of that dataset. I have kept the same image and a similar title.
Dota 2 is a popular MOBA available as free to play, and can take up thousands of hours of your life. The number of games in this dataset are played about every hour. If you like the data there are an additional 2-3 million matches easily available for download.
The aim of this dataset is to enable the exploration of player behavior, skill estimation, or anything you find interesting. The intent is to create an accessible, and easy to use resource, which can be expanded and modified if needed. As such I am open to a wide variety of suggestions as to what additions or changes to make.
See https://github.com/odota/core/wiki/JSON-Data-Dump for documentaion on data. I have found a few undocumented areas in the data, including the objectives
information. player_slot
can be used to combine most of the data, and it is available in most of the tables. Additionally all tables include match_id
, and some have account_id
to make it easier to look at an individual players matches. match_id
, and account_id
have been reencoded to save a little space. I can upload tables to allow conversion if needed. I plan adding small amount of information very soon. Including outcome for an additional 50k-100k matches that occurred after the ones currently uploaded, and some tables to enable determining which continent or region the match was played in.
matches: contains top level information about each match. see https://wiki.teamfortress.com/wiki/WebAPI/GetMatchDetails#Tower_Status%22tower_status_dire%22:%202047) for interpreting tower and barracks status. Cluster can link matches to geographic region.
players: Individual players are identified by account_id
but there is an option to play anonymously and roughly one third of the account_id
are not available. Anonymous users have the value of 0
for account_id
. Contains totals for kills, deaths, denies, etc. Player action counts are available, and are indicated by variable names beginning with unit_order_
. Counts for reasons for acquiring or losing gold, and gaining experience, have prefixes gold_
, and xp_
.
player_time: Contains last hits, experience, and gold sampled at one minute interval for all players in all matches. The column names indicate the player_slot. For instance xp_t_1
indicates that this column has experience sums for the player in slot one.
teamfights: Start and stop time of teamfights, as well as last death time. Teamfights appear to be all battles with three or more deaths. As such this does not include all battles for the entire match.
teamfights_players : Additional information provided for each player in each teamfight. player_slot
can be used to link this back to players.csv
objectives: Gives information on all the objectives completed, by which player and at what time.
chat: All chat for the 50k matches. There is plenty of profanity, and good natured trolling.
There seem to be some efforts to establish indicators for skillfull play based on specific parts of gameplay. Opendota has many statistics, and some analysis for specific benchmarks at different times in the game. Dotabuff has a lot of information I have not explored it deeply. This is an area to gather more information.
Insight from domain experts would also be useful to help clarify what problems are interesting to work on. Some initial task ideas
All of these areas have been worked on, but I am not aware of the most up to date research on dota2 gameplay.
I plan on setting up several different predictive tasks in the upcoming weeks. A test set of an additional 50 to 100 thousand matches with just hero_id, and account_id included along with outcome of the match.
The current dataset seems pretty small for modeling individual players. I would prefer to have a wide range of features instead of a larger dataset for the moment.
Dataset idea for anyone interested in creating their own Dota 2 dataset. It would be useful to have a few full matches avai...
The dataset contains test results from a digital intervention study of the CALLIDUS Project in a high school in Berlin. 13 Students were randomly sampled in two groups and completed various linguistic tasks. The focus of the study was to find out whether learning Latin vocabulary in authentic contexts leads to higher lexical competence, compared to memorizing traditional vocabulary lists. The data is available in JSON format as provided by the H5P implementation of XAPI. File names indicate the time of test completion, in the concatenated form of "year-month-day-hour-minute-second-millisecond". This allows us to trace the development of single learners who were fast enough to perform the test twice in a row. Changelog: Version 2.0: Each exercise now has a unique ID that is consistent in the whole dataset, so evaluation/visualization can refer to specific exercises more easily. Version 3.0: A simplified Excel Spreadsheet has been added to enhance the reusability of the dataset. It contains a slightly reduced overview of the data, but the core information (user ID, task statement, correct solution, given answer, score, duration) is still present. Funded by the German Research Foundation (DFG), project number 316618374
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
The data was downloaded from the extensive cricket data website cricsheet.org in JSON format. I used the pandas Python library to transform the match data into ball-by-ball data with several relevant fields. This allows for the data to be used to train regression models etc
This dataset was created as part of a project where I created metrics to rank players for T20 Internationals and the Indian Premier League (IPL). The entire project materials can be found at https://github.com/jamiewelsh25/Cricket_Data_Project/
Notebooks can be found below where I delve into predicting second innings chase success as well as first innings scores. Furthermore, I build a model to evaluate batters, bowlers and all-rounders using a Runs Added Over Average Player metric.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
β½ Top Footbal Scorers data from the BBC Sports section on 16/03/2025.
Extracts: - Player Name - Goals Scored - Team Name - Matches - Assists - Shots
Ideal for football enthusiasts and sports analysts and outputs data in CSV/JSON/XLSX format for analysis.
π**Python Web Scraping Script is also available**π
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was acquired during Cyber Czech β a hands-on cyber defense exercise (Red Team/Blue Team) held in March 2019 at Masaryk University, Brno, Czech Republic. Network traffic flows and a high variety of event logs were captured in an exercise network deployed in the KYPO Cyber Range Platform.
Contents
The dataset covers two distinct time intervals, which correspond to the official schedule of the exercise. The timestamps provided below are in the ISO 8601 date format.
Day 1, March 19, 2019
Start: 2019-03-19T11:00:00.000000+01:00
End: 2019-03-19T18:00:00.000000+01:00
Day 2, March 20, 2019
Start: 2019-03-20T08:00:00.000000+01:00
End: 2019-03-20T15:30:00.000000+01:00
The captured and collected data were normalized into three distinct event types and they are stored as structured JSON. The data are sorted by a timestamp, which represents the time they were observed. Each event type includes a raw payload ready for further processing and analysis. The description of the respective event types and the corresponding data files follows.
cz.muni.csirt.IpfixEntry.tgz β an archive of IPFIX traffic flows enriched with an additional payload of parsed application protocols in raw JSON.
cz.muni.csirt.SyslogEntry.tgz β an archive of Linux Syslog entries with the payload of corresponding text-based log messages.
cz.muni.csirt.WinlogEntry.tgz β an archive of Windows Event Log entries with the payload of original events in raw XML.
Each archive listed above includes a directory of the same name with the following four files, ready to be processed.
data.json.gz β the actual data entries in a single gzipped JSON file.
dictionary.yml β data dictionary for the entries.
schema.ddl β data schema for Apache Spark analytics engine.
schema.jsch β JSON schema for the entries.
Finally, the exercise network topology is described in a machine-readable NetJSON format and it is a part of a set of auxiliary files archive β auxiliary-material.tgz β which includes the following.
global-gateway-config.json β the network configuration of the global gateway in the NetJSON format.
global-gateway-routing.json β the routing configuration of the global gateway in the NetJSON format.
redteam-attack-schedule.{csv,odt} β the schedule of the Red Team attacks in CSV and ODT format. Source for Table 2.
redteam-reserved-ip-ranges.{csv,odt} β the list of IP segments reserved for the Red Team in CSV and ODT format. Source for Table 1.
topology.{json,pdf,png} β the topology of the complete Cyber Czech exercise network in the NetJSON, PDF and PNG format.
topology-small.{pdf,png} β simplified topology in the PDF and PNG format. Source for Figure 1.