10 datasets found

Z
Data from: Traffic and Log Data Captured During a Cyber Defense Exercise
data.niaid.nih.gov
zenodo.org
Updated Jun 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Vykopal (2020). Traffic and Log Data Captured During a Cyber Defense Exercise [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3746128
Explore at:
Dataset updated
Jun 12, 2020
Dataset provided by
Jan Vykopal
Stanislav Špaček
Daniel Tovarňák
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was acquired during Cyber Czech – a hands-on cyber defense exercise (Red Team/Blue Team) held in March 2019 at Masaryk University, Brno, Czech Republic. Network traffic flows and a high variety of event logs were captured in an exercise network deployed in the KYPO Cyber Range Platform.

Contents

The dataset covers two distinct time intervals, which correspond to the official schedule of the exercise. The timestamps provided below are in the ISO 8601 date format.

Day 1, March 19, 2019

Start: 2019-03-19T11:00:00.000000+01:00

End: 2019-03-19T18:00:00.000000+01:00

Day 2, March 20, 2019

Start: 2019-03-20T08:00:00.000000+01:00

End: 2019-03-20T15:30:00.000000+01:00

The captured and collected data were normalized into three distinct event types and they are stored as structured JSON. The data are sorted by a timestamp, which represents the time they were observed. Each event type includes a raw payload ready for further processing and analysis. The description of the respective event types and the corresponding data files follows.

cz.muni.csirt.IpfixEntry.tgz – an archive of IPFIX traffic flows enriched with an additional payload of parsed application protocols in raw JSON.

cz.muni.csirt.SyslogEntry.tgz – an archive of Linux Syslog entries with the payload of corresponding text-based log messages.

cz.muni.csirt.WinlogEntry.tgz – an archive of Windows Event Log entries with the payload of original events in raw XML.

Each archive listed above includes a directory of the same name with the following four files, ready to be processed.

data.json.gz – the actual data entries in a single gzipped JSON file.

dictionary.yml – data dictionary for the entries.

schema.ddl – data schema for Apache Spark analytics engine.

schema.jsch – JSON schema for the entries.

Finally, the exercise network topology is described in a machine-readable NetJSON format and it is a part of a set of auxiliary files archive – auxiliary-material.tgz – which includes the following.

global-gateway-config.json – the network configuration of the global gateway in the NetJSON format.

global-gateway-routing.json – the routing configuration of the global gateway in the NetJSON format.

redteam-attack-schedule.{csv,odt} – the schedule of the Red Team attacks in CSV and ODT format. Source for Table 2.

redteam-reserved-ip-ranges.{csv,odt} – the list of IP segments reserved for the Red Team in CSV and ODT format. Source for Table 1.

topology.{json,pdf,png} – the topology of the complete Cyber Czech exercise network in the NetJSON, PDF and PNG format.

topology-small.{pdf,png} – simplified topology in the PDF and PNG format. Source for Figure 1.
iplDatasetJson 2008-2023
kaggle.com
Updated Jan 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chinmay Choudhary (2024). iplDatasetJson 2008-2023 [Dataset]. https://www.kaggle.com/datasets/chinmayc3/ipldatasetjson/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 6, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Chinmay Choudhary
Description
Dataset

This dataset was created by Chinmay Choudhary

Contents
Metadata record for: Reference values for resting and post exercise...
springernature.figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scientific Data Curation Team (2023). Metadata record for: Reference values for resting and post exercise hemodynamic parameters in a 6-18 year old population [Dataset]. http://doi.org/10.6084/m9.figshare.11417481.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.11417481.v2
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Scientific Data Curation Team
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains key characteristics about the data described in the Data Descriptor Reference values for resting and post exercise hemodynamic parameters in a 6-18 year old population. Contents:

1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON format Versioning Note:Version 2 was generated when the metadata format was updated from JSON to JSON-LD. This was an automatic process that changed only the format, not the contents, of the metadata.
Piano Dataset
zenodo.org
data.niaid.nih.gov
zip
Updated Jun 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sara Fernandez; Sara Fernandez (2020). Piano Dataset [Dataset]. http://doi.org/10.5281/zenodo.3898631
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3898631
Dataset updated
Jun 18, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sara Fernandez; Sara Fernandez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Piano dataset containing 127 audios from the 6 first exercises of Hanon's The Virtuoso Pianist, and the corresponding onset and chroma labeling. It is used for the automatic assessment of piano exercises.

It also contains the annotation (.JSON) and lilypond (.ly) files for the feature extraction and visualization of the music score of each exercise, as well as the pretrained model files (all_exs_p.joblib and all_ex_r.joblib).
o
Data from: PSB2: The Second Program Synthesis Benchmark Suite
explore.openaire.eu
zenodo.org
Updated Apr 10, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas Helmuth; Peter Kelly (2021). PSB2: The Second Program Synthesis Benchmark Suite [Dataset]. http://doi.org/10.5281/zenodo.4678739
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4678739
Dataset updated
Apr 10, 2021
Authors
Thomas Helmuth; Peter Kelly
Description
PSB2: The Second Program Synthesis Benchmark Suite Datasets Version 1.0.1 (see version history at bottom) This repository contains datasets for the 25 problems described in the paper PSB2: The Second Program Synthesis Benchmark Suite. These problems come from a variety of sources, and require a range of programming constructs and datatypes to solve. These datasets are designed to be usable for any method of performing general program synthesis, including and not limited to inductive program synthesis and evolutionary methods such as genetic programming. For more information, see the associated website: https://cs.hamilton.edu/~thelmuth/PSB2/PSB2.html Use Each problem in the benchmark suite is located in a separate directory in the datasets directory. For each problem, we provide a set of edge cases and a set of random cases. The edge cases are hand-chosen cases representing the limits of the problem. The random cases are all generated based on problem-specific distributions. For each problem, we included exactly 1 million random cases. A typical use of these datasets for a set of runs of program synthesis would be: For each run, use every edge case in the training set For each run, use a different, randomly-sampled set of random cases in the training set. Use a larger set of random cases as an unseen test set. Sampling Libraries We provide the following libraries to make the downloading and sampling of these datasets easier. Using these libraries, you do not need to download the entire dataset from Zenodo; the individual problem datasets are downloaded and stored once when first sampling them. Python: https://github.com/thelmuth/psb2-python Clojure: https://github.com/thelmuth/psb2-clojure Dataset format Each edge and random dataset is provided in three formats: CSV, JSON, and EDN, with all three formats containing identical data. The CSV files are formatted as follows: The first row of the file is the column names. Each following row corresponds to one set of program inputs and expected outputs. Input columns are labeled input1, input2, etc., and output columns are labeled output1, output2, etc. In CSVs, string inputs and outputs are double quoted when necessary, but not if not necessary. Newlines within strings are escaped. Columns in CSV files are comma-separated. The JSON and EDN files are formatted using the JSON Lines standard (adapted for EDN). Each case is put on its own line of the data file. The files should be read line-by-line and each parsed into an object/map using a JSON/EDN parser. Citation If you use these datasets in a publication, please cite the paper PSB2: The Second Program Synthesis Benchmark Suite and include a link to this repository. BibTeX entry for paper: @InProceedings{Helmuth:2021:GECCO, author = "Thomas Helmuth and Peter Kelly", title = "{PSB2}: The Second Program Synthesis Benchmark Suite", booktitle = "2021 Genetic and Evolutionary Computation Conference", series = {GECCO '21}, year = "2021", isbn13 = {978-1-4503-8350-9}, address = {Lille, France}, size = {10 pages}, doi = {10.1145/3449639.3459285}, publisher = {ACM}, publisher_address = {New York, NY, USA}, month = {10-14} # jul, doi-url = {https://doi.org/10.1145/3449639.3459285}, URL = {https://dl.acm.org/doi/10.1145/3449639.3459285}, } Version History 1.0.0 - 2021/4/10 - Initial publication of PSB2 datasets on Zenodo. 1.0.1 - 2021/7/9 - Changes to CSVs to quote all strings that could be read as integers. No changes in actual data, just formatting.
Euroleague / Eurocup Play By Play Data 2007-2020
kaggle.com
Updated Feb 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Efehan (2021). Euroleague / Eurocup Play By Play Data 2007-2020 [Dataset]. https://www.kaggle.com/efehandanisman/euroleague-play-by-play-data-20072020/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 7, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Efehan
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Euroleague Basketball

While not as popular as in the USA, European basketball has its own fan base as well. I have seen some datasets scattered around on European basketball, however could not find one that has play by play data historically. Hence, I decided to create my own with help of Euroleague's JSON format that is easy to reach.

Content

Data includes play by play data for the Euroleague (top European basketball competition) since 2007, for the Eurocup (second tier international competition) since 2012 to 2020.

Data is acquired via Euroleague's JSON format which was easy to get. I create a short Do it Yourself (DIY) notebook for the ones who want to get the data for themselves.

Acknowledgements

These two github repos are my starting points, without their upstart would be hard to compile this data.

Solmos Jan Sodoge

Future Work

Regularly updating this data and making interesting analysis would be definetely interesting to see. As my time allows, I will try to make them.
Dota 2 Matches
kaggle.com
zip
Updated Oct 24, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Devin Anzelmo (2016). Dota 2 Matches [Dataset]. https://www.kaggle.com/datasets/devinanzelmo/dota-2-matches/versions/1
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Oct 24, 2016
Authors
Devin Anzelmo
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Overview

This dataset contains 50000 ranked ladder matches from the Dota 2 data dump created by Opendota. It was inspired by the Dota 2 Matches data published here by Joe Ramir. This is an update and improved version of that dataset. I have kept the same image and a similar title.

Dota 2 is a popular MOBA available as free to play, and can take up thousands of hours of your life. The number of games in this dataset are played about every hour. If you like the data there are an additional 2-3 million matches easily available for download.

The aim of this dataset is to enable the exploration of player behavior, skill estimation, or anything you find interesting. The intent is to create an accessible, and easy to use resource, which can be expanded and modified if needed. As such I am open to a wide variety of suggestions as to what additions or changes to make.

Whats Currently Available

See https://github.com/odota/core/wiki/JSON-Data-Dump for documentaion on data. I have found a few undocumented areas in the data, including the objectives information. player_slot can be used to combine most of the data, and it is available in most of the tables. Additionally all tables include match_id, and some have account_id to make it easier to look at an individual players matches. match_id, and account_id have been reencoded to save a little space. I can upload tables to allow conversion if needed. I plan adding small amount of information very soon. Including outcome for an additional 50k-100k matches that occurred after the ones currently uploaded, and some tables to enable determining which continent or region the match was played in.

matches: contains top level information about each match. see https://wiki.teamfortress.com/wiki/WebAPI/GetMatchDetails#Tower_Status%22tower_status_dire%22:%202047) for interpreting tower and barracks status. Cluster can link matches to geographic region.

players: Individual players are identified by account_id but there is an option to play anonymously and roughly one third of the account_id are not available. Anonymous users have the value of 0 for account_id. Contains totals for kills, deaths, denies, etc. Player action counts are available, and are indicated by variable names beginning with unit_order_. Counts for reasons for acquiring or losing gold, and gaining experience, have prefixes gold_, and xp_.

player_time: Contains last hits, experience, and gold sampled at one minute interval for all players in all matches. The column names indicate the player_slot. For instance xp_t_1 indicates that this column has experience sums for the player in slot one.

teamfights: Start and stop time of teamfights, as well as last death time. Teamfights appear to be all battles with three or more deaths. As such this does not include all battles for the entire match.

teamfights_players : Additional information provided for each player in each teamfight. player_slot can be used to link this back to players.csv

objectives: Gives information on all the objectives completed, by which player and at what time.

chat: All chat for the 50k matches. There is plenty of profanity, and good natured trolling.

Past Research

There seem to be some efforts to establish indicators for skillfull play based on specific parts of gameplay. Opendota has many statistics, and some analysis for specific benchmarks at different times in the game. Dotabuff has a lot of information I have not explored it deeply. This is an area to gather more information.

Some possible directions of investigation

Insight from domain experts would also be useful to help clarify what problems are interesting to work on. Some initial task ideas

Predict match outcomes based on aggregates for individual players using only account_id as prior information

Add hero id to this and see if there is a differences in performance

Estimate player skill based on a sample of in game play(this might need an external mmr source or different definition skill)

Create improved indicators of skillful play based game actions to help players target areas for improvement

All of these areas have been worked on, but I am not aware of the most up to date research on dota2 gameplay.

I plan on setting up several different predictive tasks in the upcoming weeks. A test set of an additional 50 to 100 thousand matches with just hero_id, and account_id included along with outcome of the match.

The current dataset seems pretty small for modeling individual players. I would prefer to have a wide range of features instead of a larger dataset for the moment.

Dataset idea for anyone interested in creating their own Dota 2 dataset. It would be useful to have a few full matches avai...
o
Learner Data from a Study on Latin Language Learning
explore.openaire.eu
data.niaid.nih.gov
+1more
Updated Jan 7, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Konstantin Schulz (2020). Learner Data from a Study on Latin Language Learning [Dataset]. http://doi.org/10.5281/zenodo.4108359
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4108359
Dataset updated
Jan 7, 2020
Authors
Konstantin Schulz
Description
The dataset contains test results from a digital intervention study of the CALLIDUS Project in a high school in Berlin. 13 Students were randomly sampled in two groups and completed various linguistic tasks. The focus of the study was to find out whether learning Latin vocabulary in authentic contexts leads to higher lexical competence, compared to memorizing traditional vocabulary lists. The data is available in JSON format as provided by the H5P implementation of XAPI. File names indicate the time of test completion, in the concatenated form of "year-month-day-hour-minute-second-millisecond". This allows us to trace the development of single learners who were fast enough to perform the test twice in a row. Changelog: Version 2.0: Each exercise now has a unique ID that is consistent in the whole dataset, so evaluation/visualization can refer to specific exercises more easily. Version 3.0: A simplified Excel Spreadsheet has been added to enhance the reusability of the dataset. It contains a slightly reduced overview of the data, but the core information (user ID, task statement, correct solution, given answer, score, duration) is still present. Funded by the German Research Foundation (DFG), project number 316618374
T20I Men's Cricket Match Data (2003 - 2023)
kaggle.com
Updated Sep 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jamie Welsh (2023). T20I Men's Cricket Match Data (2003 - 2023) [Dataset]. https://www.kaggle.com/datasets/jamiewelsh2/ball-by-ball-it20/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 20, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jamie Welsh
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
The data was downloaded from the extensive cricket data website cricsheet.org in JSON format. I used the pandas Python library to transform the match data into ball-by-ball data with several relevant fields. This allows for the data to be used to train regression models etc

This dataset was created as part of a project where I created metrics to rank players for T20 Internationals and the Indian Premier League (IPL). The entire project materials can be found at https://github.com/jamiewelsh25/Cricket_Data_Project/

Notebooks can be found below where I delve into predicting second innings chase success as well as first innings scores. Furthermore, I build a model to evaluate batters, bowlers and all-rounders using a Runs Added Over Average Player metric.
BBCSports-Top-Scorers-Football
kaggle.com
Updated Mar 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vivek Prasad Kushwaha (2025). BBCSports-Top-Scorers-Football [Dataset]. https://www.kaggle.com/datasets/vivekprasadkushwaha/bbcsports-top-scorers-football/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 16, 2025
Dataset provided by
Kaggle
Authors
Vivek Prasad Kushwaha
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
⚽ Top Footbal Scorers data from the BBC Sports section on 16/03/2025.

Extracts: - Player Name - Goals Scored - Team Name - Matches - Assists - Shots

Ideal for football enthusiasts and sports analysts and outputs data in CSV/JSON/XLSX format for analysis.

🐍**Python Web Scraping Script is also available**🐍
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Jan Vykopal (2020). Traffic and Log Data Captured During a Cyber Defense Exercise [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3746128

Data from: Traffic and Log Data Captured During a Cyber Defense Exercise

Explore at:

Dataset updated

Jun 12, 2020

Dataset provided by

Jan Vykopal
Stanislav Špaček
Daniel Tovarňák

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset was acquired during Cyber Czech – a hands-on cyber defense exercise (Red Team/Blue Team) held in March 2019 at Masaryk University, Brno, Czech Republic. Network traffic flows and a high variety of event logs were captured in an exercise network deployed in the KYPO Cyber Range Platform.

Contents

The dataset covers two distinct time intervals, which correspond to the official schedule of the exercise. The timestamps provided below are in the ISO 8601 date format.

Day 1, March 19, 2019

Start: 2019-03-19T11:00:00.000000+01:00

End: 2019-03-19T18:00:00.000000+01:00

Day 2, March 20, 2019

Start: 2019-03-20T08:00:00.000000+01:00

End: 2019-03-20T15:30:00.000000+01:00

The captured and collected data were normalized into three distinct event types and they are stored as structured JSON. The data are sorted by a timestamp, which represents the time they were observed. Each event type includes a raw payload ready for further processing and analysis. The description of the respective event types and the corresponding data files follows.

cz.muni.csirt.IpfixEntry.tgz – an archive of IPFIX traffic flows enriched with an additional payload of parsed application protocols in raw JSON.

cz.muni.csirt.SyslogEntry.tgz – an archive of Linux Syslog entries with the payload of corresponding text-based log messages.

cz.muni.csirt.WinlogEntry.tgz – an archive of Windows Event Log entries with the payload of original events in raw XML.

Each archive listed above includes a directory of the same name with the following four files, ready to be processed.

data.json.gz – the actual data entries in a single gzipped JSON file.

dictionary.yml – data dictionary for the entries.

schema.ddl – data schema for Apache Spark analytics engine.

schema.jsch – JSON schema for the entries.

Finally, the exercise network topology is described in a machine-readable NetJSON format and it is a part of a set of auxiliary files archive – auxiliary-material.tgz – which includes the following.

global-gateway-config.json – the network configuration of the global gateway in the NetJSON format.

global-gateway-routing.json – the routing configuration of the global gateway in the NetJSON format.

redteam-attack-schedule.{csv,odt} – the schedule of the Red Team attacks in CSV and ODT format. Source for Table 2.

redteam-reserved-ip-ranges.{csv,odt} – the list of IP segments reserved for the Red Team in CSV and ODT format. Source for Table 1.

topology.{json,pdf,png} – the topology of the complete Cyber Czech exercise network in the NetJSON, PDF and PNG format.

topology-small.{pdf,png} – simplified topology in the PDF and PNG format. Source for Figure 1.

Clear search

Close search

Google apps

Main menu

Data from: Traffic and Log Data Captured During a Cyber Defense Exercise

iplDatasetJson 2008-2023

Dataset

Contents

Metadata record for: Reference values for resting and post exercise...

Piano Dataset

Data from: PSB2: The Second Program Synthesis Benchmark Suite

Euroleague / Eurocup Play By Play Data 2007-2020

Euroleague Basketball

Content

Acknowledgements

Future Work

Dota 2 Matches

Overview

Whats Currently Available

Past Research

Some possible directions of investigation

Learner Data from a Study on Latin Language Learning

T20I Men's Cricket Match Data (2003 - 2023)

BBCSports-Top-Scorers-Football

Data from: Traffic and Log Data Captured During a Cyber Defense Exercise