10 datasets found
  1. Z

    Data from: Traffic and Log Data Captured During a Cyber Defense Exercise

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Vykopal (2020). Traffic and Log Data Captured During a Cyber Defense Exercise [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3746128
    Explore at:
    Dataset updated
    Jun 12, 2020
    Dataset provided by
    Jan Vykopal
    Stanislav Špaček
    Daniel TovarňÑk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was acquired during Cyber Czech – a hands-on cyber defense exercise (Red Team/Blue Team) held in March 2019 at Masaryk University, Brno, Czech Republic. Network traffic flows and a high variety of event logs were captured in an exercise network deployed in the KYPO Cyber Range Platform.

    Contents

    The dataset covers two distinct time intervals, which correspond to the official schedule of the exercise. The timestamps provided below are in the ISO 8601 date format.

    Day 1, March 19, 2019

    Start: 2019-03-19T11:00:00.000000+01:00

    End: 2019-03-19T18:00:00.000000+01:00

    Day 2, March 20, 2019

    Start: 2019-03-20T08:00:00.000000+01:00

    End: 2019-03-20T15:30:00.000000+01:00

    The captured and collected data were normalized into three distinct event types and they are stored as structured JSON. The data are sorted by a timestamp, which represents the time they were observed. Each event type includes a raw payload ready for further processing and analysis. The description of the respective event types and the corresponding data files follows.

    cz.muni.csirt.IpfixEntry.tgz – an archive of IPFIX traffic flows enriched with an additional payload of parsed application protocols in raw JSON.

    cz.muni.csirt.SyslogEntry.tgz – an archive of Linux Syslog entries with the payload of corresponding text-based log messages.

    cz.muni.csirt.WinlogEntry.tgz – an archive of Windows Event Log entries with the payload of original events in raw XML.

    Each archive listed above includes a directory of the same name with the following four files, ready to be processed.

    data.json.gz – the actual data entries in a single gzipped JSON file.

    dictionary.yml – data dictionary for the entries.

    schema.ddl – data schema for Apache Spark analytics engine.

    schema.jsch – JSON schema for the entries.

    Finally, the exercise network topology is described in a machine-readable NetJSON format and it is a part of a set of auxiliary files archive – auxiliary-material.tgz – which includes the following.

    global-gateway-config.json – the network configuration of the global gateway in the NetJSON format.

    global-gateway-routing.json – the routing configuration of the global gateway in the NetJSON format.

    redteam-attack-schedule.{csv,odt} – the schedule of the Red Team attacks in CSV and ODT format. Source for Table 2.

    redteam-reserved-ip-ranges.{csv,odt} – the list of IP segments reserved for the Red Team in CSV and ODT format. Source for Table 1.

    topology.{json,pdf,png} – the topology of the complete Cyber Czech exercise network in the NetJSON, PDF and PNG format.

    topology-small.{pdf,png} – simplified topology in the PDF and PNG format. Source for Figure 1.

  2. iplDatasetJson 2008-2023

    • kaggle.com
    Updated Jan 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chinmay Choudhary (2024). iplDatasetJson 2008-2023 [Dataset]. https://www.kaggle.com/datasets/chinmayc3/ipldatasetjson/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 6, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Chinmay Choudhary
    Description

    Dataset

    This dataset was created by Chinmay Choudhary

    Contents

  3. Metadata record for: Reference values for resting and post exercise...

    • springernature.figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scientific Data Curation Team (2023). Metadata record for: Reference values for resting and post exercise hemodynamic parameters in a 6-18 year old population [Dataset]. http://doi.org/10.6084/m9.figshare.11417481.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Scientific Data Curation Team
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains key characteristics about the data described in the Data Descriptor Reference values for resting and post exercise hemodynamic parameters in a 6-18 year old population. Contents:

        1. human readable metadata summary table in CSV format
    
    
        2. machine readable metadata file in JSON format 
    
    
          Versioning Note:Version 2 was generated when the metadata format was updated from JSON to JSON-LD. This was an automatic process that changed only the format, not the contents, of the metadata.
    
  4. Piano Dataset

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jun 18, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sara Fernandez; Sara Fernandez (2020). Piano Dataset [Dataset]. http://doi.org/10.5281/zenodo.3898631
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 18, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sara Fernandez; Sara Fernandez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Piano dataset containing 127 audios from the 6 first exercises of Hanon's The Virtuoso Pianist, and the corresponding onset and chroma labeling. It is used for the automatic assessment of piano exercises.

    It also contains the annotation (.JSON) and lilypond (.ly) files for the feature extraction and visualization of the music score of each exercise, as well as the pretrained model files (all_exs_p.joblib and all_ex_r.joblib).

  5. o

    Data from: PSB2: The Second Program Synthesis Benchmark Suite

    • explore.openaire.eu
    • zenodo.org
    Updated Apr 10, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Helmuth; Peter Kelly (2021). PSB2: The Second Program Synthesis Benchmark Suite [Dataset]. http://doi.org/10.5281/zenodo.4678739
    Explore at:
    Dataset updated
    Apr 10, 2021
    Authors
    Thomas Helmuth; Peter Kelly
    Description

    PSB2: The Second Program Synthesis Benchmark Suite Datasets Version 1.0.1 (see version history at bottom) This repository contains datasets for the 25 problems described in the paper PSB2: The Second Program Synthesis Benchmark Suite. These problems come from a variety of sources, and require a range of programming constructs and datatypes to solve. These datasets are designed to be usable for any method of performing general program synthesis, including and not limited to inductive program synthesis and evolutionary methods such as genetic programming. For more information, see the associated website: https://cs.hamilton.edu/~thelmuth/PSB2/PSB2.html Use Each problem in the benchmark suite is located in a separate directory in the datasets directory. For each problem, we provide a set of edge cases and a set of random cases. The edge cases are hand-chosen cases representing the limits of the problem. The random cases are all generated based on problem-specific distributions. For each problem, we included exactly 1 million random cases. A typical use of these datasets for a set of runs of program synthesis would be: For each run, use every edge case in the training set For each run, use a different, randomly-sampled set of random cases in the training set. Use a larger set of random cases as an unseen test set. Sampling Libraries We provide the following libraries to make the downloading and sampling of these datasets easier. Using these libraries, you do not need to download the entire dataset from Zenodo; the individual problem datasets are downloaded and stored once when first sampling them. Python: https://github.com/thelmuth/psb2-python Clojure: https://github.com/thelmuth/psb2-clojure Dataset format Each edge and random dataset is provided in three formats: CSV, JSON, and EDN, with all three formats containing identical data. The CSV files are formatted as follows: The first row of the file is the column names. Each following row corresponds to one set of program inputs and expected outputs. Input columns are labeled input1, input2, etc., and output columns are labeled output1, output2, etc. In CSVs, string inputs and outputs are double quoted when necessary, but not if not necessary. Newlines within strings are escaped. Columns in CSV files are comma-separated. The JSON and EDN files are formatted using the JSON Lines standard (adapted for EDN). Each case is put on its own line of the data file. The files should be read line-by-line and each parsed into an object/map using a JSON/EDN parser. Citation If you use these datasets in a publication, please cite the paper PSB2: The Second Program Synthesis Benchmark Suite and include a link to this repository. BibTeX entry for paper: @InProceedings{Helmuth:2021:GECCO, author = "Thomas Helmuth and Peter Kelly", title = "{PSB2}: The Second Program Synthesis Benchmark Suite", booktitle = "2021 Genetic and Evolutionary Computation Conference", series = {GECCO '21}, year = "2021", isbn13 = {978-1-4503-8350-9}, address = {Lille, France}, size = {10 pages}, doi = {10.1145/3449639.3459285}, publisher = {ACM}, publisher_address = {New York, NY, USA}, month = {10-14} # jul, doi-url = {https://doi.org/10.1145/3449639.3459285}, URL = {https://dl.acm.org/doi/10.1145/3449639.3459285}, } Version History 1.0.0 - 2021/4/10 - Initial publication of PSB2 datasets on Zenodo. 1.0.1 - 2021/7/9 - Changes to CSVs to quote all strings that could be read as integers. No changes in actual data, just formatting.

  6. Euroleague / Eurocup Play By Play Data 2007-2020

    • kaggle.com
    Updated Feb 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Efehan (2021). Euroleague / Eurocup Play By Play Data 2007-2020 [Dataset]. https://www.kaggle.com/efehandanisman/euroleague-play-by-play-data-20072020/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 7, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Efehan
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Euroleague Basketball

    While not as popular as in the USA, European basketball has its own fan base as well. I have seen some datasets scattered around on European basketball, however could not find one that has play by play data historically. Hence, I decided to create my own with help of Euroleague's JSON format that is easy to reach.

    Content

    Data includes play by play data for the Euroleague (top European basketball competition) since 2007, for the Eurocup (second tier international competition) since 2012 to 2020.

    Data is acquired via Euroleague's JSON format which was easy to get. I create a short Do it Yourself (DIY) notebook for the ones who want to get the data for themselves.

    Acknowledgements

    These two github repos are my starting points, without their upstart would be hard to compile this data.

    Solmos Jan Sodoge

    Future Work

    Regularly updating this data and making interesting analysis would be definetely interesting to see. As my time allows, I will try to make them.

  7. Dota 2 Matches

    • kaggle.com
    zip
    Updated Oct 24, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Devin Anzelmo (2016). Dota 2 Matches [Dataset]. https://www.kaggle.com/datasets/devinanzelmo/dota-2-matches/versions/1
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Oct 24, 2016
    Authors
    Devin Anzelmo
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Overview

    This dataset contains 50000 ranked ladder matches from the Dota 2 data dump created by Opendota. It was inspired by the Dota 2 Matches data published here by Joe Ramir. This is an update and improved version of that dataset. I have kept the same image and a similar title.

    Dota 2 is a popular MOBA available as free to play, and can take up thousands of hours of your life. The number of games in this dataset are played about every hour. If you like the data there are an additional 2-3 million matches easily available for download.

    The aim of this dataset is to enable the exploration of player behavior, skill estimation, or anything you find interesting. The intent is to create an accessible, and easy to use resource, which can be expanded and modified if needed. As such I am open to a wide variety of suggestions as to what additions or changes to make.

    Whats Currently Available

    See https://github.com/odota/core/wiki/JSON-Data-Dump for documentaion on data. I have found a few undocumented areas in the data, including the objectives information. player_slot can be used to combine most of the data, and it is available in most of the tables. Additionally all tables include match_id, and some have account_id to make it easier to look at an individual players matches. match_id, and account_id have been reencoded to save a little space. I can upload tables to allow conversion if needed. I plan adding small amount of information very soon. Including outcome for an additional 50k-100k matches that occurred after the ones currently uploaded, and some tables to enable determining which continent or region the match was played in.

    • matches: contains top level information about each match. see https://wiki.teamfortress.com/wiki/WebAPI/GetMatchDetails#Tower_Status%22tower_status_dire%22:%202047) for interpreting tower and barracks status. Cluster can link matches to geographic region.

    • players: Individual players are identified by account_id but there is an option to play anonymously and roughly one third of the account_id are not available. Anonymous users have the value of 0 for account_id. Contains totals for kills, deaths, denies, etc. Player action counts are available, and are indicated by variable names beginning with unit_order_. Counts for reasons for acquiring or losing gold, and gaining experience, have prefixes gold_, and xp_.

    • player_time: Contains last hits, experience, and gold sampled at one minute interval for all players in all matches. The column names indicate the player_slot. For instance xp_t_1 indicates that this column has experience sums for the player in slot one.

    • teamfights: Start and stop time of teamfights, as well as last death time. Teamfights appear to be all battles with three or more deaths. As such this does not include all battles for the entire match.

    • teamfights_players : Additional information provided for each player in each teamfight. player_slot can be used to link this back to players.csv

    • objectives: Gives information on all the objectives completed, by which player and at what time.

    • chat: All chat for the 50k matches. There is plenty of profanity, and good natured trolling.

    Past Research

    There seem to be some efforts to establish indicators for skillfull play based on specific parts of gameplay. Opendota has many statistics, and some analysis for specific benchmarks at different times in the game. Dotabuff has a lot of information I have not explored it deeply. This is an area to gather more information.

    Some possible directions of investigation

    Insight from domain experts would also be useful to help clarify what problems are interesting to work on. Some initial task ideas

    • Predict match outcomes based on aggregates for individual players using only account_id as prior information
    • Add hero id to this and see if there is a differences in performance
    • Estimate player skill based on a sample of in game play(this might need an external mmr source or different definition skill)
    • Create improved indicators of skillful play based game actions to help players target areas for improvement

    All of these areas have been worked on, but I am not aware of the most up to date research on dota2 gameplay.

    I plan on setting up several different predictive tasks in the upcoming weeks. A test set of an additional 50 to 100 thousand matches with just hero_id, and account_id included along with outcome of the match.

    The current dataset seems pretty small for modeling individual players. I would prefer to have a wide range of features instead of a larger dataset for the moment.

    Dataset idea for anyone interested in creating their own Dota 2 dataset. It would be useful to have a few full matches avai...

  8. o

    Learner Data from a Study on Latin Language Learning

    • explore.openaire.eu
    • data.niaid.nih.gov
    • +1more
    Updated Jan 7, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Konstantin Schulz (2020). Learner Data from a Study on Latin Language Learning [Dataset]. http://doi.org/10.5281/zenodo.4108359
    Explore at:
    Dataset updated
    Jan 7, 2020
    Authors
    Konstantin Schulz
    Description

    The dataset contains test results from a digital intervention study of the CALLIDUS Project in a high school in Berlin. 13 Students were randomly sampled in two groups and completed various linguistic tasks. The focus of the study was to find out whether learning Latin vocabulary in authentic contexts leads to higher lexical competence, compared to memorizing traditional vocabulary lists. The data is available in JSON format as provided by the H5P implementation of XAPI. File names indicate the time of test completion, in the concatenated form of "year-month-day-hour-minute-second-millisecond". This allows us to trace the development of single learners who were fast enough to perform the test twice in a row. Changelog: Version 2.0: Each exercise now has a unique ID that is consistent in the whole dataset, so evaluation/visualization can refer to specific exercises more easily. Version 3.0: A simplified Excel Spreadsheet has been added to enhance the reusability of the dataset. It contains a slightly reduced overview of the data, but the core information (user ID, task statement, correct solution, given answer, score, duration) is still present. Funded by the German Research Foundation (DFG), project number 316618374

  9. T20I Men's Cricket Match Data (2003 - 2023)

    • kaggle.com
    Updated Sep 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jamie Welsh (2023). T20I Men's Cricket Match Data (2003 - 2023) [Dataset]. https://www.kaggle.com/datasets/jamiewelsh2/ball-by-ball-it20/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 20, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jamie Welsh
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    The data was downloaded from the extensive cricket data website cricsheet.org in JSON format. I used the pandas Python library to transform the match data into ball-by-ball data with several relevant fields. This allows for the data to be used to train regression models etc

    This dataset was created as part of a project where I created metrics to rank players for T20 Internationals and the Indian Premier League (IPL). The entire project materials can be found at https://github.com/jamiewelsh25/Cricket_Data_Project/

    Notebooks can be found below where I delve into predicting second innings chase success as well as first innings scores. Furthermore, I build a model to evaluate batters, bowlers and all-rounders using a Runs Added Over Average Player metric.

  10. BBCSports-Top-Scorers-Football

    • kaggle.com
    Updated Mar 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vivek Prasad Kushwaha (2025). BBCSports-Top-Scorers-Football [Dataset]. https://www.kaggle.com/datasets/vivekprasadkushwaha/bbcsports-top-scorers-football/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 16, 2025
    Dataset provided by
    Kaggle
    Authors
    Vivek Prasad Kushwaha
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    ⚽ Top Footbal Scorers data from the BBC Sports section on 16/03/2025.

    Extracts: - Player Name - Goals Scored - Team Name - Matches - Assists - Shots

    Ideal for football enthusiasts and sports analysts and outputs data in CSV/JSON/XLSX format for analysis.

    🐍**Python Web Scraping Script is also available**🐍

  11. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jan Vykopal (2020). Traffic and Log Data Captured During a Cyber Defense Exercise [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3746128

Data from: Traffic and Log Data Captured During a Cyber Defense Exercise

Related Article
Explore at:
Dataset updated
Jun 12, 2020
Dataset provided by
Jan Vykopal
Stanislav Špaček
Daniel TovarňÑk
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset was acquired during Cyber Czech – a hands-on cyber defense exercise (Red Team/Blue Team) held in March 2019 at Masaryk University, Brno, Czech Republic. Network traffic flows and a high variety of event logs were captured in an exercise network deployed in the KYPO Cyber Range Platform.

Contents

The dataset covers two distinct time intervals, which correspond to the official schedule of the exercise. The timestamps provided below are in the ISO 8601 date format.

Day 1, March 19, 2019

Start: 2019-03-19T11:00:00.000000+01:00

End: 2019-03-19T18:00:00.000000+01:00

Day 2, March 20, 2019

Start: 2019-03-20T08:00:00.000000+01:00

End: 2019-03-20T15:30:00.000000+01:00

The captured and collected data were normalized into three distinct event types and they are stored as structured JSON. The data are sorted by a timestamp, which represents the time they were observed. Each event type includes a raw payload ready for further processing and analysis. The description of the respective event types and the corresponding data files follows.

cz.muni.csirt.IpfixEntry.tgz – an archive of IPFIX traffic flows enriched with an additional payload of parsed application protocols in raw JSON.

cz.muni.csirt.SyslogEntry.tgz – an archive of Linux Syslog entries with the payload of corresponding text-based log messages.

cz.muni.csirt.WinlogEntry.tgz – an archive of Windows Event Log entries with the payload of original events in raw XML.

Each archive listed above includes a directory of the same name with the following four files, ready to be processed.

data.json.gz – the actual data entries in a single gzipped JSON file.

dictionary.yml – data dictionary for the entries.

schema.ddl – data schema for Apache Spark analytics engine.

schema.jsch – JSON schema for the entries.

Finally, the exercise network topology is described in a machine-readable NetJSON format and it is a part of a set of auxiliary files archive – auxiliary-material.tgz – which includes the following.

global-gateway-config.json – the network configuration of the global gateway in the NetJSON format.

global-gateway-routing.json – the routing configuration of the global gateway in the NetJSON format.

redteam-attack-schedule.{csv,odt} – the schedule of the Red Team attacks in CSV and ODT format. Source for Table 2.

redteam-reserved-ip-ranges.{csv,odt} – the list of IP segments reserved for the Red Team in CSV and ODT format. Source for Table 1.

topology.{json,pdf,png} – the topology of the complete Cyber Czech exercise network in the NetJSON, PDF and PNG format.

topology-small.{pdf,png} – simplified topology in the PDF and PNG format. Source for Figure 1.

Search
Clear search
Close search
Google apps
Main menu