Mapping incident locations from a CSV file in a web map (YouTube video).
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
List of valid web domain names collected by the (bulk) crawling bots (stage-1 bots) running on varocarbas.com.
These bots perform a blind recursive analysis of links, based on the "everything is connected" idea. That is: they started in a given webpage and are expected to retrieve a relevant proportion of all the existing domain names.
Instructions on how to create a layer containing recent earthquakes from a CSV file downloaded from GNS Sciences GeoNet website to a Web Map.The CSV file must contain latitude and longitude fields for the earthquake location for it to be added to a Web Map as a point layer.Document designed to support the Natural Hazards - Earthquakes story map
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Semicolon delimited text file equivalent of the Rdata file. See the Rdata file for a description of the data in each column.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These dataset contains the results of the interlinking process between selected csv datasets harvested by the European DAta Portal and the DBpedia knowledge graph.
We aim at answering the following questions:
What are the more popular column types? This will provide hindsight about what the datasets hold and how they can be joined. It will also provide hindsight on what specific linking schemes could be applied in future elements.
What datasets have columns of the same type? This will suggest datasets that may be similar or related.
What entities appear in most datasets (co-referent entities)? This will suggest entities for which more data is published.
What datasets share a particular entity? This will suggest datasets that may be joined, or are related through that particular entity
Results are provided as augmented tables, that contain the columns of the original csv, plus a metadata file in JSON-LD format. The metadata files can be loaded in an RDF-store and queried.
Refer to the accompanying report of activities for more details on the methodolog and how to query the dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset is a set of network traffic traces in pcap/csv format captured from a single user. The traffic is classified in 5 different activities (Video, Bulk, Idle, Web, and Interactive) and the label is shown in the filename. There is also a file (mapping.csv) with the mapping of the host's IP address, the csv/pcap filename and the activity label.
Activities:
Interactive: applications that perform real-time interactions in order to provide a suitable user experience, such as editing a file in google docs and remote CLI's sessions by SSH. Bulk data transfer: applications that perform a transfer of large data volume files over the network. Some examples are SCP/FTP applications and direct downloads of large files from web servers like Mediafire, Dropbox or the university repository among others. Web browsing: contains all the generated traffic while searching and consuming different web pages. Examples of those pages are several blogs and new sites and the moodle of the university. Vídeo playback: contains traffic from applications that consume video in streaming or pseudo-streaming. The most known server used are Twitch and Youtube but the university online classroom has also been used. Idle behaviour: is composed by the background traffic generated by the user computer when the user is idle. This traffic has been captured with every application closed and with some opened pages like google docs, YouTube and several web pages, but always without user interaction.
The capture is performed in a network probe, attached to the router that forwards the user network traffic, using a SPAN port. The traffic is stored in pcap format with all the packet payload. In the csv file, every non TCP/UDP packet is filtered out, as well as every packet with no payload. The fields in the csv files are the following (one line per packet): Timestamp, protocol, payload size, IP address source and destination, UDP/TCP port source and destination. The fields are also included as a header in every csv file.
The amount of data is stated as follows:
Bulk : 19 traces, 3599 s of total duration, 8704 MBytes of pcap files Video : 23 traces, 4496 s, 1405 MBytes Web : 23 traces, 4203 s, 148 MBytes Interactive : 42 traces, 8934 s, 30.5 MBytes Idle : 52 traces, 6341 s, 0.69 MBytes
The code of our machine learning approach is also included. There is a README.txt file with the documentation of how to use the code.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A selection of analytics metrics for the data.gov.au service. Starting from January 2015 these metrics are aggregated by month and include;
If you have suggestions for additional analytics please send an email to data@pmc.gov.au for consideration.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset includes data from various Decentralized Autonomous Organizations (DAOs) platforms, namely Aragon, DAOHaus, DAOstack, Realms, Snapshot and Tally. DAOs are a new form of self-governed online communities deployed in the blockchain. DAO members typically use governance tokens to participate in the DAO decision-making process, often through a voting system where members submit proposals and vote on them.
The description of the methods used for the generation of data, for processing it and the quality-assurance procedures performed on the data can be found here:
https://doi.org/10.1145/3589335.3651481
The dataset comprises three CSV files: deployments.csv, proposals.csv, and votes.csv, each containing essential information regarding DAOs deployments, their
proposals, and the corresponding votes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A selection of analytics metrics for the NationalMap service. Starting from September 2015 these metrics are aggregated by month and include;
If you have suggestions for additional analytics please send an email to data@pmc.gov.au for consideration.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This folder contains the annonated corpus in CSV format organized as follows: * page-data.csv : contains all the annotated web pages with their HTML content. * document-data.csv : contains the documents extracted from the web pages, where each document contains a single paragraph and have a set of related tables. * table-data.csv : contains the tables related to each document. It also contains the HTML content of the table extracted from the web page. * mention-data.csv : contains all the quantity mentions with ground truth mapping extracted from the documents. * mention_table-data.csv : contains the related table for each mention. * annotations-GT.csv : contains the collected ground truth annotations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The MaRV dataset consists of 693 manually evaluated code pairs extracted from 126 GitHub Java repositories, covering four types of refactoring. The dataset also includes metadata describing the refactored elements. Each code pair was assessed by two reviewers selected from a pool of 40 participants. The MaRV dataset is continuously evolving and is supported by a web-based tool for evaluating refactoring representations. This dataset aims to enhance the accuracy and reliability of state-of-the-art models in refactoring tasks, such as refactoring candidate identification and code generation, by providing high-quality annotated data.
Our dataset is located at the path dataset/MaRV.json
The guidelines for replicating the study are provided below:
requirements.txt
.env
file based on .env.example
in the src
folder and set the variables:
CSV_PATH
: Path to the CSV file containing the list of repositories to be processed.CLONE_DIR
: Directory where repositories will be cloned.JAVA_PATH
: Path to the Java executable.REFACTORING_MINER_PATH
: Path to RefactoringMiner.pip install -r requirements.txt
CSV_PATH
should contain a column named name
with GitHub repository names (format: username/repo
)..env
file and set up the repositories CSV, then run:
python3 src/run_rm.py
CLONE_DIR
, retrieves the default branch, and runs RefactoringMiner to analyze it..json
files in CLONE_DIR
..log
files in the same directory.python3 src/count_refactorings.py
refactoring_count_by_type_and_file
, shows the number of refactorings for each technique, grouped by repository.To collect snippets before and after refactoring and their metadata, run:
python3 src/diff.py '[refactoring technique]'
Replace [refactoring technique]
with the desired technique name (e.g., Extract Method
).
The script creates a directory for each repository and subdirectories named with the commit SHA. Each commit may have one or more refactorings.
Dataset Availability:
dataset
directory.To generate the SQL file for the Web tool, run:
python3 src/generate_refactorings_sql.py
web
directory.data/output/snippets
folder with the output of src/diff.py
.sql/create_database.sql
script in your database.src/generate_refactorings_sql.py
.dataset.php
to generate the MaRV dataset file.dataset
directory of the replication package.Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset of DNS over HTTPS traffic from Firefox (Comcast, CZNIC, DNSForge, DSNSB, DOHli) The dataset contains DoH and HTTPS traffic that was captured in a virtualized environment (Docker) and generated automatically by Firefox browser with enabled DoH towards 5 different DoH servers (Comcast, CZNIC, DNSForge, DSNSB, DOHli) and a web page loads towards a sample of web pages taken from Majestic Million dataset. The data are provided in the form of PCAP files. However, we also provided TLS enriched flow data that are generated with opensource ipfixprobe flow exporter. Other than TLS related information is not relevant since the dataset comprises only encrypted TLS traffic. The TLS enriched flow data are provided in the form of CSV files with the following columns:
Column Name
Column Description
DST_IP
Destination IP address
SRC_IP
Source IP address
BYTES
The number of transmitted bytes from Source to Destination
BYTES_REV
The number of transmitted bytes from Destination to Source
TIME_FIRST
Timestamp of the first packet in the flow in format YYYY-MM-DDTHH-MM-SS
TIME_LAST
Timestamp of the last packet in the flow in format YYYY-MM-DDTHH-MM-SS
PACKETS
The number of packets transmitted from Source to Destination
PACKETS_REV
The number of packets transmitted from Destination to Source
DST_PORT
Destination port
SRC_PORT
Source port
PROTOCOL
The number of transport protocol
TCP_FLAGS
Logic OR across all TCP flags in the packets transmitted from Source to Destination
TCP_FLAGS_REV
Logic OR across all TCP flags in the packets transmitted from Destination to Source
TLS_ALPN
The Value of Application Protocol Negotiation Extension sent from Server
TLS_JA3
The JA3 fingerprint
TLS_SNI
The value of Server Name Indication Extension sent by Client
The DoH resolvers in the dataset can be identified by IP addresses written in doh_resolver_ip.csv file.
The main part of the dataset is located in DoH-Gen-F-CCDDD.tar.gz and has the following structure:
. └─── data | - Main directory with data └── generated | - Directory with generated captures ├── pcap | - Generated PCAPs │ └── firefox └── tls-flow-csv | - Generated CSV flow data └── firefox
Total stats of generated data:
Name
Value
Total Data Size
40.2 GB
Total files
10
DoH extracted tls flows
~100 K
Non-DoH extracted tls flows
~315 K
DoH Server information
Name
Provider
DoH query url
Comcast
https://corporate.comcast.com
https://doh.xfinity.com/dns-query
CZNIC
https://www.nic.cz
https://odvr.nic.cz/doh
DNSForge
https://dnsforge.de
https://dnsforge.de/dns-query
DNSSB
https://dns.sb/doh/
https://doh.dns.sb/dns-query
DOHli
https://doh.li
https://doh.li/dns-query
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset contains ratios of 13C/12C and 15N/14N during different time intervals for top predator fish and select lower trophic level organisms for use as input for "Stable C and N changes in LML food web R code".
biogas/biogas_0/supplydata197.csv
in step 2 where supply data are specified). This dataset is associated with the following publication: Hu, Y., W. Zhang, P. Tominac, M. Shen, D. Göreke, E. Martín-Hernández, M. Martín, G.J. Ruiz-Mercado, and V.M. Zavala. ADAM: A web platform for graph-based modeling and optimization of supply chains. COMPUTERS AND CHEMICAL ENGINEERING. Elsevier Science Ltd, New York, NY, USA, 165: 107911, (2022).This script will go through an entire ArcGIS Online Organization or a Portal Organization and look through all of the Web Maps. Then, this script will check the all of the urls of all of the map services within each Web Map to determine if they are valid. If they are not valid, it will write the results to a csv file so they can be taken care of. The csv file can then be used to aid the administrator in the cleanup of the map services with invalid urls. This is a Jupyter Notebook written using the ArcGIS Python API.
This feature service shows Chicago crimes in of August of 2017.
Abstract: WEB-IDS23 WEB-IDS23 is a network intrusion detection dataset that includes over 12 million flows, categorizing 20 attack types across FTP, HTTP/S, SMTP, SSH, and network scanning activities. This dataset is documented in the paper "Technical Report: Generating the WEB-IDS23 Dataset," which provides insights into the generation, structure, and key characteristics of the dataset. Data The dataset is available as CSV files under web-ids23. Each file includes the data of one class, and each row corresponds to a flow extracted using Zeek FlowMeter. In total, the dataset includes over 12 million samples. Short Documentation A short documentation of the data and the according labels can be found in the files readme.md or readme.pdf
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This web-scraped dataset collected from the cricbuzz website contains all the top 100 batsmen This web-scraped dataset collected from the cricbuzz website contains all the top 100 batsmen web-scraped dataset collected from the cricbuzz website contains all the top 100 batsmen with the best performance level at the top of the dataset, indicating that the player who has performed the best has been ranked in the following top100batsman.csv file. This dataset has only the top 100 players This web-scraped dataset collected from the cricbuzz website contains all the top 100 batsmen This a web-scraped dataset collected from the cricbuzz website contains the top 100 batsmen with the best performance level at the top of the dataset, indicating that the player who has performed the best has been ranked in the following top100batsman.csv file. This dataset has only the top 100 players who has completed the best in the field of test cricket and the data is collected on 7th January 2023.
Dataset contains:- test_ranking: this column contains the current test ranking of the player. player id : this column contains the player id which is unique and specified according to cricbuzz batsman : this column contains the name of the batsman to date rating : this column is provided by the ICC team: this column deals with the name of the team from which the player belongs. matches : this column: this column is the number of matches played by the player till date innings : innings deals with the number of times in a match the player has batted runs:total number of runs scored by the batsman high_score : highest score achieved by a batsman average : it is the ratio of total number of runs scored to the number of times the batsman got out. strike_rate: this the overall strike rate of the batsman which is calculated by runs scored divided by the ball played century @[💯](100) : number of centuries scored by the batsman double_century : number of double centuries scored by the batsman h scored by the batsman half_century : number of half_century scored by the batsman fours : total number of fours hit till date sixes : total number of sixes hit till date
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General Information
Cebulka (Polish dark web cryptomarket and image board) messages data.
Haitao Shi (The University of Edinburgh, UK); Patrycja Cheba (Jagiellonian University); Leszek Świeca (Kazimierz Wielki University in Bydgoszcz, Poland).
The dataset is part of the research supported by the Polish National Science Centre (Narodowe Centrum Nauki) grant 2021/43/B/HS6/00710.
Project title: “Rhizomatic networks, circulation of meanings and contents, and offline contexts of online drug trade” (2022-2025; PLN 956 620; funding institution: Polish National Science Centre [NCN], call: OPUS 22; Principal Investigator: Piotr Siuda [Kazimierz Wielki University in Bydgoszcz, Poland]).
Data Collection Context
Polish dark web cryptomarket and image board called Cebulka (http://cebulka7uxchnbpvmqapg5pfos4ngaxglsktzvha7a5rigndghvadeyd.onion/index.php).
This dataset was developed within the abovementioned project. The project focuses on studying internet behavior concerning disruptive actions, particularly emphasizing the online narcotics market in Poland. The research seeks to (1) investigate how the open internet, including social media, is used in the drug trade; (2) outline the significance of darknet platforms in the distribution of drugs; and (3) explore the complex exchange of content related to the drug trade between the surface web and the darknet, along with understanding meanings constructed within the drug subculture.
Within this context, Cebulka is identified as a critical digital venue in Poland’s dark web illicit substances scene. Besides serving as a marketplace, it plays a crucial role in shaping the narratives and discussions prevalent in the drug subculture. The dataset has proved to be a valuable tool for performing the analyses needed to achieve the project’s objectives.
Data Content
The data was collected in three periods, i.e., in January 2023, June 2023, and January 2024.
The dataset comprises a sample of messages posted on Cebulka from its inception until January 2024 (including all the messages with drug advertisements). These messages include the initial posts that start each thread and the subsequent posts (replies) within those threads. The dataset is organized into two directories. The “cebulka_adverts” directory contains posts related to drug advertisements (both advertisements and comments). In contrast, the “cebulka_community” directory holds a sample of posts from other parts of the cryptomarket, i.e., those not related directly to trading drugs but rather focusing on discussing illicit substances. The dataset consists of 16,842 posts.
The data has been cleaned and processed using regular expressions in Python. Additionally, all personal information was removed through regular expressions. The data has been hashed to exclude all identifiers related to instant messaging apps and email addresses. Furthermore, all usernames appearing in messages have been eliminated.
The dataset consists of the following files:
Zipped .txt files (“cebulka_adverts.zip” and “cebulka_community.zip”) containing all messages. These files are organized into individual directories that mirror the folder structure found on Cebulka.
Two .csv files that list all the messages, including file names and the content of each post. The first .csv lists messages from “cebulka_adverts.zip,” and the second .csv lists messages from “cebulka_community.zip.”
Ethical Considerations
A set of data handling policies aimed at ensuring safety and ethics has been outlined in the following paper:
Harviainen, J.T., Haasio, A., Ruokolainen, T., Hassan, L., Siuda, P., Hamari, J. (2021). Information Protection in Dark Web Drug Markets Research [in:] Proceedings of the 54th Hawaii International Conference on System Sciences, HICSS 2021, Grand Hyatt Kauai, Hawaii, USA, 4-8 January 2021, Maui, Hawaii, (ed.) Tung X. Bui, Honolulu, HI, pp. 4673-4680.
The primary safeguard was the early-stage hashing of usernames and identifiers from the messages, utilizing automated systems for irreversible hashing. Recognizing that automatic name removal might not catch all identifiers, the data underwent manual review to ensure compliance with research ethics and thorough anonymization.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In our manuscript, we introduce "ImmunoCheckDB," a web-based tool developed using the Shiny framework, which addresses the need for comprehensive analysis of ICI efficacy data and multiomic markers across different cancer types. ImmunoCheckDB enables users to conduct online meta-analyses and multiomic analyses by collecting and organizing extensive data from published clinical trials and multiomic experiments.
Mapping incident locations from a CSV file in a web map (YouTube video).