Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The datasets demonstrate the malware economy and the value chain published in our paper, Malware Finances and Operations: a Data-Driven Study of the Value Chain for Infections and Compromised Access, at the 12th International Workshop on Cyber Crime (IWCC 2023), part of the ARES Conference, published by the International Conference Proceedings Series of the ACM ICPS.
Using the well-documented scripts, it is straightforward to reproduce our findings. It takes an estimated 1 hour of human time and 3 hours of computing time to duplicate our key findings from MalwareInfectionSet; around one hour with VictimAccessSet; and minutes to replicate the price calculations using AccountAccessSet. See the included README.md files and Python scripts.
We choose to represent each victim by a single JavaScript Object Notation (JSON) data file. Data sources provide sets of victim JSON data files from which we've extracted the essential information and omitted Personally Identifiable Information (PII). We collected, curated, and modelled three datasets, which we publish under the Creative Commons Attribution 4.0 International License.
MalwareInfectionSet We discover (and, to the best of our knowledge, document scientifically for the first time) that malware networks appear to dump their data collections online. We collected these infostealer malware logs available for free. We utilise 245 malware log dumps from 2019 and 2020 originating from 14 malware networks. The dataset contains 1.8 million victim files, with a dataset size of 15 GB.
VictimAccessSet We demonstrate how Infostealer malware networks sell access to infected victims. Genesis Market focuses on user-friendliness and continuous supply of compromised data. Marketplace listings include everything necessary to gain access to the victim's online accounts, including passwords and usernames, but also detailed collection of information which provides a clone of the victim's browser session. Indeed, Genesis Market simplifies the import of compromised victim authentication data into a web browser session. We measure the prices on Genesis Market and how compromised device prices are determined. We crawled the website between April 2019 and May 2022, collecting the web pages offering the resources for sale. The dataset contains 0.5 million victim files, with a dataset size of 3.5 GB.
AccountAccessSet The Database marketplace operates inside the anonymous Tor network. Vendors offer their goods for sale, and customers can purchase them with Bitcoins. The marketplace sells online accounts, such as PayPal and Spotify, as well as private datasets, such as driver's licence photographs and tax forms. We then collect data from Database Market, where vendors sell online credentials, and investigate similarly. To build our dataset, we crawled the website between November 2021 and June 2022, collecting the web pages offering the credentials for sale. The dataset contains 33,896 victim files, with a dataset size of 400 MB.
Credits Authors
Billy Bob Brumley (Tampere University, Tampere, Finland)
Juha Nurmi (Tampere University, Tampere, Finland)
Mikko Niemelä (Cyber Intelligence House, Singapore)
Funding
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme under project numbers 804476 (SCARE) and 952622 (SPIRS).
Alternative links to download: AccountAccessSet, MalwareInfectionSet, and VictimAccessSet.
This dataset provides geospatial location data and scripts used to analyze the relationship between MODIS-derived NDVI and solar and sensor angles in a pinyon-juniper ecosystem in Grand Canyon National Park. The data are provided in support of the following publication: "Solar and sensor geometry, not vegetation response, drive satellite NDVI phenology in widespread ecosystems of the western United States". The data and scripts allow users to replicate, test, or further explore results. The file GrcaScpnModisCellCenters.csv contains locations (latitude-longitude) of all the 250-m MODIS (MOD09GQ) cell centers associated with the Grand Canyon pinyon-juniper ecosystem that the Southern Colorado Plateau Network (SCPN) is monitoring through its land surface phenology and integrated upland monitoring programs. The file SolarSensorAngles.csv contains MODIS angle measurements for the pixel at the phenocam location plus a random 100 point subset of pixels within the GRCA-PJ ecosystem. The script files (folder: 'Code') consist of 1) a Google Earth Engine (GEE) script used to download MODIS data through the GEE javascript interface, and 2) a script used to calculate derived variables and to test relationships between solar and sensor angles and NDVI using the statistical software package 'R'. The file Fig_8_NdviSolarSensor.JPG shows NDVI dependence on solar and sensor geometry demonstrated for both a single pixel/year and for multiple pixels over time. (Left) MODIS NDVI versus solar-to-sensor angle for the Grand Canyon phenocam location in 2018, the year for which there is corresponding phenocam data. (Right) Modeled r-squared values by year for 100 randomly selected MODIS pixels in the SCPN-monitored Grand Canyon pinyon-juniper ecosystem. The model for forward-scatter MODIS-NDVI is log(NDVI) ~ solar-to-sensor angle. The model for back-scatter MODIS-NDVI is log(NDVI) ~ solar-to-sensor angle + sensor zenith angle. Boxplots show interquartile ranges; whiskers extend to 10th and 90th percentiles. The horizontal line marking the average median value for forward-scatter r-squared (0.835) is nearly indistinguishable from the back-scatter line (0.833). The dataset folder also includes supplemental R-project and packrat files that allow the user to apply the workflow by opening a project that will use the same package versions used in this study (eg, .folders Rproj.user, and packrat, and files .RData, and PhenocamPR.Rproj). The empty folder GEE_DataAngles is included so that the user can save the data files from the Google Earth Engine scripts to this location, where they can then be incorporated into the r-processing scripts without needing to change folder names. To successfully use the packrat information to replicate the exact processing steps that were used, the user should refer to packrat documentation available at https://cran.r-project.org/web/packages/packrat/index.html and at https://www.rdocumentation.org/packages/packrat/versions/0.5.0. Alternatively, the user may also use the descriptive documentation phenopix package documentation, and description/references provided in the associated journal article to process the data to achieve the same results using newer packages or other software programs.
This repository contains all the data used for the article "Monitoring cropland daily carbon dioxide exchange at field scales with Sentinel-2 satellite imagery" by Pia Gottschalk, Aram Kalhori, Zhan Li, Christian Wille, Torsten Sachs. The data are used to exemplify how ground measured CO2 fluxes of an agricultural field can be linked with remotely sensed vegetation indices to provided an upscaling approach for spatial CO2-flux projection. The provided data form the basis for running the data processing scripts sequentially for (re-)producing all statistical analyses, results and figures in the article. The data are given in the formats as used in the data-processing scripts written in R, MATLAB and JavaScript of Google Eearth Engine. All codes for processing the data and a workflow description can be found here. The dataset covers three types of data: half-hourly eddy covariance (EC) data, satellite derived vegetation indices and GIS/image data. Continuous EC CO2 fluxes (03/2020 - 08/2023) are measured at the agricultural site "Heydenhof" in Northeastern Germany. The data file is provided in .mat (MATLAB) format containing the standard EddyPro software output variables which are described in an accompanying meta data file. The land use information used for footprint modeling is included as .jpeg and .png-files for visulisation and as .mat-file to be used for running the footprint modeling script. Sentinel-2 vegetation indices are provided as .csv files. These files are provided for convenience and version control only as the JavaScript for generating Sentinel-2 derived vegetation indices in Google Earth Engine is provided in the associated code repository. Here, the field boundaries are provided as shape file. Data file description: "HEY_LandUse_image.mat": MATLAB file in raster format, containing the land use codes in a 4x4 km raster with a resolution of 1 m used for running the Korman-Meixner foot print model for flux source area attribution. "meta_data_HEY_LandUse_image.txt": description of land use codes used in the "HEY_LandUse_image.mat" "HEY_LandUse_image.png": Visualisation of HEY_LandUse_image.mat. Figure A2 in manuscript. Showing the land use distribution around the measurement tower encoded in the number of land use classes used for foot print modeling. "HEYDENHOF.jpeg": Visualisation of land use classes from digitisation. Auxiliary information. Showing the land use distribution around the measurement tower. "HEY_FluxData_20200304_20220824_all_data.mat": MATLAB data file containing the half-hourly EC measurements plus auxiliary meteorological variables from 04/03/2020 to 24/08/2022 in matrix format with rows being the half-hourly measurements and including the unique time identifier "Timestamp", and "NaN" as missing data value. "meta_data_HEY_FluxData.txt": text file accompanying "HEY_FluxData_20200304_20220824_all_data.mat" containing the variable names, units, format, range and description for the variables of "HEY_FluxData_20200304_20220824_all_data.mat" "TERENO_prec_data_2020_2022.csv": comma separated text file containing the half-hourly precipitation data for the measurement site (HEY) from 01/01/2020 to 13/10/2022. "meta_data_TERENO_prec.txt": text file accompanying " TERENO_prec_data_2020_2022.csv " containing the variable description of the TERENO precipitation data. "HEY_tower_field.zip": zipped shape file outlining the agricultural field used as source area for the satellite data retrieval. "S2.csv": comma separated text file containing the vegetation indices from Sentinel-2 for the agricultural field from 02/03/2020 to 29/08/2022. "meta_data_Sentinel2_S2.txt": text file accompanying "S2.csv" containing the variable description of Sentinel-2 derived vegetation indices, i.e. "S2.csv". "S2_SD.csv": comma separated text file containing the standard deviation of the vegetation indices for the agricultural field from 02/03/2020 to 29/08/2022. "meta_data_Sentinel2_S2_SD.txt": text file accompanying "S2_SD.csv" containing the variable description of the standard deviation for the Sentinel-2 derived vegetation indices.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Data set containing Tweets captured during the 2018 UEFA Champions League Final between Real Madrid and Liverpool.
All Twitter APIs that return Tweets provide that data encoded using JavaScript Object Notation (JSON). JSON is based on key-value pairs, with named attributes and associated values. The JSON file include the following objects and attributes:
Tweet - Tweets are the basic atomic building block of all things Twitter. The Tweet object has a long list of ‘root-level’ attributes, including fundamental attributes such as id
, created_at
, and text
. Tweet child objects include user
, entities
, and extended_entities.
Tweets that are geo-tagged will have a place
child object.
User - Contains public Twitter account metadata and describes the author of the Tweet with attributes as name
, description
, followers_count
, friends_count
, etc.
Entities - Provide metadata and additional contextual information about content posted on Twitter. The entities
section provides arrays of common things included in Tweets: hashtags, user mentions, links, stock tickers (symbols), Twitter polls, and attached media.
Extended Entities - All Tweets with attached photos, videos and animated GIFs will include an extended_entities
JSON object.
Places - Tweets can be associated with a location, generating a Tweet that has been ‘geo-tagged.’
More information here.
I used the filterStream()
function to open a connection to Twitter's Streaming API, using the keyword #UCLFinal.
The capture started on Saturday, May 27th 6:45 pm UCT (beginning of the match) and finished on Saturday, May 27th 8:45 pm UCT.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The datasets demonstrate the malware economy and the value chain published in our paper, Malware Finances and Operations: a Data-Driven Study of the Value Chain for Infections and Compromised Access, at the 12th International Workshop on Cyber Crime (IWCC 2023), part of the ARES Conference, published by the International Conference Proceedings Series of the ACM ICPS.
Using the well-documented scripts, it is straightforward to reproduce our findings. It takes an estimated 1 hour of human time and 3 hours of computing time to duplicate our key findings from MalwareInfectionSet; around one hour with VictimAccessSet; and minutes to replicate the price calculations using AccountAccessSet. See the included README.md files and Python scripts.
We choose to represent each victim by a single JavaScript Object Notation (JSON) data file. Data sources provide sets of victim JSON data files from which we've extracted the essential information and omitted Personally Identifiable Information (PII). We collected, curated, and modelled three datasets, which we publish under the Creative Commons Attribution 4.0 International License.
MalwareInfectionSet We discover (and, to the best of our knowledge, document scientifically for the first time) that malware networks appear to dump their data collections online. We collected these infostealer malware logs available for free. We utilise 245 malware log dumps from 2019 and 2020 originating from 14 malware networks. The dataset contains 1.8 million victim files, with a dataset size of 15 GB.
VictimAccessSet We demonstrate how Infostealer malware networks sell access to infected victims. Genesis Market focuses on user-friendliness and continuous supply of compromised data. Marketplace listings include everything necessary to gain access to the victim's online accounts, including passwords and usernames, but also detailed collection of information which provides a clone of the victim's browser session. Indeed, Genesis Market simplifies the import of compromised victim authentication data into a web browser session. We measure the prices on Genesis Market and how compromised device prices are determined. We crawled the website between April 2019 and May 2022, collecting the web pages offering the resources for sale. The dataset contains 0.5 million victim files, with a dataset size of 3.5 GB.
AccountAccessSet The Database marketplace operates inside the anonymous Tor network. Vendors offer their goods for sale, and customers can purchase them with Bitcoins. The marketplace sells online accounts, such as PayPal and Spotify, as well as private datasets, such as driver's licence photographs and tax forms. We then collect data from Database Market, where vendors sell online credentials, and investigate similarly. To build our dataset, we crawled the website between November 2021 and June 2022, collecting the web pages offering the credentials for sale. The dataset contains 33,896 victim files, with a dataset size of 400 MB.
Credits Authors
Billy Bob Brumley (Tampere University, Tampere, Finland)
Juha Nurmi (Tampere University, Tampere, Finland)
Mikko Niemelä (Cyber Intelligence House, Singapore)
Funding
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme under project numbers 804476 (SCARE) and 952622 (SPIRS).
Alternative links to download: AccountAccessSet, MalwareInfectionSet, and VictimAccessSet.