The global number of households with a computer in was forecast to continuously increase between 2024 and 2029 by in total 88.6 million households (+8.6 percent). After the fifteenth consecutive increasing year, the computer households is estimated to reach 1.1 billion households and therefore a new peak in 2029. Notably, the number of households with a computer of was continuously increasing over the past years.Computer households are defined as households possessing at least one computer.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of households with a computer in countries like Caribbean and Africa.
The global household computer penetration in was forecast to continuously increase between 2024 and 2029 by in total 2.4 percentage points. After the eleventh consecutive increasing year, the computer penetration rate is estimated to reach 52.78 percent and therefore a new peak in 2029. Depicted is the estimated share of households owning at least one computer.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the household computer penetration in countries like Australia & Oceania and Caribbean.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Imports of Computers in the United States increased to 9348.57 USD Million in February from 7737.25 USD Million in January of 2024. This dataset includes a chart with historical data for the United States Imports of Computers.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book series. It has 1 row and is filtered where the books is How to sell computers and accessories on eBay. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Yelp Dataset JSON Each file is composed of a single object type, one JSON-object per-line.
Take a look at some examples to get you started: https://github.com/Yelp/dataset-examples.
Note: the follow examples contain inline comments, which are technically not valid JSON. This is done here to simplify the documentation and explaining the structure, the JSON files you download will not contain any comments and will be fully valid JSON.
business.json Contains business data including location data, attributes, and categories.
{ // string, 22 character unique string business id "business_id": "tnhfDv5Il8EaGSXZGiuQGg",
// string, the business's name
"name": "Garaje",
// string, the full address of the business
"address": "475 3rd St",
// string, the city
"city": "San Francisco",
// string, 2 character state code, if applicable
"state": "CA",
// string, the postal code
"postal code": "94107",
// float, latitude
"latitude": 37.7817529521,
// float, longitude
"longitude": -122.39612197,
// float, star rating, rounded to half-stars
"stars": 4.5,
// integer, number of reviews
"review_count": 1198,
// integer, 0 or 1 for closed or open, respectively
"is_open": 1,
// object, business attributes to values. note: some attribute values might be objects
"attributes": {
"RestaurantsTakeOut": true,
"BusinessParking": {
"garage": false,
"street": true,
"validated": false,
"lot": false,
"valet": false
},
},
// an array of strings of business categories
"categories": [
"Mexican",
"Burgers",
"Gastropubs"
],
// an object of key day to value hours, hours are using a 24hr clock
"hours": {
"Monday": "10:00-21:00",
"Tuesday": "10:00-21:00",
"Friday": "10:00-21:00",
"Wednesday": "10:00-21:00",
"Thursday": "10:00-21:00",
"Sunday": "11:00-18:00",
"Saturday": "10:00-21:00"
}
} review.json Contains full review text data including the user_id that wrote the review and the business_id the review is written for.
{ // string, 22 character unique review id "review_id": "zdSx_SD6obEhz9VrW9uAWA",
// string, 22 character unique user id, maps to the user in user.json
"user_id": "Ha3iJu77CxlrFm-vQRs_8g",
// string, 22 character business id, maps to business in business.json
"business_id": "tnhfDv5Il8EaGSXZGiuQGg",
// integer, star rating
"stars": 4,
// string, date formatted YYYY-MM-DD
"date": "2016-03-09",
// string, the review itself
"text": "Great place to hang out after work: the prices are decent, and the ambience is fun. It's a bit loud, but very lively. The staff is friendly, and the food is good. They have a good selection of drinks.",
// integer, number of useful votes received
"useful": 0,
// integer, number of funny votes received
"funny": 0,
// integer, number of cool votes received
"cool": 0
} user.json User data including the user's friend mapping and all the metadata associated with the user.
{ // string, 22 character unique user id, maps to the user in user.json "user_id": "Ha3iJu77CxlrFm-vQRs_8g",
// string, the user's first name
"name": "Sebastien",
// integer, the number of reviews they've written
"review_count": 56,
// string, when the user joined Yelp, formatted like YYYY-MM-DD
"yelping_since": "2011-01-01",
// array of strings, an array of the user's friend as user_ids
"friends": [
"wqoXYLWmpkEH0YvTmHBsJQ",
"KUXLLiJGrjtSsapmxmpvTA",
"6e9rJKQC3n0RSKyHLViL-Q"
],
// integer, number of useful votes sent by the user
"useful": 21,
// integer, number of funny votes sent by the user
"funny": 88,
// integer, number of cool votes sent by the user
"cool": 15,
// integer, number of fans the user has
"fans": 1032,
// array of integers, the years the user was elite
"elite": [
2012,
2013
],
// float, average rating of all reviews
"average_stars": 4.31,
// integer, number of hot compliments received by the user
"compliment_hot": 339,
// integer, number of more compliments received by the user
"compliment_more": 668,
// integer, number of profile compliments received by the user
"compliment_profile": 42,
// integer, number of cute compliments received by the user
"compliment_cute": 62,
// integer, number of list compliments received by the user
"compliment_list": 37,
// integer, number of note compliments received by the user
"compliment_note": 356,
// integer, number of plain compliments received by the user
"compliment_plain": 68,
// integer, number of coo...
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.3/customlicense?persistentId=doi:10.7910/DVN/WIYLEHhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.3/customlicense?persistentId=doi:10.7910/DVN/WIYLEH
Originally published by Harte-Hanks, the CiTDS dataset is now produced by Aberdeen Group, a subsidiary of Spiceworks Ziff Davis (SWZD). It is also referred to as CiTDB (Computer Intelligence Technology Database). CiTDS provides data on digital investments of businesses across the globe. It includes two types of technology datasets: (i) hardware expenditures and (ii) product installs. Hardware expenditure data is constructed through a combination of surveys and modeling. A survey is administered to a number of companies and the data from surveys is used to develop a prediction model of expenditures as a function of firm characteristics. CiTDS uses this model to predict the expenditures of non-surveyed firms and reports them in the dataset. In contrast, CiTDS does not do any imputation for product install data, which comes entirely from web scraping and surveys. A confidence score between 1-3 is assigned to indicate how much the source of information can be trusted. A 3 corresponds to 90-100 percent install likelihood, 2 corresponds to 75-90 percent install likelihood and 1 corresponds to 65-75 percent install likelihood. CiTDS reports technology adoption at the site level with a unique DUNS identifier. One of these sites is identified as an “enterprise,” corresponding to the firm that owns the sites. Therefore, it is possible to analyze technology adoption both at the site (establishment) and enterprise (firm) levels. CiTDS sources the site population from Dun and Bradstreet every year and drops sites that are not relevant to their clients. Due to this sample selection, there is quite a bit of variation in the number of sites from year to year, where on average, 10-15 percent of sites enter and exit every year in the US data. This number is higher in the EU data. We observe similar turnover year-to-year in the products included in the dataset. Some products have become absolute, and some new products are added every year. There are two versions of the data: (i) version 3, which covers 2016-2020, and (ii) version 4, which covers 2020-2021. The quality of version 4 is significantly better regarding the information included about the technology products. In version 3, product categories have missing values, and they are abbreviated in a way that are sometimes difficult to interpret. Version 4 does not have any major issues. Since both versions of the data are available in 2020, CiTDS provides a crosswalk between the versions. This makes it possible to use information about products in Version 4 for the products in Version 3, with the caveats that there will be no crosswalk for the products that exist in 2016-2019 but not in 2020. Finally, special attention should be paid to data from 2016, where the coverage is significantly different from 2017. From 2017 onwards, coverage is more consistent. Years of Coverage: APac: 2019 - 2021 Canada: 2015 - 2021 EMEA: 2019 - 2021 Europe: 2015 - 2018 Latin America: 2015, 2019- 2021 United States: 2015 - 2021
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Acknowledgment to supporters: "Thank you to everyone who supported the UGRansome dataset; it has received a Bronze medal on Kaggle!"
The UGRansome dataset is a versatile cybersecurity resource designed for the analysis of ransomware and zero-day cyber-attacks, particularly those exhibiting cyclostationary behavior. This dataset features various essential components, including timestamps for attack time tracking, flags for categorizing attack types, protocol data for understanding attack vectors, network flow details to observe data transfer patterns, and ransomware family classifications.
It also provides insight into the associated malware, numeric clustering for pattern recognition, and quantifies financial damage in both USD and bitcoins (BTC). The dataset employs machine learning to generate attack signatures and offers synthetic signatures for testing and simulating cybersecurity defenses.
Additionally, it can be used to identify and document anomalies, contributing to anomaly detection research and enhancing cybersecurity understanding and preparedness. This dataset offers valuable information for researchers and practitioners interested in leveraging it for various analytical and investigatory purposes such as ransomware and zero-day threats detection and classification. The dataset required deduplication and transformation.
The UGRansome dataset has been previously utilized in studies by Tokmak (2022); Alhashmi et al. (2024); Chaudhary & Adhikari (2024); Sokhonn, Park, & Lee (2024); P. Yan et al. (2024), Sharath Kumar et al. (2024), and Mohamed, A.A., Al-Saleh, A., Sharma, S.K. et al. (2025).
It has been utilized and cited in several master's dissertations and reports, demonstrating its relevance in the field of anomaly intrusion detection. Notable examples include:
S. R. Zahra, 2022. "UGRansome: Optimal Approach for Anomaly Intrusion Detection and Zero-day Threats using Cloud Environment." Master's Research in Cloud Computing, School of Computing, National College of Ireland. https://www.researchgate.net/publication/365172610_UGRansome_Optimal_Approach_for_Anomaly_Intrusion_Detection_and_Zero-day_Threats_using_Cloud_Environment_MSc_Research_Project_Cloud_Computing/citations
B. Torky, 2023. "Ensemble Methods for Anomaly Detection in Enterprise Systems." Thesis, Rochester Institute of Technology, Dubai. Advisor: Sanjay Modak.https://repository.rit.edu/theses/11497/
A. Igugu, 2024. "Evaluating the Effectiveness of AI and Machine Learning Techniques for Zero-Day Attacks Detection in Cloud Environments" Master of Science in Information Security, Luleå University of Technology, Sweden. Department of Computer Science, Electrical and Space Engineering. Supervisor: Dr. Saguna. Examiner: Prof. Christer Ahlund. https://www.diva-portal.org/smash/get/diva2:1890285/FULLTEXT02
Duran, M., duSoft Yazılım, A.Ş. and Kilinc, H., 2024. D2. 1–Academic and Technology SoTA Report. Sierra (Panel), 1, pp.26-11. Edited by: Hakan Kilinc (Orion, Türkiye), Eva Catarina Gomes Maia (ISEP, Portugal), Orhan Yildirim (Beam Teknoloji, Türkiye), Gabriela Sousa (VisionWare, Portugal), Özgü Özkan, Melike Çolak, Nesil Bor (Bites, Türkiye), Daniel Esteban Villamil Sierra (Panel, Spain). https://itea4.org/project/vesta.html
Kaliberda A. A. Development of an anti-virus solution based on neural networks: master's thesis; Ural Federal University, Institute of Radio Electronics and Information Technologies-RTF, Department of Information Technologies and Control Systems. Russia — Yekaterinburg, 2024. — 52 p. http://elar.urfu.ru/handle/10995/140331
These citations underline the impact of the UGRansome in advancing research on intrusion detection and cybersecurity:
• Mohamed, A.A., Al-Saleh, A., Sharma, S.K. et al. Zero-day exploits detection with adaptive WavePCA-Autoencoder (AWPA) adaptive hybrid exploit detection network (AHEDNet). Sci Rep 15, 4036 (2025). https://doi.org/10.1038/s41598-025-87615-2
• P. Yan, T. T. Khoei, R. S. Hyder and R. S. Hyder, "A Dual-Stage Ensemble Approach to Detect and Classify Ransomware Attacks," 2024 IEEE 15th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), Yorktown Heights, NY, USA, 2024, pp. 781-786, doi: 10.1109/UEMCON62879.2024.10754695.
• Por, L.Y., Dai, Z., Leem, S.J., Chen, Y., Yang, J., Binbeshr, F., Phan, K.Y. and Ku, C.S., 2024. A Systematic Literature Review on the Methods and Challenges in Detecting Zero-Day Attacks: Insights from the Recent CrowdStrike Incident. IEEE Access.
• Torky, B., Karamitsos, I., Najar, T. (2024). Anomaly Detection in Enterprise Payment Systems: An Ensemble Machine Learning Approach. In: Emrouznejad, A., Zervopoulos, P.D., Ozturk, I., Jamali, D., Rice, J. (eds) Business Analytics and Decision Making in Practice. ICBAP 2024. Lecture Notes in Operations Research. Springer, Cham. https://doi.org/10.1007/978-3-...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Exports of Computer & Units in Japan increased to 30064.14 JPY Million in February from 26241.37 JPY Million in January of 2024. This dataset includes a chart with historical data for Japan Exports of Computer & Units.
Integrated computing curricula combine learning objectives in computing with those in another discipline, like literacy, math, or science, to give all students experience with computing, typically before they must decide whether to take standalone CS courses. One goal of integrated computing curricula is to provide an accessible path to an introductory computing course by introducing computing concepts and practices in required courses. This dataset analyzed integrated computing curricula to determine which CS practices and concepts they teach and how extensively and, thus, how they prepare students for later computing courses. The authors conducted a content analysis to examine primary and lower secondary (i.e., K-8) curricula that are taught in non-CS classrooms, have explicit CS learning objectives (i.e., CS+X), and that took 5+ hours to complete. Lesson plans, descriptions, and resources were scored based on frameworks developed from the K-12 CS Framework, including programming conc..., Search and Inclusion Criteria While the current dataset used many of the same tools as a systematic literature review to find curricula, it is not a systematic review. Unlike in literature reviews, there are no databases of integrated computing curricula to search systematically. Instead, we searched the literature for evidence-based curricula. We first searched the ACM Digital Library for papers with "(integration OR integrated) AND (computing OR 'computer science' OR CS) AND curriculum" to find curricula that had been studied. We repeated the search with Google Scholar in journals that include "(computing OR 'computer science' OR computers) AND (education OR research)" in their titles, such as Computer Science Education, Computers & Education, and Journal of Educational Computing Research. Last, we examined each entry in CSforAll's curriculum directory for curricula that matched our inclusion criteria. We used four inclusion criteria to select curricula for analysis. Our first cri..., , # Extended computing integrated curricula scored for K-12 CS standards
https://doi.org/10.5061/dryad.j6q573nnt
Framework Development and Scoring Training
Full details about the framework development and training for the scorers can be found at Margulieux, L. E., Liao, Y-C., Anderson, E., Parker, M. C., & Calandra, B. D. (2024). Intent and extent: Computer science concepts and practices in integrated computing. ACM’s Transactions on Computing Education. doi: 10.1145/3664825
The listed computing integrated extended curricula were scored for which concepts and practices they included. The concepts and practices are based on the K-12 CS framework.
1 = Present, Blank = Not present
<h1>Similarity Computation with Extremely Randomized Clustering Forests (ERCF)</h1>
<h2> Content </h2>
This page briefly describes our work on similarity computation. <br>
It provides:
<ul>
<li>the related <b>CVPR'07 paper</b></li>
<li>the toy car <b>dataset</b></li>
<li>a <b>binary</b> of the algorithm.</li>
</ul>
<h2> Objective </h2>
<img src="sameordifferent.png" alt="same or different?">
<p>
Our purpose is to compute a similarity measure between two images.
That measure should deal with never seen objects (never seen car
models, never seen faces, ...) and should be robust to modifications in
pose, background, light. It is trained with pairs of images labeld
"Same" or "Different". This is less informative than fully labeled
training images ("Car model 1", "Car model 2", ...) but much cheaper to
obtain.
</p>
<h2> Algorithm, Paper</h2>
The algorithm is fully described in the following paper, section 2 (quite self-contained).<br>
Eric Nowak and Fr�d�ric Jurie,
<i>Learning Visual Similarity Measures for Comparing Never Seen Objects</i>,
Computer Vision and Pattern Recognition 2007 (CVPR'07), <a href="../dwl/cvpr07.pdf">pdf</a>.
<br>You can also download <a href="../dwl/nowak_jurie_cvpr07_slides.pdf">the slides of the talk</a>.
<h2> Datasets </h2>
<p>
Our algorithm is evaluated on three public datasets, and also on our own dataset of toy cars.
It can be downloaded <a href="../dwl/toycarlear.tar.gz">here (23Mb)</a>.
The archive contains the images and a metadata file (pairs.txt).
</p>
<p>
The pairs of images of the toycar dataset are made from these
vehicles:<br>
<img src="allcars_nb.jpg" alt="cars from toycar dataset">
</p>
<h2> Binary </h2>
<p>
You can <a href="../dwl/pRazSimiERCF.gz">download a binary (~1Mb)</a> of our algorithm for linux machines.
It should work on many distributions, but we have only tested
Mandrakelinux 10.1 for i586, kernel 2.6.
The binary requires the following standard libraries: linux-gate.so.1,
libpng.so.3, libjpeg.so.62, libpthread.so.0, libstdc++.so.6, libm.so.6,
libgcc_s.so.1, libc.so.6, libz.so.1, /lib/ld-linux.so.2.
</p>
<p>
This binary is a reimplementation of
our CVPR07 algorithm, for simplicity reasons it does NOT contain
geometry based split conditions, which usually increase the overall
EER-PR of 1%.
</p>
<p>
<b>Help</b> about command line options is obtained by:
<br>
<b>The best way to understand the behavior of the algorithm is to try it on our toy car dataset.</b>
<br>
Use randseed=1 to reprocude the following result, or randseed=0 to
initizalize the random number generator with the current time.
</p>
<table>
<tbody><tr>
<td>Algorithm</td>
<td>
Eric Nowak and Fr�d�ric Jurie,
<i>Learning Visual Similarity Measures for Comparing Never Seen Objects</i>,
Computer Vision and Pattern Recognition 2007 (CVPR'07).
<a href="../dwl/cvpr07.pdf">pdf, algorithm in section 2</a>.
<br>You can also download <a href="../dwl/nowak_jurie_cvpr07_slides.pdf">the slides of the talk</a>.
</td>
</tr>
<tr>
<td>Binary</td>
<td><a href="../dwl/pRazSimiERCF.gz">for linux (~1Mb)</a></td>
</tr>
<tr>
<td>Dataset</td>
<td><a href="../dwl/toycarlear.tar.gz">toycars dataset (~23Mb)</a></td>
</tr>
<tr>
<td>Command line</td>
<td>
</td>
</tr>
<tr>
<td> Output files</td>
<td>
<a href="../dwl/run_5_trees.tar.gz">outputs of the previous command line (~6Mb)</a>,
shows the trees, mem usage, detailed performance information, etc.
</td>
</tr>
<tr>
<td> Performance<br> (SVM C=1) </td>
<td>
<ul>
<li>Precision Recall Equal Error Rate (EER-PR): 84.4%</li>
<li>Computation time (learn+test): 17 hours on a P4-3.4GHz</li>
<li>Maximum memory usage: 465Mb </li>
</ul>
</td>
</tr>
</tbody></table>
<p>
The binary allows to visualize the patch pairs used to learn the trees.
The following patch pairs have been produced with:<br>
</p>
<table>
<tbody><tr><th>Pair label</th><th>Random patch in first image</th><th>Corresponding patch in second image</th></tr>
<tr><td>Different</td><td><img src="pairs/res_treepatches_neg_0_0_I0.jpg" alt="im0"></td><td><img src="pairs/res_treepatches_neg_0_0_I1.jpg" alt="im1"></td></tr>
<tr><td>Different</td><td><img src="pairs/res_treepatches_neg_1_0_I0.jpg" alt="im0"></td><td><img src="pairs/res_treepatches_neg_1_0_I1.jpg" alt="im1"></td></tr>
<tr><td>Different</td><td><img src="pairs/res_treepatches_neg_2_0_I0.jpg" alt="im0"></td><td><img src="pairs/res_treepatches_neg_2_0_I1.jpg" alt="im1"></td></tr>
<tr><td>Different</td><td><img src="pairs/res_tr...
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
E-commerce has become a new channel to support businesses development. Through e-commerce, businesses can get access and establish a wider market presence by providing cheaper and more efficient distribution channels for their products or services. E-commerce has also changed the way people shop and consume products and services. Many people are turning to their computers or smart devices to order goods, which can easily be delivered to their homes.
This is a sales transaction data set of UK-based e-commerce (online retail) for one year. This London-based shop has been selling gifts and homewares for adults and children through the website since 2007. Their customers come from all over the world and usually make direct purchases for themselves. There are also small businesses that buy in bulk and sell to other customers through retail outlet channels.
The data set contains 500K rows and 8 columns. The following is the description of each column. 1. TransactionNo (categorical): a six-digit unique number that defines each transaction. The letter “C” in the code indicates a cancellation. 2. Date (numeric): the date when each transaction was generated. 3. ProductNo (categorical): a five or six-digit unique character used to identify a specific product. 4. Product (categorical): product/item name. 5. Price (numeric): the price of each product per unit in pound sterling (£). 6. Quantity (numeric): the quantity of each product per transaction. Negative values related to cancelled transactions. 7. CustomerNo (categorical): a five-digit unique number that defines each customer. 8. Country (categorical): name of the country where the customer resides.
There is a small percentage of order cancellation in the data set. Most of these cancellations were due to out-of-stock conditions on some products. Under this situation, customers tend to cancel an order as they want all products delivered all at once.
Information is a main asset of businesses nowadays. The success of a business in a competitive environment depends on its ability to acquire, store, and utilize information. Data is one of the main sources of information. Therefore, data analysis is an important activity for acquiring new and useful information. Analyze this dataset and try to answer the following questions. 1. How was the sales trend over the months? 2. What are the most frequently purchased products? 3. How many products does the customer purchase in each transaction? 4. What are the most profitable segment customers? 5. Based on your findings, what strategy could you recommend to the business to gain more profit?
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Network traffic datasets with novel extended IP flow called NetTiSA flow
Datasets were created for the paper: NetTiSA: Extended IP Flow with Time-series Features for Universal Bandwidth-constrained High-speed Network Traffic Classification -- Josef Koumar, Karel Hynek, Jaroslav Pešek, Tomáš Čejka -- which is published in The International Journal of Computer and Telecommunications Networking https://doi.org/10.1016/j.comnet.2023.110147Please cite the usage of our datasets as:
Josef Koumar, Karel Hynek, Jaroslav Pešek, Tomáš Čejka, "NetTiSA: Extended IP flow with time-series features for universal bandwidth-constrained high-speed network traffic classification", Computer Networks, Volume 240, 2024, 110147, ISSN 1389-1286
@article{KOUMAR2024110147, title = {NetTiSA: Extended IP flow with time-series features for universal bandwidth-constrained high-speed network traffic classification}, journal = {Computer Networks}, volume = {240}, pages = {110147}, year = {2024}, issn = {1389-1286}, doi = {https://doi.org/10.1016/j.comnet.2023.110147}, url = {https://www.sciencedirect.com/science/article/pii/S1389128623005923}, author = {Josef Koumar and Karel Hynek and Jaroslav Pešek and Tomáš Čejka} }
This Zenodo repository contains 23 datasets created from 15 well-known published datasets, which are cited in the table below. Each dataset contains the NetTiSA flow feature vector.
NetTiSA flow feature vector
The novel extended IP flow called NetTiSA (Network Time Series Analysed) flow contains a universal bandwidth-constrained feature vector consisting of 20 features. We divide the NetTiSA flow classification features into three groups by computation. The first group of features is based on classical bidirectional flow information---a number of transferred bytes, and packets. The second group contains statistical and time-based features calculated using the time-series analysis of the packet sequences. The third type of features can be computed from the previous groups (i.e., on the flow collector) and improve the classification performance without any impact on the telemetry bandwidth.
Flow features
The flow features are:
Packets is the number of packets in the direction from the source to the destination IP address.
Packets in reverse order is the number of packets in the direction from the destination to the source IP address.
Bytes is the size of the payload in bytes transferred in the direction from the source to the destination IP address.
Bytes in reverse order is the size of the payload in bytes transferred in the direction from the destination to the source IP address.
Statistical and Time-based features
The features that are exported in the extended part of the flow. All of them can be computed (exactly or in approximative) by stream-wise computation, which is necessary for keeping memory requirements low. The second type of feature set contains the following features:
Mean represents mean of the payload lengths of packets
Min is the minimal value from payload lengths of all packets in a flow
Max is the maximum value from payload lengths of all packets in a flow
Standard deviation is a measure of the variation of payload lengths from the mean payload length
Root mean square is the measure of the magnitude of payload lengths of packets
Average dispersion is the average absolute difference between each payload length of the packet and the mean value
Kurtosis is the measure describing the extent to which the tails of a distribution differ from the tails of a normal distribution
Mean of relative times is the mean of the relative times which is a sequence defined as (st = {t_1 - t_1, t_2 - t_1, ..., t_n - t_1} )
Mean of time differences is the mean of the time differences which is a sequence defined as (dt = { t_j - t_i | j = i + 1, i \in {1, 2, \dots, n - 1} }.)
Min from time differences is the minimal value from all time differences, i.e., min space between packets.
Max from time differences is the maximum value from all time differences, i.e., max space between packets.
Time distribution describes the deviation of time differences between individual packets within the time series. The feature is computed by the following equation:(tdist = \frac{ \frac{1}{n-1} \sum_{i=1}^{n-1} \left| \mu_{{dt_{n-1}}} - dt_i \right| }{ \frac{1}{2} \left(max\left({dt_{n-1}}\right) - min\left({dt_{n-1}}\right) \right) })
Switching ratio represents a value change ratio (switching) between payload lengths. The switching ratio is computed by equation:(sr = \frac{s_n}{\frac{1}{2} (n - 1)})
where \(s_n\) is number of switches.
Features computed at the collectorThe third set contains features that are computed from the previous two groups prior to classification. Therefore, they do not influence the network telemetry size and their computation does not put additional load to resource-constrained flow monitoring probes. The NetTiSA flow combined with this feature set is called the Enhanced NetTiSA flow and contains the following features:
Max minus min is the difference between minimum and maximum payload lengths
Percent deviation is the dispersion of the average absolute difference to the mean value
Variance is the spread measure of the data from its mean
Burstiness is the degree of peakedness in the central part of the distribution
Coefficient of variation is a dimensionless quantity that compares the dispersion of a time series to its mean value and is often used to compare the variability of different time series that have different units of measurement
Directions describe a percentage ratio of packet direction computed as (\frac{d_1}{ d_1 + d_0}), where (d_1) is a number of packets in a direction from source to destination IP address and (d_0) the opposite direction. Both (d_1) and (d_0) are inside the classical bidirectional flow.
Duration is the duration of the flow
The NetTiSA flow is implemented into IP flow exporter ipfixprobe.
Description of dataset files
In the following table is a description of each dataset file:
File name
Detection problem
Citation of the original raw dataset
botnet_binary.csv Binary detection of botnet S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014.
botnet_multiclass.csv Multi-class classification of botnet S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014.
cryptomining_design.csv Binary detection of cryptomining; the design part Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022
cryptomining_evaluation.csv Binary detection of cryptomining; the evaluation part Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022
dns_malware.csv Binary detection of malware DNS Samaneh Mahdavifar et al. Classifying Malicious Domains using DNS Traffic Analysis. In DASC/PiCom/CBDCom/CyberSciTech 2021, pages 60–67. IEEE, 2021.
doh_cic.csv Binary detection of DoH Mohammadreza MontazeriShatoori et al. Detection of doh tunnels using time-series classification of encrypted traffic. In DASC/PiCom/CBDCom/CyberSciTech 2020, pages 63–70. IEEE, 2020
doh_real_world.csv Binary detection of DoH Kamil Jeřábek et al. Collection of datasets with DNS over HTTPS traffic. Data in Brief, 42:108310, 2022
dos.csv Binary detection of DoS Nickolaos Koroniotis et al. Towards the development of realistic botnet dataset in the Internet of Things for network forensic analytics: Bot-IoT dataset. Future Gener. Comput. Syst., 100:779–796, 2019.
edge_iiot_binary.csv Binary detection of IoT malware Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022.
edge_iiot_multiclass.csv Multi-class classification of IoT malware Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022.
https_brute_force.csv Binary detection of HTTPS Brute Force Jan Luxemburk et al. HTTPS Brute-force dataset with extended network flows, November 2020
ids_cic_binary.csv Binary detection of intrusion in IDS Iman Sharafaldin et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018.
ids_cic_multiclass.csv Multi-class classification of intrusion in IDS Iman Sharafaldin et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018.
unsw_binary.csv Binary detection of intrusion in IDS Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015.
unsw_multiclass.csv Multi-class classification of intrusion in IDS Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015.
iot_23.csv Binary detection of IoT malware Sebastian Garcia et al. IoT-23: A labeled dataset with malicious and benign IoT network traffic, January 2020. More details here https://www.stratosphereips.org /datasets-iot23
ton_iot_binary.csv Binary detection of IoT malware Nour Moustafa. A new distributed architecture for evaluating ai-based security systems at the edge: Network ton iot datasets. Sustainable Cities and Society, 72:102994, 2021
ton_iot_multiclass.csv Multi-class classification of IoT malware Nour Moustafa. A new distributed architecture for evaluating ai-based security systems at the edge: Network ton iot datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Virtualisation has received widespread adoption and deployment across a wide range of enterprises and industries throughout the years. Network Function Virtualisation (NFV) is a technical concept that presents a method for dynamically delivering virtualised network functions as virtualised or software components. Virtualised Network Function (VNF) has distinct advantages, but it also faces serious security challenges. Cyberattacks such as Denial of Service (DoS), malware/rootkit injection, port scanning, and so on can target VNF appliances just like any other network infrastructure. To create exceptional training exercises for machine or deep learning (ML/DL) models to combat cyberattacks in VNF, a suitable dataset (VNFCYBERDATA) exhibiting an actual reflection, or one that is reasonably close to an actual reflection, of the problem that the ML/DL model could address is required. This article describes a real VNF dataset that contains over seven million data points and twenty-five cyberattacks generated from five VNF appliances. To facilitate a realistic examination of VNF traffic, the dataset includes both benign and malicious traffic.CitationIf you are using this dataset for your research, please reference it as"Ayodele, B.; Buttigieg, V. The VNF Cybersecurity Dataset for Research (VNFCYBERDATA). Data 2024, 9, 132. https://doi.org/10.3390/data9110132"DocumentationDataset documentation is available at: https://www.mdpi.com/2306-5729/9/11/132
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As there was no large publicly available cross-domain dataset for comparative argument mining, we create one composed of sentences, potentially annotated with BETTER / WORSE markers (the first object is better / worse than the second object) or NONE (the sentence does not contain a comparison of the target objects). The BETTER sentences stand for a pro-argument in favor of the first compared object and WORSE-sentences represent a con-argument and favor the second object. We aim for minimizing dataset domain-specific biases in order to capture the nature of comparison and not the nature of the particular domains, thus decided to control the specificity of domains by the selection of comparison targets. We hypothesized and could confirm in preliminary experiments that comparison targets usually have a common hypernym (i.e., are instances of the same class), which we utilized for selection of the compared objects pairs. The most specific domain we choose, is computer science with comparison targets like programming languages, database products and technology standards such as Bluetooth or Ethernet. Many computer science concepts can be compared objectively (e.g., on transmission speed or suitability for certain applications). The objects for this domain were manually extracted from List of-articles at Wikipedia. In the annotation process, annotators were asked to only label sentences from this domain if they had some basic knowledge in computer science. The second, broader domain is brands. It contains objects of different types (e.g., cars, electronics, and food). As brands are present in everyday life, anyone should be able to label the majority of sentences containing well-known brands such as Coca-Cola or Mercedes. Again, targets for this domain were manually extracted from `List of''-articles at Wikipedia.The third domain is not restricted to any topic: random. For each of 24~randomly selected seed words 10 similar words were collected based on the distributional similarity API of JoBimText (http://www.jobimtext.org). Seed words created using randomlists.com: book, car, carpenter, cellphone, Christmas, coffee, cork, Florida, hamster, hiking, Hoover, Metallica, NBC, Netflix, ninja, pencil, salad, soccer, Starbucks, sword, Tolkien, wine, wood, XBox, Yale.Especially for brands and computer science, the resulting object lists were large (4493 in brands and 1339 in computer science). In a manual inspection, low-frequency and ambiguous objects were removed from all object lists (e.g., RAID (a hardware concept) and Unity (a game engine) are also regularly used nouns). The remaining objects were combined to pairs. For each object type (seed Wikipedia list page or the seed word), all possible combinations were created. These pairs were then used to find sentences containing both objects. The aforementioned approaches to selecting compared objects pairs tend minimize inclusion of the domain specific data, but do not solve the problem fully though. We keep open a question of extending dataset with diverse object pairs including abstract concepts for future work. As for the sentence mining, we used the publicly available index of dependency-parsed sentences from the Common Crawl corpus containing over 14 billion English sentences filtered for duplicates. This index was queried for sentences containing both objects of each pair. For 90% of the pairs, we also added comparative cue words (better, easier, faster, nicer, wiser, cooler, decent, safer, superior, solid, terrific, worse, harder, slower, poorly, uglier, poorer, lousy, nastier, inferior, mediocre) to the query in order to bias the selection towards comparisons but at the same time admit comparisons that do not contain any of the anticipated cues. This was necessary as a random sampling would have resulted in only a very tiny fraction of comparisons. Note that even sentences containing a cue word do not necessarily express a comparison between the desired targets (dog vs. cat: He's the best pet that you can get, better than a dog or cat.). It is thus especially crucial to enable a classifier to learn not to rely on the existence of clue words only (very likely in a random sample of sentences with very few comparisons). For our corpus, we keep pairs with at least 100 retrieved sentences.From all sentences of those pairs, 2500 for each category were randomly sampled as candidates for a crowdsourced annotation that we conducted on figure-eight.com in several small batches. Each sentence was annotated by at least five trusted workers. We ranked annotations by confidence, which is the figure-eight internal measure of combining annotator trust and voting, and discarded annotations with a confidence below 50%. Of all annotated items, 71% received unanimous votes and for over 85% at least 4 out of 5 workers agreed -- rendering the collection procedure aimed at ease of annotation successful.The final dataset contains 7199 sentences with 271 distinct object pairs. The majority of sentences (over 72%) are non-comparative despite biasing the selection with cue words; in 70% of the comparative sentences, the favored target is named first.You can browse though the data here: https://docs.google.com/spreadsheets/d/1U8i6EU9GUKmHdPnfwXEuBxi0h3aiRCLPRC-3c9ROiOE/edit?usp=sharing Full description of the dataset is available in the workshop paper at ACL 2019 conference. Please cite this paper if you use the data: Franzek, Mirco, Alexander Panchenko, and Chris Biemann. ""Categorization of Comparative Sentences for Argument Mining."" arXiv preprint arXiv:1809.06152 (2018).@inproceedings{franzek2018categorization, title={Categorization of Comparative Sentences for Argument Mining}, author={Panchenko, Alexander and Bondarenko, and Franzek, Mirco and Hagen, Matthias and Biemann, Chris}, booktitle={Proceedings of the 6th Workshop on Argument Mining at ACL'2019}, year={2019}, address={Florence, Italy}}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please cite the following paper when using this dataset:
N. Thakur, V. Su, M. Shao, K. Patel, H. Jeong, V. Knieling, and A. Bian “A labelled dataset for sentiment analysis of videos on YouTube, TikTok, and other sources about the 2024 outbreak of measles,” Proceedings of the 26th International Conference on Human-Computer Interaction (HCII 2024), Washington, USA, 29 June - 4 July 2024. (Accepted as a Late Breaking Paper, Preprint Available at: https://doi.org/10.48550/arXiv.2406.07693)
Abstract
This dataset contains the data of 4011 videos about the ongoing outbreak of measles published on 264 websites on the internet between January 1, 2024, and May 31, 2024. These websites primarily include YouTube and TikTok, which account for 48.6% and 15.2% of the videos, respectively. The remainder of the websites include Instagram and Facebook as well as the websites of various global and local news organizations. For each of these videos, the URL of the video, title of the post, description of the post, and the date of publication of the video are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis (using VADER), subjectivity analysis (using TextBlob), and fine-grain sentiment analysis (using DistilRoBERTa-base) of the video titles and video descriptions were performed. This included classifying each video title and video description into (i) one of the sentiment classes i.e. positive, negative, or neutral, (ii) one of the subjectivity classes i.e. highly opinionated, neutral opinionated, or least opinionated, and (iii) one of the fine-grain sentiment classes i.e. fear, surprise, joy, sadness, anger, disgust, or neutral. These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for performing sentiment analysis or subjectivity analysis in this field as well as for other applications. The paper associated with this dataset (please see the above-mentioned citation) also presents a list of open research questions that may be investigated using this dataset.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Oracle Corporation is an American software and hardware manufacturer with its headquarters in Redwood City (Silicon Valley), California. The company specializes in the development and marketing of computer hardware and software for corporate customers - especially the Oracle Database system. Oracle is one of the world's largest software manufacturers in terms of sales. Oracle employs more than 138,000 people and has 430,000 customers in 175 countries. In addition to the database product Oracle Database, Oracle produces and sells the Oracle Fusion middleware as well as the JEE servers Oracle Application Server and Oracle WebLogic.
With sales of $ 39.5 billion and a profit of $3.7 billion, Oracle ranks 107th among the world's largest companies according to Forbes Global 2000 (as of fiscal year 2017). The company had a market cap of approximately $191 billion in mid-2018.
Market cap: $471.13 Billion USD As of November 2024 Oracle has a market cap of $471.13 Billion USD. This makes Oracle the world's 20th most valuable company by market cap according to our data. The market capitalization, commonly called market cap, is the total market value of a publicly traded company's outstanding shares and is commonly used to measure how much a company is worth.
Geography: USA
Time period: Jan 1996- October 2024
Unit of analysis: Oracle Stock Data 2024
Variable | Description |
---|---|
date | date |
open | The price at market open. |
high | The highest price for that day. |
low | The lowest price for that day. |
close | The price at market close, adjusted for splits. |
adj_close | The closing price after adjustments for all applicable splits and dividend distributions. Data is adjusted using appropriate split and dividend multipliers, adhering to Center for Research in Security Prices (CRSP) standards. |
volume | The number of shares traded on that day. |
This dataset belongs to me. I’m sharing it here for free. You may do with it as you wish.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F18335022%2Fc1985e210c94e1ddfb08aa95e2f0c8ec%2FScreenshot%202024-11-04%20150731.png?generation=1730716840208486&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F18335022%2F180b2ba78bb9db5aeb446e34cd1641e6%2FScreenshot%202024-11-04%20150816.png?generation=1730716857397108&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F18335022%2F25f52d209adf2d948af9bbbcf019be20%2FScreenshot%202024-11-04%20150758.png?generation=1730716872432325&alt=media" alt="">
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Broadband Adoption and Computer Use by year, state, demographic characteristics’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/720f8c4b-7a1c-415c-9297-55904ba24840 on 26 January 2022.
--- Dataset description provided by original source is as follows ---
This dataset is imported from the US Department of Commerce, National Telecommunications and Information Administration (NTIA) and its "Data Explorer" site. The underlying data comes from the US Census
dataset: Specifies the month and year of the survey as a string, in "Mon YYYY" format. The CPS is a monthly survey, and NTIA periodically sponsors Supplements to that survey.
variable: Contains the standardized name of the variable being measured. NTIA identified the availability of similar data across Supplements, and assigned variable names to ease time-series comparisons.
description: Provides a concise description of the variable.
universe: Specifies the variable representing the universe of persons or households included in the variable's statistics. The specified variable is always included in the file. The only variables lacking universes are isPerson and isHouseholder, as they are themselves the broadest universes measured in the CPS.
A large number of *Prop, *PropSE, *Count, and *CountSE columns comprise the remainder of the columns. For each demographic being measured (see below), four statistics are produced, including the estimated proportion of the group for which the variable is true (*Prop), the standard error of that proportion (*PropSE), the estimated number of persons or households in that group for which the variable is true (*Count), and the standard error of that count (*CountSE).
DEMOGRAPHIC CATEGORIES
us: The usProp, usPropSE, usCount, and usCountSE columns contain statistics about all persons and households in the universe (which represents the population of the fifty states and the District and Columbia). For example, to see how the prevelance of Internet use by Americans has changed over time, look at the usProp column for each survey's internetUser variable.
age: The age category is divided into five ranges: ages 3-14, 15-24, 25-44, 45-64, and 65+. The CPS only includes data on Americans ages 3 and older. Also note that household reference persons must be at least 15 years old, so the age314* columns are blank for household-based variables. Those columns are also blank for person-based variables where the universe is "isAdult" (or a sub-universe of "isAdult"), as the CPS defines adults as persons ages 15 or older. Finally, note that some variables where children are technically in the univese will show zero values for the age314* columns. This occurs in cases where a variable simply cannot be true of a child (e.g. the workInternetUser variable, as the CPS presumes children under 15 are not eligible to work), but the topic of interest is relevant to children (e.g. locations of Internet use).
work: Employment status is divided into "Employed," "Unemployed," and "NILF" (Not in the Labor Force). These three categories reflect the official BLS definitions used in official labor force statistics. Note that employment status is only recorded in the CPS for individuals ages 15 and older. As a result, children are excluded from the universe when calculating statistics by work status, even if they are otherwise considered part of the universe for the variable of interest.
income: The income category represents annual family income, rather than just an individual person's income. It is divided into five ranges: below $25K, $25K-49,999, $50K-74,999, $75K-99,999, and $100K or more. Statistics by income group are only available in this file for Supplements beginning in 2010; prior to 2010, family income range is available in public use datasets, but is not directly comparable to newer datasets due to the 2010 introduction of the practice of allocating "don't know," "refused," and other responses that result in missing data. Prior to 2010, family income is unkown for approximately 20 percent of persons, while in 2010 the Census Bureau began imputing likely income ranges to replace missing data.
education: Educational attainment is divided into "No Diploma," "High School Grad,
--- Original source retains full ownership of the source dataset ---
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
We propose Safe Human dataset consisting of 17 different objects referred to as SH17 dataset. We scrapped images from the Pexels website, which offers "https://www.pexels.com/license/">clear usage rights for all its images, showcasing a range of human activities across diverse industrial operations.
To extract relevant images, we used multiple queries such as manufacturing worker, industrial worker, human worker, labor, etc. The tags associated with Pexels images proved reasonably accurate. After removing duplicate samples, we obtained a dataset of 8,099 images. The dataset exhibits significant diversity, representing manufacturing environments globally, thus minimizing potential regional or racial biases. Samples of the dataset are shown below.
The data consists of three folders, - images contains all images - labels contains labels in YOLO format for all images - voc_labels contains labels in VOC format for all images - train_files.txt contains list of all images we used for training - val_files.txt contains list of all images we used for validation
This dataset, scrapped through the Pexels website, is intended for educational, research, and analysis purposes only. You may be able to use the data for training of the Machine learning models only. Users are urged to use this data responsibly, ethically, and within the bounds of legal stipulations.
Legal Simplicity: All photos and videos on Pexels can be downloaded and used for free.
The dataset is provided "as is," without warranty, and the creator disclaims any legal liability for its use by others.
Users are encouraged to consider the ethical implications of their analyses and the potential impact on broader community.
https://github.com/ahmadmughees/SH17dataset
@misc{ahmad2024sh17datasethumansafety,
title={SH17: A Dataset for Human Safety and Personal Protective Equipment Detection in Manufacturing Industry},
author={Hafiz Mughees Ahmad and Afshin Rahimi},
year={2024},
eprint={2407.04590},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2407.04590},
}
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2806979%2F0a24bd8b9a3f281cf924a5171db28a40%2Fpexels-photo-3862627.jpeg?generation=1720104820503689&alt=media" alt="">
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Akcora, C.G., Li, Y., Gel, Y.R. and Kantarcioglu, M., 2019. BitcoinHeist: Topological Data Analysis for Ransomware Detection on the Bitcoin Blockchain. IJCAI-PRICAI 2020.
We have downloaded and parsed the entire Bitcoin transaction graph from 2009 January to 2018 December. Using a time interval of 24 hours, we extracted daily transactions on the network and formed the Bitcoin graph. We filtered out the network edges that transfer less than B0.3, since ransom amounts are rarely below this threshold.
Ransomware addresses are taken from three widely adopted studies: Montreal, Princeton and Padua. Please see the BitcoinHeist article for references.
On the heterogeneous Bitcoin network, in each 24-hour snapshot we extract the following six features for an address: income, neighbors, weight, length, count, loop.
In 24 ransomware families, at least one address appears in more than one 24-hour time window. CryptoLocker has 13 addresses that appear more than 100 times each. The CryptoLocker address 1LXrSb67EaH1LGc6d6kWHq8rgv4ZBQAcpU appears for a maximum of 420 times. Four addresses have conflicting ransomware labels between Montreal and Padua datasets. APT (Montreal) and Jigsaw (Padua) ransomware families have two and one P2SH addresses (that start with '3'), respectively. All other addresses are ordinary addresses that start with ’1’.
address: String. Bitcoin address. year: Integer. Year. day: Integer. Day of the year. 1 is the first day, 365 is the last day. length: Integer. weight: Float. count: Integer. looped: Integer. neighbors: Integer. income: Integer. Satoshi amount (1 bitcoin = 100 million satoshis). label: Category String. Name of the ransomware family (e.g., Cryptxxx, cryptolocker etc) or white (i.e., not known to be ransomware).
Our graph features are designed to quantify specific transaction patterns. Loop is intended to count how many transaction i) split their coins; ii) move these coins in the network by using different paths and finally, and iii) merge them in a single address. Coins at this final address can then be sold and converted to fiat currency. Weight quantifies the merge behavior (i.e., the transaction has more input addresses than output addresses), where coins in multiple addresses are each passed through a succession of merging transactions and accumulated in a final address. Similar to weight, the count feature is designed to quantify the merging pattern. However, the count feature represents information on the number of transactions, whereas the weight feature represents information on the amount (what percent of these transactions’ output?) of transactions. Length is designed to quantify mixing rounds on Bitcoin, where transactions receive and distribute similar amounts of coins in multiple rounds with newly created addresses to hide the coin origin.
White Bitcoin addresses are capped at 1K per day (Bitcoin has 800K addresses daily).
Note that although we are certain about ransomware labels, we do not know if all white addresses are in fact not related to ransomware.
When compared to non-ransomware addresses, ransomware addresses exhibit more profound right skewness in distributions of feature values.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Overview:Three new datasets available here represent normal household areas with common objects - lounge, kitchen and garden - with varying trajectories.Description:Lounge: The lounge dataset with common household objects.Lounge_oc: The lounge dataset with object occlusions near the end of trajectory.Kitchen: The kitchen dataset with common household objects.Kitchen_oc: The kitchen dataset with object occlusions near the end of trajectory.Garden: The garden dataset with common household objects.Garden_oc: The garden dataset with object occlusions near the end of trajectory.convert.py: Python script to convert a video file into jpgs.Paper:The datasets were used for the paper "SymbioLCD: Ensemble-Based Loop Closure Detection using CNN-Extracted Objects and Visual Bag-of-Words", accepted at 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems.Abstract:Loop closure detection is an essential tool of Simultaneous Localization and Mapping (SLAM) to minimize drift in its localization. Many state-of-the-art loop closure detection (LCD) algorithms use visual Bag-of-Words (vBoW), which is robust against partial occlusions in a scene but cannot perceive the semantics or spatial relationships between feature points. CNN object extraction can address those issues, by providing semantic labels and spatial relationships between objects in a scene. Previous work has mainly focused on replacing vBoW with CNN derived features.In this paper we propose SymbioLCD, a novel ensemble-based LCD that utilizes both CNN-extracted objects and vBoW features for LCD candidate prediction. When used in tandem, the added elements of object semantics and spatial-awareness creates a more robust and symbiotic loop closure detection system. The proposed SymbioLCD uses scale-invariant spatial and semantic matching, Hausdorff distance with temporal constraints, and a Random Forest that utilizes combined information from both CNN-extracted objects and vBoW features for predicting accurate loop closure candidates. Evaluation of the proposed method shows it outperforms other Machine Learning (ML) algorithms - such as SVM, Decision Tree and Neural Network, and demonstrates that there is a strong symbiosis between CNN-extracted object information and vBoW features which assists accurate LCD candidate prediction. Furthermore, it is able to perceive loop closure candidates earlier than state-of-the-art SLAM algorithms, utilizing added spatial and semantic information from CNN-extracted objects.Citation:Please use the bibtex below for citing the paper:@inproceedings{kim2021symbiolcd,title = {SymbioLCD: Ensemble-Based Loop Closure Detection using CNN-Extracted Objects and Visual Bag-of-Words},author = {Jonathan Kim and Martin Urschler and Pat Riddle and J"{o}rg Wicker},year = {2021},date = {2021-09-27},booktitle = {IEEE/RSJ International Conference on Intelligent Robots and Systems},keywords = {},pubstate = {forthcoming},tppubtype = {inproceedings}}
The global number of households with a computer in was forecast to continuously increase between 2024 and 2029 by in total 88.6 million households (+8.6 percent). After the fifteenth consecutive increasing year, the computer households is estimated to reach 1.1 billion households and therefore a new peak in 2029. Notably, the number of households with a computer of was continuously increasing over the past years.Computer households are defined as households possessing at least one computer.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of households with a computer in countries like Caribbean and Africa.