Demo to save data from a Space to a Dataset. Goal is to provide reusable snippets of code.
Documentation: https://huggingface.co/docs/huggingface_hub/main/en/guides/upload#scheduled-uploads Space: https://huggingface.co/spaces/Wauplin/space_to_dataset_saver/ JSON dataset: https://huggingface.co/datasets/Wauplin/example-space-to-dataset-json Image dataset: https://huggingface.co/datasets/Wauplin/example-space-to-dataset-image Image (zipped) dataset:… See the full description on the dataset page: https://huggingface.co/datasets/Wauplin/example-space-to-dataset-json.
PromptCloud offers specialized data extraction services for eCommerce businesses, focusing on acquiring detailed product and customer review datasets from a variety of eCommerce websites. This service is instrumental for businesses aiming to refine their eCommerce strategies through in-depth market analysis, competitive research, and enhanced customer insights.
Customization is a key aspect of PromptCloud's offerings. PromptCloud provides bespoke scraping services, tailored to the unique requirements of each business. This adaptability is especially beneficial for companies seeking a competitive advantage in the dynamic eCommerce market. A distinctive feature of PromptCloud's approach is the provision of a free sample, allowing potential clients to experience the quality and accuracy of their data firsthand. This commitment to quality is reflected in their use of advanced technologies that ensure the delivery of precise, up-to-date data.
PromptCloud's versatility extends to data delivery, offering various formats like JSON, CSV, and XML. This flexibility facilitates seamless integration of data into different business systems, highlighting their focus on creating user-friendly and effective solutions.
PromptCloud positions itself as a vital resource for eCommerce businesses looking to utilize data for strategic planning and customer understanding. Their tailored scraping services, combined with a commitment to delivering current and accurate data, make PromptCloud the best option for businesses seeking to improve their market presence and deepen their understanding of customer behavior.
We are committed to putting data at the heart of your business. Reach out for a no-frills PromptCloud experience- professional, technologically ahead and reliable.
This dataset contains resources transformed from other datasets on HDX. They exist here only in a format modified to support visualization on HDX and may not be as up to date as the source datasets from which they are derived.
Source datasets: https://data.hdx.rwlabs.org/dataset/idps-data-by-region-in-mali
CSV output from https://github.com/marks/health-insurance-marketplace-analytics/blob/master/flattener/flatten_from_index.py
taichi256/example-space-to-dataset-json dataset hosted on Hugging Face and contributed by the HF Datasets community
This dataset contains the articles published on the Covid-19 FAQ for companies published by the Directorate-General for Enterprises at https://info-entreprises-covid19.economie.fr The data are presented in the JSON format as follows: JSON [ { “title”: “Example article for documentation”, “content”: [ this is the first page of the article. here the second, “‘div’these articles incorporate some HTML formatting‘/div’” ], “path”: [ “File to visit in the FAQ”, “to join the article”] }, ... ] “'” The update is done every day at 6:00 UTC. This data is extracted directly from the site, the source code of the script used to extract the data is available here: https://github.com/chrnin/docCovidDGE
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
We have successfully extracted a comprehensive news dataset from CNBC, covering not only financial updates but also an extensive range of news categories relevant to diverse audiences in Europe, the US, and the UK. This dataset includes over 500,000 records, meticulously structured in JSON format for seamless integration and analysis.
This extensive extraction spans multiple segments, such as:
Each record in the dataset is enriched with metadata tags, enabling precise filtering by region, sector, topic, and publication date.
The comprehensive news dataset provides real-time insights into global developments, corporate strategies, leadership changes, and sector-specific trends. Designed for media analysts, research firms, and businesses, it empowers users to perform:
Additionally, the JSON format ensures easy integration with analytics platforms for advanced processing.
Looking for a rich repository of structured news data? Visit our news dataset collection to explore additional offerings tailored to your analysis needs.
To get a preview, check out the CSV sample of the CNBC economy articles dataset.
CSV output from https://github.com/marks/health-insurance-marketplace-analytics/blob/master/flattener/flatten_from_index.py
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Background
Many types of data from genomic analyses can be represented as genomic tracks, i.e. features linked to the genomic coordinates of a reference genome. Examples of such data are epigenetic DNA methylation data, ChIP-seq peaks, germline or somatic DNA variants, or RNA-seq expression levels. Researchers often face difficulties in locating, accessing and combining relevant tracks from external sources, as well as locating the raw data, reducing the value of the generated information.
FAIRtracks software ecosystem
We have, as an output of the ELIXIR Implementation Study "FAIRification of Genomic Tracks", developed a basic set of recommendations for genomic track metadata together with an implementation called FAIRtracks in the form of a JSON Schema. We propose FAIRtracks as a draft standard for genomic track metadata in order to advance the application of FAIR data principles (Findable, Accessible, Interoperable, and Reusable). We have demonstrated practical usage of this approach by designing a software ecosystem around the FAIRtracks draft standard, integrating globally identifiable metadata from various track hubs in the Track Hub Registry and other relevant repositories into a novel track search service, called TrackFind. The software ecosystem also includes the FAIRtracks augmentation service, which assists metadata producers by automatically augmenting minimal machine-readable metadata with their human-readable counterparts, as well as the FAIRtracks validation service, which extends basic JSON Schema validation to include FAIR-related features (global identifiers, ontology terms, and object references). Finally, we have implemented track metadata search and import functionality into relevant analytical tools: EPICO and the GSuite HyperBrowser. For an overview of the FAIRtracks software ecosystem, please visit: http://fairtracks.github.io/
Example FAIRtracks JSON document - augmented
The "Example FAIRtracks JSON document - augmented" is generated as part of the build process of the FAIRtracks draft standard JSON Schema (source code: https://github.com/fairtracks/fairtracks_standard/). The example FAIRtracks document contains a small selection of tracks and objects from the ENCODE project metadata (https://www.encodeproject.org/), adapted to align with the FAIRtracks draft standard. In addition to being available in the above-mentioned GitHub repository, the "Example FAIRtracks JSON document - augmented" is also published here on Zenodo in order for the document to be globally uniquely identifiable by a Digital Object Identifier (DOI).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accessing data in structured formats such as XML, CSV and JSON in statically typed languages is difficult, because the languages do not understand the structure of the data. Dynamically typed languages make this syntactically easier, but lead to error-prone code. Despite numerous efforts, most of the data available on the web do not come with a schema. The only information available to developers is a set of examples, such as typical server responses. We describe an inference algorithm that infers a type of structured formats including CSV, XML and JSON. The algorithm is based on finding a common supertype of types representing individual samples (or values in collections). We use the algorithm as a basis for an F# type provider that integrates the inference into the F# type system. As a result, users can access CSV, XML and JSON data in a statically-typed fashion just by specifying a representative sample document.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A JSON file used as an example to illustrate queries and to benchmark some tool.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Prototype of TrainingDML-AI JSON schema and examples in the paper "Towards an interoperable training data markup language for artificial intelligence in earth observation"
This dataset comprise behavioural data recorded from 61 children diagnosed with Autism Spectrum Disorders (ASD). The data was collected during a large-scale evaluation of Robot Enhanced Therapy (RET). The dataset covers over 3000 therapy sessions and more than 300 hours of therapy. Half of the children interacted with the social robot NAO supervised by a therapist. The other half, constituting a control group, interacted directly with a therapist. Both groups followed the Applied Behavior Analysis (ABA) protocol. Each session was recorded with three RGB cameras and two RGBD (Kinect) cameras, providing detailed information of children's behaviour during therapy. This public release of the dataset noes not include video recordings or other personal information. Instead, it comprises body motion, head position and orientation, and eye gaze variables, all specified as 3D data in a joint frame of reference. In addition, metadata including participant age, gender, and autism diagnosis (ADOS) variables are included. All data in this dataset is stored in JavaScript Object Notation (JSON) and can be downloaded here as DREAMdataset.zip. A much smaller archive comprising example data recorded from a single session is provided in DREAMdata-example.zip. The JSON format is specified in detail by the JSON Schema (dream.1.1.json) provided with this dataset. JSON data can be read using standard libraries in most programming languages. Basic instructions on how to load and plot the data using Python and Jupyter are available in DREAMdata-documentation.zip attached with this dataset. Please refer to https://github.com/dream2020/data for more details. The DREAM Dataset can be visualized using the DREAM Data Visualizer, an open source software available at https://github.com/dream2020/DREAM-data-visualizer. The DREAM RET System that was used for collecting this dataset is available at https://github.com/dream2020/DREAM. Denna databas omfattar beteendedata från 61 barn diagnostiserade med Autismspektrumtillstånd (AST). Insamlat data kommer från en storskalig studie på autismterapi med stöd av robotar. Databasen omfattar över 3000 sessioner från mer än 300 timmar terapi. Hälften av barnen interagerade med den sociala roboten NAO, övervakad av en terapeut. Den andra hälften, vilka utgjorde kontrollgrupp, interagerade direkt med en terapeut. Båda grupperna följde samma standardprotokoll för kognitiv beteendeterapi, Applied Behavior Analysis (ABA). Varje session spelades in med tre RGB-kameror och två RGBD kameror (Kinect) vilka analyserats med bildbehandlingstekniker för att identifiera barnets beteende under terapin. Den här publika versionen av databasen innehåller inget inspelat videomaterial eller andra personuppgifter, utan omfattar i stället anonymiserat data som beskriver barnets rörelser, huvudets position och orientering, samt ögonrörelser, alla angivna i ett gemensamt koordinatsystem. Vidare inkluderas metadata i form av barnets ålder, kön, och autismdiagnos (ADOS). All data i den här databasen är lagrad som JavaScript Object Notation (JSON) kan här laddas ned i form av DREAMdataset.zip. Ett mycket mindre arkiv med exempeldata från en enstaka session kan laddas ned separat i form av DREAMdata-example.zip. JSON-formatet finns specificerat i form av ett JSON-schema som också bifogas med denna databas. JSON kan läsas med hjälp av standardbibliotek i de flesta programspråk. Instruktioner för att läsa och visualisera datat med hjälp av Python och Jupyter bifogas i DREAMdata-documentation.zip. Vänligen besök https://github.com/dream2020/data för detaljer. Databasen kan också visualiseras med hjälp av DREAM Data Visualizer, en enkel mjukvara som finns tillgänglig i form av öppen källkod via https://github.com/dream2020/DREAM-data-visualizer. Det fullständiga systemet som användes för inspelning av denna databas finns också tillgänglig via https://github.com/dream2020/DREAM.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example Microscopy Metadata JSON files produced using the Micro-Meta App documenting an example raw-image file acquired using the custom-built TIRF Epifluorescence Structured Illumination Microscope.
For this use case, which is presented in Figure 5 of Rigano et al., 2021, Micro-Meta App was utilized to document:
1) The Hardware Specifications of the custom build TIRF Epifluorescence Structured light Microscope (TESM; Navaroli et al., 2010) developed, built on the basis of the based on Olympus IX71 microscope stand, and owned by the Biomedical Imaging Group (http://big.umassmed.edu/) at the Program in Molecular Medicine of the University of Massachusetts Medical School. Because TESM was custom-built the most appropriate documentation level is Tier 3 (Manufacturing/Technical Development/Full Documentation) as specified by the 4DN-BINA-OME Microscopy Metadata model (Hammer et al., 2021).
The TESM Hardware Specifications are stored in: Rigano et al._Figure 5_UseCase_Biomedical Imaging Group_TESM.JSON
2) The Image Acquisition Settings that were applied to the TESM microscope for the acquisition of an example image (FSWT-6hVirus-10minFIX-stk_4-EPI.tif.ome.tif) obtained by Nicholas Vecchietti and Caterina Strambio-De-Castillia. For this image, TZM-bl human cells were infected with HIV-1 retroviral three-part vector (FSWT+PAX2+pMD2.G). Six hours post-infection cells were fixed for 10 min with 1% formaldehyde in PBS, and permeabilized. Cells were stained with mouse anti-p24 primary antibody followed by DyLight488-anti-Mouse secondary antibody, to detect HIV-1 viral Capsid. In addition, cells were counterstained using rabbit anti-Lamin B1 primary antibody followed by DyLight649-anti-Rabbit secondary antibody, to visualize the nuclear envelope and with DAPI to visualize the nuclear chromosomal DNA.
The Image Acquisition Settings used to acquire the FSWT-6hVirus-10minFIX-stk_4-EPI.tif.ome.tif image are stored in: Rigano et al._Figure 5_UseCase_AS_fswt-6hvirus-10minfix-stk_4-epi.tif.JSON
Instructional video tutorials on how to use these example data files:
Use these videos to get started with using Micro-Meta App after downloading the example data files available here.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Contains counts of the 50 most common hashtags on Twitter from 2013 through 2016, sampling each hour. The sample was obtained using a 10% sample of all tweets.
In working on Unicode implementations, it is often useful to access the full content of the Unicode Character Database (UCD). For example, in establishing mappings from characters to glyphs in fonts, it is convenient to see the character scalar value, the character name, the character East Asian width, along with the shape and metrics of the proposed glyph to map to; looking at all this data simultaneously helps in evaluating the mapping.
This is a machine-readable version of the Unicode Character Database in JSON format.
The majority of information about individual codepoints is represented using properties. Each property, except for the Special_Case_Condition and Name_Alias properties, is represented by an attribute. In an XML data file, the absence of an attribute (may be only on some code-points) means that the document does not express the value of the corresponding property. Conversely, the presence of an attribute is an expression of the corresponding property value; the implied null value is represented by the empty string.
The Name_Alias property is represented by zero or more name-alias child elements. Unlike the situation for properties represented by attributes, it is not possible to determine whether all of the aliases have been represented in a data file by inspecting that data file.
The name of an attribute is the abbreviated name of the property as given in the file PropertyAliases.txt in version 6.1.0 of the UCD. For the Unihan properties, the name is that given in the various versions of the Unihan database (some properties are no longer present in version 6.1.0).
For catalog and enumerated properties, the values are those listed in the file PropertyValueAliases.txt in version 6.1.0 of the UCD; if there is an abbreviated name, it is used, otherwise the long name is used. Note that the set of possible values for a property captured in this schema may change from one version to the next.
The following properties are associated with code points:
For additional information, please consult the full documentation on the Unicode website.
Copyright © 1991-2017 Unicode, Inc. All rights reserved. Distributed under the Terms of Use in http://www.unicode.org/copyright.html.
Permission is hereby granted, free of charge, to any person obtaining a copy of the Unicode data files and any associated documentation (the "Data Files") or Unicode software and any associated documentation (the "Software") to deal in the Data Files or Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, and/or sell copies of the Data Files or Software, and to permit persons to whom the Data Files or Software are furnished to do so, provided that either (a) this copyright and permission notice appear with all copies of the Data Files or Software, or (b) this copyright and permission notice appear in associated Documentation.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We offer two data formats: A richer dataset is provided in the JSON format, which is organised by the directory structure of the Git repository. JSON supports more hierarchical or nested information such as subjects. We also provide CSVs of flattened data, which is less comprehensive but perhaps easier to grok. The CSVs provide a good introduction to overall contents of the Tate metadata and create opportunities for artistic pivot tables. JSON Artists Each artist has his or her own JSON file. They are found in the artists folder, then filed away by first letter of the artist’s surname. Artworks Artworks are found in the artworks folder. They are filed away by accession number. This is the unique identifier given to artworks when they come into the Tate collection. In many cases, the format has significance. For example, the ar accession number prefix indicates that the artwork is part of ARTIST ROOMS collection. The n prefix indicates works that once were part of the National Gallery collection. CSV There is one CSV file for artists (artist_data.csv) and one (very large) for artworks (artwork_data.csv), which we may one day break up into more manageable chunks. The CSV headings should be helpful. Let us know if not. Entrepreneurial hackers could use the CSVs as an index to the JSON collections if they wanted richer data. Usage guidelines for open data These usage guidelines are based on goodwill. They are not a legal contract but Tate requests that you follow these guidelines if you use Metadata from our Collection dataset. The Metadata published by Tate is available free of restrictions under the Creative Commons Zero Public Domain Dedication. This means that you can use it for any purpose without having to give attribution. However, Tate requests that you actively acknowledge and give attribution to Tate wherever possible. Attribution supports future efforts to release other data. It also reduces the amount of ‘orphaned data’, helping retain links to authoritative sources. Give attribution to Tate Make sure that others are aware of the rights status of Tate and are aware of these guidelines by keeping intact links to the Creative Commons Zero Public Domain Dedication. If for technical or other reasons you cannot include all the links to all sources of the Metadata and rights information directly with the Metadata, you should consider including them separately, for example in a separate document that is distributed with the Metadata or dataset. If for technical or other reasons you cannot include all the links to all sources of the Metadata and rights information, you may consider linking only to the Metadata source on Tate’s website, where all available sources and rights information can be found, including in machine readable formats. Metadata is dynamic When working with Metadata obtained from Tate, please be aware that this Metadata is not static. It sometimes changes daily. Tate continuously updates its Metadata in order to correct mistakes and include new and additional information. Museum collections are under constant study and research, and new information is frequently added to objects in the collection. Mention your modifications of the Metadata and contribute your modified Metadata back Whenever you transform, translate or otherwise modify the Metadata, make it clear that the resulting Metadata has been modified by you. If you enrich or otherwise modify Metadata, consider publishing the derived Metadata without reuse restrictions, preferably via the Creative Commons Zero Public Domain Dedication. Be responsible Ensure that you do not use the Metadata in a way that suggests any official status or that Tate endorses you or your use of the Metadata, unless you have prior permission to do so. Ensure that you do not mislead others or misrepresent the Metadata or its sources Ensure that your use of the Metadata does not breach any national legislation based thereon, notably concerning (but not limited to) data protection, defamation or copyright. Please note that you use the Metadata at your own risk. Tate offers the Metadata as-is and makes no representations or warranties of any kind concerning any Metadata published by Tate. The writers of these guidelines are deeply indebted to the Smithsonian Cooper-Hewitt, National Design Museum; and Europeana.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Three data sets providing assistance and guidance of data processing using Kmasker plants. Also, we perform a comparison of Kmasker plants functionality with the KAT tool. This DOI holds input and output data of this analysis. Example_1 uses an Aegilops speltoides dataset and results show that tested repeat sequences have B chromosome origin. Example_2 uses the winter barley specific gene VRN-H2 and results show that it is absent in spring barley cultivar Morex. Example_3 uses the full barley gene set and compares winter and spring barley presence/absence. Related commands and updates of this tutorial are provided on GitHub in the tutorial section of Kmasker plants. For the most recent version of this tutorial please have a look to the project page (https://github.com/tschmutzer/kmasker).
The Forager.ai Global Dataset is a leading source of firmographic data, backed by advanced AI and offering the highest refresh rate in the industry.
| Volume and Stats |
| Use Cases |
Sales Platforms, ABM and Intent Data Platforms, Identity Platforms, Data Vendors:
Example applications include:
Uncover trending technologies or tools gaining popularity.
Pinpoint lucrative business prospects by identifying similar solutions utilized by a specific company.
Study a company's tech stacks to understand the technical capability and skills available within that company.
B2B Tech Companies:
Venture Capital and Private Equity:
| Delivery Options |
Our dataset provides a unique blend of volume, freshness, and detail that is perfect for Sales Platforms, B2B Tech, VCs & PE firms, Marketing Automation, ABM & Intent. It stands as a cornerstone in our broader data offering, ensuring you have the information you need to drive decision-making and growth.
Tags: Company Data, Company Profiles, Employee Data, Firmographic Data, AI-Driven Data, High Refresh Rate, Company Classification, Private Market Intelligence, Workforce Intelligence, Public Companies.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
AIT Log Data Sets
This repository contains synthetic log data suitable for evaluation of intrusion detection systems, federated learning, and alert aggregation. A detailed description of the dataset is available in [1]. The logs were collected from eight testbeds that were built at the Austrian Institute of Technology (AIT) following the approach by [2]. Please cite these papers if the data is used for academic publications.
In brief, each of the datasets corresponds to a testbed representing a small enterprise network including mail server, file share, WordPress server, VPN, firewall, etc. Normal user behavior is simulated to generate background noise over a time span of 4-6 days. At some point, a sequence of attack steps is launched against the network. Log data is collected from all hosts and includes Apache access and error logs, authentication logs, DNS logs, VPN logs, audit logs, Suricata logs, network traffic packet captures, horde logs, exim logs, syslog, and system monitoring logs. Separate ground truth files are used to label events that are related to the attacks. Compared to the AIT-LDSv1.1, a more complex network and diverse user behavior is simulated, and logs are collected from all hosts in the network. If you are only interested in network traffic analysis, we also provide the AIT-NDS containing the labeled netflows of the testbed networks. We also provide the AIT-ADS, an alert data set derived by forensically applying open-source intrusion detection systems on the log data.
The datasets in this repository have the following structure:
The following table summarizes relevant properties of the datasets:
The following attacks are launched in the network:
Note that attack parameters and their execution orders vary in each dataset. Labeled log files are trimmed to the simulation time to ensure that their labels (which reference the related event by the line number in the file) are not misleading. Other log files, however, also contain log events generated before or after the simulation time and may therefore be affected by testbed setup or data collection. It is therefore recommended to only consider logs with timestamps within the simulation time for analysis.
The structure of labels is explained using the audit logs from the intranet server in the russellmitchell data set as an example in the following. The first four labels in the labels/intranet_server/logs/audit/audit.log file are as follows:
{"line": 1860, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}
{"line": 1861, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}
{"line": 1862, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}
{"line": 1863, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}
Each JSON object in this file assigns a label to one specific log line in the corresponding log file located at gather/intranet_server/logs/audit/audit.log. The field "line" in the JSON objects specify the line number of the respective event in the original log file, while the field "labels" comprise the corresponding labels. For example, the lines in the sample above provide the information that lines 1860-1863 in the gather/intranet_server/logs/audit/audit.log file are labeled with "attacker_change_user" and "escalate" corresponding to the attack step where the attacker receives escalated privileges. Inspecting these lines shows that they indeed correspond to the user authenticating as root:
type=USER_AUTH msg=audit(1642999060.603:2226): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:authentication acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'
type=USER_ACCT msg=audit(1642999060.603:2227): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:accounting acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'
type=CRED_ACQ msg=audit(1642999060.615:2228): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:setcred acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'
type=USER_START msg=audit(1642999060.627:2229): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:session_open acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'
The same applies to all other labels for this log file and all other log files. There are no labels for logs generated by "normal" (i.e., non-attack) behavior; instead, all log events that have no corresponding JSON object in one of the files from the labels directory, such as the lines 1-1859 in the example above, can be considered to be labeled as "normal". This means that in order to figure out the labels for the log data it is necessary to store the line numbers when processing the original logs from the gather directory and see if these line numbers also appear in the corresponding file in the labels directory.
Beside the attack labels, a general overview of the exact times when specific attack steps are launched are available in gather/attacker_0/logs/attacks.log. An enumeration of all hosts and their IP addresses is stated in processing/config/servers.yml. Moreover, configurations of each host are provided in gather/ and gather/.
Version history:
Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU projects GUARD (833456) and PANDORA (SI2.835928).
If you use the dataset, please cite the following publications:
[1] M. Landauer, F. Skopik, M. Frank, W. Hotwagner,
Demo to save data from a Space to a Dataset. Goal is to provide reusable snippets of code.
Documentation: https://huggingface.co/docs/huggingface_hub/main/en/guides/upload#scheduled-uploads Space: https://huggingface.co/spaces/Wauplin/space_to_dataset_saver/ JSON dataset: https://huggingface.co/datasets/Wauplin/example-space-to-dataset-json Image dataset: https://huggingface.co/datasets/Wauplin/example-space-to-dataset-image Image (zipped) dataset:… See the full description on the dataset page: https://huggingface.co/datasets/Wauplin/example-space-to-dataset-json.