22 datasets found
  1. Microsoft Bing Search For Corona Virus Intent

    • kaggle.com
    zip
    Updated Jan 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saurabh Shahane (2021). Microsoft Bing Search For Corona Virus Intent [Dataset]. https://www.kaggle.com/saurabhshahane/microsoft-bing-search-for-corona-virus-intent
    Explore at:
    zip(64939376 bytes)Available download formats
    Dataset updated
    Jan 24, 2021
    Authors
    Saurabh Shahane
    Description

    Context

    This dataset was curated from the Bing search logs (desktop users only) over the period of Jan 1st, 2020 – (Current Month - 1). Only searches that were issued many times by multiple users were included. Dataset includes queries from all over the world that had an intent related to the Coronavirus or Covid-19. In some cases this intent is explicit in the query itself, e.g. “Coronavirus updates Seattle” in other cases it is implicit , e.g. “Shelter in place”. Implicit intent of search queries (e.g. Toilet paper) were extracted by using Random walks on the click graph approach as outlined in the following paper by Nick Craswell et al at Microsoft Research: https://www.microsoft.com/en-us/research/wp-content/uploads/2007/07/craswellszummer-random-walks-sigir07.pdf All personal data was removed. Source - https://msropendata.com/datasets/c5031874-835c-48ed-8b6d-31de2dad0654

    Acknowledgements

    Data Source: Bing Coronavirus Query set (https://github.com/microsoft/BingCoronavirusQuerySet)

    License - Open Use of Data Agreement v1.0

    Content

    Inside the data folder there is a folder 2020 (for the year) which contains two kinds of files.

    QueriesByCountry_DateRange.tsv : A tab separated text file that contains queries with Coronavirus intent by Country. QueriesByState_DateRange.tsv : A tab separated text file that contains queries with Coronavirus intent by State.

    QueriesByCountry Date : string, Date on which the query was issued.

    Query : string, The actual search query issued by user(s).

    IsImplicitIntent : bool, True if query did not mention covid or coronavirus or sarsncov2 (e.g, “Shelter in place”). False otherwise.

    Country : string, Country from where the query was issued.

    PopularityScore : int, Value between 1 and 100 inclusive. 1 indicates least popular query on the day/Country with Coronavirus intent, and 100 indicates the most popular query for the same Country on the same day.

    QueriesByState Date : string, Date on which the query was issued.

    Query : string, The actual search query issued by user(s).

    IsImplicitIntent : bool, True if query did not mention covid or coronavirus or sarsncov2 (e.g, “Shelter in place”). False otherwise.

    State : string, State from where the query was issued.

    Country :string, Country from where the query was issued.

    PopularityScore : int, Value between 1 and 100 inclusive. 1 indicates least popular query on the day/State/Country with Coronavirus intent, and 100 indicates the most popular query for the same geogrpahy on the same day.

  2. Package and Dependency Metadata for CZI Hackathon: Mapping the Impact of...

    • zenodo.org
    application/gzip
    Updated Oct 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew Nesbitt; Andrew Nesbitt (2023). Package and Dependency Metadata for CZI Hackathon: Mapping the Impact of Research Software in Science [Dataset]. http://doi.org/10.5281/zenodo.10045361
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Oct 26, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrew Nesbitt; Andrew Nesbitt
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    A collection of useful datasets extracted from https://packages.ecosyste.ms and https://repos.ecosyste.ms for use at the CZI Hackathon: Mapping the Impact of Research Software in Science.

    All data is provided as NDJSON (new line delimited JSON), each line represents a valid JSON object, and they are separated by newline characters. There are python and R libraries for reading these files, or you can maually read each line and parse each line as a single JSON object.

    Each ndjson file has been compressed with gzip (actual command: `tar -czvf`) to reduce download size, they expand to significantly bigger files after extraction.

    Package Data

    Package names from cran, bioconductor and pypi that have been parsed by the software-mentions project (data: https://datadryad.org/stash/dataset/doi:10.5061/dryad.6wwpzgn2c) are collected together with their latest release at time of publishing along with the names of their dependencies, those dependency names have then also been recursively fetched with latest release and dependencies until the full list of transitive dependencies is included.

    Note: This approach uses a simplified method of dependency resolution, always picking the latest version of each package rather than taking into account each dependencies specific version range requirements, this is primarily due to time constraints and allows all software ecosystems to be processed in the same way. A future improvement would be to use each package ecosystem's specific dependency resolution algorithm to compute the full transitive dependency tree for each mentioned software package.

    GitHub Data

    Two different approaches were taken for collecting data for referenced GitHub mentions:

    1. `github.ndjson` is metadata for each repository from GitHub, including "manifest" files which are known files that contain dependency information for a project such as requirements.txt, DESCRIPTION and package.json, parsed using https://github.com/ecosyste-ms/bibliothecary, which may include transitive dependencies that have been discovered in a `lockfile` within the repository.

    2. `github_packages.ndjson` is metadata for each package that was found on any package manager that references the GitHub url as it's repository url/source/homepage, these packages, like the cran and pypi data above, include the latest release and their direct dependencies. There may be more than one package for each GitHub URL as it is a one to many relationship. `github_packages_with_transitive.ndjson` follows the same format but also includes the extra resolved transitive dependencies of all packages using the same approach as with cran and pypi data above with the same caveats.

    There are also many more ecosystems referenced in these files than just cran, bioconductor and pypi, https://packages.ecosyste.ms provides a standardized metadata format for all of them to enable comparison and simplification of automation.

    Contact

    If you would like any help, support or more data from Ecosyste.ms please do get in touch via email: hello@ecosyste.ms or open an issue on GitHub: https://github.com/ecosyste-ms/packages/issues

  3. o

    Data from: Reliance on Science in Patenting

    • explore.openaire.eu
    • zenodo.org
    Updated Oct 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matt Marx; Aaron Fuegi (2020). Reliance on Science in Patenting [Dataset]. http://doi.org/10.5281/zenodo.3236339
    Explore at:
    Dataset updated
    Oct 13, 2020
    Authors
    Matt Marx; Aaron Fuegi
    Description

    This dataset contains citations from USPTO patents granted 1947-2018 to articles captured by the Microsoft Academic Graph (MAG) from 1800-2018. If you use the data, please cite these two papers: for the dataset of citations: Marx, Matt and Aaron Fuegi, "Reliance on Science in Patenting: USPTO Front-Page Citations to Scientific Articles" (https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3331686). for the underlying dataset of papers Sinha, Arnab, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. 2015. An Overview of Microsoft Academic Service (MAS) and Applications. In Proceedings of the 24th International Conference on World Wide Web (WWW ’15 Companion). ACM, New York, NY, USA, 243-246. The main file, pcs.tsv, contains the resolved citations. Fields are tab-separated. Each match has the patent number, MAG ID, the original citation from the patent, an indicator for whether the citation was supplied by the applicant, examiner, or unknown, and a confidence score (1-10) indicating how likely this match is correct. Note that this distribution does not contain matches with confidence 2 or 1. There is also a PubMed-specific match in pcs-pubmed.tsv. The remaining files are a redistribution of the 1 January 2019 release of the Microsoft Academic Graph. All of these files are compressed using ZIP compression under CentOS5. Original files, documented at https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema, can be downloaded from https://aka.ms/msracad; this redistribution carves up the original files into smaller, variable-specific files that can be loaded individually (see _relianceonscience.pdf for full details). Source code for generating the patent citations to science in pcs.tsv is available at https://github.com/mattmarx/reliance_on_science. Source code for generating jif.zip and jcif.zip (Journal Impact Factor and Journal Commercial Impact Factor) is at https://github.com/mattmarx/jcif. Although MAG contains authors and affiliations for each paper, it does not contain the location for affiliations. We have created a dataset of locations for affiliations appearing at least 100x using Bing Maps and Google Maps; however, it is unclear to us whether the API licensing terms allow us to repost their data. In any case, you can download our source code for doing so here: https://github.com/ksjiaxian/api-requester-locations. MAG extracts field keywords for each paper (paperfieldid.zip and fieldidname.zip) --more than 200,000 fields in all! When looking to study industries or technical areas you might find this a bit overwhelming. We mapped the MAG subjects to six OECD fields and 39 subfields, defined here: http://www.oecd.org/science/inno/38235147.pdf. Clarivate provides a crosswalk between the OECD classifications and Web of Science fields, so we include WoS fields as well. This file is magfield_oecd_wos_crosswalk.zip.

  4. u

    Distribution of data used in the MS-PINPOINT project

    • rdr.ucl.ac.uk
    • datasetcatalog.nlm.nih.gov
    csv
    Updated Nov 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benjamin Ford (2024). Distribution of data used in the MS-PINPOINT project [Dataset]. http://doi.org/10.5522/04/27604563.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    Nov 26, 2024
    Dataset provided by
    University College London
    Authors
    Benjamin Ford
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This item is part of the SAFEHR scheme at UCLH. The purpose of the scheme is to publicly display what kinds of patient data we use, to encourage collaboration and transparency. More information can be found at https://safehr-data.github.io/uclh-research-discovery/This dataset describes the structured health records used as part of the MS-PINPOINT project at UCL. It mainly describes the patient demographics such as patient reported gender, ethnicity and other features. Any category with less than 5 entries is not reported in line with privacy guidelines.

  5. deberta_v3_variants

    • kaggle.com
    zip
    Updated Oct 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wojciech "Victor" Fulmyk (2022). deberta_v3_variants [Dataset]. https://www.kaggle.com/datasets/wisawesome/deberta-v3-variants
    Explore at:
    zip(8407232748 bytes)Available download formats
    Dataset updated
    Oct 30, 2022
    Authors
    Wojciech "Victor" Fulmyk
    Description

    DeBERTa v3 variants. Downloaded using:

    sudo apt-get install git-lfs git lfs install

    git clone https://huggingface.co/microsoft/deberta-v3-xsmall git clone https://huggingface.co/microsoft/deberta-v3-small git clone https://huggingface.co/microsoft/deberta-v3-base git clone https://huggingface.co/microsoft/deberta-v3-large

    For more details refer to: https://github.com/microsoft/DeBERTa https://huggingface.co/microsoft/deberta-v3-xsmall https://huggingface.co/microsoft/deberta-v3-small https://huggingface.co/microsoft/deberta-v3-base https://huggingface.co/microsoft/deberta-v3-large

    The objective of this upload: - use the trained models in Kaggle competitions without needing to connect to the internet.

    There is no intention to infringe rights of any kind on my part, I simply want to use these models in competitions that require no internet connection. If you are one of the rights holders of these models and you feel your rights are being infringed by this upload, please contact me and I will rectify the issue as soon as possible.

  6. a

    Tarrant County Building Footprints

    • data-tarrantcounty.opendata.arcgis.com
    • hub.arcgis.com
    Updated Sep 18, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tarrant County (2018). Tarrant County Building Footprints [Dataset]. https://data-tarrantcounty.opendata.arcgis.com/datasets/tarrant-county-building-footprints/explore?showTable=true
    Explore at:
    Dataset updated
    Sep 18, 2018
    Dataset authored and provided by
    Tarrant County
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Area covered
    Description

    Tarrant County Building Footprints. Computer generated building footprints for the United States. The original dataset contains 125,192,184 computer generated building footprints in all 50 US states. This data is freely available for download and use. The original dataset has been pared down to include only Tarrant County building footprints. The filter extent used also includes a portion of other counties that surround Tarrant County. License: This data is licensed by Microsoft under the Open Data Commons Open Database License (ODbL) FAQ: What the data include: Approximately 125 million building footprint polygon geometries in all 50 US States in GeoJSON format. Source: https://github.com/Microsoft/USBuildingFootprints

  7. GeoPIXE Demo Data (Windows)

    • researchdata.edu.au
    • data.csiro.au
    datadownload
    Updated Jul 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Cousens; Barbara Etschmann; Murray Jensen; Chris Ryan (2024). GeoPIXE Demo Data (Windows) [Dataset]. http://doi.org/10.25919/FF5B-WR11
    Explore at:
    datadownloadAvailable download formats
    Dataset updated
    Jul 2, 2024
    Dataset provided by
    CSIROhttp://www.csiro.au/
    Authors
    David Cousens; Barbara Etschmann; Murray Jensen; Chris Ryan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Feb 1, 2000 - Jul 1, 2024
    Description

    Demo data files for the program GeoPIXE (with simple dir structure for a single user). Use these in conjunction with the worked examples notes to aid in training to use the GeoPIXE program for SXRF and PIXE imaging. Can be used for personal training. However, the Linux version is more suited to multiple users and a training workshop. Has been expanded to make it suitable for self-guided personal training.

    Requires the GeoPIXE program, which is available from “geopixe@csiro.au” and will soon be released as Open Source on GitHub. Runs under IDL, which must be obtained separately.

    Lineage: Data was produced using a range of detectors, such as Ge and Si(Li), SDD and the Maia 384 element detector array, at various synchrotron and ion-beam laboratories, including the XFM X-ray microprobe beamline of the Australian Synchrotron, the 2-ID-E beamline at the Advanced Photon Source, the CSIRO Maia Mapper and the CSIRO Nuclear Microprobe, and processed using the GeoPIXE software package.

  8. P

    BrowardCountyBuildingFootprints

    • data.pompanobeachfl.gov
    • hub.arcgis.com
    Updated Apr 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    External Datasets (2021). BrowardCountyBuildingFootprints [Dataset]. https://data.pompanobeachfl.gov/dataset/browardcountybuildingfootprints
    Explore at:
    kml, html, geojson, arcgis geoservices rest api, zip, csvAvailable download formats
    Dataset updated
    Apr 16, 2021
    Dataset provided by
    BCGISData
    Authors
    External Datasets
    Description

    Polygons of the buildings footprints clipped Broward County. This is a product MicroSoft.

    The orginal dataset This dataset contains 125,192,184 computer generated building footprints in all 50 US states. This data is freely available for download and use.

    The data set was clipped to the Broward County developed boundary.

    https://github.com/microsoft/USBuildingFootprints/blob/master/README.md">Additional information

  9. I

    Immersive Analytics Software Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Apr 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Immersive Analytics Software Report [Dataset]. https://www.marketreportanalytics.com/reports/immersive-analytics-software-73177
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Apr 9, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Immersive Analytics Software market is experiencing rapid growth, projected to reach $453 million in 2025 and exhibiting a Compound Annual Growth Rate (CAGR) of 33.2%. This robust expansion is driven by several key factors. The increasing adoption of immersive technologies like Virtual Reality (VR) and Augmented Reality (AR) across diverse sectors—business applications, education, healthcare, and public policy—is a significant catalyst. Businesses are leveraging immersive analytics for enhanced data visualization, improved decision-making, and more engaging training programs. The healthcare sector utilizes these tools for surgical planning, medical training simulations, and patient education, while public policy applications focus on creating interactive models for urban planning and disaster response. Furthermore, the continuous advancements in hardware and software capabilities, along with decreasing costs of VR/AR devices, are further fueling market growth. The availability of user-friendly software solutions is also widening the market's accessibility, attracting a larger user base. However, the market faces certain restraints. The high initial investment required for VR/AR infrastructure and software implementation can be a barrier for smaller organizations. Additionally, concerns regarding data security and privacy, as well as the potential for motion sickness and user fatigue associated with extended use of VR/AR devices, need to be addressed. Despite these challenges, the long-term prospects for immersive analytics remain highly positive, driven by ongoing technological innovations and the increasing demand for more efficient and engaging data analysis solutions across various industries. Market segmentation reveals a strong preference for PC and Mac applications, but the mobile (iOS) and VR/AR device segments are showing significant growth potential and are expected to capture considerable market share in the coming years. The major players – Immersion Analytics, GitHub, Microsoft, IBM, Accenture, Google, SAP, Meta, HTC, HP, Tibco, and Magic Leap – are actively shaping the market through continuous innovation and strategic partnerships.

  10. VAR-wlasl-complete

    • kaggle.com
    zip
    Updated Jun 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simone Ceredi (2025). VAR-wlasl-complete [Dataset]. https://www.kaggle.com/datasets/simoneceredi/var-wlasl-complete
    Explore at:
    zip(47071065220 bytes)Available download formats
    Dataset updated
    Jun 6, 2025
    Authors
    Simone Ceredi
    Description

    WLASL Recognition and signer classification

    This is the project for the course of "Visione Artificiale e Riconoscimento" of the "University of Bologna". The project aims to classify videos of Word Level American Sign Language into their glosses. It's also possible to classify each signer using traditional methods and representation learning.

    Structure of the dataset

    data/ - WLASL_v0.3.json - missing.txt - labels.npz - wlasl_class_list.txt - videos/ - frames_no_bg/ - original_videos_sample/ - hf/ - mp/

    You can find more info about the content of the dataset here

    Acknowledgements

    All the WLASL data is intended for academic and computational use only. No commercial usage is allowed.

    Made by Dongxu Li and Hongdong Li. Please read the WLASL paper and visit the official website and repository.

    Licensed under the Computational Use of Data Agreement (C-UDA). Please refer to the C-UDA-1.0 page for more information.

  11. Data from: Performance Evolution Matrix: Visualizing Performance Variations...

    • zenodo.org
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juan Pablo Sandoval Alcocer; Fabian Beck; Alexandre Bergel; Juan Pablo Sandoval Alcocer; Fabian Beck; Alexandre Bergel (2020). Performance Evolution Matrix: Visualizing Performance Variations along Software Versions [Dataset]. http://doi.org/10.5281/zenodo.3355414
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Juan Pablo Sandoval Alcocer; Fabian Beck; Alexandre Bergel; Juan Pablo Sandoval Alcocer; Fabian Beck; Alexandre Bergel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    # Performance Evolution Matrix
    This repository contains the artifacts needed to replicate our experiment in the paper "Performance Evolution Matrix".

    # Video Demo
    [download](https://github.com/jpsandoval/PerfEvoMatrix/blob/master/MatrixMovie.mp4)

    # XMLSupport and GraphET Examples

    To open the XMLSupport and GraphET Examples (which appears in the paper) execute the following commands in a Terminal.

    **MacOSX.** We do all the experiments in a Mac Book Pro. To open the Matrix execute the following command in the folder where this project was downloaded.

    ```
    ./Pharo-OSX/Pharo.app/Contents/MacOS/Pharo Matrix.image
    ```

    **Windows.**
    You may also run the experiment in Windows, but depending on the windows version you have installed it may be some some UI bugs.
    ```
    cd Pharo-Windows
    Pharo.exe ../XMLSupportExample.image
    ```

    **Open the Visualization.**
    Please select the following code, then execute it using the green play button (at the top right of the window).
    ```
    ToadBuilder xmlSupportExample.
    ```
    or
    ```
    ToadBuilder graphETExample.
    ```
    **Note.** There are two buttons at the panel top left In (zoom in) and Out (zoom out). To move the visualization just drag the moves over the panel.

    # Experiment
    This subsection describe how to execute the tools, for replicating our experiment.

    ## Baseline
    The baseline contains the tools and the project-dataset to realize the tasks described in the paper (identifying and understanding performance variations).

    ## Open the Baseline

    **MacOSX.** We do all the experiments in a Mac Book Pro. To open the Baseline execute the following command in the folder where this project was downloaded.

    ```
    ./Pharo-OSX/Pharo.app/Contents/MacOS/Pharo Baseline.image
    ```

    **Windows.**
    You may also run the experiment in Windows, but depending on the windows version you have installed it may be some some UI bugs.
    ```
    cd Pharo-Windows
    Pharo.exe ../Baseline.image
    ```

    ## Open a Project

    There are three projects under study, depending on the project you wanna use for the task, you may execute one of the following scripts. For executing a script press Cmd-d or right-click and press do it.

    **Roassal**
    ```
    TProfileVersion openRoassal.
    ```

    **XML**
    ```
    TProfileVersion openXML.
    ```
    **Grapher**
    ```
    TProfileVersion openGrapher.
    ```

    ## Baseline Options
    For each project, we provide a UI which contains all the tools we use as a baseline. Each item in the list is a version of the selected project.

    - Browse: open a standard window to inspect the code of the project in the selected version.
    - Profile: open a window with a call context tree for the selected version.
    - Source Diff: open a window with the code differences between the selected version and the previous one.
    - Execution Diff: open a window with the merge call context tree gathered from the selected version and the previous one.

    **Note.** All these options require you select first a item in the list.

    # Matrix

    ## Open Matrix Image.

    **MacOSX.** We do all the experiments in a Mac Book Pro. To open the Matrix execute the following command in the folder where this project was downloaded.

    ```
    ./Pharo-OSX/Pharo.app/Contents/MacOS/Pharo Matrix.image
    ```

    **Windows.**
    You may also run the experiment in Windows, but depending on the windows version you have installed it may be some some UI bugs.
    ```
    cd Pharo-Windows
    Pharo.exe ../Matrix.image
    ```

    ## Open a project

    There are three projects under study, depending on the project you wanna use for the task, you may execute one of the following scripts. For executing a script press Cmd-d or right-click and press do it.

    **Roassal**
    ```
    ToadBuilder roassal.
    ```

    **XML**
    ```
    ToadBuilder xml.
    ```
    **Grapher**
    ```
    ToadBuilder grapher.
    ```

    # Data Gathering

    Before each participant starts a task we execute the following script in Smalltalk. For executing a script press Cmd-d or right-click and press do it. It allows us to track the time that a user starts the experiment and how many mouse clicks, movements.
    ```
    UProfiler newSession.
    UProfiler current start.
    ```

    After finishing the task we executed the following script. It stop recording the mouse events and save the stops time.
    ```
    UProfiler current end.
    ```

    The last script generates a file with the following information: start time, end time, number of clicks, number of mouse movements, and the number of mouse drags (we do not use this last one).
    ```
    11:34:52.5205 am,11:34:56.38016 am,14,75,0

    ```
    # Quit
    To close the artifact, just close the window or press click in any free space of the window and select quit.

  12. WLASL (World Level American Sign Language) Video

    • kaggle.com
    zip
    Updated Sep 20, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Risang Baskoro (2021). WLASL (World Level American Sign Language) Video [Dataset]. https://www.kaggle.com/datasets/risangbaskoro/wlasl-processed/code
    Explore at:
    zip(5177253885 bytes)Available download formats
    Dataset updated
    Sep 20, 2021
    Authors
    Risang Baskoro
    Area covered
    World, United States
    Description

    Context

    WLASL is the largest video dataset for Word-Level American Sign Language (ASL) recognition, which features 2,000 common different words in ASL. We hope WLASL will facilitate the research in sign language understanding and eventually benefit the communication between deaf and hearing communities.

    Content

    The WLASL_v0.3.json file contains the glossary and instances of the videos.

    Inside the videos folder, there are about 12k videos each named corresponding video_id.

    Acknowledgements

    All the WLASL data is intended for academic and computational use only. No commercial usage is allowed.

    Made by Dongxu Li and Hongdong Li. Please read the WLASL paper and visit the official website and repository.

    Licensed under the Computational Use of Data Agreement (C-UDA). Please refer to the C-UDA-1.0 page for more information.

    Inspiration

    • How to classify word-level action recognition to text?
    • What is the most accurate model to do word-level sign language recognition?
  13. MM CELEBA HQ DATASET

    • kaggle.com
    zip
    Updated Nov 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kashyap KVH (2024). MM CELEBA HQ DATASET [Dataset]. https://www.kaggle.com/datasets/kashyapkvh/mm-celeba-hq-dataset
    Explore at:
    zip(3169699323 bytes)Available download formats
    Dataset updated
    Nov 9, 2024
    Authors
    Kashyap KVH
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Multi-Modal-CelebA-HQ

    Multi-Modal-CelebA-HQ (MM-CelebA-HQ) is a dataset containing 30,000 high-resolution face images selected from CelebA, following CelebA-HQ. Each image in the dataset is accompanied by a semantic mask, sketch, descriptive text, and an image with a transparent background.

    Multi-Modal-CelebA-HQ can be used to train and evaluate algorithms for a range of face generation and understanding tasks, including text-to-image generation, sketch-to-image generation, text-guided image editing, image captioning, and visual question answering. This dataset is introduced and employed in TediGAN.

    TediGAN: Text-Guided Diverse Face Image Generation and Manipulation.
    Weihao Xia, Yujiu Yang, Jing-Hao Xue, and Baoyuan Wu.
    CVPR 2021.

    Updates :triangular_flag_on_post:

    • [07/10/2023] 3DMM coefficients and corresponding rendered images have been added to the repository.
    • [04/10/2023] The scripts for text and sketch generation have been added to the repository.
    • [06/12/2020] The paper is released on arXiv.
    • [11/13/2020] The multi-modal-celeba-hq dataset has been released.

    Data Generation

    Description

    • The textual descriptions are generated using probabilistic context-free grammar (PCFG) based on the given attributes. We create ten unique single sentence descriptions per image to obtain more training data following the format of the popular CUB dataset and COCO dataset. The previous study proposed CelebTD-HQ, but it is not publicly available.
    • For semantic labels, we use CelebAMask-HQ dataset, which contains manually-annotated semantic mask of facial attributes corresponding to CelebA-HQ.
    • For sketches, we follow the same data generation pipeline as in DeepFaceDrawing. We first apply Photocopy filter in Photoshop to extract edges, which preserves facial details and introduces excessive noise, then apply the sketch-simplification to get edge maps resembling hand-drawn sketches.
    • For background removing, we use an open-source tool Rembg and a commercial software removebg. Different backgrounds can be further added using image composition or harmonization methods like DoveNet.
    • For 3DMM coefficients and the corresponding rendered image, we use Deep3DFaceReconstruction. Please follow the instructions for data generation. We also provide the Cleaned Face Datasets, the "cleaned" version of two popular face datasets, CelebAHQ and FFHQ, made by removing instances with extreme poses, occlusions, blurriness, and the presence of multiple individuals in the frame.

    Usage

    This section outlines the process of generating the data for our task.

    The scripts provided here are not restricted to the CelebA-HQ dataset and can be utilized to preprocess any dataset that includes attribute annotations, be it image, video, or 3D shape data. This flexibility enables the creation of custom datasets that meet specific requirements. For example, the create_caption.py script can be applied to generate diverse descriptions for each video by using video facial attributes (e.g., those provided by CelebV-HQ), leading to a text-video dataset, similar to CelebV-Text.

    Text

    Please download celeba-hq-attribute.txt (CelebAMask-HQ-attribute-anno.txt) and run the following script.

    python create_caption.py
    

    The generated textual descriptions can be found at ./celeba_caption.

    Please fill out the form to request the processing script. If feasible, please send me a follow-up email after submitting the form to remind me.

    Sketch

    If Photoshop is available to you, please apply the Photocopy filter in Photoshop to extract edges. Photoshop allows batch processing so you don't have to mannually process each image. The Sobel operator is an lternative way to extract edges when Photoshop is unavailable or a simpler approach is preferred. This process preserve...

  14. Drone obstacle avoidance AirSim

    • kaggle.com
    zip
    Updated Apr 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lukáš Pellant (2025). Drone obstacle avoidance AirSim [Dataset]. https://www.kaggle.com/datasets/lukpellant/droneflight-obs-avoidanceairsimrgbdepth10k-320x320/suggestions
    Explore at:
    zip(2116088484 bytes)Available download formats
    Dataset updated
    Apr 8, 2025
    Authors
    Lukáš Pellant
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Drone Flight Obstacle AvoidanceAirSimRGBDepth10k_320x320

    Overview

    This dataset contains 10,000 samples designed for drone navigation and obstacle avoidance research. It includes RGB images (320x320), Depth maps (320x320), and corresponding Commands (vx, vy, vz, yaw_rate). The data was collected in AirSim, a realistic drone simulator by Microsoft, using a drone controlled by a script implementing potential fields for navigation and obstacle avoidance.

    This dataset is ideal for researchers and developers working on autonomous drone navigation, computer vision, or robotics projects involving RGB and Depth data.

    Dataset Details

    • Size: 10,000 samples
    • Data Types:
      • RGB Images: 320x320 resolution, 3 channels (RGB), stored as PNG files
      • Depth Maps: 320x320 resolution, 1 channel, normalized to [0, 1] with max_depth=25.0, stored as NumPy arrays (.npy)
      • Commands: 4 values (vx, vy, vz, yaw_rate), normalized based on statistics (means: [2.43, 0, 0.025, -1.17], stds: [0.87, 0, 0.32, 20.56]), stored as NumPy arrays (.npy)
    • Collection Method: Data was generated in AirSim using a drone controlled by a potential fields navigation script for obstacle avoidance.
    • Simulator: AirSim (Microsoft), licensed under the MIT License (https://github.com/microsoft/AirSim)

    File Structure

    • rgb/: Directory containing 10,000 RGB images (e.g., 000000.png, ..., 009999.png)
    • depth/: Directory containing 10,000 Depth maps as NumPy arrays (e.g., 000000.npy, ..., 009999.npy)
    • commands/: Directory containing 10,000 Commands as NumPy arrays (e.g., 000000.npy, ..., 009999.npy), each file with 4 values: vx, vy, vz, yaw_rate

    Usage

    This dataset is suitable for: - Developing models for autonomous drone navigation - Research in obstacle avoidance and path planning - Computer vision tasks involving RGB and Depth data - Robotics and simulation-based studies

    Example use case: Use the RGB and Depth data to develop algorithms for real-time obstacle avoidance in drones.

    License

    This dataset is licensed under CC BY 4.0. You are free to use, modify, and distribute it as long as you provide attribution to the author and acknowledge the source of the data: - Attribution: "Dataset DroneFlight_Obs_AvoidanceAirSimRGBDepth10k_320x320 by https://www.kaggle.com/lukpellant, data generated using AirSim (MIT License)." - AirSim License: The data was collected in AirSim, which is licensed under the MIT License (https://github.com/microsoft/AirSim).

    Acknowledgments

    • AirSim: Thanks to Microsoft for providing the AirSim simulator under the MIT License.
    • Potential Fields: The navigation script uses potential fields for obstacle avoidance, inspired by classical robotics techniques.
  15. The ORBIT (Object Recognition for Blind Image Training)-India Dataset

    • data.niaid.nih.gov
    • nde-dev.biothings.io
    • +1more
    Updated Jul 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    India, Gesu; Grayson, Martin; Massiceti, Daniela; Morrison, Cecily; Robinson, Simon; Pearson, Jennifer; Jones, Matt (2024). The ORBIT (Object Recognition for Blind Image Training)-India Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11394528
    Explore at:
    Dataset updated
    Jul 2, 2024
    Dataset provided by
    Microsofthttp://microsoft.com/
    Swansea University
    Authors
    India, Gesu; Grayson, Martin; Massiceti, Daniela; Morrison, Cecily; Robinson, Simon; Pearson, Jennifer; Jones, Matt
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    India
    Description

    The ORBIT (Object Recognition for Blind Image Training) -India Dataset is a collection of 105,243 images of 76 commonly used objects, collected by 12 individuals in India who are blind or have low vision. This dataset is an "Indian subset" of the original ORBIT dataset [1, 2], which was collected in the UK and Canada. In contrast to the ORBIT dataset, which was created in a Global North, Western, and English-speaking context, the ORBIT-India dataset features images taken in a low-resource, non-English-speaking, Global South context, a home to 90% of the world’s population of people with blindness. Since it is easier for blind or low-vision individuals to gather high-quality data by recording videos, this dataset, like the ORBIT dataset, contains images (each sized 224x224) derived from 587 videos. These videos were taken by our data collectors from various parts of India using the Find My Things [3] Android app. Each data collector was asked to record eight videos of at least 10 objects of their choice.

    Collected between July and November 2023, this dataset represents a set of objects commonly used by people who are blind or have low vision in India, including earphones, talking watches, toothbrushes, and typical Indian household items like a belan (rolling pin), and a steel glass. These videos were taken in various settings of the data collectors' homes and workspaces using the Find My Things Android app.

    The image dataset is stored in the ‘Dataset’ folder, organized by folders assigned to each data collector (P1, P2, ...P12) who collected them. Each collector's folder includes sub-folders named with the object labels as provided by our data collectors. Within each object folder, there are two subfolders: ‘clean’ for images taken on clean surfaces and ‘clutter’ for images taken in cluttered environments where the objects are typically found. The annotations are saved inside a ‘Annotations’ folder containing a JSON file per video (e.g., P1--coffee mug--clean--231220_084852_coffee mug_224.json) that contains keys corresponding to all frames/images in that video (e.g., "P1--coffee mug--clean--231220_084852_coffee mug_224--000001.jpeg": {"object_not_present_issue": false, "pii_present_issue": false}, "P1--coffee mug--clean--231220_084852_coffee mug_224--000002.jpeg": {"object_not_present_issue": false, "pii_present_issue": false}, ...). The ‘object_not_present_issue’ key is True if the object is not present in the image, and the ‘pii_present_issue’ key is True, if there is a personally identifiable information (PII) present in the image. Note, all PII present in the images has been blurred to protect the identity and privacy of our data collectors. This dataset version was created by cropping images originally sized at 1080 × 1920; therefore, an unscaled version of the dataset will follow soon.

    This project was funded by the Engineering and Physical Sciences Research Council (EPSRC) Industrial ICASE Award with Microsoft Research UK Ltd. as the Industrial Project Partner. We would like to acknowledge and express our gratitude to our data collectors for their efforts and time invested in carefully collecting videos to build this dataset for their community. The dataset is designed for developing few-shot learning algorithms, aiming to support researchers and developers in advancing object-recognition systems. We are excited to share this dataset and would love to hear from you if and how you use this dataset. Please feel free to reach out if you have any questions, comments or suggestions.

    REFERENCES:

    Daniela Massiceti, Lida Theodorou, Luisa Zintgraf, Matthew Tobias Harris, Simone Stumpf, Cecily Morrison, Edward Cutrell, and Katja Hofmann. 2021. ORBIT: A real-world few-shot dataset for teachable object recognition collected from people who are blind or low vision. DOI: https://doi.org/10.25383/city.14294597

    microsoft/ORBIT-Dataset. https://github.com/microsoft/ORBIT-Dataset

    Linda Yilin Wen, Cecily Morrison, Martin Grayson, Rita Faia Marques, Daniela Massiceti, Camilla Longden, and Edward Cutrell. 2024. Find My Things: Personalized Accessibility through Teachable AI for People who are Blind or Low Vision. In Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems (CHI EA '24). Association for Computing Machinery, New York, NY, USA, Article 403, 1–6. https://doi.org/10.1145/3613905.3648641

  16. Multi-Turn Chats With Context Classification

    • kaggle.com
    zip
    Updated Apr 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TheItCrow (2024). Multi-Turn Chats With Context Classification [Dataset]. https://www.kaggle.com/datasets/kevinbnisch/multi-turn-chats-with-context-classification
    Explore at:
    zip(2311398 bytes)Available download formats
    Dataset updated
    Apr 13, 2024
    Authors
    TheItCrow
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is the dataset on which CCC-BERT was fine-tuned on to classify whether user inputs require a new context retrieval of the RAG model or not. The creation of the dataset can be found in this notebook in section 2.

    The dataset consists of two files:

    context_chats_35k.json

    This file contains 35.000 multi-turn chats on random topics, all ending on a user's input, synthetically produced through GPT-3.5 and labeled by fetch_context which that tells us whether this user input should require the retrieval of context or not. For a more detailed explanation on this flag, please consult the CCC-BERT model card.

    Chit-Chat

    The chit-chat_dataset.tsv contains around 10.000 "nonsense" chats provided by Microsoft on this GitHub repository. I've added this small dataset as it can be used to augment the chats a bit more, but the higher quality lies within the context_chats_35k.json file

    Usage

    Multi-Turn chats are useful for fine-tuning LLMs, training for LLM tasks such as POS-Tagging, Lemmatization, Named-Entity Recognition and more.

    Feel free to utilize these chats to your liking.

  17. Randomised Synthetic Online Game Purchases Data

    • kaggle.com
    zip
    Updated Apr 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    zaclovell (2022). Randomised Synthetic Online Game Purchases Data [Dataset]. https://www.kaggle.com/datasets/zaclovell/randomised-synthetic-online-game-purchases-data
    Explore at:
    zip(1208739 bytes)Available download formats
    Dataset updated
    Apr 24, 2022
    Authors
    zaclovell
    Description

    1. Why build a dataset?

    I wanted to run data analysis and machine learning on a large dataset to build my data science skills but I felt out of touch with the various datasets available so I thought... how about I try and build my own dataset?

    2. Why gaming data?

    I wondered what data should be in the dataset and settled with online digital game purchases since I am an avid gamer. Imagine getting sales data from the PlayStation Store or Xbox Microsoft Store, this is what I was aiming to replicate.

    3. Scope of the dataset

    I envisaged the dataset to be data created through the purchase of a digital game on either the UK PlayStation Store or Xbox Microsoft Store. Considering this, the scope of dataset varies depending on which column of data you are viewing, for example: - Date and Time: purchases were defined between a start/end date (this can be altered, see point 4) and, of course, anytime across the 24hr clock - Geographically: purchases were setup to come from any postcode in the UK - in total this is over 1,000,000 active postcodes - Purchases: the list of game titles available for purchase is 24 - Registered Banks: the list of registered banks in the UK (as of 03/2022) was 159

    4. Over 42,000 rows isn't enough?

    To generate the dataset, I built a function in Python. This function, when called with the number of rows you want in your dataset, will generate the dataset. For example, calling function(1000) will provide you with a dataset with 1000 rows.

    Considering this, if just over 42,000 rows of data (42,892 to be exact) isn't enough, feel free to check out the code on my GitHub to run the function yourself with as many rows as you want.

    Note: You can also edit the start/end dates of the function depending on which timespan you want the dataset to cover.

    5. Disclaimer - this is still a work in progress!

    Yes, as stated above, this dataset is still a work in progress and is therefore not 100% perfect. There is a backlog of issues that need to be resolved. Feel free to check out the backlog.

    One example of this is how on various columns, the distributions of data is equal, when in fact for the dataset to be entirely random, this should not be the case. An example of this issue is the Time column. These issues will be resolved in a later update.

    Last updated: 24/04/2022

  18. Mammogram Mass Analyzer Desktop App

    • kaggle.com
    zip
    Updated Dec 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vbookshelf (2022). Mammogram Mass Analyzer Desktop App [Dataset]. https://www.kaggle.com/datasets/vbookshelf/mammogram-mass-analyzer-v00
    Explore at:
    zip(207834390 bytes)Available download formats
    Dataset updated
    Dec 29, 2022
    Authors
    vbookshelf
    Description

    Mammogram Mass Analyzer

    This is a free desktop computer aided diagnosis (CAD) tool that uses computer vision to detect and localize masses on full field digital mammograms. It's a flask app that's running on the desktop. Internally there are two Yolov5L ensembled models that were trained on data from the VinDr-Mammo dataset. The model ensemble has a validation accuracy of 0.65 and a validation recall of 0.63.

    My aim was to create a proof of concept for a free desktop computer aided diagnosis (CAD) system that could be used as an aid when diagnosing breast cancer. Unlike a web app, this tool does not need an internet connection and there are no monthly costs for hosting and web server rental. I think a desktop tool could be helpful to radiologists in private practice and to medical non-profits that work in remote areas.

    The complete project folder, including the trained models, is stored in this Kaggle dataset.

    For a full project description please refer to the GitHub repo: https://github.com/vbookshelf/Mammogram-Mass-Analyzer

    For info on model training and validation, please refer to the model card. I've included a confusion matrix and classification report. https://github.com/vbookshelf/Mammogram-Mass-Analyzer/blob/main/mammogram-mass-analyzer-v0.0/Model-Card-and-App-Info.pdf

    Demo

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1086574%2F421d04920cd6a4dfed890a81df0f13c8%2Fdemo1.gif?generation=1669902576106757&alt=media" alt="">

    Demo showing what happens after a user submits three dicom mammograms


    1- Main features

    • Free to use. Free to deploy. No monthly server rental costs like with a web app.
    • Completely transparent. All code is accessible and therefore fully auditable.
    • Runs locally without needing an internet connection
    • Takes mammograms in dicom format as input
    • Can analyze multiple mammograms simultaneously
    • Uses the computer’s cpu. A gpu would make the app much faster, but it's not essential.
    • Results are explainable because it draws bounding boxes around detected masses
    • Patient data remains private because it never leaves the user’s computer
    • Easy to customize because this is just a Flask app built using html, css and javascript.

    2- Cons

    • It’s not a one click setup. The user needs to have a basic knowledge of how to use the command line to set up a virtual environment, download requirements and launch a python app.
    • The inference time is about 10 seconds per image, because inference is being done on the CPU.
    • When diagnosing breast cancer radiologists look for masses, calcifications and architectural distortions. However, this app can only detect masses.
    • The amount of positive samples in the training data was limited. The accuracy and recall could be improved with more training data.

    3- How to run this app

    First download the project folder from Kaggle

    The project folder (named mammogram-mass-analyzer-v0.1) is stored in this Kaggle dataset.

    I suggest that you download the project folder from Kaggle instead of from the GitHub repo. This is because the project folder on Kaggle includes the two trained models. The project folder in this repo does not include the trained models because GitHub does not allow files larger than 25MB to be uploaded.
    The models are located inside a folder called TRAINED_MODEL_FOLDER, which is located inside the yolov5 folder: mammogram-mass-analyzer-v0.0/yolov5/TRAINED_MODEL_FOLDER/


    Overview

    This is a standard flask app. The steps to set up and run the app are the same for both Mac and Windows.

    1. Download the project folder.
    2. Use the command line to pip install the requirements listed in the requirements.txt file. (It’s located inside the project folder.)
    3. Run the app.py file from the command line.
    4. Copy the url that gets printed in the console.
    5. Paste that url into your chrome browser and press Enter. The app will open in the browser.

    This app is based on Flask and Pytorch, both of which are pure python. If you encounter any errors during installation you should be able to solve them quite easily. You won’t have to deal with the package dependency issues that happen when using Tensorflow.


    Detailed setup instructions

    The instructions below are for a Mac. I didn't include instructions for Windows because I don't have a Windows pc and therefore, I could not test the installtion process on windows. If you’re using a Windows pc then please change the commands below to suit Windows.

    You’ll need an internet connection during the first setup. After that you’ll be able to use the app without an internet connection.

    If you are a beginner you may find these resources helpful:

    The Complete Guide to Python Virtual Environments! Teclado (Includes instructions for Windows) https://www.youtube.com/watch?v=KxvKCSwlUv8&t=947s

    How To Create Python Virtual Envi...

  19. Competitions Shake-up

    • kaggle.com
    zip
    Updated Sep 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniboy370 (2020). Competitions Shake-up [Dataset]. https://www.kaggle.com/daniboy370/competitions-shakeup
    Explore at:
    zip(388789 bytes)Available download formats
    Dataset updated
    Sep 27, 2020
    Authors
    Daniboy370
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Shake-what ?!

    The Shake phenomenon occurs when the competition is shifting between two different datasets :

    \[ \text{Public test set} \ \Rightarrow \ \text{Private test set} \quad \Leftrightarrow \quad LB-\text{public} \ \Rightarrow \ LB-\text{private} \]

    The private test set that so far was unavailable becomes available, and thus the models scores are re-calculated. This re-evaluation elicits a respective re-ranking of the contestants in the competition. The shake allows participants to assess the severity of their overfitting to the public dataset, and act to improve their model until the deadline.

    Unable to find a uniform conventional term for this mechanism, I will use my common sense to define the following intuition :

                 <img src="https://github.com/Daniboy370/Uploads/blob/master/Kaggle-shake-ups/images/latex.png?raw=true" width="550">
    

    From the starter kernel :

                   <img src="https://github.com/Daniboy370/Uploads/blob/master/Kaggle-shake-ups/vids/shakeup_VID.gif?raw=true" width="625">
    

    Content

    Seven datasets of competitions which were scraped from Kaggle :

    CompetitionName of file
    Elo Merchant Category Recommendationdf_{Elo}
    Human Protein Atlas Image Classificationdf_{Protein}
    Humpback Whale Identificationdf_{Humpback}
    Microsoft Malware Predictiondf_{Microsoft}
    Quora Insincere Questions Classificationdf_{Quora}
    TGS Salt Identification Challengedf_{TGS}
    VSB Power Line Fault Detectiondf_{VSB}

    As an example, consider the following dataframe from the Quora competition : Team Name | Rank-private | Rank-public | Shake | Score-private | Score-public --- | --- The Zoo |1|7|6|0.71323|0.71123 ...| ...| ...| ...| ...| ... D.J. Trump|1401|65|-1336|0.000|0.70573

    I encourage everybody to investigate thoroughly the dataset in sought of interesting findings !

    \[ \text{Enjoy !}\]

  20. API Call based Malware Dataset

    • kaggle.com
    zip
    Updated May 8, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ferhat Ozgur Catak (2019). API Call based Malware Dataset [Dataset]. https://www.kaggle.com/focatak/malapi2019
    Explore at:
    zip(5944171 bytes)Available download formats
    Dataset updated
    May 8, 2019
    Authors
    Ferhat Ozgur Catak
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Analytics https://img.shields.io/badge/visits-100k-green" alt="Total Downloads">

    Windows Malware Dataset with PE API Calls

    Our public malware dataset generated by Cuckoo Sandbox based on Windows OS API calls analysis for cyber security researchers for malware analysis in cvs file format for machine learning applications.

    Cite The DataSet
    If you find those results useful please cite them :

    @article{10.7717/peerj-cs.285,
     title = {Deep learning based Sequential model for malware analysis using Windows exe API Calls},
     author = {Catak, Ferhat Ozgur and Yazı, Ahmet Faruk and Elezaj, Ogerta and Ahmed, Javed},
     year = 2020,
     month = jul,
     keywords = {Malware analysis, Sequential models, Network security, Long-short-term memory, Malware dataset},
     volume = 6,
     pages = {e285},
     journal = {PeerJ Computer Science},
     issn = {2376-5992},
     url = {https://doi.org/10.7717/peerj-cs.285},
     doi = {10.7717/peerj-cs.285}
    }
    

    Publications

    The details of the Mal-API-2019 dataset are published in following the papers: * [Link] AF. Yazı, FÖ Çatak, E. Gül, Classification of Metamorphic Malware with Deep Learning (LSTM), IEEE Signal Processing and Applications Conference, 2019. * [Link] Catak, FÖ., Yazi, AF., A Benchmark API Call Dataset for Windows PE Malware Classification, arXiv:1905.01999, 2019.

    Introduction

    This study seeks to obtain data which will help to address machine learning based malware research gaps. The specific objective of this study is to build a benchmark dataset for Windows operating system API calls of various malware. This is the first study to undertake metamorphic malware to build sequential API calls. It is hoped that this research will contribute to a deeper understanding of how metamorphic malware change their behavior (i.e. API calls) by adding meaningless opcodes with their own dissembler/assembler parts.

    Malware Types and System Overall

    In our research, we have translated the families produced by each of the software into 8 main malware families: Trojan, Backdoor, Downloader, Worms, Spyware Adware, Dropper, Virus. Table 1 shows the number of malware belonging to malware families in our data set. As you can see in the table, the number of samples of other malware families except AdWare is quite close to each other. There is such a difference because we don't find too much of malware from the adware malware family.

    Malware FamilySamplesDescription
    Spyware832enables a user to obtain covert information about another's computer activities by transmitting data covertly from their hard drive.
    Downloader1001share the primary functionality of downloading content.
    Trojan1001misleads users of its true intent.
    Worms1001spreads copies of itself from computer to computer.
    Adware379hides on your device and serves you advertisements.
    Dropper891surreptitiously carries viruses, back doors and other malicious software so they can be executed on the compromised machine.
    Virus1001designed to spread from host to host and has the ability to replicate itself.
    Backdoor1001a technique in which a system security mechanism is bypassed undetectably to access a computer or its data.

    Figure shows the general flow of the generation of the malware data set. As shown in the figure, we have obtained the MD5 hash values of the malware we collect from Github. We searched these hash values using the VirusTotal API, and we have obtained the families of these malicious software from the reports of 67 different antivirus software in VirusTotal. We have observed that the malicious software families found in the reports of these 67 different antivirus software in VirusTotal are different.

    Screenshot

    Data Description

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Saurabh Shahane (2021). Microsoft Bing Search For Corona Virus Intent [Dataset]. https://www.kaggle.com/saurabhshahane/microsoft-bing-search-for-corona-virus-intent
Organization logo

Microsoft Bing Search For Corona Virus Intent

Microsoft Bing Search Data From All Over The World

Explore at:
zip(64939376 bytes)Available download formats
Dataset updated
Jan 24, 2021
Authors
Saurabh Shahane
Description

Context

This dataset was curated from the Bing search logs (desktop users only) over the period of Jan 1st, 2020 – (Current Month - 1). Only searches that were issued many times by multiple users were included. Dataset includes queries from all over the world that had an intent related to the Coronavirus or Covid-19. In some cases this intent is explicit in the query itself, e.g. “Coronavirus updates Seattle” in other cases it is implicit , e.g. “Shelter in place”. Implicit intent of search queries (e.g. Toilet paper) were extracted by using Random walks on the click graph approach as outlined in the following paper by Nick Craswell et al at Microsoft Research: https://www.microsoft.com/en-us/research/wp-content/uploads/2007/07/craswellszummer-random-walks-sigir07.pdf All personal data was removed. Source - https://msropendata.com/datasets/c5031874-835c-48ed-8b6d-31de2dad0654

Acknowledgements

Data Source: Bing Coronavirus Query set (https://github.com/microsoft/BingCoronavirusQuerySet)

License - Open Use of Data Agreement v1.0

Content

Inside the data folder there is a folder 2020 (for the year) which contains two kinds of files.

QueriesByCountry_DateRange.tsv : A tab separated text file that contains queries with Coronavirus intent by Country. QueriesByState_DateRange.tsv : A tab separated text file that contains queries with Coronavirus intent by State.

QueriesByCountry Date : string, Date on which the query was issued.

Query : string, The actual search query issued by user(s).

IsImplicitIntent : bool, True if query did not mention covid or coronavirus or sarsncov2 (e.g, “Shelter in place”). False otherwise.

Country : string, Country from where the query was issued.

PopularityScore : int, Value between 1 and 100 inclusive. 1 indicates least popular query on the day/Country with Coronavirus intent, and 100 indicates the most popular query for the same Country on the same day.

QueriesByState Date : string, Date on which the query was issued.

Query : string, The actual search query issued by user(s).

IsImplicitIntent : bool, True if query did not mention covid or coronavirus or sarsncov2 (e.g, “Shelter in place”). False otherwise.

State : string, State from where the query was issued.

Country :string, Country from where the query was issued.

PopularityScore : int, Value between 1 and 100 inclusive. 1 indicates least popular query on the day/State/Country with Coronavirus intent, and 100 indicates the most popular query for the same geogrpahy on the same day.

Search
Clear search
Close search
Google apps
Main menu