22 datasets found

Microsoft Bing Search For Corona Virus Intent
kaggle.com
zip
Updated Jan 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saurabh Shahane (2021). Microsoft Bing Search For Corona Virus Intent [Dataset]. https://www.kaggle.com/saurabhshahane/microsoft-bing-search-for-corona-virus-intent
Explore at:
zip(64939376 bytes)Available download formats
Dataset updated
Jan 24, 2021
Authors
Saurabh Shahane
Description
Context

This dataset was curated from the Bing search logs (desktop users only) over the period of Jan 1st, 2020 – (Current Month - 1). Only searches that were issued many times by multiple users were included. Dataset includes queries from all over the world that had an intent related to the Coronavirus or Covid-19. In some cases this intent is explicit in the query itself, e.g. “Coronavirus updates Seattle” in other cases it is implicit , e.g. “Shelter in place”. Implicit intent of search queries (e.g. Toilet paper) were extracted by using Random walks on the click graph approach as outlined in the following paper by Nick Craswell et al at Microsoft Research: https://www.microsoft.com/en-us/research/wp-content/uploads/2007/07/craswellszummer-random-walks-sigir07.pdf All personal data was removed. Source - https://msropendata.com/datasets/c5031874-835c-48ed-8b6d-31de2dad0654

Acknowledgements

Data Source: Bing Coronavirus Query set (https://github.com/microsoft/BingCoronavirusQuerySet)

License - Open Use of Data Agreement v1.0

Content

Inside the data folder there is a folder 2020 (for the year) which contains two kinds of files.

QueriesByCountry_DateRange.tsv : A tab separated text file that contains queries with Coronavirus intent by Country. QueriesByState_DateRange.tsv : A tab separated text file that contains queries with Coronavirus intent by State.

QueriesByCountry Date : string, Date on which the query was issued.

Query : string, The actual search query issued by user(s).

IsImplicitIntent : bool, True if query did not mention covid or coronavirus or sarsncov2 (e.g, “Shelter in place”). False otherwise.

Country : string, Country from where the query was issued.

PopularityScore : int, Value between 1 and 100 inclusive. 1 indicates least popular query on the day/Country with Coronavirus intent, and 100 indicates the most popular query for the same Country on the same day.

QueriesByState Date : string, Date on which the query was issued.

Query : string, The actual search query issued by user(s).

IsImplicitIntent : bool, True if query did not mention covid or coronavirus or sarsncov2 (e.g, “Shelter in place”). False otherwise.

State : string, State from where the query was issued.

Country :string, Country from where the query was issued.

PopularityScore : int, Value between 1 and 100 inclusive. 1 indicates least popular query on the day/State/Country with Coronavirus intent, and 100 indicates the most popular query for the same geogrpahy on the same day.
Package and Dependency Metadata for CZI Hackathon: Mapping the Impact of...
zenodo.org
application/gzip
Updated Oct 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew Nesbitt; Andrew Nesbitt (2023). Package and Dependency Metadata for CZI Hackathon: Mapping the Impact of Research Software in Science [Dataset]. http://doi.org/10.5281/zenodo.10045361
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10045361
Dataset updated
Oct 26, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrew Nesbitt; Andrew Nesbitt
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
A collection of useful datasets extracted from https://packages.ecosyste.ms and https://repos.ecosyste.ms for use at the CZI Hackathon: Mapping the Impact of Research Software in Science.
All data is provided as NDJSON (new line delimited JSON), each line represents a valid JSON object, and they are separated by newline characters. There are python and R libraries for reading these files, or you can maually read each line and parse each line as a single JSON object.
Each ndjson file has been compressed with gzip (actual command: `tar -czvf`) to reduce download size, they expand to significantly bigger files after extraction.
Package Data
Package names from cran, bioconductor and pypi that have been parsed by the software-mentions project (data: https://datadryad.org/stash/dataset/doi:10.5061/dryad.6wwpzgn2c) are collected together with their latest release at time of publishing along with the names of their dependencies, those dependency names have then also been recursively fetched with latest release and dependencies until the full list of transitive dependencies is included.
Note: This approach uses a simplified method of dependency resolution, always picking the latest version of each package rather than taking into account each dependencies specific version range requirements, this is primarily due to time constraints and allows all software ecosystems to be processed in the same way. A future improvement would be to use each package ecosystem's specific dependency resolution algorithm to compute the full transitive dependency tree for each mentioned software package.
GitHub Data
Two different approaches were taken for collecting data for referenced GitHub mentions:
1. `github.ndjson` is metadata for each repository from GitHub, including "manifest" files which are known files that contain dependency information for a project such as requirements.txt, DESCRIPTION and package.json, parsed using https://github.com/ecosyste-ms/bibliothecary, which may include transitive dependencies that have been discovered in a `lockfile` within the repository.
2. `github_packages.ndjson` is metadata for each package that was found on any package manager that references the GitHub url as it's repository url/source/homepage, these packages, like the cran and pypi data above, include the latest release and their direct dependencies. There may be more than one package for each GitHub URL as it is a one to many relationship. `github_packages_with_transitive.ndjson` follows the same format but also includes the extra resolved transitive dependencies of all packages using the same approach as with cran and pypi data above with the same caveats.
There are also many more ecosystems referenced in these files than just cran, bioconductor and pypi, https://packages.ecosyste.ms provides a standardized metadata format for all of them to enable comparison and simplification of automation.
Contact
If you would like any help, support or more data from Ecosyste.ms please do get in touch via email: hello@ecosyste.ms or open an issue on GitHub: https://github.com/ecosyste-ms/packages/issues
o
Data from: Reliance on Science in Patenting
explore.openaire.eu
zenodo.org
Updated Oct 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matt Marx; Aaron Fuegi (2020). Reliance on Science in Patenting [Dataset]. http://doi.org/10.5281/zenodo.3236339
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3236339
Dataset updated
Oct 13, 2020
Authors
Matt Marx; Aaron Fuegi
Description
This dataset contains citations from USPTO patents granted 1947-2018 to articles captured by the Microsoft Academic Graph (MAG) from 1800-2018. If you use the data, please cite these two papers: for the dataset of citations: Marx, Matt and Aaron Fuegi, "Reliance on Science in Patenting: USPTO Front-Page Citations to Scientific Articles" (https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3331686). for the underlying dataset of papers Sinha, Arnab, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. 2015. An Overview of Microsoft Academic Service (MAS) and Applications. In Proceedings of the 24th International Conference on World Wide Web (WWW ’15 Companion). ACM, New York, NY, USA, 243-246. The main file, pcs.tsv, contains the resolved citations. Fields are tab-separated. Each match has the patent number, MAG ID, the original citation from the patent, an indicator for whether the citation was supplied by the applicant, examiner, or unknown, and a confidence score (1-10) indicating how likely this match is correct. Note that this distribution does not contain matches with confidence 2 or 1. There is also a PubMed-specific match in pcs-pubmed.tsv. The remaining files are a redistribution of the 1 January 2019 release of the Microsoft Academic Graph. All of these files are compressed using ZIP compression under CentOS5. Original files, documented at https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema, can be downloaded from https://aka.ms/msracad; this redistribution carves up the original files into smaller, variable-specific files that can be loaded individually (see _relianceonscience.pdf for full details). Source code for generating the patent citations to science in pcs.tsv is available at https://github.com/mattmarx/reliance_on_science. Source code for generating jif.zip and jcif.zip (Journal Impact Factor and Journal Commercial Impact Factor) is at https://github.com/mattmarx/jcif. Although MAG contains authors and affiliations for each paper, it does not contain the location for affiliations. We have created a dataset of locations for affiliations appearing at least 100x using Bing Maps and Google Maps; however, it is unclear to us whether the API licensing terms allow us to repost their data. In any case, you can download our source code for doing so here: https://github.com/ksjiaxian/api-requester-locations. MAG extracts field keywords for each paper (paperfieldid.zip and fieldidname.zip) --more than 200,000 fields in all! When looking to study industries or technical areas you might find this a bit overwhelming. We mapped the MAG subjects to six OECD fields and 39 subfields, defined here: http://www.oecd.org/science/inno/38235147.pdf. Clarivate provides a crosswalk between the OECD classifications and Web of Science fields, so we include WoS fields as well. This file is magfield_oecd_wos_crosswalk.zip.
u
Distribution of data used in the MS-PINPOINT project
rdr.ucl.ac.uk
datasetcatalog.nlm.nih.gov
csv
Updated Nov 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benjamin Ford (2024). Distribution of data used in the MS-PINPOINT project [Dataset]. http://doi.org/10.5522/04/27604563.v1
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5522/04/27604563.v1
Dataset updated
Nov 26, 2024
Dataset provided by
University College London
Authors
Benjamin Ford
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This item is part of the SAFEHR scheme at UCLH. The purpose of the scheme is to publicly display what kinds of patient data we use, to encourage collaboration and transparency. More information can be found at https://safehr-data.github.io/uclh-research-discovery/This dataset describes the structured health records used as part of the MS-PINPOINT project at UCL. It mainly describes the patient demographics such as patient reported gender, ethnicity and other features. Any category with less than 5 entries is not reported in line with privacy guidelines.
deberta_v3_variants
kaggle.com
zip
Updated Oct 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wojciech "Victor" Fulmyk (2022). deberta_v3_variants [Dataset]. https://www.kaggle.com/datasets/wisawesome/deberta-v3-variants
Explore at:
zip(8407232748 bytes)Available download formats
Dataset updated
Oct 30, 2022
Authors
Wojciech "Victor" Fulmyk
Description
DeBERTa v3 variants. Downloaded using:

sudo apt-get install git-lfs git lfs install

git clone https://huggingface.co/microsoft/deberta-v3-xsmall git clone https://huggingface.co/microsoft/deberta-v3-small git clone https://huggingface.co/microsoft/deberta-v3-base git clone https://huggingface.co/microsoft/deberta-v3-large

For more details refer to: https://github.com/microsoft/DeBERTa https://huggingface.co/microsoft/deberta-v3-xsmall https://huggingface.co/microsoft/deberta-v3-small https://huggingface.co/microsoft/deberta-v3-base https://huggingface.co/microsoft/deberta-v3-large

The objective of this upload: - use the trained models in Kaggle competitions without needing to connect to the internet.

There is no intention to infringe rights of any kind on my part, I simply want to use these models in competitions that require no internet connection. If you are one of the rights holders of these models and you feel your rights are being infringed by this upload, please contact me and I will rectify the issue as soon as possible.
a
Tarrant County Building Footprints
data-tarrantcounty.opendata.arcgis.com
hub.arcgis.com
Updated Sep 18, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tarrant County (2018). Tarrant County Building Footprints [Dataset]. https://data-tarrantcounty.opendata.arcgis.com/datasets/tarrant-county-building-footprints/explore?showTable=true
Explore at:
Dataset updated
Sep 18, 2018
Dataset authored and provided by
Tarrant County
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Area covered

Description
Tarrant County Building Footprints. Computer generated building footprints for the United States. The original dataset contains 125,192,184 computer generated building footprints in all 50 US states. This data is freely available for download and use. The original dataset has been pared down to include only Tarrant County building footprints. The filter extent used also includes a portion of other counties that surround Tarrant County. License: This data is licensed by Microsoft under the Open Data Commons Open Database License (ODbL) FAQ: What the data include: Approximately 125 million building footprint polygon geometries in all 50 US States in GeoJSON format. Source: https://github.com/Microsoft/USBuildingFootprints
GeoPIXE Demo Data (Windows)
researchdata.edu.au
data.csiro.au
datadownload
Updated Jul 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Cousens; Barbara Etschmann; Murray Jensen; Chris Ryan (2024). GeoPIXE Demo Data (Windows) [Dataset]. http://doi.org/10.25919/FF5B-WR11
Explore at:
datadownloadAvailable download formats
Unique identifier
https://doi.org/10.25919/FF5B-WR11
Dataset updated
Jul 2, 2024
Dataset provided by
CSIROhttp://www.csiro.au/
Authors
David Cousens; Barbara Etschmann; Murray Jensen; Chris Ryan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Feb 1, 2000 - Jul 1, 2024
Description
Demo data files for the program GeoPIXE (with simple dir structure for a single user). Use these in conjunction with the worked examples notes to aid in training to use the GeoPIXE program for SXRF and PIXE imaging. Can be used for personal training. However, the Linux version is more suited to multiple users and a training workshop. Has been expanded to make it suitable for self-guided personal training.

Requires the GeoPIXE program, which is available from “geopixe@csiro.au” and will soon be released as Open Source on GitHub. Runs under IDL, which must be obtained separately.

Lineage: Data was produced using a range of detectors, such as Ge and Si(Li), SDD and the Maia 384 element detector array, at various synchrotron and ion-beam laboratories, including the XFM X-ray microprobe beamline of the Australian Synchrotron, the 2-ID-E beamline at the Advanced Photon Source, the CSIRO Maia Mapper and the CSIRO Nuclear Microprobe, and processed using the GeoPIXE software package.
P
BrowardCountyBuildingFootprints
data.pompanobeachfl.gov
hub.arcgis.com
Updated Apr 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
External Datasets (2021). BrowardCountyBuildingFootprints [Dataset]. https://data.pompanobeachfl.gov/dataset/browardcountybuildingfootprints
Explore at:
kml, html, geojson, arcgis geoservices rest api, zip, csvAvailable download formats
Dataset updated
Apr 16, 2021
Dataset provided by
BCGISData
Authors
External Datasets
Description
Polygons of the buildings footprints clipped Broward County. This is a product MicroSoft.
The orginal dataset This dataset contains 125,192,184 computer generated building footprints in all 50 US states. This data is freely available for download and use.
The data set was clipped to the Broward County developed boundary.
https://github.com/microsoft/USBuildingFootprints/blob/master/README.md">Additional information
I
Immersive Analytics Software Report
marketreportanalytics.com
doc, pdf, ppt
Updated Apr 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). Immersive Analytics Software Report [Dataset]. https://www.marketreportanalytics.com/reports/immersive-analytics-software-73177
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Apr 9, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Immersive Analytics Software market is experiencing rapid growth, projected to reach $453 million in 2025 and exhibiting a Compound Annual Growth Rate (CAGR) of 33.2%. This robust expansion is driven by several key factors. The increasing adoption of immersive technologies like Virtual Reality (VR) and Augmented Reality (AR) across diverse sectors—business applications, education, healthcare, and public policy—is a significant catalyst. Businesses are leveraging immersive analytics for enhanced data visualization, improved decision-making, and more engaging training programs. The healthcare sector utilizes these tools for surgical planning, medical training simulations, and patient education, while public policy applications focus on creating interactive models for urban planning and disaster response. Furthermore, the continuous advancements in hardware and software capabilities, along with decreasing costs of VR/AR devices, are further fueling market growth. The availability of user-friendly software solutions is also widening the market's accessibility, attracting a larger user base. However, the market faces certain restraints. The high initial investment required for VR/AR infrastructure and software implementation can be a barrier for smaller organizations. Additionally, concerns regarding data security and privacy, as well as the potential for motion sickness and user fatigue associated with extended use of VR/AR devices, need to be addressed. Despite these challenges, the long-term prospects for immersive analytics remain highly positive, driven by ongoing technological innovations and the increasing demand for more efficient and engaging data analysis solutions across various industries. Market segmentation reveals a strong preference for PC and Mac applications, but the mobile (iOS) and VR/AR device segments are showing significant growth potential and are expected to capture considerable market share in the coming years. The major players – Immersion Analytics, GitHub, Microsoft, IBM, Accenture, Google, SAP, Meta, HTC, HP, Tibco, and Magic Leap – are actively shaping the market through continuous innovation and strategic partnerships.
VAR-wlasl-complete
kaggle.com
zip
Updated Jun 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simone Ceredi (2025). VAR-wlasl-complete [Dataset]. https://www.kaggle.com/datasets/simoneceredi/var-wlasl-complete
Explore at:
zip(47071065220 bytes)Available download formats
Dataset updated
Jun 6, 2025
Authors
Simone Ceredi
Description
WLASL Recognition and signer classification

This is the project for the course of "Visione Artificiale e Riconoscimento" of the "University of Bologna". The project aims to classify videos of Word Level American Sign Language into their glosses. It's also possible to classify each signer using traditional methods and representation learning.

Structure of the dataset

data/ - WLASL_v0.3.json - missing.txt - labels.npz - wlasl_class_list.txt - videos/ - frames_no_bg/ - original_videos_sample/ - hf/ - mp/

You can find more info about the content of the dataset here

Acknowledgements

All the WLASL data is intended for academic and computational use only. No commercial usage is allowed.

Made by Dongxu Li and Hongdong Li. Please read the WLASL paper and visit the official website and repository.

Licensed under the Computational Use of Data Agreement (C-UDA). Please refer to the C-UDA-1.0 page for more information.
Data from: Performance Evolution Matrix: Visualizing Performance Variations...
zenodo.org
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan Pablo Sandoval Alcocer; Fabian Beck; Alexandre Bergel; Juan Pablo Sandoval Alcocer; Fabian Beck; Alexandre Bergel (2020). Performance Evolution Matrix: Visualizing Performance Variations along Software Versions [Dataset]. http://doi.org/10.5281/zenodo.3355414
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3355414
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juan Pablo Sandoval Alcocer; Fabian Beck; Alexandre Bergel; Juan Pablo Sandoval Alcocer; Fabian Beck; Alexandre Bergel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
# Performance Evolution Matrix
This repository contains the artifacts needed to replicate our experiment in the paper "Performance Evolution Matrix".

# Video Demo
[download](https://github.com/jpsandoval/PerfEvoMatrix/blob/master/MatrixMovie.mp4)

# XMLSupport and GraphET Examples

To open the XMLSupport and GraphET Examples (which appears in the paper) execute the following commands in a Terminal.

**MacOSX.** We do all the experiments in a Mac Book Pro. To open the Matrix execute the following command in the folder where this project was downloaded.

```
./Pharo-OSX/Pharo.app/Contents/MacOS/Pharo Matrix.image
```

**Windows.**
You may also run the experiment in Windows, but depending on the windows version you have installed it may be some some UI bugs.
```
cd Pharo-Windows
Pharo.exe ../XMLSupportExample.image
```

**Open the Visualization.**
Please select the following code, then execute it using the green play button (at the top right of the window).
```
ToadBuilder xmlSupportExample.
```
or
```
ToadBuilder graphETExample.
```
**Note.** There are two buttons at the panel top left In (zoom in) and Out (zoom out). To move the visualization just drag the moves over the panel.

# Experiment
This subsection describe how to execute the tools, for replicating our experiment.

## Baseline
The baseline contains the tools and the project-dataset to realize the tasks described in the paper (identifying and understanding performance variations).

## Open the Baseline

**MacOSX.** We do all the experiments in a Mac Book Pro. To open the Baseline execute the following command in the folder where this project was downloaded.

```
./Pharo-OSX/Pharo.app/Contents/MacOS/Pharo Baseline.image
```

**Windows.**
You may also run the experiment in Windows, but depending on the windows version you have installed it may be some some UI bugs.
```
cd Pharo-Windows
Pharo.exe ../Baseline.image
```

## Open a Project

There are three projects under study, depending on the project you wanna use for the task, you may execute one of the following scripts. For executing a script press Cmd-d or right-click and press do it.

**Roassal**
```
TProfileVersion openRoassal.
```

**XML**
```
TProfileVersion openXML.
```
**Grapher**
```
TProfileVersion openGrapher.
```

## Baseline Options
For each project, we provide a UI which contains all the tools we use as a baseline. Each item in the list is a version of the selected project.

- Browse: open a standard window to inspect the code of the project in the selected version.
- Profile: open a window with a call context tree for the selected version.
- Source Diff: open a window with the code differences between the selected version and the previous one.
- Execution Diff: open a window with the merge call context tree gathered from the selected version and the previous one.

**Note.** All these options require you select first a item in the list.

# Matrix

## Open Matrix Image.

**MacOSX.** We do all the experiments in a Mac Book Pro. To open the Matrix execute the following command in the folder where this project was downloaded.

```
./Pharo-OSX/Pharo.app/Contents/MacOS/Pharo Matrix.image
```

**Windows.**
You may also run the experiment in Windows, but depending on the windows version you have installed it may be some some UI bugs.
```
cd Pharo-Windows
Pharo.exe ../Matrix.image
```

## Open a project

There are three projects under study, depending on the project you wanna use for the task, you may execute one of the following scripts. For executing a script press Cmd-d or right-click and press do it.

**Roassal**
```
ToadBuilder roassal.
```

**XML**
```
ToadBuilder xml.
```
**Grapher**
```
ToadBuilder grapher.
```

# Data Gathering

Before each participant starts a task we execute the following script in Smalltalk. For executing a script press Cmd-d or right-click and press do it. It allows us to track the time that a user starts the experiment and how many mouse clicks, movements.
```
UProfiler newSession.
UProfiler current start.
```

After finishing the task we executed the following script. It stop recording the mouse events and save the stops time.
```
UProfiler current end.
```

The last script generates a file with the following information: start time, end time, number of clicks, number of mouse movements, and the number of mouse drags (we do not use this last one).
```
11:34:52.5205 am,11:34:56.38016 am,14,75,0

```
# Quit
To close the artifact, just close the window or press click in any free space of the window and select quit.
WLASL (World Level American Sign Language) Video
kaggle.com
zip
Updated Sep 20, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Risang Baskoro (2021). WLASL (World Level American Sign Language) Video [Dataset]. https://www.kaggle.com/datasets/risangbaskoro/wlasl-processed/code
Explore at:
zip(5177253885 bytes)Available download formats
Dataset updated
Sep 20, 2021
Authors
Risang Baskoro
Area covered
World, United States
Description
Context

WLASL is the largest video dataset for Word-Level American Sign Language (ASL) recognition, which features 2,000 common different words in ASL. We hope WLASL will facilitate the research in sign language understanding and eventually benefit the communication between deaf and hearing communities.

Content

The WLASL_v0.3.json file contains the glossary and instances of the videos.

Inside the videos folder, there are about 12k videos each named corresponding video_id.

Acknowledgements

All the WLASL data is intended for academic and computational use only. No commercial usage is allowed.

Made by Dongxu Li and Hongdong Li. Please read the WLASL paper and visit the official website and repository.

Licensed under the Computational Use of Data Agreement (C-UDA). Please refer to the C-UDA-1.0 page for more information.

Inspiration

How to classify word-level action recognition to text?

What is the most accurate model to do word-level sign language recognition?
MM CELEBA HQ DATASET
kaggle.com
zip
Updated Nov 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kashyap KVH (2024). MM CELEBA HQ DATASET [Dataset]. https://www.kaggle.com/datasets/kashyapkvh/mm-celeba-hq-dataset
Explore at:
zip(3169699323 bytes)Available download formats
Dataset updated
Nov 9, 2024
Authors
Kashyap KVH
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Multi-Modal-CelebA-HQ

Multi-Modal-CelebA-HQ (MM-CelebA-HQ) is a dataset containing 30,000 high-resolution face images selected from CelebA, following CelebA-HQ. Each image in the dataset is accompanied by a semantic mask, sketch, descriptive text, and an image with a transparent background.

Multi-Modal-CelebA-HQ can be used to train and evaluate algorithms for a range of face generation and understanding tasks, including text-to-image generation, sketch-to-image generation, text-guided image editing, image captioning, and visual question answering. This dataset is introduced and employed in TediGAN.

TediGAN: Text-Guided Diverse Face Image Generation and Manipulation.
Weihao Xia, Yujiu Yang, Jing-Hao Xue, and Baoyuan Wu.
CVPR 2021.

Updates :triangular_flag_on_post:

[07/10/2023] 3DMM coefficients and corresponding rendered images have been added to the repository.

[04/10/2023] The scripts for text and sketch generation have been added to the repository.

[06/12/2020] The paper is released on arXiv.

[11/13/2020] The multi-modal-celeba-hq dataset has been released.

Data Generation

Description

The textual descriptions are generated using probabilistic context-free grammar (PCFG) based on the given attributes. We create ten unique single sentence descriptions per image to obtain more training data following the format of the popular CUB dataset and COCO dataset. The previous study proposed CelebTD-HQ, but it is not publicly available.

For semantic labels, we use CelebAMask-HQ dataset, which contains manually-annotated semantic mask of facial attributes corresponding to CelebA-HQ.

For sketches, we follow the same data generation pipeline as in DeepFaceDrawing. We first apply Photocopy filter in Photoshop to extract edges, which preserves facial details and introduces excessive noise, then apply the sketch-simplification to get edge maps resembling hand-drawn sketches.

For background removing, we use an open-source tool Rembg and a commercial software removebg. Different backgrounds can be further added using image composition or harmonization methods like DoveNet.

For 3DMM coefficients and the corresponding rendered image, we use Deep3DFaceReconstruction. Please follow the instructions for data generation. We also provide the Cleaned Face Datasets, the "cleaned" version of two popular face datasets, CelebAHQ and FFHQ, made by removing instances with extreme poses, occlusions, blurriness, and the presence of multiple individuals in the frame.

Usage

This section outlines the process of generating the data for our task.

The scripts provided here are not restricted to the CelebA-HQ dataset and can be utilized to preprocess any dataset that includes attribute annotations, be it image, video, or 3D shape data. This flexibility enables the creation of custom datasets that meet specific requirements. For example, the create_caption.py script can be applied to generate diverse descriptions for each video by using video facial attributes (e.g., those provided by CelebV-HQ), leading to a text-video dataset, similar to CelebV-Text.

Text

Please download celeba-hq-attribute.txt (CelebAMask-HQ-attribute-anno.txt) and run the following script.

python create_caption.py

The generated textual descriptions can be found at ./celeba_caption.

Please fill out the form to request the processing script. If feasible, please send me a follow-up email after submitting the form to remind me.

Sketch

If Photoshop is available to you, please apply the Photocopy filter in Photoshop to extract edges. Photoshop allows batch processing so you don't have to mannually process each image. The Sobel operator is an lternative way to extract edges when Photoshop is unavailable or a simpler approach is preferred. This process preserve...
Drone obstacle avoidance AirSim
kaggle.com
zip
Updated Apr 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lukáš Pellant (2025). Drone obstacle avoidance AirSim [Dataset]. https://www.kaggle.com/datasets/lukpellant/droneflight-obs-avoidanceairsimrgbdepth10k-320x320/suggestions
Explore at:
zip(2116088484 bytes)Available download formats
Dataset updated
Apr 8, 2025
Authors
Lukáš Pellant
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Drone Flight Obstacle AvoidanceAirSimRGBDepth10k_320x320

Overview

This dataset contains 10,000 samples designed for drone navigation and obstacle avoidance research. It includes RGB images (320x320), Depth maps (320x320), and corresponding Commands (vx, vy, vz, yaw_rate). The data was collected in AirSim, a realistic drone simulator by Microsoft, using a drone controlled by a script implementing potential fields for navigation and obstacle avoidance.

This dataset is ideal for researchers and developers working on autonomous drone navigation, computer vision, or robotics projects involving RGB and Depth data.

Dataset Details

Size: 10,000 samples

Data Types:

RGB Images: 320x320 resolution, 3 channels (RGB), stored as PNG files

Depth Maps: 320x320 resolution, 1 channel, normalized to [0, 1] with max_depth=25.0, stored as NumPy arrays (.npy)

Commands: 4 values (vx, vy, vz, yaw_rate), normalized based on statistics (means: [2.43, 0, 0.025, -1.17], stds: [0.87, 0, 0.32, 20.56]), stored as NumPy arrays (.npy)

Collection Method: Data was generated in AirSim using a drone controlled by a potential fields navigation script for obstacle avoidance.

Simulator: AirSim (Microsoft), licensed under the MIT License (https://github.com/microsoft/AirSim)

File Structure

rgb/: Directory containing 10,000 RGB images (e.g., 000000.png, ..., 009999.png)

depth/: Directory containing 10,000 Depth maps as NumPy arrays (e.g., 000000.npy, ..., 009999.npy)

commands/: Directory containing 10,000 Commands as NumPy arrays (e.g., 000000.npy, ..., 009999.npy), each file with 4 values: vx, vy, vz, yaw_rate

Usage

This dataset is suitable for: - Developing models for autonomous drone navigation - Research in obstacle avoidance and path planning - Computer vision tasks involving RGB and Depth data - Robotics and simulation-based studies

Example use case: Use the RGB and Depth data to develop algorithms for real-time obstacle avoidance in drones.

License

This dataset is licensed under CC BY 4.0. You are free to use, modify, and distribute it as long as you provide attribution to the author and acknowledge the source of the data: - Attribution: "Dataset DroneFlight_Obs_AvoidanceAirSimRGBDepth10k_320x320 by https://www.kaggle.com/lukpellant, data generated using AirSim (MIT License)." - AirSim License: The data was collected in AirSim, which is licensed under the MIT License (https://github.com/microsoft/AirSim).

Acknowledgments

AirSim: Thanks to Microsoft for providing the AirSim simulator under the MIT License.

Potential Fields: The navigation script uses potential fields for obstacle avoidance, inspired by classical robotics techniques.
The ORBIT (Object Recognition for Blind Image Training)-India Dataset
data.niaid.nih.gov
nde-dev.biothings.io
+1more
Updated Jul 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
India, Gesu; Grayson, Martin; Massiceti, Daniela; Morrison, Cecily; Robinson, Simon; Pearson, Jennifer; Jones, Matt (2024). The ORBIT (Object Recognition for Blind Image Training)-India Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11394528
Explore at:
Dataset updated
Jul 2, 2024
Dataset provided by
Microsofthttp://microsoft.com/
Swansea University
Authors
India, Gesu; Grayson, Martin; Massiceti, Daniela; Morrison, Cecily; Robinson, Simon; Pearson, Jennifer; Jones, Matt
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
India
Description
The ORBIT (Object Recognition for Blind Image Training) -India Dataset is a collection of 105,243 images of 76 commonly used objects, collected by 12 individuals in India who are blind or have low vision. This dataset is an "Indian subset" of the original ORBIT dataset [1, 2], which was collected in the UK and Canada. In contrast to the ORBIT dataset, which was created in a Global North, Western, and English-speaking context, the ORBIT-India dataset features images taken in a low-resource, non-English-speaking, Global South context, a home to 90% of the world’s population of people with blindness. Since it is easier for blind or low-vision individuals to gather high-quality data by recording videos, this dataset, like the ORBIT dataset, contains images (each sized 224x224) derived from 587 videos. These videos were taken by our data collectors from various parts of India using the Find My Things [3] Android app. Each data collector was asked to record eight videos of at least 10 objects of their choice.

Collected between July and November 2023, this dataset represents a set of objects commonly used by people who are blind or have low vision in India, including earphones, talking watches, toothbrushes, and typical Indian household items like a belan (rolling pin), and a steel glass. These videos were taken in various settings of the data collectors' homes and workspaces using the Find My Things Android app.

The image dataset is stored in the ‘Dataset’ folder, organized by folders assigned to each data collector (P1, P2, ...P12) who collected them. Each collector's folder includes sub-folders named with the object labels as provided by our data collectors. Within each object folder, there are two subfolders: ‘clean’ for images taken on clean surfaces and ‘clutter’ for images taken in cluttered environments where the objects are typically found. The annotations are saved inside a ‘Annotations’ folder containing a JSON file per video (e.g., P1--coffee mug--clean--231220_084852_coffee mug_224.json) that contains keys corresponding to all frames/images in that video (e.g., "P1--coffee mug--clean--231220_084852_coffee mug_224--000001.jpeg": {"object_not_present_issue": false, "pii_present_issue": false}, "P1--coffee mug--clean--231220_084852_coffee mug_224--000002.jpeg": {"object_not_present_issue": false, "pii_present_issue": false}, ...). The ‘object_not_present_issue’ key is True if the object is not present in the image, and the ‘pii_present_issue’ key is True, if there is a personally identifiable information (PII) present in the image. Note, all PII present in the images has been blurred to protect the identity and privacy of our data collectors. This dataset version was created by cropping images originally sized at 1080 × 1920; therefore, an unscaled version of the dataset will follow soon.

This project was funded by the Engineering and Physical Sciences Research Council (EPSRC) Industrial ICASE Award with Microsoft Research UK Ltd. as the Industrial Project Partner. We would like to acknowledge and express our gratitude to our data collectors for their efforts and time invested in carefully collecting videos to build this dataset for their community. The dataset is designed for developing few-shot learning algorithms, aiming to support researchers and developers in advancing object-recognition systems. We are excited to share this dataset and would love to hear from you if and how you use this dataset. Please feel free to reach out if you have any questions, comments or suggestions.

REFERENCES:

Daniela Massiceti, Lida Theodorou, Luisa Zintgraf, Matthew Tobias Harris, Simone Stumpf, Cecily Morrison, Edward Cutrell, and Katja Hofmann. 2021. ORBIT: A real-world few-shot dataset for teachable object recognition collected from people who are blind or low vision. DOI: https://doi.org/10.25383/city.14294597

microsoft/ORBIT-Dataset. https://github.com/microsoft/ORBIT-Dataset

Linda Yilin Wen, Cecily Morrison, Martin Grayson, Rita Faia Marques, Daniela Massiceti, Camilla Longden, and Edward Cutrell. 2024. Find My Things: Personalized Accessibility through Teachable AI for People who are Blind or Low Vision. In Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems (CHI EA '24). Association for Computing Machinery, New York, NY, USA, Article 403, 1–6. https://doi.org/10.1145/3613905.3648641
Multi-Turn Chats With Context Classification
kaggle.com
zip
Updated Apr 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TheItCrow (2024). Multi-Turn Chats With Context Classification [Dataset]. https://www.kaggle.com/datasets/kevinbnisch/multi-turn-chats-with-context-classification
Explore at:
zip(2311398 bytes)Available download formats
Dataset updated
Apr 13, 2024
Authors
TheItCrow
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is the dataset on which CCC-BERT was fine-tuned on to classify whether user inputs require a new context retrieval of the RAG model or not. The creation of the dataset can be found in this notebook in section 2.

The dataset consists of two files:

context_chats_35k.json

This file contains 35.000 multi-turn chats on random topics, all ending on a user's input, synthetically produced through GPT-3.5 and labeled by fetch_context which that tells us whether this user input should require the retrieval of context or not. For a more detailed explanation on this flag, please consult the CCC-BERT model card.

Chit-Chat

The chit-chat_dataset.tsv contains around 10.000 "nonsense" chats provided by Microsoft on this GitHub repository. I've added this small dataset as it can be used to augment the chats a bit more, but the higher quality lies within the context_chats_35k.json file

Usage

Multi-Turn chats are useful for fine-tuning LLMs, training for LLM tasks such as POS-Tagging, Lemmatization, Named-Entity Recognition and more.

Feel free to utilize these chats to your liking.
Randomised Synthetic Online Game Purchases Data
kaggle.com
zip
Updated Apr 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
zaclovell (2022). Randomised Synthetic Online Game Purchases Data [Dataset]. https://www.kaggle.com/datasets/zaclovell/randomised-synthetic-online-game-purchases-data
Explore at:
zip(1208739 bytes)Available download formats
Dataset updated
Apr 24, 2022
Authors
zaclovell
Description
1. Why build a dataset?

I wanted to run data analysis and machine learning on a large dataset to build my data science skills but I felt out of touch with the various datasets available so I thought... how about I try and build my own dataset?

2. Why gaming data?

I wondered what data should be in the dataset and settled with online digital game purchases since I am an avid gamer. Imagine getting sales data from the PlayStation Store or Xbox Microsoft Store, this is what I was aiming to replicate.

3. Scope of the dataset

I envisaged the dataset to be data created through the purchase of a digital game on either the UK PlayStation Store or Xbox Microsoft Store. Considering this, the scope of dataset varies depending on which column of data you are viewing, for example: - Date and Time: purchases were defined between a start/end date (this can be altered, see point 4) and, of course, anytime across the 24hr clock - Geographically: purchases were setup to come from any postcode in the UK - in total this is over 1,000,000 active postcodes - Purchases: the list of game titles available for purchase is 24 - Registered Banks: the list of registered banks in the UK (as of 03/2022) was 159

4. Over 42,000 rows isn't enough?

To generate the dataset, I built a function in Python. This function, when called with the number of rows you want in your dataset, will generate the dataset. For example, calling function(1000) will provide you with a dataset with 1000 rows.

Considering this, if just over 42,000 rows of data (42,892 to be exact) isn't enough, feel free to check out the code on my GitHub to run the function yourself with as many rows as you want.

Note: You can also edit the start/end dates of the function depending on which timespan you want the dataset to cover.

5. Disclaimer - this is still a work in progress!

Yes, as stated above, this dataset is still a work in progress and is therefore not 100% perfect. There is a backlog of issues that need to be resolved. Feel free to check out the backlog.

One example of this is how on various columns, the distributions of data is equal, when in fact for the dataset to be entirely random, this should not be the case. An example of this issue is the Time column. These issues will be resolved in a later update.

Last updated: 24/04/2022
Mammogram Mass Analyzer Desktop App
kaggle.com
zip
Updated Dec 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vbookshelf (2022). Mammogram Mass Analyzer Desktop App [Dataset]. https://www.kaggle.com/datasets/vbookshelf/mammogram-mass-analyzer-v00
Explore at:
zip(207834390 bytes)Available download formats
Dataset updated
Dec 29, 2022
Authors
vbookshelf
Description
Mammogram Mass Analyzer

This is a free desktop computer aided diagnosis (CAD) tool that uses computer vision to detect and localize masses on full field digital mammograms. It's a flask app that's running on the desktop. Internally there are two Yolov5L ensembled models that were trained on data from the VinDr-Mammo dataset. The model ensemble has a validation accuracy of 0.65 and a validation recall of 0.63.

My aim was to create a proof of concept for a free desktop computer aided diagnosis (CAD) system that could be used as an aid when diagnosing breast cancer. Unlike a web app, this tool does not need an internet connection and there are no monthly costs for hosting and web server rental. I think a desktop tool could be helpful to radiologists in private practice and to medical non-profits that work in remote areas.

The complete project folder, including the trained models, is stored in this Kaggle dataset.

For a full project description please refer to the GitHub repo: https://github.com/vbookshelf/Mammogram-Mass-Analyzer

For info on model training and validation, please refer to the model card. I've included a confusion matrix and classification report. https://github.com/vbookshelf/Mammogram-Mass-Analyzer/blob/main/mammogram-mass-analyzer-v0.0/Model-Card-and-App-Info.pdf

Demo

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1086574%2F421d04920cd6a4dfed890a81df0f13c8%2Fdemo1.gif?generation=1669902576106757&alt=media" alt="">

Demo showing what happens after a user submits three dicom mammograms

1- Main features

Free to use. Free to deploy. No monthly server rental costs like with a web app.

Completely transparent. All code is accessible and therefore fully auditable.

Runs locally without needing an internet connection

Takes mammograms in dicom format as input

Can analyze multiple mammograms simultaneously

Uses the computer’s cpu. A gpu would make the app much faster, but it's not essential.

Results are explainable because it draws bounding boxes around detected masses

Patient data remains private because it never leaves the user’s computer

Easy to customize because this is just a Flask app built using html, css and javascript.

2- Cons

It’s not a one click setup. The user needs to have a basic knowledge of how to use the command line to set up a virtual environment, download requirements and launch a python app.

The inference time is about 10 seconds per image, because inference is being done on the CPU.

When diagnosing breast cancer radiologists look for masses, calcifications and architectural distortions. However, this app can only detect masses.

The amount of positive samples in the training data was limited. The accuracy and recall could be improved with more training data.

3- How to run this app

First download the project folder from Kaggle

The project folder (named mammogram-mass-analyzer-v0.1) is stored in this Kaggle dataset.

I suggest that you download the project folder from Kaggle instead of from the GitHub repo. This is because the project folder on Kaggle includes the two trained models. The project folder in this repo does not include the trained models because GitHub does not allow files larger than 25MB to be uploaded.
The models are located inside a folder called TRAINED_MODEL_FOLDER, which is located inside the yolov5 folder: mammogram-mass-analyzer-v0.0/yolov5/TRAINED_MODEL_FOLDER/

Overview

This is a standard flask app. The steps to set up and run the app are the same for both Mac and Windows.

Download the project folder.

Use the command line to pip install the requirements listed in the requirements.txt file. (It’s located inside the project folder.)

Run the app.py file from the command line.

Copy the url that gets printed in the console.

Paste that url into your chrome browser and press Enter. The app will open in the browser.

This app is based on Flask and Pytorch, both of which are pure python. If you encounter any errors during installation you should be able to solve them quite easily. You won’t have to deal with the package dependency issues that happen when using Tensorflow.

Detailed setup instructions

The instructions below are for a Mac. I didn't include instructions for Windows because I don't have a Windows pc and therefore, I could not test the installtion process on windows. If you’re using a Windows pc then please change the commands below to suit Windows.

You’ll need an internet connection during the first setup. After that you’ll be able to use the app without an internet connection.

If you are a beginner you may find these resources helpful:

The Complete Guide to Python Virtual Environments! Teclado (Includes instructions for Windows) https://www.youtube.com/watch?v=KxvKCSwlUv8&t=947s

How To Create Python Virtual Envi...

Competitions Shake-up

kaggle.com

zip

Updated Sep 27, 2020

Facebook

Twitter

Click to copy link

Link copied

Cite

Daniboy370 (2020). Competitions Shake-up [Dataset]. https://www.kaggle.com/daniboy370/competitions-shakeup

Explore at:

zip(388789 bytes)Available download formats

Dataset updated

Sep 27, 2020

Authors

Daniboy370

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Shake-what ?!

The Shake phenomenon occurs when the competition is shifting between two different datasets :

\[ \text{Public test set} \ \Rightarrow \ \text{Private test set} \quad \Leftrightarrow \quad LB-\text{public} \ \Rightarrow \ LB-\text{private} \]

The private test set that so far was unavailable becomes available, and thus the models scores are re-calculated. This re-evaluation elicits a respective re-ranking of the contestants in the competition. The shake allows participants to assess the severity of their overfitting to the public dataset, and act to improve their model until the deadline.

Unable to find a uniform conventional term for this mechanism, I will use my common sense to define the following intuition :

             <img src="https://github.com/Daniboy370/Uploads/blob/master/Kaggle-shake-ups/images/latex.png?raw=true" width="550">

From the starter kernel :

               <img src="https://github.com/Daniboy370/Uploads/blob/master/Kaggle-shake-ups/vids/shakeup_VID.gif?raw=true" width="625">

Content

Seven datasets of competitions which were scraped from Kaggle :

Competition	Name of file
Elo Merchant Category Recommendation	df_{Elo}
Human Protein Atlas Image Classification	df_{Protein}
Humpback Whale Identification	df_{Humpback}
Microsoft Malware Prediction	df_{Microsoft}
Quora Insincere Questions Classification	df_{Quora}
TGS Salt Identification Challenge	df_{TGS}
VSB Power Line Fault Detection	df_{VSB}

As an example, consider the following dataframe from the Quora competition : Team Name | Rank-private | Rank-public | Shake | Score-private | Score-public --- | --- The Zoo |1|7|6|0.71323|0.71123 ...| ...| ...| ...| ...| ... D.J. Trump|1401|65|-1336|0.000|0.70573

I encourage everybody to investigate thoroughly the dataset in sought of interesting findings !

\[ \text{Enjoy !}\]

API Call based Malware Dataset

kaggle.com

zip

Updated May 8, 2019

Facebook

Twitter

Click to copy link

Link copied

Cite

Ferhat Ozgur Catak (2019). API Call based Malware Dataset [Dataset]. https://www.kaggle.com/focatak/malapi2019

Explore at:

zip(5944171 bytes)Available download formats

Dataset updated

May 8, 2019

Authors

Ferhat Ozgur Catak

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

https://img.shields.io/badge/visits-100k-green" alt="Total Downloads">

Windows Malware Dataset with PE API Calls

Our public malware dataset generated by Cuckoo Sandbox based on Windows OS API calls analysis for cyber security researchers for malware analysis in cvs file format for machine learning applications.

Cite The DataSet
If you find those results useful please cite them :

@article{10.7717/peerj-cs.285,
 title = {Deep learning based Sequential model for malware analysis using Windows exe API Calls},
 author = {Catak, Ferhat Ozgur and Yazı, Ahmet Faruk and Elezaj, Ogerta and Ahmed, Javed},
 year = 2020,
 month = jul,
 keywords = {Malware analysis, Sequential models, Network security, Long-short-term memory, Malware dataset},
 volume = 6,
 pages = {e285},
 journal = {PeerJ Computer Science},
 issn = {2376-5992},
 url = {https://doi.org/10.7717/peerj-cs.285},
 doi = {10.7717/peerj-cs.285}
}

Publications

The details of the Mal-API-2019 dataset are published in following the papers: * [Link] AF. Yazı, FÖ Çatak, E. Gül, Classification of Metamorphic Malware with Deep Learning (LSTM), IEEE Signal Processing and Applications Conference, 2019. * [Link] Catak, FÖ., Yazi, AF., A Benchmark API Call Dataset for Windows PE Malware Classification, arXiv:1905.01999, 2019.

Introduction

This study seeks to obtain data which will help to address machine learning based malware research gaps. The specific objective of this study is to build a benchmark dataset for Windows operating system API calls of various malware. This is the first study to undertake metamorphic malware to build sequential API calls. It is hoped that this research will contribute to a deeper understanding of how metamorphic malware change their behavior (i.e. API calls) by adding meaningless opcodes with their own dissembler/assembler parts.

Malware Types and System Overall

In our research, we have translated the families produced by each of the software into 8 main malware families: Trojan, Backdoor, Downloader, Worms, Spyware Adware, Dropper, Virus. Table 1 shows the number of malware belonging to malware families in our data set. As you can see in the table, the number of samples of other malware families except AdWare is quite close to each other. There is such a difference because we don't find too much of malware from the adware malware family.

Malware Family	Samples	Description
Spyware	832	enables a user to obtain covert information about another's computer activities by transmitting data covertly from their hard drive.
Downloader	1001	share the primary functionality of downloading content.
Trojan	1001	misleads users of its true intent.
Worms	1001	spreads copies of itself from computer to computer.
Adware	379	hides on your device and serves you advertisements.
Dropper	891	surreptitiously carries viruses, back doors and other malicious software so they can be executed on the compromised machine.
Virus	1001	designed to spread from host to host and has the ability to replicate itself.
Backdoor	1001	a technique in which a system security mechanism is bypassed undetectably to access a computer or its data.

Figure shows the general flow of the generation of the malware data set. As shown in the figure, we have obtained the MD5 hash values of the malware we collect from Github. We searched these hash values using the VirusTotal API, and we have obtained the families of these malicious software from the reports of 67 different antivirus software in VirusTotal. We have observed that the malicious software families found in the reports of these 67 different antivirus software in VirusTotal are different.

Screenshot

Data Description

Facebook

Twitter

Click to copy link

Link copied

Cite

Saurabh Shahane (2021). Microsoft Bing Search For Corona Virus Intent [Dataset]. https://www.kaggle.com/saurabhshahane/microsoft-bing-search-for-corona-virus-intent

Microsoft Bing Search For Corona Virus Intent

Microsoft Bing Search Data From All Over The World

Explore at:

zip(64939376 bytes)Available download formats

Dataset updated

Jan 24, 2021

Authors

Saurabh Shahane

Description

Context

This dataset was curated from the Bing search logs (desktop users only) over the period of Jan 1st, 2020 – (Current Month - 1). Only searches that were issued many times by multiple users were included. Dataset includes queries from all over the world that had an intent related to the Coronavirus or Covid-19. In some cases this intent is explicit in the query itself, e.g. “Coronavirus updates Seattle” in other cases it is implicit , e.g. “Shelter in place”. Implicit intent of search queries (e.g. Toilet paper) were extracted by using Random walks on the click graph approach as outlined in the following paper by Nick Craswell et al at Microsoft Research: https://www.microsoft.com/en-us/research/wp-content/uploads/2007/07/craswellszummer-random-walks-sigir07.pdf All personal data was removed. Source - https://msropendata.com/datasets/c5031874-835c-48ed-8b6d-31de2dad0654

Acknowledgements

Data Source: Bing Coronavirus Query set (https://github.com/microsoft/BingCoronavirusQuerySet)

License - Open Use of Data Agreement v1.0

Content

Inside the data folder there is a folder 2020 (for the year) which contains two kinds of files.

QueriesByCountry_DateRange.tsv : A tab separated text file that contains queries with Coronavirus intent by Country. QueriesByState_DateRange.tsv : A tab separated text file that contains queries with Coronavirus intent by State.

QueriesByCountry Date : string, Date on which the query was issued.

Query : string, The actual search query issued by user(s).

IsImplicitIntent : bool, True if query did not mention covid or coronavirus or sarsncov2 (e.g, “Shelter in place”). False otherwise.

Country : string, Country from where the query was issued.

PopularityScore : int, Value between 1 and 100 inclusive. 1 indicates least popular query on the day/Country with Coronavirus intent, and 100 indicates the most popular query for the same Country on the same day.

QueriesByState Date : string, Date on which the query was issued.

Query : string, The actual search query issued by user(s).

IsImplicitIntent : bool, True if query did not mention covid or coronavirus or sarsncov2 (e.g, “Shelter in place”). False otherwise.

State : string, State from where the query was issued.

Country :string, Country from where the query was issued.

PopularityScore : int, Value between 1 and 100 inclusive. 1 indicates least popular query on the day/State/Country with Coronavirus intent, and 100 indicates the most popular query for the same geogrpahy on the same day.

Clear search

Close search

Google apps

Main menu

Microsoft Bing Search For Corona Virus Intent

Context

Acknowledgements

License - Open Use of Data Agreement v1.0

Content

Package and Dependency Metadata for CZI Hackathon: Mapping the Impact of...

Package Data

GitHub Data

Contact

Data from: Reliance on Science in Patenting

Distribution of data used in the MS-PINPOINT project

deberta_v3_variants

Tarrant County Building Footprints

GeoPIXE Demo Data (Windows)

BrowardCountyBuildingFootprints

Immersive Analytics Software Report

VAR-wlasl-complete

WLASL Recognition and signer classification

Structure of the dataset

Acknowledgements

Data from: Performance Evolution Matrix: Visualizing Performance Variations...

WLASL (World Level American Sign Language) Video

Context

Content

Acknowledgements

Inspiration

MM CELEBA HQ DATASET

Multi-Modal-CelebA-HQ

Updates :triangular_flag_on_post:

Data Generation

Description

Usage

Text

Sketch

Drone obstacle avoidance AirSim

Drone Flight Obstacle AvoidanceAirSimRGBDepth10k_320x320

Overview

Dataset Details

File Structure

Usage

License

Acknowledgments

The ORBIT (Object Recognition for Blind Image Training)-India Dataset

Multi-Turn Chats With Context Classification

context_chats_35k.json

Chit-Chat

Usage

Randomised Synthetic Online Game Purchases Data

1. Why build a dataset?

2. Why gaming data?

3. Scope of the dataset

4. Over 42,000 rows isn't enough?

5. Disclaimer - this is still a work in progress!

Last updated: 24/04/2022

Mammogram Mass Analyzer Desktop App

Mammogram Mass Analyzer

Demo

1- Main features

2- Cons

3- How to run this app

First download the project folder from Kaggle

Overview

Detailed setup instructions

Competitions Shake-up

Shake-what ?!

Content

API Call based Malware Dataset

Windows Malware Dataset with PE API Calls

Publications

Introduction

Malware Types and System Overall

Data Description

Microsoft Bing Search For Corona Virus Intent

Microsoft Bing Search Data From All Over The World

Context

Acknowledgements

License - Open Use of Data Agreement v1.0

Content