Facebook
TwitterThis dataset was curated from the Bing search logs (desktop users only) over the period of Jan 1st, 2020 – (Current Month - 1). Only searches that were issued many times by multiple users were included. Dataset includes queries from all over the world that had an intent related to the Coronavirus or Covid-19. In some cases this intent is explicit in the query itself, e.g. “Coronavirus updates Seattle” in other cases it is implicit , e.g. “Shelter in place”. Implicit intent of search queries (e.g. Toilet paper) were extracted by using Random walks on the click graph approach as outlined in the following paper by Nick Craswell et al at Microsoft Research: https://www.microsoft.com/en-us/research/wp-content/uploads/2007/07/craswellszummer-random-walks-sigir07.pdf All personal data was removed. Source - https://msropendata.com/datasets/c5031874-835c-48ed-8b6d-31de2dad0654
Data Source: Bing Coronavirus Query set (https://github.com/microsoft/BingCoronavirusQuerySet)
Inside the data folder there is a folder 2020 (for the year) which contains two kinds of files.
QueriesByCountry_DateRange.tsv : A tab separated text file that contains queries with Coronavirus intent by Country. QueriesByState_DateRange.tsv : A tab separated text file that contains queries with Coronavirus intent by State.
QueriesByCountry Date : string, Date on which the query was issued.
Query : string, The actual search query issued by user(s).
IsImplicitIntent : bool, True if query did not mention covid or coronavirus or sarsncov2 (e.g, “Shelter in place”). False otherwise.
Country : string, Country from where the query was issued.
PopularityScore : int, Value between 1 and 100 inclusive. 1 indicates least popular query on the day/Country with Coronavirus intent, and 100 indicates the most popular query for the same Country on the same day.
QueriesByState Date : string, Date on which the query was issued.
Query : string, The actual search query issued by user(s).
IsImplicitIntent : bool, True if query did not mention covid or coronavirus or sarsncov2 (e.g, “Shelter in place”). False otherwise.
State : string, State from where the query was issued.
Country :string, Country from where the query was issued.
PopularityScore : int, Value between 1 and 100 inclusive. 1 indicates least popular query on the day/State/Country with Coronavirus intent, and 100 indicates the most popular query for the same geogrpahy on the same day.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
A collection of useful datasets extracted from https://packages.ecosyste.ms and https://repos.ecosyste.ms for use at the CZI Hackathon: Mapping the Impact of Research Software in Science.
All data is provided as NDJSON (new line delimited JSON), each line represents a valid JSON object, and they are separated by newline characters. There are python and R libraries for reading these files, or you can maually read each line and parse each line as a single JSON object.
Each ndjson file has been compressed with gzip (actual command: `tar -czvf`) to reduce download size, they expand to significantly bigger files after extraction.
Package names from cran, bioconductor and pypi that have been parsed by the software-mentions project (data: https://datadryad.org/stash/dataset/doi:10.5061/dryad.6wwpzgn2c) are collected together with their latest release at time of publishing along with the names of their dependencies, those dependency names have then also been recursively fetched with latest release and dependencies until the full list of transitive dependencies is included.
Note: This approach uses a simplified method of dependency resolution, always picking the latest version of each package rather than taking into account each dependencies specific version range requirements, this is primarily due to time constraints and allows all software ecosystems to be processed in the same way. A future improvement would be to use each package ecosystem's specific dependency resolution algorithm to compute the full transitive dependency tree for each mentioned software package.
Two different approaches were taken for collecting data for referenced GitHub mentions:
1. `github.ndjson` is metadata for each repository from GitHub, including "manifest" files which are known files that contain dependency information for a project such as requirements.txt, DESCRIPTION and package.json, parsed using https://github.com/ecosyste-ms/bibliothecary, which may include transitive dependencies that have been discovered in a `lockfile` within the repository.
2. `github_packages.ndjson` is metadata for each package that was found on any package manager that references the GitHub url as it's repository url/source/homepage, these packages, like the cran and pypi data above, include the latest release and their direct dependencies. There may be more than one package for each GitHub URL as it is a one to many relationship. `github_packages_with_transitive.ndjson` follows the same format but also includes the extra resolved transitive dependencies of all packages using the same approach as with cran and pypi data above with the same caveats.
There are also many more ecosystems referenced in these files than just cran, bioconductor and pypi, https://packages.ecosyste.ms provides a standardized metadata format for all of them to enable comparison and simplification of automation.
If you would like any help, support or more data from Ecosyste.ms please do get in touch via email: hello@ecosyste.ms or open an issue on GitHub: https://github.com/ecosyste-ms/packages/issues
Facebook
TwitterThis dataset contains citations from USPTO patents granted 1947-2018 to articles captured by the Microsoft Academic Graph (MAG) from 1800-2018. If you use the data, please cite these two papers: for the dataset of citations: Marx, Matt and Aaron Fuegi, "Reliance on Science in Patenting: USPTO Front-Page Citations to Scientific Articles" (https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3331686). for the underlying dataset of papers Sinha, Arnab, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. 2015. An Overview of Microsoft Academic Service (MAS) and Applications. In Proceedings of the 24th International Conference on World Wide Web (WWW ’15 Companion). ACM, New York, NY, USA, 243-246. The main file, pcs.tsv, contains the resolved citations. Fields are tab-separated. Each match has the patent number, MAG ID, the original citation from the patent, an indicator for whether the citation was supplied by the applicant, examiner, or unknown, and a confidence score (1-10) indicating how likely this match is correct. Note that this distribution does not contain matches with confidence 2 or 1. There is also a PubMed-specific match in pcs-pubmed.tsv. The remaining files are a redistribution of the 1 January 2019 release of the Microsoft Academic Graph. All of these files are compressed using ZIP compression under CentOS5. Original files, documented at https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema, can be downloaded from https://aka.ms/msracad; this redistribution carves up the original files into smaller, variable-specific files that can be loaded individually (see _relianceonscience.pdf for full details). Source code for generating the patent citations to science in pcs.tsv is available at https://github.com/mattmarx/reliance_on_science. Source code for generating jif.zip and jcif.zip (Journal Impact Factor and Journal Commercial Impact Factor) is at https://github.com/mattmarx/jcif. Although MAG contains authors and affiliations for each paper, it does not contain the location for affiliations. We have created a dataset of locations for affiliations appearing at least 100x using Bing Maps and Google Maps; however, it is unclear to us whether the API licensing terms allow us to repost their data. In any case, you can download our source code for doing so here: https://github.com/ksjiaxian/api-requester-locations. MAG extracts field keywords for each paper (paperfieldid.zip and fieldidname.zip) --more than 200,000 fields in all! When looking to study industries or technical areas you might find this a bit overwhelming. We mapped the MAG subjects to six OECD fields and 39 subfields, defined here: http://www.oecd.org/science/inno/38235147.pdf. Clarivate provides a crosswalk between the OECD classifications and Web of Science fields, so we include WoS fields as well. This file is magfield_oecd_wos_crosswalk.zip.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This item is part of the SAFEHR scheme at UCLH. The purpose of the scheme is to publicly display what kinds of patient data we use, to encourage collaboration and transparency. More information can be found at https://safehr-data.github.io/uclh-research-discovery/This dataset describes the structured health records used as part of the MS-PINPOINT project at UCL. It mainly describes the patient demographics such as patient reported gender, ethnicity and other features. Any category with less than 5 entries is not reported in line with privacy guidelines.
Facebook
TwitterDeBERTa v3 variants. Downloaded using:
sudo apt-get install git-lfs git lfs install
git clone https://huggingface.co/microsoft/deberta-v3-xsmall git clone https://huggingface.co/microsoft/deberta-v3-small git clone https://huggingface.co/microsoft/deberta-v3-base git clone https://huggingface.co/microsoft/deberta-v3-large
For more details refer to: https://github.com/microsoft/DeBERTa https://huggingface.co/microsoft/deberta-v3-xsmall https://huggingface.co/microsoft/deberta-v3-small https://huggingface.co/microsoft/deberta-v3-base https://huggingface.co/microsoft/deberta-v3-large
The objective of this upload: - use the trained models in Kaggle competitions without needing to connect to the internet.
There is no intention to infringe rights of any kind on my part, I simply want to use these models in competitions that require no internet connection. If you are one of the rights holders of these models and you feel your rights are being infringed by this upload, please contact me and I will rectify the issue as soon as possible.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Tarrant County Building Footprints. Computer generated building footprints for the United States. The original dataset contains 125,192,184 computer generated building footprints in all 50 US states. This data is freely available for download and use. The original dataset has been pared down to include only Tarrant County building footprints. The filter extent used also includes a portion of other counties that surround Tarrant County. License: This data is licensed by Microsoft under the Open Data Commons Open Database License (ODbL) FAQ: What the data include: Approximately 125 million building footprint polygon geometries in all 50 US States in GeoJSON format. Source: https://github.com/Microsoft/USBuildingFootprints
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Demo data files for the program GeoPIXE (with simple dir structure for a single user). Use these in conjunction with the worked examples notes to aid in training to use the GeoPIXE program for SXRF and PIXE imaging. Can be used for personal training. However, the Linux version is more suited to multiple users and a training workshop. Has been expanded to make it suitable for self-guided personal training.
Requires the GeoPIXE program, which is available from “geopixe@csiro.au” and will soon be released as Open Source on GitHub. Runs under IDL, which must be obtained separately.
Lineage: Data was produced using a range of detectors, such as Ge and Si(Li), SDD and the Maia 384 element detector array, at various synchrotron and ion-beam laboratories, including the XFM X-ray microprobe beamline of the Australian Synchrotron, the 2-ID-E beamline at the Advanced Photon Source, the CSIRO Maia Mapper and the CSIRO Nuclear Microprobe, and processed using the GeoPIXE software package.
Facebook
TwitterPolygons of the buildings footprints clipped Broward County. This is a product MicroSoft.
The orginal dataset This dataset contains 125,192,184 computer generated building footprints in all 50 US states. This data is freely available for download and use.
The data set was clipped to the Broward County developed boundary.
https://github.com/microsoft/USBuildingFootprints/blob/master/README.md">Additional information
Facebook
Twitterhttps://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
The Immersive Analytics Software market is experiencing rapid growth, projected to reach $453 million in 2025 and exhibiting a Compound Annual Growth Rate (CAGR) of 33.2%. This robust expansion is driven by several key factors. The increasing adoption of immersive technologies like Virtual Reality (VR) and Augmented Reality (AR) across diverse sectors—business applications, education, healthcare, and public policy—is a significant catalyst. Businesses are leveraging immersive analytics for enhanced data visualization, improved decision-making, and more engaging training programs. The healthcare sector utilizes these tools for surgical planning, medical training simulations, and patient education, while public policy applications focus on creating interactive models for urban planning and disaster response. Furthermore, the continuous advancements in hardware and software capabilities, along with decreasing costs of VR/AR devices, are further fueling market growth. The availability of user-friendly software solutions is also widening the market's accessibility, attracting a larger user base. However, the market faces certain restraints. The high initial investment required for VR/AR infrastructure and software implementation can be a barrier for smaller organizations. Additionally, concerns regarding data security and privacy, as well as the potential for motion sickness and user fatigue associated with extended use of VR/AR devices, need to be addressed. Despite these challenges, the long-term prospects for immersive analytics remain highly positive, driven by ongoing technological innovations and the increasing demand for more efficient and engaging data analysis solutions across various industries. Market segmentation reveals a strong preference for PC and Mac applications, but the mobile (iOS) and VR/AR device segments are showing significant growth potential and are expected to capture considerable market share in the coming years. The major players – Immersion Analytics, GitHub, Microsoft, IBM, Accenture, Google, SAP, Meta, HTC, HP, Tibco, and Magic Leap – are actively shaping the market through continuous innovation and strategic partnerships.
Facebook
TwitterThis is the project for the course of "Visione Artificiale e Riconoscimento" of the "University of Bologna". The project aims to classify videos of Word Level American Sign Language into their glosses. It's also possible to classify each signer using traditional methods and representation learning.
data/
- WLASL_v0.3.json
- missing.txt
- labels.npz
- wlasl_class_list.txt
- videos/
- frames_no_bg/
- original_videos_sample/
- hf/
- mp/
You can find more info about the content of the dataset here
All the WLASL data is intended for academic and computational use only. No commercial usage is allowed.
Made by Dongxu Li and Hongdong Li. Please read the WLASL paper and visit the official website and repository.
Licensed under the Computational Use of Data Agreement (C-UDA). Please refer to the C-UDA-1.0 page for more information.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# Performance Evolution Matrix
This repository contains the artifacts needed to replicate our experiment in the paper "Performance Evolution Matrix".
# Video Demo
[download](https://github.com/jpsandoval/PerfEvoMatrix/blob/master/MatrixMovie.mp4)
# XMLSupport and GraphET Examples
To open the XMLSupport and GraphET Examples (which appears in the paper) execute the following commands in a Terminal.
**MacOSX.** We do all the experiments in a Mac Book Pro. To open the Matrix execute the following command in the folder where this project was downloaded.
```
./Pharo-OSX/Pharo.app/Contents/MacOS/Pharo Matrix.image
```
**Windows.**
You may also run the experiment in Windows, but depending on the windows version you have installed it may be some some UI bugs.
```
cd Pharo-Windows
Pharo.exe ../XMLSupportExample.image
```
**Open the Visualization.**
Please select the following code, then execute it using the green play button (at the top right of the window).
```
ToadBuilder xmlSupportExample.
```
or
```
ToadBuilder graphETExample.
```
**Note.** There are two buttons at the panel top left In (zoom in) and Out (zoom out). To move the visualization just drag the moves over the panel.
# Experiment
This subsection describe how to execute the tools, for replicating our experiment.
## Baseline
The baseline contains the tools and the project-dataset to realize the tasks described in the paper (identifying and understanding performance variations).
## Open the Baseline
**MacOSX.** We do all the experiments in a Mac Book Pro. To open the Baseline execute the following command in the folder where this project was downloaded.
```
./Pharo-OSX/Pharo.app/Contents/MacOS/Pharo Baseline.image
```
**Windows.**
You may also run the experiment in Windows, but depending on the windows version you have installed it may be some some UI bugs.
```
cd Pharo-Windows
Pharo.exe ../Baseline.image
```
## Open a Project
There are three projects under study, depending on the project you wanna use for the task, you may execute one of the following scripts. For executing a script press Cmd-d or right-click and press do it.
**Roassal**
```
TProfileVersion openRoassal.
```
**XML**
```
TProfileVersion openXML.
```
**Grapher**
```
TProfileVersion openGrapher.
```
## Baseline Options
For each project, we provide a UI which contains all the tools we use as a baseline. Each item in the list is a version of the selected project.
- Browse: open a standard window to inspect the code of the project in the selected version.
- Profile: open a window with a call context tree for the selected version.
- Source Diff: open a window with the code differences between the selected version and the previous one.
- Execution Diff: open a window with the merge call context tree gathered from the selected version and the previous one.
**Note.** All these options require you select first a item in the list.
# Matrix
## Open Matrix Image.
**MacOSX.** We do all the experiments in a Mac Book Pro. To open the Matrix execute the following command in the folder where this project was downloaded.
```
./Pharo-OSX/Pharo.app/Contents/MacOS/Pharo Matrix.image
```
**Windows.**
You may also run the experiment in Windows, but depending on the windows version you have installed it may be some some UI bugs.
```
cd Pharo-Windows
Pharo.exe ../Matrix.image
```
## Open a project
There are three projects under study, depending on the project you wanna use for the task, you may execute one of the following scripts. For executing a script press Cmd-d or right-click and press do it.
**Roassal**
```
ToadBuilder roassal.
```
**XML**
```
ToadBuilder xml.
```
**Grapher**
```
ToadBuilder grapher.
```
# Data Gathering
Before each participant starts a task we execute the following script in Smalltalk. For executing a script press Cmd-d or right-click and press do it. It allows us to track the time that a user starts the experiment and how many mouse clicks, movements.
```
UProfiler newSession.
UProfiler current start.
```
After finishing the task we executed the following script. It stop recording the mouse events and save the stops time.
```
UProfiler current end.
```
The last script generates a file with the following information: start time, end time, number of clicks, number of mouse movements, and the number of mouse drags (we do not use this last one).
```
11:34:52.5205 am,11:34:56.38016 am,14,75,0
```
# Quit
To close the artifact, just close the window or press click in any free space of the window and select quit.
Facebook
TwitterWLASL is the largest video dataset for Word-Level American Sign Language (ASL) recognition, which features 2,000 common different words in ASL. We hope WLASL will facilitate the research in sign language understanding and eventually benefit the communication between deaf and hearing communities.
The WLASL_v0.3.json file contains the glossary and instances of the videos.
Inside the videos folder, there are about 12k videos each named corresponding video_id.
All the WLASL data is intended for academic and computational use only. No commercial usage is allowed.
Made by Dongxu Li and Hongdong Li. Please read the WLASL paper and visit the official website and repository.
Licensed under the Computational Use of Data Agreement (C-UDA). Please refer to the C-UDA-1.0 page for more information.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Multi-Modal-CelebA-HQ (MM-CelebA-HQ) is a dataset containing 30,000 high-resolution face images selected from CelebA, following CelebA-HQ. Each image in the dataset is accompanied by a semantic mask, sketch, descriptive text, and an image with a transparent background.
Multi-Modal-CelebA-HQ can be used to train and evaluate algorithms for a range of face generation and understanding tasks, including text-to-image generation, sketch-to-image generation, text-guided image editing, image captioning, and visual question answering. This dataset is introduced and employed in TediGAN.
TediGAN: Text-Guided Diverse Face Image Generation and Manipulation.
Weihao Xia, Yujiu Yang, Jing-Hao Xue, and Baoyuan Wu.
CVPR 2021.
This section outlines the process of generating the data for our task.
The scripts provided here are not restricted to the CelebA-HQ dataset and can be utilized to preprocess any dataset that includes attribute annotations, be it image, video, or 3D shape data. This flexibility enables the creation of custom datasets that meet specific requirements. For example, the create_caption.py script can be applied to generate diverse descriptions for each video by using video facial attributes (e.g., those provided by CelebV-HQ), leading to a text-video dataset, similar to CelebV-Text.
Please download celeba-hq-attribute.txt (CelebAMask-HQ-attribute-anno.txt) and run the following script.
python create_caption.py
The generated textual descriptions can be found at ./celeba_caption.
Please fill out the form to request the processing script. If feasible, please send me a follow-up email after submitting the form to remind me.
If Photoshop is available to you, please apply the Photocopy filter in Photoshop to extract edges. Photoshop allows batch processing so you don't have to mannually process each image. The Sobel operator is an lternative way to extract edges when Photoshop is unavailable or a simpler approach is preferred. This process preserve...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 10,000 samples designed for drone navigation and obstacle avoidance research. It includes RGB images (320x320), Depth maps (320x320), and corresponding Commands (vx, vy, vz, yaw_rate). The data was collected in AirSim, a realistic drone simulator by Microsoft, using a drone controlled by a script implementing potential fields for navigation and obstacle avoidance.
This dataset is ideal for researchers and developers working on autonomous drone navigation, computer vision, or robotics projects involving RGB and Depth data.
rgb/: Directory containing 10,000 RGB images (e.g., 000000.png, ..., 009999.png)depth/: Directory containing 10,000 Depth maps as NumPy arrays (e.g., 000000.npy, ..., 009999.npy)commands/: Directory containing 10,000 Commands as NumPy arrays (e.g., 000000.npy, ..., 009999.npy), each file with 4 values: vx, vy, vz, yaw_rateThis dataset is suitable for: - Developing models for autonomous drone navigation - Research in obstacle avoidance and path planning - Computer vision tasks involving RGB and Depth data - Robotics and simulation-based studies
Example use case: Use the RGB and Depth data to develop algorithms for real-time obstacle avoidance in drones.
This dataset is licensed under CC BY 4.0. You are free to use, modify, and distribute it as long as you provide attribution to the author and acknowledge the source of the data: - Attribution: "Dataset DroneFlight_Obs_AvoidanceAirSimRGBDepth10k_320x320 by https://www.kaggle.com/lukpellant, data generated using AirSim (MIT License)." - AirSim License: The data was collected in AirSim, which is licensed under the MIT License (https://github.com/microsoft/AirSim).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The ORBIT (Object Recognition for Blind Image Training) -India Dataset is a collection of 105,243 images of 76 commonly used objects, collected by 12 individuals in India who are blind or have low vision. This dataset is an "Indian subset" of the original ORBIT dataset [1, 2], which was collected in the UK and Canada. In contrast to the ORBIT dataset, which was created in a Global North, Western, and English-speaking context, the ORBIT-India dataset features images taken in a low-resource, non-English-speaking, Global South context, a home to 90% of the world’s population of people with blindness. Since it is easier for blind or low-vision individuals to gather high-quality data by recording videos, this dataset, like the ORBIT dataset, contains images (each sized 224x224) derived from 587 videos. These videos were taken by our data collectors from various parts of India using the Find My Things [3] Android app. Each data collector was asked to record eight videos of at least 10 objects of their choice.
Collected between July and November 2023, this dataset represents a set of objects commonly used by people who are blind or have low vision in India, including earphones, talking watches, toothbrushes, and typical Indian household items like a belan (rolling pin), and a steel glass. These videos were taken in various settings of the data collectors' homes and workspaces using the Find My Things Android app.
The image dataset is stored in the ‘Dataset’ folder, organized by folders assigned to each data collector (P1, P2, ...P12) who collected them. Each collector's folder includes sub-folders named with the object labels as provided by our data collectors. Within each object folder, there are two subfolders: ‘clean’ for images taken on clean surfaces and ‘clutter’ for images taken in cluttered environments where the objects are typically found. The annotations are saved inside a ‘Annotations’ folder containing a JSON file per video (e.g., P1--coffee mug--clean--231220_084852_coffee mug_224.json) that contains keys corresponding to all frames/images in that video (e.g., "P1--coffee mug--clean--231220_084852_coffee mug_224--000001.jpeg": {"object_not_present_issue": false, "pii_present_issue": false}, "P1--coffee mug--clean--231220_084852_coffee mug_224--000002.jpeg": {"object_not_present_issue": false, "pii_present_issue": false}, ...). The ‘object_not_present_issue’ key is True if the object is not present in the image, and the ‘pii_present_issue’ key is True, if there is a personally identifiable information (PII) present in the image. Note, all PII present in the images has been blurred to protect the identity and privacy of our data collectors. This dataset version was created by cropping images originally sized at 1080 × 1920; therefore, an unscaled version of the dataset will follow soon.
This project was funded by the Engineering and Physical Sciences Research Council (EPSRC) Industrial ICASE Award with Microsoft Research UK Ltd. as the Industrial Project Partner. We would like to acknowledge and express our gratitude to our data collectors for their efforts and time invested in carefully collecting videos to build this dataset for their community. The dataset is designed for developing few-shot learning algorithms, aiming to support researchers and developers in advancing object-recognition systems. We are excited to share this dataset and would love to hear from you if and how you use this dataset. Please feel free to reach out if you have any questions, comments or suggestions.
REFERENCES:
Daniela Massiceti, Lida Theodorou, Luisa Zintgraf, Matthew Tobias Harris, Simone Stumpf, Cecily Morrison, Edward Cutrell, and Katja Hofmann. 2021. ORBIT: A real-world few-shot dataset for teachable object recognition collected from people who are blind or low vision. DOI: https://doi.org/10.25383/city.14294597
microsoft/ORBIT-Dataset. https://github.com/microsoft/ORBIT-Dataset
Linda Yilin Wen, Cecily Morrison, Martin Grayson, Rita Faia Marques, Daniela Massiceti, Camilla Longden, and Edward Cutrell. 2024. Find My Things: Personalized Accessibility through Teachable AI for People who are Blind or Low Vision. In Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems (CHI EA '24). Association for Computing Machinery, New York, NY, USA, Article 403, 1–6. https://doi.org/10.1145/3613905.3648641
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is the dataset on which CCC-BERT was fine-tuned on to classify whether user inputs require a new context retrieval of the RAG model or not. The creation of the dataset can be found in this notebook in section 2.
The dataset consists of two files:
This file contains 35.000 multi-turn chats on random topics, all ending on a user's input, synthetically produced through GPT-3.5 and labeled by fetch_context which that tells us whether this user input should require the retrieval of context or not. For a more detailed explanation on this flag, please consult the CCC-BERT model card.
The chit-chat_dataset.tsv contains around 10.000 "nonsense" chats provided by Microsoft on this GitHub repository. I've added this small dataset as it can be used to augment the chats a bit more, but the higher quality lies within the context_chats_35k.json file
Multi-Turn chats are useful for fine-tuning LLMs, training for LLM tasks such as POS-Tagging, Lemmatization, Named-Entity Recognition and more.
Feel free to utilize these chats to your liking.
Facebook
TwitterI wanted to run data analysis and machine learning on a large dataset to build my data science skills but I felt out of touch with the various datasets available so I thought... how about I try and build my own dataset?
I wondered what data should be in the dataset and settled with online digital game purchases since I am an avid gamer. Imagine getting sales data from the PlayStation Store or Xbox Microsoft Store, this is what I was aiming to replicate.
I envisaged the dataset to be data created through the purchase of a digital game on either the UK PlayStation Store or Xbox Microsoft Store. Considering this, the scope of dataset varies depending on which column of data you are viewing, for example: - Date and Time: purchases were defined between a start/end date (this can be altered, see point 4) and, of course, anytime across the 24hr clock - Geographically: purchases were setup to come from any postcode in the UK - in total this is over 1,000,000 active postcodes - Purchases: the list of game titles available for purchase is 24 - Registered Banks: the list of registered banks in the UK (as of 03/2022) was 159
To generate the dataset, I built a function in Python. This function, when called with the number of rows you want in your dataset, will generate the dataset. For example, calling function(1000) will provide you with a dataset with 1000 rows.
Considering this, if just over 42,000 rows of data (42,892 to be exact) isn't enough, feel free to check out the code on my GitHub to run the function yourself with as many rows as you want.
Note: You can also edit the start/end dates of the function depending on which timespan you want the dataset to cover.
Yes, as stated above, this dataset is still a work in progress and is therefore not 100% perfect. There is a backlog of issues that need to be resolved. Feel free to check out the backlog.
One example of this is how on various columns, the distributions of data is equal, when in fact for the dataset to be entirely random, this should not be the case. An example of this issue is the Time column. These issues will be resolved in a later update.
Facebook
TwitterThis is a free desktop computer aided diagnosis (CAD) tool that uses computer vision to detect and localize masses on full field digital mammograms. It's a flask app that's running on the desktop. Internally there are two Yolov5L ensembled models that were trained on data from the VinDr-Mammo dataset. The model ensemble has a validation accuracy of 0.65 and a validation recall of 0.63.
My aim was to create a proof of concept for a free desktop computer aided diagnosis (CAD) system that could be used as an aid when diagnosing breast cancer. Unlike a web app, this tool does not need an internet connection and there are no monthly costs for hosting and web server rental. I think a desktop tool could be helpful to radiologists in private practice and to medical non-profits that work in remote areas.
The complete project folder, including the trained models, is stored in this Kaggle dataset.
For a full project description please refer to the GitHub repo: https://github.com/vbookshelf/Mammogram-Mass-Analyzer
For info on model training and validation, please refer to the model card. I've included a confusion matrix and classification report. https://github.com/vbookshelf/Mammogram-Mass-Analyzer/blob/main/mammogram-mass-analyzer-v0.0/Model-Card-and-App-Info.pdf
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1086574%2F421d04920cd6a4dfed890a81df0f13c8%2Fdemo1.gif?generation=1669902576106757&alt=media" alt="">
Demo showing what happens after a user submits three dicom mammograms
The project folder (named mammogram-mass-analyzer-v0.1) is stored in this Kaggle dataset.
I suggest that you download the project folder from Kaggle instead of from the GitHub repo. This is because the project folder on Kaggle includes the two trained models. The project folder in this repo does not include the trained models because GitHub does not allow files larger than 25MB to be uploaded.
The models are located inside a folder called TRAINED_MODEL_FOLDER, which is located inside the yolov5 folder:
mammogram-mass-analyzer-v0.0/yolov5/TRAINED_MODEL_FOLDER/
This is a standard flask app. The steps to set up and run the app are the same for both Mac and Windows.
This app is based on Flask and Pytorch, both of which are pure python. If you encounter any errors during installation you should be able to solve them quite easily. You won’t have to deal with the package dependency issues that happen when using Tensorflow.
The instructions below are for a Mac. I didn't include instructions for Windows because I don't have a Windows pc and therefore, I could not test the installtion process on windows. If you’re using a Windows pc then please change the commands below to suit Windows.
You’ll need an internet connection during the first setup. After that you’ll be able to use the app without an internet connection.
If you are a beginner you may find these resources helpful:
The Complete Guide to Python Virtual Environments! Teclado (Includes instructions for Windows) https://www.youtube.com/watch?v=KxvKCSwlUv8&t=947s
How To Create Python Virtual Envi...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Shake phenomenon occurs when the competition is shifting between two different datasets :
\[ \text{Public test set} \ \Rightarrow \ \text{Private test set} \quad \Leftrightarrow \quad LB-\text{public} \ \Rightarrow \ LB-\text{private} \]
The private test set that so far was unavailable becomes available, and thus the models scores are re-calculated. This re-evaluation elicits a respective re-ranking of the contestants in the competition. The shake allows participants to assess the severity of their overfitting to the public dataset, and act to improve their model until the deadline.
Unable to find a uniform conventional term for this mechanism, I will use my common sense to define the following intuition :
<img src="https://github.com/Daniboy370/Uploads/blob/master/Kaggle-shake-ups/images/latex.png?raw=true" width="550">
From the starter kernel :
<img src="https://github.com/Daniboy370/Uploads/blob/master/Kaggle-shake-ups/vids/shakeup_VID.gif?raw=true" width="625">
Seven datasets of competitions which were scraped from Kaggle :
| Competition | Name of file |
|---|---|
| Elo Merchant Category Recommendation | df_{Elo} |
| Human Protein Atlas Image Classification | df_{Protein} |
| Humpback Whale Identification | df_{Humpback} |
| Microsoft Malware Prediction | df_{Microsoft} |
| Quora Insincere Questions Classification | df_{Quora} |
| TGS Salt Identification Challenge | df_{TGS} |
| VSB Power Line Fault Detection | df_{VSB} |
As an example, consider the following dataframe from the Quora competition :
Team Name | Rank-private | Rank-public | Shake | Score-private | Score-public
--- | ---
The Zoo |1|7|6|0.71323|0.71123
...| ...| ...| ...| ...| ...
D.J. Trump|1401|65|-1336|0.000|0.70573
I encourage everybody to investigate thoroughly the dataset in sought of interesting findings !
\[ \text{Enjoy !}\]
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
https://img.shields.io/badge/visits-100k-green" alt="Total Downloads">
Our public malware dataset generated by Cuckoo Sandbox based on Windows OS API calls analysis for cyber security researchers for malware analysis in cvs file format for machine learning applications.
Cite The DataSet
If you find those results useful please cite them :
@article{10.7717/peerj-cs.285,
title = {Deep learning based Sequential model for malware analysis using Windows exe API Calls},
author = {Catak, Ferhat Ozgur and Yazı, Ahmet Faruk and Elezaj, Ogerta and Ahmed, Javed},
year = 2020,
month = jul,
keywords = {Malware analysis, Sequential models, Network security, Long-short-term memory, Malware dataset},
volume = 6,
pages = {e285},
journal = {PeerJ Computer Science},
issn = {2376-5992},
url = {https://doi.org/10.7717/peerj-cs.285},
doi = {10.7717/peerj-cs.285}
}
The details of the Mal-API-2019 dataset are published in following the papers: * [Link] AF. Yazı, FÖ Çatak, E. Gül, Classification of Metamorphic Malware with Deep Learning (LSTM), IEEE Signal Processing and Applications Conference, 2019. * [Link] Catak, FÖ., Yazi, AF., A Benchmark API Call Dataset for Windows PE Malware Classification, arXiv:1905.01999, 2019.
This study seeks to obtain data which will help to address machine learning based malware research gaps. The specific objective of this study is to build a benchmark dataset for Windows operating system API calls of various malware. This is the first study to undertake metamorphic malware to build sequential API calls. It is hoped that this research will contribute to a deeper understanding of how metamorphic malware change their behavior (i.e. API calls) by adding meaningless opcodes with their own dissembler/assembler parts.
In our research, we have translated the families produced by each of the software into 8 main malware families: Trojan, Backdoor, Downloader, Worms, Spyware Adware, Dropper, Virus. Table 1 shows the number of malware belonging to malware families in our data set. As you can see in the table, the number of samples of other malware families except AdWare is quite close to each other. There is such a difference because we don't find too much of malware from the adware malware family.
| Malware Family | Samples | Description |
|---|---|---|
| Spyware | 832 | enables a user to obtain covert information about another's computer activities by transmitting data covertly from their hard drive. |
| Downloader | 1001 | share the primary functionality of downloading content. |
| Trojan | 1001 | misleads users of its true intent. |
| Worms | 1001 | spreads copies of itself from computer to computer. |
| Adware | 379 | hides on your device and serves you advertisements. |
| Dropper | 891 | surreptitiously carries viruses, back doors and other malicious software so they can be executed on the compromised machine. |
| Virus | 1001 | designed to spread from host to host and has the ability to replicate itself. |
| Backdoor | 1001 | a technique in which a system security mechanism is bypassed undetectably to access a computer or its data. |
Figure shows the general flow of the generation of the malware data set. As shown in the figure, we have obtained the MD5 hash values of the malware we collect from Github. We searched these hash values using the VirusTotal API, and we have obtained the families of these malicious software from the reports of 67 different antivirus software in VirusTotal. We have observed that the malicious software families found in the reports of these 67 different antivirus software in VirusTotal are different.
Facebook
TwitterThis dataset was curated from the Bing search logs (desktop users only) over the period of Jan 1st, 2020 – (Current Month - 1). Only searches that were issued many times by multiple users were included. Dataset includes queries from all over the world that had an intent related to the Coronavirus or Covid-19. In some cases this intent is explicit in the query itself, e.g. “Coronavirus updates Seattle” in other cases it is implicit , e.g. “Shelter in place”. Implicit intent of search queries (e.g. Toilet paper) were extracted by using Random walks on the click graph approach as outlined in the following paper by Nick Craswell et al at Microsoft Research: https://www.microsoft.com/en-us/research/wp-content/uploads/2007/07/craswellszummer-random-walks-sigir07.pdf All personal data was removed. Source - https://msropendata.com/datasets/c5031874-835c-48ed-8b6d-31de2dad0654
Data Source: Bing Coronavirus Query set (https://github.com/microsoft/BingCoronavirusQuerySet)
Inside the data folder there is a folder 2020 (for the year) which contains two kinds of files.
QueriesByCountry_DateRange.tsv : A tab separated text file that contains queries with Coronavirus intent by Country. QueriesByState_DateRange.tsv : A tab separated text file that contains queries with Coronavirus intent by State.
QueriesByCountry Date : string, Date on which the query was issued.
Query : string, The actual search query issued by user(s).
IsImplicitIntent : bool, True if query did not mention covid or coronavirus or sarsncov2 (e.g, “Shelter in place”). False otherwise.
Country : string, Country from where the query was issued.
PopularityScore : int, Value between 1 and 100 inclusive. 1 indicates least popular query on the day/Country with Coronavirus intent, and 100 indicates the most popular query for the same Country on the same day.
QueriesByState Date : string, Date on which the query was issued.
Query : string, The actual search query issued by user(s).
IsImplicitIntent : bool, True if query did not mention covid or coronavirus or sarsncov2 (e.g, “Shelter in place”). False otherwise.
State : string, State from where the query was issued.
Country :string, Country from where the query was issued.
PopularityScore : int, Value between 1 and 100 inclusive. 1 indicates least popular query on the day/State/Country with Coronavirus intent, and 100 indicates the most popular query for the same geogrpahy on the same day.