Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
📌 Context of the Dataset
The Healthcare Ransomware Dataset was created to simulate real-world cyberattacks in the healthcare industry. Hospitals, clinics, and research labs have become prime targets for ransomware due to their reliance on real-time patient data and legacy IT infrastructure. This dataset provides insight into attack patterns, recovery times, and cybersecurity practices across different healthcare organizations.
Why is this important?
Ransomware attacks on healthcare organizations can shut down entire hospitals, delay treatments, and put lives at risk. Understanding how different healthcare organizations respond to attacks can help develop better security strategies. The dataset allows cybersecurity analysts, data scientists, and researchers to study patterns in ransomware incidents and explore predictive modeling for risk mitigation.
📌 Sources and Research Inspiration This simulated dataset was inspired by real-world cybersecurity reports and built using insights from official sources, including:
1️⃣ IBM Cost of a Data Breach Report (2024)
The healthcare sector had the highest average cost of data breaches ($10.93 million per incident). On average, organizations recovered only 64.8% of their data after paying ransom. Healthcare breaches took 277 days on average to detect and contain.
2️⃣ Sophos State of Ransomware in Healthcare (2024)
67% of healthcare organizations were hit by ransomware in 2024, an increase from 60% in 2023. 66% of backup compromise attempts succeeded, making data recovery significantly more difficult. The most common attack vectors included exploited vulnerabilities (34%) and compromised credentials (34%).
3️⃣ Health & Human Services (HHS) Cybersecurity Reports
Ransomware incidents in healthcare have doubled since 2016. Organizations that fail to monitor threats frequently experience higher infection rates.
4️⃣ Cybersecurity & Infrastructure Security Agency (CISA) Alerts
Identified phishing, unpatched software, and exposed RDP ports as top ransomware entry points. Only 13% of healthcare organizations monitor cyber threats more than once per day, increasing the risk of undetected attacks.
5️⃣ Emsisoft 2020 Report on Ransomware in Healthcare
The number of ransomware attacks in healthcare increased by 278% between 2018 and 2023. 560 healthcare facilities were affected in a single year, disrupting patient care and emergency services.
📌 Why is This a Simulated Dataset?
This dataset does not contain real patient data or actual ransomware cases. Instead, it was built using probabilistic modeling and structured randomness based on industry benchmarks and cybersecurity reports.
How It Was Created:
1️⃣ Defining the Dataset Structure
The dataset was designed to simulate realistic attack patterns in healthcare, using actual ransomware case studies as inspiration.
Columns were selected based on what real-world cybersecurity teams track, such as: Attack methods (phishing, RDP exploits, credential theft). Infection rates, recovery time, and backup compromise rates. Organization type (hospitals, clinics, research labs) and monitoring frequency.
2️⃣ Generating Realistic Data Using ChatGPT & Python
ChatGPT assisted in defining relationships between attack factors, ensuring that key cybersecurity concepts were accurately reflected. Python’s NumPy and Pandas libraries were used to introduce randomized attack simulations based on real-world statistics. Data was validated against industry research to ensure it aligns with actual ransomware attack trends.
3️⃣ Ensuring Logical Relationships Between Data Points
Hospitals take longer to recover due to larger infrastructure and compliance requirements. Organizations that track more cyber threats recover faster because they detect attacks earlier. Backup security significantly impacts recovery time, reflecting the real-world risk of backup encryption attacks.
Success.ai’s Healthcare Professionals Data for Healthcare & Hospital Executives in Europe provides a reliable and comprehensive dataset tailored for businesses aiming to connect with decision-makers in the European healthcare and hospital sectors. Covering healthcare executives, hospital administrators, and medical directors, this dataset offers verified contact details, professional insights, and leadership profiles.
With access to over 700 million verified global profiles and data from 70 million businesses, Success.ai ensures your outreach, market research, and partnership strategies are powered by accurate, continuously updated, and GDPR-compliant data. Backed by our Best Price Guarantee, this solution is indispensable for navigating and thriving in Europe’s healthcare industry.
Why Choose Success.ai’s Healthcare Professionals Data?
Verified Contact Data for Targeted Engagement
Comprehensive Coverage of European Healthcare Professionals
Continuously Updated Datasets
Ethical and Compliant
Data Highlights:
Key Features of the Dataset:
Comprehensive Professional Profiles
Advanced Filters for Precision Campaigns
Healthcare Industry Insights
AI-Driven Enrichment
Strategic Use Cases:
Marketing and Outreach to Healthcare Executives
Partnership Development and Collaboration
Market Research and Competitive Analysis
Recruitment and Workforce Solutions
Why Choose Success.ai?
Best Price Guarantee
Seamless Integration
...
By US Open Data Portal, data.gov [source]
This dataset provides an inside look at the performance of the Veterans Health Administration (VHA) hospitals on timely and effective care measures. It contains detailed information such as hospital names, addresses, census-designated cities and locations, states, ZIP codes county names, phone numbers and associated conditions. Additionally, each entry includes a score, sample size and any notes or footnotes to give further context. This data is collected through either Quality Improvement Organizations for external peer review programs as well as direct electronic medical records. By understanding these performance scores of VHA hospitals on timely care measures we can gain valuable insights into how VA healthcare services are delivering values throughout the country!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains information about the performance of Veterans Health Administration hospitals on timely and effective care measures. In this dataset, you can find the hospital name, address, city, state, ZIP code, county name, phone number associated with each hospital as well as data related to the timely and effective care measure such as conditions being measured and their associated scores.
To use this dataset effectively, we recommend first focusing on identifying an area of interest for analysis. For example: what condition is most impacting wait times for patients? Once that has been identified you can narrow down which fields would best fit your needs - for example if you are studying wait times then “Score” may be more valuable to filter than Footnote. Additionally consider using aggregation functions over certain fields (like average score over time) in order to get a better understanding of overall performance by factor--for instance Location.
Ultimately this dataset provides a snapshot into how Veteran's Health Administration hospitals are performing on timely and effective care measures so any research should focus around that aspect of healthcare delivery
- Analyzing and predicting hospital performance on a regional level to improve the quality of healthcare for veterans across the country.
- Using this dataset to identify trends and develop strategies for hospitals that consistently score low on timely and effective care measures, with the goal of improving patient outcomes.
- Comparison analysis between different VHA hospitals to discover patterns and best practices in providing effective care so they can be shared with other hospitals in the system
If you use this dataset in your research, please credit the original authors. Data Source
License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices.
File: csv-1.csv | Column name | Description | |:-----------------------|:-------------------------------------------------------------| | Hospital Name | Name of the VHA hospital. (String) | | Address | Street address of the VHA hospital. (String) | | City | City where the VHA hospital is located. (String) | | State | State where the VHA hospital is located. (String) | | ZIP Code | ZIP code of the VHA hospital. (Integer) | | County Name | County where the VHA hospital is located. (String) | | Phone Number | Phone number of the VHA hospital. (String) | | Condition | Condition being measured. (String) | | Measure Name | Measure used to measure the condition. (String) | | Score | Score achieved by the VHA h...
Success.ai’s B2B Contact Data and Healthcare Professionals Data for Global Healthcare Professionals offers businesses a powerful resource to connect with healthcare administrators and decision-makers across the globe. Derived from over 170 million verified professional profiles, this dataset delivers unparalleled accuracy and reach, enabling effective outreach and building strategic relationships with professionals in the healthcare sector.
Why Choose Success.ai’s Global Healthcare Professionals Contact Data?
Every profile is validated using advanced AI algorithms, ensuring up to 99% accuracy.
Global Reach Across Healthcare:
Connect with healthcare professionals and decision-makers in hospitals, clinics, research institutions, and public health organizations worldwide.
Includes data for regions such as North America, Europe, Asia-Pacific, and beyond.
Continuously Updated Profiles:
Ensure your campaigns are supported by the latest data with our real-time updates.
Adapt to industry trends and professional movements dynamically.
Compliance with Data Privacy Laws:
Fully adheres to global regulations like GDPR and CCPA, ensuring ethical use of contact information.
Data Highlights: - 170M+ Verified Professional Profiles: A vast dataset encompassing professionals from multiple industries, including healthcare. - 50M Work Emails: Verified and AI-validated for precise and reliable communication. - 30M Company Profiles: Gain insights into the organizations where healthcare professionals operate. - 700M Global Professional Profiles: Comprehensive datasets that enhance your outreach and analytics.
Key Features of the Dataset: - Comprehensive Professional Profiles: Verified work emails, direct phone numbers, and LinkedIn profiles for accurate targeting. - Customizable Segmentation: Filter by job titles, industries, company sizes, and geographic locations. - AI-Driven Insights: Profiles enriched with role-specific and industry-specific insights for maximum relevance.
Strategic Use Cases:
Streamline outreach efforts to build relationships and close deals faster.
Targeted Marketing Campaigns:
Execute tailored email and phone campaigns for healthcare professionals and organizations.
Maximize engagement with hyper-personalized outreach strategies.
Recruitment in Healthcare:
Find and connect with top talent for executive, administrative, and clinical roles in the healthcare industry.
Ensure you’re reaching the right candidates with continuously updated contact information.
Market Research and Strategic Planning:
Analyze trends in healthcare hiring, administration, and innovation using rich data insights.
Identify partnership opportunities with institutions and organizations at the forefront of healthcare.
Why Success.ai is Your Trusted Partner?
High-quality datasets at the most competitive prices in the market.
Tailored Solutions:
Flexible options for accessing and integrating datasets based on your unique business needs.
Seamless Integration:
Choose API integration or downloadable datasets to match your workflows.
Unmatched Accuracy and Scale:
Sourced from 170M verified professional profiles, our data is enriched and validated with AI to deliver industry-leading accuracy.
APIs for Advanced Data Solutions:
Data Enrichment API: Enhance your existing data with real-time updates and additional insights for healthcare professionals.
Lead Generation API: Directly integrate our healthcare professional contact data into your CRM or marketing platforms for seamless campaigns.
Transform your outreach and engagement strategies with B2B Contact Data for Global Healthcare Professionals from Success.ai. Whether you’re targeting administrators, executives, or decision-makers, our verified and continuously updated profiles provide the precision and depth you need to succeed.
Enjoy the benefits of our Best Price Guarantee and experience the difference with Success.ai. Contact us now to empower your business with AI-validated contact data that drives real results!
No one beats us on price. Period.
Overview: This is a large-scale real-world dataset with videos recording medical staff washing their hands as part of their normal job duties in the Jurmala Hospital located in Jurmala, Latvia. There are 2427 hand washing episodes in total, almost all of which are annotated by two persons. The annotations classify the washing movements according to the World Health Organization's (WHO) guidelines by marking each frame in each video with a certain movement code. This dataset is part on three dataset series all following the same format: https://zenodo.org/record/4537209 - data collected in Pauls Stradins Clinical University Hospital https://zenodo.org/record/5808764 - data collected in Jurmala Hospital https://zenodo.org/record/5808789 - data collected in the Medical Education Technology Center (METC) of Riga Stradins University Applications: The intention of this dataset is twofold: to serve as a basis for training machine learning classifiers for automated hand washing movement recognition and quality control, and to allow to investigate the real-world quality of washing performed by working medical staff. Statistics: Frame rate: 30 FPS Resolution: 320x240 and 640x480 Number of videos: 2427 Number of annotation files: 4818 Movement codes (both in CSV and JSON files): 1: Hand washing movement ��� Palm to palm 2: Hand washing movement ��� Palm over dorsum, fingers interlaced 3: Hand washing movement ��� Palm to palm, fingers interlaced 4: Hand washing movement ��� Backs of fingers to opposing palm, fingers interlocked 5: Hand washing movement ��� Rotational rubbing of the thumb 6: Hand washing movement ��� Fingertips to palm 7: Turning off the faucet with a paper towel 0: Other hand washing movement Acknowledgments: The dataset collection was funded by the Latvian Council of Science project: "Automated hand washing quality control and quality evaluation system with real-time feedback", No: lzp - Nr. 2020/2-0309. References: For more detailed information, see this article, describing a similar dataset collected in a different project: M. Lulla, A. Rutkovskis, A. Slavinska, A. Vilde, A. Gromova, M. Ivanovs, A. Skadins, R. Kadikis, A. Elsts. Hand-Washing Video Dataset Annotated According to the World Health Organization���s Hand-Washing Guidelines. Data. 2021; 6(4):38. https://doi.org/10.3390/data6040038 Contact information: atis.elsts@edi.lv
By Amber Thomas [source]
This dataset contains machine-readable hospital pricing information for Children's Hospitals and Clinics of Minnesota. It includes three separate files:
- 2022-top-25-hospital-based-clinics-list.csv: This file provides the top 25 primary care procedure prices, including procedure codes, fees, and insurance coverage details.
- 2022-standard-list-of-charges-hospital-op.csv: This file includes machine-readable hospital pricing information, including procedure codes, fees, and insurance coverage details.
- 2022-msdrg.csv: This file also contains machine-readable hospital pricing information, including procedure codes, fees, and insurance coverage details.
The data was collected programmatically using a custom script written in Node.js and Microsoft Playwright. These files were then mirrored on the data.world platform using the Import from URL option.
If you find any errors in the dataset or have any questions or concerns, please leave a note in the Discussion tab of this dataset or contact supportdata.world for assistance
Dataset Overview:
- The dataset contains three files: a) 2022-top-25-hospital-based-clinics-list.csv: This file includes the top 25 primary care procedure prices for Children's Hospitals and Clinics of Minnesota, including procedure codes, fees, and insurance coverages. b) 2022-standard-list-of-charges-hospital-op.csv: This file includes machine-readable hospital pricing information for Children's Hospitals and Clinics of Minnesota, including procedure codes, fees, and insurance coverages. c) 2022-msdrg.csv: This file includes machine-readable hospital pricing information for Children's Hospitals and Clinics of Minnesota, including MSDRG (Medicare Severity Diagnosis Related Groups) codes, fees, and insurance coverages.
Data Collection:
- The data was collected programmatically using a custom script written in Node.js with the assistance of Microsoft Playwright.
- These datasets were programmatically mirrored on the data.world platform using the Import from URL option.
Usage Guidelines:
Explore Procedure Prices: You can analyze the top 25 primary care procedure prices by referring to the '2022-top-25-hospital-based-clinics-list.csv' file. It provides information on procedure codes (identifiers), associated fees (costs), and insurance coverage details.
Analyze Hospital Price Information: The '2022-standard-list-of-charges-hospital-op.csv' contains comprehensive machine-readable hospital pricing information. You can examine various procedures by their respective codes along with associated fees as well as corresponding insurance coverage details.
Understand MSDRG Codes & Fees: The '2022-msdrg.csv' file includes machine-readable hospital pricing information based on MSDRG (Medicare Severity Diagnosis Related Groups) codes. You can explore the relationship between diagnosis groups and associated fees, along with insurance coverage details.
Reporting Errors:
- If you identify any errors or discrepancies in the dataset, please leave a note in the Discussion tab of this dataset to notify others who may be interested.
- Alternatively, you can reach out to the data.world team at supportdata.world for further assistance.
- Comparative Analysis: Researchers and healthcare professionals can use this dataset to compare the pricing of primary care procedures at Children's Hospitals and Clinics of Minnesota with other hospitals. This can help identify any variations or discrepancies in pricing, enabling better cost management and transparency.
- Insurance Coverage Analysis: The insurance coverage information provided in this dataset can be used to analyze which procedures are covered by different insurance providers. This analysis can help patients understand their out-of-pocket expenses for specific procedures and choose the best insurance plan accordingly.
- Cost Estimation: Patients can utilize this dataset to estimate the cost of primary care procedures at Children's Hospitals and Clinics of Minnesota before seeking medical treatment. By comparing procedure fees across different hospitals, patients can make informed decisions about where to receive their healthcare services based on affordability and quality
If you use this dataset in your research, please credit the original authors. Data Source
**Unknown License - Please chec...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects, has 5 rows. and is filtered where the books includes Which country has the World's best health care?. It features 10 columns including book subject, number of authors, number of books, earliest publication date, and latest publication date. The preview is ordered by number of books (descending).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Global patterns of current and future road infrastructure - Supplementary spatial data
Authors: Johan Meijer, Mark Huijbregts, Kees Schotten, Aafke Schipper
Research paper summary: Georeferenced information on road infrastructure is essential for spatial planning, socio-economic assessments and environmental impact analyses. Yet current global road maps are typically outdated or characterized by spatial bias in coverage. In the Global Roads Inventory Project we gathered, harmonized and integrated nearly 60 geospatial datasets on road infrastructure into a global roads dataset. The resulting dataset covers 222 countries and includes over 21 million km of roads, which is two to three times the total length in the currently best available country-based global roads datasets. We then related total road length per country to country area, population density, GDP and OECD membership, resulting in a regression model with adjusted R2 of 0.90, and found that that the highest road densities are associated with densely populated and wealthier countries. Applying our regression model to future population densities and GDP estimates from the Shared Socioeconomic Pathway (SSP) scenarios, we obtained a tentative estimate of 3.0–4.7 million km additional road length for the year 2050. Large increases in road length were projected for developing nations in some of the world's last remaining wilderness areas, such as the Amazon, the Congo basin and New Guinea. This highlights the need for accurate spatial road datasets to underpin strategic spatial planning in order to reduce the impacts of roads in remaining pristine ecosystems.
Contents: The GRIP dataset consists of global and regional vector datasets in ESRI filegeodatabase and shapefile format, and global raster datasets of road density at a 5 arcminutes resolution (~8x8km). The GRIP dataset is mainly aimed at providing a roads dataset that is easily usable for scientific global environmental and biodiversity modelling projects. The dataset is not suitable for navigation. GRIP4 is based on many different sources (including OpenStreetMap) and to the best of our ability we have verified their public availability, as a criteria in our research. The UNSDI-Transportation datamodel was applied for harmonization of the individual source datasets. GRIP4 is provided under a Creative Commons License (CC-0) and is free to use. The GRIP database and future global road infrastructure scenario projections following the Shared Socioeconomic Pathways (SSPs) are described in the paper by Meijer et al (2018). Due to shapefile file size limitations the global file is only available in ESRI filegeodatabase format.
Regional coding of the other vector datasets in shapefile and ESRI fgdb format:
Road density raster data:
Keyword: global, data, roads, infrastructure, network, global roads inventory project (GRIP), SSP scenarios
Notice of data discontinuation: Since the start of the pandemic, AP has reported case and death counts from data provided by Johns Hopkins University. Johns Hopkins University has announced that they will stop their daily data collection efforts after March 10. As Johns Hopkins stops providing data, the AP will also stop collecting daily numbers for COVID cases and deaths. The HHS and CDC now collect and visualize key metrics for the pandemic. AP advises using those resources when reporting on the pandemic going forward.
April 9, 2020
April 20, 2020
April 29, 2020
September 1st, 2020
February 12, 2021
new_deaths
column.February 16, 2021
The AP is using data collected by the Johns Hopkins University Center for Systems Science and Engineering as our source for outbreak caseloads and death counts for the United States and globally.
The Hopkins data is available at the county level in the United States. The AP has paired this data with population figures and county rural/urban designations, and has calculated caseload and death rates per 100,000 people. Be aware that caseloads may reflect the availability of tests -- and the ability to turn around test results quickly -- rather than actual disease spread or true infection rates.
This data is from the Hopkins dashboard that is updated regularly throughout the day. Like all organizations dealing with data, Hopkins is constantly refining and cleaning up their feed, so there may be brief moments where data does not appear correctly. At this link, you’ll find the Hopkins daily data reports, and a clean version of their feed.
The AP is updating this dataset hourly at 45 minutes past the hour.
To learn more about AP's data journalism capabilities for publishers, corporations and financial institutions, go here or email kromano@ap.org.
Use AP's queries to filter the data or to join to other datasets we've made available to help cover the coronavirus pandemic
Filter cases by state here
Rank states by their status as current hotspots. Calculates the 7-day rolling average of new cases per capita in each state: https://data.world/associatedpress/johns-hopkins-coronavirus-case-tracker/workspace/query?queryid=481e82a4-1b2f-41c2-9ea1-d91aa4b3b1ac
Find recent hotspots within your state by running a query to calculate the 7-day rolling average of new cases by capita in each county: https://data.world/associatedpress/johns-hopkins-coronavirus-case-tracker/workspace/query?queryid=b566f1db-3231-40fe-8099-311909b7b687&showTemplatePreview=true
Join county-level case data to an earlier dataset released by AP on local hospital capacity here. To find out more about the hospital capacity dataset, see the full details.
Pull the 100 counties with the highest per-capita confirmed cases here
Rank all the counties by the highest per-capita rate of new cases in the past 7 days here. Be aware that because this ranks per-capita caseloads, very small counties may rise to the very top, so take into account raw caseload figures as well.
The AP has designed an interactive map to track COVID-19 cases reported by Johns Hopkins.
@(https://datawrapper.dwcdn.net/nRyaf/15/)
<iframe title="USA counties (2018) choropleth map Mapping COVID-19 cases by county" aria-describedby="" id="datawrapper-chart-nRyaf" src="https://datawrapper.dwcdn.net/nRyaf/10/" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important;" height="400"></iframe><script type="text/javascript">(function() {'use strict';window.addEventListener('message', function(event) {if (typeof event.data['datawrapper-height'] !== 'undefined') {for (var chartId in event.data['datawrapper-height']) {var iframe = document.getElementById('datawrapper-chart-' + chartId) || document.querySelector("iframe[src*='" + chartId + "']");if (!iframe) {continue;}iframe.style.height = event.data['datawrapper-height'][chartId] + 'px';}}});})();</script>
Johns Hopkins timeseries data - Johns Hopkins pulls data regularly to update their dashboard. Once a day, around 8pm EDT, Johns Hopkins adds the counts for all areas they cover to the timeseries file. These counts are snapshots of the latest cumulative counts provided by the source on that day. This can lead to inconsistencies if a source updates their historical data for accuracy, either increasing or decreasing the latest cumulative count. - Johns Hopkins periodically edits their historical timeseries data for accuracy. They provide a file documenting all errors in their timeseries files that they have identified and fixed here
This data should be credited to Johns Hopkins University COVID-19 tracking project
World Countries Generalized represents generalized boundaries for the countries of the world as of August 2022. The generalized political boundaries improve draw performance and effectiveness at a global or continental level. This layer is best viewed out beyond a scale of 1:5,000,000.This layer's geography was developed by Esri and sourced from Garmin International, Inc., the U.S. Central Intelligence Agency (The World Factbook), and the National Geographic Society for use as a world basemap. It is updated annually as country names or significant borders change.
Time series of tropical cyclone "best track" position and intensity data are provided for all ocean basins where tropical cyclones occur. Position and intensity data are available at 6-hourly intervals over the duration of each cyclone's life. The general period of record begins in 1851, but this varies by ocean basin. See the inventories [http://rda.ucar.edu/datasets/ds824.1/inventories/] for data availability specific to each basin. This data set was received as a revision to an NCDC tropical cyclone data set, with data generally available through the late 1990s. Since then, the set is being continually updated from the U.S. NOAA National Hurricane Center and the U.S. Navy Joint Typhoon Warning Center best track archives. For a complete history of updates for each ocean basin, see the dataset documentation [http://rda.ucar.edu/datasets/ds824.1/docs/].
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the code for Relevance and Redundancy ranking; a an efficient filter-based feature ranking framework for evaluating relevance based on multi-feature interactions and redundancy on mixed datasets.Source code is in .scala and .sbt format, metadata in .xml, all of which can be accessed and edited in standard, openly accessible text edit software. Diagrams are in openly accessible .png format.Supplementary_2.pdf: contains the results of experiments on multiple classifiers, along with parameter settings and a description of how KLD converges to mutual information based on its symmetricity.dataGenerator.zip: Synthetic data generator inspired from NIPS: Workshop on variable and feature selection (2001), http://www.clopinet.com/isabelle/Projects/NIPS2001/rar-mfs-master.zip: Relevance and Redundancy Framework containing overview diagram, example datasets, source code and metadata. Details on installing and running are provided below.Background. Feature ranking is benfiecial to gain knowledge and to identify the relevant features from a high-dimensional dataset. However, in several datasets, few features by themselves might have small correlation with the target classes, but by combining these features with some other features, they can be strongly correlated with the target. This means that multiple features exhibit interactions among themselves. It is necessary to rank the features based on these interactions for better analysis and classifier performance. However, evaluating these interactions on large datasets is computationally challenging. Furthermore, datasets often have features with redundant information. Using such redundant features hinders both efficiency and generalization capability of the classifier. The major challenge is to efficiently rank the features based on relevance and redundancy on mixed datasets. In the related publication, we propose a filter-based framework based on Relevance and Redundancy (RaR), RaR computes a single score that quantifies the feature relevance by considering interactions between features and redundancy. The top ranked features of RaR are characterized by maximum relevance and non-redundancy. The evaluation on synthetic and real world datasets demonstrates that our approach outperforms several state of-the-art feature selection techniques.# Relevance and Redundancy Framework (rar-mfs) rar-mfs is an algorithm for feature selection and can be employed to select features from labelled data sets. The Relevance and Redundancy Framework (RaR), which is the theory behind the implementation, is a novel feature selection algorithm that - works on large data sets (polynomial runtime),- can handle differently typed features (e.g. nominal features and continuous features), and- handles multivariate correlations.## InstallationThe tool is written in scala and uses the weka framework to load and handle data sets. You can either run it independently providing the data as an
.arff
or .csv
file or you can include the algorithm as a (maven / ivy) dependency in your project. As an example data set we use heart-c. ### Project dependencyThe project is published to maven central (link). To depend on the project use:- maven xml de.hpi.kddm rar-mfs_2.11 1.0.2
- sbt: sbt libraryDependencies += "de.hpi.kddm" %% "rar-mfs" % "1.0.2"
To run the algorithm usescalaimport de.hpi.kddm.rar._// ...val dataSet = de.hpi.kddm.rar.Runner.loadCSVDataSet(new File("heart-c.csv", isNormalized = false, "")val algorithm = new RaRSearch( HicsContrastPramsFA(numIterations = config.samples, maxRetries = 1, alphaFixed = config.alpha, maxInstances = 1000), RaRParamsFixed(k = 5, numberOfMonteCarlosFixed = 5000, parallelismFactor = 4))algorithm.selectFeatures(dataSet)
### Command line tool- EITHER download the prebuild binary which requires only an installation of a recent java version (>= 6) 1. download the prebuild jar from the releases tab (latest) 2. run java -jar rar-mfs-1.0.2.jar--help
Using the prebuild jar, here is an example usage: sh rar-mfs > java -jar rar-mfs-1.0.2.jar arff --samples 100 --subsetSize 5 --nonorm heart-c.arff Feature Ranking: 1 - age (12) 2 - sex (8) 3 - cp (11) ...
- OR build the repository on your own: 1. make sure sbt is installed 2. clone repository 3. run sbt run
Simple example using sbt directly after cloning the repository: sh rar-mfs > sbt "run arff --samples 100 --subsetSize 5 --nonorm heart-c.arff" Feature Ranking: 1 - age (12) 2 - sex (8) 3 - cp (11) ...
### [Optional]To speed up the algorithm, consider using a fast solver such as Gurobi (http://www.gurobi.com/). Install the solver and put the provided gurobi.jar
into the java classpath. ## Algorithm### IdeaAbstract overview of the different steps of the proposed feature selection algorithm:https://github.com/tmbo/rar-mfs/blob/master/docu/images/algorithm_overview.png" alt="Algorithm Overview">The Relevance and Redundancy ranking framework (RaR) is a method able to handle large scale data sets and data sets with mixed features. Instead of directly selecting a subset, a feature ranking gives a more detailed overview into the relevance of the features. The method consists of a multistep approach where we 1. repeatedly sample subsets from the whole feature space and examine their relevance and redundancy: exploration of the search space to gather more and more knowledge about the relevance and redundancy of features 2. decude scores for features based on the scores of the subsets 3. create the best possible ranking given the sampled insights.### Parameters| Parameter | Default value | Description || ---------- | ------------- | ------------|| m - contrast iterations | 100 | Number of different slices to evaluate while comparing marginal and conditional probabilities || alpha - subspace slice size | 0.01 | Percentage of all instances to use as part of a slice which is used to compare distributions || n - sampling itertations | 1000 | Number of different subsets to select in the sampling phase|| k - sample set size | 5 | Maximum size of the subsets to be selected in the sampling phase|
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To create the dataset, the top 10 countries leading in the incidence of COVID-19 in the world were selected as of October 22, 2020 (on the eve of the second full of pandemics), which are presented in the Global 500 ranking for 2020: USA, India, Brazil, Russia, Spain, France and Mexico. For each of these countries, no more than 10 of the largest transnational corporations included in the Global 500 rating for 2020 and 2019 were selected separately. The arithmetic averages were calculated and the change (increase) in indicators such as profitability and profitability of enterprises, their ranking position (competitiveness), asset value and number of employees. The arithmetic mean values of these indicators for all countries of the sample were found, characterizing the situation in international entrepreneurship as a whole in the context of the COVID-19 crisis in 2020 on the eve of the second wave of the pandemic. The data is collected in a general Microsoft Excel table. Dataset is a unique database that combines COVID-19 statistics and entrepreneurship statistics. The dataset is flexible data that can be supplemented with data from other countries and newer statistics on the COVID-19 pandemic. Due to the fact that the data in the dataset are not ready-made numbers, but formulas, when adding and / or changing the values in the original table at the beginning of the dataset, most of the subsequent tables will be automatically recalculated and the graphs will be updated. This allows the dataset to be used not just as an array of data, but as an analytical tool for automating scientific research on the impact of the COVID-19 pandemic and crisis on international entrepreneurship. The dataset includes not only tabular data, but also charts that provide data visualization. The dataset contains not only actual, but also forecast data on morbidity and mortality from COVID-19 for the period of the second wave of the pandemic in 2020. The forecasts are presented in the form of a normal distribution of predicted values and the probability of their occurrence in practice. This allows for a broad scenario analysis of the impact of the COVID-19 pandemic and crisis on international entrepreneurship, substituting various predicted morbidity and mortality rates in risk assessment tables and obtaining automatically calculated consequences (changes) on the characteristics of international entrepreneurship. It is also possible to substitute the actual values identified in the process and following the results of the second wave of the pandemic to check the reliability of pre-made forecasts and conduct a plan-fact analysis. The dataset contains not only the numerical values of the initial and predicted values of the set of studied indicators, but also their qualitative interpretation, reflecting the presence and level of risks of a pandemic and COVID-19 crisis for international entrepreneurship.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
In our work, we have designed and implemented a novel workflow with several heuristic methods to combine state-of-the-art methods related to CVE fix commits gathering. As a consequence of our improvements, we have been able to gather the largest programming language-independent real-world dataset of CVE vulnerabilities with the associated fix commits. Our dataset containing 29,203 unique CVEs coming from 7,238 unique GitHub projects is, to the best of our knowledge, by far the biggest CVE vulnerability dataset with fix commits available today. These CVEs are associated with 35,276 unique commits as sql and 39,931 patch commit files that fixed those vulnerabilities(some patch files can't be saved as sql due to several techincal reasons) Our larger dataset thus substantially improves over the current real-world vulnerability datasets and enables further progress in research on vulnerability detection and software security. We used NVD(nvd.nist.gov) and Github Secuirty advisory Database as the main sources of our pipeline.
We release to the community a 16GB PostgreSQL database that contains information on CVEs up to 2024-09-26, CWEs of each CVE, files and methods changed by each commit, and repository metadata. Additionally, patch files related to the fix commits are available as a separate package. Furthermore, we make our dataset collection tool also available to the community.
cvedataset-patches.zip
file contains fix patches, and postgrescvedumper.sql.zip
contains a postgtesql dump of fixes, together with several other fields such as CVEs, CWEs, repository meta-data, commit data, file changes, method changed, etc.
MoreFixes data-storage strategy is based on CVEFixes to store CVE commits fixes from open-source repositories, and uses a modified version of Porspector(part of ProjectKB from SAP) as a module to detect commit fixes of a CVE. Our full methodology is presented in the paper, with the title of "MoreFixes: A Large-Scale Dataset of CVE Fix Commits Mined through Enhanced Repository Discovery", which will be published in the Promise conference (2024).
For more information about usage and sample queries, visit the Github repository: https://github.com/JafarAkhondali/Morefixes
If you are using this dataset, please be aware that the repositories that we mined contain different licenses and you are responsible to handle any licesnsing issues. This is also the similar case with CVEFixes.
This product uses the NVD API but is not endorsed or certified by the NVD.
This research was partially supported by the Dutch Research Council (NWO) under the project NWA.1215.18.008 Cyber Security by Integrated Design (C-SIDe).
To restore the dataset, you can use the docker-compose file available at the gitub repository. Dataset default credentials after restoring dump:
POSTGRES_USER=postgrescvedumper POSTGRES_DB=postgrescvedumper POSTGRES_PASSWORD=a42a18537d74c3b7e584c769152c3d
Please use this for citation:
title={MoreFixes: A large-scale dataset of CVE fix commits mined through enhanced repository discovery},
author={Akhoundali, Jafar and Nouri, Sajad Rahim and Rietveld, Kristian and Gadyatskaya, Olga},
booktitle={Proceedings of the 20th International Conference on Predictive Models and Data Analytics in Software Engineering},
pages={42--51},
year={2024}
}
By Health [source]
This dataset contains detailed information about 30-day readmission and mortality rates of U.S. hospitals. It is an essential tool for stakeholders aiming to identify opportunities for improving healthcare quality and performance across the country. Providers benefit by having access to comprehensive data regarding readmission, mortality rate, score, measure start/end dates, compared average to national as well as other pertinent metrics like zip codes, phone numbers and county names. Use this data set to conduct evaluations of how hospitals are meeting industry standards from a quality and outcomes perspective in order to make more informed decisions when designing patient care strategies and policies
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides data on 30-day readmission and mortality rates of U.S. hospitals, useful in understanding the quality of healthcare being provided. This data can provide insight into the effectiveness of treatments, patient care, and staff performance at different healthcare facilities throughout the country.
In order to use this dataset effectively, it is important to understand each column and how best to interpret them. The ‘Hospital Name’ column displays the name of the facility; ‘Address’ lists a street address for the hospital; ‘City’ indicates its geographic location; ‘State’ specifies a two-letter abbreviation for that state; ‘ZIP Code’ provides each facility's 5 digit zip code address; 'County Name' specifies what county that particular hospital resides in; 'Phone number' lists a phone contact for any given facility ;'Measure Name' identifies which measure is being recorded (for instance: Elective Delivery Before 39 Weeks); 'Score' value reflects an average score based on patient feedback surveys taken over time frame listed under ' Measure Start Date.' Then there are also columns tracking both lower estimates ('Lower Estimate') as well as higher estimates ('Higher Estimate'); these create variability that can be tracked by researchers seeking further answers or formulating future studies on this topic or field.; Lastly there is one more measure oissociated with this set: ' Footnote,' which may highlight any addional important details pertinent to analysis such as numbers outlying National averages etc..
This data set can be used by hospitals, research facilities and other interested parties in providing inciteful information when making decisions about patient care standards throughout America . It can help find patterns about readmitis/mortality along county lines or answer questions about preformance fluctuations between different hospital locations over an extended amount of time. So if you are ever curious about 30 days readmitted within US Hospitals don't hesitate to dive into this insightful dataset!
- Comparing hospitals on a regional or national basis to measure the quality of care provided for readmission and mortality rates.
- Analyzing the effects of technological advancements such as telemedicine, virtual visits, and AI on readmission and mortality rates at different hospitals.
- Using measures such as Lower Estimate Higher Estimate scores to identify systematic problems in readmissions or mortality rate management at hospitals and informing public health care policy
If you use this dataset in your research, please credit the original authors. Data Source
License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices.
File: Readmissions_and_Deaths_-_Hospital.csv | Column name | Description | |:-------------------------|:---------------------------------------------------------------------------------------------------| | Hospital Name ...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MGD: Music Genre Dataset
Over recent years, the world has seen a dramatic change in the way people consume music, moving from physical records to streaming services. Since 2017, such services have become the main source of revenue within the global recorded music market. Therefore, this dataset is built by using data from Spotify. It provides a weekly chart of the 200 most streamed songs for each country and territory it is present, as well as an aggregated global chart.
Considering that countries behave differently when it comes to musical tastes, we use chart data from global and regional markets from January 2017 to December 2019, considering eight of the top 10 music markets according to IFPI: United States (1st), Japan (2nd), United Kingdom (3rd), Germany (4th), France (5th), Canada (8th), Australia (9th), and Brazil (10th).
We also provide information about the hit songs and artists present in the charts, such as all collaborating artists within a song (since the charts only provide the main ones) and their respective genres, which is the core of this work. MGD also provides data about musical collaboration, as we build collaboration networks based on artist partnerships in hit songs. Therefore, this dataset contains:
Genre Networks: Success-based genre collaboration networks
Genre Mapping: Genre mapping from Spotify genres to super-genres
Artist Networks: Success-based artist collaboration networks
Artists: Some artist data
Hit Songs: Hit Song data and features
Charts: Enhanced data from Spotify Weekly Top 200 Charts
This dataset was originally built for a conference paper at ISMIR 2020. If you make use of the dataset, please also cite the following paper:
Gabriel P. Oliveira, Mariana O. Silva, Danilo B. Seufitelli, Anisio Lacerda, and Mirella M. Moro. Detecting Collaboration Profiles in Success-based Music Genre Networks. In Proceedings of the 21st International Society for Music Information Retrieval Conference (ISMIR 2020), 2020.
@inproceedings{ismir/OliveiraSSLM20, title = {Detecting Collaboration Profiles in Success-based Music Genre Networks}, author = {Gabriel P. Oliveira and Mariana O. Silva and Danilo B. Seufitelli and Anisio Lacerda and Mirella M. Moro}, booktitle = {21st International Society for Music Information Retrieval Conference} pages = {726--732}, year = {2020} }
This dataset contains information on application install interactions of users in the Myket android application market. The dataset was created for the purpose of evaluating interaction prediction models, requiring user and item identifiers along with timestamps of the interactions. Hence, the dataset can be used for interaction prediction and building a recommendation system. Furthermore, the data forms a dynamic network of interactions, and we can also perform network representation learning on the nodes in the network, which are users and applications.
Data Creation The dataset was initially generated by the Myket data team, and later cleaned and subsampled by Erfan Loghmani a master student at Sharif University of Technology at the time. The data team focused on a two-week period and randomly sampled 1/3 of the users with interactions during that period. They then selected install and update interactions for three months before and after the two-week period, resulting in interactions spanning about 6 months and two weeks.
We further subsampled and cleaned the data to focus on application download interactions. We identified the top 8000 most installed applications and selected interactions related to them. We retained users with more than 32 interactions, resulting in 280,391 users. From this group, we randomly selected 10,000 users, and the data was filtered to include only interactions for these users. The detailed procedure can be found in here.
Data Structure The dataset has two main files.
myket.csv: This file contains the interaction information and follows the same format as the datasets used in the "JODIE: Predicting Dynamic Embedding Trajectory in Temporal Interaction Networks" (ACM SIGKDD 2019) project. However, this data does not contain state labels and interaction features, resulting in associated columns being all zero. app_info_sample.csv: This file comprises features associated with applications present in the sample. For each individual application, information such as the approximate number of installs, average rating, count of ratings, and category are included. These features provide insights into the applications present in the dataset.
Dataset Details
Total Instances: 694,121 install interaction instances Instances Format: Triplets of user_id, app_name, timestamp 10,000 users and 7,988 android applications Item features for 7,606 applications
For a detailed summary of the data's statistics, including information on users, applications, and interactions, please refer to the Python notebook available at summary-stats.ipynb. The notebook provides an overview of the dataset's characteristics and can be helpful for understanding the data's structure before using it for research or analysis.
Top 20 Most Installed Applications | Package Name | Count of Interactions | | ---------------------------------- | --------------------- | | com.instagram.android | 15292 | | ir.resaneh1.iptv | 12143 | | com.tencent.ig | 7919 | | com.ForgeGames.SpecialForcesGroup2 | 7797 | | ir.nomogame.ClutchGame | 6193 | | com.dts.freefireth | 6041 | | com.whatsapp | 5876 | | com.supercell.clashofclans | 5817 | | com.mojang.minecraftpe | 5649 | | com.lenovo.anyshare.gps | 5076 | | ir.medu.shad | 4673 | | com.firsttouchgames.dls3 | 4641 | | com.activision.callofduty.shooter | 4357 | | com.tencent.iglite | 4126 | | com.aparat | 3598 | | com.kiloo.subwaysurf | 3135 | | com.supercell.clashroyale | 2793 | | co.palang.QuizOfKings | 2589 | | com.nazdika.app | 2436 | | com.digikala | 2413 |
Comparison with SNAP Datasets The Myket dataset introduced in this repository exhibits distinct characteristics compared to the real-world datasets used by the project. The table below provides a comparative overview of the key dataset characteristics:
Dataset | #Users | #Items | #Interactions | Average Interactions per User | Average Unique Items per User |
---|---|---|---|---|---|
Myket | 10,000 | 7,988 | 694,121 | 69.4 | 54.6 |
LastFM | 980 | 1,000 | 1,293,103 | 1,319.5 | 158.2 |
10,000 | 984 | 672,447 | 67.2 | 7.9 | |
Wikipedia | 8,227 | 1,000 | 157,474 | 19.1 | 2.2 |
MOOC | 7,047 | 97 | 411,749 | 58.4 | 25.3 |
The Myket dataset stands out by having an ample number of both users and items, highlighting its relevance for real-world, large-scale applications. Unlike LastFM, Reddit, and Wikipedia datasets, where users exhibit repetitive item interactions, the Myket dataset contains a comparatively lower amount of repetitive interactions. This unique characteristic reflects the diverse nature of user behaviors in the Android application market environment.
Citation If you use this dataset in your research, please cite the following preprint:
@misc{loghmani2023effect, title={Effect of Choosing Loss Function when Using T-batching for Representation Learning on Dynamic Networks}, author={Erfan Loghmani and MohammadAmin Fazli}, year={2023}, eprint={2308.06862}, archivePrefix={arXiv}, primaryClass={cs.LG} }
Success.ai’s LinkedIn Data Solutions offer unparalleled access to a vast dataset of 700 million public LinkedIn profiles and 70 million LinkedIn company records, making it one of the most comprehensive and reliable LinkedIn datasets available on the market today. Our employee data and LinkedIn data are ideal for businesses looking to streamline recruitment efforts, build highly targeted lead lists, or develop personalized B2B marketing campaigns.
Whether you’re looking for recruiting data, conducting investment research, or seeking to enrich your CRM systems with accurate and up-to-date LinkedIn profile data, Success.ai provides everything you need with pinpoint precision. By tapping into LinkedIn company data, you’ll have access to over 40 critical data points per profile, including education, professional history, and skills.
Key Benefits of Success.ai’s LinkedIn Data: Our LinkedIn data solution offers more than just a dataset. With GDPR-compliant data, AI-enhanced accuracy, and a price match guarantee, Success.ai ensures you receive the highest-quality data at the best price in the market. Our datasets are delivered in Parquet format for easy integration into your systems, and with millions of profiles updated daily, you can trust that you’re always working with fresh, relevant data.
Global Reach and Industry Coverage: Our LinkedIn data covers professionals across all industries and sectors, providing you with detailed insights into businesses around the world. Our geographic coverage spans 259M profiles in the United States, 22M in the United Kingdom, 27M in India, and thousands of profiles in regions such as Europe, Latin America, and Asia Pacific. With LinkedIn company data, you can access profiles of top companies from the United States (6M+), United Kingdom (2M+), and beyond, helping you scale your outreach globally.
Why Choose Success.ai’s LinkedIn Data: Success.ai stands out for its tailored approach and white-glove service, making it easy for businesses to receive exactly the data they need without managing complex data platforms. Our dedicated Success Managers will curate and deliver your dataset based on your specific requirements, so you can focus on what matters most—reaching the right audience. Whether you’re sourcing employee data, LinkedIn profile data, or recruiting data, our service ensures a seamless experience with 99% data accuracy.
Key Use Cases:
LinkedIn URL: Access direct links to LinkedIn profiles for immediate insights. Full Name: Verified first and last names. Job Title: Current job titles, and prior experience. Company Information: Company name, LinkedIn URL, domain, and location. Work and Per...
As of June 2024, the most popular database management system (DBMS) worldwide was Oracle, with a ranking score of 1244.08; MySQL and Microsoft SQL server rounded out the top three. Although the database management industry contains some of the largest companies in the tech industry, such as Microsoft, Oracle and IBM, a number of free and open-source DBMSs such as PostgreSQL and MariaDB remain competitive. Database Management Systems As the name implies, DBMSs provide a platform through which developers can organize, update, and control large databases. Given the business world’s growing focus on big data and data analytics, knowledge of SQL programming languages has become an important asset for software developers around the world, and database management skills are seen as highly desirable. In addition to providing developers with the tools needed to operate databases, DBMS are also integral to the way that consumers access information through applications, which further illustrates the importance of the software.
Survey based Harmonized Indicators (SHIP) files are harmonized data files from household surveys that are conducted by countries in Africa. To ensure the quality and transparency of the data, it is critical to document the procedures of compiling consumption aggregation and other indicators so that the results can be duplicated with ease. This process enables consistency and continuity that make temporal and cross-country comparisons consistent and more reliable.
Four harmonized data files are prepared for each survey to generate a set of harmonized variables that have the same variable names. Invariably, in each survey, questions are asked in a slightly different way, which poses challenges on consistent definition of harmonized variables. The harmonized household survey data present the best available variables with harmonized definitions, but not identical variables. The four harmonized data files are
a) Individual level file (Labor force indicators in a separate file): This file has information on basic characteristics of individuals such as age and sex, literacy, education, health, anthropometry and child survival. b) Labor force file: This file has information on labor force including employment/unemployment, earnings, sectors of employment, etc. c) Household level file: This file has information on household expenditure, household head characteristics (age and sex, level of education, employment), housing amenities, assets, and access to infrastructure and services. d) Household Expenditure file: This file has consumption/expenditure aggregates by consumption groups according to Purpose (COICOP) of Household Consumption of the UN.
National
The survey covered all de jure household members (usual residents).
Sample survey data [ssd]
SAMPLE DESIGN FOR ROUND 4 OF THE GLSS A nationally representative sample of households was selected in order to achieve the survey objectives.
Sample Frame For the purposes of this survey the list of the 1984 population census Enumeration Areas (EAs) with population and household information was used as the sampling frame. The primary sampling units were the 1984 EAs with the secondary units being the households in the EAs. This frame, though quite old, was considered inadequate, it being the best available at the time. Indeed, this frame was used for the earlier rounds of the GLSS.
Stratification In order to increase precision and reliability of the estimates, the technique of stratification was employed in the sample design, using geographical factors, ecological zones and location of residence as the main controls. Specifically, the EAs were first stratified according to the three ecological zones namely; Coastal, Forest and Savannah, and then within each zone further stratification was done based on the size of the locality into rural or urban.
SAMPLE SELECTION EAs A two-stage sample was selected for the survey. At the first stage, 300 EAs were selected using systematic sampling with probability proportional to size method (PPS) where the size measure is the 1984 number of households in the EA. This was achieved by ordering the list of EAs with their sizes according to the strata. The size column was then cumulated, and with a random start and a fixed interval the sample EAs were selected.
It was observed that some of the selected EAs had grown in size over time and therefore needed segmentation. In this connection, such EAs were divided into approximately equal parts, each segment constituting about 200 households. Only one segment was then randomly selected for listing of the households.
Households At the second stage, a fixed number of 20 households was systematically selected from each selected EA to give a total of 6,000 households. Additional 5 households were selected as reserve to replace missing households. Equal number of households was selected from each EA in order to reflect the labour force focus of the survey.
NOTE: The above sample selection procedure deviated slightly from that used for the earlier rounds of the GLSS, as such the sample is not self-weighting. This is because, 1. given the long period between 1984 and the GLSS 4 fieldwork the number of households in the various EAs are likely to have grown at different rates. 2. the listing exercise was not properly done as some of the selected EAs were not listed completely. Moreover, it was noted that the segmentation done for larger EAs during the listing was a bit arbitrary.
Face-to-face [f2f]
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
📌 Context of the Dataset
The Healthcare Ransomware Dataset was created to simulate real-world cyberattacks in the healthcare industry. Hospitals, clinics, and research labs have become prime targets for ransomware due to their reliance on real-time patient data and legacy IT infrastructure. This dataset provides insight into attack patterns, recovery times, and cybersecurity practices across different healthcare organizations.
Why is this important?
Ransomware attacks on healthcare organizations can shut down entire hospitals, delay treatments, and put lives at risk. Understanding how different healthcare organizations respond to attacks can help develop better security strategies. The dataset allows cybersecurity analysts, data scientists, and researchers to study patterns in ransomware incidents and explore predictive modeling for risk mitigation.
📌 Sources and Research Inspiration This simulated dataset was inspired by real-world cybersecurity reports and built using insights from official sources, including:
1️⃣ IBM Cost of a Data Breach Report (2024)
The healthcare sector had the highest average cost of data breaches ($10.93 million per incident). On average, organizations recovered only 64.8% of their data after paying ransom. Healthcare breaches took 277 days on average to detect and contain.
2️⃣ Sophos State of Ransomware in Healthcare (2024)
67% of healthcare organizations were hit by ransomware in 2024, an increase from 60% in 2023. 66% of backup compromise attempts succeeded, making data recovery significantly more difficult. The most common attack vectors included exploited vulnerabilities (34%) and compromised credentials (34%).
3️⃣ Health & Human Services (HHS) Cybersecurity Reports
Ransomware incidents in healthcare have doubled since 2016. Organizations that fail to monitor threats frequently experience higher infection rates.
4️⃣ Cybersecurity & Infrastructure Security Agency (CISA) Alerts
Identified phishing, unpatched software, and exposed RDP ports as top ransomware entry points. Only 13% of healthcare organizations monitor cyber threats more than once per day, increasing the risk of undetected attacks.
5️⃣ Emsisoft 2020 Report on Ransomware in Healthcare
The number of ransomware attacks in healthcare increased by 278% between 2018 and 2023. 560 healthcare facilities were affected in a single year, disrupting patient care and emergency services.
📌 Why is This a Simulated Dataset?
This dataset does not contain real patient data or actual ransomware cases. Instead, it was built using probabilistic modeling and structured randomness based on industry benchmarks and cybersecurity reports.
How It Was Created:
1️⃣ Defining the Dataset Structure
The dataset was designed to simulate realistic attack patterns in healthcare, using actual ransomware case studies as inspiration.
Columns were selected based on what real-world cybersecurity teams track, such as: Attack methods (phishing, RDP exploits, credential theft). Infection rates, recovery time, and backup compromise rates. Organization type (hospitals, clinics, research labs) and monitoring frequency.
2️⃣ Generating Realistic Data Using ChatGPT & Python
ChatGPT assisted in defining relationships between attack factors, ensuring that key cybersecurity concepts were accurately reflected. Python’s NumPy and Pandas libraries were used to introduce randomized attack simulations based on real-world statistics. Data was validated against industry research to ensure it aligns with actual ransomware attack trends.
3️⃣ Ensuring Logical Relationships Between Data Points
Hospitals take longer to recover due to larger infrastructure and compliance requirements. Organizations that track more cyber threats recover faster because they detect attacks earlier. Backup security significantly impacts recovery time, reflecting the real-world risk of backup encryption attacks.