96 datasets found

m
Stroke_Analysis
data.mendeley.com
Updated Dec 2, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vamsi Bandi (2020). Stroke_Analysis [Dataset]. http://doi.org/10.17632/jpb5tds9f6.1
Explore at:
Unique identifier
https://doi.org/10.17632/jpb5tds9f6.1
Dataset updated
Dec 2, 2020
Authors
Vamsi Bandi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data set contains primary attributes like Age, NIHSS, mRS, Systolic Blood pressure, Diastolic blood pressure, Glucose, Paralysis, Smoking, BMI, Cholesterol with standard deviation values 23.69, 11.27, 1.87, 24.92, 18.34, 56.11, 1.106, 0.9, 6.23, 20.26 and Mean values 47.12, 18.12, 3.67, 153.09, 103.65, 225.85, 1.36, 0.88, 33.73, 217.53
Accompanying simulated data for "Go multivariate: a Monte Carlo study of a...
zenodo.org
explore.openaire.eu
zip
Updated Mar 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Mildiner Moraga; Sebastian Mildiner Moraga; Emmeke Aarts; Emmeke Aarts (2022). Accompanying simulated data for "Go multivariate: a Monte Carlo study of a multilevel hidden Markov model with categorical data of varying complexity" [Dataset]. http://doi.org/10.5281/zenodo.6384007
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6384007
Dataset updated
Mar 25, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sebastian Mildiner Moraga; Sebastian Mildiner Moraga; Emmeke Aarts; Emmeke Aarts
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The multilevel hidden Markov model (MHMM) is a promising vehicle to investigate latent dynamics over time in social and behavioral processes. By including continuous individual random effects, the model accommodates variability between individuals, providing individual-specific trajectories and facilitating the study of individual differences. However, the performance of the MHMM has not been sufficiently explored. Currently, there are no practical guidelines on the sample size needed to obtain reliable estimates related to categorical data characteristics We performed an extensive simulation to assess the effect of the number of dependent variables (1-4), the number of individuals (5-90), and the number of observations per individual (100-1600) on the estimation performance of group-level parameters and between-individual variability on a Bayesian MHMM with categorical data of various levels of complexity. We found that using multivariate data generally alleviates the sample size needed and improves the stability of the results. Regarding the estimation of group-level parameters, the number of individuals and observations largely compensate for each other. Meanwhile, only the former drives the estimation of between-individual variability. We conclude with guidelines on the sample size necessary based on the complexity of the data and the study objectives of the practitioners.

This repository contains data generated for the manuscript: "Go multivariate: a Monte Carlo study of a multilevel hidden Markov model with categorical data of varying complexity". It comprehends: (1) model outputs (maximum a posteriori estimates) for each repetition (n=100) of each scenario (n=324) of the main simulation, (2) complete model outputs (including estimates for 4000 MCMC iterations) for two chains of each repetition (n=3) of each scenario (n=324). Please note that the empirical data used in the manuscript is not available as part of this repository. A subsample of the data used in the empirical example are openly available as an example data set in the R package mHMMbayes on CRAN. The full data set is available on request from the authors.
The StreamCat Dataset: Accumulated Attributes for NHDPlusV2 (Version 2.1)...
catalog.data.gov
gimi9.com
Updated Feb 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Environmental Protection Agency, Office of Research and Development (ORD), Center for Public Health and Environmental Assessment (CPHEA), Pacific Ecological Systems Division (PESD), (2025). The StreamCat Dataset: Accumulated Attributes for NHDPlusV2 (Version 2.1) Catchments for the Conterminous United States: Surficial Lithology in Watershed [Dataset]. https://catalog.data.gov/dataset/the-streamcat-dataset-accumulated-attributes-for-nhdplusv2-version-2-1-catchments-for-the--5783e
Explore at:
Dataset updated
Feb 4, 2025
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Area covered
Contiguous United States, United States
Description
This dataset represents the density of 18 USGS lithology classes within individual, local NHDPlusV2 catchments and upstream, contributing watersheds(see Data Sources for links to NHDPlusV2 data and USGS). Attributes were calculated for every local NHDPlusV2 catchment and then accumulated to provide watershed-level metrics for USGS lithology data. This data set is derived from the USGS raster map of 18 lithology classes (categorical data type) for the conterminous USA. The map was produced based on texture, internal structure, thickness, and environment of deposition or formation of materials. These 18 lithology classes were summarized by local catchment and by watershed to produce 18 local catchment-level and watershed-level metrics as a categorical data type.
f
Summary of variables of the data set included in the analysis.
plos.figshare.com
xls
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Owen Bodger; Aidan Byrne; Philip A. Evans; Sarah Rees; Gwen Jones; Claire Cowell; Mike B. Gravenor; Rhys Williams (2023). Summary of variables of the data set included in the analysis. [Dataset]. http://doi.org/10.1371/journal.pone.0027161.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0027161.t001
Dataset updated
Jun 8, 2023
Dataset provided by
PLOS ONE
Authors
Owen Bodger; Aidan Byrne; Philip A. Evans; Sarah Rees; Gwen Jones; Claire Cowell; Mike B. Gravenor; Rhys Williams
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Footnote: (f) denotes a categorical variable, (c) a continuous covariate and (n) a nominal variable.
H
Replication Data for: Nursery Data Set
dataverse.harvard.edu
Updated Apr 5, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenjuan Wang (2018). Replication Data for: Nursery Data Set [Dataset]. http://doi.org/10.7910/DVN/MBFQK0
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/MBFQK0
Dataset updated
Apr 5, 2018
Dataset provided by
Harvard Dataverse
Authors
Wenjuan Wang
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset is downloaded from UCI repository. https://archive.ics.uci.edu/ml/datasets/nursery the dataset contains categorical data to rank nursery school applicants. The original dataset contains 5 classes. Classes were reorganized in order to remain with only two classes (”recommended” or ”not recommended”).
Data from: car sales
kaggle.com
zip
Updated Oct 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sridhar jakkaraju (2023). car sales [Dataset]. https://www.kaggle.com/datasets/sridharjakkaraju/car-sales/code
Explore at:
zip(120379 bytes)Available download formats
Dataset updated
Oct 30, 2023
Authors
sridhar jakkaraju
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by sridhar jakkaraju

Released under CC0: Public Domain

Contents

Bridging the Gap in Hypertension Management: Evaluating Blood Pressure...

data.mendeley.com

Updated Jan 15, 2025

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

abu sufian (2025). Bridging the Gap in Hypertension Management: Evaluating Blood Pressure Control and Associated Risk Factors in a Resource-Constrained Setting [Dataset]. http://doi.org/10.17632/56jyjndvcr.1

Explore at:

Unique identifier

https://doi.org/10.17632/56jyjndvcr.1

Dataset updated

Jan 15, 2025

Authors

abu sufian

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Dataset Description

This dataset contains a simulated collection of 1,00000 patient records designed to explore hypertension management in resource-constrained settings. It provides comprehensive data for analyzing blood pressure control rates, associated risk factors, and complications. The dataset is ideal for predictive modelling, risk analysis, and treatment optimization, offering insights into demographic, clinical, and treatment-related variables.

Dataset Structure

Dataset Volume

• Size: 10,000 records. • Features: 19 variables, categorized into Sociodemographic, Clinical, Complications, and Treatment/Control groups.
Variables and Categories

A. Sociodemographic Variables

1. Age:
•  Continuous variable in years.
•  Range: 18–80 years.
•  Mean ± SD: 49.37 ± 12.81.
2. Sex:
•  Categorical variable.
•  Values: Male, Female.
3. Education:
•  Categorical variable.
•  Values: No Education, Primary, Secondary, Higher Secondary, Graduate, Post-Graduate, Madrasa.
4. Occupation:
•  Categorical variable.
•  Values: Service, Business, Agriculture, Retired, Unemployed, Housewife.
5. Monthly Income:
•  Categorical variable in Bangladeshi Taka.
•  Values: <5000, 5001–10000, 10001–15000, >15000.
6. Residence:
•  Categorical variable.
•  Values: Urban, Sub-urban, Rural.

B. Clinical Variables

7. Systolic BP:
•  Continuous variable in mmHg.
•  Range: 100–200 mmHg.
•  Mean ± SD: 140 ± 15 mmHg.
8. Diastolic BP:
•  Continuous variable in mmHg.
•  Range: 60–120 mmHg.
•  Mean ± SD: 90 ± 10 mmHg.
9. Elevated Creatinine:
•  Binary variable (\geq 1.4 \, \text{mg/dL}).
•  Values: Yes, No.
10. Diabetes Mellitus:
•  Binary variable.
•  Values: Yes, No.
11. Family History of CVD:
•  Binary variable.
•  Values: Yes, No.
12. Elevated Cholesterol:
•  Binary variable (\geq 200 \, \text{mg/dL}).
•  Values: Yes, No.
13. Smoking:
•  Binary variable.
•  Values: Yes, No.

C. Complications

14. LVH (Left Ventricular Hypertrophy):
•  Binary variable (ECG diagnosis).
•  Values: Yes, No.
15. IHD (Ischemic Heart Disease):
•  Binary variable.
•  Values: Yes, No.
16. CVD (Cerebrovascular Disease):
•  Binary variable.
•  Values: Yes, No.
17. Retinopathy:
•  Binary variable.
•  Values: Yes, No.

D. Treatment and Control

18. Treatment:
•  Categorical variable indicating therapy type.
•  Values: Single Drug, Combination Drugs.
19. Control Status:
•  Binary variable.
•  Values: Controlled, Uncontrolled.

Dataset Applications

1. Predictive Modeling:
•  Develop models to predict blood pressure control status using demographic and clinical data.
2. Risk Analysis:
•  Identify significant factors influencing hypertension control and complications.
3. Severity Scoring:
•  Quantify hypertension severity for patient risk stratification.
4. Complications Prediction:
•  Forecast complications like IHD, LVH, and CVD for early intervention.
5. Treatment Guidance:
•  Analyze therapy efficacy to recommend optimal treatment strategies.

Black Friday Sales EDA
kaggle.com
Updated Oct 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rushikesh Konapure (2022). Black Friday Sales EDA [Dataset]. https://www.kaggle.com/datasets/rishikeshkonapure/black-friday-sales-eda
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 29, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rushikesh Konapure
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset History

A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summaries of various customers for selected high-volume products from last month. The data set also contains customer demographics (age, gender, marital status, city type, stay in the current city), product details (productid and product category) and Total purchase amount from last month.

Now, they want to build a model to predict the purchase amount of customers against various products which will help them to create a personalized offer for customers against different products.

Tasks to perform

The purchase column is the Target Variable, perform Univariate Analysis and Bivariate Analysis w.r.t the Purchase.

Masked in the column description means already converted from categorical value to numerical column.

Below mentioned points are just given to get you started with the dataset, not mandatory to follow the same sequence.

DATA PREPROCESSING

Check the basic statistics of the dataset

Check for missing values in the data

Check for unique values in data

Perform EDA

Purchase Distribution

Check for outliers

Analysis by Gender, Marital Status, occupation, occupation vs purchase, purchase by city, purchase by age group, etc

Drop unnecessary fields

Convert categorical data into integer using map function (e.g 'Gender' column)

Missing value treatment

Rename columns

Fill nan values

map range variables into integers (e.g 'Age' column)

Data Visualisation

visualize individual column

Age vs Purchased

Occupation vs Purchased

Productcategory1 vs Purchased

Productcategory2 vs Purchased

Productcategory3 vs Purchased

City category pie chart

check for more possible plots

All the Best!!
l
Drug consumption database: quantified categorical attributes
figshare.le.ac.uk
figshare.com
txt
Updated May 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elaine Fehrman; Vincent Egan; Evgeny Mirkes (2023). Drug consumption database: quantified categorical attributes [Dataset]. http://doi.org/10.25392/leicester.data.7588409.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.7588409.v2
Dataset updated
May 30, 2023
Dataset provided by
University of Leicester
Authors
Elaine Fehrman; Vincent Egan; Evgeny Mirkes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Drug consumption database with quantified categorical attributes. DescriptionDB.pdf contains detailed description of database.
o
Accompanying simulated data for "Go multivariate: recommendations on...
explore.openaire.eu
Updated Mar 25, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Mildiner Moraga; Emmeke Aarts (2022). Accompanying simulated data for "Go multivariate: recommendations on multilevel hidden Markov models with categorical data of varying complexity" [Dataset]. http://doi.org/10.5281/zenodo.6385196
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6385196
Dataset updated
Mar 25, 2022
Authors
Sebastian Mildiner Moraga; Emmeke Aarts
Description
The multilevel hidden Markov model (MHMM) is a promising vehicle to investigate latent dynamics over time in social and behavioral processes. By including continuous individual random effects, the model accommodates variability between individuals, providing individual-specific trajectories and facilitating the study of individual differences. However, the performance of the MHMM has not been sufficiently explored. Currently, there are no practical guidelines on the sample size needed to obtain reliable estimates related to categorical data characteristics We performed an extensive simulation to assess the effect of the number of dependent variables (1-4), the number of individuals (5-90), and the number of observations per individual (100-1600) on the estimation performance of group-level parameters and between-individual variability on a Bayesian MHMM with categorical data of various levels of complexity. We found that using multivariate data generally alleviates the sample size needed and improves the stability of the results. Regarding the estimation of group-level parameters, the number of individuals and observations largely compensate for each other. Meanwhile, only the former drives the estimation of between-individual variability. We conclude with guidelines on the sample size necessary based on the complexity of the data and the study objectives of the practitioners. This repository contains data generated for the manuscript: "Go multivariate: recommendations on multilevel hidden Markov models with categorical data of varying complexity". It comprehends: (1) model outputs (maximum a posteriori estimates) for each repetition (n=100) of each scenario (n=324) of the main simulation, (2) complete model outputs (including estimates for 4000 MCMC iterations) for two chains of each repetition (n=3) of each scenario (n=324). Please note that the empirical data used in the manuscript is not available as part of this repository. A subsample of the data used in the empirical example are openly available as an example data set in the R package mHMMbayes on CRAN. The full data set is available on request from the authors.
p
Prostate Cancer - Dataset - CKAN
data.poltekkes-smg.ac.id
Updated Oct 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Prostate Cancer - Dataset - CKAN [Dataset]. https://data.poltekkes-smg.ac.id/dataset/prostate-cancer
Explore at:
Dataset updated
Oct 7, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the dataset of 100 patients to implement the machine learning algorithm and thereby interpreting results The data set consists of 100 observations and 10 variables (out of which 8 numeric variables and one categorical variable and is ID) which are as follows: Id 1.Radius 2.Texture 3.Perimeter 4.Area 5.Smoothness 6.Compactness 7.diagnosis_result 8.Symmetry 9.Fractal dimension
f
Risky Business: Factor Analysis of Survey Data – Assessing the Probability...
plos.figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cees van der Eijk; Jonathan Rose (2023). Risky Business: Factor Analysis of Survey Data – Assessing the Probability of Incorrect Dimensionalisation [Dataset]. http://doi.org/10.1371/journal.pone.0118900
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0118900
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Cees van der Eijk; Jonathan Rose
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This paper undertakes a systematic assessment of the extent to which factor analysis the correct number of latent dimensions (factors) when applied to ordered-categorical survey items (so-called Likert items). We simulate 2400 data sets of uni-dimensional Likert items that vary systematically over a range of conditions such as the underlying population distribution, the number of items, the level of random error, and characteristics of items and item-sets. Each of these datasets is factor analysed in a variety of ways that are frequently used in the extant literature, or that are recommended in current methodological texts. These include exploratory factor retention heuristics such as Kaiser’s criterion, Parallel Analysis and a non-graphical scree test, and (for exploratory and confirmatory analyses) evaluations of model fit. These analyses are conducted on the basis of Pearson and polychoric correlations. We find that, irrespective of the particular mode of analysis, factor analysis applied to ordered-categorical survey data very often leads to over-dimensionalisation. The magnitude of this risk depends on the specific way in which factor analysis is conducted, the number of items, the properties of the set of items, and the underlying population distribution. The paper concludes with a discussion of the consequences of over-dimensionalisation, and a brief mention of alternative modes of analysis that are much less prone to such problems.
The LakeCat Dataset: Accumulated Attributes for NHDPlusV2 (Version 2.1)...
catalog.data.gov
gimi9.com
Updated Feb 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Environmental Protection Agency, Office of Research and Development (ORD), Center for Public Health and Environmental Assessment (CPHEA), Pacific Ecological Systems Division (PESD), (2025). The LakeCat Dataset: Accumulated Attributes for NHDPlusV2 (Version 2.1) Catchments for the Conterminous United States: National Land Cover Database [Dataset]. https://catalog.data.gov/dataset/the-lakecat-dataset-accumulated-attributes-for-nhdplusv2-version-2-1-catchments-for-the-co-2c040
Explore at:
Dataset updated
Feb 5, 2025
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Area covered
Contiguous United States, United States
Description
This dataset represents the land cover data within individual local and accumulated upstream catchments for NHDPlusV2 Waterbodies based on the NLCD. Catchment boundaries in LakeCat are defined in one of two ways, on-network or off-network. The on-network catchment boundaries follow the catchments provided in the NHDPlusV2 and the metrics for these lakes mirror metrics from StreamCat, but will substitute the COMID of the NHDWaterbody for that of the NHDFlowline. The off-network catchment framework uses the NHDPlusV2 flow direction rasters to define non-overlapping lake-catchment boundaries and then links them through an off-network flow table. This data set is derived from the NLCD raster composed of 16 land cover classes (categorical data type) for the conterminous USA. Four classes of the NLCD were excluded as they were specific to Alaska land covers. This raster was produced based on a decision-tree classification of 2001, 2004, 2006, 2008, 2011, 2013, 2016, and 2019 Landsat satellite data. This dataset will include additional years as they become available.
Z
Adult dataset preprocessed
data.niaid.nih.gov
zenodo.org
Updated Jul 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adult dataset preprocessed [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12533513
Explore at:
Dataset updated
Jul 1, 2024
Dataset provided by
Schuster, Verena
Pustozerova, Anastasia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The files "adult_train.csv" and "adult_test.csv" contain preprocessed versions of the Adult dataset from the USI repository.

The file "adult_preprocessing.ipynb" contains a python notebook file with all the preprocessing steps used to generate "adult_train.csv" and "adult_test.csv" from the original Adult dataset.

The preprocessing steps include:

One-hot-encoding of categorical values

Imputation of missing values using knn-imputer with k=1

Standard scaling of ordinal attributes

Note: we assume the scenario when the test set is available before training (every attribute besides the target - "income"), therefore we combine train and test sets before the preprocessing.
Z
BE-KONFORM data set (Group A) (Bedarfsermittlung im Rahmen der Erstellung...
data.niaid.nih.gov
zenodo.org
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brühmann, Boris. A. (2024). BE-KONFORM data set (Group A) (Bedarfsermittlung im Rahmen der Erstellung des Konzepts für das Forschungsdatenmanagement an der Medizinischen Fakultät der Universität Freiburg [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7390789
Explore at:
Dataset updated
Jul 12, 2024
Dataset provided by
Brühmann, Boris. A.
Knaus, Jochen
Binder, Harald
Fichtner, Urs A.
Horstmeier, Lukas M.
Area covered
Freiburg im Breisgau
Description
This dataset was collected within the BE-KONFORM study investigating employee's needs regarding research data management at the Medical Faculty of the University of Freiburg. The full dataset captures 236 complete cases. The study included a randomized module allocating subjects to one of two groups. Group A received the information that data will be published (n=113) and group B did not. Due to data protection law, only data of group A, where written informed consent was given (n=112) could be published here. This dataset had to be prepared for publication in order to avoid de-anonymisation of subjects. Therefore, variable [anzahl_m] was recoded into a categorical variable. Open text answers, where combination of variables might lead to an identification of the subjects were changed, where necessary. Changes are marked with brackets: [changed text]. Information on survey mode and sampling is provided in the data note.
u
Data from: A TripAdvisor Dataset for Dyadic Context Analysis
portalinvestigacion.udc.gal
data.niaid.nih.gov
+1more
Updated 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
López-Riobóo Botana, Iñigo Luis; Alonso-Betanzos, Amparo; Bolón-Canedo, Verónica; Guijarro-Berdiñas, Bertha; López-Riobóo Botana, Iñigo Luis; Alonso-Betanzos, Amparo; Bolón-Canedo, Verónica; Guijarro-Berdiñas, Bertha (2022). A TripAdvisor Dataset for Dyadic Context Analysis [Dataset]. https://portalinvestigacion.udc.gal/documentos/668fc448b9e7c03b01bd8b43
Explore at:
Dataset updated
2022
Authors
López-Riobóo Botana, Iñigo Luis; Alonso-Betanzos, Amparo; Bolón-Canedo, Verónica; Guijarro-Berdiñas, Bertha; López-Riobóo Botana, Iñigo Luis; Alonso-Betanzos, Amparo; Bolón-Canedo, Verónica; Guijarro-Berdiñas, Bertha
Description
There are many contexts where dyadic data are present. In social networks, users are linked to a variety of items, defining interactions. In the social platform of TripAdvisor, users are linked to restaurants by means of reviews posted by them. Using the information of these interactions, we can get valuable insights for forecasting, proposing tasks related to recommender systems, sentiment analysis, text-based personalisation or text summarisation, among others. Furthermore, in the context of TripAdvisor there is a scarcity of public datasets and lack of well-known benchmarks for model assessment. We present six new TripAdvisor datasets from the restaurants of six different cities: London, New York, New Delhi, Paris, Barcelona and Madrid. If you use this data, please cite the following paper under submission process (preprint - arXiv) We exclusively collected the reviews written in English from the restaurants of each city. The tabular data is comprised of a set of six different CSV files, containing numerical, categorical and text features: parse_count: numerical (integer), corresponding number of extracted review by the web scraper (auto-incremental) author_id: categorical (string), univocal, incremental and anonymous identifier of the user (UID_XXXXXXXXXX) restaurant_name: categorical (string), name of the restaurant matching the review rating_review: numerical (integer), review score in the range 1-5 sample: categorical (string), indicating “positive” sample for scores 4-5 and “negative” for scores 1-3 review_id: categorical (string), univocal and internal identifier of the review (review_XXXXXXXXX) title_review: text, review title review_preview: text, preview of the review, truncated in the website when the text is very long review_full: text, complete review date: timestamp, publication date of the review in the format (day, month, year) city: categorical (string), city of the restaurant which the review was written for url_restaurant: text, restaurant url
Z
INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET
data.niaid.nih.gov
zenodo.org
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nafiz Sadman (2024). INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4047647
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
Nafiz Sadman
Nishat Anjum
Kishor Datta Gupta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Bangladesh, United States
Description
Introduction

There are several works based on Natural Language Processing on newspaper reports. Mining opinions from headlines [ 1 ] using Standford NLP and SVM by Rameshbhaiet. Al.compared several algorithms on a small and large dataset. Rubinet. al., in their paper [ 2 ], created a mechanism to differentiate fake news from real ones by building a set of characteristics of news according to their types. The purpose was to contribute to the low resource data available for training machine learning algorithms. Doumitet. al.in [ 3 ] have implemented LDA, a topic modeling approach to study bias present in online news media.

However, there are not many NLP research invested in studying COVID-19. Most applications include classification of chest X-rays and CT-scans to detect presence of pneumonia in lungs [ 4 ], a consequence of the virus. Other research areas include studying the genome sequence of the virus[ 5 ][ 6 ][ 7 ] and replicating its structure to fight and find a vaccine. This research is crucial in battling the pandemic. The few NLP based research publications are sentiment classification of online tweets by Samuel et el [ 8 ] to understand fear persisting in people due to the virus. Similar work has been done using the LSTM network to classify sentiments from online discussion forums by Jelodaret. al.[ 9 ]. NKK dataset is the first study on a comparatively larger dataset of a newspaper report on COVID-19, which contributed to the virus’s awareness to the best of our knowledge.

2 Data-set Introduction

2.1 Data Collection

We accumulated 1000 online newspaper report from United States of America (USA) on COVID-19. The newspaper includes The Washington Post (USA) and StarTribune (USA). We have named it as “Covid-News-USA-NNK”. We also accumulated 50 online newspaper report from Bangladesh on the issue and named it “Covid-News-BD-NNK”. The newspaper includes The Daily Star (BD) and Prothom Alo (BD). All these newspapers are from the top provider and top read in the respective countries. The collection was done manually by 10 human data-collectors of age group 23- with university degrees. This approach was suitable compared to automation to ensure the news were highly relevant to the subject. The newspaper online sites had dynamic content with advertisements in no particular order. Therefore there were high chances of online scrappers to collect inaccurate news reports. One of the challenges while collecting the data is the requirement of subscription. Each newspaper required $1 per subscriptions. Some criteria in collecting the news reports provided as guideline to the human data-collectors were as follows:

The headline must have one or more words directly or indirectly related to COVID-19.

The content of each news must have 5 or more keywords directly or indirectly related to COVID-19.

The genre of the news can be anything as long as it is relevant to the topic. Political, social, economical genres are to be more prioritized.

Avoid taking duplicate reports.

Maintain a time frame for the above mentioned newspapers.

To collect these data we used a google form for USA and BD. We have two human editor to go through each entry to check any spam or troll entry.

2.2 Data Pre-processing and Statistics

Some pre-processing steps performed on the newspaper report dataset are as follows:

Remove hyperlinks.

Remove non-English alphanumeric characters.

Remove stop words.

Lemmatize text.

While more pre-processing could have been applied, we tried to keep the data as much unchanged as possible since changing sentence structures could result us in valuable information loss. While this was done with help of a script, we also assigned same human collectors to cross check for any presence of the above mentioned criteria.

The primary data statistics of the two dataset are shown in Table 1 and 2.

Table 1: Covid-News-USA-NNK data statistics

No of words per headline

7 to 20

No of words per body content

150 to 2100

Table 2: Covid-News-BD-NNK data statistics No of words per headline

10 to 20

No of words per body content

100 to 1500

2.3 Dataset Repository

We used GitHub as our primary data repository in account name NKK^1. Here, we created two repositories USA-NKK^2 and BD-NNK^3. The dataset is available in both CSV and JSON format. We are regularly updating the CSV files and regenerating JSON using a py script. We provided a python script file for essential operation. We welcome all outside collaboration to enrich the dataset.

3 Literature Review

Natural Language Processing (NLP) deals with text (also known as categorical) data in computer science, utilizing numerous diverse methods like one-hot encoding, word embedding, etc., that transform text to machine language, which can be fed to multiple machine learning and deep learning algorithms.

Some well-known applications of NLP includes fraud detection on online media sites[ 10 ], using authorship attribution in fallback authentication systems[ 11 ], intelligent conversational agents or chatbots[ 12 ] and machine translations used by Google Translate[ 13 ]. While these are all downstream tasks, several exciting developments have been made in the algorithm solely for Natural Language Processing tasks. The two most trending ones are BERT[ 14 ], which uses bidirectional encoder-decoder architecture to create the transformer model, that can do near-perfect classification tasks and next-word predictions for next generations, and GPT-3 models released by OpenAI[ 15 ] that can generate texts almost human-like. However, these are all pre-trained models since they carry huge computation cost. Information Extraction is a generalized concept of retrieving information from a dataset. Information extraction from an image could be retrieving vital feature spaces or targeted portions of an image; information extraction from speech could be retrieving information about names, places, etc[ 16 ]. Information extraction in texts could be identifying named entities and locations or essential data. Topic modeling is a sub-task of NLP and also a process of information extraction. It clusters words and phrases of the same context together into groups. Topic modeling is an unsupervised learning method that gives us a brief idea about a set of text. One commonly used topic modeling is Latent Dirichlet Allocation or LDA[17].

Keyword extraction is a process of information extraction and sub-task of NLP to extract essential words and phrases from a text. TextRank [ 18 ] is an efficient keyword extraction technique that uses graphs to calculate the weight of each word and pick the words with more weight to it.

Word clouds are a great visualization technique to understand the overall ’talk of the topic’. The clustered words give us a quick understanding of the content.

4 Our experiments and Result analysis

We used the wordcloud library^4 to create the word clouds. Figure 1 and 3 presents the word cloud of Covid-News-USA- NNK dataset by month from February to May. From the figures 1,2,3, we can point few information:

In February, both the news paper have talked about China and source of the outbreak.

StarTribune emphasized on Minnesota as the most concerned state. In April, it seemed to have been concerned more.

Both the newspaper talked about the virus impacting the economy, i.e, bank, elections, administrations, markets.

Washington Post discussed global issues more than StarTribune.

StarTribune in February mentioned the first precautionary measurement: wearing masks, and the uncontrollable spread of the virus throughout the nation.

While both the newspaper mentioned the outbreak in China in February, the weight of the spread in the United States are more highlighted through out March till May, displaying the critical impact caused by the virus.

We used a script to extract all numbers related to certain keywords like ’Deaths’, ’Infected’, ’Died’ , ’Infections’, ’Quarantined’, Lock-down’, ’Diagnosed’ etc from the news reports and created a number of cases for both the newspaper. Figure 4 shows the statistics of this series. From this extraction technique, we can observe that April was the peak month for the covid cases as it gradually rose from February. Both the newspaper clearly shows us that the rise in covid cases from February to March was slower than the rise from March to April. This is an important indicator of possible recklessness in preparations to battle the virus. However, the steep fall from April to May also shows the positive response against the attack. We used Vader Sentiment Analysis to extract sentiment of the headlines and the body. On average, the sentiments were from -0.5 to -0.9. Vader Sentiment scale ranges from -1(highly negative to 1(highly positive). There were some cases

where the sentiment scores of the headline and body contradicted each other,i.e., the sentiment of the headline was negative but the sentiment of the body was slightly positive. Overall, sentiment analysis can assist us sort the most concerning (most negative) news from the positive ones, from which we can learn more about the indicators related to COVID-19 and the serious impact caused by it. Moreover, sentiment analysis can also provide us information about how a state or country is reacting to the pandemic. We used PageRank algorithm to extract keywords from headlines as well as the body content. PageRank efficiently highlights important relevant keywords in the text. Some frequently occurring important keywords extracted from both the datasets are: ’China’, Government’, ’Masks’, ’Economy’, ’Crisis’, ’Theft’ , ’Stock market’ , ’Jobs’ , ’Election’, ’Missteps’, ’Health’, ’Response’. Keywords extraction acts as a filter allowing quick searches for indicators in case of locating situations of the economy,
m
Data from: Las Vegas Strip
data.mendeley.com
Updated Jul 29, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sérgio Moro (2017). Las Vegas Strip [Dataset]. http://doi.org/10.17632/tsf9sjdwh2.1
Explore at:
Unique identifier
https://doi.org/10.17632/tsf9sjdwh2.1
Dataset updated
Jul 29, 2017
Authors
Sérgio Moro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Las Vegas Strip, Las Vegas
Description
This dataset includes quantitative and categorical features from online reviews from 21 hotels located in Las Vegas Strip, extracted from TripAdvisor (http://www.tripadvisor.com). All the 504 reviews were collected between January and August of 2016. The dataset contains 504 records and 20 tuned features (as of “status = included”, from Table 1 of the article mentioned below), 24 per hotel (two per each month, randomly selected), regarding the year of 2015.
Z
Enriched Data of Wind Farms (EDWin)
data.niaid.nih.gov
zenodo.org
Updated Jan 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haller, Marina (2023). Enriched Data of Wind Farms (EDWin) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7558884
Explore at:
Dataset updated
Jan 23, 2023
Dataset authored and provided by
Haller, Marina
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
EDWin (Enriched Data of Wind Farms) is a dataset developed to provide information about global wind farms. The dataset is based on OpenStreetMap (OSM) data and has been enriched with additional variables obtained from various databases. The dataset includes two separate data sets, one for global turbines and one for wind farms. As of September 2022, this dataset contains the most recent information available.

The datasets have the following structures:

Wind Turbine data

The data for wind turbines includes 359,947 entries and 12 columns.

Variable Name Description id Key value of the data point lon Longitude of the location lat Latitude of the location country Country where the turbine is located continent Continent where the turbine is located land cover The type of land on which the turbine is located landform The physical features of the land on which the turbine is located elevation The altitude of the turbine turbine spacing The distance between turbines in the wind farm WFid Wind Farm ID number of turbines The number of turbines in the wind farm shape The rough shape of the wind farm

Wind Farm data

The data for wind farms includes 20,608 entries and 11 columns.

Variable Name Description WFid Wind Farm ID lon Longitude of the location (center of the wind farm) lat Latitude of the location (center of the wind farm) country Country where the wind farm is located continent Continent where the wind farm is located land cover The modal value of the land cover for the turbines in the wind farm landform The average value of the landform for the turbines in the wind farm elevation The average elevation of the turbines in the wind farm turbine spacing The average turbine spacing for the turbines in the wind farm number of turbines The number of turbines in the wind farm shape The rough shape of the wind farm

Note that the data for "Country", "Continent", "Land Cover", "Landform", "Elevation" and "Turbine spacing" were collected turbine-specific and later added to the wind farm dataset in an aggregated form. For the categorical variables, the modulus of the respective turbine values was taken, and for numerical variables, the average was calculated. The two variables, number of turbines (i.e. wind farm size) and wind farm shape (i.e. a rough shape of the wind farm), were obtained from the wind farms data and added to the turbine dataset.

Sources

[1] Open street map. https://openstreetmap.org/. [Online] Accessed: 2022-10-02.

[2] Cutler J. Cleveland, Christopher Morris, Dictionary of Energy (Second Edition), Elsevier, 2015, Pages 638-655, ISBN 9780080968117

https://doi.org/10.1016/B978-0-08-096811-7.50023-8.

[4] Dunnett, S., Sorichetta, A., Taylor, G. et al. Harmonised global datasets of wind and solar farm locations and power. Sci Data 7, 130 (2020).

https://doi.org/10.1038/s41597-020-0469-8

[5] Buchhorn, M. ; Lesiv, M. ; Tsendbazar, N. - E. ; Herold, M. ; Bertels, L. ; Smets, B. Copernicus Global Land Cover Layers-Collection 2. Remote Sensing 2020, 12 Volume 108, 1044. doi:10.3390/rs12061044

[6] Theobald, D. M., Harrison-Atlas, D., Monahan, W. B., & Albano, C. M. (2015). Ecologically-relevant maps of landforms and physiographic diversity for climate adaptation planning. PloS one, 10(12), e0143619

[7] Global Multi-resolution Terrain Elevation Data 2010 courtesy of the U.S. Geological Survey

Data from: Login Data Set for Risk-Based Authentication

zenodo.org
data.niaid.nih.gov
+1more

zip

Updated Jun 30, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Stephan Wiefling; Stephan Wiefling; Paul René Jørgensen; Paul René Jørgensen; Sigurd Thunem; Sigurd Thunem; Luigi Lo Iacono; Luigi Lo Iacono (2022). Login Data Set for Risk-Based Authentication [Dataset]. http://doi.org/10.5281/zenodo.6782156

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6782156

Dataset updated

Jun 30, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Stephan Wiefling; Stephan Wiefling; Paul René Jørgensen; Paul René Jørgensen; Sigurd Thunem; Sigurd Thunem; Luigi Lo Iacono; Luigi Lo Iacono

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Login Data Set for Risk-Based Authentication

Synthesized login feature data of >33M login attempts and >3.3M users on a large-scale online service in Norway. Original data collected between February 2020 and February 2021.

This data sets aims to foster research and development for Risk-Based Authentication (RBA) systems. The data was synthesized from the real-world login behavior of more than 3.3M users at a large-scale single sign-on (SSO) online service in Norway.

The users used this SSO to access sensitive data provided by the online service, e.g., a cloud storage and billing information. We used this data set to study how the Freeman et al. (2016) RBA model behaves on a large-scale online service in the real world (see Publication). The synthesized data set can reproduce these results made on the original data set (see Study Reproduction). Beyond that, you can use this data set to evaluate and improve RBA algorithms under real-world conditions.

WARNING: The feature values are plausible, but still totally artificial. Therefore, you should NOT use this data set in productive systems, e.g., intrusion detection systems.

Overview

The data set contains the following features related to each login attempt on the SSO:

Feature	Data Type	Description	Range or Example
IP Address	String	IP address belonging to the login attempt	0.0.0.0 - 255.255.255.255
Country	String	Country derived from the IP address	US
Region	String	Region derived from the IP address	New York
City	String	City derived from the IP address	Rochester
ASN	Integer	Autonomous system number derived from the IP address	0 - 600000
User Agent String	String	User agent string submitted by the client	Mozilla/5.0 (Windows NT 10.0; Win64; ...
OS Name and Version	String	Operating system name and version derived from the user agent string	Windows 10
Browser Name and Version	String	Browser name and version derived from the user agent string	Chrome 70.0.3538
Device Type	String	Device type derived from the user agent string	(`mobile`, `desktop`, `tablet`, `bot`, `unknown`)¹
User ID	Integer	Idenfication number related to the affected user account	[Random pseudonym]
Login Timestamp	Integer	Timestamp related to the login attempt	[64 Bit timestamp]
Round-Trip Time (RTT) [ms]	Integer	Server-side measured latency between client and server	1 - 8600000
Login Successful	Boolean	`True`: Login was successful, `False`: Login failed	(`true`, `false`)
Is Attack IP	Boolean	IP address was found in known attacker data set	(`true`, `false`)
Is Account Takeover	Boolean	Login attempt was identified as account takeover by incident response team of the online service	(`true`, `false`)

Data Creation

As the data set targets RBA systems, especially the Freeman et al. (2016) model, the statistical feature probabilities between all users, globally and locally, are identical for the categorical data. All the other data was randomly generated while maintaining logical relations and timely order between the features.

The timestamps, however, are not identical and contain randomness. The feature values related to IP address and user agent string were randomly generated by publicly available data, so they were very likely not present in the real data set. The RTTs resemble real values but were randomly assigned among users per geolocation. Therefore, the RTT entries were probably in other positions in the original data set.

The country was randomly assigned per unique feature value. Based on that, we randomly assigned an ASN related to the country, and generated the IP addresses for this ASN. The cities and regions were derived from the generated IP addresses for privacy reasons and do not reflect the real logical relations from the original data set.
The device types are identical to the real data set. Based on that, we randomly assigned the OS, and based on the OS the browser information. From this information, we randomly generated the user agent string. Therefore, all the logical relations regarding the user agent are identical as in the real data set.
The RTT was randomly drawn from the login success status and synthesized geolocation data. We did this to ensure that the RTTs are realistic ones.

Regarding the Data Values

Due to unresolvable conflicts during the data creation, we had to assign some unrealistic IP addresses and ASNs that are not present in the real world. Nevertheless, these do not have any effects on the risk scores generated by the Freeman et al. (2016) model.

You can recognize them by the following values:

ASNs with values >= 500.000
IP addresses in the range 10.0.0.0 - 10.255.255.255 (10.0.0.0/8 CIDR range)

Study Reproduction

Based on our evaluation, this data set can reproduce our study results regarding the RBA behavior of an RBA model using the IP address (IP address, country, and ASN) and user agent string (Full string, OS name and version, browser name and version, device type) as features.

The calculated RTT significances for countries and regions inside Norway are not identical using this data set, but have similar tendencies. The same is true for the Median RTTs per country. This is due to the fact that the available number of entries per country, region, and city changed with the data creation procedure. However, the RTTs still reflect the real-world distributions of different geolocations by city.

See RESULTS.md for more details.

Ethics

By using the SSO service, the users agreed in the data collection and evaluation for research purposes. For study reproduction and fostering RBA research, we agreed with the data owner to create a synthesized data set that does not allow re-identification of customers.

The synthesized data set does not contain any sensitive data values, as the IP addresses, browser identifiers, login timestamps, and RTTs were randomly generated and assigned.

Publication

You can find more details on our conducted study in the following journal article:

Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service (2022)
Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono.
ACM Transactions on Privacy and Security

Bibtex

@article{Wiefling_Pump_2022,
 author = {Wiefling, Stephan and Jørgensen, Paul René and Thunem, Sigurd and Lo Iacono, Luigi},
 title = {Pump {Up} {Password} {Security}! {Evaluating} and {Enhancing} {Risk}-{Based} {Authentication} on a {Real}-{World} {Large}-{Scale} {Online} {Service}},
 journal = {{ACM} {Transactions} on {Privacy} and {Security}},
 doi = {10.1145/3546069},
 publisher = {ACM},
 year  = {2022}
}

License

This data set and the contents of this repository are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. See the LICENSE file for details. If the data set is used within a publication, the following journal article has to be cited as the source of the data set:

Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono: Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service. In: ACM Transactions on Privacy and Security (2022). doi: 10.1145/3546069

Few (invalid) user agents strings from the original data set could not be parsed, so their device type is empty. Perhaps this parse error is useful information for your studies, so we kept these 1526 entries.↩︎

Facebook

Twitter

Click to copy link

Link copied

Cite

Vamsi Bandi (2020). Stroke_Analysis [Dataset]. http://doi.org/10.17632/jpb5tds9f6.1

Stroke_Analysis

Explore at:

6 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.17632/jpb5tds9f6.1

Dataset updated

Dec 2, 2020

Authors

Vamsi Bandi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This data set contains primary attributes like Age, NIHSS, mRS, Systolic Blood pressure, Diastolic blood pressure, Glucose, Paralysis, Smoking, BMI, Cholesterol with standard deviation values 23.69, 11.27, 1.87, 24.92, 18.34, 56.11, 1.106, 0.9, 6.23, 20.26 and Mean values 47.12, 18.12, 3.67, 153.09, 103.65, 225.85, 1.36, 0.88, 33.73, 217.53

Clear search

Close search

Google apps

Main menu

Stroke_Analysis

Accompanying simulated data for "Go multivariate: a Monte Carlo study of a...

The StreamCat Dataset: Accumulated Attributes for NHDPlusV2 (Version 2.1)...

Summary of variables of the data set included in the analysis.

Replication Data for: Nursery Data Set

Data from: car sales

Dataset

Contents

Bridging the Gap in Hypertension Management: Evaluating Blood Pressure...

Black Friday Sales EDA

Drug consumption database: quantified categorical attributes

Accompanying simulated data for "Go multivariate: recommendations on...

Prostate Cancer - Dataset - CKAN

Risky Business: Factor Analysis of Survey Data – Assessing the Probability...

The LakeCat Dataset: Accumulated Attributes for NHDPlusV2 (Version 2.1)...

Adult dataset preprocessed

BE-KONFORM data set (Group A) (Bedarfsermittlung im Rahmen der Erstellung...

Data from: A TripAdvisor Dataset for Dyadic Context Analysis

INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET

Data from: Las Vegas Strip

Enriched Data of Wind Farms (EDWin)

Data from: Login Data Set for Risk-Based Authentication

Stroke_Analysis