100+ datasets found

Statistical Analysis of Individual Participant Data Meta-Analyses: A...
plos.figshare.com
datasetcatalog.nlm.nih.gov
tiff
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gavin B. Stewart; Douglas G. Altman; Lisa M. Askie; Lelia Duley; Mark C. Simmonds; Lesley A. Stewart (2023). Statistical Analysis of Individual Participant Data Meta-Analyses: A Comparison of Methods and Recommendations for Practice [Dataset]. http://doi.org/10.1371/journal.pone.0046042
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0046042
Dataset updated
Jun 8, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Gavin B. Stewart; Douglas G. Altman; Lisa M. Askie; Lelia Duley; Mark C. Simmonds; Lesley A. Stewart
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundIndividual participant data (IPD) meta-analyses that obtain “raw” data from studies rather than summary data typically adopt a “two-stage” approach to analysis whereby IPD within trials generate summary measures, which are combined using standard meta-analytical methods. Recently, a range of “one-stage” approaches which combine all individual participant data in a single meta-analysis have been suggested as providing a more powerful and flexible approach. However, they are more complex to implement and require statistical support. This study uses a dataset to compare “two-stage” and “one-stage” models of varying complexity, to ascertain whether results obtained from the approaches differ in a clinically meaningful way. Methods and FindingsWe included data from 24 randomised controlled trials, evaluating antiplatelet agents, for the prevention of pre-eclampsia in pregnancy. We performed two-stage and one-stage IPD meta-analyses to estimate overall treatment effect and to explore potential treatment interactions whereby particular types of women and their babies might benefit differentially from receiving antiplatelets. Two-stage and one-stage approaches gave similar results, showing a benefit of using anti-platelets (Relative risk 0.90, 95% CI 0.84 to 0.97). Neither approach suggested that any particular type of women benefited more or less from antiplatelets. There were no material differences in results between different types of one-stage model. ConclusionsFor these data, two-stage and one-stage approaches to analysis produce similar results. Although one-stage models offer a flexible environment for exploring model structure and are useful where across study patterns relating to types of participant, intervention and outcome mask similar relationships within trials, the additional insights provided by their usage may not outweigh the costs of statistical support for routine application in syntheses of randomised controlled trials. Researchers considering undertaking an IPD meta-analysis should not necessarily be deterred by a perceived need for sophisticated statistical methods when combining information from large randomised trials.
Data from: Comparing two classes of alpha diversities and their...
zenodo.org
search.dataone.org
+1more
txt
Updated Jun 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anne Chao; Chun-Huo Chiu; Shu-Hui Wu; Chun-Lin Huang; Yiching Lin; Anne Chao; Chun-Huo Chiu; Shu-Hui Wu; Chun-Lin Huang; Yiching Lin (2022). Data from: Comparing two classes of alpha diversities and their corresponding beta and (dis)similarity measures, with an application to the Formosan sika deer (Cervus nippon taiouanus) reintroduction program [Dataset]. http://doi.org/10.5061/dryad.vn85pg1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.vn85pg1
Dataset updated
Jun 1, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anne Chao; Chun-Huo Chiu; Shu-Hui Wu; Chun-Lin Huang; Yiching Lin; Anne Chao; Chun-Huo Chiu; Shu-Hui Wu; Chun-Lin Huang; Yiching Lin
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Diversity partitioning, which decomposes gamma diversity into alpha and beta components, is commonly used to obtain measures that quantify spatial/temporal diversity and compositional similarity or dissimilarity among assemblages. We focus on the decomposition of diversity as measured by Hill numbers (parameterized by a diversity order q≧0). 2. At least three diversity-partitioning schemes based on Hill numbers have been proposed. These schemes differ in the way they formulate alpha diversity. We focus on comparing two classes of alpha diversities, developed respectively by Routledge (1979) and Chiu et al. (2014). Both are defined for all diversity orders q≧0. Because these two approaches to quantifying alpha have not been compared in the literature; it has been unclear how to choose a proper alpha formulation for practical applications. 3. We review the two classes of alpha diversities and discuss the properties of their corresponding beta and (dis)similarity measures. Our research offers clear guidelines regarding the choice of an alpha formula: (i) If the goal is to assess compositional (dis)similarity among (unweighted) species relative abundance datasets, then the two alpha formulas are identical, leading to the same beta and (dis)similarity measures. (ii) If the goal is to assess compositional (dis)similarity among (unweighted) species raw abundance datasets, then Chiu et al.'s approach should be used. Their beta can be monotonically transformed to various (dis)similarity measures in the range [0, 1]. (iii) If each assemblage is weighted by its absolute, total abundance (i.e., assemblage size), but the goal is to assess compositional (dis)similarity among species relative abundance datasets, then Routledge's approach should be used. In this case, construction of legitimate (dis)similarity measures among species relative abundance datasets for unequal assemblage sizes/weights, for any order q≧0, has not been addressed in the literature. Here we propose non-monotonic transformations of Routledge's beta to fill this gap. 4. The extension of our analysis to phylogenetic diversity partitioning is generally parallel. We apply various species and phylogenetic dissimilarity measures to Taiwan's plant data; the results provide insights into the assessment of a reintroduction program of Formosan sika deer into a forest area. Pertinent sampling and related issues are also discussed.
f
Comparing AUROC, exploited biases, use cases and issues of Graph Attention...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jul 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Steinbach, Daniel; Ahrens, Paul C.; Gibb, Sebastian; Heyer, Robert; Walke, Daniel; Saake, Gunter; Broneske, David; Kaiser, Thorsten (2025). Comparing AUROC, exploited biases, use cases and issues of Graph Attention Networks on similarity and patient-centric graphs (directed, reversed directed and undirected) for the classification of sepsis on complete blood count data (higher values represent better performance). We evaluated the classification performance on two datasets (internal and external dataset). Bold values represent the best values in each column. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002104615
Explore at:
Dataset updated
Jul 8, 2025
Authors
Steinbach, Daniel; Ahrens, Paul C.; Gibb, Sebastian; Heyer, Robert; Walke, Daniel; Saake, Gunter; Broneske, David; Kaiser, Thorsten
Description
Comparing AUROC, exploited biases, use cases and issues of Graph Attention Networks on similarity and patient-centric graphs (directed, reversed directed and undirected) for the classification of sepsis on complete blood count data (higher values represent better performance). We evaluated the classification performance on two datasets (internal and external dataset). Bold values represent the best values in each column.
d
Dataset for: Same Question, Different Answers? An Empirical Comparison of...
demo-b2find.dkrz.de
Updated Sep 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Dataset for: Same Question, Different Answers? An Empirical Comparison of Web Data and Traditional Data - Dataset - B2FIND [Dataset]. http://demo-b2find.dkrz.de/dataset/bc684dad-c657-5013-b2d4-cc35b4a2e7ee
Explore at:
Dataset updated
Sep 22, 2025
Description
Psychological scientists increasingly study web data, such as user ratings or social media postings. However, whether research relying on such web data leads to the same conclusions as research based on traditional data is largely unknown. To test this, we (re)analyzed three datasets, thereby comparing web data with lab and online survey data. We calculated correlations across these different datasets (Study 1) and investigated identical, illustrative research questions in each dataset (Studies 2 to 4). Our results suggest that web and traditional data are not fundamentally different and usually lead to similar conclusions, but also that it is important to consider differences between data types such as populations and research settings. Web data can be a valuable tool for psychologists when accounting for such differences, as it allows for testing established research findings in new contexts, complementing them with insights from novel data sources.
f
Data from: Estimating Missing Unit Process Data in Life Cycle Assessment...
acs.figshare.com
datasetcatalog.nlm.nih.gov
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ping Hou; Jiarui Cai; Shen Qu; Ming Xu (2023). Estimating Missing Unit Process Data in Life Cycle Assessment Using a Similarity-Based Approach [Dataset]. http://doi.org/10.1021/acs.est.7b05366.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.est.7b05366.s002
Dataset updated
Jun 1, 2023
Dataset provided by
ACS Publications
Authors
Ping Hou; Jiarui Cai; Shen Qu; Ming Xu
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
In life cycle assessment (LCA), collecting unit process data from the empirical sources (i.e., meter readings, operation logs/journals) is often costly and time-consuming. We propose a new computational approach to estimate missing unit process data solely relying on limited known data based on a similarity-based link prediction method. The intuition is that similar processes in a unit process network tend to have similar material/energy inputs and waste/emission outputs. We use the ecoinvent 3.1 unit process data sets to test our method in four steps: (1) dividing the data sets into a training set and a test set; (2) randomly removing certain numbers of data in the test set indicated as missing; (3) using similarity-weighted means of various numbers of most similar processes in the training set to estimate the missing data in the test set; and (4) comparing estimated data with the original values to determine the performance of the estimation. The results show that missing data can be accurately estimated when less than 5% data are missing in one process. The estimation performance decreases as the percentage of missing data increases. This study provides a new approach to compile unit process data and demonstrates a promising potential of using computational approaches for LCA data compilation.
Hierarchical clustering of 7 Million Proteins
kaggle.com
zip
Updated Aug 9, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rajasankar Viswanathan (2017). Hierarchical clustering of 7 Million Proteins [Dataset]. https://www.kaggle.com/rajasankar/hierarchical-clustering-of-7-million-proteins
Explore at:
zip(32711247 bytes)Available download formats
Dataset updated
Aug 9, 2017
Authors
Rajasankar Viswanathan
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Context

Clustering using distance needs all-against-all matching. New algorithm can cluster 7 Million proteins using approximate clustering under one hour.

Content

cat: contains Hierarchical sequence. protein_names : List of proteins in the group. Original data can be downloaded from ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/env_nr.gz

Inspiration

Researchers can use the data to find relationships between proteins more easily.

Data Structure Description

Data set has two files. protein_groupings file is the clustered data. This file has only names. Sequences for the names can be found in protein_name_letter file.

How it is created

Data is downloaded from the NCBI site and fasta format was converted into full length sequence. Sequences were fed into the clustering algorithm.

Cluster mappings

As this is the Hierarchical clustering, relationship between the sequences can be found by comparing the values in gn_list .

All the groups start with cluster_id:0 , split:0 and progress into matched splits. Difference between the splits would indicate that how much two sequences can match. Comparing the cluster_id would check if the sequences belong to same group or different group.

cluster_id = unique id for cluster. split = approximate similarity between the sequences. This is an absolute value. 63 would mean there is 63 letters would match between the sequences. Higher the value more similarity. inner_cluster_id = unique id to compare inner cluster matches. total clusters = number of clusters after approximate match is generated.

Due to space restrictions in Kaggle, this data set has only 9093 groups containing 129696 sequences.

One sequence may be in more than cluster because similarity is calculated as if all-against-all comparison is used.

Ex : For A, B, C , if A ~ B = 50, B~ C = 50 and A~C =0 then clustering will have two groups [A,B] and [B,C]

If you need full dataset for your research, contact me.

What is the issue with previous dataset

Previous dataset had issues with similarity comparisons between intra-clusters. Inner cluster comparison worked. This is fixed in the new version.

Code Similarity Dataset – Python Variants

kaggle.com

zip

Updated Jul 6, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Hem Ajit Patel (2025). Code Similarity Dataset – Python Variants [Dataset]. https://www.kaggle.com/datasets/hemajitpatel/code-similarity-dataset-python-variants

Explore at:

zip(39806 bytes)Available download formats

Dataset updated

Jul 6, 2025

Authors

Hem Ajit Patel

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Code Similarity Dataset – Python Variants

A collection of code snippets solving common programming problems in multiple variations.

Each problem has 20+ versions, written in different styles and logic patterns, making this dataset ideal for studying:

Code similarity
Plagiarism detection
AI-based code search
Code classification
Semantic code retrieval

📚 What's Inside?

The dataset includes the following tasks: - Reverse a String - Find Max in List - Check if a Number is Prime - Check if a String is a Palindrome - Generate Fibonacci Sequence

Each task contains: - 20 variations of code - Metadata file describing method and notes - README with usage instructions

Column Descriptions

The full_metadata.csv file contains the following fields:

Column Name	Description
`problem_type`	The programming task solved (e.g., `reverse_string`, `max_in_list`)
`id`	Unique ID of the snippet within that problem group
`filename`	Filename of the code snippet (e.g., `snip_01.py`)
`language`	Programming language used (`Python`)
`method`	Type of approach used (e.g., `Slicing`, `Recursive`, `While loop`)
`notes`	Additional details about the logic or style used in the snippet

🗂 Folder Structure

CodeSimilarityDataset/ │ ├── reverse_string/ │ ├── snippets/ │ ├── metadata.csv │ └── README.txt │ ├── max_in_list/ │ ├── snippets/ │ ├── metadata.csv │ └── README.txt │ ├── is_prime/ │ ├── snippets/ │ ├── metadata.csv │ └── README.txt │ ├── is_palindrome/ │ ├── snippets/ │ ├── metadata.csv │ └── README.txt │ ├── fibonacci/ │ ├── snippets/ │ ├── metadata.csv │ └── README.txt │ └── full_metadata.csv ← Combined metadata across all problems

🔍 Use Cases

Train models to detect similar code logic
Build plagiarism detection systems
Improve code recommendation engines
Teach students about code variation
Benchmark code search algorithms

🧪 Sample Applications

Visualize logic type distribution

Compare structural similarity (AST/difflib/token matching)

Cluster similar snippets using embeddings

Train code-style-aware LLMs

📦 File Formats

All code snippets are .py files. Metadata is provided in CSV format for easy loading into pandas or other tools.

🛠 How to Use

You can load metadata easily with Python:

import pandas as pd

df = pd.read_csv('full_metadata.csv') print(df.sample(5))

Then read any snippet:

with open("reverse_string/snippets/snip_01.py") as f: code = f.read() print(code)

📄 License

This dataset is released under the MIT License — free to use, modify, and distribute with proper attribution.

n
Data from: A user-friendly guide to using distance measures to compare time...
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Mar 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shawn Dove; Monika Böhm; Robin Freeman; Sean Jellesmark; David Murrell (2024). A user-friendly guide to using distance measures to compare time series in ecology [Dataset]. http://doi.org/10.5061/dryad.bzkh189g7
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.bzkh189g7
Dataset updated
Mar 7, 2024
Dataset provided by
Indianapolis Zoo
University College London
Zoological Society of London
Authors
Shawn Dove; Monika Böhm; Robin Freeman; Sean Jellesmark; David Murrell
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Time series are a critical component of ecological analysis, used to track changes in biotic and abiotic variables. Information can be extracted from the properties of time series for tasks such as classification (e.g. assigning species to individual bird calls); clustering (e.g. clustering similar responses in population dynamics to abrupt changes in the environment or management interventions); prediction (e.g. accuracy of model predictions to original time series data); and anomaly detection (e.g. detecting possible catastrophic events from population time series). These common tasks in ecological research rely on the notion of (dis-) similarity, which can be determined using distance measures. A plethora of distance measures have been described, predominantly in the computer and information sciences, but many have not been introduced to ecologists. Furthermore, little is known about how to select appropriate distance measures for time-series-related tasks. Therefore, many potential applications remain unexplored. Here we describe 16 properties of distance measures that are likely to be of importance to a variety of ecological questions involving time series. We then test 42 distance measures for each property and use the results to develop an objective method to select appropriate distance measures for any task and ecological dataset. We demonstrate our selection method by applying it to a set of real-world data on breeding bird populations in the UK and discuss other potential applications for distance measures, along with associated technical issues common in ecology. Our real-world population trends exhibit a common challenge for time series comparisons: a high level of stochasticity. We demonstrate two different ways of overcoming this challenge, first by selecting distance measures with properties that make them well-suited to comparing noisy time series, and second by applying a smoothing algorithm before selecting appropriate distance measures. In both cases, the distance measures chosen through our selection method are not only fit-for-purpose but are consistent in their rankings of the population trends. The results of our study should lead to an improved understanding of, and greater scope for, the use of distance measures for comparing ecological time series, and help us answer new ecological questions. Methods Distance measure test results were produced using R and can be replicated using scripts available on GitHub at https://github.com/shawndove/Trend_compare. Detailed information on wading bird trends can be found in Jellesmark et al. (2021) below. Jellesmark, S., Ausden, M., Blackburn, T. M., Gregory, R. D., Hoffmann, M., Massimino, D., McRae, L., & Visconti, P. (2021). A counterfactual approach to measure the impact of wet grassland conservation on U.K. breeding bird populations. Conservation Biology, 35(5), 1575–1585. https://doi.org/10.1111/cobi.13692
Customer Satisfaction Scores and Behavior Data
kaggle.com
zip
Updated Apr 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Salahuddin Ahmed (2025). Customer Satisfaction Scores and Behavior Data [Dataset]. https://www.kaggle.com/datasets/salahuddinahmedshuvo/customer-satisfaction-scores-and-behavior-data/discussion
Explore at:
zip(2456 bytes)Available download formats
Dataset updated
Apr 6, 2025
Authors
Salahuddin Ahmed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains customer satisfaction scores collected from a survey, alongside key demographic and behavioral data. It includes variables such as customer age, gender, location, purchase history, support contact status, loyalty level, and satisfaction factors. The dataset is designed to help analyze customer satisfaction, identify trends, and develop insights that can drive business decisions.

File Information: File Name: customer_satisfaction_data.csv (or your specific file name)

File Type: CSV (or the actual file format you are using)

Number of Rows: 120

Number of Columns: 10

Column Names:

Customer_ID – Unique identifier for each customer (e.g., 81-237-4704)

Group – The group to which the customer belongs (A or B)

Satisfaction_Score – Customer's satisfaction score on a scale of 1-10

Age – Age of the customer

Gender – Gender of the customer (Male, Female)

Location – Customer's location (e.g., Phoenix.AZ, Los Angeles.CA)

Purchase_History – Whether the customer has made a purchase (Yes or No)

Support_Contacted – Whether the customer has contacted support (Yes or No)

Loyalty_Level – Customer's loyalty level (Low, Medium, High)

Satisfaction_Factor – Primary factor contributing to customer satisfaction (e.g., Price, Product Quality)

Statistical Analyses:

Descriptive Statistics:

Calculate mean, median, mode, standard deviation, and range for key numerical variables (e.g., Satisfaction Score, Age).

Summarize categorical variables (e.g., Gender, Loyalty Level, Purchase History) with frequency distributions and percentages.

Two-Sample t-Test (Independent t-test):

Compare the mean satisfaction scores between two independent groups (e.g., Group A vs. Group B) to determine if there is a significant difference in their average satisfaction scores.

Paired t-Test:

If there are two related measurements (e.g., satisfaction scores before and after a certain event), you can compare the means using a paired t-test.

One-Way ANOVA (Analysis of Variance):

Test if there are significant differences in mean satisfaction scores across more than two groups (e.g., comparing the mean satisfaction score across different Loyalty Levels).

Chi-Square Test for Independence:

Examine the relationship between two categorical variables (e.g., Gender vs. Purchase History or Loyalty Level vs. Support Contacted) to determine if there’s a significant association.

Mann-Whitney U Test:

For non-normally distributed data, use this test to compare satisfaction scores between two independent groups (e.g., Group A vs. Group B) to see if their distributions differ significantly.

Kruskal-Wallis Test:

Similar to ANOVA, but used for non-normally distributed data. This test can compare the median satisfaction scores across multiple groups (e.g., comparing satisfaction scores across Loyalty Levels or Satisfaction Factors).

Spearman’s Rank Correlation:

Test for a monotonic relationship between two ordinal or continuous variables (e.g., Age vs. Satisfaction Score or Satisfaction Score vs. Loyalty Level).

Regression Analysis:

Linear Regression: Model the relationship between a continuous dependent variable (e.g., Satisfaction Score) and independent variables (e.g., Age, Gender, Loyalty Level).

Logistic Regression: If analyzing binary outcomes (e.g., Purchase History or Support Contacted), you could model the probability of an outcome based on predictors.

Factor Analysis:

To identify underlying patterns or groups in customer behavior or satisfaction factors, you can apply Factor Analysis to reduce the dimensionality of the dataset and group similar variables.

Cluster Analysis:

Use K-Means Clustering or Hierarchical Clustering to group customers based on similarity in their satisfaction scores and other features (e.g., Loyalty Level, Purchase History).

Confidence Intervals:

Calculate confidence intervals for the mean of satisfaction scores or any other metric to estimate the range in which the true population mean might lie.
r
The banksia plot: a method for visually comparing point estimates and...
researchdata.edu.au
datasetcatalog.nlm.nih.gov
+1more
Updated Apr 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Turner; Joanne McKenzie; Emily Karahalios; Elizabeth Korevaar (2024). The banksia plot: a method for visually comparing point estimates and confidence intervals across datasets [Dataset]. http://doi.org/10.26180/25286407.V2
Explore at:
Unique identifier
https://doi.org/10.26180/25286407.V2
Dataset updated
Apr 16, 2024
Dataset provided by
Monash University
Authors
Simon Turner; Joanne McKenzie; Emily Karahalios; Elizabeth Korevaar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Companion data for the creation of a banksia plot:
Background:
In research evaluating statistical analysis methods, a common aim is to compare point estimates and confidence intervals (CIs) calculated from different analyses. This can be challenging when the outcomes (and their scale ranges) differ across datasets. We therefore developed a plot to facilitate pairwise comparisons of point estimates and confidence intervals from different statistical analyses both within and across datasets.
Methods:
The plot was developed and refined over the course of an empirical study. To compare results from a variety of different studies, a system of centring and scaling is used. Firstly, the point estimates from reference analyses are centred to zero, followed by scaling confidence intervals to span a range of one. The point estimates and confidence intervals from matching comparator analyses are then adjusted by the same amounts. This enables the relative positions of the point estimates and CI widths to be quickly assessed while maintaining the relative magnitudes of the difference in point estimates and confidence interval widths between the two analyses. Banksia plots can be graphed in a matrix, showing all pairwise comparisons of multiple analyses. In this paper, we show how to create a banksia plot and present two examples: the first relates to an empirical evaluation assessing the difference between various statistical methods across 190 interrupted time series (ITS) data sets with widely varying characteristics, while the second example assesses data extraction accuracy comparing results obtained from analysing original study data (43 ITS studies) with those obtained by four researchers from datasets digitally extracted from graphs from the accompanying manuscripts.
Results:
In the banksia plot of statistical method comparison, it was clear that there was no difference, on average, in point estimates and it was straightforward to ascertain which methods resulted in smaller, similar or larger confidence intervals than others. In the banksia plot comparing analyses from digitally extracted data to those from the original data it was clear that both the point estimates and confidence intervals were all very similar among data extractors and original data.
Conclusions:
The banksia plot, a graphical representation of centred and scaled confidence intervals, provides a concise summary of comparisons between multiple point estimates and associated CIs in a single graph. Through this visualisation, patterns and trends in the point estimates and confidence intervals can be easily identified.
This collection of files allows the user to create the images used in the companion paper and amend this code to create their own banksia plots using either Stata version 17 or R version 4.3.1
H
Replication Data for: Exploring Disagreement in Indicators of State...
dataverse.harvard.edu
datasetcatalog.nlm.nih.gov
+2more
Updated May 30, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charles Crabtree (2018). Replication Data for: Exploring Disagreement in Indicators of State Repression [Dataset]. http://doi.org/10.7910/DVN/V5LB9K
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/V5LB9K
Dataset updated
May 30, 2018
Dataset provided by
Harvard Dataverse
Authors
Charles Crabtree
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Until recently, researchers who wanted to examine the determinants of state respect for most specific negative rights needed to rely on data from the CIRI or the Political Terror Scale (PTS). The new V-DEM dataset offers scholars a potential alternative to the individual human rights variables from CIRI. We analyze a set of key Cingranelli-Richards (CIRI) Human Rights Data Project and Varieties of Democracy (V-DEM) negative rights indicators, finding unusual and unexpectedly large patterns of disagreement between the two sets. First, we discuss the new V-DEM dataset by comparing it to the disaggregated CIRI indicators, discussing the history of each project, and describing its empirical domain. Second, we identify a set of disaggregated human rights measures that are similar across the two datasets and discuss each project's measurement approach. Third, we examine how these measures compare to each other empirically, showing that they diverge considerably across both time and space. These findings point to several important directions for future work, such as how conceptual approaches and measurement strategies affect rights scores. For the time being, our findings suggest that researchers should think carefully about using the measures as substitutes.
f
Data Sheet 1_Predicting implicit concept embeddings for singular...
frontiersin.figshare.com
pdf
Updated Mar 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Clint Cuffy; Bridget T. McInnes (2025). Data Sheet 1_Predicting implicit concept embeddings for singular relationship discovery replication of closed literature-based discovery.pdf [Dataset]. http://doi.org/10.3389/frma.2025.1509502.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/frma.2025.1509502.s001
Dataset updated
Mar 5, 2025
Dataset provided by
Frontiers
Authors
Clint Cuffy; Bridget T. McInnes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectiveLiterature-based Discovery (LBD) identifies new knowledge by leveraging existing literature. It exploits interconnecting implicit relationships to build bridges between isolated sets of non-interacting literatures. It has been used to facilitate drug repurposing, new drug discovery, and study adverse event reactions. Within the last decade, LBD systems have transitioned from using statistical methods to exploring deep learning (DL) to analyze semantic spaces between non-interacting literatures. Recent works explore knowledge graphs (KG) to represent explicit relationships. These works envision LBD as a knowledge graph completion (KGC) task and use DL to generate implicit relationships. However, these systems require the researcher to have domain-expert knowledge when submitting relevant queries for novel hypothesis discovery.MethodsOur method explores a novel approach to identify all implicit hypotheses given the researcher's search query and expedites the knowledge discovery process. We revise the KGC task as the task of predicting interconnecting vertex embeddings within the graph. We train our model using a similarity learning objective and compare our model's predictions against all known vertices within the graph to determine the likelihood of an implicit relationship (i.e., connecting edge). We also explore three approaches to represent edge connections between vertices within the KG: average, concatenation, and Hadamard. Lastly, we explore an approach to induce inductive biases and expedite model convergence (i.e., input representation scaling).ResultsWe evaluate our method by replicating five known discoveries within the Hallmark of Cancer (HOC) datasets and compare our method to two existing works. Our results show no significant difference in reported ranks and model convergence rate when comparing scaling our input representations and not using this method. Comparing our method to previous works, we found our method achieves optimal performance on two of five datasets and achieves comparable performance on the remaining datasets. We further analyze our results using statistical significance testing to demonstrate the efficacy of our method.ConclusionWe found our similarity-based learning objective predicts linking vertex embeddings for single relationship closed discovery replication. Our method also provides a ranked list of linking vertices between a set of inputs. This approach reduces researcher burden and allows further exploration of generated hypotheses.
s
Datasets for "From authority to similarity: how Google transformed its...
orda.shef.ac.uk
csv
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Warren Pearce (2025). Datasets for "From authority to similarity: how Google transformed its knowledge infrastructure using computer vision" [Dataset]. http://doi.org/10.15131/shef.data.29481173.v2
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.15131/shef.data.29481173.v2
Dataset updated
Jul 6, 2025
Dataset provided by
The University of Sheffield
Authors
Warren Pearce
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
We investigate the impact of computer vision models, a prominent artificial intelligence tool, on critical knowledge infrastructure, using the case of Google search engines. We answer the following research question: How do search results for Google Images compare internationally with those for Google Search, and how can these results be explained by changes in Google’s knowledge infrastructure? To answer this question, we carry out four steps: 1) theorise the relationship between web epistemology, calculative technology, platform vernacular and issue configuration, illustrating the dynamics of critical knowledge infrastructures on the web; 2) provide a potted history of Google’s use of computer vision in search; 3) undertake the first international comparison of search results from Google Search with Google Images; 4) analyse the visual content of search results from Google Images. Using quanti-quali digital methods including visual content analysis, social semiotics and computer vision network analysis, we analyse search results related to environmental change across six countries, with two key findings. First, Google Images search results contain fewer authoritative sources than Google Search across all countries. Second, Google Images results constitute a narrow, homogenised visual repertoire across all countries. This constitutes a transformation in web epistemology from ranking-by-authority to ranking-by-similarity, driven by a shift in calculative technology from web links (Google Search) to computer vision (Google Images). Our findings and theoretical model open up new questions regarding the impact of computer vision on the public availability of knowledge in our increasingly image-saturated digital societies.
🤖Android vs iOS🍎 Device Benchmarks📊
kaggle.com
zip
Updated Sep 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
💥Alien💥 (2022). 🤖Android vs iOS🍎 Device Benchmarks📊 [Dataset]. https://www.kaggle.com/datasets/alanjo/android-vs-ios-devices-crossplatform-benchmarks/
Explore at:
zip(4989 bytes)Available download formats
Dataset updated
Sep 2, 2022
Authors
💥Alien💥
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Compilation Dataset: Smartphone Processors Ranking & Scores

Context

Benchmarks allow for easy comparison between multiple devices by scoring their performance on a standardized series of tests, and they are useful in many instances: When buying a new phone or tablet

Content

Newest data as of May 3rd, 2022. This dataset contains benchmarks of Android and iOS devices

1. Total Score

Benchmark apps gives your device an overall numerical score as well as individual scores for each test it performs. The overall score is created by adding the results of those individual scores. These score numbers don't mean much on their own, they're just helpful for comparing different devices. For example, if your device's score is 300000, a device with a score of 600000 is about twice as fast. You can use individual test scores to compare the relative performance of specific parts of different devices. For example, you could compare how fast your phone's storage performs compared to another phone's storage.

2. CPU Score

The first part of the overall score is your CPU score. The CPU score in turn includes the output of CPU Mathematical Operations, CPU Common Algorithms, and CPU Multi-Core. In simpler words, the CPU score means how fast your phone processes commands. Your device's central processing unit (CPU) does most of the number-crunching. A faster CPU can run apps faster, so everything on your device will seem faster. Of course, once you get to a certain point, CPU speed won't affect performance much. However, a faster CPU may still help when running more demanding applications, such as high-end games.

3. GPU Score

The second part of the overall score is your GPU score. This score is comprised of the output of graphical components like Metal, OpenGL or Vulkan, depending on your device. The GPU score means how well your phone displays 2D and 3D graphics. Your device's graphics processing unit (GPU) handles accelerated graphics. When you play a game, your GPU kicks into gear and renders the 3D graphics or accelerates the shiny 2D graphics. Many interface animations and other transitions also use the GPU. The GPU is optimized for these sorts of graphics operations. The CPU could perform them, but it's more general-purpose and would take more time and battery power. You can say that your GPU does the graphics number-crunching, so a higher score here is better.

4. MEM score

The third part of the overall score is your MEM score. The MEM score includes the results of the output of RAM Access, ROM APP IO, ROM Sequential Read and Write, and ROM Random Access. In simpler words, the MEM score means how fast and how much memory your phone possesses. RAM stands for random-access memory; while ROM stands for read-only memory. Your device uses RAM as working memory, while flash storage or an internal SD card is used for long-term storage. The faster it can write to and read data from its RAM, the faster your device will perform. Your RAM is constantly being used on your device, whatever you're doing. While RAM is volatile in nature, ROM is its opposite. RAM mostly stores temporary data, while ROM is used to store permanent data like the firmware of your phone. Both the RAM and ROM make up the memory of your phone, helping it to perform tasks efficiently.

5. UX Score

The fourth and final part of the overall score is your UX score. The UX score is made up of the results of the output of the Data Security, Data Processing, Image Processing, User Experience, and Video CTS and Decode tests. The UX score means an overall score that represents how the device's "user experience" will be in the real world. It's a number you can look at to get a feel for a device's overall performance without digging into the above benchmarks or relying too much on the overall score.

Acknowledgements

Data scrapped from AnTuTu, cross-platform adjusted using 3DMark and Geekbench

If you enjoyed this dataset, here's some similar datasets you may like 😎
Discovering Hidden Trends in Global Video Games
kaggle.com
zip
Updated Dec 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Discovering Hidden Trends in Global Video Games [Dataset]. https://www.kaggle.com/datasets/thedevastator/discovering-hidden-trends-in-global-video-games
Explore at:
zip(57229 bytes)Available download formats
Dataset updated
Dec 3, 2022
Authors
The Devastator
Description
Discovering Hidden Trends in Global Video Games Sales

Platforms, Genres, and Profitable Regions

By Andy Bramwell [source]

About this dataset

This dataset contains sales data for video games from all around the world, across different platforms, genres and regions. From the thought-provoking latest release of RPGs to the thrilling adventures of racing games, this database provides an insight into what constitutes as a hit game in today’s gaming industry. Armed with this data and analysis, future developers can better understand what types of gameplay and mechanics resonate more with players to create a new gaming experience. Through its comprehensive analysis on various game titles, genres and platforms this dataset displays detailed insights into how video games can achieve global success as well as providing a wonderful window into the ever-changing trends of gaming culture

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset can be used to uncover hidden trends in Global Video Games Sales. To make the most of this data, it is important to understand the different columns and their respective values.

The 'Rank' column identifies each game's ranking according to its global sales (highest to lowest). This can help you identify which games are most popular globally. The 'Game Title' column contains the name of each video game, which allows you to easily discern one entry from another. The 'Platform' column lists the type of platform on which each game was released, e.g., PlayStation 4 or Xbox One, so that you can make comparisons between platforms as well as specific games for each platform. The 'Year' column provides an additional way of making year-on-year comparisons and tracking changes over time in global video game sales.
In addition, this dataset also contains metadata such as genre ('Genre'), publisher ('Publisher'), and review score ('Review') that add context when considering a particular title's performance in terms of global sales rankings. For example, it might be more compelling to compare two similar genres than two disparate ones when analyzing how successful a select set of titles have been at generating revenue in comparison with others released globally within that timeline. Lastly but no less important are the three variables dedicated exclusively for geographic breakdowns: North America ('North America'), Europe (Europe), Japan (Japan), Rest of World (Rest of World), and Global (Global). This allows us to see how certain regions contribute individually or collectively towards a given title's overall sales figures; by comparing these metrics regionally or collectively an interesting picture arises -- from which inferences about consumer preferences and supplier priorities emerge!

Overall this powerful dataset allows researchers and marketers alike a deep dive into market performance for those persistent questions about demand patterns across demographics around the world!

Research Ideas

Analyzing the effects of genre and platform on a game's success - By comparing different genres and platforms, one can get a better understanding of what type of games have the highest sales in different regions across the globe. This could help developers decide which type of gaming content to create in order to maximize their profits.

Tracking changes in global video games trends over time - This dataset could be used to analyze how various elements such as genre or platform affect success over various years, allowing developers an inside look into what kind of videos are being favored at any given moment across the world.

Identifying highly successful games and their key elements- Developers could look at this data to find any common factors such as publisher or platform shared by successful titles to uncover characteristics that lead to a high rate-of-return when creating video games or other forms media entertainment

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

See the dataset description for more information.

Columns

File: Video Games Sales.csv | Column name | Description | |:------------------|:------------------------------------------------------------| | Rank | The ranking of the game in terms of global sales. (Integer) | | Game Title | The title of the game. (String) | | Platform | The platform the game was released on. (String) ...
d
Data from: Comparing full-field data from structural components with...
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Jul 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
William Christian; Andrew Dean; Ceri Middleton; Ksenija Dvurecenska; Eann Patterson (2021). Comparing full-field data from structural components with complicated geometries [Dataset]. http://doi.org/10.5061/dryad.pc866t1p9
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.pc866t1p9
Dataset updated
Jul 30, 2021
Dataset provided by
Dryad
Authors
William Christian; Andrew Dean; Ceri Middleton; Ksenija Dvurecenska; Eann Patterson
Time period covered
Jul 29, 2021
Description
A new decomposition algorithm based on QR factorisation is introduced for processing and comparing irregularly shaped stress and deformation datasets found in structural analysis. The algorithm improves the comparison of two-dimensional data fields from the surface of components where data is missing from the field of view due to obstructed measurement systems or component geometry that results in areas where no data is present. The technique enables the comparison of these irregularly shaped datasets without the need for interpolation or warping of the data. This ensures comparisons are only made between the available data in each dataset and thus similarity metrics are not biased by missing data. The decomposition and comparison technique has been applied during an impact experiment, a modal analysis, and a fatigue study, with the stress and displacement data obtained from finite element analysis, digital image correlation and thermoelastic stress analysis. The results demonstrate tha...
D
Replication Data for: Super-resolution reconstruction of scalar fields from...
darus.uni-stuttgart.de
Updated Nov 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Shamooni (2025). Replication Data for: Super-resolution reconstruction of scalar fields from the pyrolysis of pulverised biomass using deep learning [Dataset]. http://doi.org/10.18419/DARUS-5519
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.18419/DARUS-5519
Dataset updated
Nov 14, 2025
Dataset provided by
DaRUS
Authors
Ali Shamooni
License
https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-5519https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-5519
Dataset funded by
China Scholarship Council (CSC)
DFG
Helmholtz Association of German Research Centers (HGF)
Description
README Repository for publication: A. Shamooni et al., Super-resolution reconstruction of scalar fields from the pyrolysis of pulverised biomass using deep learning, Proc. Combust. Inst. (2025) Containing torch_code The main Pytorch source code used for training/testing is provided in torch_code.tar.gz file. torch_code_tradGAN To compare with traditional GAN, we use the code in torch_code_tradGAN with similar particle-laden datasets. The source code is torch_code_tradGAN.tar.gz file. datasets The training/validation/testing datasets have been provided in lmdb format which is ready to use in the code. The datasets in datasets.tar.gz contain: Training dataset: data_train_OF-mass_kinematics_mk0x_1x_2x_FHIT_particle_128_Re52-2D_20736_lmdb.lmdb Test dataset: data_valid_inSample_OF-mass_kinematics_mk0x_1x_2x_FHIT_particle_128_Re52-2D_3456_lmdb.lmdb Note that the samples from 9 DNS cases are collected in order (each case 2304 samples for training and 384 samples for testing) which can be recognized using the provided metadata file in each folder. Out of distribution test datasets: Out of distribution test dataset (used in Fig 10 of the paper): data_valid_inSample_OF-mass_kinematics_mk3x_FHIT_particle_128_Re52-2D_nonUniform_1024_lmdb.lmdb | We have two separate OOD DNS cases and from each we select 512 samples. experiments The main trained models are provided in experiments.tar.gz file. Each experiment contains the log file of the training, the last training state (for restart) and the model wights used in the publication. Trained model using the main dataset (used in Figs 2-10 of the paper): h_oldOrder_mk_700-11-c_PFT_Inp4TrZk_outTrZ_RRDBNetCBAM-4Prt_DcondPrtWav_f128g64b16_BS16x4_LrG45D5_DS-mk012-20k_LStandLog To compare with traditional GAN, we use the code in torch_code_tradGAN with similar particle-laden datasets as above. The training consists of one pre-training step and two separate fine-tuning. One fine-tuning with the loss weights from the litreature and one fine-tuning with tuned loss weights. The final results are in experiments/trad_GAN/experiments/ Pre-trained traditional GAN model (used in Figs 8-9 of the paper): train_RRDB_SRx4_particle_PSNR Fine-tuned traditional GAN model with loss weights from lit. (used in Figs 8-9 of the paper) train_ESRGAN_SRx4_particle_Nista_oneBlock Fine-tuned traditional GAN model with optimized loss weights (used in Figs 8-9 of the paper) train_ESRGAN_SRx4_particle_oneBlock_betaA inference_notebooks The inference_notebooks folder contains example notebooks to do inference. The folder contains "torch_code_inference" and "torch_code_tradGAN_inference". The "torch_code_inference" is the inference of main trained model. The "torch_code_tradGAN_inference" is the inference for traditional GAN approach. Move the inference folders in each of these folders into the corresponding torch_code roots. Also create softlinks of datasets and experiments in the main torch_code roots. Note that in each notebook you must double check the required paths to make sure they are set correctly. How to Build the environment To build the environment required for the training and inference you need Anaconda. Go to the torch_code folder and conda env create -f environment.yml Then create ipython kernel for post processing, conda activate torch_22_2025_Shamooni_PCI python -m ipykernel install --user --name ipyk_torch_22_2025_Shamooni_PCI --display-name "ipython kernel for post processing of PCI2025" Perform training It is suggested to create softlinks to the dataset folder directly in the torch_code folder: cd torch_code ln -s datasets You can also simply move the datasets and inference forlders in the torch_code folder beside the cfd_sr folder and other files. In general, we prefer to have a root structure as below: root files and directories: cfd_sr datasets experiments inference options init.py test.py train.py version.py Then activate the conda environment conda activate torch_22_2025_Shamooni_PCI An example script to run on single node with 2 GPUs: torchrun --standalone --nnodes=1 --nproc_per_node=2 train.py -opt options/train/condSRGAN/use_h_mk_700-011_PFT.yml --launcher pytorch Make sure that the paths to datasets "dataroot_gt" and "meta_info_file" for both training and validation data in option files are set correctly.
d
Data from: Temporal and Spatio-Temporal High-Resolution Satellite Data for...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Temporal and Spatio-Temporal High-Resolution Satellite Data for the Validation of a Landsat Time-Series of Fractional Component Cover Across Western United States (U.S.) Rangelands [Dataset]. https://catalog.data.gov/dataset/temporal-and-spatio-temporal-high-resolution-satellite-data-for-the-validation-of-a-landsa
Explore at:
Dataset updated
Nov 20, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Western United States, United States
Description
Western U.S. rangelands have been quantified as six fractional cover (0-100%) components over the Landsat archive (1985-2018) at 30-m resolution, termed the “Back-in-Time” (BIT) dataset. Robust validation through space and time is needed to quantify product accuracy. We leverage field data observed concurrently with HRS imagery over multiple years and locations in the Western U.S. to dramatically expand the spatial extent and sample size of validation analysis relative to a direct comparison to field observations and to previous work. We compare HRS and BIT data in the corresponding space and time. Our objectives were to evaluate the temporal and spatio-temporal relationships between HRS and BIT data, and to compare their response to spatio-temporal variation in climate. We hypothesize that strong temporal and spatio-temporal relationships will exist between HRS and BIT data and that they will exhibit similar climate response. We evaluated a total of 42 HRS sites across the western U.S. with 32 sites in Wyoming, and 5 sites each in Nevada and Montana. HRS sites span a broad range of vegetation, biophysical, climatic, and disturbance regimes. Our HRS sites were strategically located to collectively capture the range of biophysical conditions within a region. Field data were used to train 2-m predictions of fractional component cover at each HRS site and year. The 2-m predictions were degraded to 30-m, and some were used to train regional Landsat-scale, 30-m, “base” maps of fractional component cover representing circa 2016 conditions. A Landsat-imagery time-series spanning 1985-2018, excluding 2012, was analyzed for change through time. Pixels and times identified as changed from the base were trained using the base fractional component cover from the pixels identified as unchanged. Changed pixels were labeled with the updated predictions, while the base was maintained in the unchanged pixels. The resulting BIT suite includes the fractional cover of the six components described above for 1985-2018. We compare the two datasets, HRS and BIT, in space and time. Two tabular data presented here correspond to a temporal and spatio-temporal validation of the BIT data. First, the temporal data are HRS and BIT component cover and climate variable means by site by year. Second, the spatio-temporal data are HRS and BIT component cover and associated climate variables at individual pixels in a site-year.
VSA - FOOD REVIEWS
kaggle.com
zip
Updated Dec 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guu Tran (2024). VSA - FOOD REVIEWS [Dataset]. https://www.kaggle.com/datasets/guutran/vietnamese-sentiment-analysis-food-reviews/code
Explore at:
zip(6046184 bytes)Available download formats
Dataset updated
Dec 23, 2024
Authors
Guu Tran
Description
Content

This dataset, Vietnamese Sentiment Analysis - Food Reviews, is a combined and curated collection of two existing datasets:

Sentiment Analysis Foody by Taaaan (Owner) and Vu Duc Thinh (Editor)

Sentiment Analysis - Food Reviews by Cương Khuất Nguyên (Owner)

It contains user-generated food reviews written in Vietnamese and labeled with sentiment ratings. The dataset consists of two columns:

Comment: The textual review in Vietnamese, expressing opinions about food and dining experiences.

Rating: A binary label indicating sentiment:

0: Negative

1: Positive

This dataset is highly valuable for exploring sentiment analysis in the context of Vietnamese food reviews, offering a rich resource for developing, training, and evaluating machine learning and deep learning models.

Potential Use Cases

This dataset is particularly useful for the following machine learning and natural language processing (NLP) tasks:

Sentiment Analysis
- Building models to classify user reviews as positive or negative sentiments.
- Developing solutions for businesses to understand customer satisfaction and improve services.

Text Classification
- Training supervised learning algorithms for binary classification tasks.
- Benchmarking Vietnamese NLP models on labeled datasets.

Feature Extraction
- Exploring feature extraction techniques such as TF-IDF, word embeddings (e.g., Word2Vec, FastText), or transformer-based embeddings (e.g., BERT, PhoBERT).

Natural Language Understanding
- Analyzing user sentiments for insights into food preferences and trends in Vietnamese culinary culture.

Transfer Learning
- Fine-tuning pre-trained Vietnamese language models like PhoBERT for downstream tasks.

Multi-Language Sentiment Analysis
- Augmenting cross-lingual sentiment analysis by comparing this dataset with similar datasets in other languages.

Recommender Systems
- Using sentiment scores as input features for food recommendation systems.

Aspect-Based Sentiment Analysis (ABSA)
- Extending the dataset to identify sentiment toward specific aspects of food reviews, such as taste, service, or price.

This dataset opens opportunities for researchers and practitioners to advance Vietnamese NLP and develop practical applications in the food and hospitality industry.
f
Data Sheet 1_Large language models generating synthetic clinical datasets: a...
frontiersin.figshare.com
xlsx
Updated Feb 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 1_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2025.1533508.s001
Dataset updated
Feb 5, 2025
Dataset provided by
Frontiers
Authors
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.

Facebook

Twitter

Click to copy link

Link copied

Cite

Gavin B. Stewart; Douglas G. Altman; Lisa M. Askie; Lelia Duley; Mark C. Simmonds; Lesley A. Stewart (2023). Statistical Analysis of Individual Participant Data Meta-Analyses: A Comparison of Methods and Recommendations for Practice [Dataset]. http://doi.org/10.1371/journal.pone.0046042

Statistical Analysis of Individual Participant Data Meta-Analyses: A Comparison of Methods and Recommendations for Practice

Explore at:

108 scholarly articles cite this dataset (View in Google Scholar)

tiffAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0046042

Dataset updated

Jun 8, 2023

Dataset provided by

PLOShttp://plos.org/

Authors

Gavin B. Stewart; Douglas G. Altman; Lisa M. Askie; Lelia Duley; Mark C. Simmonds; Lesley A. Stewart

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

BackgroundIndividual participant data (IPD) meta-analyses that obtain “raw” data from studies rather than summary data typically adopt a “two-stage” approach to analysis whereby IPD within trials generate summary measures, which are combined using standard meta-analytical methods. Recently, a range of “one-stage” approaches which combine all individual participant data in a single meta-analysis have been suggested as providing a more powerful and flexible approach. However, they are more complex to implement and require statistical support. This study uses a dataset to compare “two-stage” and “one-stage” models of varying complexity, to ascertain whether results obtained from the approaches differ in a clinically meaningful way. Methods and FindingsWe included data from 24 randomised controlled trials, evaluating antiplatelet agents, for the prevention of pre-eclampsia in pregnancy. We performed two-stage and one-stage IPD meta-analyses to estimate overall treatment effect and to explore potential treatment interactions whereby particular types of women and their babies might benefit differentially from receiving antiplatelets. Two-stage and one-stage approaches gave similar results, showing a benefit of using anti-platelets (Relative risk 0.90, 95% CI 0.84 to 0.97). Neither approach suggested that any particular type of women benefited more or less from antiplatelets. There were no material differences in results between different types of one-stage model. ConclusionsFor these data, two-stage and one-stage approaches to analysis produce similar results. Although one-stage models offer a flexible environment for exploring model structure and are useful where across study patterns relating to types of participant, intervention and outcome mask similar relationships within trials, the additional insights provided by their usage may not outweigh the costs of statistical support for routine application in syntheses of randomised controlled trials. Researchers considering undertaking an IPD meta-analysis should not necessarily be deterred by a perceived need for sophisticated statistical methods when combining information from large randomised trials.

Clear search

Close search

Google apps

Main menu

Statistical Analysis of Individual Participant Data Meta-Analyses: A...

Data from: Comparing two classes of alpha diversities and their...

Comparing AUROC, exploited biases, use cases and issues of Graph Attention...

Dataset for: Same Question, Different Answers? An Empirical Comparison of...

Data from: Estimating Missing Unit Process Data in Life Cycle Assessment...

Hierarchical clustering of 7 Million Proteins

Context

Content

Inspiration

Data Structure Description

How it is created

Cluster mappings

What is the issue with previous dataset

Code Similarity Dataset – Python Variants

Code Similarity Dataset – Python Variants

📚 What's Inside?

Column Descriptions

🗂 Folder Structure

🔍 Use Cases

🧪 Sample Applications

📦 File Formats

🛠 How to Use

📄 License

Data from: A user-friendly guide to using distance measures to compare time...

Customer Satisfaction Scores and Behavior Data

The banksia plot: a method for visually comparing point estimates and...

Companion data for the creation of a banksia plot:

Background:

Methods:

Results:

Conclusions:

Replication Data for: Exploring Disagreement in Indicators of State...

Data Sheet 1_Predicting implicit concept embeddings for singular...

Datasets for "From authority to similarity: how Google transformed its...

🤖Android vs iOS🍎 Device Benchmarks📊

Compilation Dataset: Smartphone Processors Ranking & Scores

Context

Content

1. Total Score

2. CPU Score

3. GPU Score

4. MEM score

5. UX Score

Acknowledgements

If you enjoyed this dataset, here's some similar datasets you may like 😎

Discovering Hidden Trends in Global Video Games

Discovering Hidden Trends in Global Video Games Sales

Platforms, Genres, and Profitable Regions

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Data from: Comparing full-field data from structural components with...

Replication Data for: Super-resolution reconstruction of scalar fields from...

Data from: Temporal and Spatio-Temporal High-Resolution Satellite Data for...

VSA - FOOD REVIEWS

Content

Potential Use Cases

Data Sheet 1_Large language models generating synthetic clinical datasets: a...

Statistical Analysis of Individual Participant Data Meta-Analyses: A Comparison of Methods and Recommendations for Practice