Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundIndividual participant data (IPD) meta-analyses that obtain “raw” data from studies rather than summary data typically adopt a “two-stage” approach to analysis whereby IPD within trials generate summary measures, which are combined using standard meta-analytical methods. Recently, a range of “one-stage” approaches which combine all individual participant data in a single meta-analysis have been suggested as providing a more powerful and flexible approach. However, they are more complex to implement and require statistical support. This study uses a dataset to compare “two-stage” and “one-stage” models of varying complexity, to ascertain whether results obtained from the approaches differ in a clinically meaningful way. Methods and FindingsWe included data from 24 randomised controlled trials, evaluating antiplatelet agents, for the prevention of pre-eclampsia in pregnancy. We performed two-stage and one-stage IPD meta-analyses to estimate overall treatment effect and to explore potential treatment interactions whereby particular types of women and their babies might benefit differentially from receiving antiplatelets. Two-stage and one-stage approaches gave similar results, showing a benefit of using anti-platelets (Relative risk 0.90, 95% CI 0.84 to 0.97). Neither approach suggested that any particular type of women benefited more or less from antiplatelets. There were no material differences in results between different types of one-stage model. ConclusionsFor these data, two-stage and one-stage approaches to analysis produce similar results. Although one-stage models offer a flexible environment for exploring model structure and are useful where across study patterns relating to types of participant, intervention and outcome mask similar relationships within trials, the additional insights provided by their usage may not outweigh the costs of statistical support for routine application in syntheses of randomised controlled trials. Researchers considering undertaking an IPD meta-analysis should not necessarily be deterred by a perceived need for sophisticated statistical methods when combining information from large randomised trials.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Facebook
TwitterComparing AUROC, exploited biases, use cases and issues of Graph Attention Networks on similarity and patient-centric graphs (directed, reversed directed and undirected) for the classification of sepsis on complete blood count data (higher values represent better performance). We evaluated the classification performance on two datasets (internal and external dataset). Bold values represent the best values in each column.
Facebook
TwitterPsychological scientists increasingly study web data, such as user ratings or social media postings. However, whether research relying on such web data leads to the same conclusions as research based on traditional data is largely unknown. To test this, we (re)analyzed three datasets, thereby comparing web data with lab and online survey data. We calculated correlations across these different datasets (Study 1) and investigated identical, illustrative research questions in each dataset (Studies 2 to 4). Our results suggest that web and traditional data are not fundamentally different and usually lead to similar conclusions, but also that it is important to consider differences between data types such as populations and research settings. Web data can be a valuable tool for psychologists when accounting for such differences, as it allows for testing established research findings in new contexts, complementing them with insights from novel data sources.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
In life cycle assessment (LCA), collecting unit process data from the empirical sources (i.e., meter readings, operation logs/journals) is often costly and time-consuming. We propose a new computational approach to estimate missing unit process data solely relying on limited known data based on a similarity-based link prediction method. The intuition is that similar processes in a unit process network tend to have similar material/energy inputs and waste/emission outputs. We use the ecoinvent 3.1 unit process data sets to test our method in four steps: (1) dividing the data sets into a training set and a test set; (2) randomly removing certain numbers of data in the test set indicated as missing; (3) using similarity-weighted means of various numbers of most similar processes in the training set to estimate the missing data in the test set; and (4) comparing estimated data with the original values to determine the performance of the estimation. The results show that missing data can be accurately estimated when less than 5% data are missing in one process. The estimation performance decreases as the percentage of missing data increases. This study provides a new approach to compile unit process data and demonstrates a promising potential of using computational approaches for LCA data compilation.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Clustering using distance needs all-against-all matching. New algorithm can cluster 7 Million proteins using approximate clustering under one hour.
cat: contains Hierarchical sequence. protein_names : List of proteins in the group. Original data can be downloaded from ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/env_nr.gz
Researchers can use the data to find relationships between proteins more easily.
Data set has two files. protein_groupings file is the clustered data. This file has only names. Sequences for the names can be found in protein_name_letter file.
Data is downloaded from the NCBI site and fasta format was converted into full length sequence. Sequences were fed into the clustering algorithm.
As this is the Hierarchical clustering, relationship between the sequences can be found by comparing the values in gn_list .
All the groups start with cluster_id:0 , split:0 and progress into matched splits. Difference between the splits would indicate that how much two sequences can match. Comparing the cluster_id would check if the sequences belong to same group or different group.
cluster_id = unique id for cluster. split = approximate similarity between the sequences. This is an absolute value. 63 would mean there is 63 letters would match between the sequences. Higher the value more similarity. inner_cluster_id = unique id to compare inner cluster matches. total clusters = number of clusters after approximate match is generated.
Due to space restrictions in Kaggle, this data set has only 9093 groups containing 129696 sequences.
One sequence may be in more than cluster because similarity is calculated as if all-against-all comparison is used.
Ex : For A, B, C , if A ~ B = 50, B~ C = 50 and A~C =0 then clustering will have two groups [A,B] and [B,C]
If you need full dataset for your research, contact me.
Previous dataset had issues with similarity comparisons between intra-clusters. Inner cluster comparison worked. This is fixed in the new version.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A collection of code snippets solving common programming problems in multiple variations.
Each problem has 20+ versions, written in different styles and logic patterns, making this dataset ideal for studying:
The dataset includes the following tasks: - Reverse a String - Find Max in List - Check if a Number is Prime - Check if a String is a Palindrome - Generate Fibonacci Sequence
Each task contains: - 20 variations of code - Metadata file describing method and notes - README with usage instructions
The full_metadata.csv file contains the following fields:
| Column Name | Description |
|---|---|
problem_type | The programming task solved (e.g., reverse_string, max_in_list) |
id | Unique ID of the snippet within that problem group |
filename | Filename of the code snippet (e.g., snip_01.py) |
language | Programming language used (Python) |
method | Type of approach used (e.g., Slicing, Recursive, While loop) |
notes | Additional details about the logic or style used in the snippet |
CodeSimilarityDataset/ │ ├── reverse_string/ │ ├── snippets/ │ ├── metadata.csv │ └── README.txt │ ├── max_in_list/ │ ├── snippets/ │ ├── metadata.csv │ └── README.txt │ ├── is_prime/ │ ├── snippets/ │ ├── metadata.csv │ └── README.txt │ ├── is_palindrome/ │ ├── snippets/ │ ├── metadata.csv │ └── README.txt │ ├── fibonacci/ │ ├── snippets/ │ ├── metadata.csv │ └── README.txt │ └── full_metadata.csv ← Combined metadata across all problems
Visualize logic type distribution
Compare structural similarity (AST/difflib/token matching)
Cluster similar snippets using embeddings
Train code-style-aware LLMs
All code snippets are .py files. Metadata is provided in CSV format for easy loading into pandas or other tools.
You can load metadata easily with Python:
import pandas as pd
df = pd.read_csv('full_metadata.csv') print(df.sample(5))
Then read any snippet:
with open("reverse_string/snippets/snip_01.py") as f: code = f.read() print(code)
This dataset is released under the MIT License — free to use, modify, and distribute with proper attribution.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Time series are a critical component of ecological analysis, used to track changes in biotic and abiotic variables. Information can be extracted from the properties of time series for tasks such as classification (e.g. assigning species to individual bird calls); clustering (e.g. clustering similar responses in population dynamics to abrupt changes in the environment or management interventions); prediction (e.g. accuracy of model predictions to original time series data); and anomaly detection (e.g. detecting possible catastrophic events from population time series). These common tasks in ecological research rely on the notion of (dis-) similarity, which can be determined using distance measures. A plethora of distance measures have been described, predominantly in the computer and information sciences, but many have not been introduced to ecologists. Furthermore, little is known about how to select appropriate distance measures for time-series-related tasks. Therefore, many potential applications remain unexplored. Here we describe 16 properties of distance measures that are likely to be of importance to a variety of ecological questions involving time series. We then test 42 distance measures for each property and use the results to develop an objective method to select appropriate distance measures for any task and ecological dataset. We demonstrate our selection method by applying it to a set of real-world data on breeding bird populations in the UK and discuss other potential applications for distance measures, along with associated technical issues common in ecology. Our real-world population trends exhibit a common challenge for time series comparisons: a high level of stochasticity. We demonstrate two different ways of overcoming this challenge, first by selecting distance measures with properties that make them well-suited to comparing noisy time series, and second by applying a smoothing algorithm before selecting appropriate distance measures. In both cases, the distance measures chosen through our selection method are not only fit-for-purpose but are consistent in their rankings of the population trends. The results of our study should lead to an improved understanding of, and greater scope for, the use of distance measures for comparing ecological time series, and help us answer new ecological questions. Methods Distance measure test results were produced using R and can be replicated using scripts available on GitHub at https://github.com/shawndove/Trend_compare. Detailed information on wading bird trends can be found in Jellesmark et al. (2021) below. Jellesmark, S., Ausden, M., Blackburn, T. M., Gregory, R. D., Hoffmann, M., Massimino, D., McRae, L., & Visconti, P. (2021). A counterfactual approach to measure the impact of wet grassland conservation on U.K. breeding bird populations. Conservation Biology, 35(5), 1575–1585. https://doi.org/10.1111/cobi.13692
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains customer satisfaction scores collected from a survey, alongside key demographic and behavioral data. It includes variables such as customer age, gender, location, purchase history, support contact status, loyalty level, and satisfaction factors. The dataset is designed to help analyze customer satisfaction, identify trends, and develop insights that can drive business decisions.
File Information: File Name: customer_satisfaction_data.csv (or your specific file name)
File Type: CSV (or the actual file format you are using)
Number of Rows: 120
Number of Columns: 10
Column Names:
Customer_ID – Unique identifier for each customer (e.g., 81-237-4704)
Group – The group to which the customer belongs (A or B)
Satisfaction_Score – Customer's satisfaction score on a scale of 1-10
Age – Age of the customer
Gender – Gender of the customer (Male, Female)
Location – Customer's location (e.g., Phoenix.AZ, Los Angeles.CA)
Purchase_History – Whether the customer has made a purchase (Yes or No)
Support_Contacted – Whether the customer has contacted support (Yes or No)
Loyalty_Level – Customer's loyalty level (Low, Medium, High)
Satisfaction_Factor – Primary factor contributing to customer satisfaction (e.g., Price, Product Quality)
Statistical Analyses:
Descriptive Statistics:
Calculate mean, median, mode, standard deviation, and range for key numerical variables (e.g., Satisfaction Score, Age).
Summarize categorical variables (e.g., Gender, Loyalty Level, Purchase History) with frequency distributions and percentages.
Two-Sample t-Test (Independent t-test):
Compare the mean satisfaction scores between two independent groups (e.g., Group A vs. Group B) to determine if there is a significant difference in their average satisfaction scores.
Paired t-Test:
If there are two related measurements (e.g., satisfaction scores before and after a certain event), you can compare the means using a paired t-test.
One-Way ANOVA (Analysis of Variance):
Test if there are significant differences in mean satisfaction scores across more than two groups (e.g., comparing the mean satisfaction score across different Loyalty Levels).
Chi-Square Test for Independence:
Examine the relationship between two categorical variables (e.g., Gender vs. Purchase History or Loyalty Level vs. Support Contacted) to determine if there’s a significant association.
Mann-Whitney U Test:
For non-normally distributed data, use this test to compare satisfaction scores between two independent groups (e.g., Group A vs. Group B) to see if their distributions differ significantly.
Kruskal-Wallis Test:
Similar to ANOVA, but used for non-normally distributed data. This test can compare the median satisfaction scores across multiple groups (e.g., comparing satisfaction scores across Loyalty Levels or Satisfaction Factors).
Spearman’s Rank Correlation:
Test for a monotonic relationship between two ordinal or continuous variables (e.g., Age vs. Satisfaction Score or Satisfaction Score vs. Loyalty Level).
Regression Analysis:
Linear Regression: Model the relationship between a continuous dependent variable (e.g., Satisfaction Score) and independent variables (e.g., Age, Gender, Loyalty Level).
Logistic Regression: If analyzing binary outcomes (e.g., Purchase History or Support Contacted), you could model the probability of an outcome based on predictors.
Factor Analysis:
To identify underlying patterns or groups in customer behavior or satisfaction factors, you can apply Factor Analysis to reduce the dimensionality of the dataset and group similar variables.
Cluster Analysis:
Use K-Means Clustering or Hierarchical Clustering to group customers based on similarity in their satisfaction scores and other features (e.g., Loyalty Level, Purchase History).
Confidence Intervals:
Calculate confidence intervals for the mean of satisfaction scores or any other metric to estimate the range in which the true population mean might lie.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In research evaluating statistical analysis methods, a common aim is to compare point estimates and confidence intervals (CIs) calculated from different analyses. This can be challenging when the outcomes (and their scale ranges) differ across datasets. We therefore developed a plot to facilitate pairwise comparisons of point estimates and confidence intervals from different statistical analyses both within and across datasets.
The plot was developed and refined over the course of an empirical study. To compare results from a variety of different studies, a system of centring and scaling is used. Firstly, the point estimates from reference analyses are centred to zero, followed by scaling confidence intervals to span a range of one. The point estimates and confidence intervals from matching comparator analyses are then adjusted by the same amounts. This enables the relative positions of the point estimates and CI widths to be quickly assessed while maintaining the relative magnitudes of the difference in point estimates and confidence interval widths between the two analyses. Banksia plots can be graphed in a matrix, showing all pairwise comparisons of multiple analyses. In this paper, we show how to create a banksia plot and present two examples: the first relates to an empirical evaluation assessing the difference between various statistical methods across 190 interrupted time series (ITS) data sets with widely varying characteristics, while the second example assesses data extraction accuracy comparing results obtained from analysing original study data (43 ITS studies) with those obtained by four researchers from datasets digitally extracted from graphs from the accompanying manuscripts.
In the banksia plot of statistical method comparison, it was clear that there was no difference, on average, in point estimates and it was straightforward to ascertain which methods resulted in smaller, similar or larger confidence intervals than others. In the banksia plot comparing analyses from digitally extracted data to those from the original data it was clear that both the point estimates and confidence intervals were all very similar among data extractors and original data.
The banksia plot, a graphical representation of centred and scaled confidence intervals, provides a concise summary of comparisons between multiple point estimates and associated CIs in a single graph. Through this visualisation, patterns and trends in the point estimates and confidence intervals can be easily identified.
This collection of files allows the user to create the images used in the companion paper and amend this code to create their own banksia plots using either Stata version 17 or R version 4.3.1
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Until recently, researchers who wanted to examine the determinants of state respect for most specific negative rights needed to rely on data from the CIRI or the Political Terror Scale (PTS). The new V-DEM dataset offers scholars a potential alternative to the individual human rights variables from CIRI. We analyze a set of key Cingranelli-Richards (CIRI) Human Rights Data Project and Varieties of Democracy (V-DEM) negative rights indicators, finding unusual and unexpectedly large patterns of disagreement between the two sets. First, we discuss the new V-DEM dataset by comparing it to the disaggregated CIRI indicators, discussing the history of each project, and describing its empirical domain. Second, we identify a set of disaggregated human rights measures that are similar across the two datasets and discuss each project's measurement approach. Third, we examine how these measures compare to each other empirically, showing that they diverge considerably across both time and space. These findings point to several important directions for future work, such as how conceptual approaches and measurement strategies affect rights scores. For the time being, our findings suggest that researchers should think carefully about using the measures as substitutes.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ObjectiveLiterature-based Discovery (LBD) identifies new knowledge by leveraging existing literature. It exploits interconnecting implicit relationships to build bridges between isolated sets of non-interacting literatures. It has been used to facilitate drug repurposing, new drug discovery, and study adverse event reactions. Within the last decade, LBD systems have transitioned from using statistical methods to exploring deep learning (DL) to analyze semantic spaces between non-interacting literatures. Recent works explore knowledge graphs (KG) to represent explicit relationships. These works envision LBD as a knowledge graph completion (KGC) task and use DL to generate implicit relationships. However, these systems require the researcher to have domain-expert knowledge when submitting relevant queries for novel hypothesis discovery.MethodsOur method explores a novel approach to identify all implicit hypotheses given the researcher's search query and expedites the knowledge discovery process. We revise the KGC task as the task of predicting interconnecting vertex embeddings within the graph. We train our model using a similarity learning objective and compare our model's predictions against all known vertices within the graph to determine the likelihood of an implicit relationship (i.e., connecting edge). We also explore three approaches to represent edge connections between vertices within the KG: average, concatenation, and Hadamard. Lastly, we explore an approach to induce inductive biases and expedite model convergence (i.e., input representation scaling).ResultsWe evaluate our method by replicating five known discoveries within the Hallmark of Cancer (HOC) datasets and compare our method to two existing works. Our results show no significant difference in reported ranks and model convergence rate when comparing scaling our input representations and not using this method. Comparing our method to previous works, we found our method achieves optimal performance on two of five datasets and achieves comparable performance on the remaining datasets. We further analyze our results using statistical significance testing to demonstrate the efficacy of our method.ConclusionWe found our similarity-based learning objective predicts linking vertex embeddings for single relationship closed discovery replication. Our method also provides a ranked list of linking vertices between a set of inputs. This approach reduces researcher burden and allows further exploration of generated hypotheses.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
We investigate the impact of computer vision models, a prominent artificial intelligence tool, on critical knowledge infrastructure, using the case of Google search engines. We answer the following research question: How do search results for Google Images compare internationally with those for Google Search, and how can these results be explained by changes in Google’s knowledge infrastructure? To answer this question, we carry out four steps: 1) theorise the relationship between web epistemology, calculative technology, platform vernacular and issue configuration, illustrating the dynamics of critical knowledge infrastructures on the web; 2) provide a potted history of Google’s use of computer vision in search; 3) undertake the first international comparison of search results from Google Search with Google Images; 4) analyse the visual content of search results from Google Images. Using quanti-quali digital methods including visual content analysis, social semiotics and computer vision network analysis, we analyse search results related to environmental change across six countries, with two key findings. First, Google Images search results contain fewer authoritative sources than Google Search across all countries. Second, Google Images results constitute a narrow, homogenised visual repertoire across all countries. This constitutes a transformation in web epistemology from ranking-by-authority to ranking-by-similarity, driven by a shift in calculative technology from web links (Google Search) to computer vision (Google Images). Our findings and theoretical model open up new questions regarding the impact of computer vision on the public availability of knowledge in our increasingly image-saturated digital societies.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Benchmarks allow for easy comparison between multiple devices by scoring their performance on a standardized series of tests, and they are useful in many instances: When buying a new phone or tablet
Newest data as of May 3rd, 2022. This dataset contains benchmarks of Android and iOS devices
Benchmark apps gives your device an overall numerical score as well as individual scores for each test it performs. The overall score is created by adding the results of those individual scores. These score numbers don't mean much on their own, they're just helpful for comparing different devices. For example, if your device's score is 300000, a device with a score of 600000 is about twice as fast. You can use individual test scores to compare the relative performance of specific parts of different devices. For example, you could compare how fast your phone's storage performs compared to another phone's storage.
The first part of the overall score is your CPU score. The CPU score in turn includes the output of CPU Mathematical Operations, CPU Common Algorithms, and CPU Multi-Core. In simpler words, the CPU score means how fast your phone processes commands. Your device's central processing unit (CPU) does most of the number-crunching. A faster CPU can run apps faster, so everything on your device will seem faster. Of course, once you get to a certain point, CPU speed won't affect performance much. However, a faster CPU may still help when running more demanding applications, such as high-end games.
The second part of the overall score is your GPU score. This score is comprised of the output of graphical components like Metal, OpenGL or Vulkan, depending on your device. The GPU score means how well your phone displays 2D and 3D graphics. Your device's graphics processing unit (GPU) handles accelerated graphics. When you play a game, your GPU kicks into gear and renders the 3D graphics or accelerates the shiny 2D graphics. Many interface animations and other transitions also use the GPU. The GPU is optimized for these sorts of graphics operations. The CPU could perform them, but it's more general-purpose and would take more time and battery power. You can say that your GPU does the graphics number-crunching, so a higher score here is better.
The third part of the overall score is your MEM score. The MEM score includes the results of the output of RAM Access, ROM APP IO, ROM Sequential Read and Write, and ROM Random Access. In simpler words, the MEM score means how fast and how much memory your phone possesses. RAM stands for random-access memory; while ROM stands for read-only memory. Your device uses RAM as working memory, while flash storage or an internal SD card is used for long-term storage. The faster it can write to and read data from its RAM, the faster your device will perform. Your RAM is constantly being used on your device, whatever you're doing. While RAM is volatile in nature, ROM is its opposite. RAM mostly stores temporary data, while ROM is used to store permanent data like the firmware of your phone. Both the RAM and ROM make up the memory of your phone, helping it to perform tasks efficiently.
The fourth and final part of the overall score is your UX score. The UX score is made up of the results of the output of the Data Security, Data Processing, Image Processing, User Experience, and Video CTS and Decode tests. The UX score means an overall score that represents how the device's "user experience" will be in the real world. It's a number you can look at to get a feel for a device's overall performance without digging into the above benchmarks or relying too much on the overall score.
Data scrapped from AnTuTu, cross-platform adjusted using 3DMark and Geekbench
Facebook
TwitterBy Andy Bramwell [source]
This dataset contains sales data for video games from all around the world, across different platforms, genres and regions. From the thought-provoking latest release of RPGs to the thrilling adventures of racing games, this database provides an insight into what constitutes as a hit game in today’s gaming industry. Armed with this data and analysis, future developers can better understand what types of gameplay and mechanics resonate more with players to create a new gaming experience. Through its comprehensive analysis on various game titles, genres and platforms this dataset displays detailed insights into how video games can achieve global success as well as providing a wonderful window into the ever-changing trends of gaming culture
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset can be used to uncover hidden trends in Global Video Games Sales. To make the most of this data, it is important to understand the different columns and their respective values.
The 'Rank' column identifies each game's ranking according to its global sales (highest to lowest). This can help you identify which games are most popular globally. The 'Game Title' column contains the name of each video game, which allows you to easily discern one entry from another. The 'Platform' column lists the type of platform on which each game was released, e.g., PlayStation 4 or Xbox One, so that you can make comparisons between platforms as well as specific games for each platform. The 'Year' column provides an additional way of making year-on-year comparisons and tracking changes over time in global video game sales.
In addition, this dataset also contains metadata such as genre ('Genre'), publisher ('Publisher'), and review score ('Review') that add context when considering a particular title's performance in terms of global sales rankings. For example, it might be more compelling to compare two similar genres than two disparate ones when analyzing how successful a select set of titles have been at generating revenue in comparison with others released globally within that timeline. Lastly but no less important are the three variables dedicated exclusively for geographic breakdowns: North America ('North America'), Europe (Europe), Japan (Japan), Rest of World (Rest of World), and Global (Global). This allows us to see how certain regions contribute individually or collectively towards a given title's overall sales figures; by comparing these metrics regionally or collectively an interesting picture arises -- from which inferences about consumer preferences and supplier priorities emerge!Overall this powerful dataset allows researchers and marketers alike a deep dive into market performance for those persistent questions about demand patterns across demographics around the world!
- Analyzing the effects of genre and platform on a game's success - By comparing different genres and platforms, one can get a better understanding of what type of games have the highest sales in different regions across the globe. This could help developers decide which type of gaming content to create in order to maximize their profits.
- Tracking changes in global video games trends over time - This dataset could be used to analyze how various elements such as genre or platform affect success over various years, allowing developers an inside look into what kind of videos are being favored at any given moment across the world.
- Identifying highly successful games and their key elements- Developers could look at this data to find any common factors such as publisher or platform shared by successful titles to uncover characteristics that lead to a high rate-of-return when creating video games or other forms media entertainment
If you use this dataset in your research, please credit the original authors. Data Source
See the dataset description for more information.
File: Video Games Sales.csv | Column name | Description | |:------------------|:------------------------------------------------------------| | Rank | The ranking of the game in terms of global sales. (Integer) | | Game Title | The title of the game. (String) | | Platform | The platform the game was released on. (String) ...
Facebook
TwitterA new decomposition algorithm based on QR factorisation is introduced for processing and comparing irregularly shaped stress and deformation datasets found in structural analysis. The algorithm improves the comparison of two-dimensional data fields from the surface of components where data is missing from the field of view due to obstructed measurement systems or component geometry that results in areas where no data is present. The technique enables the comparison of these irregularly shaped datasets without the need for interpolation or warping of the data. This ensures comparisons are only made between the available data in each dataset and thus similarity metrics are not biased by missing data. The decomposition and comparison technique has been applied during an impact experiment, a modal analysis, and a fatigue study, with the stress and displacement data obtained from finite element analysis, digital image correlation and thermoelastic stress analysis. The results demonstrate tha...
Facebook
Twitterhttps://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-5519https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-5519
README Repository for publication: A. Shamooni et al., Super-resolution reconstruction of scalar fields from the pyrolysis of pulverised biomass using deep learning, Proc. Combust. Inst. (2025) Containing torch_code The main Pytorch source code used for training/testing is provided in torch_code.tar.gz file. torch_code_tradGAN To compare with traditional GAN, we use the code in torch_code_tradGAN with similar particle-laden datasets. The source code is torch_code_tradGAN.tar.gz file. datasets The training/validation/testing datasets have been provided in lmdb format which is ready to use in the code. The datasets in datasets.tar.gz contain: Training dataset: data_train_OF-mass_kinematics_mk0x_1x_2x_FHIT_particle_128_Re52-2D_20736_lmdb.lmdb Test dataset: data_valid_inSample_OF-mass_kinematics_mk0x_1x_2x_FHIT_particle_128_Re52-2D_3456_lmdb.lmdb Note that the samples from 9 DNS cases are collected in order (each case 2304 samples for training and 384 samples for testing) which can be recognized using the provided metadata file in each folder. Out of distribution test datasets: Out of distribution test dataset (used in Fig 10 of the paper): data_valid_inSample_OF-mass_kinematics_mk3x_FHIT_particle_128_Re52-2D_nonUniform_1024_lmdb.lmdb | We have two separate OOD DNS cases and from each we select 512 samples. experiments The main trained models are provided in experiments.tar.gz file. Each experiment contains the log file of the training, the last training state (for restart) and the model wights used in the publication. Trained model using the main dataset (used in Figs 2-10 of the paper): h_oldOrder_mk_700-11-c_PFT_Inp4TrZk_outTrZ_RRDBNetCBAM-4Prt_DcondPrtWav_f128g64b16_BS16x4_LrG45D5_DS-mk012-20k_LStandLog To compare with traditional GAN, we use the code in torch_code_tradGAN with similar particle-laden datasets as above. The training consists of one pre-training step and two separate fine-tuning. One fine-tuning with the loss weights from the litreature and one fine-tuning with tuned loss weights. The final results are in experiments/trad_GAN/experiments/ Pre-trained traditional GAN model (used in Figs 8-9 of the paper): train_RRDB_SRx4_particle_PSNR Fine-tuned traditional GAN model with loss weights from lit. (used in Figs 8-9 of the paper) train_ESRGAN_SRx4_particle_Nista_oneBlock Fine-tuned traditional GAN model with optimized loss weights (used in Figs 8-9 of the paper) train_ESRGAN_SRx4_particle_oneBlock_betaA inference_notebooks The inference_notebooks folder contains example notebooks to do inference. The folder contains "torch_code_inference" and "torch_code_tradGAN_inference". The "torch_code_inference" is the inference of main trained model. The "torch_code_tradGAN_inference" is the inference for traditional GAN approach. Move the inference folders in each of these folders into the corresponding torch_code roots. Also create softlinks of datasets and experiments in the main torch_code roots. Note that in each notebook you must double check the required paths to make sure they are set correctly. How to Build the environment To build the environment required for the training and inference you need Anaconda. Go to the torch_code folder and conda env create -f environment.yml Then create ipython kernel for post processing, conda activate torch_22_2025_Shamooni_PCI python -m ipykernel install --user --name ipyk_torch_22_2025_Shamooni_PCI --display-name "ipython kernel for post processing of PCI2025" Perform training It is suggested to create softlinks to the dataset folder directly in the torch_code folder: cd torch_code ln -s datasets You can also simply move the datasets and inference forlders in the torch_code folder beside the cfd_sr folder and other files. In general, we prefer to have a root structure as below: root files and directories: cfd_sr datasets experiments inference options init.py test.py train.py version.py Then activate the conda environment conda activate torch_22_2025_Shamooni_PCI An example script to run on single node with 2 GPUs: torchrun --standalone --nnodes=1 --nproc_per_node=2 train.py -opt options/train/condSRGAN/use_h_mk_700-011_PFT.yml --launcher pytorch Make sure that the paths to datasets "dataroot_gt" and "meta_info_file" for both training and validation data in option files are set correctly.
Facebook
TwitterWestern U.S. rangelands have been quantified as six fractional cover (0-100%) components over the Landsat archive (1985-2018) at 30-m resolution, termed the “Back-in-Time” (BIT) dataset. Robust validation through space and time is needed to quantify product accuracy. We leverage field data observed concurrently with HRS imagery over multiple years and locations in the Western U.S. to dramatically expand the spatial extent and sample size of validation analysis relative to a direct comparison to field observations and to previous work. We compare HRS and BIT data in the corresponding space and time. Our objectives were to evaluate the temporal and spatio-temporal relationships between HRS and BIT data, and to compare their response to spatio-temporal variation in climate. We hypothesize that strong temporal and spatio-temporal relationships will exist between HRS and BIT data and that they will exhibit similar climate response. We evaluated a total of 42 HRS sites across the western U.S. with 32 sites in Wyoming, and 5 sites each in Nevada and Montana. HRS sites span a broad range of vegetation, biophysical, climatic, and disturbance regimes. Our HRS sites were strategically located to collectively capture the range of biophysical conditions within a region. Field data were used to train 2-m predictions of fractional component cover at each HRS site and year. The 2-m predictions were degraded to 30-m, and some were used to train regional Landsat-scale, 30-m, “base” maps of fractional component cover representing circa 2016 conditions. A Landsat-imagery time-series spanning 1985-2018, excluding 2012, was analyzed for change through time. Pixels and times identified as changed from the base were trained using the base fractional component cover from the pixels identified as unchanged. Changed pixels were labeled with the updated predictions, while the base was maintained in the unchanged pixels. The resulting BIT suite includes the fractional cover of the six components described above for 1985-2018. We compare the two datasets, HRS and BIT, in space and time. Two tabular data presented here correspond to a temporal and spatio-temporal validation of the BIT data. First, the temporal data are HRS and BIT component cover and climate variable means by site by year. Second, the spatio-temporal data are HRS and BIT component cover and associated climate variables at individual pixels in a site-year.
Facebook
TwitterThis dataset, Vietnamese Sentiment Analysis - Food Reviews, is a combined and curated collection of two existing datasets:
It contains user-generated food reviews written in Vietnamese and labeled with sentiment ratings. The dataset consists of two columns:
0: Negative 1: Positive This dataset is highly valuable for exploring sentiment analysis in the context of Vietnamese food reviews, offering a rich resource for developing, training, and evaluating machine learning and deep learning models.
This dataset is particularly useful for the following machine learning and natural language processing (NLP) tasks:
Sentiment Analysis
- Building models to classify user reviews as positive or negative sentiments.
- Developing solutions for businesses to understand customer satisfaction and improve services.
Text Classification
- Training supervised learning algorithms for binary classification tasks.
- Benchmarking Vietnamese NLP models on labeled datasets.
Feature Extraction
- Exploring feature extraction techniques such as TF-IDF, word embeddings (e.g., Word2Vec, FastText), or transformer-based embeddings (e.g., BERT, PhoBERT).
Natural Language Understanding
- Analyzing user sentiments for insights into food preferences and trends in Vietnamese culinary culture.
Transfer Learning
- Fine-tuning pre-trained Vietnamese language models like PhoBERT for downstream tasks.
Multi-Language Sentiment Analysis
- Augmenting cross-lingual sentiment analysis by comparing this dataset with similar datasets in other languages.
Recommender Systems
- Using sentiment scores as input features for food recommendation systems.
Aspect-Based Sentiment Analysis (ABSA)
- Extending the dataset to identify sentiment toward specific aspects of food reviews, such as taste, service, or price.
This dataset opens opportunities for researchers and practitioners to advance Vietnamese NLP and develop practical applications in the food and hospitality industry.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundIndividual participant data (IPD) meta-analyses that obtain “raw” data from studies rather than summary data typically adopt a “two-stage” approach to analysis whereby IPD within trials generate summary measures, which are combined using standard meta-analytical methods. Recently, a range of “one-stage” approaches which combine all individual participant data in a single meta-analysis have been suggested as providing a more powerful and flexible approach. However, they are more complex to implement and require statistical support. This study uses a dataset to compare “two-stage” and “one-stage” models of varying complexity, to ascertain whether results obtained from the approaches differ in a clinically meaningful way. Methods and FindingsWe included data from 24 randomised controlled trials, evaluating antiplatelet agents, for the prevention of pre-eclampsia in pregnancy. We performed two-stage and one-stage IPD meta-analyses to estimate overall treatment effect and to explore potential treatment interactions whereby particular types of women and their babies might benefit differentially from receiving antiplatelets. Two-stage and one-stage approaches gave similar results, showing a benefit of using anti-platelets (Relative risk 0.90, 95% CI 0.84 to 0.97). Neither approach suggested that any particular type of women benefited more or less from antiplatelets. There were no material differences in results between different types of one-stage model. ConclusionsFor these data, two-stage and one-stage approaches to analysis produce similar results. Although one-stage models offer a flexible environment for exploring model structure and are useful where across study patterns relating to types of participant, intervention and outcome mask similar relationships within trials, the additional insights provided by their usage may not outweigh the costs of statistical support for routine application in syntheses of randomised controlled trials. Researchers considering undertaking an IPD meta-analysis should not necessarily be deterred by a perceived need for sophisticated statistical methods when combining information from large randomised trials.