Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TrueFace is a first dataset of social media processed real and synthetic faces, obtained by the successful StyleGAN generative models, and shared on Facebook, Twitter and Telegram.
Images have historically been a universal and cross-cultural communication medium, capable of reaching people of any social background, status or education. Unsurprisingly though, their social impact has often been exploited for malicious purposes, like spreading misinformation and manipulating public opinion. With today's technologies, the possibility to generate highly realistic fakes is within everyone's reach. A major threat derives in particular from the use of synthetically generated faces, which are able to deceive even the most experienced observer. To contrast this fake news phenomenon, researchers have employed artificial intelligence to detect synthetic images by analysing patterns and artifacts introduced by the generative models. However, most online images are subject to repeated sharing operations by social media platforms. Said platforms process uploaded images by applying operations (like compression) that progressively degrade those useful forensic traces, compromising the effectiveness of the developed detectors. To solve the synthetic-vs-real problem "in the wild", more realistic image databases, like TrueFace, are needed to train specialised detectors.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Skin Disease GAN-Generated Lightweight Dataset
This dataset is a collection of skin disease images generated using a Generative Adversarial Network (GAN) approach. Specifically, a GAN was utilized with Stable Diffusion as the generator and a transformer-based discriminator to create realistic images of various skin diseases. The GAN approach enhances the accuracy and realism of the generated images, making this dataset a valuable resource for machine learning and computer vision applications in dermatology.
To create this dataset, a series of Low-Rank Adaptations (LoRAs) were generated for each disease category. These LoRAs were trained on the base dataset with 60 epochs and 30,000 steps using OneTrainer. Images were then generated for the following disease categories:
Due to the availability of ample public images, Melanoma was excluded from the generation process. The Fooocus API served as the generator within the GAN framework, creating images based on the LoRAs.
To ensure quality and accuracy, a transformer-based discriminator was employed to verify the generated images, classifying them into the correct disease categories.
The original base dataset used to create this GAN-based dataset includes reputable sources such as:
2019 HAM10000 Challenge - Kaggle - Google Images - Dermnet NZ - Bing Images - Yandex - Hellenic Atlas - Dermatological Atlas The LoRAs and their recommended weights for generating images are available for download on our CivitAi profile. You can refer to this profile for detailed instructions and access to the LoRAs used in this dataset.
Generated Images: High-quality images of skin diseases generated via GAN with Stable Diffusion, using transformer-based discrimination for accurate classification.
This dataset is suitable for:
When using this dataset, please cite the following reference: Espinosa, E.G., Castilla, J.S.R., Lamont, F.G. (2025). Skin Disease Pre-diagnosis with Novel Visual Transformers. In: Figueroa-García, J.C., Hernández, G., Suero Pérez, D.F., Gaona García, E.E. (eds) Applied Computer Sciences in Engineering. WEA 2024. Communications in Computer and Information Science, vol 2222. Springer, Cham. https://doi.org/10.1007/978-3-031-74595-9_10
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Monthly nighttime lights (NTL) can clearly depict an area's prevailing intra-year socio-economic dynamics. The Earth Observation Group at Colorado School of Mines provides monthly NTL products from the Day Night Band (DNB) sensor on board the Visible and Infrared Imaging Suite (VIIRS) satellite (April 2012 onwards) and from Operational Linescan System (OLS) sensor onboard the Defense Meteorological Satellite Program (DMSP) satellites (April 1992 onwards). In the current study, an attempt has been made to generate synthetic monthly VIIRS-like products of 1992-2012, using a deep learning-based image translation network. Initially, the defects of the 216 monthly DMSP images (1992-2013) were corrected to remove geometric errors, background noise, and radiometric errors. Correction on monthly VIIRS imagery to remove background noise and ephemeral lights was done using low and high thresholds. Improved DMSP and corrected VIIRS images from April 2012 - December 2013 are used in a conditional generative adversarial network (cGAN) along with Land Use Land Cover, as auxiliary input, to generate VIIRS-like imagery from 1992-2012. The modelled imagery was aggregated annually and showed an R2 of 0.94 with the results of other annual-scale VIIRS-like imagery products of India, R2 of 0.85 w.r.t GDP and R2 of 0.69 w.r.t population. Regression analysis of the generated VIIRS-like products with the actual VIIRS images for the years 2012 and 2013 over India indicated a good approximation with an R2 of 0.64 and 0.67 respectively, while the spatial density relation depicted an under-estimation of the brightness values by the model at extremely high radiance values with an R2 of 0.56 and 0.53 respectively. Qualitative analysis for also performed on both national and state scales. Visual analysis over 1992-2013 confirms a gradual increase in the brightness of the lights indicating that the cGAN model images closely represent the actual pattern followed by the nighttime lights. Finally, a synthetically generated monthly VIIRS-like product is delivered to the research community which will be useful for studying the changes in socio-economic dynamics over time.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Protective coatings based on two dimensional materials such as graphene have gained traction for diverse applications. Their impermeability, inertness, excellent bonding with metals, and amenability to functionalization renders them as promising coatings for both abiotic and microbiologically influenced corrosion (MIC). Owing to the success of graphene coatings, the whole family of 2D materials, including hexagonal boron nitride and molybdenum disulphide are being screened to obtain other promising coatings. AI-based data-driven models can accelerate virtual screening of 2D coatings with desirable physical and chemical properties. However, lack of large experimental datasets renders training of classifiers difficult and often results in over-fitting. Generate large datasets for MIC resistance of 2D coatings is both complex and laborious. Deep learning data augmentation methods can alleviate this issue by generating synthetic electrochemical data that resembles the training data classes. Here, we investigated two different deep generative models, namely variation autoencoder (VAE) and generative adversarial network (GAN) for generating synthetic data for expanding small experimental datasets. Our model experimental system included few layered graphene over copper surfaces. The synthetic data generated using GAN displayed a greater neural network system performance (83-85% accuracy) than VAE generated synthetic data (78-80% accuracy). However, VAE data performed better (90% accuracy) than GAN data (84%-85% accuracy) when using XGBoost. Finally, we show that synthetic data based on VAE and GAN models can drive machine learning models for developing MIC resistant 2D coatings.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Electrocardiogram (ECG) datasets tend to be highly imbalanced due to the scarcity of abnormal cases. Additionally, the use of real patients’ ECGs is highly regulated due to privacy issues. Therefore, there is always a need for more ECG data, especially for the training of automatic diagnosis machine learning models, which perform better when trained on a balanced dataset. We studied the synthetic ECG generation capability of 5 different models from the generative adversarial network (GAN) family and compared their performances, the focus being only on Normal cardiac cycles. Dynamic Time Warping (DTW), Fréchet, and Euclidean distance functions were employed to quantitatively measure performance. Five different methods for evaluating generated beats were proposed and applied. We also proposed 3 new concepts (threshold, accepted beat and productivity rate) and employed them along with the aforementioned methods as a systematic way for comparison between models. The results show that all the tested models can, to an extent, successfully mass-generate acceptable heartbeats with high similarity in morphological features, and potentially all of them can be used to augment imbalanced datasets. However, visual inspections of generated beats favors BiLSTM-DC GAN and WGAN, as they produce statistically more acceptable beats. Also, with regards to productivity rate, the Classic GAN is superior with a 72% productivity rate. We also designed a simple experiment with the state-of-the-art classifier (ECGResNet34) to show empirically that the augmentation of the imbalanced dataset by synthetic ECG signals could improve the performance of classification significantly.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The global importance of mesopelagic fish is increasingly recognised, but they remain relatively poorly studied. This is particularly true in the Southern Ocean, where mesopelagic fishes are both key predators and prey, but where the remote environment makes sampling challenging. Despite this, multiple national Antarctic research programs have undertaken regional sampling of mesopelagic fish over several decades. However, data are dispersed, and sampling methodologies often differ precluding comparisons and limiting synthetic analyses. We identified potential data holders by compiling a metadata catalogue of existing survey data for Southern Ocean mesopelagic fishes. Data holders contributed 17,491 occurrence and 11,190 abundance records from 4780 net hauls from 72 different research cruises. Data span across 37 years from 1991 to 2019 and include trait-based information (length, weight, maturity). The final dataset underwent quality control processes and detailed metadata was provided for each sampling event. This dataset can be accessed through the Antarctic Biodiversity Portal. Myctobase will enhance research capacity by providing the broadscale baseline data necessary for observing and modelling mesopelagic fishes. The paper which this dataset refers to is available at https://doi.org/10.1038/s41597-022-01496-y. The original dataset as submitted for the paper is also available on Zenodo at https://doi.org/10.5281/zenodo.6131579.
This dataset is published by SCAR-AntOBIS under the license CC-BY 4.0. We would appreciate it if you could follow the guidelines from the SCAR and IPY Data Policies (https://www.scar.org/excom-meetings/xxxi-scar-delegates-2010-buenos-aires-argentina/4563-scar-xxxi-ip04b-scar-data-policy/file/) when using the data. If you have any questions regarding this dataset, please do not hesitate to contact us via the contact information provided in the metadata or via data-biodiversity-aq@naturalsciences.be. Issues with dataset can be reported at https://github.com/biodiversity-aq/data-publication/ This dataset is part of the Belgian contribution to the EU-Lifewatch project funded by the Belgian Science Policy Office (BELSPO, contract n°FR/36/AN1/AntaBIS) .
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TrueFace is a first dataset of social media processed real and synthetic faces, obtained by the successful StyleGAN generative models, and shared on Facebook, Twitter and Telegram.
Images have historically been a universal and cross-cultural communication medium, capable of reaching people of any social background, status or education. Unsurprisingly though, their social impact has often been exploited for malicious purposes, like spreading misinformation and manipulating public opinion. With today's technologies, the possibility to generate highly realistic fakes is within everyone's reach. A major threat derives in particular from the use of synthetically generated faces, which are able to deceive even the most experienced observer. To contrast this fake news phenomenon, researchers have employed artificial intelligence to detect synthetic images by analysing patterns and artifacts introduced by the generative models. However, most online images are subject to repeated sharing operations by social media platforms. Said platforms process uploaded images by applying operations (like compression) that progressively degrade those useful forensic traces, compromising the effectiveness of the developed detectors. To solve the synthetic-vs-real problem "in the wild", more realistic image databases, like TrueFace, are needed to train specialised detectors.