Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data we used to evaluate Louvain Method in the study Benchmarking Graph Databases on the Problem of Community Detection. These data werw synthetically generated using the LFR-Benchmark (3rd link). There are two type of files, networkX.dat and communityX.dat. The networkX.dat file contains the list of edges (nodes are labelled from 1 to the number of nodes; the edges are ordered and repeated twice, i.e. source-target and target-source). The first four lines of the networkX.dat file list the parameters we used to generate the data. The communityX.dat file contains a list of the nodes and their membership (memberships are labelled by integer numbers >=1). Note X correspond to the number of nodes each dataset contains.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Overview This dataset contains 1,000 synthetic financial transactions, mimicking real-world spending behaviors across various expense categories. It is ideal for machine learning, data analysis, and financial modeling tasks such as expense classification, anomaly detection, and trend analysis.
Dataset Features Transaction_ID: Unique identifier for each transaction (e.g., TX0001).
Date: Transaction date (randomly generated within the past year).
Amount: Transaction value (ranging from $5 to $150, following a uniform distribution).
Description: Short description of the transaction.
Merchant: Business or service provider where the transaction occurred.
Category: High-level expense category (e.g., Food & Beverage, Bills, Healthcare).
Categories & Merchants Food & Beverage: Starbucks, McDonald's, Subway, Dunkin
Bills: Local Utility, Internet Provider, Mobile Carrier
Entertainment: AMC Theatres, Netflix, Spotify
Transportation: Uber, Lyft, Local Transit
Groceries: Walmart, Target, Costco
Healthcare: CVS Pharmacy, Walgreens, Local Clinic
Use Cases ✅ Financial Analysis: Understand spending patterns across different categories. ✅ Anomaly Detection: Identify potential fraud by analyzing transaction amounts. ✅ Time-Series Analysis: Study spending behavior trends over time. ✅ Classification & Clustering: Build models to categorize transactions automatically. ✅ Synthetic Data Research: Use it as a benchmark dataset for developing synthetic data generation techniques.
Limitations This dataset is fully synthetic and does not reflect real financial data.
Spending patterns are generated using random sampling, without real-world statistical distributions.
Does not include user profiles, locations, or payment methods.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Diabetic retinopathy (DR) is a prominent reason of blindness globally, which is a diagnostically challenging disease owing to the intricate process of its development and the human eye’s complexity, which consists of nearly forty connected components like the retina, iris, optic nerve, and so on. This study proposes a novel approach to the identification of DR employing methods such as synthetic data generation, K- Means Clustering-Based Binary Grey Wolf Optimizer (KCBGWO), and Fully Convolutional Encoder-Decoder Networks (FCEDN). This is achieved using Generative Adversarial Networks (GANs) to generate high-quality synthetic data and transfer learning for accurate feature extraction and classification, integrating these with Extreme Learning Machines (ELM). The substantial evaluation plan we have provided on the IDRiD dataset gives exceptional outcomes, where our proposed model gives 99.87% accuracy and 99.33% sensitivity, while its specificity is 99. 78%. This is why the outcomes of the presented study can be viewed as promising in terms of the further development of the proposed approach for DR diagnosis, as well as in creating a new reference point within the framework of medical image analysis and providing more effective and timely treatments.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22121490%2F7189944f8fc292a094c90daa799d08ca%2FChatGPT%20Image%2015%20Kas%202025%2014_07_37.png?generation=1763204959770660&alt=media" alt="">
This synthetic dataset simulates 300 global cities across 6 major geographic regions, designed specifically for unsupervised machine learning and clustering analysis. It explores how economic status, environmental quality, infrastructure, and digital access shape urban lifestyles worldwide.
| Feature | Description | Range |
|---|---|---|
| 10 Features | Economic, environmental & social indicators | Realistically scaled |
| 300 Cities | Europe, Asia, Americas, Africa, Oceania | Diverse distributions |
| Strong Correlations | Income ↔ Rent (+0.8), Density ↔ Pollution (+0.6) | ML-ready |
| No Missing Values | Clean, preprocessed data | Ready for analysis |
| 4-5 Natural Clusters | Metropolitan hubs, eco-towns, developing centers | Pre-validated |
✅ Realistic Correlations: Income strongly predicts rent (+0.8), internet access (+0.7), and happiness (+0.6)
✅ Regional Diversity: Each region has distinct economic and environmental characteristics
✅ Clustering-Ready: Naturally separable into 4-5 lifestyle archetypes
✅ Beginner-Friendly: No data cleaning required, includes example code
✅ Documented: Comprehensive README with methodology and use cases
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Load and prepare
df = pd.read_csv('city_lifestyle_dataset.csv')
X = df.drop(['city_name', 'country'], axis=1)
X_scaled = StandardScaler().fit_transform(X)
# Cluster
kmeans = KMeans(n_clusters=5, random_state=42)
df['cluster'] = kmeans.fit_predict(X_scaled)
# Analyze
print(df.groupby('cluster').mean())
After working with this dataset, you will be able to: 1. Apply K-Means, DBSCAN, and Hierarchical Clustering 2. Use PCA for dimensionality reduction and visualization 3. Interpret correlation matrices and feature relationships 4. Create geographic visualizations with cluster assignments 5. Profile and name discovered clusters based on characteristics
| Cluster | Characteristics | Example Cities |
|---|---|---|
| Metropolitan Tech Hubs | High income, density, rent | Silicon Valley, Singapore |
| Eco-Friendly Towns | Low density, clean air, high happiness | Nordic cities |
| Developing Centers | Mid income, high density, poor air | Emerging markets |
| Low-Income Suburban | Low infrastructure, income | Rural areas |
| Industrial Mega-Cities | Very high density, pollution | Manufacturing hubs |
Unlike random synthetic data, this dataset was carefully engineered with: - ✨ Realistic correlation structures based on urban research - 🌍 Regional characteristics matching real-world patterns - 🎯 Optimal cluster separability (validated via silhouette scores) - 📚 Comprehensive documentation and starter code
✓ Learn clustering without data cleaning hassles
✓ Practice PCA and dimensionality reduction
✓ Create beautiful geographic visualizations
✓ Understand feature correlation in real-world contexts
✓ Build a portfolio project with clear business insights
This dataset was designed for educational purposes in machine learning and data science. While synthetic, it reflects real patterns observed in global urban development research.
Happy Clustering! 🎉
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
High-Fidelity Synthetic Medical Records for AI, ML Modeling, LLM Training & HealthTech Research
This is a synthetic dataset of healthcare records generated using Syncora.ai, a next-generation synthetic data generation platform designed for privacy-safe AI development.
It simulates patient demographics, medical conditions, treatments, billing, and admission data, preserving statistical realism while ensuring 0% privacy risk.
This free dataset is designed for:
Think of this as fake data that mimics real-world healthcare patterns — statistically accurate, but without any sensitive patient information.
The dataset captures patient-level hospital information, including:
All records are 100% synthetic, maintaining the statistical properties of real-world healthcare data while remaining safe to share and use for ML & LLM tasks.
Unlike most healthcare datasets, this one is tailored for LLM training:
Syncora.ai is a synthetic data generation platform designed for healthcare, finance, and enterprise AI.
Key benefits:
Take your AI projects to the next level with Syncora.ai:
→ Generate your own synthetic datasets now
This is a free dataset, 100% synthetic, and contains no real patient information.
It is safe for public use in education, research, open-source contributions, LLM training, and AI development.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Hyperparameters and performance metrics of BGWO with K-means clustering.
Facebook
Twitterhttp://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
Accompanying data and analyses of the article "Generating brain-wide connectome using synthetic axonal morphologies". The code to reproduce the figures is available at this repository.
Main contents:
Additional files:
*These softwares might not be open-source at the time of publication of this data, but a public link will be provided as soon as they are.
Facebook
TwitterThis paper presents a method of generating Mercer Kernels from an ensemble of probabilistic mixture models, where each mixture model is generated from a Bayesian mixture density estimate. We show how to convert the ensemble estimates into a Mercer Kernel, describe the properties of this new kernel function, and give examples of the performance of this kernel on unsupervised clustering of synthetic data and also in the domain of unsupervised multispectral image understanding.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Classification outcomes utilizing feature extraction from diverse pre-trained CNN models and various down sampling approaches.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains synthetic data representing energy consumption patterns for 5,000 customers across different regions. The data includes both residential and commercial properties, with information about building characteristics, occupancy, and monthly energy costs.
Energy consumption analysis is crucial for understanding customer behavior, optimizing resource allocation, and developing targeted energy efficiency programs. This dataset provides a foundation for exploring relationships between building size, occupancy, geographic location, and energy costs.
The dataset includes the following features:
Potential Use Cases This dataset can be used for:
This dataset was generated using Python with NumPy and Pandas. Below is the complete code used to create the data:
# Import packages
import pandas as pd
import numpy as np
np.random.seed(42)
# Number of customers
num_customer = 5000
# Customer ID
customer_id = ["CUSTOMER_" + str(i).zfill(4) for i in range(1, num_customer + 1)]
# Type of customer
customer_type = np.random.choice(a=['residential', 'commercial'], size=num_customer, replace=True, p=[0.65, 0.35])
# Regions
regions = ['North', 'Northeast', 'Midwest', 'Southeast', 'South']
regions = np.random.choice(a=regions, size=num_customer, replace=True, p=[0.15, 0.25, 0.4, 0.1, 0.1])
# Building size between 17 and 77 square meters
building_size_m2 = np.random.choice(
a=[17, 24, 45, 52, 77],
size=num_customer,
replace=True,
p=[0.15, 0.25, 0.4, 0.1, 0.1])
# Number of occupants with condition
occupants = []
for i in building_size_m2:
if i <= 44:
occupants.append(np.random.randint(low=1, high=4))
else:
occupants.append(np.random.randint(low=1, high=5))
# Energy cost in BRL (Brazilian Real)
energy_cost_brl = []
# Iterate through residents of each customer
for res in num_residents:
if 1 <= res <= 3:
energy_cost_brl.append(np.random.uniform(52.5, 103.85))
else:
energy_cost_brl.append(np.random.uniform(103.86, 158.67))
energy_cost_brl = [round(x, 2) for x in energy_cost_brl]
# Dataframe
df = pd.DataFrame(
{
'customer_id': customer_id,
'customer_type': customer_type,
'regions': regions,
'building_size_m2': building_size_m2,
'occupants': occupants,
'energy_cost_brl': energy_cost_brl
}
)
This is a synthetic dataset created for analytical and educational purposes. The data distributions and relationships are designed to simulate realistic energy consumption patterns.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Optimization algorithms performance comparison across benchmark functions.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
DrCyZ: Techniques for analyzing and extracting useful information from CyZ.
Samples from NASA Perseverance and set of GAN generated synthetic images from Neural Mars.
Repository: https://github.com/decurtoidiaz/drcyz
Subset of samples from (includes tools to visualize and analyse the dataset):
CyZ: MARS Space Exploration Dataset. [https://doi.org/10.5281/zenodo.5655473]
Images from NASA missions of the celestial body.
Repository: https://github.com/decurtoidiaz/cyz
Authors:
J. de Curtò c@decurto.be
I. de Zarzà z@dezarza.be
• Subset of samples from Perseverance (drcyz/c).
∙ png (drcyz/c/png).
PNG files (5025) selected from NASA Perseverance (CyZ-1.1) after t-SNE and K-means Clustering.
∙ csv (drcyz/c/csv).
CSV file.
• Resized samples from Perseverance (drcyz/c+).
∙ png 64x64; 128x128; 256x256; 512x512; 1024x1024 (drcyz/c+/drcyz_64-1024).
PNG files resized at the corresponding size.
∙ TFRecords 64x64; 128x128; 256x256; 512x512; 1024x1024 (drcyz/c+/tfr_drcyz_64-1024).
TFRecord resized at the corresponding size to import on Tensorflow.
• Synthetic images from Neural Mars generated using Stylegan2-ada (drcyz/drcyz+).
∙ png 100; 1000; 10000 (drcyz/drcyz+/drcyz_256_100-10000)
PNG files subset of 100, 1000 and 10000 at size 256x256.
• Network Checkpoint from Stylegan2-ada trained at size 256x256 (drcyz/model_drcyz).
∙ network-snapshot-000798-drcyz.pkl
• Notebooks in python to analyse the original dataset and reproduce the experiments; K-means Clustering, t-SNE, PCA, synthetic generation using Stylegan2-ada and instance segmentation using Deeplab (https://github.com/decurtoidiaz/drcyz/tree/main/dr_cyz+).
∙ clustering_curiosity_de_curto_and_de_zarza.ipynb
K-means Clustering and PCA(2) with images from Curiosity.
∙ clustering_perseverance_de_curto_and_de_zarza.ipynb
K-means Clustering and PCA(2) with images from Perseverance.
∙ tsne_curiosity_de_curto_and_de_zarza.ipynb
t-SNE and PCA (components selected to explain 99% of variance) with images from Curiosity.
∙ tsne_perseverance_de_curto_and_de_zarza.ipynb
t-SNE and PCA (components selected to explain 99% of variance) with images from Perseverance.
∙ Stylegan2-ada_de_curto_and_de_zarza.ipynb
Stylegan2-ada trained on a subset of images from NASA Perseverance (DrCyZ).
∙ statistics_perseverance_de_curto_and_de_zarza.ipynb
Compute statistics from synthetic samples generated by Stylegan2-ada (DrCyZ) and images from NASA Perseverance (CyZ).
∙ DeepLab_TFLite_ADE20k_de_curto_and_de_zarza.ipynb
Example of instance segmentation using Deeplab with a sample from NASA Perseverance (DrCyZ).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Swiss-PDGs: Synthetic low- and medium-voltage grids for Switzerland
For details on the model used to generate this dataset, please refer to the article "Large-scale generation of geo-referenced power distribution grids from open data with load clustering" (2025), by A. Oneto, B. Gjorgiev, F. Tettamanti, and G. Sansavini, published in Sustainable Energy, Grids and Networks.
https://doi.org/10.1016/j.segan.2025.101678" target="_blank" rel="noreferrer noopener">https://doi.org/10.1016/j.segan.2025.101678
Facebook
Twitterhttps://www.gnu.org/licenses/gpl-3.0.html.enhttps://www.gnu.org/licenses/gpl-3.0.html.en
The ProbINet package is designed to be a comprehensive and user-friendly toolset for researchers and practitioners interested in modeling network data through probabilistic generative approaches. Our goal is to provide a unified resource that brings together different advances scattered across many code repositories. By doing so, we aim not only to enhance the usability of existing models but also to facilitate the comparison of different approaches. Moreover, through a range of tutorials, we aim at simplifying the use of these methods to perform inferential tasks, including the prediction of missing network edges, node clustering (community detection), anomaly identification, and the generation of synthetic data from latent variables.
Facebook
TwitterTo realize the molecular design of new functional silver(I) clusters, a new synthetic approach has been proposed, by which the weakly coordinating ligands NO3– in a Ag20 thiolate cluster precursor can be substituted by carboxylic ligands while keeping its inner core intact. By rational design, novel atom-precise carboxylic or amino acid protected 20-core Ag(I)-thiolate clusters have been demonstrated for the first time. The fluorescence and electrochemical activity of the postmodified Ag20 clusters can be modulated by alrestatin or ferrocenecarboxylic acid substitution. More strikingly, when chiral amino acids were used as postmodified ligands, CD-activity was observed for the Ag20 clusters, unveiling an efficient way to obtain atom-precise chiral silver(I) clusters.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mathematical representation of performance evaluation matrices.
Facebook
TwitterAll kingdoms of life use the transient 5′-deoxyadenosyl radical (5′-dAdo•) to initiate a wide range of difficult chemical reactions. Because of its high reactivity, the 5′-dAdo• must be generated in a controlled manner to abstract a specific H atom and avoid unproductive reactions. In radical S-adenosylmethionine (SAM) enzymes, the 5′-dAdo• is formed upon reduction of SAM by an [Fe4S4] cluster. An organometallic precursor featuring an Fe–C bond between the [Fe4S4] cluster and the 5′-dAdo group was recently characterized and shown to be competent for substrate radical generation, presumably via Fe–C bond homolysis. Such reactivity is without precedent for Fe–S clusters. Here, we show that synthetic [Fe4S4]–alkyl clusters undergo Fe–C bond homolysis when the alkylated Fe site has a suitable coordination number, thereby providing support for the intermediacy of organometallic species in radical SAM enzymes. Addition of pyridine donors to [(IMes)3Fe4S4–R]+ clusters (R = alkyl or benzyl; IMes = 1,3-dimesitylimidazol-2-ylidene) generates R•, ultimately forming R–R coupled hydrocarbons. This process is facile at room temperature and allows for the generation of highly reactive radicals including primary carbon radicals. Mechanistic studies, including use of the 5-hexenyl radical clock, demonstrate that Fe–C bond homolysis occurs reversibly. Using these experimental insights and kinetic simulations, we evaluate the circumstances in which an organometallic intermediate can direct the 5′-dAdo• toward productive H-atom abstraction. Our findings demonstrate that reversible homolysis of even weak M–C bonds is a feasible protective mechanism for the 5′-dAdo• that can allow selective X–H bond activation in both radical SAM and adenosylcobalamin enzymes.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The PowerPulse dataset is a synthetically generated collection of 100,000 records designed to simulate real-world energy usage patterns. It provides insights into household energy consumption, solar energy generation, and environmental impact across various regions and weather conditions. With detailed fields like EnergyConsumed_kWh, SolarEnergyGenerated_kWh, WeatherCondition, and CO2Emissions_kg, this dataset is ideal for exploring energy trends, building predictive models, and analyzing sustainability initiatives.
Key Features:
Comprehensive Coverage: Includes attributes like energy consumption, solar generation, CO2 emissions, and appliance usage. Scalable Insights: Designed to handle large-scale data processing with tools like PySpark. Real-World Relevance: Captures modern energy challenges such as renewable energy optimization and carbon footprint analysis. Flexible Use Cases: Suitable for regression, classification, clustering, and exploratory data analysis.
Facebook
TwitterThis dataset simulates Jet2 airline passenger bookings and is designed for segmentation, clustering, and behavioral analysis.
The Jet2 Synthetic Booking dataset provides a realistic simulation of passenger booking behavior for Jet2, a UK-based leisure airline. It is ideal for data science projects involving customer segmentation, predictive modeling, and operational insights.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Generative image models have revolutionized artificial intelligence by enabling the synthesis of high-quality, realistic images. These models utilize deep learning techniques to learn complex data distributions and generate novel images that closely resemble the training dataset. Recent advancements, particularly in diffusion models, have led to remarkable improvements in image fidelity, diversity, and controllability. In this work, we investigate the application of a conditional latent diffusion model in the healthcare domain. Specifically, we trained a latent diffusion model using unlabeled histopathology images. Initially, these images were embedded into a lower-dimensional latent space using a Vector Quantized Generative Adversarial Network (VQ-GAN). Subsequently, a diffusion process was applied within this latent space, and clustering was performed on the resulting latent features. The clustering results were then used as a conditioning mechanism for the diffusion model, enabling conditional image generation. Finally, we determined the optimal number of clusters using cluster validation metrics and assessed the quality of the synthetic images through quantitative methods. To enhance the interpretability of the synthetic image generation process, expert input was incorporated into the cluster assignments.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data we used to evaluate Louvain Method in the study Benchmarking Graph Databases on the Problem of Community Detection. These data werw synthetically generated using the LFR-Benchmark (3rd link). There are two type of files, networkX.dat and communityX.dat. The networkX.dat file contains the list of edges (nodes are labelled from 1 to the number of nodes; the edges are ordered and repeated twice, i.e. source-target and target-source). The first four lines of the networkX.dat file list the parameters we used to generate the data. The communityX.dat file contains a list of the nodes and their membership (memberships are labelled by integer numbers >=1). Note X correspond to the number of nodes each dataset contains.