26 datasets found
  1. Synthetic Data for graphdb-benchmark

    • figshare.com
    txt
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sotiris Beis; Symeon Papadopoulos; Yannis Kompatsiaris (2023). Synthetic Data for graphdb-benchmark [Dataset]. http://doi.org/10.6084/m9.figshare.1221760.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Sotiris Beis; Symeon Papadopoulos; Yannis Kompatsiaris
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data we used to evaluate Louvain Method in the study Benchmarking Graph Databases on the Problem of Community Detection. These data werw synthetically generated using the LFR-Benchmark (3rd link). There are two type of files, networkX.dat and communityX.dat. The networkX.dat file contains the list of edges (nodes are labelled from 1 to the number of nodes; the edges are ordered and repeated twice, i.e. source-target and target-source). The first four lines of the networkX.dat file list the parameters we used to generate the data. The communityX.dat file contains a list of the nodes and their membership (memberships are labelled by integer numbers >=1). Note X correspond to the number of nodes each dataset contains.

  2. Realistic Synthetic Spending Data

    • kaggle.com
    zip
    Updated Mar 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Atishay Jain (2025). Realistic Synthetic Spending Data [Dataset]. https://www.kaggle.com/datasets/atishayjain07/realistic-synthetic-spending-data
    Explore at:
    zip(14288 bytes)Available download formats
    Dataset updated
    Mar 29, 2025
    Authors
    Atishay Jain
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Overview This dataset contains 1,000 synthetic financial transactions, mimicking real-world spending behaviors across various expense categories. It is ideal for machine learning, data analysis, and financial modeling tasks such as expense classification, anomaly detection, and trend analysis.

    Dataset Features Transaction_ID: Unique identifier for each transaction (e.g., TX0001).

    Date: Transaction date (randomly generated within the past year).

    Amount: Transaction value (ranging from $5 to $150, following a uniform distribution).

    Description: Short description of the transaction.

    Merchant: Business or service provider where the transaction occurred.

    Category: High-level expense category (e.g., Food & Beverage, Bills, Healthcare).

    Categories & Merchants Food & Beverage: Starbucks, McDonald's, Subway, Dunkin

    Bills: Local Utility, Internet Provider, Mobile Carrier

    Entertainment: AMC Theatres, Netflix, Spotify

    Transportation: Uber, Lyft, Local Transit

    Groceries: Walmart, Target, Costco

    Healthcare: CVS Pharmacy, Walgreens, Local Clinic

    Use Cases ✅ Financial Analysis: Understand spending patterns across different categories. ✅ Anomaly Detection: Identify potential fraud by analyzing transaction amounts. ✅ Time-Series Analysis: Study spending behavior trends over time. ✅ Classification & Clustering: Build models to categorize transactions automatically. ✅ Synthetic Data Research: Use it as a benchmark dataset for developing synthetic data generation techniques.

    Limitations This dataset is fully synthetic and does not reflect real financial data.

    Spending patterns are generated using random sampling, without real-world statistical distributions.

    Does not include user profiles, locations, or payment methods.

  3. IDRiD-based state-of-the-art comparison.

    • plos.figshare.com
    xls
    Updated Dec 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sundreen Asad Kamal; Youtian Du; Majdi Khalid; Majed Farrash; Sahraoui Dhelim (2024). IDRiD-based state-of-the-art comparison. [Dataset]. http://doi.org/10.1371/journal.pone.0312016.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 5, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Sundreen Asad Kamal; Youtian Du; Majdi Khalid; Majed Farrash; Sahraoui Dhelim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Diabetic retinopathy (DR) is a prominent reason of blindness globally, which is a diagnostically challenging disease owing to the intricate process of its development and the human eye’s complexity, which consists of nearly forty connected components like the retina, iris, optic nerve, and so on. This study proposes a novel approach to the identification of DR employing methods such as synthetic data generation, K- Means Clustering-Based Binary Grey Wolf Optimizer (KCBGWO), and Fully Convolutional Encoder-Decoder Networks (FCEDN). This is achieved using Generative Adversarial Networks (GANs) to generate high-quality synthetic data and transfer learning for accurate feature extraction and classification, integrating these with Extreme Learning Machines (ELM). The substantial evaluation plan we have provided on the IDRiD dataset gives exceptional outcomes, where our proposed model gives 99.87% accuracy and 99.33% sensitivity, while its specificity is 99. 78%. This is why the outcomes of the presented study can be viewed as promising in terms of the further development of the proposed approach for DR diagnosis, as well as in creating a new reference point within the framework of medical image analysis and providing more effective and timely treatments.

  4. 🌆 City Lifestyle Segmentation Dataset

    • kaggle.com
    zip
    Updated Nov 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UmutUygurr (2025). 🌆 City Lifestyle Segmentation Dataset [Dataset]. https://www.kaggle.com/datasets/umuttuygurr/city-lifestyle-segmentation-dataset
    Explore at:
    zip(11274 bytes)Available download formats
    Dataset updated
    Nov 15, 2025
    Authors
    UmutUygurr
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22121490%2F7189944f8fc292a094c90daa799d08ca%2FChatGPT%20Image%2015%20Kas%202025%2014_07_37.png?generation=1763204959770660&alt=media" alt="">

    🌆 About This Dataset

    This synthetic dataset simulates 300 global cities across 6 major geographic regions, designed specifically for unsupervised machine learning and clustering analysis. It explores how economic status, environmental quality, infrastructure, and digital access shape urban lifestyles worldwide.

    🎯 Perfect For:

    • 📊 K-Means, DBSCAN, Agglomerative Clustering
    • 🔬 PCA & t-SNE Dimensionality Reduction
    • 🗺️ Geospatial Visualization (Plotly, Folium)
    • 📈 Correlation Analysis & Feature Engineering
    • 🎓 Educational Projects (Beginner to Intermediate)

    📦 What's Inside?

    FeatureDescriptionRange
    10 FeaturesEconomic, environmental & social indicatorsRealistically scaled
    300 CitiesEurope, Asia, Americas, Africa, OceaniaDiverse distributions
    Strong CorrelationsIncome ↔ Rent (+0.8), Density ↔ Pollution (+0.6)ML-ready
    No Missing ValuesClean, preprocessed dataReady for analysis
    4-5 Natural ClustersMetropolitan hubs, eco-towns, developing centersPre-validated

    🔥 Key Features

    Realistic Correlations: Income strongly predicts rent (+0.8), internet access (+0.7), and happiness (+0.6)
    Regional Diversity: Each region has distinct economic and environmental characteristics
    Clustering-Ready: Naturally separable into 4-5 lifestyle archetypes
    Beginner-Friendly: No data cleaning required, includes example code
    Documented: Comprehensive README with methodology and use cases

    🚀 Quick Start Example

    import pandas as pd
    from sklearn.cluster import KMeans
    from sklearn.preprocessing import StandardScaler
    
    # Load and prepare
    df = pd.read_csv('city_lifestyle_dataset.csv')
    X = df.drop(['city_name', 'country'], axis=1)
    X_scaled = StandardScaler().fit_transform(X)
    
    # Cluster
    kmeans = KMeans(n_clusters=5, random_state=42)
    df['cluster'] = kmeans.fit_predict(X_scaled)
    
    # Analyze
    print(df.groupby('cluster').mean())
    

    🎓 Learning Outcomes

    After working with this dataset, you will be able to: 1. Apply K-Means, DBSCAN, and Hierarchical Clustering 2. Use PCA for dimensionality reduction and visualization 3. Interpret correlation matrices and feature relationships 4. Create geographic visualizations with cluster assignments 5. Profile and name discovered clusters based on characteristics

    📚 Ideal For These Projects

    • 🏆 Kaggle Competitions: Practice clustering techniques
    • 📝 Academic Projects: Urban planning, sociology, environmental science
    • 💼 Portfolio Work: Showcase ML skills to employers
    • 🎓 Learning: Hands-on practice with unsupervised learning
    • 🔬 Research: Urban lifestyle segmentation studies

    🌍 Expected Clusters

    ClusterCharacteristicsExample Cities
    Metropolitan Tech HubsHigh income, density, rentSilicon Valley, Singapore
    Eco-Friendly TownsLow density, clean air, high happinessNordic cities
    Developing CentersMid income, high density, poor airEmerging markets
    Low-Income SuburbanLow infrastructure, incomeRural areas
    Industrial Mega-CitiesVery high density, pollutionManufacturing hubs

    🛠️ Technical Details

    • Format: CSV (UTF-8)
    • Size: ~300 rows × 10 columns
    • Missing Values: 0%
    • Data Types: 2 categorical, 8 numerical
    • Target Variable: None (unsupervised)
    • Correlation Strength: Pre-validated (r: 0.4 to 0.8)

    📖 What Makes This Dataset Special?

    Unlike random synthetic data, this dataset was carefully engineered with: - ✨ Realistic correlation structures based on urban research - 🌍 Regional characteristics matching real-world patterns - 🎯 Optimal cluster separability (validated via silhouette scores) - 📚 Comprehensive documentation and starter code

    🏅 Use This Dataset If You Want To:

    ✓ Learn clustering without data cleaning hassles
    ✓ Practice PCA and dimensionality reduction
    ✓ Create beautiful geographic visualizations
    ✓ Understand feature correlation in real-world contexts
    ✓ Build a portfolio project with clear business insights

    📊 Acknowledgments

    This dataset was designed for educational purposes in machine learning and data science. While synthetic, it reflects real patterns observed in global urban development research.

    Happy Clustering! 🎉

  5. synthetic-medical-records-dataset

    • kaggle.com
    zip
    Updated Sep 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syncora_ai (2025). synthetic-medical-records-dataset [Dataset]. https://www.kaggle.com/datasets/syncoraai/synthetic-medical-records-dataset
    Explore at:
    zip(1582643 bytes)Available download formats
    Dataset updated
    Sep 11, 2025
    Authors
    Syncora_ai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Synthetic Healthcare Dataset — Powered by Syncora

    High-Fidelity Synthetic Medical Records for AI, ML Modeling, LLM Training & HealthTech Research

    About This Dataset

    This is a synthetic dataset of healthcare records generated using Syncora.ai, a next-generation synthetic data generation platform designed for privacy-safe AI development.

    It simulates patient demographics, medical conditions, treatments, billing, and admission data, preserving statistical realism while ensuring 0% privacy risk.

    This free dataset is designed for:

    • Healthcare AI research
    • Predictive analytics (disease risk, treatment outcomes)
    • LLM training on structured tabular healthcare data
    • Medical data science education & experimentation

    Think of this as fake data that mimics real-world healthcare patterns — statistically accurate, but without any sensitive patient information.

    Dataset Context & Features

    The dataset captures patient-level hospital information, including:

    • Demographics: Age, Gender, Blood Type
    • Medical Details: Diagnosed medical condition, prescribed medication, test results
    • Hospital Records: Admission type (emergency, planned, outpatient), billing amount
    • Target Applications: Predictive modeling, anomaly detection, cost optimization

    All records are 100% synthetic, maintaining the statistical properties of real-world healthcare data while remaining safe to share and use for ML & LLM tasks.

    LLM Training & Generative AI Applications 🧠

    Unlike most healthcare datasets, this one is tailored for LLM training:

    • Fine-tune LLMs on tabular + medical data for reasoning tasks
    • Create medical report generators from structured fields (e.g., convert demographics + condition + test results into natural language summaries)
    • Use as fake data for prompt engineering, synthetic QA pairs, or generative simulations
    • Safely train LLMs to understand healthcare schemas without exposing private patient data

    Machine Learning & AI Use Cases

    • Predictive Modeling: Forecast patient outcomes or readmission likelihood
    • Classification: Disease diagnosis prediction using demographic and medical variables
    • Clustering: Patient segmentation by condition, treatment, or billing pattern
    • Healthcare Cost Prediction: Estimate and optimize billing amounts
    • Bias & Fairness Testing: Study algorithmic bias without exposing sensitive patient data

    Why Syncora?

    Syncora.ai is a synthetic data generation platform designed for healthcare, finance, and enterprise AI.

    Key benefits:

    • Privacy-first: 100% synthetic, zero risk of re-identification
    • Statistical accuracy: Feature relationships preserved for ML & LLM training
    • Regulatory compliance: HIPAA, GDPR, DPDP safe
    • Scalability: Generate millions of synthetic patient records with agentic AI

    Ideas for Exploration

    • Which medical conditions correlate with higher billing amounts?
    • Can test results predict hospitalization type?
    • How do demographics influence treatment or billing trends?
    • Can synthetic datasets reduce bias in healthcare AI & LLMs?

    🔗 Generate Your Own Synthetic Data

    Take your AI projects to the next level with Syncora.ai:
    → Generate your own synthetic datasets now

    Licensing & Compliance

    This is a free dataset, 100% synthetic, and contains no real patient information.
    It is safe for public use in education, research, open-source contributions, LLM training, and AI development.

  6. Hyperparameters and performance metrics of BGWO with K-means clustering.

    • plos.figshare.com
    xls
    Updated Dec 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sundreen Asad Kamal; Youtian Du; Majdi Khalid; Majed Farrash; Sahraoui Dhelim (2024). Hyperparameters and performance metrics of BGWO with K-means clustering. [Dataset]. http://doi.org/10.1371/journal.pone.0312016.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 5, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Sundreen Asad Kamal; Youtian Du; Majdi Khalid; Majed Farrash; Sahraoui Dhelim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Hyperparameters and performance metrics of BGWO with K-means clustering.

  7. Long-range axonal projections analyses of the mouse brain

    • zenodo.org
    zip
    Updated Oct 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Remy Petkantchin; Remy Petkantchin (2024). Long-range axonal projections analyses of the mouse brain [Dataset]. http://doi.org/10.5281/zenodo.13790069
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 15, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Remy Petkantchin; Remy Petkantchin
    License

    http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0

    Description

    Accompanying data and analyses of the article "Generating brain-wide connectome using synthetic axonal morphologies". The code to reproduce the figures is available at this repository.

    Main contents:

    • SEU_morphs.zip, peng_2021_morphs.zip, ML_morphs.zip: input morphologies from various sources (novel morphologies collected by H. Peng, Southeast University, Peng et al. 2021, Winnubst et al. 2019) after morphology-workflows Repair post-processing steps.
    • atlas : atlas of the mouse brain used (enhanced version of Allen Brain CCFv3)
    • out_a_p : output of the axonal projection* anaylsis and clustering on the 3601 morphologies of the dataset.
      • axon_lengths_12.csv, axon_terminals_12.csv : the lengths of axons in each subregion where they terminate, and terminals, at hierarchy level 12 from the brain hierarchy.
      • clustering_output.csv : output of the GMM clustering.
      • config_a_p.cfg : configuration file used to produce this analysis, with running the axonal projection code.
    • circuit : contains files describing the circuit synthesized with Blue Brain's circuit-build*.
      • bioname : parameters used to synthesize the circuit (regions to synthesize, cell densities per region, location of axons to graft...).
      • sonata : files that contain nodes and edges of the circuit.
      • auxiliary : various cell collections from the synthesized cells, filtered by region. Cell collections are to be read with Voxcell.
      • conn_mat.h5: the connectivity matrices obtained for the case where MOp5 long-range axons were synthesized, computed with ConnectomeUtilities.Contains connectivity matrices for the synthesized LRAs, biological LRAs, and grafted local axons.
    • local_axons : biological local axons that are grafted to the synthesized dendrites that do not have synthesized long-range axons.
    • synthesized_MOp5_LRAs : 1695 synthesized cells with long-range axons of the MOp5 region, in the atlas reference frame.
    • out_a_p_synth_MOp5 : axonal projection analysis of the synthesized MOp5 axons.
    • synthesized_isocortex_cells : all synthesized cells of the isocortex region, except MOp5 cells. They are in h5 format, which takes less space than asc and swc. The h5 format can be read and converted for instance with MorphIO.
    • synthesized_isocortex_LRAs : 21680 synthesized cells with long-range axons of the isocortex regions for which a GMM cluster was created.

    Additional files:

    • flatmap_both.nrrd : file used to generate a flat map visualization of the mouse isocortex, shown in the article.
    • config_a_s.cfg : configuration file used to synthesize the long-range axons with the axon-synthesis* code.
    • target_pts : tufts common ancestors for the synthesized MOp5 axons.

    *These softwares might not be open-source at the time of publication of this data, but a public link will be provided as soon as they are.

  8. Mixture Density Mercer Kernels: A Method to Learn Kernels - Dataset - NASA...

    • data.nasa.gov
    Updated Mar 31, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Mixture Density Mercer Kernels: A Method to Learn Kernels - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/mixture-density-mercer-kernels-a-method-to-learn-kernels
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    This paper presents a method of generating Mercer Kernels from an ensemble of probabilistic mixture models, where each mixture model is generated from a Bayesian mixture density estimate. We show how to convert the ensemble estimates into a Mercer Kernel, describe the properties of this new kernel function, and give examples of the performance of this kernel on unsupervised clustering of synthetic data and also in the domain of unsupervised multispectral image understanding.

  9. Classification outcomes utilizing feature extraction from diverse...

    • plos.figshare.com
    xls
    Updated Dec 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sundreen Asad Kamal; Youtian Du; Majdi Khalid; Majed Farrash; Sahraoui Dhelim (2024). Classification outcomes utilizing feature extraction from diverse pre-trained CNN models and various down sampling approaches. [Dataset]. http://doi.org/10.1371/journal.pone.0312016.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 5, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Sundreen Asad Kamal; Youtian Du; Majdi Khalid; Majed Farrash; Sahraoui Dhelim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Classification outcomes utilizing feature extraction from diverse pre-trained CNN models and various down sampling approaches.

  10. Residential and Commercial Energy Cost Dataset

    • kaggle.com
    zip
    Updated Oct 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrey Silva (2025). Residential and Commercial Energy Cost Dataset [Dataset]. https://www.kaggle.com/datasets/andreylss/residential-and-commercial-energy-cost-dataset
    Explore at:
    zip(39189 bytes)Available download formats
    Dataset updated
    Oct 11, 2025
    Authors
    Andrey Silva
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains synthetic data representing energy consumption patterns for 5,000 customers across different regions. The data includes both residential and commercial properties, with information about building characteristics, occupancy, and monthly energy costs.

    Context

    Energy consumption analysis is crucial for understanding customer behavior, optimizing resource allocation, and developing targeted energy efficiency programs. This dataset provides a foundation for exploring relationships between building size, occupancy, geographic location, and energy costs.

    Content

    The dataset includes the following features:

    • customer_id: Unique identifier for each customer (CUSTOMER_0001 to CUSTOMER_5000)
    • customer_type: Type of property (residential or commercial)
    • regions: Geographic region (North, Northeast, Midwest, Southeast, South)
    • building_size_m2: Building size in square meters (17, 24, 45, 52, or 77 m²)
    • num_residents: Number of occupants in the property (1-4)
    • energy_cost_brl: Monthly energy cost in local currency

    Potential Use Cases This dataset can be used for:

    • Exploratory Data Analysis (EDA)
    • Predictive modeling of energy costs
    • Customer segmentation and clustering analysis
    • Machine learning practice and educational purposes

    Data Generation Code

    This dataset was generated using Python with NumPy and Pandas. Below is the complete code used to create the data:

    # Import packages
    import pandas as pd
    import numpy as np
    np.random.seed(42)
    
    # Number of customers
    num_customer = 5000
    
    # Customer ID
    customer_id = ["CUSTOMER_" + str(i).zfill(4) for i in range(1, num_customer + 1)]
    
    # Type of customer
    customer_type = np.random.choice(a=['residential', 'commercial'], size=num_customer, replace=True, p=[0.65, 0.35])
    
    # Regions
    regions = ['North', 'Northeast', 'Midwest', 'Southeast', 'South']
    regions = np.random.choice(a=regions, size=num_customer, replace=True, p=[0.15, 0.25, 0.4, 0.1, 0.1])
    
    # Building size between 17 and 77 square meters
    building_size_m2 = np.random.choice(
      a=[17, 24, 45, 52, 77],
      size=num_customer,
      replace=True,
      p=[0.15, 0.25, 0.4, 0.1, 0.1])
    
    # Number of occupants with condition
    occupants = []
    for i in building_size_m2:
      if i <= 44:
        occupants.append(np.random.randint(low=1, high=4))
      else:
        occupants.append(np.random.randint(low=1, high=5))
    
    # Energy cost in BRL (Brazilian Real)
    energy_cost_brl = []
    
    # Iterate through residents of each customer
    for res in num_residents:
      if 1 <= res <= 3:
        energy_cost_brl.append(np.random.uniform(52.5, 103.85))
      else:
        energy_cost_brl.append(np.random.uniform(103.86, 158.67))
    energy_cost_brl = [round(x, 2) for x in energy_cost_brl]
    
    # Dataframe
    df = pd.DataFrame(
      {
        'customer_id': customer_id,
        'customer_type': customer_type,
        'regions': regions,
        'building_size_m2': building_size_m2,
        'occupants': occupants,
        'energy_cost_brl': energy_cost_brl
      }
    )
    
    

    Acknowledgments

    This is a synthetic dataset created for analytical and educational purposes. The data distributions and relationships are designed to simulate realistic energy consumption patterns.

  11. Optimization algorithms performance comparison across benchmark functions.

    • plos.figshare.com
    xls
    Updated Dec 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sundreen Asad Kamal; Youtian Du; Majdi Khalid; Majed Farrash; Sahraoui Dhelim (2024). Optimization algorithms performance comparison across benchmark functions. [Dataset]. http://doi.org/10.1371/journal.pone.0312016.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 5, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Sundreen Asad Kamal; Youtian Du; Majdi Khalid; Majed Farrash; Sahraoui Dhelim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Optimization algorithms performance comparison across benchmark functions.

  12. Z

    DrCyZ: Techniques for analyzing and extracting useful information from CyZ.

    • data.niaid.nih.gov
    Updated Jan 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    de Curtò, J.; de Zarzà, I. (2022). DrCyZ: Techniques for analyzing and extracting useful information from CyZ. [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5816857
    Explore at:
    Dataset updated
    Jan 19, 2022
    Dataset provided by
    Universitat Oberta de Catalunya
    Authors
    de Curtò, J.; de Zarzà, I.
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    DrCyZ: Techniques for analyzing and extracting useful information from CyZ.

    Samples from NASA Perseverance and set of GAN generated synthetic images from Neural Mars.

    Repository: https://github.com/decurtoidiaz/drcyz

    Subset of samples from (includes tools to visualize and analyse the dataset):

    CyZ: MARS Space Exploration Dataset. [https://doi.org/10.5281/zenodo.5655473]

    Images from NASA missions of the celestial body.

    Repository: https://github.com/decurtoidiaz/cyz

    Authors:

    J. de Curtò c@decurto.be

    I. de Zarzà z@dezarza.be

    File Information from DrCyZ-1.1

    • Subset of samples from Perseverance (drcyz/c).
      ∙ png (drcyz/c/png).
        PNG files (5025) selected from NASA Perseverance (CyZ-1.1) after t-SNE and K-means Clustering. 
      ∙ csv (drcyz/c/csv).
        CSV file.
    
    
    • Resized samples from Perseverance (drcyz/c+).
      ∙ png 64x64; 128x128; 256x256; 512x512; 1024x1024 (drcyz/c+/drcyz_64-1024).
        PNG files resized at the corresponding size. 
      ∙ TFRecords 64x64; 128x128; 256x256; 512x512; 1024x1024 (drcyz/c+/tfr_drcyz_64-1024).
        TFRecord resized at the corresponding size to import on Tensorflow.
    
    
    • Synthetic images from Neural Mars generated using Stylegan2-ada (drcyz/drcyz+).
      ∙ png 100; 1000; 10000 (drcyz/drcyz+/drcyz_256_100-10000)
        PNG files subset of 100, 1000 and 10000 at size 256x256.
    
    
    • Network Checkpoint from Stylegan2-ada trained at size 256x256 (drcyz/model_drcyz).
      ∙ network-snapshot-000798-drcyz.pkl
    
    
    • Notebooks in python to analyse the original dataset and reproduce the experiments; K-means Clustering, t-SNE, PCA, synthetic generation using Stylegan2-ada and instance segmentation using Deeplab (https://github.com/decurtoidiaz/drcyz/tree/main/dr_cyz+).
      ∙ clustering_curiosity_de_curto_and_de_zarza.ipynb
        K-means Clustering and PCA(2) with images from Curiosity.
      ∙ clustering_perseverance_de_curto_and_de_zarza.ipynb
        K-means Clustering and PCA(2) with images from Perseverance.
      ∙ tsne_curiosity_de_curto_and_de_zarza.ipynb
        t-SNE and PCA (components selected to explain 99% of variance) with images from Curiosity.
      ∙ tsne_perseverance_de_curto_and_de_zarza.ipynb
        t-SNE and PCA (components selected to explain 99% of variance) with images from Perseverance.
      ∙ Stylegan2-ada_de_curto_and_de_zarza.ipynb
        Stylegan2-ada trained on a subset of images from NASA Perseverance (DrCyZ).
      ∙ statistics_perseverance_de_curto_and_de_zarza.ipynb
        Compute statistics from synthetic samples generated by Stylegan2-ada (DrCyZ) and images from NASA Perseverance (CyZ).
      ∙ DeepLab_TFLite_ADE20k_de_curto_and_de_zarza.ipynb
        Example of instance segmentation using Deeplab with a sample from NASA Perseverance (DrCyZ).
    
  13. Synthetic low- and medium-voltage grids for Switzerland

    • zenodo.org
    zip
    Updated Apr 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alfredo Ernesto Oneto; Alfredo Ernesto Oneto; Filippo Tettamanti; Blazhe Gjorgiev; Blazhe Gjorgiev; Giovanni Sansavini; Giovanni Sansavini; Filippo Tettamanti (2025). Synthetic low- and medium-voltage grids for Switzerland [Dataset]. http://doi.org/10.5281/zenodo.15167589
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alfredo Ernesto Oneto; Alfredo Ernesto Oneto; Filippo Tettamanti; Blazhe Gjorgiev; Blazhe Gjorgiev; Giovanni Sansavini; Giovanni Sansavini; Filippo Tettamanti
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Mar 7, 2025
    Area covered
    Switzerland
    Description

    Swiss-PDGs: Synthetic low- and medium-voltage grids for Switzerland

    For details on the model used to generate this dataset, please refer to the article "Large-scale generation of geo-referenced power distribution grids from open data with load clustering" (2025), by A. Oneto, B. Gjorgiev, F. Tettamanti, and G. Sansavini, published in Sustainable Energy, Grids and Networks.
    https://doi.org/10.1016/j.segan.2025.101678" target="_blank" rel="noreferrer noopener">https://doi.org/10.1016/j.segan.2025.101678

  14. E

    ProbINet: Bridging Usability Gaps in Probabilistic Network Analysis

    • edmond.mpg.de
    zip
    Updated Nov 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diego Alejandro Baptista Theuerkauf; Martina Contisciani; Caterina De Bacco; Jean-Claude Passy; Diego Alejandro Baptista Theuerkauf; Martina Contisciani; Caterina De Bacco; Jean-Claude Passy (2025). ProbINet: Bridging Usability Gaps in Probabilistic Network Analysis [Dataset]. http://doi.org/10.17617/3.AYGXPP
    Explore at:
    zip(18787627)Available download formats
    Dataset updated
    Nov 28, 2025
    Dataset provided by
    Edmond
    Authors
    Diego Alejandro Baptista Theuerkauf; Martina Contisciani; Caterina De Bacco; Jean-Claude Passy; Diego Alejandro Baptista Theuerkauf; Martina Contisciani; Caterina De Bacco; Jean-Claude Passy
    License

    https://www.gnu.org/licenses/gpl-3.0.html.enhttps://www.gnu.org/licenses/gpl-3.0.html.en

    Description

    The ProbINet package is designed to be a comprehensive and user-friendly toolset for researchers and practitioners interested in modeling network data through probabilistic generative approaches. Our goal is to provide a unified resource that brings together different advances scattered across many code repositories. By doing so, we aim not only to enhance the usability of existing models but also to facilitate the comparison of different approaches. Moreover, through a range of tutorials, we aim at simplifying the use of these methods to perform inferential tasks, including the prediction of missing network edges, node clustering (community detection), anomaly identification, and the generation of synthetic data from latent variables.

  15. f

    Data from: Atom-Precise Modification of Silver(I) Thiolate Cluster by Shell...

    • datasetcatalog.nlm.nih.gov
    • acs.figshare.com
    • +1more
    Updated Jan 2, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Li, Bing; Gao, Guang-Gang; Zang, Shuang-Quan; Du, Xiang-Sha; Li, Guo-Ping; Wang, Jia-Yin; Li, Si (2018). Atom-Precise Modification of Silver(I) Thiolate Cluster by Shell Ligand Substitution: A New Approach to Generation of Cluster Functionality and Chirality [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001800459
    Explore at:
    Dataset updated
    Jan 2, 2018
    Authors
    Li, Bing; Gao, Guang-Gang; Zang, Shuang-Quan; Du, Xiang-Sha; Li, Guo-Ping; Wang, Jia-Yin; Li, Si
    Description

    To realize the molecular design of new functional silver(I) clusters, a new synthetic approach has been proposed, by which the weakly coordinating ligands NO3– in a Ag20 thiolate cluster precursor can be substituted by carboxylic ligands while keeping its inner core intact. By rational design, novel atom-precise carboxylic or amino acid protected 20-core Ag(I)-thiolate clusters have been demonstrated for the first time. The fluorescence and electrochemical activity of the postmodified Ag20 clusters can be modulated by alrestatin or ferrocenecarboxylic acid substitution. More strikingly, when chiral amino acids were used as postmodified ligands, CD-activity was observed for the Ag20 clusters, unveiling an efficient way to obtain atom-precise chiral silver(I) clusters.

  16. Mathematical representation of performance evaluation matrices.

    • figshare.com
    • plos.figshare.com
    xls
    Updated Dec 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sundreen Asad Kamal; Youtian Du; Majdi Khalid; Majed Farrash; Sahraoui Dhelim (2024). Mathematical representation of performance evaluation matrices. [Dataset]. http://doi.org/10.1371/journal.pone.0312016.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 5, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Sundreen Asad Kamal; Youtian Du; Majdi Khalid; Majed Farrash; Sahraoui Dhelim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Mathematical representation of performance evaluation matrices.

  17. f

    Data from: Reversible Formation of Alkyl Radicals at [Fe4S4] Clusters and...

    • datasetcatalog.nlm.nih.gov
    Updated Aug 6, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brown, Alexandra C.; Suess, Daniel L. M. (2020). Reversible Formation of Alkyl Radicals at [Fe4S4] Clusters and Its Implications for Selectivity in Radical SAM Enzymes [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000532251
    Explore at:
    Dataset updated
    Aug 6, 2020
    Authors
    Brown, Alexandra C.; Suess, Daniel L. M.
    Description

    All kingdoms of life use the transient 5′-deoxyadenosyl radical (5′-dAdo•) to initiate a wide range of difficult chemical reactions. Because of its high reactivity, the 5′-dAdo• must be generated in a controlled manner to abstract a specific H atom and avoid unproductive reactions. In radical S-adenosylmethionine (SAM) enzymes, the 5′-dAdo• is formed upon reduction of SAM by an [Fe4S4] cluster. An organometallic precursor featuring an Fe–C bond between the [Fe4S4] cluster and the 5′-dAdo group was recently characterized and shown to be competent for substrate radical generation, presumably via Fe–C bond homolysis. Such reactivity is without precedent for Fe–S clusters. Here, we show that synthetic [Fe4S4]–alkyl clusters undergo Fe–C bond homolysis when the alkylated Fe site has a suitable coordination number, thereby providing support for the intermediacy of organometallic species in radical SAM enzymes. Addition of pyridine donors to [(IMes)3Fe4S4–R]+ clusters (R = alkyl or benzyl; IMes = 1,3-dimesitylimidazol-2-ylidene) generates R•, ultimately forming R–R coupled hydrocarbons. This process is facile at room temperature and allows for the generation of highly reactive radicals including primary carbon radicals. Mechanistic studies, including use of the 5-hexenyl radical clock, demonstrate that Fe–C bond homolysis occurs reversibly. Using these experimental insights and kinetic simulations, we evaluate the circumstances in which an organometallic intermediate can direct the 5′-dAdo• toward productive H-atom abstraction. Our findings demonstrate that reversible homolysis of even weak M–C bonds is a feasible protective mechanism for the 5′-dAdo• that can allow selective X–H bond activation in both radical SAM and adenosylcobalamin enzymes.

  18. PowerPulse: A Synthetic Energy Insights Dataset

    • kaggle.com
    zip
    Updated Jan 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Purvansh Singh (2025). PowerPulse: A Synthetic Energy Insights Dataset [Dataset]. https://www.kaggle.com/datasets/purvanshsingh/powerpulse-a-synthetic-energy-insights-dataset
    Explore at:
    zip(8498640 bytes)Available download formats
    Dataset updated
    Jan 20, 2025
    Authors
    Purvansh Singh
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The PowerPulse dataset is a synthetically generated collection of 100,000 records designed to simulate real-world energy usage patterns. It provides insights into household energy consumption, solar energy generation, and environmental impact across various regions and weather conditions. With detailed fields like EnergyConsumed_kWh, SolarEnergyGenerated_kWh, WeatherCondition, and CO2Emissions_kg, this dataset is ideal for exploring energy trends, building predictive models, and analyzing sustainability initiatives.

    Key Features:

    Comprehensive Coverage: Includes attributes like energy consumption, solar generation, CO2 emissions, and appliance usage. Scalable Insights: Designed to handle large-scale data processing with tools like PySpark. Real-World Relevance: Captures modern energy challenges such as renewable energy optimization and carbon footprint analysis. Flexible Use Cases: Suitable for regression, classification, clustering, and exploratory data analysis.

  19. Jet2 Synthetic Booking

    • kaggle.com
    zip
    Updated Nov 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adythio Niramoyo (2025). Jet2 Synthetic Booking [Dataset]. https://www.kaggle.com/datasets/adythioniramoyo/jet2-synthetic-booking
    Explore at:
    zip(10065345 bytes)Available download formats
    Dataset updated
    Nov 6, 2025
    Authors
    Adythio Niramoyo
    Description

    This dataset simulates Jet2 airline passenger bookings and is designed for segmentation, clustering, and behavioral analysis.

    📊 Dataset Description: Jet2 Synthetic Booking

    The Jet2 Synthetic Booking dataset provides a realistic simulation of passenger booking behavior for Jet2, a UK-based leisure airline. It is ideal for data science projects involving customer segmentation, predictive modeling, and operational insights.

    🧾 Key Features

    • Passenger-level booking records with anonymized identifiers
    • Temporal booking patterns: Includes booking dates, departure dates, and lead times
    • Flight details: Routes, departure airports, destination airports
    • Fare and pricing data: Ticket prices, taxes, and total spend
    • Passenger segmentation: Useful for clustering into groups like Early Birds, Mid-Range, and Late Volatility
    • Synthetic generation: Modeled to reflect realistic Jet2 booking trends without using proprietary or personal data

    🎯 Use Cases

    • K-Means clustering to identify booking behavior segments
    • Time series analysis of booking lead times and seasonal demand
    • Revenue optimization based on fare classes and booking windows
    • Marketing strategy development by understanding customer booking habits

    📁 Format & Accessibility

    • Available as a CSV file on Kaggle
    • Cleaned and structured for immediate use in Python, R, or BI tools
    • No missing values or privacy concerns due to synthetic generation
  20. A sample set of original and generated images using the 14 cluster model.

    • figshare.com
    zip
    Updated Jul 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Naoaki ONO (2025). A sample set of original and generated images using the 14 cluster model. [Dataset]. http://doi.org/10.6084/m9.figshare.29588849.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 17, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Naoaki ONO
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Generative image models have revolutionized artificial intelligence by enabling the synthesis of high-quality, realistic images. These models utilize deep learning techniques to learn complex data distributions and generate novel images that closely resemble the training dataset. Recent advancements, particularly in diffusion models, have led to remarkable improvements in image fidelity, diversity, and controllability. In this work, we investigate the application of a conditional latent diffusion model in the healthcare domain. Specifically, we trained a latent diffusion model using unlabeled histopathology images. Initially, these images were embedded into a lower-dimensional latent space using a Vector Quantized Generative Adversarial Network (VQ-GAN). Subsequently, a diffusion process was applied within this latent space, and clustering was performed on the resulting latent features. The clustering results were then used as a conditioning mechanism for the diffusion model, enabling conditional image generation. Finally, we determined the optimal number of clusters using cluster validation metrics and assessed the quality of the synthetic images through quantitative methods. To enhance the interpretability of the synthetic image generation process, expert input was incorporated into the cluster assignments.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sotiris Beis; Symeon Papadopoulos; Yannis Kompatsiaris (2023). Synthetic Data for graphdb-benchmark [Dataset]. http://doi.org/10.6084/m9.figshare.1221760.v1
Organization logoOrganization logo

Synthetic Data for graphdb-benchmark

Explore at:
txtAvailable download formats
Dataset updated
Jun 3, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Sotiris Beis; Symeon Papadopoulos; Yannis Kompatsiaris
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The data we used to evaluate Louvain Method in the study Benchmarking Graph Databases on the Problem of Community Detection. These data werw synthetically generated using the LFR-Benchmark (3rd link). There are two type of files, networkX.dat and communityX.dat. The networkX.dat file contains the list of edges (nodes are labelled from 1 to the number of nodes; the edges are ordered and repeated twice, i.e. source-target and target-source). The first four lines of the networkX.dat file list the parameters we used to generate the data. The communityX.dat file contains a list of the nodes and their membership (memberships are labelled by integer numbers >=1). Note X correspond to the number of nodes each dataset contains.

Search
Clear search
Close search
Google apps
Main menu