14 datasets found
  1. OpenResume: Advancing Career Trajectory Modeling with Anonymized and...

    • zenodo.org
    Updated Feb 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michiharu Yamashita; Thanh Tran; Dongwon Lee; Michiharu Yamashita; Thanh Tran; Dongwon Lee (2025). OpenResume: Advancing Career Trajectory Modeling with Anonymized and Synthetic Resume Datasets [Dataset]. http://doi.org/10.1109/bigdata62323.2024.10825519
    Explore at:
    Dataset updated
    Feb 24, 2025
    Dataset provided by
    Institute of Electrical and Electronics Engineershttp://www.ieee.ro/
    Authors
    Michiharu Yamashita; Thanh Tran; Dongwon Lee; Michiharu Yamashita; Thanh Tran; Dongwon Lee
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    The OpenResume dataset is designed for researchers and practitioners in career trajectory modeling and job-domain machine learning, as described in the IEEE BigData 2024 paper. It includes both anonymized realistic resumes and synthetically generated resumes, offering a comprehensive resource for developing and benchmarking predictive models across a variety of career-related tasks. By employing anonymization and differential privacy techniques, OpenResume ensures that research can be conducted while maintaining privacy. The dataset is available in this repository. Please see the paper for more details: 10.1109/BigData62323.2024.10825519

    If you find this paper useful in your research or use this dataset in any publications, projects, tools, or other forms, please cite:

    @inproceedings{yamashita2024openresume,

    title={{OpenResume: Advancing Career Trajectory Modeling with Anonymized and Synthetic Resume Datasets}},

    author={Yamashita, Michiharu and Tran, Thanh and Lee, Dongwon},

    booktitle={2024 IEEE International Conference on Big Data (BigData)},

    year={2024},

    organization={IEEE}

    }

    @inproceedings{yamashita2023james,

    title={{JAMES: Normalizing Job Titles with Multi-Aspect Graph Embeddings and Reasoning}},

    author={Yamashita, Michiharu and Shen, Jia Tracy and Tran, Thanh and Ekhtiari, Hamoon and Lee, Dongwon},

    booktitle={2023 IEEE International Conference on Data Science and Advanced Analytics (DSAA)},

    year={2023},

    organization={IEEE}

    }

    Data Contents and Organization

    The dataset consists of two primary components:

    • Realistic Data: An anonymized dataset utilizing differential privacy techniques.
    • Synthetic Data: A synthetic dataset generated from real-world job transition graphs.

    The dataset includes the following features:

    • Anonymized User Identifiers: Unique IDs for anonymized users.
    • Anonymized Company Identifiers: Unique IDs for anonymized companies.
    • Normalized Job Titles: Job titles standardized into the ESCO taxonomy.
    • Job Durations: Start and end dates, either anonymized or synthetically generated with differential privacy.

    Detailed information on how the OpenResume dataset is constructed can be found in our paper.

    Dataset Extension

    Job titles in the OpenResume dataset are normalized into the ESCO occupation taxonomy. You can easily integrate the OpenResume dataset with ESCO job and skill databases to perform additional downstream tasks.

    • Applicable Tasks:
      • Next Job Title Prediction (Career Path Prediction)
      • Next Company Prediction (Career Path Prediction)
      • Turnover Prediction
      • Link Prediction
      • Required Skill Prediction (with ESCO dataset integration)
      • Existing Skill Prediction (with ESCO dataset integration)
      • Job Description Classification (with ESCO dataset integration)
      • Job Title Classification (with ESCO dataset integration)
      • Text Feature-Based Model Development (with ESCO dataset integration)
      • LLM Development for Resume-Related Tasks (with ESCO dataset integration)
      • And more!

    Intended Uses

    The primary objective of OpenResume is to provide an open resource for:

    1. Evaluating and comparing newly developed career models in a standardized manner.
    2. Fostering AI advancements in career trajectory modeling and job market analytics.

    With its manageable size, the dataset allows for quick validation of model performance, accelerating innovation in the field. It is particularly useful for researchers who face barriers in accessing proprietary datasets.

    While OpenResume is an excellent tool for research and model development, it is not intended for commercial, real-world applications. Companies and job platforms are expected to rely on proprietary data for their operational systems. By excluding sensitive attributes such as race and gender, OpenResume minimizes the risk of bias propagation during model training.

    Our goal is to support transparent, open research by providing this dataset. We encourage responsible use to ensure fairness and integrity in research, particularly in the context of ethical AI practices.

    Ethical and Responsible Use

    The OpenResume dataset was developed with a strong emphasis on privacy and ethical considerations. Personal identifiers and company names have been anonymized, and differential privacy techniques have been applied to protect individual privacy. We expect all users to adhere to ethical research practices and respect the privacy of data subjects.

    Related Work

    JAMES: Normalizing Job Titles with Multi-Aspect Graph Embeddings and Reasoning
    Michiharu Yamashita, Jia Tracy Shen, Thanh Tran, Hamoon Ekhtiari, and Dongwon Lee
    IEEE Int'l Conf. on Data Science and Advanced Analytics (DSAA), 2023

    Fake Resume Attacks: Data Poisoning on Online Job Platforms
    Michiharu Yamashita, Thanh Tran, and Dongwon Lee
    The ACM Web Conference 2024 (WWW), 2024

  2. Supplementary material for the paper "Comparison of stochastic and machine...

    • figshare.com
    pdf
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Georgia Papacharalampous; Hristos Tyralis (2023). Supplementary material for the paper "Comparison of stochastic and machine learning methods for multi-step ahead forecasting of hydrological processes" [Dataset]. http://doi.org/10.6084/m9.figshare.7092824.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Georgia Papacharalampous; Hristos Tyralis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset serves as supplementary material to the fully reproducible paper entitled "Comparison of stochastic and machine learning methods for multi-step ahead forecasting of hydrological processes". We provide the R codes and their outcomes. We also provide the reports entitled “Definitions of the stochastic processes’’, “Definitions of the forecast quality metrics’’ and “Selected figures for the qualitative comparison of the forecasting methods’’. The former version of this dataset is available in the provided link.

  3. f

    S1 Data -

    • plos.figshare.com
    zip
    Updated Dec 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JiaMing Gong; MingGang Dong (2024). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0311133.s007
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 13, 2024
    Dataset provided by
    PLOS ONE
    Authors
    JiaMing Gong; MingGang Dong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Online imbalanced learning is an emerging topic that combines the challenges of class imbalance and concept drift. However, current works account for issues of class imbalance and concept drift. And only few works have considered these issues simultaneously. To this end, this paper proposes an entropy-based dynamic ensemble classification algorithm (EDAC) to consider data streams with class imbalance and concept drift simultaneously. First, to address the problem of imbalanced learning in training data chunks arriving at different times, EDAC adopts an entropy-based balanced strategy. It divides the data chunks into multiple balanced sample pairs based on the differences in the information entropy between classes in the sample data chunk. Additionally, we propose a density-based sampling method to improve the accuracy of classifying minority class samples into high quality samples and common samples via the density of similar samples. In this manner high quality and common samples are randomly selected for training the classifier. Finally, to solve the issue of concept drift, EDAC designs and implements an ensemble classifier that uses a self-feedback strategy to determine the initial weight of the classifier by adjusting the weight of the sub-classifier according to the performance on the arrived data chunks. The experimental results demonstrate that EDAC outperforms five state-of-the-art algorithms considering four synthetic and one real-world data streams.

  4. Surgical Scene Segmentation in Robotic Gastrectomy

    • kaggle.com
    Updated Dec 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jihun Yoon (2022). Surgical Scene Segmentation in Robotic Gastrectomy [Dataset]. http://doi.org/10.34740/kaggle/ds/2744937
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 19, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jihun Yoon
    Description

    Paper

    Abstract

    The previous image synthesis research for surgical vision had limited results for real-world applications with simple simulators, including only a few organs and surgical tools and outdated segmentation models to evaluate the quality of the image. Furthermore, none of the research released complete datasets to the public enabling open research. Therefore, we release a new dataset to encourage further study and provide novel methods with extensive experiments for surgical scene segmentation using semantic image synthesis with a more complex virtual surgery environment. First, we created three cross-validation sets of real image data considering demographic and clinical information from 40 cases of real surgical videos of gastrectomy with the da Vinci Surgical System (dVSS). Second, we created a virtual surgery environment in the Unity engine with five organs from real patient CT data and 22 the da Vinci surgical instruments from actual measurements. Third, We converted this environment photo-realistically with representative semantic image synthesis models, SEAN and SPADE. Lastly, we evaluated it with various state-of-the-art instance and semantic segmentation models. We succeeded in highly improving our segmentation models with the help of synthetic training data. More methods, statistics, and visualizations on https://sisvse.github.io/.

    The contribution of our work

    • We release the first large-scale instance and semantic segmentation dataset, including both real and synthetic data that can be used for visual object recognition and image-to-image translation research for gastrectomy with the dVSS
    • We systematically analyzed surgical scene segmentation using semantic image synthesis with state-of-the-art models with ten combinations of real and synthetic data.
    • We found exciting results that synthetic data improved low-performance classes and was very effective for Mask AP improvement while improving the segmentation models overall.

    Data generation

    We collected 40 cases of real surgical videos of distal gastrectomy for gastric cancer with the da Vinci Surgical System (dVSS), approved by an institutional review board at the medical institution. In order to evaluate generalization performance, we created three cross-validation datasets considering demographic and clinical variations such as gender, age, BMI, operation time, and patient bleeding. Each cross-validation set consists of 30 cases for train/validation and 10 cases for test data. You can find the overall statistics and demographic and clinical information details in the paper.

    Object categories

    We list five organs (Gallbladder, Liver, Pancreas, Spleen, and Stomach) and 13 surgical instruments that commonly appear from surgeries (Hamonic Ace; HA, Stapler, Cadiere Forceps; CF, Maryland Bipolar Forceps; MBF, Medium-large Clip Applier; MCA, Small Sclip Applier; SCA, Curved Atraumatic Graspers; CAG, Suction, Drain Tube; DT, Endotip, Needle, Specimenbag, Gauze). We classify some rare organs and instruments as “other tissues” and “other instruments” classes. The surgical instruments consist of robotic and laparoscopic instruments and auxiliary tools mainly used for robotic subtotal gastrectomy. In addition, we divide some surgical instruments according to their head, H, wrist; W, and body; B structures, which leads to 24 classes for instruments in total.

    Virtual Surgery Environment and Synthetic Data

    Abdominal computed tomography (CT) DICOM data of a patient and actual measurements of each surgical instrument are used to build a virtual surgery environment. We aim to generate meaningful synthetic data from a sample patient. We annotated five organs listed for real data and reconstructed 3D models by using VTK. In addition, we precisely measured the actual size of each instrument commonly used for laparoscopic and robotic surgery with dVSS. We built 3D models with commercial software such as 3DMax, Zbrush, and Substance Painter. After that, we integrated 3D organ and instrument models into the unity environment for virtual surgery. A user can control a camera and two surgical instruments like actual robotic surgery through a keyboard and mouse in this environment. To reproduce the same camera viewpoint as dVSS, we set the exact parameters of an endoscope used in the surgery. While the user simulates a surgery, a snapshot function projects a 3D scene into a 2D image. According to the projected 2D image, the environment automatically generates corresponding segmentation masks.

    Qualified annotations

    Seven annotators trained for surgical tools and organs annotated six organs and 14 surgical instruments divided into 24 instruments according to head, wrist, and body structures with a web-based computer visio...

  5. P

    Stanford-ORB Dataset

    • paperswithcode.com
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhengfei Kuang; Yunzhi Zhang; Hong-Xing Yu; Samir Agarwala; Shangzhe Wu; Jiajun Wu (2025). Stanford-ORB Dataset [Dataset]. https://paperswithcode.com/dataset/stanford-orb
    Explore at:
    Dataset updated
    Apr 10, 2025
    Authors
    Zhengfei Kuang; Yunzhi Zhang; Hong-Xing Yu; Samir Agarwala; Shangzhe Wu; Jiajun Wu
    Description

    We introduce Stanford-ORB, a new real-world 3D Object inverse Rendering Benchmark. Recent advances in inverse rendering have enabled a wide range of real-world applications in 3D content generation, moving rapidly from research and commercial use cases to consumer devices. While the results continue to improve, there is no real-world benchmark that can quantitatively assess and compare the performance of various inverse rendering methods. Existing real-world datasets typically only consist of the shape and multi-view images of objects, which are not sufficient for evaluating the quality of material recovery and object relighting. Methods capable of recovering material and lighting often resort to synthetic data for quantitative evaluation, which on the other hand does not guarantee generalization to complex real-world environments. We introduce a new dataset of real-world objects captured under a variety of natural scenes with ground-truth 3D scans, multi-view images, and environment lighting. Using this dataset, we establish the first comprehensive real-world evaluation benchmark for object inverse rendering tasks from in-the-wild scenes, and compare the performance of various existing methods.

  6. P

    MedalCare-XL Dataset

    • paperswithcode.com
    • zenodo.org
    Updated Nov 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). MedalCare-XL Dataset [Dataset]. https://paperswithcode.com/dataset/medalcare-xl
    Explore at:
    Dataset updated
    Nov 28, 2022
    Description

    Mechanistic cardiac electrophysiology models allow for personalized simulations of the electrical activity in the heart and the ensuing electrocardiogram (ECG) on the body surface. As such, synthetic signals possess precisely known ground truth labels of the underlying disease (model parameterization) and can be employed for validation of machine learning ECG analysis tools in addition to clinical signals. Recently, synthetic ECG signals were used to enrich sparse clinical data for machine learning or even replace them completely during training leading to good performance on real-world clinical test data.

    We thus generated a large synthetic database comprising a total of 16,900 12~lead ECGs based on multi-scale electrophysiological simulations equally distributed into 1 normal healthy control and 7 pathology classes. The pathological case of myocardial infraction had 6 sub-classes. A comparison of extracted timing and amplitude features between the virtual cohort and a large publicly available clinical ECG database demonstrated that the synthetic signals represent clinical ECGs for healthy and pathological subpopulations with high fidelity. The novel dataset of simulated ECG signals is split into training, validation and test data folds for development of novel machine learning algorithms and their objective assessment.

    This folder WP2_largeDataset_Noise contains the 12-lead ECGs of 10 seconds length. Each ECG is stored in a separate CSV file with one row per lead (lead order: I, II, III, aVR, aVL, aVF, V1-V6) and one sample per column (sampling rate: 500Hz). Data are split by pathologies (avblock = AV block, lbbb = left bundle branch block, rbbb = right bundle branch block, sinus = normal sinus rhythm, lae = left atrial enlargement, fam = fibrotic atrial cardiomyopathy, iab = interatrial conduction block, mi = myocardial infarction). MI data are further split into subclasses depending on the occlusion site (LAD, LCX, RCA) and transmurality (0.3 or 1.0). Each pathology subclass contains training, validation and testing data (~ 70/15/15 split). Training, validation and testing datasets were defined according to the model with which QRST complexes were simulated, i.e., ECGs calculated with the same anatomical model but different electrophysiological parameters are only present in one of the test, validation and training datasets but never in multiple. Each subfolder also contains a "siginfo.csv" file specifying the respective simulation run for the P wave and the QRST segment that was used to synthesize the 10 second ECG segment. Each signal is available in three variations: run_raw.csv contains the synthesized ECG without added noise and without filtering runnoise.csv contains the synthesized ECG (unfiltered) with superimposed noise run*_filtered.csv contains the filtered synthesized ECG (fiter settings: highpass cutoff frequency 0.5Hz, lowpass cutoff frequency 150Hz, butterworth filters of order 3).

    The folder WP2_largeDataset_ParameterFiles contains the parameter files used to simulate the 12-lead ECGs. Parameters are split for atrial and ventricular simulations, which were run independently from one another. See Gillette, Gsell, Nagel* et al. "MedalCare-XL: 16,900 healthy and pathological electrocardiograms obtained through multi-scale electrophysiological models" for a description of the model parameters.

  7. P

    ArtiFact Dataset

    • paperswithcode.com
    Updated May 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Awsafur Rahman; Bishmoy Paul; Najibul Haque Sarker; Zaber Ibn Abdul Hakim; Shaikh Anowarul Fattah (2025). ArtiFact Dataset [Dataset]. https://paperswithcode.com/dataset/artifact
    Explore at:
    Dataset updated
    May 5, 2025
    Authors
    Md Awsafur Rahman; Bishmoy Paul; Najibul Haque Sarker; Zaber Ibn Abdul Hakim; Shaikh Anowarul Fattah
    Description

    The ArtiFact dataset is a large-scale image dataset that aims to include a diverse collection of real and synthetic images from multiple categories, including Human/Human Faces, Animal/Animal Faces, Places, Vehicles, Art, and many other real-life objects. The dataset comprises 8 sources that were carefully chosen to ensure diversity and includes images synthesized from 25 distinct methods, including 13 GANs, 7 Diffusion, and 5 other miscellaneous generators. The dataset contains 2,496,738 images, comprising 964,989 real images and 1,531,749 fake images.

    To ensure diversity across different sources, the real images of the dataset are randomly sampled from source datasets containing numerous categories, whereas synthetic images are generated within the same categories as the real images. Captions and image masks from the COCO dataset are utilized to generate images for text2image and inpainting generators, while normally distributed noise with different random seeds is used for noise2image generators. The dataset is further processed to reflect real-world scenarios by applying random cropping, downscaling, and JPEG compression, in accordance with the IEEE VIP Cup 2022 standards.

    The ArtiFact dataset is intended to serve as a benchmark for evaluating the performance of synthetic image detectors under real-world conditions. It includes a broad spectrum of diversity in terms of generators used and syntheticity, providing a challenging dataset for image detection tasks.

    Total number of images: 2,496,738 Number of real images: 964,989 Number of fake images: 1,531,749 Number of generators used for fake images: 25 (including 13 GANs, 7 Diffusion, and 5 miscellaneous generators) Number of sources used for real images: 8 Categories included in the dataset: Human/Human Faces, Animal/Animal Faces, Places, Vehicles, Art, and other real-life objects Image Resolution: 200 x 200

  8. FastLloyd Clustering Datasets

    • zenodo.org
    xz
    Updated May 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdulrahman Diaa; Abdulrahman Diaa; Thomas Humphries; Thomas Humphries; Florian Kerschbaum; Florian Kerschbaum (2025). FastLloyd Clustering Datasets [Dataset]. http://doi.org/10.5281/zenodo.15530593
    Explore at:
    xzAvailable download formats
    Dataset updated
    May 28, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Abdulrahman Diaa; Abdulrahman Diaa; Thomas Humphries; Thomas Humphries; Florian Kerschbaum; Florian Kerschbaum
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This artifact bundles the five dataset archives used in our private federated clustering evaluation, corresponding to the real-world benchmarks, scaling experiments, ablation studies, and timing performance tests described in the paper. The real_datasets.tar.xz includes ten established clustering benchmarks drawn from UCI and the Clustering basic benchmark (DOI: https://doi.org/10.1007/s10489-018-1238-7); scale_datasets.tar.xz contains the SynthNew family generated to assess scalability via the R clusterGeneration package ; ablate_datasets.tar.xz holds the AblateSynth sets varying cluster separation for ablation analysis also powered by clusterGeneration ; g2_datasets.tar.xz packages the G2 sets—Gaussian clusters of size 2048 across dimensions 2–1024 with two clusters each, collected from the Clustering basic benchmark (DOI: https://doi.org/10.1007/s10489-018-1238-7) ; and timing_datasets.tar.xz includes the real s1 and lsun datasets alongside TimeSynth files (balanced synthetic clusters for timing), as per Mohassel et al.’s experimental framework .

    Contents

    1. real_datasets.tar.xz

    Contains ten real-world benchmark datasets and formatted as one sample per line with space-separated features:

    • iris.txt: 150 samples, 4 features, 3 classes; classic UCI Iris dataset for petal/sepal measurements.

    • lsun.txt: 400 samples, 2 features, 3 clusters; two-dimensional variant of the LSUN dataset for clustering experiments .

    • s1.txt: 5,000 samples, 2 features, 15 clusters; synthetic benchmark from Fränti’s S1 series.

    • house.txt: 1,837 samples, 3 features, 3 clusters; housing data transformed for clustering tasks.

    • adult.txt: 48,842 samples, 6 features, 3 clusters; UCI Census Income (“Adult”) dataset for income bracket prediction.

    • wine.txt: 178 samples, 13 features, 3 cultivars; UCI Wine dataset with chemical analysis features.

    • breast.txt: 569 samples, 9 features, 2 classes; Wisconsin Diagnostic Breast Cancer dataset.

    • yeast.txt: 1,484 samples, 8 features, 10 localization sites; yeast protein localization data.

    • mnist.txt: 10,000 samples, 784 features (28×28 pixels), 10 digit classes; MNIST handwritten digits.

    • birch2.txt: (a random) 25,000/100,000 subset of samples, 2 features, 100 clusters; synthetic BIRCH2 dataset for high-cluster‐count evaluation .

    2. scale_datasets.tar.xz

    Holds the SynthNew_{k}_{d}_{s}.txt files for scaling experiments, where:

    • $k \in \{2,4,8,16,32\}$ is the number of clusters,

    • $d \in \{2,4,8,16,32,64,128,256,512\}$ is the dimensionality,

    • $s \in \{1,2,3\}$ are different random seeds.

    These are generated with the R clusterGeneration package with cluster sizes following a $1:2:...:k$ ratio. We incorporate a random number (in $[0, 100]$) of randomly sampled outliers and set the cluster separation degrees randomly in $[0.16, 0.26]$, spanning partially overlapping to separated clusters.

    3. ablate_datasets.tar.xz

    Contains the AblateSynth_{k}_{d}_{sep}.txt files for ablation studies, with:

    • $k \in \{2,4,8,16\}$ clusters,

    • $d \in \{2,4,8,16\}$ dimensions,

    • $sep \in \{0.25, 0.5, 0.75\}$ controlling cluster separation degrees.

    Also generated via clusterGeneration.

    4. g2_datasets.tar.xz

    Packages the G2 synthetic sets (g2-{dim}-{var}.txt) from the clustering-data benchmarks:

    • $N=2048$ samples, $k=2$ Gaussian clusters,

    • Dimensions $d \in \{1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024\}$

    • Cluster overlap $var \in \{10, 20, 30, 40, 50, 60, 70, 80, 90, 100\}$

    5. timing_datasets.tar.xz

    Includes:

    • s1.txt, lsun.txt: two real datasets for baseline timing.

    • timesynth_{k}_{d}_{n}.txt: synthetic timing datasets with balanced cluster sizes C_{avg}=N/K, varying:

      • $k \in \{2,5\}$

      • $d \in \{2,5\}$

      • $N \in \{10000; 100000\}$

    Generated similarly to the scaling sets, following Mohassel et al.’s timing experiment protocol .

    Usage:

    Unpack any archive with tar -xJf

  9. P

    Data from: LLFF Dataset

    • paperswithcode.com
    • library.toponeai.link
    Updated Jan 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ben Mildenhall; Pratul P. Srinivasan; Rodrigo Ortiz-Cayon; Nima Khademi Kalantari; Ravi Ramamoorthi; Ren Ng; Abhishek Kar (2025). LLFF Dataset [Dataset]. https://paperswithcode.com/dataset/llff
    Explore at:
    Dataset updated
    Jan 12, 2025
    Authors
    Ben Mildenhall; Pratul P. Srinivasan; Rodrigo Ortiz-Cayon; Nima Khademi Kalantari; Ravi Ramamoorthi; Ren Ng; Abhishek Kar
    Description

    Local Light Field Fusion (LLFF) is a practical and robust deep learning solution for capturing and rendering novel views of complex real-world scenes for virtual exploration. The dataset consists of both renderings and real images of natural scenes. The synthetic images are rendered from the SUNCG and UnrealCV where SUNCG contains 45000 simplistic house and room environments with texture-mapped surfaces and low geometric complexity. UnrealCV contains a few large-scale environments modeled and rendered with extreme detail. The real images are 24 scenes captured from a handheld cellphone.

  10. Synthetic Multimodal Drone Delivery Dataset

    • zenodo.org
    zip
    Updated Apr 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diyar Altinses; Diyar Altinses (2025). Synthetic Multimodal Drone Delivery Dataset [Dataset]. http://doi.org/10.5281/zenodo.15124580
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 2, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Diyar Altinses; Diyar Altinses
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Nov 4, 2024
    Description

    README: Synthetic Logistics Dataset Structure and Components

    This dataset provides a structured representation of logistics data designed to evaluate and optimize hybrid truck-and-drone delivery networks. It captures a comprehensive set of parameters essential for modeling real-world logistics scenarios, including spatial coordinates, environmental conditions, and operational constraints. The data is meticulously organized into distinct keys, each representing a critical aspect of the delivery network, enabling researchers and practitioners to conduct flexible and in-depth analyses.

    The dataset is a curated subset derived from the research presented in the paper "Synthetic Dataset Generation for Optimizing Multimodal Drone Delivery Systems" by Altinsel et al. (2024), published in Drones. It serves as a practical resource for studying the interplay between ground-based and aerial delivery systems, with a focus on efficiency, environmental impact, and operational feasibility.

    Altinses, D., Torres, D. O. S., Gobachew, A. M., Lier, S., & Schwung, A. (2024). Synthetic Dataset Generation for Optimizing Multimodal Drone Delivery Systems. Drones (2504-446X), 8(12).

    Each data file contains information on ten customer locations, specified by their x and y coordinates, which facilitate the modeling of delivery routes and service areas. Additionally, the dataset includes communication data represented as a two-dimensional grid, which can be used to assess signal strength, connectivity, or other network-related factors that influence drone operations.

    A key feature of this dataset is the inclusion of wind data, structured as a two-dimensional grid with four distinct features per grid point. These features likely represent wind velocity components (such as horizontal and vertical directions) along with auxiliary parameters like turbulence intensity or wind shear, which are crucial for drone path planning and energy consumption estimation. The wind data enables researchers to simulate realistic environmental conditions and evaluate their impact on drone performance, stability, and battery life.

    By integrating geospatial, environmental, and operational data, this dataset supports a wide range of applications, from route optimization and energy efficiency studies to risk assessment and resilience planning in multimodal delivery systems. Its synthetic nature ensures reproducibility while maintaining relevance to real-world logistics challenges, making it a valuable tool for advancing research in drone-assisted delivery networks.

    The 4 wind channels represent:

    1. X and Y (Grid Positions)

      • These define where the arrows start (usually a meshgrid).

    2. U and V (Arrow Directions)

      • U = Horizontal component (e.g., gradient in x).

      • V = Vertical component (e.g., gradient in y).

    How to load the files using Python:

    data = np.loadtxt('data.txt')

    #### Just for Wind data:

    data = data.reshape((4,16,16))

  11. P

    DAGM2007 Dataset

    • paperswithcode.com
    • opendatalab.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DAGM2007 Dataset [Dataset]. https://paperswithcode.com/dataset/dagm2007
    Explore at:
    Description

    This is a synthetic dataset for defect detection on textured surfaces. It was originally created for a competition at the 2007 symposium of the DAGM (Deutsche Arbeitsgemeinschaft für Mustererkennung e.V., the German chapter of the International Association for Pattern Recognition). The competition was hosted together with the GNSS (German Chapter of the European Neural Network Society).

    After the competition, the dataset has been used as a test dataset in multiple projects and research papers. It is publicly available from the University of Heidelberg website (Heidelberg Collaboratory for Image Processing).

    The data is artificially generated, but similar to real world problems. The first six out of ten datasets, denoted as development datasets, are supposed to be used for algorithm development. The remaining four datasets, which are referred to as competition datasets, can be used to evaluate the performance. Researchers should consider not using or analyzing the competition datasets before the development is completed as a code of honour.

  12. Data from: Multilevel Modeling of Training Needs in Artificial Intelligence

    • zenodo.org
    bin, xls
    Updated Mar 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Veronica Distefano; Veronica Distefano; Sandra De iaco; Sandra De iaco; Sabrina Maggio; Sabrina Maggio (2025). Multilevel Modeling of Training Needs in Artificial Intelligence [Dataset]. http://doi.org/10.5281/zenodo.13890780
    Explore at:
    bin, xlsAvailable download formats
    Dataset updated
    Mar 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Veronica Distefano; Veronica Distefano; Sandra De iaco; Sandra De iaco; Sabrina Maggio; Sabrina Maggio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Nowadays, Artificial Intelligence (AI) is playing a rapidly increasing role in several fields of research and in almost all sectors of real life. However, few studies have assessed the effects of AI applications on training needs. This paper proposes an innovative multilevel modeling in order to investigate Awareness, Attitude and Trust towards AI and their reflections on learning needs. In particular, it is shown how a machine learning variable selection algorithm can support the definition of the optimal subset of all relevant covariates with respect to the outcome variable and improve the multilevel model performance for estimating the probability of educational needs. Thus, starting from a complex web survey to European citizens distributed in eight countries, the estimation of a multilevel binary model, defined on the basis of covariates selected through the Boruta random forest algorithm, is proposed. A discussion on the gender differences of the related estimated multilevel logit models is presented. A sensitivity analysis is also included in order to assess the prediction accuracy of the proposed multilevel logit modeling.

    This repository contains data generated for the manuscript: " A two-stage procedure for optimal modeling of the probability of training needs in artificial intelligence". It comprehends: (1) the dataset Data_Boruta_Random_Forest used to estimate the variables importance. (2) the dataset Data_Multilevel to perform the comparison among different multilevel binary models proposed in the paper.

  13. f

    Description of the real-world dataset.

    • plos.figshare.com
    xls
    Updated Jun 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fadi K. Dib; Peter Rodgers (2023). Description of the real-world dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0287744.t010
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 27, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Fadi K. Dib; Peter Rodgers
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Graph drawing, involving the automatic layout of graphs, is vital for clear data visualization and interpretation but poses challenges due to the optimization of a multi-metric objective function, an area where current search-based methods seek improvement. In this paper, we investigate the performance of Jaya algorithm for automatic graph layout with straight lines. Jaya algorithm has not been previously used in the field of graph drawing. Unlike most population-based methods, Jaya algorithm is a parameter-less algorithm in that it requires no algorithm-specific control parameters and only population size and number of iterations need to be specified, which makes it easy for researchers to apply in the field. To improve Jaya algorithm’s performance, we applied Latin Hypercube Sampling to initialize the population of individuals so that they widely cover the search space. We developed a visualization tool that simplifies the integration of search methods, allowing for easy performance testing of algorithms on graphs with weighted aesthetic metrics. We benchmarked the Jaya algorithm and its enhanced version against Hill Climbing and Simulated Annealing, commonly used graph-drawing search algorithms which have a limited number of parameters, to demonstrate Jaya algorithm’s effectiveness in the field. We conducted experiments on synthetic datasets with varying numbers of nodes and edges using the Erdős–Rényi model and real-world graph datasets and evaluated the quality of the generated layouts, and the performance of the methods based on number of function evaluations. We also conducted a scalability experiment on Jaya algorithm to evaluate its ability to handle large-scale graphs. Our results showed that Jaya algorithm significantly outperforms Hill Climbing and Simulated Annealing in terms of the quality of the generated graph layouts and the speed at which the layouts were produced. Using improved population sampling generated better layouts compared to the original Jaya algorithm using the same number of function evaluations. Moreover, Jaya algorithm was able to draw layouts for graphs with 500 nodes in a reasonable time.

  14. P

    OOD-CV Dataset

    • paperswithcode.com
    Updated Jun 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bingchen Zhao; Shaozuo Yu; Wufei Ma; Mingxin Yu; Shenxiao Mei; Angtian Wang; Ju He; Alan Yuille; Adam Kortylewski (2024). OOD-CV Dataset [Dataset]. https://paperswithcode.com/dataset/ood-cv
    Explore at:
    Dataset updated
    Jun 16, 2024
    Authors
    Bingchen Zhao; Shaozuo Yu; Wufei Ma; Mingxin Yu; Shenxiao Mei; Angtian Wang; Ju He; Alan Yuille; Adam Kortylewski
    Description

    Enhancing the robustness of vision algorithms in real-world scenarios is challenging. One reason is that existing robustness benchmarks are limited, as they either rely on synthetic data or ignore the effects of individual nuisance factors. We introduce OOD-CV, a benchmark dataset that includes out-of-distribution examples of 10 object categories in terms of pose, shape, texture, context and the weather conditions, and enables benchmarking models for image classification, object detection, and 3D pose estimation. In addition to this novel dataset, we contribute extensive experiments using popular baseline methods, which reveal that: 1. Some nuisance factors have a much stronger negative effect on the performance compared to others, also depending on the vision task. 2. Current approaches to enhance robustness have only marginal effects, and can even reduce robustness. 3. We do not observe significant differences between convolutional and transformer architectures. We believe our dataset provides a rich test bed to study robustness and will help push forward research in this area.

  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Michiharu Yamashita; Thanh Tran; Dongwon Lee; Michiharu Yamashita; Thanh Tran; Dongwon Lee (2025). OpenResume: Advancing Career Trajectory Modeling with Anonymized and Synthetic Resume Datasets [Dataset]. http://doi.org/10.1109/bigdata62323.2024.10825519
Organization logo

OpenResume: Advancing Career Trajectory Modeling with Anonymized and Synthetic Resume Datasets

Explore at:
14 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Feb 24, 2025
Dataset provided by
Institute of Electrical and Electronics Engineershttp://www.ieee.ro/
Authors
Michiharu Yamashita; Thanh Tran; Dongwon Lee; Michiharu Yamashita; Thanh Tran; Dongwon Lee
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Overview

The OpenResume dataset is designed for researchers and practitioners in career trajectory modeling and job-domain machine learning, as described in the IEEE BigData 2024 paper. It includes both anonymized realistic resumes and synthetically generated resumes, offering a comprehensive resource for developing and benchmarking predictive models across a variety of career-related tasks. By employing anonymization and differential privacy techniques, OpenResume ensures that research can be conducted while maintaining privacy. The dataset is available in this repository. Please see the paper for more details: 10.1109/BigData62323.2024.10825519

If you find this paper useful in your research or use this dataset in any publications, projects, tools, or other forms, please cite:

@inproceedings{yamashita2024openresume,

title={{OpenResume: Advancing Career Trajectory Modeling with Anonymized and Synthetic Resume Datasets}},

author={Yamashita, Michiharu and Tran, Thanh and Lee, Dongwon},

booktitle={2024 IEEE International Conference on Big Data (BigData)},

year={2024},

organization={IEEE}

}

@inproceedings{yamashita2023james,

title={{JAMES: Normalizing Job Titles with Multi-Aspect Graph Embeddings and Reasoning}},

author={Yamashita, Michiharu and Shen, Jia Tracy and Tran, Thanh and Ekhtiari, Hamoon and Lee, Dongwon},

booktitle={2023 IEEE International Conference on Data Science and Advanced Analytics (DSAA)},

year={2023},

organization={IEEE}

}

Data Contents and Organization

The dataset consists of two primary components:

  • Realistic Data: An anonymized dataset utilizing differential privacy techniques.
  • Synthetic Data: A synthetic dataset generated from real-world job transition graphs.

The dataset includes the following features:

  • Anonymized User Identifiers: Unique IDs for anonymized users.
  • Anonymized Company Identifiers: Unique IDs for anonymized companies.
  • Normalized Job Titles: Job titles standardized into the ESCO taxonomy.
  • Job Durations: Start and end dates, either anonymized or synthetically generated with differential privacy.

Detailed information on how the OpenResume dataset is constructed can be found in our paper.

Dataset Extension

Job titles in the OpenResume dataset are normalized into the ESCO occupation taxonomy. You can easily integrate the OpenResume dataset with ESCO job and skill databases to perform additional downstream tasks.

  • Applicable Tasks:
    • Next Job Title Prediction (Career Path Prediction)
    • Next Company Prediction (Career Path Prediction)
    • Turnover Prediction
    • Link Prediction
    • Required Skill Prediction (with ESCO dataset integration)
    • Existing Skill Prediction (with ESCO dataset integration)
    • Job Description Classification (with ESCO dataset integration)
    • Job Title Classification (with ESCO dataset integration)
    • Text Feature-Based Model Development (with ESCO dataset integration)
    • LLM Development for Resume-Related Tasks (with ESCO dataset integration)
    • And more!

Intended Uses

The primary objective of OpenResume is to provide an open resource for:

  1. Evaluating and comparing newly developed career models in a standardized manner.
  2. Fostering AI advancements in career trajectory modeling and job market analytics.

With its manageable size, the dataset allows for quick validation of model performance, accelerating innovation in the field. It is particularly useful for researchers who face barriers in accessing proprietary datasets.

While OpenResume is an excellent tool for research and model development, it is not intended for commercial, real-world applications. Companies and job platforms are expected to rely on proprietary data for their operational systems. By excluding sensitive attributes such as race and gender, OpenResume minimizes the risk of bias propagation during model training.

Our goal is to support transparent, open research by providing this dataset. We encourage responsible use to ensure fairness and integrity in research, particularly in the context of ethical AI practices.

Ethical and Responsible Use

The OpenResume dataset was developed with a strong emphasis on privacy and ethical considerations. Personal identifiers and company names have been anonymized, and differential privacy techniques have been applied to protect individual privacy. We expect all users to adhere to ethical research practices and respect the privacy of data subjects.

Related Work

JAMES: Normalizing Job Titles with Multi-Aspect Graph Embeddings and Reasoning
Michiharu Yamashita, Jia Tracy Shen, Thanh Tran, Hamoon Ekhtiari, and Dongwon Lee
IEEE Int'l Conf. on Data Science and Advanced Analytics (DSAA), 2023

Fake Resume Attacks: Data Poisoning on Online Job Platforms
Michiharu Yamashita, Thanh Tran, and Dongwon Lee
The ACM Web Conference 2024 (WWW), 2024

Search
Clear search
Close search
Google apps
Main menu