Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.
The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.
The synthetic data have been used so far in two analyses described in related peer-reviewed publications, which also provide information about the original data sources:
Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.
The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).
The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:
In addition, this repository provides these additional files:
The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).
The first part merges all the data, including the annual PM2.5 levels, into a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.
This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables, as well as the mortality risks, resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description: This dataset contains 5 sample PDF Electronic Health Records (EHRs), generated as part of a synthetic healthcare data project. The purpose of this dataset is to assist with sales distribution, offering potential users and stakeholders a glimpse of how synthetic EHRs can look and function. These records have been crafted to mimic realistic admission data while ensuring privacy and compliance with all data protection regulations.
Key Features: 1. Synthetic Data: Entirely artificial data created for testing and demonstration purposes. 1. PDF Format: Records are presented in PDF format, commonly used in healthcare systems. 1. Diverse Use Cases: Useful for evaluating tools related to data parsing, machine learning in healthcare, or EHR management systems. 1. Rich Admission Details: Includes admission-related data that highlights the capabilities of synthetic EHR generation.
Potential Use Cases:
Feel free to use this dataset for non-commercial testing and demonstration purposes. Feedback and suggestions for improvements are always welcome!
Facebook
TwitterA synthetic dataset including simulated versions of Nightingale Health's NMR quantification of 251 metabolites and biomarkers.
Based on the UK Biobank synthetic data. See the UK Biobank Showcase schema for descriptions of the columns included.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
To study the data-scarcity mitigation for learning-based visual localization methods via sim-to-real transfer, we curate and now present the CrossLoc benchmark datasets—a multimodal aerial sim-to-real data available for flights above nature and urban terrains. Unlike the previous computer vision datasets focusing on localization in a single domain (mostly real RGB images), the provided benchmark datasets include various multimodal synthetic cues paired to all real photos. Complementary to the paired real and synthetic data, we offer rich synthetic data that efficiently fills the flight envelope volume in the vicinity of the real data.
The synthetic data rendering was achieved using the proposed data generation workflow TOPO-DataGen. The provided CrossLoc datasets were used as an initial benchmark to showcase the use of synthetic data to assist visual localization in the real world with limited real data. Please refer to our main paper at https://arxiv.org/abs/2112.09081 and our code at https://github.com/TOPO-EPFL/CrossLoc for details. Methods The dataset collection, processing, and validation details are explained in our paper available at https://arxiv.org/abs/2112.09081 and our code available at https://github.com/TOPO-EPFL/CrossLoc.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a synthetic data set derived from a simple hierarchical decision model to demonstrate decision support system DEX . The decision model included six attributes, including buying and maintenance price, the number of passengers, size of the luggage booth, and evaluated the utility of the car from a buyer's perspective. All attributes were discrete, having from three to four values. The data set provides car's utility for all possible combinations of attribute values. The data set was originally created to showcase the ability of machine learning by function decomposition to recreate the hierarchy of the decision model.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains sample synthetic data used for training a solution for reading analog pressure gauges values. We have used this during the writing of our paper and blog(s) which showcase how synthetic data can be used to train and use computer vision models. We've chosen the topic of Analog Gauge Reading Understanding as it is a common problem in many industries and exemplifies how output from multiple models can be consumed in heuristics to get a final reading.
The dataset contains the following: - subset of the synthetic data used for training, we have included the two latest versions of datasets. Each contains both the images and the coco annotations for segmentation and pose estimation. - inference data for the test videos available in the Kaggle dataset. For each video there is one CSV file which contains for every frame the bbox for the (main) gauge, keypoints locations for the needle tip, gauge center, min and max scale ticks, and the predicted reading.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data scarcity and discontinuity are common occurrences in the healthcare and epidemiological dataset and often is needed to form an educative decision and forecast the upcoming scenario. Often to avoid these problems, these data are processed as monthly/yearly aggregate where the prevalent forecasting tools like Autoregressive Integrated Moving Average (ARIMA), Seasonal Autoregressive Integrated Moving Average (SARIMA), and TBATS often fail to provide satisfactory results. Artificial data synthesis methods have been proven to be a powerful tool for tackling these challenges. The paper aims to propose a novel algorithm named Stochastic Bayesian Downscaling (SBD) algorithm based on the Bayesian approach that can regenerate downscaled time series of varying time lengths from aggregated data, preserving most of the statistical characteristics and the aggregated sum of the original data. The paper presents two epidemiological time series case studies of Bangladesh (Dengue, Covid-19) to showcase the workflow of the algorithm. The case studies illustrate that the synthesized data agrees with the original data regarding its statistical properties, trend, seasonality, and residuals. In the case of forecasting performance, using the last 12 years data of Dengue infection data in Bangladesh, we were able to decrease error terms up to 72.76% using synthetic data over actual aggregated data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data scarcity and discontinuity are common occurrences in the healthcare and epidemiological dataset and often is needed to form an educative decision and forecast the upcoming scenario. Often to avoid these problems, these data are processed as monthly/yearly aggregate where the prevalent forecasting tools like Autoregressive Integrated Moving Average (ARIMA), Seasonal Autoregressive Integrated Moving Average (SARIMA), and TBATS often fail to provide satisfactory results. Artificial data synthesis methods have been proven to be a powerful tool for tackling these challenges. The paper aims to propose a novel algorithm named Stochastic Bayesian Downscaling (SBD) algorithm based on the Bayesian approach that can regenerate downscaled time series of varying time lengths from aggregated data, preserving most of the statistical characteristics and the aggregated sum of the original data. The paper presents two epidemiological time series case studies of Bangladesh (Dengue, Covid-19) to showcase the workflow of the algorithm. The case studies illustrate that the synthesized data agrees with the original data regarding its statistical properties, trend, seasonality, and residuals. In the case of forecasting performance, using the last 12 years data of Dengue infection data in Bangladesh, we were able to decrease error terms up to 72.76% using synthetic data over actual aggregated data.
Facebook
TwitterThis Demonstration utilized a fraud detection data set and kernel, referenced below to showcase the accuracy and safety of using the products of the kymera fabrication machine
The original data set we have used is the Synthetic Financial Datasets For Fraud Detection This file accurately mimics the original data set features while in fact generating the entire data set from scratch.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data scarcity and discontinuity are common occurrences in the healthcare and epidemiological dataset and often is needed to form an educative decision and forecast the upcoming scenario. Often to avoid these problems, these data are processed as monthly/yearly aggregate where the prevalent forecasting tools like Autoregressive Integrated Moving Average (ARIMA), Seasonal Autoregressive Integrated Moving Average (SARIMA), and TBATS often fail to provide satisfactory results. Artificial data synthesis methods have been proven to be a powerful tool for tackling these challenges. The paper aims to propose a novel algorithm named Stochastic Bayesian Downscaling (SBD) algorithm based on the Bayesian approach that can regenerate downscaled time series of varying time lengths from aggregated data, preserving most of the statistical characteristics and the aggregated sum of the original data. The paper presents two epidemiological time series case studies of Bangladesh (Dengue, Covid-19) to showcase the workflow of the algorithm. The case studies illustrate that the synthesized data agrees with the original data regarding its statistical properties, trend, seasonality, and residuals. In the case of forecasting performance, using the last 12 years data of Dengue infection data in Bangladesh, we were able to decrease error terms up to 72.76% using synthetic data over actual aggregated data.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Context:This synthetic healthcare dataset has been created to serve as a valuable resource for data science, machine learning, and data analysis enthusiasts. It is designed to mimic real-world healthcare data, enabling users to practice, develop, and showcase their data manipulation and analysis skills in the context of the healthcare industry.
Inspiration:The inspiration behind this dataset is rooted in the need for practical and diverse healthcare data for educational and research purposes. Healthcare data is often sensitive and subject to privacy regulations, making it challenging to access for learning and experimentation. To address this gap, I have leveraged Python's Faker library to generate a dataset that mirrors the structure and attributes commonly found in healthcare records. By providing this synthetic data, I hope to foster innovation, learning, and knowledge sharing in the healthcare analytics domain.
Dataset Information:Each column provides specific information about the patient, their admission, and the healthcare services provided, making this dataset suitable for various data analysis and modeling tasks in the healthcare domain. Here's a brief explanation of each column in the dataset - - Name: This column represents the name of the patient associated with the healthcare record. - Age: The age of the patient at the time of admission, expressed in years. - Gender: Indicates the gender of the patient, either "Male" or "Female." - Blood Type: The patient's blood type, which can be one of the common blood types (e.g., "A+", "O-", etc.). - Medical Condition: This column specifies the primary medical condition or diagnosis associated with the patient, such as "Diabetes," "Hypertension," "Asthma," and more. - Date of Admission: The date on which the patient was admitted to the healthcare facility. - Doctor: The name of the doctor responsible for the patient's care during their admission. - Hospital: Identifies the healthcare facility or hospital where the patient was admitted. - Insurance Provider: This column indicates the patient's insurance provider, which can be one of several options, including "Aetna," "Blue Cross," "Cigna," "UnitedHealthcare," and "Medicare." - Billing Amount: The amount of money billed for the patient's healthcare services during their admission. This is expressed as a floating-point number. - Room Number: The room number where the patient was accommodated during their admission. - Admission Type: Specifies the type of admission, which can be "Emergency," "Elective," or "Urgent," reflecting the circumstances of the admission. - Discharge Date: The date on which the patient was discharged from the healthcare facility, based on the admission date and a random number of days within a realistic range. - Medication: Identifies a medication prescribed or administered to the patient during their admission. Examples include "Aspirin," "Ibuprofen," "Penicillin," "Paracetamol," and "Lipitor." - Test Results: Describes the results of a medical test conducted during the patient's admission. Possible values include "Normal," "Abnormal," or "Inconclusive," indicating the outcome of the test.
Usage Scenarios:This dataset can be utilized for a wide range of purposes, including: - Developing and testing healthcare predictive models. - Practicing data cleaning, transformation, and analysis techniques. - Creating data visualizations to gain insights into healthcare trends. - Learning and teaching data science and machine learning concepts in a healthcare context. - You can treat it as a Multi-Class Classification Problem and solve it for Test Results which contains 3 categories(Normal, Abnormal, and Inconclusive).
Acknowledgments:Image Credit:Image by BC Y from Pixabay
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We provide a synthetic reference data set covering over 100,000 labeled references (mostly Russian language) and a manually annotated set of real references (771 in number) gathered from multidisciplinary Cyrillic script publications.
Background:
Extracting structured data from bibliographic references is a crucial task for the creation of scholarly databases. While approaches, tools, and evaluation data sets for the task exist, there is a distinct lack of support for languages other than English and scripts other than the Latin alphabet. A significant portion of the scientific literature that is thereby excluded consists of publications written in Cyrillic script languages. To address this problem, we introduce a new multilingual and multidisciplinary data set of over 100,000 labeled reference strings. The data set covers multiple Cyrillic languages and contains over 700 manually labeled references, while the remaining are generated synthetically. With random samples of varying size of this data, we train multiple well-performing sequence labeling BERT models and thus show the usability of our proposed data set. To this end, we showcase an implementation of a multilingual BERT model trained on the synthetic data and evaluated on the manually labeled references. Our model achieves an F1 score of 0.93 and thereby significantly outperforms a state-of-the-art model we retrain and evaluate on our data.
The code for generating the data set is available at https://github.com/igor261/Sequence-Labeling-for-Citation-Field-Extraction-from-Cyrillic-Script-References
When using the data set, please cite the following paper:
Igor Shapiro, Tarek Saier, Michael Färber: "Sequence Labeling for Citation Field Extraction from Cyrillic Script References". In Proceedings of the AAAI-22 Workshop on Scientific Document Understanding (SDU@AAAI'22), 2022.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data scarcity and discontinuity are common occurrences in the healthcare and epidemiological dataset and often is needed to form an educative decision and forecast the upcoming scenario. Often to avoid these problems, these data are processed as monthly/yearly aggregate where the prevalent forecasting tools like Autoregressive Integrated Moving Average (ARIMA), Seasonal Autoregressive Integrated Moving Average (SARIMA), and TBATS often fail to provide satisfactory results. Artificial data synthesis methods have been proven to be a powerful tool for tackling these challenges. The paper aims to propose a novel algorithm named Stochastic Bayesian Downscaling (SBD) algorithm based on the Bayesian approach that can regenerate downscaled time series of varying time lengths from aggregated data, preserving most of the statistical characteristics and the aggregated sum of the original data. The paper presents two epidemiological time series case studies of Bangladesh (Dengue, Covid-19) to showcase the workflow of the algorithm. The case studies illustrate that the synthesized data agrees with the original data regarding its statistical properties, trend, seasonality, and residuals. In the case of forecasting performance, using the last 12 years data of Dengue infection data in Bangladesh, we were able to decrease error terms up to 72.76% using synthetic data over actual aggregated data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset simulates a medical appointment scheduling system, designed to demonstrate practical applications of data generation techniques in the healthcare field. Although synthetic, the data is based on real-world values to enhance its realism and utility.
The primary goals of this dataset are:
The dataset contains three main tables:
The dataset simulates a medical office operating Monday to Friday, from 8:00 AM to 6:00 PM, with appointments scheduled every 15 minutes (4 per hour). Key parameters include:
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
🏦 Synthetic Loan Approval Dataset
A Realistic, High-Quality Dataset for Credit Risk Modelling
🎯 Why This Dataset?
Most loan datasets on Kaggle have unrealistic patterns where:
Unlike most loan datasets available online, this one is built on real banking criteria from US and Canadian financial institutions. Drawing from 3 years of hands-on finance industry experience, the dataset incorporates realistic correlations and business logic that reflect how actual lending decisions are made. This makes it perfect for data scientists looking to build portfolio projects that showcase not just coding ability, but genuine understanding of credit risk modelling.
📊 Dataset Overview
| Metric | Value |
|---|---|
| Total Records | 50,000 |
| Features | 20 (customer_id + 18 predictors + 1 target) |
| Target Distribution | 55% Approved, 45% Rejected |
| Missing Values | 0 (Complete dataset) |
| Product Types | Credit Card, Personal Loan, Line of Credit |
| Market | United States & Canada |
| Use Case | Binary Classification (Approved/Rejected) |
🔑 Key Features
Identifier:
-Customer ID (unique identifier for each application)
Demographics:
-Age, Occupation Status, Years Employed
Financial Profile:
-Annual Income, Credit Score, Credit History Length -Savings/Assets, Current Debt
Credit Behaviour:
-Defaults on File, Delinquencies, Derogatory Marks
Loan Request:
-Product Type, Loan Intent, Loan Amount, Interest Rate
Calculated Ratios:
-Debt-to-Income, Loan-to-Income, Payment-to-Income
💡 What Makes This Dataset Special?
1️⃣ Real-World Approval Logic The dataset implements actual banking criteria: - DTI ratio > 50% = automatic rejection - Defaults on file = instant reject - Credit score bands match real lending thresholds - Employment verification for loans ≥$20K
2️⃣ Realistic Correlations - Higher income → Better credit scores - Older applicants → Longer credit history - Students → Lower income, special treatment for small loans - Loan intent affects approval (Education best, Debt Consolidation worst)
3️⃣ Product-Specific Rules - Credit Cards: More lenient, higher limits - Personal Loans: Standard criteria, up to $100K - Line of Credit: Capped at $50K, manual review for high amounts
4️⃣ Edge Cases Included - Young applicants (age 18) building first credit - Students with thin credit files - Self-employed with variable income - High debt-to-income ratios - Multiple delinquencies
🎓 Perfect For - Machine Learning Practice: Binary classification with real patterns - Credit Risk Modelling: Learn actual lending criteria - Portfolio Projects: Build impressive, explainable models - Feature Engineering: Rich dataset with meaningful relationships - Business Analytics: Understand financial decision-making
📈 Quick Stats
Approval Rates by Product - Credit Card: 60.4% more lenient) - Personal Loan: 46.9 (standard) - Line of Credit: 52.6% (moderate)
Loan Intent (Best → Worst Approval Odds) 1. Education (63% approved) 2. Personal (58% approved) 3. Medical/Home (52% approved) 4. Business (48% approved) 5. Debt Consolidation (40% approved)
Credit Score Distribution - Mean: 644 - Range: 300-850 - Realistic bell curve around 600-700
Income Distribution - Mean: $50,063 - Median: $41,608 - Range: $15K - $250K
🎯 Expected Model Performance
With proper feature engineering and tuning: - Accuracy: 75-85% - ROC-AUC: 0.80-0.90 - F1-Score: 0.75-0.85
Important: Feature importance should show: 1. Credit Score (most important) 2. Debt-to-Income Ratio 3. Delinquencies 4. Loan Amount 5. Income
If your model shows different patterns, something's wrong!
🏆 Use Cases & Projects
Beginner - Binary classification with XGBoost/Random Forest - EDA and visualization practice - Feature importance analysis
Intermediate - Custom threshold optimization (profit maximization) - Cost-sensitive learning (false positive vs false negative) - Ensemble methods and stacking
Advanced - Explainable AI (SHAP, LIME) - Fairness analysis across demographics - Production-ready API with FastAPI/Flask - Streamlit deployment with business rules
⚠️ Important Notes
This is SYNTHETIC Data - Generated based on real banking criteria - No real customer data was used - Safe for public sharing and portfolio use
Limitations - Simplified approval logic (real banks use 100+ factors) - No temporal component (no time series) - Single country/currency assumed (USD) - No external factors (economy, market conditions)
Educational Purpose This dataset is designed for: - Learning credit risk modeling - Portfolio projects - ML practice - Understanding lending criteria
NOT for: - Actual lending decisions - Financial advice - Production use without validation
🤝 Contributing
Found an issue? Have suggestions? - Open an issue on GitHub - Suggest i...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ChatGPT, a general artificial intelligence, has been recognized as a powerful tool in scientific writing and programming but its use as a medical tool is largely overlooked. The general accessibility, rapid response time and comprehensive training database might enable ChatGPT to serve as a diagnostic augmentation tool in certain clinical settings. The diagnostic process in neurology is often challenging and complex. In certain time-sensitive scenarios, rapid evaluation and diagnostic decisions are needed, while in other cases clinicians are faced with rare disorders and atypical disease manifestations. Due to these factors, the diagnostic accuracy in neurology is often suboptimal. Here we evaluated whether ChatGPT can be utilized as a valuable and innovative diagnostic augmentation tool in various neurological settings. We used synthetic data generated by neurological experts to represent descriptive anamneses of patients with known neurology-related diseases, then the probability for an appropriate diagnosis made by ChatGPT was measured. To give clarity to the accuracy of the AI-determined diagnosis, all cases have been cross-validated by other experts and general medical doctors as well. We found that ChatGPT-determined diagnostic accuracy (ranging from 68.5% ± 3.28% to 83.83% ± 2.73%) can reach the accuracy of other experts (81.66% ± 2.02%), furthermore, it surpasses the probability of an appropriate diagnosis if the examiner is a general medical doctor (57.15% ± 2.64%). Our results showcase the efficacy of general artificial intelligence like ChatGPT as a diagnostic augmentation tool in medicine. In the future, AI-based supporting tools might be useful amendments in medical practice and help to improve the diagnostic process in neurology.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Sua Música (suamusica.com.br) is one of the largest online platforms in Latin America and the ultimate online destination for Brazilian music enthusiasts. Whether you're a passionate listener, a budding musician, or simply curious about the rich sounds of Brazilian culture, you've come to the right place. At suamusica.com.br, we revolutionize the way music is shared and enjoyed in Brazil. Our platform offers a vast collection of songs, albums, and playlists spanning various genres and artists, ensuring that there's something for everyone. From samba and bossa nova to funk and pagode, our extensive catalog covers it all. Explore, discover, and create personalized playlists that match your mood and taste. Immerse yourself in exclusive content, such as live performances, interviews, and behind-the-scenes glimpses into the lives of your favorite musicians. Join our thriving community of music lovers, connect with fellow fans, and embark on a musical journey that will transport you to the vibrant world of Brazilian music. Experience the rhythm, energy, and diversity of suamusica.com.br, and let the melodies of Brazil captivate your senses.
Welcome to the Kaggle challenge dedicated to creating a recommendation system for the suamusica.com.br platform! If you're passionate about music and data science, this challenge is the perfect opportunity to showcase your skills and contribute to enhancing the music experience for users of the suamusica.com.br platform, with its vast collection of songs and genres, presents an exciting opportunity to develop an intelligent recommendation system that can suggest personalized music choices to users based on their preferences. By participating in this challenge, you'll dive into the world of collaborative filtering, machine learning algorithms, and data analysis to create a recommendation system that will revolutionize how users discover new music on suamusica.com.br. Join us on this exciting journey and let's unlock the power of data to provide personalized music recommendations to millions of users.
This challenge is a little bit different than what Kaggle users are used to. It is not about only machine learning and high accuracy. We expect you to create a pipeline for a recommendation system for a music streaming platform. We provided a script that creates synthetic data: one of them contains transactional data with the plays amount by user and by day, the second contains dimensional data correlating the id of tracks with the id of artists and musical genre, and the final dataset contains metrics about artists. The actual values are not the main point of the challenge, but the pipeline is. Focus on the algorithms you can use, and what type of features you can use, based on the fact that it is a streaming platform, so think about average track duration, likes, follows, plays received on specific days of the week or specific times of the day, bpm of songs, genres, and so forth. The synthetic data generation scripts were also left as a challenge if you want to improve them, for example, creating correlation between features, or adding metric features to transactional data, or more information to the dimensional datasets. Explore all the information available and be thorough on the pipeline description, the ETL is also very important, and to name technologies (stacks) is also very important, for example, the use of AWS Lambdas or Airflow to orchestrate the whole pipeline. The codes are written in Python, mostly NumPy. Feel free to explore other libraries with out-of-the-box solutions, but have in mind that we will score higher points for creative and technical solutions with deterministic mathematical and statistical algorithms. Another important point is that, afterward, the model is deployed, describe how would you monitor the performance of your system, describe the performance indicators (KPIs) you would use and how you would measure them.
The scripts provided by the Data Science team of Sua Música do not contain information about the platform database, the averages and standard deviations do not represent statistical population information of the platform users. The data structure is also generic and represent usual relational refined datasets that any streaming platform data team would possess.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.
The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.
The synthetic data have been used so far in two analyses described in related peer-reviewed publications, which also provide information about the original data sources:
Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.
The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).
The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:
In addition, this repository provides these additional files:
The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).
The first part merges all the data, including the annual PM2.5 levels, into a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.
This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables, as well as the mortality risks, resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.