Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset is a standard table representing steps of patient care. It contains 4 standard variables : a patient identifier, the label of the step, the start date and the end date of the step. One patient may have several steps. The step labels are synthetic (i.e., A, B, C, D, E, F) and may correspond to passages in care unit, successive administrations of drugs or carrying out of medical procedures.
This dataset is used for a tutorial dedicated to the Sankey diagram : https://gitlab.com/d8096/health_data_science_tutorials/-/tree/main/tutorials/sankey_diagram
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Synthetic data, source code, and supplementary text for the article "Euler deconvolution of potential field data" by Leonardo Uieda, Vanderlei C. Oliveira Jr., and Valéria C. F. Barbosa. This is part of a tutorial submitted to The Leading Edge (http://library.seg.org/journal/tle). Results were generated using the open-source Python package Fatiando a Terra version 0.2 (http://www.fatiando.org). This material along with the manuscript can also be found at https://github.com/pinga-lab/paper-tle-euler-tutorial Synthetic data and model Examples in the tutorial use synthetic data generated with the IPython notebook create_synthetic_data.ipynb. File synthetic_data.txt has 4 columns: x (north), y (east), z (down) and the total field magnetic anomaly. x, y, and z are in meters. The total field anomaly is in nanoTesla (nT). File metadata.json contains extra information about the data, such as inclination and declination of the inducing field (in degrees), shape of the data grid (number of points in y and x, respectively), the area containing the data (W, E, S, N, in meters), and the model boundaries (W, E, S, N, top, bottom, in meters). File model.pickle is a serialized version of the model used to generate the data. It contains a list of instances of the PolygonalPrism class of Fatiando a Terra. The serialization was done using the cPickle Python module. Reproducing the results in the tutorial The notebook euler-deconvolution-examples.ipynb runs the Euler deconvolution on the synthetic data and generates the figures for the manuscript. It also presents a more detailed explanation of the method and more tests than went into the finished manuscript.
Facebook
TwitterThe deposit contains a dataset created for the paper, 'Many Models in R: A Tutorial'. ncds.Rds is an R format synthetic dataset created with the synthpop dataset in R using data from the National Child Development Study (NCDS), a birth cohort of individuals born in a single week of March 1958 in Britain. The dataset contains data on fourteen biomarkers collected at the age 46/47 sweep of the survey, four measures of cognitive ability from age 11 and 16, and three covariates, sex, body mass index at age 11 and father's social class. The data is only intended to be used in the tutorial - it is not to be used for drawing statistical inferences.
This project contains data used in the paper, "Many Models in R: A Tutorial". The data are a simplified, synthetic and imputed version of the National Child Development Study. There are variables for 14 biomarkers from the age 46/47 biomedical survey, 4 measures of cognitive ability from tests at ages 11 and 16, and 3 covariates (sex, father's socioeconomic class and BMI at age 11).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 6,000 example images generated with the process described in Roboflow's How to Create a Synthetic Dataset tutorial.
The images are composed of a background (randomly selected from Google's Open Images dataset) and a number of fruits (from Horea94's Fruit Classification Dataset) superimposed on top with a random orientation, scale, and color transformation. All images are 416x550 to simulate a smartphone aspect ratio.
To generate your own images, follow our tutorial or download the code.
Example:
https://blog.roboflow.ai/content/images/2020/04/synthetic-fruit-examples.jpg" alt="Example Image">
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MikroTik RouterOS Configuration Dataset
A structured dataset containing MikroTik RouterOS configuration guides and tutorials generated using Gemini 2.0 model_name="gemini-2.0-flash-exp".
Dataset Details
3000+ configuration examples Source: Synthetic data generated from gemini-2.0-flash-exp LLM Format: Parquet file with structured columns
Columns
filename: Original MD file name title: Configuration guide title prompt: Scenario description… See the full description on the dataset page: https://huggingface.co/datasets/vivek-dodia/synthetic-data-gemini-2.0-ComplexConfigurations.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This a Lego bricks image dataset that is annotated in a PASCAL VOC format ready for ML object detection pipeline. Additionally I made tutorials on how to: - Generate synthetic images and create bounding box annotations in Pascal VOC format using Blender. - Train ML models (YoloV5 and SSD) for detecting multiple objects in an image. The tutorial with Blender scripts for rendering the dataset and Jupyter notebooks for training ML models can be found here: https://github.com/mantyni/Multi-object-detection-lego
Dataset contains: Lego Brick images in JPG format, 300x300 resolution Annotations in PASCAL-VOC format There's 6 Lego bricks in this dataset, each appearing approximately 600 times across the dataset: brick_2x2, brick_2x4, brick_1x6, plate_1x2, plate_2x2, plate_2x4
Lego brick 3D models obtained from: Mecabricks - https://www.mecabricks.com/
First 500 images are of individual Lego bricks rendered in different angles and backgrounds. Images afterwards are of multiple bricks. Each image is rendered using different backgrounds, brick colour and shadow variations to enable Sim2Real transfer. After training ML (YoloV5 and SSD) models on synthetic dataset I then tested it on real images achieving ~70% detection accuracy.
The main purpose of this project is to show how to create your own realistic synthetic image datasets for training computer vision models without needing real world data.
Facebook
TwitterThis dataset was created to be the base of the data.world SQL tutorial exercises. Data was genererated using Synthea, a synthetic patient generator that models the medical history of synthetic patients. Their mission is to output high-quality synthetic, realistic but not real, patient data and associated health records covering every aspect of healthcare. The resulting data is free from cost, privacy, and security restrictions, enabling research with Health IT data that is otherwise legally or practically unavailable. De-identified real data still presents a challenge in the medical field because there are peopel who excel at re-identification of these data. For that reason the average medical center, etc. will not share their patient data. Most governmental data is at the hospital level. NHANES data is an exception.
You can read Synthea's first academic paper here.
Foto von Rubaitul Azad auf Unsplash
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This upload contains data required to replicate a tutorial that applies regression-based unmixing of spectral-temporal metrics for sub-pixel land cover mapping with synthetically created training data. The tutorial uses the Framework for Operational Radiometric Correction for Environmental monitoring.
This dataset contains intermediate and final results of the workflow described in that tutorial as well as auxiliary data such as parameter files.
Please refer to the above mentioned tutorial for more information.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Grammar Error Correction synthetic dataset consisting of 185 million sentence pairs, created using a Tagged Corruption modelon Google's C4 dataset.
This version of the dataset was extracted from "https://huggingface.co/datasets/liweili/c4_200m">Li Liwei's HuggingFace dataset and converted to TSV format.
The corruption edits by Felix Stahlberg and Shankar Kumar are licensed under CC BY 4.0. C4 dataset was released by AllenAI under the terms of ODC-BY By using this, you are also bound by the Common Crawl terms of use in respect of the content contained in the dataset.
This dataset is converted in Parquet format, but a TSV format is available in previous versions. The reason of the conversion was the poor performance in accessing each file. I'm open to request and suggestions on how to better handle such a big dataset.
This dataset is available in TSV format, splitted in 10 files of approximately 18M samples each. Each sample is a couple formed by the incorrect and the corrected sentences. | Incorrect | Corrected| | ------------- |:-------------:| | Much many brands and sellers still in the market. | Many brands and sellers still in the market. | | She likes playing in park and come here every week | She likes playing in the park and comes here every week |
I'm planning of releasing a notebook where I'll show Grammar Error Correction using a seq2seq architecture based on BERT and LSTM. Until then, you can try to build your own model!
This dataset can be used to train sequence-to-sequence models, based on encoder-decoder approach.
The task is quite similar to the NMT task, here are some tutorials:
- NLP from scratch: translation with a seq2seq network and attention
- Language Translation with nn.Transformers and TorchText
https://production-media.paperswithcode.com/tasks/gec_foTfIZW.png" alt="Grammar Error Correction example">
Thanks to the dataset creators Felix Stahlberg and Shankar Kumar and to Li Liwei for first giving access to the processed dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This tutorial includes two PowerPoint presentations developed by Mary Mangan from OpenHelix. Students should start with the Introduction prior to moving on to the Advanced tutorial. The slides decks include numerous comments that will help students go through the tutorials. In order to perform the hands on activities students need to download the GenoCAD Training Set. This dataset includes a list of parts and a grammar used as part of the GenoCAD Introductory tutorial. In order to import this data set in GenoCAD, proceed as follows: 1- Log into GenoCAD, create an account if you don't already have one. 2- Click on the Parts tab. 3- Click on the Grammars tab. 4- Click on the Add/Import Grammar button. 5- Using the "choose file" button, select the grammar file (.genocad) and click on import grammar. 6- Click on "use existing icon set" and click on "continue import". Upon completion of this procedure you should have a new grammar with a library of 37 parts in your workspace.
The tutorial also includes a series of additional exercises that will be used to reinforce the concepts introduced in the tutorial. Please visit the GenoCAD page for videos of the tutotials.
Facebook
Twitter
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This fileset includes a grammar file (.zip) and a list of parts used as part of the GenoCAD Introductory tutorial. In order to import this data set in GenoCAD, proceed as follows: 1- Log into GenoCAD, create an account if you don't already have one. 2- Click on the Parts tab. 3- Click on the Grammars tab. 4- Click on the Add/Import Grammar button. 5- Using the "choose file" button, select the grammar file (.zip) and click on import grammar. 6- Click on "use existing icon set" and click on "continue import". 7- Click on "1-Parts" to return to the parts management tool. 8- On the Libraries tab, click on the "New Library" button. 9- Select the "Training Set E.Coli v3" grammar and give a name to the parts library such as "Training Set Parts Library". 10- Click on the "My Parts" tab and click on the "Import parts" button. 11- Select the "Training Set Parts Library" parts library and the tab-delimited option. Select the Training_Set_Parts_v3.txt file and click import. Upon completion of this procedure you should have a new grammar with a library of 37 parts in your workspace.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Synthetic data generating parameters. The table summarizes the generating parameters for synthetic networks showing the corresponding symbol, name and range after the application of the constraints in Section e.2.
Facebook
TwitterThis dataset was collected by a edtech startup. The startup is into teaching entrepreneurial life-skills in animated-gamified format through its video series to kids between the age group of 6-14 years. Through its learning management system the company tracks the progress made by all of its subscribers on the platform. Company records platform content usage activity data and tries to follow up with parents if there is any inactiveness on the platform by their child. Here's more information about the dataset
There is some missing data as well. I hope it would be good dataset for beginners practicing their NLP skills.
Image by Steven Weirather from Pixabay
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Synthetic test data for a tutorial that explains how to convert spreadsheet data to tidy data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Synthetic extracellular recording generated by MEArec for the SpikeInterface tutorials and results.
Dataset descriptions
recordings_36cells_four-tetrodes_30.0_10.0uV_20-06-2019_14_48.h5 : extracellular recording on 4 tetrodes on a shank. Each tetrode is in a diamond configuration and the inter-tetrode distance is 300 um. There are 36 ground-truth neurons distributed over the 4 tetrodes. The duration is 30 seconds and the noise is uncorrelated Gaussian noise with 10 uV standard deviation. It is used in this notebook notebook.
recordings_50cells_SqMEA-10-15um_60.0_10.0uV_27-03-2019_13-31.h5 : extracellular recordings on a square MEA in a 10x10 configuration with 15 um inter-electrode distance. There are 50 ground-truth neurons. The duration is 60 seconds and the noise is uncorrelated Gaussian noise with 10 uV standard deviation. It is used in this notebook.
Facebook
TwitterThe formation of oxygen–carbon bonds is one of the fundamental transformations in organic synthesis. In this regard the application of palladium-based catalysts has been extensively studied during recent years. Nowadays it is an established methodology and the success has been proven in manifold synthetic procedures. This tutorial review summarizes the advances on palladium-catalysed C–O bond formation, means hydroxylation and alkoxylation reactions.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The main reason for making this dataset is the publication of the paper: Learning from Simulated and Unsupervised Images through Adversarial Training and the idea of the SimGAN. The dataset and kernels should make it easier to get started making SimGAN networks and testing them out and comparing them to other approaches like KNN, GAN, InfoGAN and the like.
gaze.csv: A full table of values produced by the UnityEyes tool for every image in the gaze.h5 file
gaze.json: A json version of the CSV table (easier to read in pandas)
gaze.h5: The synthetic gazes from the UnityEyes tool
real_gaze.h5: The gaze images from MPII packed into a single hdf5
The synthetic images were generated with the windows version of UnityEyes http://www.cl.cam.ac.uk/research/rainbow/projects/unityeyes/tutorial.html
The real images were taken from https://www.mpi-inf.mpg.de/departments/computer-vision-and-multimodal-computing/research/gaze-based-human-computer-interaction/appearance-based-gaze-estimation-in-the-wild-mpiigaze/, which can be cited like this: Appearance-based Gaze Estimation in the Wild, X. Zhang, Y. Sugano, M. Fritz and A. Bulling, Proc. of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), June, p.4511-4520, (2015).
Enhancement:
Gaze Detection:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ANOVA results for the distributions in Fig 8. Four different one-way ANOVA for each combination of and in the toy example #2. The corresponding p and F values are shown in this table.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Classification performances for the Breast Cancer Wisconsin dataset. Performance indices are extracted by the confusion matrix obtained comparing the original ground truth and the labels of the Fiedler vector.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset is a standard table representing steps of patient care. It contains 4 standard variables : a patient identifier, the label of the step, the start date and the end date of the step. One patient may have several steps. The step labels are synthetic (i.e., A, B, C, D, E, F) and may correspond to passages in care unit, successive administrations of drugs or carrying out of medical procedures.
This dataset is used for a tutorial dedicated to the Sankey diagram : https://gitlab.com/d8096/health_data_science_tutorials/-/tree/main/tutorials/sankey_diagram