100+ datasets found

D
SYNERGY - Open machine learning dataset on study selection in systematic...
dataverse.nl
csv, json, txt, zip
Updated Apr 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan De Bruin; Jonathan De Bruin; Yongchao Ma; Yongchao Ma; Gerbrich Ferdinands; Gerbrich Ferdinands; Jelle Teijema; Jelle Teijema; Rens Van de Schoot; Rens Van de Schoot (2023). SYNERGY - Open machine learning dataset on study selection in systematic reviews [Dataset]. http://doi.org/10.34894/HE6NAQ
Explore at:
txt(212), json(702), zip(16028323), json(19426), txt(263), zip(3560967), txt(305), json(470), txt(279), zip(2355371), json(23201), csv(460956), txt(200), json(685), json(546), csv(63996), zip(2989015), zip(5749455), txt(331), txt(315), json(691), json(23775), csv(672721), json(468), txt(415), json(22778), csv(31919), csv(746832), json(18392), zip(62992826), csv(234822), txt(283), zip(34788857), json(475), txt(242), json(533), csv(42227), json(24548), zip(738232), json(22477), json(25491), zip(11463283), json(17741), csv(490660), json(19662), json(578), csv(19786), zip(14708207), zip(24619707), zip(2404439), json(713), json(27224), json(679), json(26426), txt(185), json(906), zip(18534723), json(23550), txt(266), txt(317), zip(6019723), json(33943), txt(436), csv(388378), json(469), zip(2106498), txt(320), csv(451336), txt(338), zip(19428163), json(14326), json(31652), txt(299), csv(96153), txt(220), csv(114789), json(15452), csv(5372708), json(908), csv(317928), csv(150923), json(465), csv(535584), json(26090), zip(8164831), json(19633), txt(316), json(23494), csv(133950), json(18638), csv(3944082), json(15345), json(473), zip(4411063), zip(10396095), zip(835096), txt(255), json(699), csv(654705), txt(294), csv(989865), zip(1028035), txt(322), zip(15085090), txt(237), txt(310), json(756), json(30628), json(19490), json(25908), txt(401), json(701), zip(5543909), json(29397), zip(14007470), json(30058), zip(58869042), csv(852937), json(35711), csv(298011), csv(187163), txt(258), zip(3526740), json(568), json(21552), zip(66466788), csv(215250), json(577), csv(103010), txt(306), zip(11840006)Available download formats
Unique identifier
https://doi.org/10.34894/HE6NAQ
Dataset updated
Apr 24, 2023
Dataset provided by
DataverseNL
Authors
Jonathan De Bruin; Jonathan De Bruin; Yongchao Ma; Yongchao Ma; Gerbrich Ferdinands; Gerbrich Ferdinands; Jelle Teijema; Jelle Teijema; Rens Van de Schoot; Rens Van de Schoot
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
SYNERGY is a free and open dataset on study selection in systematic reviews, comprising 169,288 academic works from 26 systematic reviews. Only 2,834 (1.67%) of the academic works in the binary classified dataset are included in the systematic reviews. This makes the SYNERGY dataset a unique dataset for the development of information retrieval algorithms, especially for sparse labels. Due to the many available variables available per record (i.e. titles, abstracts, authors, references, topics), this dataset is useful for researchers in NLP, machine learning, network analysis, and more. In total, the dataset contains 82,668,134 trainable data points. The easiest way to get the SYNERGY dataset is via the synergy-dataset Python package. See https://github.com/asreview/synergy-dataset for all information.
Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO (2023). NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python [Dataset]. http://doi.org/10.6084/m9.figshare.21967265.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21967265.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.

GitHub page: https://github.com/soarsmu/NICHE
Learning Path Index Dataset
kaggle.com
zip
Updated Nov 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mani Sarkar (2024). Learning Path Index Dataset [Dataset]. https://www.kaggle.com/datasets/neomatrix369/learning-path-index-dataset/code
Explore at:
zip(151846 bytes)Available download formats
Dataset updated
Nov 6, 2024
Authors
Mani Sarkar
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Description

The Learning Path Index Dataset is a comprehensive collection of byte-sized courses and learning materials tailored for individuals eager to delve into the fields of Data Science, Machine Learning, and Artificial Intelligence (AI), making it an indispensable reference for students, professionals, and educators in the Data Science and AI communities.

This Kaggle Dataset along with the KaggleX Learning Path Index GitHub Repo were created by the mentors and mentees of Cohort 3 KaggleX BIPOC Mentorship Program (between August 2023 and November 2023, also see this). See Credits section at the bottom of the long description.

Inspiration

This dataset was created out of a commitment to facilitate learning and growth within the Data Science, Machine Learning, and AI communities. It started off as an idea at the end of Cohort 2 of the KaggleX BIPOC Mentorship Program brainstorming and feedback session. It was one of the ideas to create byte-sized learning material to help our KaggleX mentees learn things faster. It aspires to simplify the process of finding, evaluating, and selecting the most fitting educational resources.

Context

This dataset was meticulously curated to assist learners in navigating the vast landscape of Data Science, Machine Learning, and AI education. It serves as a compass for those aiming to develop their skills and expertise in these rapidly evolving fields.

The mentors and mentees communicated via Discord, Trello, Google Hangout, etc... to put together these artifacts and made them public for everyone to use and contribute back.

Sources

The dataset compiles data from a curated selection of reputable sources including leading educational platforms such as Google Developer, Google Cloud Skill Boost, IBM, Fast AI, etc. By drawing from these trusted sources, we ensure that the data is both accurate and pertinent. The raw data and other artifacts as a result of this exercise can be found on the GitHub Repo i.e. KaggleX Learning Path Index GitHub Repo.

Content

The dataset encompasses the following attributes:

Course / Learning Material: The title of the Data Science, Machine Learning, or AI course or learning material.

Source: The provider or institution offering the course.

Course Level: The proficiency level, ranging from Beginner to Advanced.

Type (Free or Paid): Indicates whether the course is available for free or requires payment.

Module: Specific module or section within the course.

Duration: The estimated time required to complete the module or course.

Module / Sub-module Difficulty Level: The complexity level of the module or sub-module.

Keywords / Tags / Skills / Interests / Categories: Relevant keywords, tags, or categories associated with the course with a focus on Data Science, Machine Learning, and AI.

Links: Hyperlinks to access the course or learning material directly.

How to contribute to this initiative?

You can also join us by taking part in the next KaggleX BIPOC Mentorship program (also see this)

Keep your eyes open on the Kaggle Discussions page and other KaggleX social media channels. Or find us on the Kaggle Discord channel to learn more about the next steps

Create notebooks from this data

Create supplementary or complementary data for or from this dataset

Submit corrections/enhancements or anything else to help improve this dataset so it has a wider use and purpose

License

The Learning Path Index Dataset is openly shared under a permissive license, allowing users to utilize the data for educational, analytical, and research purposes within the Data Science, Machine Learning, and AI domains. Feel free to fork the dataset and make it your own, we would be delighted if you contributed back to the dataset and/or our KaggleX Learning Path Index GitHub Repo as well.

Important Links

KaggleX BIPOC Mentorship program (also see this)

KaggleX Learning Path Index Dataset

KaggleX Learning Path Index GitHub Repo

New Official Kaggle Discord Server!

Credits

Credits for all the work done to create this Kaggle Dataset and the KaggleX [Learnin...
h
github-issues
huggingface.co
Updated Oct 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jason-ice-SCUT (2025). github-issues [Dataset]. https://huggingface.co/datasets/Jason-ice-SCUT/github-issues
Explore at:
Dataset updated
Oct 20, 2025
Authors
Jason-ice-SCUT
Description
Dataset Card for GitHub Issues without Comments

Dataset Summary

The GitHub Issues dataset contains issues and pull requests from the 🤗 Datasets repository ,but it does not include the comments.It supports tasks like Text classification and text retrieval. Each entry is an English-language discussion centered around NLP, computer vision, and other machine learning datasets.

Dataset Metadata

Attribute Value

Modalities Tabular, Text

Data Formats… See the full description on the dataset page: https://huggingface.co/datasets/Jason-ice-SCUT/github-issues.
Data from: ManyTypes4Py: A Benchmark Python Dataset for Machine...
data.europa.eu
unknown
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-5244636?locale=lv
Explore at:
unknown(1052407809)Available download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is gathered on Sep. 17th 2020 from GitHub. It has clean and complete versions (from v0.7): The clean version has 5.1K type-checked Python repositories and 1.2M type annotations. The complete version has 5.2K Python repositories and 3.3M type annotations. The dataset's source files are type-checked using mypy (clean version). The dataset is also de-duplicated using the CD4Py tool. Check out the README.MD file for the description of the dataset. Notable changes to each version of the dataset are documented in CHANGELOG.md. The dataset's scripts and utilities are available on its GitHub repository.
Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...
data.europa.eu
unknown
Updated Feb 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. https://data.europa.eu/88u/dataset/oai-zenodo-org-4571228
Explore at:
unknown(395470535)Available download formats
Dataset updated
Feb 28, 2021
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is gathered on Sep. 17th 2020. It has more than 5.4K Python repositories that are hosted on GitHub. Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA. The dataset is also de-duplicated using the CD4Py tool. The list of duplicate files is provided in duplicate_files.txt file. All of its Python projects are processed in JSON-formatted files. They contain a seq2seq representation of each file, type-related hints, and information for machine learning models. The structure of JSON-formatted files is described in JSONOutput.md file. The dataset is split into train, validation and test sets by source code files. The list of files and their corresponding set is provided in dataset_split.csv file. Notable changes to each version of the dataset are documented in CHANGELOG.md.
F
Data from: A Neural Approach for Text Extraction from Scholarly Figures
data.uni-hannover.de
zip
Updated Jan 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TIB (2022). A Neural Approach for Text Extraction from Scholarly Figures [Dataset]. https://data.uni-hannover.de/dataset/a-neural-approach-for-text-extraction-from-scholarly-figures
Explore at:
zipAvailable download formats
Dataset updated
Jan 20, 2022
Dataset authored and provided by
TIB
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
A Neural Approach for Text Extraction from Scholarly Figures

This is the readme for the supplemental data for our ICDAR 2019 paper.

You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202

If you found this dataset useful, please consider citing our paper:

@inproceedings{DBLP:conf/icdar/MorrisTE19, author = {David Morris and Peichen Tang and Ralph Ewerth}, title = {A Neural Approach for Text Extraction from Scholarly Figures}, booktitle = {2019 International Conference on Document Analysis and Recognition, {ICDAR} 2019, Sydney, Australia, September 20-25, 2019}, pages = {1438--1443}, publisher = {{IEEE}}, year = {2019}, url = {https://doi.org/10.1109/ICDAR.2019.00231}, doi = {10.1109/ICDAR.2019.00231}, timestamp = {Tue, 04 Feb 2020 13:28:39 +0100}, biburl = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).

Datasets

We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.

Testing

These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2

Validation

The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.

Training

We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.

Code

We have made our code available in code.zip. We will upload code, announce further news, and field questions via the github repo.

Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours subdirectory contains the trained weights we used in the paper.

We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar as text_recognition_multipro.py.

We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar.

Parameter sweeps are automated by param_sweep.rb. This file also shows how to invoke all of these components.
Data from: A Benchmark Suite for Systematically Evaluating Reasoning...
zenodo.org
data.niaid.nih.gov
zip
Updated Jun 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bortolotti Samuele; Marconato Emanuele; Carraro Tommaso; Morettin Paolo; van Krieken Emile; Vergari Antonio; Teso Stefano; Passerini Andrea; Passerini Andrea; Bortolotti Samuele; Marconato Emanuele; Carraro Tommaso; Morettin Paolo; van Krieken Emile; Vergari Antonio; Teso Stefano (2024). A Benchmark Suite for Systematically Evaluating Reasoning Shortcuts [Dataset]. http://doi.org/10.5281/zenodo.11612556
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11612556
Dataset updated
Jun 13, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Bortolotti Samuele; Marconato Emanuele; Carraro Tommaso; Morettin Paolo; van Krieken Emile; Vergari Antonio; Teso Stefano; Passerini Andrea; Passerini Andrea; Bortolotti Samuele; Marconato Emanuele; Carraro Tommaso; Morettin Paolo; van Krieken Emile; Vergari Antonio; Teso Stefano
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Codebase [Github] | Dataset [Zenodo]

Abstract

The advent of powerful neural classifiers has increased interest in problems that require both learning and reasoning. These problems are critical for understanding important properties of models, such as trustworthiness, generalization, interpretability, and compliance to safety and structural constraints. However, recent research observed that tasks requiring both learning and reasoning on background knowledge often suffer from reasoning shortcuts (RSs): predictors can solve the downstream reasoning task without associating the correct concepts to the high-dimensional data. To address this issue, we introduce rsbench, a comprehensive benchmark suite designed to systematically evaluate the impact of RSs on models by providing easy access to highly customizable tasks affected by RSs. Furthermore, rsbench implements common metrics for evaluating concept quality and introduces novel formal verification procedures for assessing the presence of RSs in learning tasks. Using rsbench, we highlight that obtaining high quality concepts in both purely neural and neuro-symbolic models is a far-from-solved problem. rsbench is available on Github.

Usage

We recommend visiting the official code website for instructions on how to use the dataset and accompaying software code.

License

All ready-made data sets and generated datasets are distributed under the CC-BY-SA 4.0 license, with the exception of Kand-Logic, which is derived from Kandinsky-patterns and as such is distributed under the GPL-3.0 license.

Datasets Overview

CLIP-embeddings. This folder contains the saved activations from a pretrained CLIP model applied to the tested dataset. It includes embeddings that represent the dataset in a format suitable for further analysis and experimentation.

BDD_OIA-original-dataset. This directory holds the original files from the X-OIA project by Xu et al. [1]. These datasets have been made publicly available for ease of access and further research. If you are going to use it, please consider citing the original authors.

kand-logic-3k. This folder contains all images generated for the Kand-Logic project. Each image is accompanied by annotations for both concepts and labels.

bbox-kand-logic-3k. In this directory, you will find images from the Kand-Logic project that have undergone a preprocessing step. These images are extracted based on bounding boxes, rescaled, and include annotations for concepts and labels.

sdd-oia. This folder includes all images and labels generated using rsbench.

sdd-oia-embeddings. This directory contains 512-dimensional embeddings extracted from a pretrained ResNet18 model on ImageNet. The embeddings are derived from the sdd-oia`dataset.

BDD-OIA-preprocessed. Here you will find preprocessed data that follow the methodology outlined by Sawada and Nakamura [2]. The folder contains 2048-dimensional embeddings extracted from a pretrained Faster-RCNN model on the BDD-100k dataset.

The original BDD datasets can be downloaded from the following Google Drive link: [Download BDD Dataset].

References

[1] Xu et al., *Explainable Object-Induced Action Decision for Autonomous Vehicles*, CVPR 2020.

[2] Sawada and Nakamura, *Concept Bottleneck Model With Additional Unsupervised Concepts*, IEEE 2022.
I
Dataset: Breaking the barrier of human-annotated training data for...
databank.illinois.edu
Updated Dec 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Varela; Andrew Leakey (2024). Dataset: Breaking the barrier of human-annotated training data for machine-learning-aided plant research using aerial imagery [Dataset]. http://doi.org/10.13012/B2IDB-8462244_V2
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-8462244_V2
Dataset updated
Dec 12, 2024
Authors
Sebastian Varela; Andrew Leakey
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
U.S. Department of Energy (DOE)
Description
This dataset supports the implementation described in the manuscript "Breaking the Barrier of Human-Annotated Training Data for Machine-Learning-Aided Biological Research Using Aerial Imagery." It comprises UAV aerial imagery used to execute the code available at https://github.com/pixelvar79/GAN-Flowering-Detection-paper. For detailed information on dataset usage and instructions for implementing the code to reproduce the study, please refer to the GitHub repository.
o
Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...
explore.openaire.eu
data.europa.eu
Updated Apr 26, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amir M. Mir; Evaldas Latoskinas; Georgios Gousios (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. http://doi.org/10.5281/zenodo.4044635
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4044635
Dataset updated
Apr 26, 2021
Authors
Amir M. Mir; Evaldas Latoskinas; Georgios Gousios
Description
The dataset is gathered on Sep. 17th 2020 from GitHub. It has more than 5.2K Python repositories and 4.2M type annotations. The dataset is also de-duplicated using the CD4Py tool. Check out the README.MD file for the description of the dataset. Notable changes to each version of the dataset are documented in CHANGELOG.md. The dataset's scripts and utilities are available on its GitHub repository.
Data from: FISBe: A real-world benchmark dataset for instance segmentation...
zenodo.org
data.niaid.nih.gov
+1more
bin, json +3
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lisa Mais; Lisa Mais; Peter Hirsch; Peter Hirsch; Claire Managan; Claire Managan; Ramya Kandarpa; Josef Lorenz Rumberger; Josef Lorenz Rumberger; Annika Reinke; Annika Reinke; Lena Maier-Hein; Lena Maier-Hein; Gudrun Ihrke; Gudrun Ihrke; Dagmar Kainmueller; Dagmar Kainmueller; Ramya Kandarpa (2024). FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures [Dataset]. http://doi.org/10.5281/zenodo.10875063
Explore at:
zip, text/x-python, bin, json, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10875063
Dataset updated
Apr 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lisa Mais; Lisa Mais; Peter Hirsch; Peter Hirsch; Claire Managan; Claire Managan; Ramya Kandarpa; Josef Lorenz Rumberger; Josef Lorenz Rumberger; Annika Reinke; Annika Reinke; Lena Maier-Hein; Lena Maier-Hein; Gudrun Ihrke; Gudrun Ihrke; Dagmar Kainmueller; Dagmar Kainmueller; Ramya Kandarpa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Feb 26, 2024
Description
General

For more details and the most up-to-date information please consult our project page: https://kainmueller-lab.github.io/fisbe.

Summary

A new dataset for neuron instance segmentation in 3d multicolor light microscopy data of fruit fly brains

30 completely labeled (segmented) images

71 partly labeled images

altogether comprising ∼600 expert-labeled neuron instances (labeling a single neuron takes between 30-60 min on average, yet a difficult one can take up to 4 hours)

To the best of our knowledge, the first real-world benchmark dataset for instance segmentation of long thin filamentous objects

A set of metrics and a novel ranking score for respective meaningful method benchmarking

An evaluation of three baseline methods in terms of the above metrics and score

Abstract

Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.

Dataset documentation:

We provide a detailed documentation of our dataset, following the Datasheet for Datasets questionnaire:

>> FISBe Datasheet

Our dataset originates from the FlyLight project, where the authors released a large image collection of nervous systems of ~74,000 flies, available for download under CC BY 4.0 license.

Files

fisbe_v1.0_{completely,partly}.zip

contains the image and ground truth segmentation data; there is one zarr file per sample, see below for more information on how to access zarr files.

fisbe_v1.0_mips.zip

maximum intensity projections of all samples, for convenience.

sample_list_per_split.txt

a simple list of all samples and the subset they are in, for convenience.

view_data.py

a simple python script to visualize samples, see below for more information on how to use it.

dim_neurons_val_and_test_sets.json

a list of instance ids per sample that are considered to be of low intensity/dim; can be used for extended evaluation.

Readme.md

general information

How to work with the image files

Each sample consists of a single 3d MCFO image of neurons of the fruit fly.
For each image, we provide a pixel-wise instance segmentation for all separable neurons.
Each sample is stored as a separate zarr file (zarr is a file storage format for chunked, compressed, N-dimensional arrays based on an open-source specification.").
The image data ("raw") and the segmentation ("gt_instances") are stored as two arrays within a single zarr file.
The segmentation mask for each neuron is stored in a separate channel.
The order of dimensions is CZYX.

We recommend to work in a virtual environment, e.g., by using conda:

conda create -y -n flylight-env -c conda-forge python=3.9
conda activate flylight-env

How to open zarr files

Install the python zarr package:
pip install zarr

Opened a zarr file with:

import zarr
raw = zarr.open(
seg = zarr.open(

# optional:
import numpy as np
raw_np = np.array(raw)

Zarr arrays are read lazily on-demand.
Many functions that expect numpy arrays also work with zarr arrays.
Optionally, the arrays can also explicitly be converted to numpy arrays.

How to view zarr image files

We recommend to use napari to view the image data.

Install napari:
pip install "napari[all]"

Save the following Python script:

import zarr, sys, napari

raw = zarr.load(sys.argv[1], mode='r', path="volumes/raw")
gts = zarr.load(sys.argv[1], mode='r', path="volumes/gt_instances")

viewer = napari.Viewer(ndisplay=3)
for idx, gt in enumerate(gts):
viewer.add_labels(
gt, rendering='translucent', blending='additive', name=f'gt_{idx}')
viewer.add_image(raw[0], colormap="red", name='raw_r', blending='additive')
viewer.add_image(raw[1], colormap="green", name='raw_g', blending='additive')
viewer.add_image(raw[2], colormap="blue", name='raw_b', blending='additive')
napari.run()

Execute:
python view_data.py

Metrics

S: Average of avF1 and C

avF1: Average F1 Score

C: Average ground truth coverage

clDice_TP: Average true positives clDice

FS: Number of false splits

FM: Number of false merges

tp: Relative number of true positives

For more information on our selected metrics and formal definitions please see our paper.

Baseline

To showcase the FISBe dataset together with our selection of metrics, we provide evaluation results for three baseline methods, namely PatchPerPix (ppp), Flood Filling Networks (FFN) and a non-learnt application-specific color clustering from Duan et al..
For detailed information on the methods and the quantitative results please see our paper.

License

The FlyLight Instance Segmentation Benchmark (FISBe) dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Citation

If you use FISBe in your research, please use the following BibTeX entry:

@misc{mais2024fisbe, title = {FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures}, author = {Lisa Mais and Peter Hirsch and Claire Managan and Ramya Kandarpa and Josef Lorenz Rumberger and Annika Reinke and Lena Maier-Hein and Gudrun Ihrke and Dagmar Kainmueller}, year = 2024, eprint = {2404.00130}, archivePrefix ={arXiv}, primaryClass = {cs.CV} }

Acknowledgments

We thank Aljoscha Nern for providing unpublished MCFO images as well as Geoffrey W. Meissner and the entire FlyLight Project Team for valuable
discussions.
P.H., L.M. and D.K. were supported by the HHMI Janelia Visiting Scientist Program.
This work was co-funded by Helmholtz Imaging.

Changelog

There have been no changes to the dataset so far.
All future change will be listed on the changelog page.

Contributing

If you would like to contribute, have encountered any issues or have any suggestions, please open an issue for the FISBe dataset in the accompanying github repository.

All contributions are welcome!
Machine Learning users on Github
kaggle.com
zip
Updated Jan 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
prosper chuks (2022). Machine Learning users on Github [Dataset]. https://www.kaggle.com/prosperchuks/machine-learning-users-on-github
Explore at:
zip(52282 bytes)Available download formats
Dataset updated
Jan 9, 2022
Authors
prosper chuks
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
Data was scraped from Github's API.

Columns

LOGIN: shows the user's Github login ID: user's id URL: API link to the user's profile NAME: fullname of the user COMPANY: organization the user's affiliated with BLOG: link to the user's blog site LOCATION: location where the user resides EMAIL: user's email address BIO: about the user

This dataset contains over 600 users from Lagos, Nigeria and Rwanda

Source: https://github.com/ProsperChuks/Github-Data-Ingestion/tree/main/data
GitHub Social Network
kaggle.com
Updated Jan 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gitanjali Wadhwa (2023). GitHub Social Network [Dataset]. https://www.kaggle.com/datasets/gitanjali1425/github-social-network-graph-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 12, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gitanjali Wadhwa
Description
Description

An extensive social network of GitHub developers was collected from the public API in June 2019. Nodes are developers who have starred at most minuscule 10 repositories, and edges are mutual follower relationships between them. The vertex features are extracted based on the location; repositories starred, employer and e-mail address. The task related to the graph is binary node classification - one has to predict whether the GitHub user is a web or a machine learning developer. This targeting feature was derived from the job title of each user.

Properties

Directed: No.

Node features: Yes.

Edge features: No.

Node labels: Yes. Binary-labeled.

Temporal: No.

Nodes: 37,700

Edges: 289,003

Density: 0.001

Transitvity: 0.013

Possible Tasks

Binary node classification

Link prediction

Community detection

Network visualisation
R
Data from: Fashion Mnist Dataset
universe.roboflow.com
opendatalab.com
+3more
zip
Updated Aug 10, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Popular Benchmarks (2022). Fashion Mnist Dataset [Dataset]. https://universe.roboflow.com/popular-benchmarks/fashion-mnist-ztryt/model/3
Explore at:
zipAvailable download formats
Dataset updated
Aug 10, 2022
Dataset authored and provided by
Popular Benchmarks
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Clothing
Description
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Authors:

Han Xiao, Kashif Rasul and Roland Vollgraf

https://arxiv.org/abs/1708.07747

Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

All images were sized 28x28 in the original dataset

Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. * Source

Here's an example of how the data looks (each class takes three-rows): https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png" alt="Visualized Fashion MNIST dataset">

Version 1 (original-images_Original-FashionMNIST-Splits):

Original images, with the original splits for MNIST: train (86% of images - 60,000 images) set and test (14% of images - 10,000 images) set only.

This version was not trained

Version 3 (original-images_trainSetSplitBy80_20):

Original, raw images, with the train set split to provide 80% of its images to the training set and 20% of its images to the validation set

https://blog.roboflow.com/train-test-split/ https://i.imgur.com/angfheJ.png" alt="Train/Valid/Test Split Rebalancing">

Citation:

@online{xiao2017/online, author = {Han Xiao and Kashif Rasul and Roland Vollgraf}, title = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms}, date = {2017-08-28}, year = {2017}, eprintclass = {cs.LG}, eprinttype = {arXiv}, eprint = {cs.LG/1708.07747}, }
Z
Bio-logger Ethogram Benchmark: A benchmark for computational analysis of...
data.niaid.nih.gov
portalcienciaytecnologia.jcyl.es
+4more
Updated Apr 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hoffman, Benjamin; Cusimano, Maddie; Baglione, Vittorio; Canestrari, Daniela; Chevallier, Damien; DeSantis, Dominic L.; Jeantet, Lorène; Ladds, Monique A.; Maekawa, Takuya; Mata-Silva, Vicente; Moreno-González, Víctor; Trapote, Eva; Vainio, Outi; Vehkaoja, Antti; Yoda, Ken; Zacarian, Katherine; Friedlaender, Ari (2024). Bio-logger Ethogram Benchmark: A benchmark for computational analysis of animal behavior, using animal-borne tags [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7807280
Explore at:
Dataset updated
Apr 19, 2024
Dataset provided by
Universidad de León
University of California, Santa Cruz
Centre national de la recherche scientifique Borea
Earth Species Project
University of Texas, El Paso
Department of Conservation, New Zealand
Osaka University
African Institute for Mathematical Sciences, Stellenbosch University
Nagoya University
Tampere University
Georgia College & State University
University of Helsinki
Authors
Hoffman, Benjamin; Cusimano, Maddie; Baglione, Vittorio; Canestrari, Daniela; Chevallier, Damien; DeSantis, Dominic L.; Jeantet, Lorène; Ladds, Monique A.; Maekawa, Takuya; Mata-Silva, Vicente; Moreno-González, Víctor; Trapote, Eva; Vainio, Outi; Vehkaoja, Antti; Yoda, Ken; Zacarian, Katherine; Friedlaender, Ari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the datasets and experiment results presented in our arxiv paper:

B. Hoffman, M. Cusimano, V. Baglione, D. Canestrari, D. Chevallier, D. DeSantis, L. Jeantet, M. Ladds, T. Maekawa, V. Mata-Silva, V. Moreno-González, A. Pagano, E. Trapote, O. Vainio, A. Vehkaoja, K. Yoda, K. Zacarian, A. Friedlaender, "A benchmark for computational analysis of animal behavior, using animal-borne tags," 2023.

Standardized code to implement, train, and evaluate models can be found at https://github.com/earthspecies/BEBE/.

Please note the licenses in each dataset folder.

Zip folders beginning with "formatted": These are the datasets we used to run the experiments reported in the benchmark paper.

Zip folders beginning with "raw": These are the unprocessed datasets used in BEBE. Code to process these raw datasets into the formatted ones used by BEBE can be found at https://github.com/earthspecies/BEBE-datasets/.

Zip folders beginning with "experiments": Results of the cross-validation experiments reported in the paper, as well as hyperparameter optimization. Confusion matrices for all experiments can also be found here. Note that dt, rf, and svm refer to the feature set from Nathan et al., 2012.

Results used in Fig. 4 of arxiv paper (deep neural networks vs. classical models){dataset}_ harnet_nogyr{dataset}_CRNN{dataset}_CNN{dataset}_dt{dataset}_rf{dataset}_svm{dataset}_wavelet_dt{dataset}_wavelet_rf{dataset}_wavelet_svm

Results used in Fig. 5D of arxiv paper (full data setting)If dataset contains gyroscope (HAR, jeantet_turtles, vehkaoja_dogs):{dataset}_harnet_nogyr{dataset}_harnet_random_nogyr{dataset}_harnet_unfrozen_nogyr{dataset}_RNN_nogyr{dataset}_CRNN_nogyr{dataset}_rf_nogyrOtherwise:{dataset}_harnet_nogyr{dataset}_harnet_unfrozen_nogyr{dataset}_harnet_random_nogyr{dataset}_RNN_nogyr{dataset}_CRNN{dataset}_rf

Results used in Fig. 5E of arxiv paper (reduced data setting)If dataset contains gyroscope (HAR, jeantet_turtles, vehkaoja_dogs):{dataset}_harnet_low_data_nogyr{dataset}_harnet_random_low_data_nogyr{dataset}_harnet_unfrozen_low_data_nogyr{dataset}_RNN_low_data_nogyr{dataset}_wavelet_RNN_low_data_nogyr{dataset}_CRNN_low_data_nogyr{dataset}_rf_low_data_nogyr

Otherwise:{dataset}_harnet_low_data_nogyr{dataset}_harnet_random_low_data_nogyr{dataset}_harnet_unfrozen_low_data_nogyr{dataset}_RNN_low_data_nogyr{dataset}_wavelet_RNN_low_data_nogyr{dataset}_CRNN_low_data{dataset}_rf_low_data

CSV files: we also include summaries of the experimental results in experiments_summary.csv, experiments_by_fold_individual.csv, experiments_by_fold_behavior.csv.

experiments_summary.csv - results averaged over individuals and behavior classesdataset (str): name of datasetexperiment (str): name of model with experiment setting fig4 (bool): True if dataset+experiment was used in figure 4 of arxiv paperfig5d (bool): True if dataset+experiment was used in figure 5d of arxiv paperfig5e (bool): True if dataset+experiment was used in figure 5e of arxiv paperf1_mean (float): mean of macro-averaged F1 score, averaged over individuals in test foldsf1_std (float): standard deviation of macro-averaged F1 score, computed over individuals in test foldsprec_mean, prec_std (float): analogous for precisionrec_mean, rec_std (float): analogous for recallexperiments_by_fold_individual.csv - results per individual in the test foldsdataset (str): name of datasetexperiment (str): name of model with experiment setting fig4 (bool): True if dataset+experiment was used in figure 4 of arxiv paperfig5d (bool): True if dataset+experiment was used in figure 5d of arxiv paperfig5e (bool): True if dataset+experiment was used in figure 5e of arxiv paperfold (int): test fold indexindividual (int): individuals are numbered zero-indexed, starting from fold 1f1 (float): macro-averaged f1 score for this individualprecision (float): macro-averaged precision for this individualrecall (float): macro-averaged recall for this individual

experiments_by_fold_behavior.csv - results per behavior class, for each test folddataset (str): name of datasetexperiment (str): name of model with experiment setting fig4 (bool): True if dataset+experiment was used in figure 4 of arxiv paperfig5d (bool): True if dataset+experiment was used in figure 5d of arxiv paperfig5e (bool): True if dataset+experiment was used in figure 5e of arxiv paperfold (int): test fold indexbehavior_class (str): name of behavior classf1 (float): f1 score for this behavior, averaged over individuals in the test foldprecision (float): precision for this behavior, averaged over individuals in the test foldrecall (float): recall for this behavior, averaged over individuals in the test foldtrain_ground_truth_label_counts (int): number of timepoints labeled with this behavior class, in the training set
Z
Data from: Dataset of paper "Why do Machine Learning Notebooks Crash?"
nde-dev.biothings.io
data-staging.niaid.nih.gov
Updated Mar 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meijer, Willem (2025). Dataset of paper "Why do Machine Learning Notebooks Crash?" [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_14070487
Explore at:
Dataset updated
Mar 12, 2025
Dataset provided by
Meijer, Willem
Varró, Dániel
Wang, Yiran
López, José Antonio Hernández
Nilsson, Ulf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
All the related data of our paper "Why do Machine Learning Notebooks Crash?" includes:

GitHub and Kaggle notebooks that contain error outputs.

GitHub notebooks are from The Stack repository[1].

Kaggle notebooks are public notebooks on Kaggle platform from year 2023, downloaded via KGTorrent[2].

Identified programming language results of GitHub notebooks.

Identified ML library results from Kaggle notebooks.

Datasets of crashes from GitHub and Kaggle.

Clustering results of crashes from all crashes, and from GitHub and Kaggle respectively.

Sampled crashes and associated notebooks (organized by cluster id).

Manual labeling and reviewing results.

Reproducing results.

The related code repository can be found here.
h
ml-bench
huggingface.co
Updated May 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yanjun Shao (2024). ml-bench [Dataset]. https://huggingface.co/datasets/super-dainiu/ml-bench
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 29, 2024
Authors
Yanjun Shao
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code

📖 Paper • 🚀 Github Page • 🦙 GitHub

ML-Bench is a novel dual-setup benchmark designed to evaluate Large Language Models (LLMs) and AI agents in generating repository-level code for machine learning tasks. The benchmark consists of 9,641 examples from 169 diverse tasks across 18 GitHub machine learning repositories. This dataset contains the following fields:… See the full description on the dataset page: https://huggingface.co/datasets/super-dainiu/ml-bench.
GitHub Bugs Prediction Challenge (Machine Hack)
kaggle.com
zip
Updated Oct 8, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shadab Hussain (2020). GitHub Bugs Prediction Challenge (Machine Hack) [Dataset]. https://www.kaggle.com/datasets/shadabhussain/github-bugs-prediction-challenge-machine-hack/code
Explore at:
zip(103105294 bytes)Available download formats
Dataset updated
Oct 8, 2020
Authors
Shadab Hussain
Description
Foreseeing bugs, features, and questions on GitHub can be fun, especially when one is provided with a colossal dataset containing the GitHub issues. In this hackathon, we are challenging the MachineHack community to come up with an algorithm that can predict the bugs, features, and questions based on GitHub titles and the text body. With text data, there can be a lot of challenges especially when the dataset is big. Analyzing such a dataset requires a lot to be taken into account mainly due to the preprocessing involved to represent raw text and make them machine-understandable. Usually, we stem and lemmatize the raw information and then represent it using TF-IDF, Word Embeddings, etc.

However, provided the state-of-the-art NLP models such as Transformer based BERT models one can skip the manual feature engineering like TF-IDF and Count Vectorizers. In this short span of time, we would encourage you to leverage the ImageNet moment (Transfer Learning) in NLP using various pre-trained models.

Hackathon Link- https://www.machinehack.com/hackathons/predict_github_issues_embold_sponsored_hackathon/overview
C
Community-Driven Model Service Platform Report
marketreportanalytics.com
doc, pdf, ppt
Updated Apr 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). Community-Driven Model Service Platform Report [Dataset]. https://www.marketreportanalytics.com/reports/community-driven-model-service-platform-73127
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Apr 9, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Community-Driven Model Service Platform market is experiencing robust growth, projected to reach $35.14 billion in 2025 and maintain a Compound Annual Growth Rate (CAGR) of 10.1% from 2025 to 2033. This expansion is fueled by several key factors. The increasing availability of open-source models and datasets, fostered by platforms like Kaggle, GitHub, and Hugging Face, is democratizing access to advanced machine learning capabilities. This, in turn, accelerates innovation and reduces the barrier to entry for both developers and businesses. Furthermore, the growing demand for specialized AI solutions across diverse sectors—from healthcare and finance to manufacturing and retail—is driving adoption. The cloud-based segment holds a significant market share due to its scalability, accessibility, and cost-effectiveness compared to on-premises solutions. The adult application segment is currently the largest, reflecting the high concentration of skilled professionals and research activities within this group; however, the children's application segment shows significant growth potential given increasing educational initiatives incorporating AI. Geographic distribution shows North America and Europe currently leading market adoption, while Asia-Pacific is expected to witness rapid expansion driven by increasing digitalization and technological advancements. The competitive landscape is characterized by a mix of established technology giants and emerging startups. Platforms like TensorFlow Hub and Model Zoo provide comprehensive model repositories, while companies like DrivenData and Cortex focus on data-centric approaches. This competitive environment encourages continuous improvement and innovation within the platform offerings. Challenges include ensuring data security and privacy, addressing biases in datasets, and maintaining a balance between open collaboration and intellectual property rights. However, the overall trajectory points toward sustained market growth, fueled by ongoing technological advancements, increasing adoption across diverse industries, and the continuous contribution of a vibrant community of developers and researchers. Future growth will hinge on platforms successfully addressing the challenges and further enhancing collaborative features, fostering community engagement, and expanding the available resources.
d
Synthea synthetic patient data for lung cancer risk prediction machine...
search.dataone.org
dataverse.harvard.edu
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chen, AJ (2023). Synthea synthetic patient data for lung cancer risk prediction machine learning [Dataset]. http://doi.org/10.7910/DVN/GD5XWE
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/GD5XWE
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Chen, AJ
Description
This dataset contains Synthea synthetic patient data used in building ML models for lung cancer risk prediction. The ML models are used to simulate ML-enabled LHS. This open dataset is part of the synthetic data repository of the Open LHS project on GitHub: https://github.com/lhs-open/synthetic-data. For data source and methods, see the first ML-LHS simulation paper published in Nature Scientific Reports: https://www.nature.com/articles/s41598-022-23011-4.

Facebook

Twitter

Click to copy link

Link copied

Cite

Jonathan De Bruin; Jonathan De Bruin; Yongchao Ma; Yongchao Ma; Gerbrich Ferdinands; Gerbrich Ferdinands; Jelle Teijema; Jelle Teijema; Rens Van de Schoot; Rens Van de Schoot (2023). SYNERGY - Open machine learning dataset on study selection in systematic reviews [Dataset]. http://doi.org/10.34894/HE6NAQ

SYNERGY - Open machine learning dataset on study selection in systematic reviews

Explore at:

17 scholarly articles cite this dataset (View in Google Scholar)

txt(212), json(702), zip(16028323), json(19426), txt(263), zip(3560967), txt(305), json(470), txt(279), zip(2355371), json(23201), csv(460956), txt(200), json(685), json(546), csv(63996), zip(2989015), zip(5749455), txt(331), txt(315), json(691), json(23775), csv(672721), json(468), txt(415), json(22778), csv(31919), csv(746832), json(18392), zip(62992826), csv(234822), txt(283), zip(34788857), json(475), txt(242), json(533), csv(42227), json(24548), zip(738232), json(22477), json(25491), zip(11463283), json(17741), csv(490660), json(19662), json(578), csv(19786), zip(14708207), zip(24619707), zip(2404439), json(713), json(27224), json(679), json(26426), txt(185), json(906), zip(18534723), json(23550), txt(266), txt(317), zip(6019723), json(33943), txt(436), csv(388378), json(469), zip(2106498), txt(320), csv(451336), txt(338), zip(19428163), json(14326), json(31652), txt(299), csv(96153), txt(220), csv(114789), json(15452), csv(5372708), json(908), csv(317928), csv(150923), json(465), csv(535584), json(26090), zip(8164831), json(19633), txt(316), json(23494), csv(133950), json(18638), csv(3944082), json(15345), json(473), zip(4411063), zip(10396095), zip(835096), txt(255), json(699), csv(654705), txt(294), csv(989865), zip(1028035), txt(322), zip(15085090), txt(237), txt(310), json(756), json(30628), json(19490), json(25908), txt(401), json(701), zip(5543909), json(29397), zip(14007470), json(30058), zip(58869042), csv(852937), json(35711), csv(298011), csv(187163), txt(258), zip(3526740), json(568), json(21552), zip(66466788), csv(215250), json(577), csv(103010), txt(306), zip(11840006)Available download formats

Unique identifier

https://doi.org/10.34894/HE6NAQ

Dataset updated

Apr 24, 2023

Dataset provided by

DataverseNL

Authors

Jonathan De Bruin; Jonathan De Bruin; Yongchao Ma; Yongchao Ma; Gerbrich Ferdinands; Gerbrich Ferdinands; Jelle Teijema; Jelle Teijema; Rens Van de Schoot; Rens Van de Schoot

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

SYNERGY is a free and open dataset on study selection in systematic reviews, comprising 169,288 academic works from 26 systematic reviews. Only 2,834 (1.67%) of the academic works in the binary classified dataset are included in the systematic reviews. This makes the SYNERGY dataset a unique dataset for the development of information retrieval algorithms, especially for sparse labels. Due to the many available variables available per record (i.e. titles, abstracts, authors, references, topics), this dataset is useful for researchers in NLP, machine learning, network analysis, and more. In total, the dataset contains 82,668,134 trainable data points. The easiest way to get the SYNERGY dataset is via the synergy-dataset Python package. See https://github.com/asreview/synergy-dataset for all information.

Clear search

Close search

Google apps

Main menu

SYNERGY - Open machine learning dataset on study selection in systematic...

Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...

Learning Path Index Dataset

Description

Inspiration

Context

Sources

Content

How to contribute to this initiative?

License

Important Links

Credits

github-issues

Data from: ManyTypes4Py: A Benchmark Python Dataset for Machine...

Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

Data from: A Neural Approach for Text Extraction from Scholarly Figures

A Neural Approach for Text Extraction from Scholarly Figures

Datasets

Testing

Validation

Training

Code

Data from: A Benchmark Suite for Systematically Evaluating Reasoning...

Dataset: Breaking the barrier of human-annotated training data for...

Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

Data from: FISBe: A real-world benchmark dataset for instance segmentation...

General

Summary

Abstract

Dataset documentation:

Files

How to work with the image files

How to open zarr files

How to view zarr image files

Metrics

Baseline

License

Citation

Acknowledgments

Changelog

Contributing

Machine Learning users on Github

Columns

GitHub Social Network

Data from: Fashion Mnist Dataset

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Authors:

Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

All images were sized 28x28 in the original dataset

Version 1 (original-images_Original-FashionMNIST-Splits):

Version 3 (original-images_trainSetSplitBy80_20):

Citation:

Bio-logger Ethogram Benchmark: A benchmark for computational analysis of...

Data from: Dataset of paper "Why do Machine Learning Notebooks Crash?"

ml-bench

GitHub Bugs Prediction Challenge (Machine Hack)

Community-Driven Model Service Platform Report

Synthea synthetic patient data for lung cancer risk prediction machine...

SYNERGY - Open machine learning dataset on study selection in systematic reviews