Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
SYNERGY is a free and open dataset on study selection in systematic reviews, comprising 169,288 academic works from 26 systematic reviews. Only 2,834 (1.67%) of the academic works in the binary classified dataset are included in the systematic reviews. This makes the SYNERGY dataset a unique dataset for the development of information retrieval algorithms, especially for sparse labels. Due to the many available variables available per record (i.e. titles, abstracts, authors, references, topics), this dataset is useful for researchers in NLP, machine learning, network analysis, and more. In total, the dataset contains 82,668,134 trainable data points. The easiest way to get the SYNERGY dataset is via the synergy-dataset Python package. See https://github.com/asreview/synergy-dataset for all information.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.
GitHub page: https://github.com/soarsmu/NICHE
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Learning Path Index Dataset is a comprehensive collection of byte-sized courses and learning materials tailored for individuals eager to delve into the fields of Data Science, Machine Learning, and Artificial Intelligence (AI), making it an indispensable reference for students, professionals, and educators in the Data Science and AI communities.
This Kaggle Dataset along with the KaggleX Learning Path Index GitHub Repo were created by the mentors and mentees of Cohort 3 KaggleX BIPOC Mentorship Program (between August 2023 and November 2023, also see this). See Credits section at the bottom of the long description.
This dataset was created out of a commitment to facilitate learning and growth within the Data Science, Machine Learning, and AI communities. It started off as an idea at the end of Cohort 2 of the KaggleX BIPOC Mentorship Program brainstorming and feedback session. It was one of the ideas to create byte-sized learning material to help our KaggleX mentees learn things faster. It aspires to simplify the process of finding, evaluating, and selecting the most fitting educational resources.
This dataset was meticulously curated to assist learners in navigating the vast landscape of Data Science, Machine Learning, and AI education. It serves as a compass for those aiming to develop their skills and expertise in these rapidly evolving fields.
The mentors and mentees communicated via Discord, Trello, Google Hangout, etc... to put together these artifacts and made them public for everyone to use and contribute back.
The dataset compiles data from a curated selection of reputable sources including leading educational platforms such as Google Developer, Google Cloud Skill Boost, IBM, Fast AI, etc. By drawing from these trusted sources, we ensure that the data is both accurate and pertinent. The raw data and other artifacts as a result of this exercise can be found on the GitHub Repo i.e. KaggleX Learning Path Index GitHub Repo.
The dataset encompasses the following attributes:
The Learning Path Index Dataset is openly shared under a permissive license, allowing users to utilize the data for educational, analytical, and research purposes within the Data Science, Machine Learning, and AI domains. Feel free to fork the dataset and make it your own, we would be delighted if you contributed back to the dataset and/or our KaggleX Learning Path Index GitHub Repo as well.
Credits for all the work done to create this Kaggle Dataset and the KaggleX [Learnin...
Facebook
TwitterDataset Card for GitHub Issues without Comments
Dataset Summary
The GitHub Issues dataset contains issues and pull requests from the 🤗 Datasets repository ,but it does not include the comments.It supports tasks like Text classification and text retrieval. Each entry is an English-language discussion centered around NLP, computer vision, and other machine learning datasets.
Dataset Metadata
Attribute Value
Modalities Tabular, Text
Data Formats… See the full description on the dataset page: https://huggingface.co/datasets/Jason-ice-SCUT/github-issues.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset is gathered on Sep. 17th 2020 from GitHub. It has clean and complete versions (from v0.7): The clean version has 5.1K type-checked Python repositories and 1.2M type annotations. The complete version has 5.2K Python repositories and 3.3M type annotations. The dataset's source files are type-checked using mypy (clean version). The dataset is also de-duplicated using the CD4Py tool. Check out the README.MD file for the description of the dataset. Notable changes to each version of the dataset are documented in CHANGELOG.md. The dataset's scripts and utilities are available on its GitHub repository.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset is gathered on Sep. 17th 2020. It has more than 5.4K Python repositories that are hosted on GitHub. Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA. The dataset is also de-duplicated using the CD4Py tool. The list of duplicate files is provided in duplicate_files.txt file. All of its Python projects are processed in JSON-formatted files. They contain a seq2seq representation of each file, type-related hints, and information for machine learning models. The structure of JSON-formatted files is described in JSONOutput.md file. The dataset is split into train, validation and test sets by source code files. The list of files and their corresponding set is provided in dataset_split.csv file. Notable changes to each version of the dataset are documented in CHANGELOG.md.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This is the readme for the supplemental data for our ICDAR 2019 paper.
You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202
If you found this dataset useful, please consider citing our paper:
@inproceedings{DBLP:conf/icdar/MorrisTE19,
author = {David Morris and
Peichen Tang and
Ralph Ewerth},
title = {A Neural Approach for Text Extraction from Scholarly Figures},
booktitle = {2019 International Conference on Document Analysis and Recognition,
{ICDAR} 2019, Sydney, Australia, September 20-25, 2019},
pages = {1438--1443},
publisher = {{IEEE}},
year = {2019},
url = {https://doi.org/10.1109/ICDAR.2019.00231},
doi = {10.1109/ICDAR.2019.00231},
timestamp = {Tue, 04 Feb 2020 13:28:39 +0100},
biburl = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).
We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.
These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2
The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.
We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.
We have made our code available in code.zip. We will upload code, announce further news, and field questions via the github repo.
Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours subdirectory contains the trained weights we used in the paper.
We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar as text_recognition_multipro.py.
We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar.
Parameter sweeps are automated by param_sweep.rb. This file also shows how to invoke all of these components.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Codebase [Github] | Dataset [Zenodo]
Abstract
The advent of powerful neural classifiers has increased interest in problems that require both learning and reasoning. These problems are critical for understanding important properties of models, such as trustworthiness, generalization, interpretability, and compliance to safety and structural constraints. However, recent research observed that tasks requiring both learning and reasoning on background knowledge often suffer from reasoning shortcuts (RSs): predictors can solve the downstream reasoning task without associating the correct concepts to the high-dimensional data. To address this issue, we introduce rsbench, a comprehensive benchmark suite designed to systematically evaluate the impact of RSs on models by providing easy access to highly customizable tasks affected by RSs. Furthermore, rsbench implements common metrics for evaluating concept quality and introduces novel formal verification procedures for assessing the presence of RSs in learning tasks. Using rsbench, we highlight that obtaining high quality concepts in both purely neural and neuro-symbolic models is a far-from-solved problem. rsbench is available on Github.
Usage
We recommend visiting the official code website for instructions on how to use the dataset and accompaying software code.
License
All ready-made data sets and generated datasets are distributed under the CC-BY-SA 4.0 license, with the exception of Kand-Logic, which is derived from Kandinsky-patterns and as such is distributed under the GPL-3.0 license.
Datasets Overview
The original BDD datasets can be downloaded from the following Google Drive link: [Download BDD Dataset].
References
[1] Xu et al., *Explainable Object-Induced Action Decision for Autonomous Vehicles*, CVPR 2020.
[2] Sawada and Nakamura, *Concept Bottleneck Model With Additional Unsupervised Concepts*, IEEE 2022.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset supports the implementation described in the manuscript "Breaking the Barrier of Human-Annotated Training Data for Machine-Learning-Aided Biological Research Using Aerial Imagery." It comprises UAV aerial imagery used to execute the code available at https://github.com/pixelvar79/GAN-Flowering-Detection-paper. For detailed information on dataset usage and instructions for implementing the code to reproduce the study, please refer to the GitHub repository.
Facebook
TwitterThe dataset is gathered on Sep. 17th 2020 from GitHub. It has more than 5.2K Python repositories and 4.2M type annotations. The dataset is also de-duplicated using the CD4Py tool. Check out the README.MD file for the description of the dataset. Notable changes to each version of the dataset are documented in CHANGELOG.md. The dataset's scripts and utilities are available on its GitHub repository.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For more details and the most up-to-date information please consult our project page: https://kainmueller-lab.github.io/fisbe.
Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.
We provide a detailed documentation of our dataset, following the Datasheet for Datasets questionnaire:
Our dataset originates from the FlyLight project, where the authors released a large image collection of nervous systems of ~74,000 flies, available for download under CC BY 4.0 license.
Each sample consists of a single 3d MCFO image of neurons of the fruit fly.
For each image, we provide a pixel-wise instance segmentation for all separable neurons.
Each sample is stored as a separate zarr file (zarr is a file storage format for chunked, compressed, N-dimensional arrays based on an open-source specification.").
The image data ("raw") and the segmentation ("gt_instances") are stored as two arrays within a single zarr file.
The segmentation mask for each neuron is stored in a separate channel.
The order of dimensions is CZYX.
We recommend to work in a virtual environment, e.g., by using conda:
conda create -y -n flylight-env -c conda-forge python=3.9conda activate flylight-env
pip install zarr
import zarrraw = zarr.open(seg = zarr.open(
# optional:import numpy as npraw_np = np.array(raw)
Zarr arrays are read lazily on-demand.
Many functions that expect numpy arrays also work with zarr arrays.
Optionally, the arrays can also explicitly be converted to numpy arrays.
We recommend to use napari to view the image data.
pip install "napari[all]"
import zarr, sys, napari
raw = zarr.load(sys.argv[1], mode='r', path="volumes/raw")gts = zarr.load(sys.argv[1], mode='r', path="volumes/gt_instances")
viewer = napari.Viewer(ndisplay=3)for idx, gt in enumerate(gts): viewer.add_labels( gt, rendering='translucent', blending='additive', name=f'gt_{idx}')viewer.add_image(raw[0], colormap="red", name='raw_r', blending='additive')viewer.add_image(raw[1], colormap="green", name='raw_g', blending='additive')viewer.add_image(raw[2], colormap="blue", name='raw_b', blending='additive')napari.run()
python view_data.py
For more information on our selected metrics and formal definitions please see our paper.
To showcase the FISBe dataset together with our selection of metrics, we provide evaluation results for three baseline methods, namely PatchPerPix (ppp), Flood Filling Networks (FFN) and a non-learnt application-specific color clustering from Duan et al..
For detailed information on the methods and the quantitative results please see our paper.
The FlyLight Instance Segmentation Benchmark (FISBe) dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
If you use FISBe in your research, please use the following BibTeX entry:
@misc{mais2024fisbe,
title = {FISBe: A real-world benchmark dataset for instance
segmentation of long-range thin filamentous structures},
author = {Lisa Mais and Peter Hirsch and Claire Managan and Ramya
Kandarpa and Josef Lorenz Rumberger and Annika Reinke and Lena
Maier-Hein and Gudrun Ihrke and Dagmar Kainmueller},
year = 2024,
eprint = {2404.00130},
archivePrefix ={arXiv},
primaryClass = {cs.CV}
}
We thank Aljoscha Nern for providing unpublished MCFO images as well as Geoffrey W. Meissner and the entire FlyLight Project Team for valuable
discussions.
P.H., L.M. and D.K. were supported by the HHMI Janelia Visiting Scientist Program.
This work was co-funded by Helmholtz Imaging.
There have been no changes to the dataset so far.
All future change will be listed on the changelog page.
If you would like to contribute, have encountered any issues or have any suggestions, please open an issue for the FISBe dataset in the accompanying github repository.
All contributions are welcome!
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Data was scraped from Github's API.
LOGIN: shows the user's Github login ID: user's id URL: API link to the user's profile NAME: fullname of the user COMPANY: organization the user's affiliated with BLOG: link to the user's blog site LOCATION: location where the user resides EMAIL: user's email address BIO: about the user
This dataset contains over 600 users from Lagos, Nigeria and Rwanda
Source: https://github.com/ProsperChuks/Github-Data-Ingestion/tree/main/data
Facebook
TwitterDescription
An extensive social network of GitHub developers was collected from the public API in June 2019. Nodes are developers who have starred at most minuscule 10 repositories, and edges are mutual follower relationships between them. The vertex features are extracted based on the location; repositories starred, employer and e-mail address. The task related to the graph is binary node classification - one has to predict whether the GitHub user is a web or a machine learning developer. This targeting feature was derived from the job title of each user.
Properties
Possible Tasks
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits.
* Source
Here's an example of how the data looks (each class takes three-rows):
https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png" alt="Visualized Fashion MNIST dataset">
train (86% of images - 60,000 images) set and test (14% of images - 10,000 images) set only.train set split to provide 80% of its images to the training set and 20% of its images to the validation set@online{xiao2017/online,
author = {Han Xiao and Kashif Rasul and Roland Vollgraf},
title = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms},
date = {2017-08-28},
year = {2017},
eprintclass = {cs.LG},
eprinttype = {arXiv},
eprint = {cs.LG/1708.07747},
}
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the datasets and experiment results presented in our arxiv paper:
B. Hoffman, M. Cusimano, V. Baglione, D. Canestrari, D. Chevallier, D. DeSantis, L. Jeantet, M. Ladds, T. Maekawa, V. Mata-Silva, V. Moreno-González, A. Pagano, E. Trapote, O. Vainio, A. Vehkaoja, K. Yoda, K. Zacarian, A. Friedlaender, "A benchmark for computational analysis of animal behavior, using animal-borne tags," 2023.
Standardized code to implement, train, and evaluate models can be found at https://github.com/earthspecies/BEBE/.
Please note the licenses in each dataset folder.
Zip folders beginning with "formatted": These are the datasets we used to run the experiments reported in the benchmark paper.
Zip folders beginning with "raw": These are the unprocessed datasets used in BEBE. Code to process these raw datasets into the formatted ones used by BEBE can be found at https://github.com/earthspecies/BEBE-datasets/.
Zip folders beginning with "experiments": Results of the cross-validation experiments reported in the paper, as well as hyperparameter optimization. Confusion matrices for all experiments can also be found here. Note that dt, rf, and svm refer to the feature set from Nathan et al., 2012.
Results used in Fig. 4 of arxiv paper (deep neural networks vs. classical models){dataset}_ harnet_nogyr{dataset}_CRNN{dataset}_CNN{dataset}_dt{dataset}_rf{dataset}_svm{dataset}_wavelet_dt{dataset}_wavelet_rf{dataset}_wavelet_svm
Results used in Fig. 5D of arxiv paper (full data setting)If dataset contains gyroscope (HAR, jeantet_turtles, vehkaoja_dogs):{dataset}_harnet_nogyr{dataset}_harnet_random_nogyr{dataset}_harnet_unfrozen_nogyr{dataset}_RNN_nogyr{dataset}_CRNN_nogyr{dataset}_rf_nogyrOtherwise:{dataset}_harnet_nogyr{dataset}_harnet_unfrozen_nogyr{dataset}_harnet_random_nogyr{dataset}_RNN_nogyr{dataset}_CRNN{dataset}_rf
Results used in Fig. 5E of arxiv paper (reduced data setting)If dataset contains gyroscope (HAR, jeantet_turtles, vehkaoja_dogs):{dataset}_harnet_low_data_nogyr{dataset}_harnet_random_low_data_nogyr{dataset}_harnet_unfrozen_low_data_nogyr{dataset}_RNN_low_data_nogyr{dataset}_wavelet_RNN_low_data_nogyr{dataset}_CRNN_low_data_nogyr{dataset}_rf_low_data_nogyr
Otherwise:{dataset}_harnet_low_data_nogyr{dataset}_harnet_random_low_data_nogyr{dataset}_harnet_unfrozen_low_data_nogyr{dataset}_RNN_low_data_nogyr{dataset}_wavelet_RNN_low_data_nogyr{dataset}_CRNN_low_data{dataset}_rf_low_data
CSV files: we also include summaries of the experimental results in experiments_summary.csv, experiments_by_fold_individual.csv, experiments_by_fold_behavior.csv.
experiments_summary.csv - results averaged over individuals and behavior classesdataset (str): name of datasetexperiment (str): name of model with experiment setting fig4 (bool): True if dataset+experiment was used in figure 4 of arxiv paperfig5d (bool): True if dataset+experiment was used in figure 5d of arxiv paperfig5e (bool): True if dataset+experiment was used in figure 5e of arxiv paperf1_mean (float): mean of macro-averaged F1 score, averaged over individuals in test foldsf1_std (float): standard deviation of macro-averaged F1 score, computed over individuals in test foldsprec_mean, prec_std (float): analogous for precisionrec_mean, rec_std (float): analogous for recallexperiments_by_fold_individual.csv - results per individual in the test foldsdataset (str): name of datasetexperiment (str): name of model with experiment setting fig4 (bool): True if dataset+experiment was used in figure 4 of arxiv paperfig5d (bool): True if dataset+experiment was used in figure 5d of arxiv paperfig5e (bool): True if dataset+experiment was used in figure 5e of arxiv paperfold (int): test fold indexindividual (int): individuals are numbered zero-indexed, starting from fold 1f1 (float): macro-averaged f1 score for this individualprecision (float): macro-averaged precision for this individualrecall (float): macro-averaged recall for this individual
experiments_by_fold_behavior.csv - results per behavior class, for each test folddataset (str): name of datasetexperiment (str): name of model with experiment setting fig4 (bool): True if dataset+experiment was used in figure 4 of arxiv paperfig5d (bool): True if dataset+experiment was used in figure 5d of arxiv paperfig5e (bool): True if dataset+experiment was used in figure 5e of arxiv paperfold (int): test fold indexbehavior_class (str): name of behavior classf1 (float): f1 score for this behavior, averaged over individuals in the test foldprecision (float): precision for this behavior, averaged over individuals in the test foldrecall (float): recall for this behavior, averaged over individuals in the test foldtrain_ground_truth_label_counts (int): number of timepoints labeled with this behavior class, in the training set
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
All the related data of our paper "Why do Machine Learning Notebooks Crash?" includes:
GitHub and Kaggle notebooks that contain error outputs.
GitHub notebooks are from The Stack repository[1].
Kaggle notebooks are public notebooks on Kaggle platform from year 2023, downloaded via KGTorrent[2].
Identified programming language results of GitHub notebooks.
Identified ML library results from Kaggle notebooks.
Datasets of crashes from GitHub and Kaggle.
Clustering results of crashes from all crashes, and from GitHub and Kaggle respectively.
Sampled crashes and associated notebooks (organized by cluster id).
Manual labeling and reviewing results.
Reproducing results.
The related code repository can be found here.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code
📖 Paper • 🚀 Github Page • 🦙 GitHub
ML-Bench is a novel dual-setup benchmark designed to evaluate Large Language Models (LLMs) and AI agents in generating repository-level code for machine learning tasks. The benchmark consists of 9,641 examples from 169 diverse tasks across 18 GitHub machine learning repositories. This dataset contains the following fields:… See the full description on the dataset page: https://huggingface.co/datasets/super-dainiu/ml-bench.
Facebook
TwitterForeseeing bugs, features, and questions on GitHub can be fun, especially when one is provided with a colossal dataset containing the GitHub issues. In this hackathon, we are challenging the MachineHack community to come up with an algorithm that can predict the bugs, features, and questions based on GitHub titles and the text body. With text data, there can be a lot of challenges especially when the dataset is big. Analyzing such a dataset requires a lot to be taken into account mainly due to the preprocessing involved to represent raw text and make them machine-understandable. Usually, we stem and lemmatize the raw information and then represent it using TF-IDF, Word Embeddings, etc.
However, provided the state-of-the-art NLP models such as Transformer based BERT models one can skip the manual feature engineering like TF-IDF and Count Vectorizers. In this short span of time, we would encourage you to leverage the ImageNet moment (Transfer Learning) in NLP using various pre-trained models.
Hackathon Link- https://www.machinehack.com/hackathons/predict_github_issues_embold_sponsored_hackathon/overview
Facebook
Twitterhttps://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
The Community-Driven Model Service Platform market is experiencing robust growth, projected to reach $35.14 billion in 2025 and maintain a Compound Annual Growth Rate (CAGR) of 10.1% from 2025 to 2033. This expansion is fueled by several key factors. The increasing availability of open-source models and datasets, fostered by platforms like Kaggle, GitHub, and Hugging Face, is democratizing access to advanced machine learning capabilities. This, in turn, accelerates innovation and reduces the barrier to entry for both developers and businesses. Furthermore, the growing demand for specialized AI solutions across diverse sectors—from healthcare and finance to manufacturing and retail—is driving adoption. The cloud-based segment holds a significant market share due to its scalability, accessibility, and cost-effectiveness compared to on-premises solutions. The adult application segment is currently the largest, reflecting the high concentration of skilled professionals and research activities within this group; however, the children's application segment shows significant growth potential given increasing educational initiatives incorporating AI. Geographic distribution shows North America and Europe currently leading market adoption, while Asia-Pacific is expected to witness rapid expansion driven by increasing digitalization and technological advancements. The competitive landscape is characterized by a mix of established technology giants and emerging startups. Platforms like TensorFlow Hub and Model Zoo provide comprehensive model repositories, while companies like DrivenData and Cortex focus on data-centric approaches. This competitive environment encourages continuous improvement and innovation within the platform offerings. Challenges include ensuring data security and privacy, addressing biases in datasets, and maintaining a balance between open collaboration and intellectual property rights. However, the overall trajectory points toward sustained market growth, fueled by ongoing technological advancements, increasing adoption across diverse industries, and the continuous contribution of a vibrant community of developers and researchers. Future growth will hinge on platforms successfully addressing the challenges and further enhancing collaborative features, fostering community engagement, and expanding the available resources.
Facebook
TwitterThis dataset contains Synthea synthetic patient data used in building ML models for lung cancer risk prediction. The ML models are used to simulate ML-enabled LHS. This open dataset is part of the synthetic data repository of the Open LHS project on GitHub: https://github.com/lhs-open/synthetic-data. For data source and methods, see the first ML-LHS simulation paper published in Nature Scientific Reports: https://www.nature.com/articles/s41598-022-23011-4.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
SYNERGY is a free and open dataset on study selection in systematic reviews, comprising 169,288 academic works from 26 systematic reviews. Only 2,834 (1.67%) of the academic works in the binary classified dataset are included in the systematic reviews. This makes the SYNERGY dataset a unique dataset for the development of information retrieval algorithms, especially for sparse labels. Due to the many available variables available per record (i.e. titles, abstracts, authors, references, topics), this dataset is useful for researchers in NLP, machine learning, network analysis, and more. In total, the dataset contains 82,668,134 trainable data points. The easiest way to get the SYNERGY dataset is via the synergy-dataset Python package. See https://github.com/asreview/synergy-dataset for all information.