100+ datasets found

o
Amos: A large-scale abdominal multi-organ benchmark for versatile medical...
explore.openaire.eu
zenodo.org
Updated Oct 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
YuanfengJi (2022). Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation (Unlabeled Data Part III) [Dataset]. http://doi.org/10.5281/zenodo.7295816
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7295816
Dataset updated
Oct 29, 2022
Authors
YuanfengJi
Description
Despite the considerable progress in automatic abdominal multi-organ segmentation from CT/MRI scans in recent years, a comprehensive evaluation of the models' capabilities is hampered by the lack of a large-scale benchmark from diverse clinical scenarios. Constraint by the high cost of collecting and labeling 3D medical data, most of the deep learning models to date are driven by datasets with a limited number of organs of interest or samples, which still limits the power of modern deep models and makes it difficult to provide a fully comprehensive and fair estimate of various methods. To mitigate the limitations, we present AMOS, a large-scale, diverse, clinical dataset for abdominal organ segmentation. AMOS provides 500 CT and 100 MRI scans collected from multi-center, multi-vendor, multi-modality, multi-phase, multi-disease patients, each with voxel-level annotations of 15 abdominal organs, providing challenging examples and test-bed for studying robust segmentation algorithms under diverse targets and scenarios. We further benchmark several state-of-the-art medical segmentation models to evaluate the status of the existing methods on this new challenging dataset. We have made our datasets, benchmark servers, and baselines publicly available, and hope to inspire future research. The paper can be found at https://arxiv.org/pdf/2206.08023.pdf In addition to providing the labeled 600 CT and MRI scans, we expect to provide 2000 CT and 1200 MRI scans without labels to support more learning tasks (semi-supervised, un-supervised, domain adaption, ...). The link can be found in: labeled data (500CT+100MRI) unlabeled data Part I (900CT) unlabeled data Part II (1100CT) (Now there are 1000CT, we will replenish to 1100CT) unlabeled data Part III (1200MRI) if you found this dataset useful for your research, please cite: @article{ji2022amos, title={AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation}, author={Ji, Yuanfeng and Bai, Haotian and Yang, Jie and Ge, Chongjian and Zhu, Ye and Zhang, Ruimao and Li, Zhen and Zhang, Lingyan and Ma, Wanling and Wan, Xiang and others}, journal={arXiv preprint arXiv:2206.08023}, year={2022} }
f
Average dice coefficients of the few-supervised learning models using 2%,...
figshare.com
xls
Updated Sep 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seung-Ah Lee; Hyun Su Kim; Ehwa Yang; Young Cheol Yoon; Ji Hyun Lee; Byung-Ok Choi; Jae-Hun Kim (2024). Average dice coefficients of the few-supervised learning models using 2%, 5%, and 10% of the labeled data, and semi-supervised learning models using 10% of the labeled data for training. [Dataset]. http://doi.org/10.1371/journal.pone.0310203.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310203.t002
Dataset updated
Sep 6, 2024
Dataset provided by
PLOS ONE
Authors
Seung-Ah Lee; Hyun Su Kim; Ehwa Yang; Young Cheol Yoon; Ji Hyun Lee; Byung-Ok Choi; Jae-Hun Kim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Average dice coefficients of the few-supervised learning models using 2%, 5%, and 10% of the labeled data, and semi-supervised learning models using 10% of the labeled data for training.
A
‘BLE RSSI Dataset for Indoor localization’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 26, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2018). ‘BLE RSSI Dataset for Indoor localization’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-ble-rssi-dataset-for-indoor-localization-85fd/latest
Explore at:
Dataset updated
Jan 26, 2018
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘BLE RSSI Dataset for Indoor localization’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/mehdimka/ble-rssi-dataset on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Content

The dataset was created using the RSSI readings of an array of 13 ibeacons in the first floor of Waldo Library, Western Michigan University. Data was collected using iPhone 6S. The dataset contains two sub-datasets: a labeled dataset (1420 instances) and an unlabeled dataset (5191 instances). The recording was performed during the operational hours of the library. For the labeled dataset, the input data contains the location (label column), a timestamp, followed by RSSI readings of 13 iBeacons. RSSI measurements are negative values. Bigger RSSI values indicate closer proximity to a given iBeacon (e.g., RSSI of -65 represent a closer distance to a given iBeacon compared to RSSI of -85). For out-of-range iBeacons, the RSSI is indicated by -200. The locations related to RSSI readings are combined in one column consisting a letter for the column and a number for the row of the position. The following figure depicts the layout of the iBeacons as well as the arrange of locations.

https://www.kaggle.com/mehdimka/ble-rssi-dataset/downloads/iBeacon_Layout.jpg" alt="iBeacons Layout">

Attribute Information

location: The location of receiving RSSIs from ibeacons b3001 to b3013; symbolic values showing the column and row of the location on the map (e.g., A01 stands for column A, row 1).

date: Datetime in the format of ‘d-m-yyyy hh:mm:ss’

b3001 - b3013: RSSI readings corresponding to the iBeacons; numeric, integers only.

Acknowledgements

Provider: Mehdi Mohammadi and Ala Al-Fuqaha, {mehdi.mohammadi, ala-alfuqaha}@wmich.edu, Department of Computer Science, Western Michigan University

Citation Request:

M. Mohammadi, A. Al-Fuqaha, M. Guizani, J. Oh, “Semi-supervised Deep Reinforcement Learning in Support of IoT and Smart City Services,” IEEE Internet of Things Journal, Vol. PP, No. 99, 2017.

Inspiration

How unlabeled data can help for an improved learning system. How a GAN model can synthesizes viable paths based on the little labeled data and larger set of unlabeled data.

--- Original source retains full ownership of the source dataset ---
a
Stanford STL-10 Image Dataset
academictorrents.com
bittorrent
Updated Nov 26, 2015
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adam Coates and Honglak Lee and Andrew Y. Ng (2015). Stanford STL-10 Image Dataset [Dataset]. https://academictorrents.com/details/a799a2845ac29a66c07cf74e2a2838b6c5698a6a
Explore at:
bittorrent(2640397119)Available download formats
Dataset updated
Nov 26, 2015
Dataset authored and provided by
Adam Coates and Honglak Lee and Andrew Y. Ng
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
![]() The STL-10 dataset is an image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms. It is inspired by the CIFAR-10 dataset but with some modifications. In particular, each class has fewer labeled training examples than in CIFAR-10, but a very large set of unlabeled examples is provided to learn image models prior to supervised training. The primary challenge is to make use of the unlabeled data (which comes from a similar but different distribution from the labeled data) to build a useful prior. We also expect that the higher resolution of this dataset (96x96) will make it a challenging benchmark for developing more scalable unsupervised learning methods. Overview 10 classes: airplane, bird, car, cat, deer, dog, horse, monkey, ship, truck. Images are 96x96 pixels, color. 500 training images (10 pre-defined folds), 800 test images per class. 100000 unlabeled images for uns
f
Number of images used for the training and testing of the models with...
plos.figshare.com
xls
Updated Sep 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seung-Ah Lee; Hyun Su Kim; Ehwa Yang; Young Cheol Yoon; Ji Hyun Lee; Byung-Ok Choi; Jae-Hun Kim (2024). Number of images used for the training and testing of the models with different labeling strategies. [Dataset]. http://doi.org/10.1371/journal.pone.0310203.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310203.t001
Dataset updated
Sep 6, 2024
Dataset provided by
PLOS ONE
Authors
Seung-Ah Lee; Hyun Su Kim; Ehwa Yang; Young Cheol Yoon; Ji Hyun Lee; Byung-Ok Choi; Jae-Hun Kim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Number of images used for the training and testing of the models with different labeling strategies.
R
AI in Semi-supervised Learning Market Market Research Report 2033
researchintelo.com
csv, pdf, pptx
Updated Jul 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Research Intelo (2025). AI in Semi-supervised Learning Market Market Research Report 2033 [Dataset]. https://researchintelo.com/report/ai-in-semi-supervised-learning-market-market
Explore at:
csv, pptx, pdfAvailable download formats
Dataset updated
Jul 24, 2025
Dataset authored and provided by
Research Intelo
License
https://researchintelo.com/privacy-and-policyhttps://researchintelo.com/privacy-and-policy
Time period covered
2024 - 2033
Area covered
Global
Description
AI in Semi-supervised Learning Market Outlook

According to our latest research, the AI in Semi-supervised Learning market size reached USD 1.82 billion in 2024 globally, driven by rapid advancements in artificial intelligence and machine learning applications across diverse industries. The market is expected to expand at a robust CAGR of 28.1% from 2025 to 2033, reaching a projected value of USD 17.17 billion by 2033. This exponential growth is primarily fueled by the increasing need for efficient data labeling, the proliferation of unstructured data, and the growing adoption of AI-driven solutions in both large enterprises and small and medium businesses. As per the latest research, the surging demand for automation, accuracy, and cost-efficiency in data processing is significantly accelerating the adoption of semi-supervised learning models worldwide.

One of the most significant growth factors for the AI in Semi-supervised Learning market is the explosive increase in data generation across industries such as healthcare, finance, retail, and automotive. Organizations are continually collecting vast amounts of structured and unstructured data, but the process of labeling this data for supervised learning remains time-consuming and expensive. Semi-supervised learning offers a compelling solution by leveraging small amounts of labeled data alongside large volumes of unlabeled data, thus reducing the dependency on extensive manual annotation. This approach not only accelerates the deployment of AI models but also enhances their accuracy and scalability, making it highly attractive for enterprises seeking to maximize the value of their data assets while minimizing operational costs.

Another critical driver propelling the growth of the AI in Semi-supervised Learning market is the increasing sophistication of AI algorithms and the integration of advanced technologies such as deep learning, natural language processing, and computer vision. These advancements have enabled semi-supervised learning models to achieve remarkable performance in complex tasks like image and speech recognition, medical diagnostics, and fraud detection. The ability to process and interpret vast datasets with minimal supervision is particularly valuable in sectors where labeled data is scarce or expensive to obtain. Furthermore, the ongoing investments in research and development by leading technology companies and academic institutions are fostering innovation, resulting in more robust and scalable semi-supervised learning frameworks that can be seamlessly integrated into enterprise workflows.

The proliferation of cloud computing and the increasing adoption of hybrid and multi-cloud environments are also contributing significantly to the expansion of the AI in Semi-supervised Learning market. Cloud-based deployment offers unparalleled scalability, flexibility, and cost-efficiency, allowing organizations of all sizes to access cutting-edge AI tools and infrastructure without the need for substantial upfront investments. This democratization of AI technology is empowering small and medium enterprises to leverage semi-supervised learning for competitive advantage, driving widespread adoption across regions and industries. Additionally, the emergence of AI-as-a-Service (AIaaS) platforms is further simplifying the integration and management of semi-supervised learning models, enabling businesses to accelerate their digital transformation initiatives and unlock new growth opportunities.

From a regional perspective, North America currently dominates the AI in Semi-supervised Learning market, accounting for the largest share in 2024, followed closely by Europe and Asia Pacific. The strong presence of leading AI vendors, robust technological infrastructure, and high investments in AI research and development are key factors driving market growth in these regions. Asia Pacific is expected to witness the fastest CAGR during the forecast period, fueled by rapid digitalization, expanding IT infrastructure, and increasing government initiatives to promote AI adoption. Meanwhile, Latin America and the Middle East & Africa are also showing promising growth potential, supported by rising awareness of AI benefits and growing investments in digital transformation projects across various sectors.

Component Analysis

The component segment of the AI in Semi-supervised Learning market is divided into software, hardware, and services, each playing a pivotal role in the adoption and implementation of semi-s
t
Square dataset - Dataset - LDM
service.tib.eu
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Square dataset - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/square-dataset
Explore at:
Dataset updated
Dec 2, 2024
Description
The dataset used in the paper is a wide domain image dataset, and the authors propose a weakly semi-supervised method for disentangling using both labeled and unlabeled data.
Z
Data used in Machine learning reveals the waggle drift's role in the honey...
data.niaid.nih.gov
zenodo.org
Updated May 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wild, Benjamin (2023). Data used in Machine learning reveals the waggle drift's role in the honey bee dance communication system [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7928120
Explore at:
Dataset updated
May 18, 2023
Dataset provided by
Wild, Benjamin
Dormagen, David M
Wario, Fernando
Landgraf, Tim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data and metadata used in "Machine learning reveals the waggle drift’s role in the honey bee dance communication system"

All timestamps are given in ISO 8601 format.

The following files are included:

Berlin2019_waggle_phases.csv, Berlin2021_waggle_phases.csv

Automatic individual detections of waggle phases during our recording periods in 2019 and 2021.

timestamp: Date and time of the detection.

cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).

x_median, y_median: Median position of the bee during the waggle phase (for 2019 given in millimeters after applying a homography, for 2021 in the original image coordinates).

waggle_angle: Body orientation of the bee during the waggle phase in radians (0: oriented to the right, PI / 4: oriented upwards).

Berlin2019_dances.csv

Automatic detections of dance behavior during our recording period in 2019.

dancer_id: Unique ID of the individual bee.

dance_id: Unique ID of the dance.

ts_from, ts_to: Date and time of the beginning and end of the dance.

cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).

median_x, median_y: Median position of the individual during the dance.

feeder_cam_id: ID of the feeder that the bee was detected at prior to the dance.

Berlin2019_followers.csv

Automatic detections of attendance and following behavior, corresponding to the dances in Berlin2019_dances.csv.

dance_id: Unique ID of the dance being attended or followed.

follower_id: Unique ID of the individual attending or following the dance.

ts_from, ts_to: Date and time of the beginning and end of the interaction.

label: “attendance” or “follower”

cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).

Berlin2019_dances_with_manually_verified_times.csv

A sample of dances from Berlin2019_dances.csv where the exact timestamps have been manually verified to correspond to the beginning of the first and last waggle phase down to a precision of ca. 166 ms (video material was recorded at 6 FPS).

dance_id: Unique ID of the dance.

dancer_id: Unique ID of the dancing individual.

cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).

feeder_cam_id: ID of the feeder that the bee was detected at prior to the dance.

dance_start, dance_end: Manually verified date and times of the beginning and end of the dance.

Berlin2019_dance_classifier_labels.csv

Manually annotated waggle phases or following behavior for our recording season in 2019 that was used to train the dancing and following classifier. Can be merged with the supplied individual detections.

timestamp: Timestamp of the individual frame the behavior was observed in.

frame_id: Unique ID of the video frame the behavior was observed in.

bee_id: Unique ID of the individual bee.

label: One of “nothing”, “waggle”, “follower”

Berlin2019_dance_classifier_unlabeled.csv

Additional unlabeled samples of timestamp and individual ID with the same format as Berlin2019_dance_classifier_labels.csv, but without a label. The data points have been sampled close to detections of our waggle phase classifier, so behaviors related to the waggle dance are likely overrepresented in that sample.

Berlin2021_waggle_phase_classifier_labels.csv

Manually annotated detections of our waggle phase detector (bb_wdd2) that were used to train the neural network filter (bb_wdd_filter) for the 2021 data.

detection_id: Unique ID of the waggle phase.

label: One of “waggle”, “activating”, “ventilating”, “trembling”, “other”. Where “waggle” denoted a waggle phase, “activating” is the shaking signal, “ventilating” is a bee fanning her wings. “trembling” denotes a tremble dance, but the distinction from the “other” class was often not clear, so “trembling” was merged into “other” for training.

orientation: The body orientation of the bee that triggered the detection in radians (0: facing to the right, PI /4: facing up).

metadata_path: Path to the individual detection in the same directory structure as created by the waggle dance detector.

Berlin2021_waggle_phase_classifier_ground_truth.zip

The output of the waggle dance detector (bb_wdd2) that corresponds to Berlin2021_waggle_phase_classifier_labels.csv and is used for training. The archive includes a directory structure as output by the bb_wdd2 and each directory includes the original image sequence that triggered the detection in an archive and the corresponding metadata. The training code supplied in bb_wdd_filter directly works with this directory structure.

Berlin2019_tracks.zip

Detections and tracks from the recording season in 2019 as produced by our tracking system. As the full data is several terabytes in size, we include the subset of our data here that is relevant for our publication which comprises over 46 million detections. We included tracks for all detected behaviors (dancing, following, attending) including one minute before and after the behavior. We also included all tracks that correspond to the labeled and unlabeled data that was used to train the dance classifier including 30 seconds before and after the data used for training. We grouped the exported data by date to make the handling easier, but to efficiently work with the data, we recommend importing it into an indexable database.

The individual files contain the following columns:

cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).

timestamp: Date and time of the detection.

frame_id: Unique ID of the video frame of the recording from which the detection was extracted.

track_id: Unique ID of an individual track (short motion path from one individual). For longer tracks, the detections can be linked based on the bee_id.

bee_id: Unique ID of the individual bee.

bee_id_confidence: Confidence between 0 and 1 that the bee_id is correct as output by our tracking system.

x_pos_hive, y_pos_hive: Spatial position of the bee in the hive on the side indicated by cam_id. Given in millimeters after applying a homography on the video material.

orientation_hive: Orientation of the bees’ thorax in the hive in radians (0: oriented to the right, PI / 4: oriented upwards).

Berlin2019_feeder_experiment_log.csv

Experiment log for our feeder experiments in 2019.

date: Date given in the format year-month-day.

feeder_cam_id: Numeric ID of the feeder.

coordinates: Longitude and latitude of the feeder. For feeders 1 and 2 this is only given once and held constant. Feeder 3 had varying locations.

time_opened, time_closed: Date and time when the feeder was set up or closed again. sucrose_solution: Concentration of the sucrose solution given as sugar:water (in terms of weight). On days where feeder 3 was open, the other two feeders offered water without sugar.

Software used to acquire and analyze the data:

bb_pipeline: Tag localization and decoding pipeline

bb_pipeline_models: Pretrained localizer and decoder models for bb_pipeline

bb_binary: Raw detection data storage format

bb_irflash: IR flash system schematics and arduino code

bb_imgacquisition: Recording and network storage

bb_behavior: Database interaction and data (pre)processing, feature extraction

bb_tracking: Tracking of bee detections over time

bb_wdd2: Automatic detection and decoding of honey bee waggle dances

bb_wdd_filter: Machine learning model to improve the accuracy of the waggle dance detector

bb_dance_networks: Detection of dancing and following behavior from trajectories
m
AgriShelf: A Multi-Class, Bi-Source Image Dataset for Smart Agri-Food...
data.mendeley.com
Updated Apr 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tala Jano (2025). AgriShelf: A Multi-Class, Bi-Source Image Dataset for Smart Agri-Food Retailing Applications [Dataset]. http://doi.org/10.17632/s3vc2552sf.4
Explore at:
Unique identifier
https://doi.org/10.17632/s3vc2552sf.4
Dataset updated
Apr 24, 2025
Authors
Tala Jano
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this dataset, we have compiled a comprehensive collection of 16,592 agri-food retail images across various classes commonly found in grocery and supermarket environments. To ensure generalizability, the dataset was collected using two distinct sources: a smartphone and an Intel RealSense Depth Camera (D435i), under diverse, real-world conditions, such as shelf inclinations, lighting levels, and different angles. The dataset is structured into two main subsets: unlabeled and labeled. The unlabeled subset is curated for key computer vision tasks relevant to retail applications, including classification, object detection, and product recognition. The labeled subset consists of 2,416 samples with detailed centroid annotations, making it suitable for On-Shelf Availability (OSA) estimation, counting, or multi-task learning approaches. Altogether, both subsets serve as valuable benchmarks for evaluating and testing automated inventory monitoring systems and real-time retail analytics applications.
Urban Sound & Sight (Urbansas) - Labeled set
zenodo.org
explore.openaire.eu
txt, zip
Updated Jun 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Magdalena Fuentes; Bea Steers; Pablo Zinemanas; Martín Rocamora; Luca Bondi; Julia Wilkins; Qianyi Shi; Yao Hou; Samarjit Das; Xavier Serra; Juan Pablo Bello; Magdalena Fuentes; Bea Steers; Pablo Zinemanas; Martín Rocamora; Luca Bondi; Julia Wilkins; Qianyi Shi; Yao Hou; Samarjit Das; Xavier Serra; Juan Pablo Bello (2022). Urban Sound & Sight (Urbansas) - Labeled set [Dataset]. http://doi.org/10.5281/zenodo.6658386
Explore at:
txt, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6658386
Dataset updated
Jun 20, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Magdalena Fuentes; Bea Steers; Pablo Zinemanas; Martín Rocamora; Luca Bondi; Julia Wilkins; Qianyi Shi; Yao Hou; Samarjit Das; Xavier Serra; Juan Pablo Bello; Magdalena Fuentes; Bea Steers; Pablo Zinemanas; Martín Rocamora; Luca Bondi; Julia Wilkins; Qianyi Shi; Yao Hou; Samarjit Das; Xavier Serra; Juan Pablo Bello
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Urban Sound & Sight (Urbansas):

Version 1.0, May 2022

Created by
Magdalena Fuentes (1, 2), Bea Steers (1, 2), Pablo Zinemanas (3), Martín Rocamora (4), Luca Bondi (5), Julia Wilkins (1, 2), Qianyi Shi (2), Yao Hou (2), Samarjit Das (5), Xavier Serra (3), Juan Pablo Bello (1, 2)
1. Music and Audio Research Lab, New York University
2. Center for Urban Science and Progress, New York University
3. Universitat Pompeu Fabra, Barcelona, Spain
4. Universidad de la República, Montevideo, Uruguay
5. Bosch Research, Pittsburgh, PA, USA

Publication

If using this data in academic work, please cite the following paper, which presented this dataset:
M. Fuentes, B. Steers, P. Zinemanas, M. Rocamora, L. Bondi, J. Wilkins, Q. Shi, Y. Hou, S. Das, X. Serra, J. Bello. “Urban Sound & Sight: Dataset and Benchmark for Audio-Visual Urban Scene Understanding”. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.

Description

Urbansas is a dataset for the development and evaluation of machine listening systems for audiovisual spatial urban understanding. One of the main challenges to this field of study is a lack of realistic, labeled data to train and evaluate models on their ability to localize using a combination of audio and video.
We set four main goals for creating this dataset:
1. To compile a set of real-field audio-visual recordings;
2. The recordings should be stereo to allow exploring sound localization in the wild;
3. The compilation should be varied in terms of scenes and recording conditions to be meaningful for training and evaluation of machine learning models;
4. The labeled collection should be accompanied by a bigger unlabeled collection with similar characteristics to allow exploring self-supervised learning in urban contexts.
Audiovisual data
We have compiled and manually annotated Urbansas from two publicly available datasets, plus the addition of unreleased material. The public datasets are the TAU Urban Audio-Visual Scenes 2021 Development dataset (street-traffic subset) and the Montevideo Audio-Visual Dataset (MAVD):

Wang, Shanshan, et al. "A curated dataset of urban scenes for audio-visual scene analysis." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.

Zinemanas, Pablo, Pablo Cancela, and Martín Rocamora. "MAVD: A dataset for sound event detection in urban environments." Detection and Classification of Acoustic Scenes and Events, DCASE 2019, New York, NY, USA, 25–26 oct, page 263--267 (2019).

The TAU dataset consists of 10-second segments of audio and video from different scenes across European cities, traffic being one of the scenes. Only the scenes labeled as traffic were included in Urbansas. MAVD is an audio-visual traffic dataset curated in different locations of Montevideo, Uruguay, with annotations of vehicles and vehicle components sounds (e.g. engine, brakes) for sound event detection. Besides the published datasets, we include a total of 9.5 hours of unpublished material recorded in Montevideo, with the same recording devices of MAVD but including new locations and scenes.

Recordings for TAU were acquired using a GoPro Hero 5 (30fps, 1280x720) and a Soundman OKM II Klassik/studio A3 electret binaural in-ear microphone with a Zoom F8 audio recorder (48kHz, 24 bits, stereo). Recordings for MAVD were collected using a GoPro Hero 3 (24fps, 1920x1080) and a SONY PCM-D50 recorder (48kHz, 24 bits, stereo).

When compiled in Urbansas, it includes 15 hours of stereo audio and video, stored in separate 10 second MPEG4 (1280x720, 24fps) and WAV (48kHz, 24 bit, 2 channel) files. Both released video datasets are already anonymized to obscure people and license plates, the unpublished MAVD data was anonymized similarly using this anonymizer. We also distribute the 2fps video used for producing the annotations.

The audio and video files both share the same filename stem, meaning that they can be associated after removing the parent directory and extension.

MAVD:
video/

TAU:
video/

where location_id in both cases includes the city and an ID number.

city & places & clips & mins & frames & labeled mins \\
Montevideo & 8 & 4085 & 681 & 980400 & 92 \\
Stockholm & 3 & 91 & 15 & 21840 & 2 \\
Barcelona & 4 & 144 & 24 & 34560 & 24 \\
Helsinki & 4 & 144 & 24 & 34560 & 16 \\
Lisbon & 4 & 144 & 24 & 34560 & 19 \\
Lyon & 4 & 144 & 24 & 34560 & 6 \\
Paris & 4 & 144 & 24 & 34560 & 2 \\
Prague & 4 & 144 & 24 & 34560 & 2 \\
Vienna & 4 & 144 & 24 & 34560 & 6 \\
London & 5 & 144 & 24 & 34560 & 4 \\
Milan & 6 & 144 & 24 & 34560 & 6 \\
\midrule
Total & 50 & 5472 & 912 & 1.3M & 180 \\

Annotations

Of the 15 hours of audio and video, 3 hours of data (1.5 hours TAU, 1.5 hours MAVD) are manually annotated by our team both in audio and image, along with 12 hours of unlabeled data (2.5 hours TAU, 9.5 hours of unpublished material) for the benefit of unsupervised models. The distribution of clips across locations was selected to maximize variance across different scenes. The annotations were collected at 2 frames per second (FPS) as it provided a balance between temporal granularity and clip coverage.

The annotation data is contained in video_annotations.csv and audio_annotations.csv.

Video Annotations

Each row in the video annotations represents a single object in a single frame of the video. The annotation schema is as follows:

frame_id: The index of the frame within the clip the annotation is associated with. This index is 0-based and goes up to 19 (assuming 10-second clips with annotations at 2 FPS)

track_id: The ID of the detected instance that identifies the same object across different frames. These IDs are guaranteed to be unique within a clip.

x, y, w, h: The top-left corner and width and height of the object’s bounding box in the video. The values are given in absolute coordinates with respect to the image size (1280x720).

class_id: The index of the class corresponding to: [0, 1, 2, 3, -1] — see label for the index mapping. The -1 value corresponds to the case where there are no events, but still clip-level annotations, like night and city. When operating on bounding boxes, class_id of -1 should be filtered.

label: The label text. This is equivalent to LABELS[class_id], where LABELS=[car, bus, motorbike, truck, -1]. The label -1 has the same role as above.

visibility: The visibility of the object. This is 1 unless the object becomes obstructed, where it changes to 0.

filename: The file ID of the associated file. This is the file’s path minus the parent directory and extension.

city: The city where the clip was collected in.

location_id: The specific name of the location. This may include an integer ID following the city name for cases where there are multiple collection points.

time: The time (in seconds) of the annotation, relative to the start of the file. Equivalent to frame_id / fps .

night: Whether the clip takes place during the day or at night. This value is singular per clip.

subset: Which data source the data originally belongs to (TAU or MAVD).

Audio Annotations

Each row represents a single object instance, along with the time range that it exists within the clip. The annotation schema is as follows:

filename: The file ID odd the associated audio file. See filename above.

class_id, label: See above. Audio has an additional class_id of 4 (label=offscreen) which indicates an off-screen vehicle - meaning a vehicle that is heard but not seen. A class_id of -1 indicates a clip-level annotation for a clip that has no object annotations (an empty scene).

non_identifiable_vehicle_sound: True if the region contains the sound of vehicles where individual instances cannot be uniquely identified.

start, end: The start and end times (in seconds) of the annotation relative to the file.

Conditions of use

Dataset created by Magdalena Fuentes, Bea Steers, Pablo Zinemanas, Martín Rocamora, Luca Bondi, Julia Wilkins, Qianyi Shi, Yao Hou, Samarjit Das, Xavier Serra, and Juan Pablo Bello.

The Urbansas dataset is offered free of charge under the following terms:

Urbansas annotations are release under the CC BY 4.0 license

Urbansas video and audio replicates the original sources licenses:

MAVD subset is released under CC BY 4.0

TAU subset is released under a Non-Commercial license

Feedback

Please help us improve Urbansas by sending your feedback to:

Magdalena Fuentes: mfuentes@nyu.edu

Bea Steers: bsteers@nyu.edu

In case of a problem, please include as many details as possible.

Acknowledgments

This work was partially supported by the National Science
Self-Supervised Learning Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Sep 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2024). Self-Supervised Learning Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-self-supervised-learning-market
Explore at:
pdf, pptx, csvAvailable download formats
Dataset updated
Sep 23, 2024
Dataset provided by
Authors
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Self-Supervised Learning Market Outlook

As of 2023, the global self-supervised learning market size is valued at approximately USD 1.5 billion and is expected to escalate to around USD 10.8 billion by 2032, reflecting a compound annual growth rate (CAGR) of 24.1% during the forecast period. This robust growth is driven by the increasing demand for advanced AI models that can learn from large volumes of unlabeled data, significantly reducing the dependency on labeled datasets, thereby making AI training more cost-effective and scalable.

The growth of the self-supervised learning market is fueled by several factors, one of which is the exponential increase in data generation. With the proliferation of digital devices, IoT technologies, and social media platforms, there is an unprecedented amount of data being created every second. Self-supervised learning models leverage this vast amount of unlabeled data to train themselves, making them particularly valuable in industries where data labeling is time-consuming and expensive. This capability is especially pertinent in fields like healthcare, finance, and retail, where the rapid analysis of extensive datasets can lead to significant advancements in predictive analytics and customer insights.

Another critical driver is the advancement in computational technologies that support more sophisticated machine learning models. The development of more powerful GPUs and cloud-based AI platforms has enabled the efficient training and deployment of self-supervised learning models. These technological advancements not only reduce the time required for training but also enhance the accuracy and performance of the models. Furthermore, the integration of self-supervised learning with other AI paradigms such as reinforcement learning and deep learning is opening new avenues for research and application, further propelling market growth.

The increasing adoption of AI across various industries is also a significant growth factor. Businesses are increasingly recognizing the potential of AI to optimize operations, enhance customer experiences, and drive innovation. Self-supervised learning, with its ability to make sense of large, unstructured datasets, is becoming a cornerstone of AI strategies across sectors. For instance, in the healthcare sector, self-supervised learning is being used to develop predictive models for disease diagnosis and treatment planning, while in the finance sector, it aids in fraud detection and risk management.

Regionally, North America is expected to dominate the self-supervised learning market, owing to the presence of leading technology companies and extensive R&D activities in AI. However, the Asia Pacific region is anticipated to witness the fastest growth during the forecast period, driven by rapid digital transformation, increasing investment in AI technologies, and supportive government initiatives. Europe also presents a significant market opportunity, with a strong focus on AI research and development, particularly in countries like Germany, the UK, and France.

Component Analysis

The self-supervised learning market is segmented by component into software, hardware, and services. The software segment is expected to hold the largest market share, driven by the development and adoption of advanced AI algorithms and platforms. These software solutions are designed to leverage the vast amounts of unlabeled data available, making them highly valuable for various applications such as natural language processing, computer vision, and predictive analytics. Furthermore, continuous advancements in software capabilities, such as improved model training techniques and enhanced data preprocessing tools, are expected to fuel the growth of this segment.

The hardware segment, while smaller in comparison to software, is crucial for the efficient deployment of self-supervised learning models. This includes high-performance computing systems, GPUs, and specialized AI accelerators that provide the necessary computational power to train and run complex AI models. Innovations in hardware technology, such as the development of more energy-efficient and powerful processing units, are expected to drive growth in this segment. Additionally, the increasing adoption of edge computing devices that can perform AI tasks locally, thereby reducing latency and bandwidth usage, is also contributing to the expansion of the hardware segment.

Services are another vital component of the self-supervised learning market. This segment encompasses various professional services such as consulting, int
Machine Learning Courses Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Machine Learning Courses Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/machine-learning-courses-market
Explore at:
csv, pptx, pdfAvailable download formats
Dataset updated
Jan 7, 2025
Dataset provided by
Authors
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Machine Learning Courses Market Outlook

The global market size of Machine Learning (ML) courses is witnessing substantial growth, with market valuation expected to reach $3.1 billion in 2023 and projected to soar to $12.6 billion by 2032, exhibiting a robust CAGR of 16.5% over the forecast period. This rapid expansion is fueled by the increasing adoption of artificial intelligence (AI) and machine learning technologies across various industries, the rising need for upskilling and reskilling in the workforce, and the growing penetration of online education platforms.

One of the most significant growth factors driving the ML courses market is the escalating demand for AI and ML expertise in the job market. As industries increasingly integrate AI and machine learning into their operations to enhance efficiency and innovation, there is a burgeoning need for professionals with relevant skills. Companies across sectors such as finance, healthcare, retail, and manufacturing are investing heavily in training programs to bridge the skills gap, thus driving the demand for ML courses. Additionally, the rapid evolution of technology necessitates continuous learning, further bolstering market growth.

Another crucial factor contributing to the market's expansion is the proliferation of online education platforms that offer flexible and affordable ML courses. Platforms like Coursera, Udacity, edX, and Khan Academy have made high-quality education accessible to a global audience. These platforms offer an array of courses tailored to different skill levels, from beginners to advanced learners, making it easier for individuals to pursue continuous learning and career advancement. The convenience and flexibility of online learning are particularly appealing to working professionals and students, thereby driving the market's growth.

The increasing collaboration between educational institutions and technology companies is also playing a pivotal role in the growth of the ML courses market. Many universities and colleges are partnering with leading tech firms to develop specialized curricula that align with industry requirements. These collaborations help ensure that the courses offered are up-to-date with the latest technological advancements and industry standards. As a result, students and professionals are better equipped with the skills needed to thrive in a technology-driven job market, further propelling the demand for ML courses.

On a regional level, North America holds a significant share of the ML courses market, driven by the presence of numerous leading tech companies and educational institutions, as well as a highly skilled workforce. The region's strong emphasis on innovation and technological advancement is a key driver of market growth. Additionally, Asia Pacific is emerging as a lucrative market for ML courses, with countries like China, India, and Japan witnessing increased investments in AI and ML education and training. The rising internet penetration, growing popularity of online education, and government initiatives to promote digital literacy are some of the factors contributing to the market's growth in this region.

Self-Supervised Learning, a cutting-edge approach in the realm of machine learning, is gaining traction as a pivotal element in the development of more autonomous AI systems. Unlike traditional supervised learning, which relies heavily on labeled data, self-supervised learning leverages unlabeled data to train models, significantly reducing the dependency on human intervention for data annotation. This method is particularly advantageous in scenarios where acquiring labeled data is costly or impractical. By enabling models to learn from vast amounts of unlabeled data, self-supervised learning enhances the ability of AI systems to generalize from limited labeled examples, thereby improving their performance in real-world applications. The integration of self-supervised learning techniques into machine learning courses is becoming increasingly important, as it equips learners with the knowledge to tackle complex AI challenges and develop more robust models.

Course Type Analysis

The Machine Learning Courses market is segmented by course type into online courses, offline courses, bootcamps, and workshops. Online courses dominate the segment due to their accessibility, flexibility, and cost-effectiveness. Platforms like Coursera and Udacity have democratized access to high-quality ML education, enabling lear
f
Data from: Leveraging Unlabeled Data for Superior ROC Curve Estimation via a...
tandf.figshare.com
bin
Updated Feb 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Menghua Zhang; Mengjiao Peng; Yong Zhou (2025). Leveraging Unlabeled Data for Superior ROC Curve Estimation via a Semiparametric Approach [Dataset]. http://doi.org/10.6084/m9.figshare.28156199.v2
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28156199.v2
Dataset updated
Feb 26, 2025
Dataset provided by
Taylor & Francis
Authors
Menghua Zhang; Mengjiao Peng; Yong Zhou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The receiver operating characteristic (ROC) curve is a widely used tool in various fields, including economics, medicine, and machine learning, for evaluating classification performance and comparing treatment effect. The absence of clear and readily labels is a frequent phenomenon in estimating ROC owing to various reasons like labeling cost, time constraints, data privacy and information asymmetry. Traditional supervised estimators commonly rely solely on labeled data, where each sample is associated with a fully observed response variable. We propose a new set of semi-supervised (SS) estimators to exploit available unlabeled data (samples lack of observations for responses) to enhance the estimation precision under the semi-parametric setting assuming that the distribution of the response variable for one group is known up to unknown parameters. The newly proposed SS estimators have attractive properties such as adaptability and efficiency by leveraging the flexibility of kernel smoothing method. We establish the large sample properties of the SS estimators, which demonstrate that the SS estimators outperform the supervised estimator consistently under mild assumptions. Numeric experiments provide empirical evidence to support our theoretical findings. Finally, we showcase the practical applicability of our proposed methodology by applying it to two real datasets.
h
SemiEvol
huggingface.co
Updated Oct 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
junyu (2024). SemiEvol [Dataset]. https://huggingface.co/datasets/luojunyu/SemiEvol
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 22, 2024
Authors
junyu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for Dataset Name

The SemiEvol dataset is part of the broader work on semi-supervised fine-tuning for Large Language Models (LLMs). The dataset includes labeled and unlabeled data splits designed to enhance the reasoning capabilities of LLMs through a bi-level knowledge propagation and selection framework, as proposed in the paper SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation.

Dataset Details Dataset Sources [optional]… See the full description on the dataset page: https://huggingface.co/datasets/luojunyu/SemiEvol.
R
Hyper Kvasir Dataset
universe.roboflow.com
zip
Updated Jul 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simula (2024). Hyper Kvasir Dataset [Dataset]. https://universe.roboflow.com/simula/hyper-kvasir/model/1
Explore at:
zipAvailable download formats
Dataset updated
Jul 24, 2024
Dataset authored and provided by
Simula
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
GI Tract
Description
Overview This is the largest Gastrointestinal dataset generously provided by Simula Research Laboratory in Norway

You can read their research paper here in Nature

In total, the dataset contains 10,662 labeled images stored using the JPEG format. The images can be found in the images folder. The classes, which each of the images belong to, correspond to the folder they are stored in (e.g., the ’polyp’ folder contains all polyp images, the ’barretts’ folder contains all images of Barrett’s esophagus, etc.). Each class-folder is located in a subfolder describing the type of finding, which again is located in a folder describing wheter it is a lower GI or upper GI finding. The number of images per class are not balanced, which is a general challenge in the medical field due to the fact that some findings occur more often than others. This adds an additional challenge for researchers, since methods applied to the data should also be able to learn from a small amount of training data. The labeled images represent 23 different classes of findings.

The data is collected during real gastro- and colonoscopy examinations at a Hospital in Norway and partly labeled by experienced gastrointestinal endoscopists.

Use Cases

"Artificial intelligence is currently a hot topic in medicine. The fact that medical data is often sparse and hard to obtain due to legal restrictions and lack of medical personnel to perform the cumbersome and tedious labeling of the data, leads to technical limitations. In this respect, we share the Hyper-Kvasir dataset, which is the largest image and video dataset from the gastrointestinal tract available today."

"We have used the labeled data to research the classification and segmentation of GI findings using both computer vision and ML approaches to potentially be used in live and post-analysis of patient examinations. Areas of potential utilization are analysis, classification, segmentation, and retrieval of images and videos with particular findings or particular properties from the computer science area. The labeled data can also be used for teaching and training in medical education. Having expert gastroenterologists providing the ground truths over various findings, HyperKvasir provides a unique and diverse learning set for future clinicians. Moreover, the unlabeled data is well suited for semi-supervised and unsupervised methods, and, if even more ground truth data is needed, the users of the data can use their own local medical experts to provide the needed labels. Finally, the videos can in addition be used to simulate live endoscopies feeding the video into the system like it is captured directly from the endoscopes enable developers to do image classification."

Borgli, H., Thambawita, V., Smedsrud, P.H. et al. HyperKvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy. Sci Data 7, 283 (2020). https://doi.org/10.1038/s41597-020-00622-y

Using this Dataset

Hyper-Kvasir is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source. This means that in all documents and papers that use or refer to the Hyper-Kvasir dataset or report experimental results based on the dataset, a reference to the related article needs to be added: PREPRINT: https://osf.io/mkzcq/. Additionally, one should provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

About Roboflow

Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.

Developers reduce 50% of their boilerplate code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.
Z
LEPset
data.niaid.nih.gov
zenodo.org
Updated Jun 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sheng,Bin (2023). LEPset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8041284
Explore at:
Dataset updated
Jun 15, 2023
Dataset provided by
Sheng,Bin
Wang,Teng
Wang,Kaixuan
Li, Jiajia
Zhang, Pingping
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LEPset is a large-scale EUS-based pancreas image dataset from the Department of Goenterology, Changhai Hospital, Second Military Medical University/Naval Medical University. This dataset consists of 420 patients and 3,500 images, and it has been divided into two categories (PC and NPC). We have invited experienced clinicians to annotate the category labels for all 3500 EUS images. Moreover, our LEPset also has 8,000 EUS images without any classification annotation.

After downloading the data set LEPset.zip, select the appropriate unzip file to extract it

After unzipping, there will be two folders: unlabeled and labeled

There are 8000 EUS images in the unlabeled folder and two folders in the labeled folder, NPC and PC, representing non-pancreatic cancer and pancreatic cancer respectively. 140 patients (1820 images) in NPC and 280 patients (1680 images) in PC

Unlabelled images can be used for pre-training of the model, and labelled images can be used for training and validation of the supervised model
R
AI in Unsupervised Learning Market Market Research Report 2033
researchintelo.com
csv, pdf, pptx
Updated Jul 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Research Intelo (2025). AI in Unsupervised Learning Market Market Research Report 2033 [Dataset]. https://researchintelo.com/report/ai-in-unsupervised-learning-market-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Jul 24, 2025
Dataset authored and provided by
Research Intelo
License
https://researchintelo.com/privacy-and-policyhttps://researchintelo.com/privacy-and-policy
Time period covered
2024 - 2033
Area covered
Global
Description
AI in Unsupervised Learning Market Outlook

According to our latest research, the AI in Unsupervised Learning market size reached USD 3.8 billion globally in 2024, demonstrating robust expansion as organizations increasingly leverage unsupervised techniques for extracting actionable insights from unlabelled data. The market is forecasted to grow at a CAGR of 28.2% from 2025 to 2033, propelling the industry to an estimated USD 36.7 billion by 2033. This remarkable growth trajectory is primarily fueled by the escalating adoption of artificial intelligence across diverse sectors, an exponential surge in data generation, and the pressing need for advanced analytics that can operate without manual data labeling.

One of the key growth factors driving the AI in Unsupervised Learning market is the rising complexity and volume of data generated by enterprises in the digital era. Organizations are inundated with unstructured and unlabelled data from sources such as social media, IoT devices, and transactional systems. Traditional supervised learning methods are often impractical due to the time and cost associated with manual labeling. Unsupervised learning algorithms, such as clustering and dimensionality reduction, offer a scalable solution by autonomously identifying patterns, anomalies, and hidden structures within vast datasets. This capability is increasingly vital for industries aiming to enhance decision-making, streamline operations, and gain a competitive edge through advanced analytics.

Another significant driver is the rapid advancement in computational power and AI infrastructure, which has made it feasible to implement sophisticated unsupervised learning models at scale. The proliferation of cloud computing and specialized AI hardware has reduced barriers to entry, enabling even small and medium enterprises to deploy unsupervised learning solutions. Additionally, the evolution of neural networks and deep learning architectures has expanded the scope of unsupervised algorithms, allowing for more complex tasks such as image recognition, natural language processing, and anomaly detection. These technological advancements are not only accelerating adoption but also fostering innovation across sectors including healthcare, finance, manufacturing, and retail.

Furthermore, regulatory compliance and the growing emphasis on data privacy are pushing organizations to adopt unsupervised learning methods. Unlike supervised approaches that require sensitive data labeling, unsupervised algorithms can process data without explicit human intervention, thereby reducing the risk of privacy breaches. This is particularly relevant in sectors such as healthcare and BFSI, where stringent data protection regulations are in place. The ability to derive insights from unlabelled data while maintaining compliance is a compelling value proposition, further propelling the market forward.

Regionally, North America continues to dominate the AI in Unsupervised Learning market owing to its advanced technological ecosystem, significant investments in AI research, and strong presence of leading market players. Europe follows closely, driven by robust regulatory frameworks and a focus on ethical AI deployment. The Asia Pacific region is exhibiting the fastest growth, fueled by rapid digital transformation, government initiatives, and increasing adoption of AI across industries. Latin America and the Middle East & Africa are also witnessing steady growth, albeit at a slower pace, as awareness and infrastructure continue to develop.

Component Analysis

The Component segment of the AI in Unsupervised Learning market is categorized into Software, Hardware, and Services, each playing a pivotal role in the overall ecosystem. The software segment, comprising machine learning frameworks, data analytics platforms, and AI development tools, holds the largest market share. This dominance is attributed to the continuous evolution of AI algorithms and the increasing availability of open-source and proprietary solutions tailored for unsupervised learning. Enterprises are investing heavily in software that can facilitate the seamless integration of unsupervised learning capabilities into existing workflows, enabling automation, predictive analytics, and pattern recognition without the need for labeled data.

The hardware segment, while smaller in comparison to software, is experiencing significant growth due to the escalating demand for high-perf
e
Classification of gravure printed patterns using singular value...
b2find.eudat.eu
Updated Mar 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Classification of gravure printed patterns using singular value decomposition and machine learning (MATLAB code) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/4ed725a1-1a97-5f87-b261-31f0be5d7483
Explore at:
Dataset updated
Mar 22, 2025
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset contains MATLAB code ('code_MachLearn_ImgClass.zip') for automated classification of gravure printed patterns from the HYPA-p dataset. The developed algorithm performs singular value decomposition (SVD) and training of several machine learning classifiers, such as k-Nearest Neighbors (kNN). The classifiers are trained and tested on labeled data. Afterwards, the trained classifiers can be used for automated classification of unlabeled data. Further information can be found in the provided README-file.
Z
Adverse Drug Reaction (ADR) Text Dataset
data.niaid.nih.gov
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Monko, Gloriana (2025). Adverse Drug Reaction (ADR) Text Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13889330
Explore at:
Dataset updated
Apr 11, 2025
Dataset authored and provided by
Monko, Gloriana
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This repository contains text data and code related to the identification and clustering of Adverse Drug Reactions (ADR) using Sentence-BERT (S-BERT) embeddings and the SS-DBSCAN clustering algorithm. The dataset includes both labeled and unlabeled patient reports extracted from the publicly available MIMIC-III database.

The labeled data has been manually annotated to distinguish between ADR and non-ADR cases. The unlabeled dataset is used for unsupervised clustering experiments, particularly to assess high-dimensional data clustering performance.

New in This Version:- Added Jupyter Notebook: mimic-5k_PCA_tSNE_clustering.ipynb- Included detailed README_ADR_Clustering_Task.txt with step-by-step instructions to reproduce clustering results- Explained how to scale experiments from 1,000 to full dataset size
h
DC_inside_comments
huggingface.co
Updated Mar 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DasolChoi (2025). DC_inside_comments [Dataset]. https://huggingface.co/datasets/Dasool/DC_inside_comments
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 9, 2025
Authors
DasolChoi
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
DC_inside_comments

This dataset contains 110,000 raw comments collected from DC Inside. It is intended for unsupervised learning or pretraining purposes.

Dataset Summary

Data Type: Unlabeled raw comments Number of Examples: 110,000 Source: DC Inside

Related Dataset

For labeled data and multi-task annotated examples, please refer to the KoMultiText dataset.

How to Load the Dataset

from datasets import load_dataset

Load the unlabeled dataset… See the full description on the dataset page: https://huggingface.co/datasets/Dasool/DC_inside_comments.

Facebook

Twitter

Click to copy link

Link copied

Cite

YuanfengJi (2022). Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation (Unlabeled Data Part III) [Dataset]. http://doi.org/10.5281/zenodo.7295816

Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation (Unlabeled Data Part III)

Explore at:

24 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.5281/zenodo.7295816

Dataset updated

Oct 29, 2022

Authors

YuanfengJi

Description

Despite the considerable progress in automatic abdominal multi-organ segmentation from CT/MRI scans in recent years, a comprehensive evaluation of the models' capabilities is hampered by the lack of a large-scale benchmark from diverse clinical scenarios. Constraint by the high cost of collecting and labeling 3D medical data, most of the deep learning models to date are driven by datasets with a limited number of organs of interest or samples, which still limits the power of modern deep models and makes it difficult to provide a fully comprehensive and fair estimate of various methods. To mitigate the limitations, we present AMOS, a large-scale, diverse, clinical dataset for abdominal organ segmentation. AMOS provides 500 CT and 100 MRI scans collected from multi-center, multi-vendor, multi-modality, multi-phase, multi-disease patients, each with voxel-level annotations of 15 abdominal organs, providing challenging examples and test-bed for studying robust segmentation algorithms under diverse targets and scenarios. We further benchmark several state-of-the-art medical segmentation models to evaluate the status of the existing methods on this new challenging dataset. We have made our datasets, benchmark servers, and baselines publicly available, and hope to inspire future research. The paper can be found at https://arxiv.org/pdf/2206.08023.pdf In addition to providing the labeled 600 CT and MRI scans, we expect to provide 2000 CT and 1200 MRI scans without labels to support more learning tasks (semi-supervised, un-supervised, domain adaption, ...). The link can be found in: labeled data (500CT+100MRI) unlabeled data Part I (900CT) unlabeled data Part II (1100CT) (Now there are 1000CT, we will replenish to 1100CT) unlabeled data Part III (1200MRI) if you found this dataset useful for your research, please cite: @article{ji2022amos, title={AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation}, author={Ji, Yuanfeng and Bai, Haotian and Yang, Jie and Ge, Chongjian and Zhu, Ye and Zhang, Ruimao and Li, Zhen and Zhang, Lingyan and Ma, Wanling and Wan, Xiang and others}, journal={arXiv preprint arXiv:2206.08023}, year={2022} }

Clear search

Close search

Google apps

Main menu

Amos: A large-scale abdominal multi-organ benchmark for versatile medical...

Average dice coefficients of the few-supervised learning models using 2%,...

‘BLE RSSI Dataset for Indoor localization’ analyzed by Analyst-2

Content

Attribute Information

Acknowledgements

Inspiration

How unlabeled data can help for an improved learning system. How a GAN model can synthesizes viable paths based on the little labeled data and larger set of unlabeled data.

Stanford STL-10 Image Dataset

Number of images used for the training and testing of the models with...

AI in Semi-supervised Learning Market Market Research Report 2033

AI in Semi-supervised Learning Market Outlook

Component Analysis

Square dataset - Dataset - LDM

Data used in Machine learning reveals the waggle drift's role in the honey...

AgriShelf: A Multi-Class, Bi-Source Image Dataset for Smart Agri-Food...

Urban Sound & Sight (Urbansas) - Labeled set

Self-Supervised Learning Market Report | Global Forecast From 2025 To 2033

Self-Supervised Learning Market Outlook

Component Analysis

Machine Learning Courses Market Report | Global Forecast From 2025 To 2033

Machine Learning Courses Market Outlook

Course Type Analysis

Data from: Leveraging Unlabeled Data for Superior ROC Curve Estimation via a...

SemiEvol

Hyper Kvasir Dataset

LEPset

AI in Unsupervised Learning Market Market Research Report 2033

AI in Unsupervised Learning Market Outlook

Component Analysis

Classification of gravure printed patterns using singular value...

Adverse Drug Reaction (ADR) Text Dataset

DC_inside_comments

Load the unlabeled dataset… See the full description on the dataset page: https://huggingface.co/datasets/Dasool/DC_inside_comments.

Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation (Unlabeled Data Part III)See More Versions

Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation (Unlabeled Data Part III)