66 datasets found

Z
Human Inner Ear Anatomy: Labeled Volume CT Data of Inner Ear Fluid Space and...
data.niaid.nih.gov
zenodo.org
Updated Aug 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stebani, Jannik (2023). Human Inner Ear Anatomy: Labeled Volume CT Data of Inner Ear Fluid Space and Anatomical Landmarks [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8277158
Explore at:
Dataset updated
Aug 30, 2023
Dataset provided by
Blaimer, Martin
Rak, Kristen
Neun, Tilman
Stebani, Jannik
Pelt, Daniël M.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The provided dataset comprises 43 instances of temporal bone volume CT scans. The scans were performed on human cadaveric specimen with a resulting isotropic voxel size of (99 \times 99 \times 99 \, \, \mathrm{\mu m}^3). Voxel-wise image labels of the fluid space of the bony labyrinth, subdivided in the three semantic classes cochlear volume, vestibular volume and semicircular canal volume are provided. In addition, each dataset contains JSON-like descriptor data defining the voxel coordinates of the anatomical landmarks: (1) apex of the cochlea, (2) oval window and (3) round window. The dataset can be used to train and evaluate algorithmic machine learning models for automated innear ear analysis in the context of the supervised learning paradigm.

Usage Notes

The datasets are formatted in the HDF5 format developed by the HDF5 Group. We utilized and thus recommend the usage of Python bindings pyHDF to handle the datasets.

The flat-panel volume CT raw data, labels and landmarks are saved in the HDF5-internal file structure using the respective group and datasets:

raw/raw-0 label/label-0 landmark/landmark-0 landmark/landmark-1 landmark/landmark-2

Array raw and label data can be read from the file by indexing into an opened h5py file handle, for example as numpy.ndarray. Further metadata is contained in the attribute dictionaries of the raw and label datasets.

Landmark coordinate data is available as an attribute dict and contains the coordinate system (LPS or RAS), IJK voxel coordinates and label information. The helicotrema or cochlea top is globally saved in landmark 0, the oval window in landmark 1 and the round window in landmark 2. Read as a Python dictionary, exemplary landmark information for a dataset may reads as follows:

{'coordsys': 'LPS', 'id': 1, 'ijk_position': array([181, 188, 100]), 'label': 'CochleaTop', 'orientation': array([-1., -0., -0., -0., -1., -0., 0., 0., 1.]), 'xyz_position': array([ 44.21109689, -139.38058589, -183.48249736])}

{'coordsys': 'LPS', 'id': 2, 'ijk_position': array([222, 182, 145]), 'label': 'OvalWindow', 'orientation': array([-1., -0., -0., -0., -1., -0., 0., 0., 1.]), 'xyz_position': array([ 48.27890112, -139.95991131, -179.04103763])}

{'coordsys': 'LPS', 'id': 3, 'ijk_position': array([223, 209, 147]), 'label': 'RoundWindow', 'orientation': array([-1., -0., -0., -0., -1., -0., 0., 0., 1.]), 'xyz_position': array([ 48.33120126, -137.27135678, -178.8665465 ])}
h
Data from: SRL4ORL: Improving Opinion Role Labeling Using Multi-Task...
heidata.uni-heidelberg.de
tudatalib.ulb.tu-darmstadt.de
zip
Updated Feb 4, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ana Marasovic; Ana Marasovic (2019). SRL4ORL: Improving Opinion Role Labeling Using Multi-Task Learning With Semantic Role Labeling [Source Code] [Dataset]. http://doi.org/10.11588/DATA/LWN9XE
Explore at:
zip(14676065)Available download formats
Unique identifier
https://doi.org/10.11588/DATA/LWN9XE
Dataset updated
Feb 4, 2019
Dataset provided by
heiDATA
Authors
Ana Marasovic; Ana Marasovic
License
https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/LWN9XEhttps://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/LWN9XE
Description
This repository contains code for reproducing experiments done in Marasovic and Frank (2018). Paper abstract: For over a decade, machine learning has been used to extract opinion-holder-target structures from text to answer the question "Who expressed what kind of sentiment towards what?". Recent neural approaches do not outperform the state-of-the-art feature-based models for Opinion Role Labeling (ORL). We suspect this is due to the scarcity of labeled training data and address this issue using different multi-task learning (MTL) techniques with a related task which has substantially more data, i.e. Semantic Role Labeling (SRL). We show that two MTL models improve significantly over the single-task model for labeling of both holders and targets, on the development and the test sets. We found that the vanilla MTL model, which makes predictions using only shared ORL and SRL features, performs the best. With deeper analysis, we determine what works and what might be done to make further improvements for ORL. Data for ORL Download MPQA 2.0 corpus. Check mpqa2-pytools for example usage. Splits can be found in the datasplit folder. Data for SRL The data is provided by: CoNLL-2005 Shared Task, but the original words are from the Penn Treebank dataset, which is not publicly available. How to train models? python main.py --adv_coef 0.0 --model fs --exp_setup_id new --n_layers_orl 0 --begin_fold 0 --end_fold 4 python main.py --adv_coef 0.0 --model html --exp_setup_id new --n_layers_orl 1 --n_layers_shared 2 --begin_fold 0 --end_fold 4 python main.py --adv_coef 0.0 --model sp --exp_setup_id new --n_layers_orl 3 --begin_fold 0 --end_fold 4 python main.py --adv_coef 0.1 --model asp --exp_setup_id prior --n_layers_orl 3 --begin_fold 0 --end_fold 10
A
AI Training Dataset Market Report
archivemarketresearch.com
doc, pdf, ppt
Updated Nov 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI Training Dataset Market Report [Dataset]. https://www.archivemarketresearch.com/reports/ai-training-dataset-market-5881
Explore at:
ppt, pdf, docAvailable download formats
Dataset updated
Nov 22, 2024
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
global
Variables measured
Market Size
Description
The AI Training Dataset Market size was valued at USD 2124.0 million in 2023 and is projected to reach USD 8593.38 million by 2032, exhibiting a CAGR of 22.1 % during the forecasts period. An AI training dataset is a collection of data used to train machine learning models. It typically includes labeled examples, where each data point has an associated output label or target value. The quality and quantity of this data are crucial for the model's performance. A well-curated dataset ensures the model learns relevant features and patterns, enabling it to generalize effectively to new, unseen data. Training datasets can encompass various data types, including text, images, audio, and structured data. The driving forces behind this growth include:
Data from: DeepLabCut: markerless pose estimation of user-defined body parts...
zenodo.org
data.niaid.nih.gov
zip
Updated Oct 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Mathis; Alexander Mathis; Pranav Mamidanna; Pranav Mamidanna; Kevin M. Cury; Taiga Abe; Venkatesh N. Murthy; Venkatesh N. Murthy; Mackenzie Weygandt Mathis; Mackenzie Weygandt Mathis; Matthias Bethge; Matthias Bethge; Kevin M. Cury; Taiga Abe (2023). DeepLabCut: markerless pose estimation of user-defined body parts with deep learning [Dataset]. http://doi.org/10.5281/zenodo.4008504
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4008504
Dataset updated
Oct 13, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alexander Mathis; Alexander Mathis; Pranav Mamidanna; Pranav Mamidanna; Kevin M. Cury; Taiga Abe; Venkatesh N. Murthy; Venkatesh N. Murthy; Mackenzie Weygandt Mathis; Mackenzie Weygandt Mathis; Matthias Bethge; Matthias Bethge; Kevin M. Cury; Taiga Abe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data entry contains annotated mouse data from the DeepLabCut Nature Neuroscience paper.

This data entry contains a public release of annotated mouse data from the DeepLabCut paper. The trail-tracking behavior is part of an investigation into odor guided navigation, where one or multiple wildtype (C57BL/6J) mice are running on a paper spool and following odor trails. These experiments were carried out by Alexander Mathis & Mackenzie Mathis in the Murthy lab at Harvard University.

Data was recorded by two different cameras (640×480 pixels with Point Grey Firefly (FMVU-03MTM-CS), and at approximately 1,700×1,200 pixels with Grasshopper 3 4.1MP Mono USB3 Vision (CMOSIS CMV4000-3E12)) at 30 Hz. The latter images were cropped around mice to generate images that are approximately 800×800.

Here we share 1066, frames from multiple experimental sessions observing 7 different mice. Pranav Mamidanna labeled the snout, the tip of the left and right ear as well as the base of the tail in the example images. The data is organized in DeepLabCut 2.0 project structure with images and annotations in the labeled-data folder. The names are pseudocodes indicating mouse id and session id, e.g. m4s1 = mouse 4 session 1.

Code for loading, visualizing & training deep neural networks available at https://github.com/DeepLabCut/DeepLabCut.
f
The algorithms ranking.
figshare.com
bin
Updated Jun 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moein E. Samadi; Sandra Kiefer; Sebastian Johaness Fritsch; Johannes Bickenbach; Andreas Schuppert (2023). The algorithms ranking. [Dataset]. http://doi.org/10.1371/journal.pone.0274569.t006
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0274569.t006
Dataset updated
Jun 16, 2023
Dataset provided by
PLOS ONE
Authors
Moein E. Samadi; Sandra Kiefer; Sebastian Johaness Fritsch; Johannes Bickenbach; Andreas Schuppert
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The algorithms ranking.
u
Data from: DIPSER: A Dataset for In-Person Student Engagement Recognition in...
observatorio-cientifico.ua.es
scidb.cn
Updated 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Márquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Álvarez, Carolina Lorenzo; Fernandez-Herrero, Jorge; Viejo, Diego; Rosabel Roig-Vila; Cazorla, Miguel; Márquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Álvarez, Carolina Lorenzo; Fernandez-Herrero, Jorge; Viejo, Diego; Rosabel Roig-Vila; Cazorla, Miguel (2025). DIPSER: A Dataset for In-Person Student Engagement Recognition in the Wild [Dataset]. https://observatorio-cientifico.ua.es/documentos/67321d21aea56d4af0484172
Explore at:
Dataset updated
2025
Authors
Márquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Álvarez, Carolina Lorenzo; Fernandez-Herrero, Jorge; Viejo, Diego; Rosabel Roig-Vila; Cazorla, Miguel; Márquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Álvarez, Carolina Lorenzo; Fernandez-Herrero, Jorge; Viejo, Diego; Rosabel Roig-Vila; Cazorla, Miguel
Description
Data DescriptionThe DIPSER dataset is designed to assess student attention and emotion in in-person classroom settings, consisting of RGB camera data, smartwatch sensor data, and labeled attention and emotion metrics. It includes multiple camera angles per student to capture posture and facial expressions, complemented by smartwatch data for inertial and biometric metrics. Attention and emotion labels are derived from self-reports and expert evaluations. The dataset includes diverse demographic groups, with data collected in real-world classroom environments, facilitating the training of machine learning models for predicting attention and correlating it with emotional states.Data Collection and Generation ProceduresThe dataset was collected in a natural classroom environment at the University of Alicante, Spain. The recording setup consisted of six general cameras positioned to capture the overall classroom context and individual cameras placed at each student’s desk. Additionally, smartwatches were used to collect biometric data, such as heart rate, accelerometer, and gyroscope readings.Experimental SessionsNine distinct educational activities were designed to ensure a comprehensive range of engagement scenarios:News Reading – Students read projected or device-displayed news.Brainstorming Session – Idea generation for problem-solving.Lecture – Passive listening to an instructor-led session.Information Organization – Synthesizing information from different sources.Lecture Test – Assessment of lecture content via mobile devices.Individual Presentations – Students present their projects.Knowledge Test – Conducted using Kahoot.Robotics Experimentation – Hands-on session with robotics.MTINY Activity Design – Development of educational activities with computational thinking.Technical SpecificationsRGB Cameras: Individual cameras recorded at 640×480 pixels, while context cameras captured at 1280×720 pixels.Frame Rate: 9-10 FPS depending on the setup.Smartwatch Sensors: Collected heart rate, accelerometer, gyroscope, rotation vector, and light sensor data at a frequency of 1–100 Hz.Data Organization and FormatsThe dataset follows a structured directory format:/groupX/experimentY/subjectZ.zip Each subject-specific folder contains:images/ (individual facial images)watch_sensors/ (sensor readings in JSON format)labels/ (engagement & emotion annotations)metadata/ (subject demographics & session details)Annotations and LabelingEach data entry includes engagement levels (1-5) and emotional states (9 categories) based on both self-reported labels and evaluations by four independent experts. A custom annotation tool was developed to ensure consistency across evaluations.Missing Data and Data QualitySynchronization: A centralized server ensured time alignment across devices. Brightness changes were used to verify synchronization.Completeness: No major missing data, except for occasional random frame drops due to embedded device performance.Data Consistency: Uniform collection methodology across sessions, ensuring high reliability.Data Processing MethodsTo enhance usability, the dataset includes preprocessed bounding boxes for face, body, and hands, along with gaze estimation and head pose annotations. These were generated using YOLO, MediaPipe, and DeepFace.File Formats and AccessibilityImages: Stored in standard JPEG format.Sensor Data: Provided as structured JSON files.Labels: Available as CSV files with timestamps.The dataset is publicly available under the CC-BY license and can be accessed along with the necessary processing scripts via the DIPSER GitHub repository.Potential Errors and LimitationsDue to camera angles, some student movements may be out of frame in collaborative sessions.Lighting conditions vary slightly across experiments.Sensor latency variations are minimal but exist due to embedded device constraints.CitationIf you find this project helpful for your research, please cite our work using the following bibtex entry:@misc{marquezcarpintero2025dipserdatasetinpersonstudent1, title={DIPSER: A Dataset for In-Person Student1 Engagement Recognition in the Wild}, author={Luis Marquez-Carpintero and Sergio Suescun-Ferrandiz and Carolina Lorenzo Álvarez and Jorge Fernandez-Herrero and Diego Viejo and Rosabel Roig-Vila and Miguel Cazorla}, year={2025}, eprint={2502.20209}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2502.20209}, } Usage and ReproducibilityResearchers can utilize standard tools like OpenCV, TensorFlow, and PyTorch for analysis. The dataset supports research in machine learning, affective computing, and education analytics, offering a unique resource for engagement and attention studies in real-world classroom environments.
m
Data from: Semi-supervised non-negative matrix factorization with structure...
data.mendeley.com
Updated Dec 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenjing Jing (2024). Semi-supervised non-negative matrix factorization with structure preserving for image clustering [Dataset]. http://doi.org/10.17632/gf67wvrhbs.1
Explore at:
Unique identifier
https://doi.org/10.17632/gf67wvrhbs.1
Dataset updated
Dec 9, 2024
Authors
Wenjing Jing
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The code for paper '' Semi-supervised non-negative matrix factorization with structure preserving for image clustering''. This paper constructs a new label matrix with weights and further construct a label constraint regularizer to both utilize the label information and maintain the intrinsic structure of NMF. Based on the label constraint regularizer, the basis images of labeled data are extracted for monitoring and modifying the basis images learning of all data by establishing a basis regularizer. By incorporating the label constraint regularizer and the basis regularizer into NMF, a new semi-supervised NMF method is introduced. The proposed method is applied to image clustering and experimental results demonstrate the effectiveness of the proposed method in contrast with state-of-the-art unsupervised and semi-supervised algorithms.

Global Data Classification Tool Market Research Report: By Deployment Model...

wiseguyreports.com

Updated Jun 21, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

wWiseguy Research Consultants Pvt Ltd (2024). Global Data Classification Tool Market Research Report: By Deployment Model (On-Premises, Cloud-Based, SaaS-Based), By Organization Size (Small & Medium-Sized Enterprises (SMEs), Large Enterprises), By Industry Vertical (Healthcare, Financial Services, Government and Public Sector, Retail and E-commerce, Manufacturing and Logistics), By Data Type (Structured Data, Semi-Structured Data, Unstructured Data), By Functionality (Automated Data Classification, Manual Data Classification, Data Discovery, Data Labeling, Data Masking) and By Regional (North America, Europe, South America, Asia Pacific, Middle East and Africa) - Forecast to 2032. [Dataset]. https://www.wiseguyreports.com/reports/data-classification-tool-market

Explore at:

Dataset updated

Jun 21, 2024

Dataset authored and provided by

wWiseguy Research Consultants Pvt Ltd

License

https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy

Time period covered

Jan 6, 2024

Area covered

Global

Description

BASE YEAR	2024
HISTORICAL DATA	2019 - 2024
REPORT COVERAGE	Revenue Forecast, Competitive Landscape, Growth Factors, and Trends
MARKET SIZE 2023	2.83(USD Billion)
MARKET SIZE 2024	3.38(USD Billion)
MARKET SIZE 2032	14.02(USD Billion)
SEGMENTS COVERED	Deployment Model ,Organization Size ,Industry Vertical ,Data Type ,Application ,Regional
COUNTRIES COVERED	North America, Europe, APAC, South America, MEA
KEY MARKET DYNAMICS	Increasing data privacy regulations Growing need for data security and compliance Proliferation of unstructured data Rise of artificial intelligence and machine learning Adoption of cloudbased data storage
MARKET FORECAST UNITS	USD Billion
KEY COMPANIES PROFILED	- Informatica ,- Oracle ,- Symantec ,- IBM ,- Informatica ,- Splunk ,- Varonis Systems ,- Digital Guardian ,- STEALTHbits Technologies ,- Cybereason ,- Netskope ,- FireEye ,- Trustwave ,- Check Point Software Technologies
MARKET FORECAST PERIOD	2024 - 2032
KEY MARKET OPPORTUNITIES	Increase in data breaches Growing adoption of cloud and SaaS solutions Need for data protection and compliance regulations Emergence of AI and ML technologies Growing focus on data privacy
COMPOUND ANNUAL GROWTH RATE (CAGR)	19.46% (2024 - 2032)

Z
CheckMyBlob evaluation data set (CL)
data.niaid.nih.gov
zenodo.org
Updated Aug 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CheckMyBlob evaluation data set (CL) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1040856
Explore at:
Dataset updated
Aug 8, 2023
Dataset provided by
Kowiel, Marcin
Jaskolski, Mariusz
Porebski, Przemyslaw J.
Brzezinski, Dariusz
Minor, Wladek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A data set of ligands used to evaluate the CheckMyBlob method, described in the Kowiel et al. paper "Automatic recognition of ligands in electron density by machine learning methods".

This data set repeats the setup used in the study of Carolan & Lamzin titled "Automated identification of crystallographic ligands using sparse-density representations". It consists of ligands from X-ray diffraction experiments with 1.0–2.5 Å resolution. Adjacent PDB ligands were not connected. Ligands were labeled according to the PDB naming convention. The data set was limited to the 82 ligand types listed by Carolan & Lamzin. The resulting data set consists of 121,360 examples with ligand counts ranging from 42,622 examples for SO4 to 16 for SPO (spheroidene).

For machine learning (classification) purposes, the target attribute is: res_name.
d
3D Microvascular Image Data and Labels for Machine Learning - Dataset -...
b2find.dkrz.de
Updated May 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). 3D Microvascular Image Data and Labels for Machine Learning - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/fc51c530-c314-5fd4-8979-1032b6f798cf
Explore at:
Dataset updated
May 7, 2024
Description
These images and associated binary labels were collected from collaborators across multiple universities to serve as a diverse representation of biomedical images of vessel structures, for use in the training and validation of machine learning tools for vessel segmentation. The dataset contains images from a variety of imaging modalities, at different resolutions, using difference sources of contrast and featuring different organs/ pathologies. This data was use to train, test and validated a foundational model for 3D vessel segmentation, tUbeNet, which can be found on github. The paper descripting the training and validation of the model can be found here. Filenames are structured as follows: Data - [Modality][species Organ][resolution].tif Labels - [Modality][species Organ][resolution]labels.tif Sub-volumes of larger dataset - [Modality][species Organ]_subvolume[dimensions in pixels].tif Manual labelling of blood vessels was carried out using Amira (2020.2, Thermo-Fisher, UK). Training data: opticalHREM_murineLiver_2.26x2.26x1.75um.tif: A high resolution episcopic microscopy (HREM) dataset, acquired in house by staining a healthy mouse liver with Eosin B and imaged using a standard HREM protocol. NB: 25% of this image volume was withheld from training, for use as test data. CT_murineTumour_20x20x20um.tif: X-ray microCT images of a microvascular cast, taken from a subcutaneous mouse model of colorectal cancer (acquired in house). NB: 25% of this image volume was withheld from training, for use as test data. RSOM_murineTumour_20x20um.tif: Raster-Scanning Optoacoustic Mesoscopy (RSOM) data from a subcutaneous tumour model (provided by Emma Brown, Bohndiek Group, University of Cambridge). The image data has undergone filtering to reduce the background (Brown et al., 2019). OCTA_humanRetina_24x24um.tif: retinal angiography data obtained using Optical Coherence Tomography Angiography (OCT-A) (provided by Dr Ranjan Rajendram, Moorfields Eye Hospital). Test data: MRI_porcineLiver_0.9x0.9x5mm.tif: T1-weighted Balanced Turbo Field Echo Magnetic Resonance Imaging (MRI) data from a machine-perfused porcine liver, acquired in-house. Test Data MFHREM_murineTumourLectin_2.76x2.76x2.61um.tif: a subcutaneous colorectal tumour mouse model was imaged in house using Multi-fluorescence HREM in house, with Dylight 647 conjugated lectin staining the vasculature (Walsh et al., 2021). The image data has been processed using an asymmetric deconvolution algorithm described by Walsh et al., 2020. NB: A sub-volume of 480x480x640 voxels was manually labelled (MFHREM_murineTumourLectin_subvolume480x480x640.tif). MFHREM_murineBrainLectin_0.85x0.85x0.86um.tif: an MF-HREM image of the cortex of a mouse brain, stained with Dylight-647 conjugated lectin, was acquired in house (Walsh et al., 2021). The image data has been downsampled and processed using an asymmetric deconvolution algorithm described by Walsh et al., 2020. NB: A sub-volume of 1000x1000x99 voxels was manually labelled. This sub-volume is provided at full resolution and without preprocessing (MFHREM_murineBrainLectin_subvol_0.57x0.57x0.86um.tif). 2Photon_murineOlfactoryBulbLectin_0.2x0.46x5.2um.tif: two-photon data of mouse olfactory bulb blood vessels, labelled with sulforhodamine 101, was kindly provided by Yuxin Zhang at the Sensory Circuits and Neurotechnology Lab, the Francis Crick Institute (Bosch et al., 2022). NB: A sub-volume of 500x500x79 voxel was manually labelled (2Photon_murineOlfactoryBulbLectin_subvolume500x500x79.tif). References: Bosch, C., Ackels, T., Pacureanu, A., Zhang, Y., Peddie, C. J., Berning, M., Rzepka, N., Zdora, M. C., Whiteley, I., Storm, M., Bonnin, A., Rau, C., Margrie, T., Collinson, L., & Schaefer, A. T. (2022). Functional and multiscale 3D structural investigation of brain tissue through correlative in vivo physiology, synchrotron microtomography and volume electron microscopy. Nature Communications 2022 13:1, 13(1), 1–16. https://doi.org/10.1038/s41467-022-30199-6 Brown, E., Brunker, J., & Bohndiek, S. E. (2019). Photoacoustic imaging as a tool to probe the tumour microenvironment. DMM Disease Models and Mechanisms, 12(7). https://doi.org/10.1242/DMM.039636 Walsh, C., Holroyd, N. A., Finnerty, E., Ryan, S. G., Sweeney, P. W., Shipley, R. J., & Walker-Samuel, S. (2021). Multifluorescence High-Resolution Episcopic Microscopy for 3D Imaging of Adult Murine Organs. Advanced Photonics Research, 2(10), 2100110. https://doi.org/10.1002/ADPR.202100110 Walsh, C., Holroyd, N., Shipley, R., & Walker-Samuel, S. (2020). Asymmetric Point Spread Function Estimation and Deconvolution for Serial-Sectioning Block-Face Imaging. Communications in Computer and Information Science, 1248 CCIS, 235–249. https://doi.org/10.1007/978-3-030-52791-4_19
Performance metrics for the validation dataset for models trained with all...
plos.figshare.com
xls
Updated Mar 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Performance metrics for the validation dataset for models trained with all modifications except for over-sampling (Combined changes) and with just the rotation modification (Rotate). [Dataset]. https://plos.figshare.com/articles/dataset/Performance_metrics_for_the_validation_dataset_for_models_trained_with_all_modifications_except_for_over-sampling_Combined_changes_and_with_just_the_rotation_modification_Rotate_/25510249
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0293856.t005
Dataset updated
Mar 29, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Marjolein Oostrom; Michael A. Muniak; Rogene M. Eichler West; Sarah Akers; Paritosh Pande; Moses Obiri; Wei Wang; Kasey Bowyer; Zhuhao Wu; Lisa M. Bramer; Tianyi Mao; Bobbie Jo M. Webb-Robertson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Std Dev Diff represents the standard deviation over validation sections of the differences of the metrics to the default rotate augmentation and the Avg Diff is the overall average differences compared to the default model. All models had a background weight of 1.0, all layers fully trainable, and a 0.001 learning rate.
Labelled data for fine tuning a geological Named Entity Recognition and...
data-search.nerc.ac.uk
metadata.bgs.ac.uk
html
Updated Dec 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
British Geological Survey (2024). Labelled data for fine tuning a geological Named Entity Recognition and Entity Relation Extraction model [Dataset]. https://data-search.nerc.ac.uk/geonetwork/srv/api/records/15ac4ca9-3be0-119e-e063-0937940a8990
Explore at:
htmlAvailable download formats
Dataset updated
Dec 19, 2024
Dataset authored and provided by
British Geological Surveyhttps://www.bgs.ac.uk/
License
http://inspire.ec.europa.eu/metadata-codelist/LimitationsOnPublicAccess/noLimitationshttp://inspire.ec.europa.eu/metadata-codelist/LimitationsOnPublicAccess/noLimitations
Time period covered
Nov 1, 2023 - Feb 15, 2024
Description
This dataset consists of sentences extracted from BGS memoirs, DECC/OGA onshore hydrocarbons well reports and Mineral Reconnaissance Programme (MRP) reports. The sentences have been annotated to enable the dataset to be used as labelled training data for a Named Entity Recognition model and Entity Relation Extraction model, both of which are Natural Language Processing (NLP) techniques that assist with extracting structured data from unstructured text. The entities of interest are rock formations, geological ages, rock types, physical properties and locations, with inter-relations such as overlies, observedIn. The entity labels for rock formations and geological ages in the BGS memoirs were an extract from earlier published work https://github.com/BritishGeologicalSurvey/geo-ner-model https://zenodo.org/records/4181488 . The data can be used to fine tune a pre-trained large language model using transfer learning, to create a model that can be used in inference mode to automatically create the labels, thereby creating structured data useful for geological modelling and subsurface characterisation. The data is provided in JSONL(Relation) format which is the export format from doccano open source text annotation software (https://doccano.github.io/doccano/) used to create the labels. The source documents are already publicly available, but the MRP and DECC reports are only published in pdf image form. These latter documents had to undergo OCR and resulted in lower quality text and a lower quality training data. The majority of the labelled data is from the higher quality BGS memoirs text. The dataset is a proof of concept. Minimal peer review of the labelling has been conducted so this should not be treated as a gold standard labelled dataset, and it is of insufficient volume to build a performant model. The development of this training data and the text processing scripts were supported by a grant from UK Government Office for Technology Transfer (GOTT) Knowledge Asset Grant Fund Project 10083604
f
Optimized network structure.
plos.figshare.com
figshare.com
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Menaa Nawaz; Jameel Ahmed (2023). Optimized network structure. [Dataset]. http://doi.org/10.1371/journal.pone.0279305.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0279305.t003
Dataset updated
Jun 21, 2023
Dataset provided by
PLOS ONE
Authors
Menaa Nawaz; Jameel Ahmed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Optimized network structure.
Performance metrics on the test datasets for the best model, which is...
plos.figshare.com
xls
Updated Mar 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marjolein Oostrom; Michael A. Muniak; Rogene M. Eichler West; Sarah Akers; Paritosh Pande; Moses Obiri; Wei Wang; Kasey Bowyer; Zhuhao Wu; Lisa M. Bramer; Tianyi Mao; Bobbie Jo M. Webb-Robertson (2024). Performance metrics on the test datasets for the best model, which is trained with 1.0 background weights and rotation augmentations, compared with the original model without fine-tuning. [Dataset]. http://doi.org/10.1371/journal.pone.0293856.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0293856.t006
Dataset updated
Mar 29, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Marjolein Oostrom; Michael A. Muniak; Rogene M. Eichler West; Sarah Akers; Paritosh Pande; Moses Obiri; Wei Wang; Kasey Bowyer; Zhuhao Wu; Lisa M. Bramer; Tianyi Mao; Bobbie Jo M. Webb-Robertson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Std Dev Diff represents the standard deviation of the differences of the metrics with the original model and the Avg Diff is the average differences compared to the original model. The trained model had a 1.0 background weight, all the layers trainable, 0.001 learning rate, and rotation augmentation (Rotate).
t
Privacy-Sensitive Conversations between Care Workers and Care Home Residents...
test.researchdata.tuwien.ac.at
researchdata.tuwien.ac.at
+1more
bin, text/markdown
Updated Dec 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Reinhard Grabler; Michael Starzinger; Matthias Hirschmanner; Matthias Hirschmanner; Helena Anna Frijns; Helena Anna Frijns; Reinhard Grabler; Michael Starzinger; Reinhard Grabler; Michael Starzinger; Reinhard Grabler; Michael Starzinger (2024). Privacy-Sensitive Conversations between Care Workers and Care Home Residents in a Residential Care Home [Dataset]. http://doi.org/10.70124/hbtq5-ykv92
Explore at:
bin, text/markdownAvailable download formats
Unique identifier
https://doi.org/10.70124/hbtq5-ykv92
Dataset updated
Dec 6, 2024
Dataset provided by
TU Wien
Authors
Reinhard Grabler; Michael Starzinger; Matthias Hirschmanner; Matthias Hirschmanner; Helena Anna Frijns; Helena Anna Frijns; Reinhard Grabler; Michael Starzinger; Reinhard Grabler; Michael Starzinger; Reinhard Grabler; Michael Starzinger
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 2024 - Aug 2024
Description
Dataset Card for "privacy-care-interactions"

Table of Contents

Dataset Description

Purpose and Features

Dataset Overview

Language Distribution

Locale Distribution

Key Facts

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Annotations

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions

Dataset Description

Purpose and Features

🔒 Collection of Privacy-Sensitive Conversations between Care Workers and Care Home Residents in an Residential Care Home 🔒

The dataset is useful to train and evaluate models to identify and classify privacy-sensitive parts of conversations from text, especially in the context of AI assistants and LLMs.

Dataset Overview

Total entries: 95

Number of distinct taxonomy categories in the public dataset: 4

Number of distinct conversational categories in public dataset: 7

Papers:

Continues the work of: Privacy Agents: Utilizing Large Language Models to Safeguard Contextual Integrity in Elderly Care

Continues the work of: Prototype of a care documentation support system using audio recordings of care actions and large language models

Language Distribution 🌍

English (en): 95

Locale Distribution 🌎

United States (US) 🇺🇸: 95

Key Facts 🔑

This is synthetic data! Generated using proprietary algorithms - no privacy violations!

Conversations are classified following the taxonomy for privacy-sensitive robotics by Rueben et al. (2017).

The data was manually labeled by an expert.

Dataset Structure

Data Instances

The provided data format is .jsonl, the JSON Lines text format, also called newline-delimited JSON. An example entry looks as follows.

{ "text": "CW: Have you ever been to Italy? CR: Oh, yes... many years ago.", "taxonomy": 0, "category": 0, "affected_speaker": 1, "language": "en", "locale": "US", "data_type": 1, "uid": 16, "split": "train" }

Data Fields

The data fields are:

text: a string feature. The abbreviaton of the speakers refer to the care worker (CW) and the care recipient (CR).

taxonomy: a classification label, with possible values including informational (0), invasion (1), collection (2), processing (3), dissemination (4), physical (5), personal-space (6), territoriality (7), intrusion (8), obtrusion (9), contamination (10), modesty (11), psychological (12), interrogation (13), psychological-distance (14), social (15), association (16), crowding-isolation (17), public-gaze (18), solitude (19), intimacy (20), anonymity (21), reserve (22). The taxonomy is derived from Rueben et al. (2017). The classifications were manually labeled by an expert.

category: a classification label, with possible values including personal-information (0), family (1), health (2), thoughts (3), values (4), acquaintance (5), appointment (6). The privacy category affected in the conversation. The classifications were manually labeled by an expert.

affected_speaker: a classification label, with possible values including care-worker (0), care-recipient (1), other (2), both (3). The speaker whose privacy is impacted during the conversation. The classifications were manually labeled by an expert.

language: a string feature. Language code as defined by ISO 639.

locale: a string feature. Regional code as defined by ISO 3166-1 alpha-2.

data_type: a string a classification label, with possible values including real (0), synthetic (1).

uid: a int64 feature. A unique identifier within the dataset.

split: a string feature. Either train, validation or test.

Dataset Splits

The dataset has 2 subsets:

split: with a total of 95 examples split into train, validation and test (70%-15%-15%)

unsplit: with a total of 95 examples in a single train split

name train validation test
split 66 14 15
unsplit 95 n/a n/a

The files follow the naming convention subset-split-language.jsonl. The following files are contained in the dataset:

split-train-en.jsonl

split-validation-en.jsonl

split-test-en.jsonl

unsplit-train-en.jsonl

Dataset Creation

Curation Rationale

Recording audio of care workers and residents during care interactions, which includes partial and full body washing, giving of medication, as well as wound care, is a highly privacy-sensitive use case. Therefore, a dataset is created, which includes privacy-sensitive parts of conversations, synthesized from real-world data. This dataset serves as a basis for fine-tuning a local LLM to highlight and classify privacy-sensitive sections of transcripts created in care interactions, to further mask them to protect privacy.

Source Data

Initial Data Collection

The intial data was collected in the project Caring Robots of TU Wien in cooperation with Caritas Wien. One project track aims to facilitate Large Languge Models (LLM) to support documentation of care workers, with LLM-generated summaries of audio recordings of interactions between care workers and care home residents. The initial data are the transcriptions of those care interactions.

Data Processing

The transcriptions were thoroughly reviewed, and sections containing privacy-sensitive information were identified and marked using qualitative data analysis software by two experts. Subsequently, the accessible portions of the interviews were translated from German to US English using the locally executed LLM icky/translate. In the next step, another llama3.1:70b was used locally to synthesize the conversation segments. This process involved generating similar, yet distinct and new, conversations that are not linked to the original data. The dataset was split using the train_test_split function from the <a href="https://scikit-learn.org/1.5/modules/generated/sklearn.model_selection.train_test_split.html" target="_blank"
c
The global Artificial Intelligence Chip market size is USD 21584.2 million...
cognitivemarketresearch.com
pdf,excel,csv,ppt
Updated Apr 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cognitive Market Research (2024). The global Artificial Intelligence Chip market size is USD 21584.2 million in 2024. [Dataset]. https://www.cognitivemarketresearch.com/artificial-intelligence-chip-market-report
Explore at:
pdf,excel,csv,pptAvailable download formats
Dataset updated
Apr 20, 2024
Dataset authored and provided by
Cognitive Market Research
License
https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
Time period covered
2021 - 2033
Area covered
Global
Description
According to Cognitive Market Research, the global Artificial Intelligence Chip market size will be USD 21584.2 million in 2024. It will expand at a compound annual growth rate (CAGR) of 39.50% from 2024 to 2031.

North America held the major market share for more than 40% of the global revenue with a market size of USD 8633.68 million in 2024 and will grow at a compound annual growth rate (CAGR) of 37.7% from 2024 to 2031. Europe accounted for a market share of over 30% of the global revenue with a market size of USD 6475.26 million. Asia Pacific held a market share of around 23% of the global revenue with a market size of USD 4964.37 million in 2024 and will grow at a compound annual growth rate (CAGR) of 41.5% from 2024 to 2031. Latin America had a market share of more than 5% of the global revenue with a market size of USD 1079.21 million in 2024 and will grow at a compound annual growth rate (CAGR) of 38.9% from 2024 to 2031. Middle East and Africa had a market share of around 2% of the global revenue and was estimated at a market size of USD 431.68 million in 2024 and will grow at a compound annual growth rate (CAGR) of 39.2% from 2024 to 2031. The BFSI held the highest Artificial Intelligence Chip market revenue share in 2024.

Market Dynamics of Artificial Intelligence Chip Market

Key Drivers for Artificial Intelligence Chip Market

Rapid data growth and computational power demand to Increase the Demand Globally

A compute-intensive processor is a critical parameter for the processing of AI algorithms. The speedier the chip, the more quickly it can process the data necessary to construct an AI system. AI processors are primarily utilized in data centers and high-end servers due to the fact that end computers are unable to manage such substantial workloads due to a lack of power and time. AMD provides a series of EPYC processors that include cloud services, data analytics, and visualization. It boasts an Ethernet bandwidth of 8–10 GB and a memory capacity of up to 4 TB. It provides security capabilities, flexibility, and sophisticated I/O integration. Cloud computing, high-performance computing (HPC), and numerous other applications are optimally served by AMD EPYC processors.

Growing potential of AI-based healthcare tools to Propel Market Growth

AI improves emergency care monitoring, real-time patient data collecting, and preventative healthcare suggestions. Health and wellness services like mobile apps may track patients' movements using AI. With AI-based tools, in-home health monitoring and information access, personalized health management, and treatment devices like better hearing aids, visual assistive devices, and physical assistive devices like intelligent walkers can be implemented efficiently. Thus, AI-based solutions are being used to improve the physical, emotional, social, and mental health of the elderly globally. Future applications may combine ML, DL, and computer vision for posture detection and geriatric behavior learning.

Restraint Factor for the Artificial Intelligence Chip Market

Minimal organized data for AI system development to Limit the Sales

Training and building a full and powerful AI system need data. The manual entry of data structured datasets earlier. The growing digital footprint and technology trends like IoT and Industry 4.0 generated large amounts of data from wearable devices, smart homes, intelligent thermostats, connected cars, IP cameras, smart devices, manufacturing machines, industrial equipment, and other remotely connected devices. Text, audio, and pictures make up this unstructured data. Without an organized internal structure, developers can't extract relevant data. Training machine learning tools requires high-quality labelled data and skilled human trainers. Time and skill are needed to extract and label unstructured data. Structured data is essential for AI system development. Companies are using semi-structured data to get insights from groupings.

Impact of Covid-19 on the Artificial Intelligence Chip Market

The long-term impact of the initial outbreak has been beneficial, despite the disruptions to the supply chain and manufacturing delays. The pandemic has expedited the process of AI adoption in a variety of industries, such as healthcare, retail, and manufacturing. The demand for AI processors was driven by the heightened necessity for automation, remote monitoring, and data and analytics. ...
Z
FSDnoisy18k
data.niaid.nih.gov
paperswithcode.com
+2more
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FSDnoisy18k [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2529933
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Xavier Favory
Manoj Plakal
Mercedes Collado
Xavier Serra
Frederic Font
Eduardo Fonseca
Daniel P. W. Ellis
Description
FSDnoisy18k is an audio dataset collected with the aim of fostering the investigation of label noise in sound event classification. It contains 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data.

Data curators

Eduardo Fonseca and Mercedes Collado

Contact

You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

Citation

If you use this dataset or part of it, please cite the following ICASSP 2019 paper:

Eduardo Fonseca, Manoj Plakal, Daniel P. W. Ellis, Frederic Font, Xavier Favory, and Xavier Serra, “Learning Sound Event Classifiers from Web Audio with Noisy Labels”, arXiv preprint arXiv:1901.01189, 2019

You can also consider citing our ISMIR 2017 paper that describes the Freesound Annotator, which was used to gather the manual annotations included in FSDnoisy18k:

Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, “Freesound Datasets: A Platform for the Creation of Open Audio Datasets”, In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

FSDnoisy18k description

What follows is a summary of the most basic aspects of FSDnoisy18k. For a complete description of FSDnoisy18k, make sure to check:

the FSDnoisy18k companion site: http://www.eduardofonseca.net/FSDnoisy18k/

the description provided in Section 2 of our ICASSP 2019 paper

FSDnoisy18k is an audio dataset collected with the aim of fostering the investigation of label noise in sound event classification. It contains 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data.

The source of audio content is Freesound—a sound sharing site created an maintained by the Music Technology Group hosting over 400,000 clips uploaded by its community of users, who additionally provide some basic metadata (e.g., tags, and title). The 20 classes of FSDnoisy18k are drawn from the AudioSet Ontology and are selected based on data availability as well as on their suitability to allow the study of label noise. The 20 classes are: "Acoustic guitar", "Bass guitar", "Clapping", "Coin (dropping)", "Crash cymbal", "Dishes, pots, and pans", "Engine", "Fart", "Fire", "Fireworks", "Glass", "Hi-hat", "Piano", "Rain", "Slam", "Squeak", "Tearing", "Walk, footsteps", "Wind", and "Writing". FSDnoisy18k was created with the Freesound Annotator, which is a platform for the collaborative creation of open audio datasets.

We defined a clean portion of the dataset consisting of correct and complete labels. The remaining portion is referred to as the noisy portion. Each clip in the dataset has a single ground truth label (singly-labeled data).

The clean portion of the data consists of audio clips whose labels are rated as present in the clip and predominant (almost all with full inter-annotator agreement), meaning that the label is correct and, in most cases, there is no additional acoustic material other than the labeled class. A few clips may contain some additional sound events, but they occur in the background and do not belong to any of the 20 target classes. This is more common for some classes that rarely occur alone, e.g., “Fire”, “Glass”, “Wind” or “Walk, footsteps”.

The noisy portion of the data consists of audio clips that received no human validation. In this case, they are categorized on the basis of the user-provided tags in Freesound. Hence, the noisy portion features a certain amount of label noise.

Code

We've released the code for our ICASSP 2019 paper at https://github.com/edufonseca/icassp19. The framework comprises all the basic stages: feature extraction, training, inference and evaluation. After loading the FSDnoisy18k dataset, log-mel energies are computed and a CNN baseline is trained and evaluated. The code also allows to test four noise-robust loss functions. Please check our paper for more details.

Label noise characteristics

FSDnoisy18k features real label noise that is representative of audio data retrieved from the web, particularly from Freesound. The analysis of a per-class, random, 15% of the noisy portion of FSDnoisy18k revealed that roughly 40% of the analyzed labels are correct and complete, whereas 60% of the labels show some type of label noise. Please check the FSDnoisy18k companion site for a detailed characterization of the label noise in the dataset, including a taxonomy of label noise for singly-labeled data as well as a per-class description of the label noise.

FSDnoisy18k basic characteristics

The dataset most relevant characteristics are as follows:

FSDnoisy18k contains 18,532 audio clips (42.5h) unequally distributed in the 20 aforementioned classes drawn from the AudioSet Ontology.

The audio clips are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files.

The audio clips are of variable length ranging from 300ms to 30s, and each clip has a single ground truth label (singly-labeled data).

The dataset is split into a test set and a train set. The test set is drawn entirely from the clean portion, while the remainder of data forms the train set.

The train set is composed of 17,585 clips (41.1h) unequally distributed among the 20 classes. It features a clean subset and a noisy subset. In terms of number of clips their proportion is 10%/90%, whereas in terms of duration the proportion is slightly more extreme (6%/94%). The per-class percentage of clean data within the train set is also imbalanced, ranging from 6.1% to 22.4%. The number of audio clips per class ranges from 51 to 170, and from 250 to 1000 in the clean and noisy subsets, respectively. Further, a noisy small subset is defined, which includes an amount of (noisy) data comparable (in terms of duration) to that of the clean subset.

The test set is composed of 947 clips (1.4h) that belong to the clean portion of the data. Its class distribution is similar to that of the clean subset of the train set. The number of per-class audio clips in the test set ranges from 30 to 72. The test set enables a multi-class classification problem.

FSDnoisy18k is an expandable dataset that features a per-class varying degree of types and amount of label noise. The dataset allows investigation of label noise as well as other approaches, from semi-supervised learning, e.g., self-training to learning with minimal supervision.

License

FSDnoisy18k has licenses at two different levels, as explained next. All sounds in Freesound are released under Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound. In particular, all Freesound clips included in FSDnoisy18k are released under either CC-BY or CC0. For attribution purposes and to facilitate attribution of these files to third parties, we include a relation of audio clips and their corresponding license in the LICENSE-INDIVIDUAL-CLIPS file downloaded with the dataset.

In addition, FSDnoisy18k as a whole is the result of a curation process and it has an additional license. FSDnoisy18k is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the dataset.

Files

FSDnoisy18k can be downloaded as a series of zip files with the following directory structure:

root │
└───FSDnoisy18k.audio_train/ Audio clips in the train set │
└───FSDnoisy18k.audio_test/ Audio clips in the test set │
└───FSDnoisy18k.meta/ Files for evaluation setup │ │
│ └───train.csv Data split and ground truth for the train set │ │
│ └───test.csv Ground truth for the test set
│
└───FSDnoisy18k.doc/ │
└───README.md The dataset description file that you are reading │
└───LICENSE-DATASET License of the FSDnoisy18k dataset as an entity
│
└───LICENSE-INDIVIDUAL-CLIPS.csv Licenses of the individual audio clips from Freesound

Each row (i.e. audio clip) of the train.csv file contains the following information:

fname: the file name

label: the audio classification label (ground truth)

aso_id: the id of the corresponding category as per the AudioSet Ontology

manually_verified: Boolean (1 or 0) flag to indicate whether the clip belongs to the clean portion (1), or to the noisy portion (0) of the train set

noisy_small: Boolean (1 or 0) flag to indicate whether the clip belongs to the noisy_small portion (1) of the train set

Each row (i.e. audio clip) of the test.csv file contains the following information:

fname: the file name

label: the audio classification label (ground truth)

aso_id: the id of the corresponding category as per the AudioSet Ontology

Links

Source code for our preprint: https://github.com/edufonseca/icassp19 Freesound Annotator: https://annotator.freesound.org/ Freesound: https://freesound.org Eduardo Fonseca’s personal website: http://www.eduardofonseca.net/

Acknowledgments

This work is partially supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688382 AudioCommons. Eduardo Fonseca is also sponsored by a Google Faculty Research Award 2017. We thank everyone who contributed to FSDnoisy18k with annotations.
Sample 8 rows of ADE-TABLE.
plos.figshare.com
xlsx
Updated Sep 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shuntaro Yada; Tomohiro Nishiyama; Shoko Wakamiya; Yoshimasa Kawazoe; Shungo Imai; Satoko Hori; Eiji Aramaki (2024). Sample 8 rows of ADE-TABLE. [Dataset]. http://doi.org/10.1371/journal.pone.0310432.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310432.s001
Dataset updated
Sep 11, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Shuntaro Yada; Tomohiro Nishiyama; Shoko Wakamiya; Yoshimasa Kawazoe; Shungo Imai; Satoko Hori; Eiji Aramaki
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Surface expressions derived from private in-hospital texts in Japanese (TEXT). For readability, we have used their English translations. These surface expressions, which are often symptom-like phrases, are categorized as parent diseases and labeled with corresponding ICD codes. We further label each surface expression with relevance to frequent adverse effects of anti-cancer drugs, such as stomatitis, peripheral neuropathy, and hand-foot syndrome (e.g., and ). The original table has eight relevant labels containing binary values (1 if relevant and 0 otherwise). The samples were randomly selected from entries relevant to stomatitis or peripheral neuropathy. (XLSX)
OSM buildings noisy labels dataset
zenodo.org
explore.openaire.eu
+1more
zip
Updated Apr 27, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonas Gütter; Jonas Gütter (2022). OSM buildings noisy labels dataset [Dataset]. http://doi.org/10.5281/zenodo.6477788
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6477788
Dataset updated
Apr 27, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jonas Gütter; Jonas Gütter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains tile imagery from the OpenStreetMap project alongside label masks for buildings from OpenStreetMap. Besides the original clean label set, additional noisy label sets for random noise, removed and added buildings are provided.

The purpose of this dataset is to provide training data for analysing the impact of noisy labels on the performance of models for semantic segmentation in Earth observation.

The code for downloading and creating the datasets as well as for performing some preliminary analyses is also provided, however it is necessary to have access to a tile server where OpenStreetMap tiles can be downloaded in sufficient amounts.

To reproduce the dataset and perform analysis on it, do the following:

unzip data.zip and code.zip

create the folder structure from data

Build and activate a python environment from environment.yml

Insert the url of a suitable tile server for OSM tiles in line 76 of utils.py

Execute download_OSM_dataset.py to download OSM image tiles alongside OSM labels

Execute create_noisy_labels.py for the OSM dataset to create noisy label sets

Divide the images and labels into train and test data. split_data.py can be used as a baseline for this, but pathnames have to be adjusted and the corresponding directories have to be created first.

Call train_model.py to train a model on the data. Specify the data size and the label set by giving command line arguments as shown in train_model.sh
d
Data from: Long-lasting vocal plasticity in adult marmoset monkeys
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Jun 7, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Long-lasting vocal plasticity in adult marmoset monkeys [Dataset]. https://datadryad.org/stash/dataset/doi:10.5061/dryad.7nq1c6s
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.7nq1c6s
Dataset updated
Jun 7, 2019
Dataset provided by
Dryad
Authors
Lingyun Zhao; Bahar Boroumand Rad; Xiaoqin Wang
Time period covered
2019
Description
Marmoset vocalization dataJammingAllData.mat contains a MATLAB structure array “SbjData” which stores the experimental data for the four subjects. Each row in “SbjData” is a subject. The first column is for low-frequency noise perturbation (and the baseline preceding it) whereas the second column is for high-frequency noise perturbation (and the baseline preceding it). “SbjData.Param” stores various parameters used in the analysis and the experiment info. For example, “exp_day” contains the day number for each experimental session. Perturbation sessions are labeled as “Jamming” in the data structures. “SbjData.Measure.FreqPrePostWin” stores the actual data, as described below for its several fields. “Jamming” and “Baseline” contains the original recorded data for the perturbation and baseline sessions, respectively. “JammingDT” and “BaselineDT” are the detrended data. Inside these matrices, Column 1 labels separate sessions; Column 2 is the call onset time (sec); Column 4 is the fundame...

name	train	validation	test
split	66	14	15
unsplit	95	n/a	n/a

Facebook

Twitter

Click to copy link

Link copied

Cite

Stebani, Jannik (2023). Human Inner Ear Anatomy: Labeled Volume CT Data of Inner Ear Fluid Space and Anatomical Landmarks [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8277158

Human Inner Ear Anatomy: Labeled Volume CT Data of Inner Ear Fluid Space and Anatomical Landmarks

Explore at:

Dataset updated

Aug 30, 2023

Dataset provided by

Blaimer, Martin
Rak, Kristen
Neun, Tilman
Stebani, Jannik
Pelt, Daniël M.

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The provided dataset comprises 43 instances of temporal bone volume CT scans. The scans were performed on human cadaveric specimen with a resulting isotropic voxel size of (99 \times 99 \times 99 \, \, \mathrm{\mu m}^3). Voxel-wise image labels of the fluid space of the bony labyrinth, subdivided in the three semantic classes cochlear volume, vestibular volume and semicircular canal volume are provided. In addition, each dataset contains JSON-like descriptor data defining the voxel coordinates of the anatomical landmarks: (1) apex of the cochlea, (2) oval window and (3) round window. The dataset can be used to train and evaluate algorithmic machine learning models for automated innear ear analysis in the context of the supervised learning paradigm.

Usage Notes

The datasets are formatted in the HDF5 format developed by the HDF5 Group. We utilized and thus recommend the usage of Python bindings pyHDF to handle the datasets.

The flat-panel volume CT raw data, labels and landmarks are saved in the HDF5-internal file structure using the respective group and datasets:

raw/raw-0 label/label-0 landmark/landmark-0 landmark/landmark-1 landmark/landmark-2

Array raw and label data can be read from the file by indexing into an opened h5py file handle, for example as numpy.ndarray. Further metadata is contained in the attribute dictionaries of the raw and label datasets.

Landmark coordinate data is available as an attribute dict and contains the coordinate system (LPS or RAS), IJK voxel coordinates and label information. The helicotrema or cochlea top is globally saved in landmark 0, the oval window in landmark 1 and the round window in landmark 2. Read as a Python dictionary, exemplary landmark information for a dataset may reads as follows:

{'coordsys': 'LPS', 'id': 1, 'ijk_position': array([181, 188, 100]), 'label': 'CochleaTop', 'orientation': array([-1., -0., -0., -0., -1., -0., 0., 0., 1.]), 'xyz_position': array([ 44.21109689, -139.38058589, -183.48249736])}

{'coordsys': 'LPS', 'id': 2, 'ijk_position': array([222, 182, 145]), 'label': 'OvalWindow', 'orientation': array([-1., -0., -0., -0., -1., -0., 0., 0., 1.]), 'xyz_position': array([ 48.27890112, -139.95991131, -179.04103763])}

{'coordsys': 'LPS', 'id': 3, 'ijk_position': array([223, 209, 147]), 'label': 'RoundWindow', 'orientation': array([-1., -0., -0., -0., -1., -0., 0., 0., 1.]), 'xyz_position': array([ 48.33120126, -137.27135678, -178.8665465 ])}

Clear search

Close search

Google apps

Main menu

Human Inner Ear Anatomy: Labeled Volume CT Data of Inner Ear Fluid Space and...

Data from: SRL4ORL: Improving Opinion Role Labeling Using Multi-Task...

AI Training Dataset Market Report

Data from: DeepLabCut: markerless pose estimation of user-defined body parts...

The algorithms ranking.

Data from: DIPSER: A Dataset for In-Person Student Engagement Recognition in...

Data from: Semi-supervised non-negative matrix factorization with structure...

Global Data Classification Tool Market Research Report: By Deployment Model...

CheckMyBlob evaluation data set (CL)

3D Microvascular Image Data and Labels for Machine Learning - Dataset -...

Performance metrics for the validation dataset for models trained with all...

Labelled data for fine tuning a geological Named Entity Recognition and...

Optimized network structure.

Performance metrics on the test datasets for the best model, which is...

Privacy-Sensitive Conversations between Care Workers and Care Home Residents...

Dataset Card for "privacy-care-interactions"

Table of Contents

Dataset Description

Purpose and Features

Dataset Overview

Language Distribution 🌍

Locale Distribution 🌎

Key Facts 🔑

Dataset Structure

Data Instances

Data Fields

Dataset Splits

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection

Data Processing

The global Artificial Intelligence Chip market size is USD 21584.2 million...

FSDnoisy18k

Sample 8 rows of ADE-TABLE.

OSM buildings noisy labels dataset

Data from: Long-lasting vocal plasticity in adult marmoset monkeys

Human Inner Ear Anatomy: Labeled Volume CT Data of Inner Ear Fluid Space and Anatomical Landmarks