66 datasets found
  1. Z

    Human Inner Ear Anatomy: Labeled Volume CT Data of Inner Ear Fluid Space and...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stebani, Jannik (2023). Human Inner Ear Anatomy: Labeled Volume CT Data of Inner Ear Fluid Space and Anatomical Landmarks [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8277158
    Explore at:
    Dataset updated
    Aug 30, 2023
    Dataset provided by
    Blaimer, Martin
    Rak, Kristen
    Neun, Tilman
    Stebani, Jannik
    Pelt, Daniël M.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The provided dataset comprises 43 instances of temporal bone volume CT scans. The scans were performed on human cadaveric specimen with a resulting isotropic voxel size of (99 \times 99 \times 99 \, \, \mathrm{\mu m}^3). Voxel-wise image labels of the fluid space of the bony labyrinth, subdivided in the three semantic classes cochlear volume, vestibular volume and semicircular canal volume are provided. In addition, each dataset contains JSON-like descriptor data defining the voxel coordinates of the anatomical landmarks: (1) apex of the cochlea, (2) oval window and (3) round window. The dataset can be used to train and evaluate algorithmic machine learning models for automated innear ear analysis in the context of the supervised learning paradigm.

    Usage Notes

    The datasets are formatted in the HDF5 format developed by the HDF5 Group. We utilized and thus recommend the usage of Python bindings pyHDF to handle the datasets.

    The flat-panel volume CT raw data, labels and landmarks are saved in the HDF5-internal file structure using the respective group and datasets:

    raw/raw-0 label/label-0 landmark/landmark-0 landmark/landmark-1 landmark/landmark-2

    Array raw and label data can be read from the file by indexing into an opened h5py file handle, for example as numpy.ndarray. Further metadata is contained in the attribute dictionaries of the raw and label datasets.

    Landmark coordinate data is available as an attribute dict and contains the coordinate system (LPS or RAS), IJK voxel coordinates and label information. The helicotrema or cochlea top is globally saved in landmark 0, the oval window in landmark 1 and the round window in landmark 2. Read as a Python dictionary, exemplary landmark information for a dataset may reads as follows:

    {'coordsys': 'LPS', 'id': 1, 'ijk_position': array([181, 188, 100]), 'label': 'CochleaTop', 'orientation': array([-1., -0., -0., -0., -1., -0., 0., 0., 1.]), 'xyz_position': array([ 44.21109689, -139.38058589, -183.48249736])}

    {'coordsys': 'LPS', 'id': 2, 'ijk_position': array([222, 182, 145]), 'label': 'OvalWindow', 'orientation': array([-1., -0., -0., -0., -1., -0., 0., 0., 1.]), 'xyz_position': array([ 48.27890112, -139.95991131, -179.04103763])}

    {'coordsys': 'LPS', 'id': 3, 'ijk_position': array([223, 209, 147]), 'label': 'RoundWindow', 'orientation': array([-1., -0., -0., -0., -1., -0., 0., 0., 1.]), 'xyz_position': array([ 48.33120126, -137.27135678, -178.8665465 ])}

  2. h

    Data from: SRL4ORL: Improving Opinion Role Labeling Using Multi-Task...

    • heidata.uni-heidelberg.de
    • tudatalib.ulb.tu-darmstadt.de
    zip
    Updated Feb 4, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ana Marasovic; Ana Marasovic (2019). SRL4ORL: Improving Opinion Role Labeling Using Multi-Task Learning With Semantic Role Labeling [Source Code] [Dataset]. http://doi.org/10.11588/DATA/LWN9XE
    Explore at:
    zip(14676065)Available download formats
    Dataset updated
    Feb 4, 2019
    Dataset provided by
    heiDATA
    Authors
    Ana Marasovic; Ana Marasovic
    License

    https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/LWN9XEhttps://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/LWN9XE

    Description

    This repository contains code for reproducing experiments done in Marasovic and Frank (2018). Paper abstract: For over a decade, machine learning has been used to extract opinion-holder-target structures from text to answer the question "Who expressed what kind of sentiment towards what?". Recent neural approaches do not outperform the state-of-the-art feature-based models for Opinion Role Labeling (ORL). We suspect this is due to the scarcity of labeled training data and address this issue using different multi-task learning (MTL) techniques with a related task which has substantially more data, i.e. Semantic Role Labeling (SRL). We show that two MTL models improve significantly over the single-task model for labeling of both holders and targets, on the development and the test sets. We found that the vanilla MTL model, which makes predictions using only shared ORL and SRL features, performs the best. With deeper analysis, we determine what works and what might be done to make further improvements for ORL. Data for ORL Download MPQA 2.0 corpus. Check mpqa2-pytools for example usage. Splits can be found in the datasplit folder. Data for SRL The data is provided by: CoNLL-2005 Shared Task, but the original words are from the Penn Treebank dataset, which is not publicly available. How to train models? python main.py --adv_coef 0.0 --model fs --exp_setup_id new --n_layers_orl 0 --begin_fold 0 --end_fold 4 python main.py --adv_coef 0.0 --model html --exp_setup_id new --n_layers_orl 1 --n_layers_shared 2 --begin_fold 0 --end_fold 4 python main.py --adv_coef 0.0 --model sp --exp_setup_id new --n_layers_orl 3 --begin_fold 0 --end_fold 4 python main.py --adv_coef 0.1 --model asp --exp_setup_id prior --n_layers_orl 3 --begin_fold 0 --end_fold 10

  3. A

    AI Training Dataset Market Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Nov 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI Training Dataset Market Report [Dataset]. https://www.archivemarketresearch.com/reports/ai-training-dataset-market-5881
    Explore at:
    ppt, pdf, docAvailable download formats
    Dataset updated
    Nov 22, 2024
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    global
    Variables measured
    Market Size
    Description

    The AI Training Dataset Market size was valued at USD 2124.0 million in 2023 and is projected to reach USD 8593.38 million by 2032, exhibiting a CAGR of 22.1 % during the forecasts period. An AI training dataset is a collection of data used to train machine learning models. It typically includes labeled examples, where each data point has an associated output label or target value. The quality and quantity of this data are crucial for the model's performance. A well-curated dataset ensures the model learns relevant features and patterns, enabling it to generalize effectively to new, unseen data. Training datasets can encompass various data types, including text, images, audio, and structured data. The driving forces behind this growth include:

  4. Data from: DeepLabCut: markerless pose estimation of user-defined body parts...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Oct 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Mathis; Alexander Mathis; Pranav Mamidanna; Pranav Mamidanna; Kevin M. Cury; Taiga Abe; Venkatesh N. Murthy; Venkatesh N. Murthy; Mackenzie Weygandt Mathis; Mackenzie Weygandt Mathis; Matthias Bethge; Matthias Bethge; Kevin M. Cury; Taiga Abe (2023). DeepLabCut: markerless pose estimation of user-defined body parts with deep learning [Dataset]. http://doi.org/10.5281/zenodo.4008504
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 13, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alexander Mathis; Alexander Mathis; Pranav Mamidanna; Pranav Mamidanna; Kevin M. Cury; Taiga Abe; Venkatesh N. Murthy; Venkatesh N. Murthy; Mackenzie Weygandt Mathis; Mackenzie Weygandt Mathis; Matthias Bethge; Matthias Bethge; Kevin M. Cury; Taiga Abe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data entry contains annotated mouse data from the DeepLabCut Nature Neuroscience paper.

    This data entry contains a public release of annotated mouse data from the DeepLabCut paper. The trail-tracking behavior is part of an investigation into odor guided navigation, where one or multiple wildtype (C57BL/6J) mice are running on a paper spool and following odor trails. These experiments were carried out by Alexander Mathis & Mackenzie Mathis in the Murthy lab at Harvard University.

    Data was recorded by two different cameras (640×480 pixels with Point Grey Firefly (FMVU-03MTM-CS), and at approximately 1,700×1,200 pixels with Grasshopper 3 4.1MP Mono USB3 Vision (CMOSIS CMV4000-3E12)) at 30 Hz. The latter images were cropped around mice to generate images that are approximately 800×800.

    Here we share 1066, frames from multiple experimental sessions observing 7 different mice. Pranav Mamidanna labeled the snout, the tip of the left and right ear as well as the base of the tail in the example images. The data is organized in DeepLabCut 2.0 project structure with images and annotations in the labeled-data folder. The names are pseudocodes indicating mouse id and session id, e.g. m4s1 = mouse 4 session 1.

    Code for loading, visualizing & training deep neural networks available at https://github.com/DeepLabCut/DeepLabCut.

  5. f

    The algorithms ranking.

    • figshare.com
    bin
    Updated Jun 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moein E. Samadi; Sandra Kiefer; Sebastian Johaness Fritsch; Johannes Bickenbach; Andreas Schuppert (2023). The algorithms ranking. [Dataset]. http://doi.org/10.1371/journal.pone.0274569.t006
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 16, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Moein E. Samadi; Sandra Kiefer; Sebastian Johaness Fritsch; Johannes Bickenbach; Andreas Schuppert
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The algorithms ranking.

  6. u

    Data from: DIPSER: A Dataset for In-Person Student Engagement Recognition in...

    • observatorio-cientifico.ua.es
    • scidb.cn
    Updated 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Márquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Álvarez, Carolina Lorenzo; Fernandez-Herrero, Jorge; Viejo, Diego; Rosabel Roig-Vila; Cazorla, Miguel; Márquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Álvarez, Carolina Lorenzo; Fernandez-Herrero, Jorge; Viejo, Diego; Rosabel Roig-Vila; Cazorla, Miguel (2025). DIPSER: A Dataset for In-Person Student Engagement Recognition in the Wild [Dataset]. https://observatorio-cientifico.ua.es/documentos/67321d21aea56d4af0484172
    Explore at:
    Dataset updated
    2025
    Authors
    Márquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Álvarez, Carolina Lorenzo; Fernandez-Herrero, Jorge; Viejo, Diego; Rosabel Roig-Vila; Cazorla, Miguel; Márquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Álvarez, Carolina Lorenzo; Fernandez-Herrero, Jorge; Viejo, Diego; Rosabel Roig-Vila; Cazorla, Miguel
    Description

    Data DescriptionThe DIPSER dataset is designed to assess student attention and emotion in in-person classroom settings, consisting of RGB camera data, smartwatch sensor data, and labeled attention and emotion metrics. It includes multiple camera angles per student to capture posture and facial expressions, complemented by smartwatch data for inertial and biometric metrics. Attention and emotion labels are derived from self-reports and expert evaluations. The dataset includes diverse demographic groups, with data collected in real-world classroom environments, facilitating the training of machine learning models for predicting attention and correlating it with emotional states.Data Collection and Generation ProceduresThe dataset was collected in a natural classroom environment at the University of Alicante, Spain. The recording setup consisted of six general cameras positioned to capture the overall classroom context and individual cameras placed at each student’s desk. Additionally, smartwatches were used to collect biometric data, such as heart rate, accelerometer, and gyroscope readings.Experimental SessionsNine distinct educational activities were designed to ensure a comprehensive range of engagement scenarios:News Reading – Students read projected or device-displayed news.Brainstorming Session – Idea generation for problem-solving.Lecture – Passive listening to an instructor-led session.Information Organization – Synthesizing information from different sources.Lecture Test – Assessment of lecture content via mobile devices.Individual Presentations – Students present their projects.Knowledge Test – Conducted using Kahoot.Robotics Experimentation – Hands-on session with robotics.MTINY Activity Design – Development of educational activities with computational thinking.Technical SpecificationsRGB Cameras: Individual cameras recorded at 640×480 pixels, while context cameras captured at 1280×720 pixels.Frame Rate: 9-10 FPS depending on the setup.Smartwatch Sensors: Collected heart rate, accelerometer, gyroscope, rotation vector, and light sensor data at a frequency of 1–100 Hz.Data Organization and FormatsThe dataset follows a structured directory format:/groupX/experimentY/subjectZ.zip Each subject-specific folder contains:images/ (individual facial images)watch_sensors/ (sensor readings in JSON format)labels/ (engagement & emotion annotations)metadata/ (subject demographics & session details)Annotations and LabelingEach data entry includes engagement levels (1-5) and emotional states (9 categories) based on both self-reported labels and evaluations by four independent experts. A custom annotation tool was developed to ensure consistency across evaluations.Missing Data and Data QualitySynchronization: A centralized server ensured time alignment across devices. Brightness changes were used to verify synchronization.Completeness: No major missing data, except for occasional random frame drops due to embedded device performance.Data Consistency: Uniform collection methodology across sessions, ensuring high reliability.Data Processing MethodsTo enhance usability, the dataset includes preprocessed bounding boxes for face, body, and hands, along with gaze estimation and head pose annotations. These were generated using YOLO, MediaPipe, and DeepFace.File Formats and AccessibilityImages: Stored in standard JPEG format.Sensor Data: Provided as structured JSON files.Labels: Available as CSV files with timestamps.The dataset is publicly available under the CC-BY license and can be accessed along with the necessary processing scripts via the DIPSER GitHub repository.Potential Errors and LimitationsDue to camera angles, some student movements may be out of frame in collaborative sessions.Lighting conditions vary slightly across experiments.Sensor latency variations are minimal but exist due to embedded device constraints.CitationIf you find this project helpful for your research, please cite our work using the following bibtex entry:@misc{marquezcarpintero2025dipserdatasetinpersonstudent1, title={DIPSER: A Dataset for In-Person Student1 Engagement Recognition in the Wild}, author={Luis Marquez-Carpintero and Sergio Suescun-Ferrandiz and Carolina Lorenzo Álvarez and Jorge Fernandez-Herrero and Diego Viejo and Rosabel Roig-Vila and Miguel Cazorla}, year={2025}, eprint={2502.20209}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2502.20209}, } Usage and ReproducibilityResearchers can utilize standard tools like OpenCV, TensorFlow, and PyTorch for analysis. The dataset supports research in machine learning, affective computing, and education analytics, offering a unique resource for engagement and attention studies in real-world classroom environments.

  7. m

    Data from: Semi-supervised non-negative matrix factorization with structure...

    • data.mendeley.com
    Updated Dec 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wenjing Jing (2024). Semi-supervised non-negative matrix factorization with structure preserving for image clustering [Dataset]. http://doi.org/10.17632/gf67wvrhbs.1
    Explore at:
    Dataset updated
    Dec 9, 2024
    Authors
    Wenjing Jing
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The code for paper '' Semi-supervised non-negative matrix factorization with structure preserving for image clustering''. This paper constructs a new label matrix with weights and further construct a label constraint regularizer to both utilize the label information and maintain the intrinsic structure of NMF. Based on the label constraint regularizer, the basis images of labeled data are extracted for monitoring and modifying the basis images learning of all data by establishing a basis regularizer. By incorporating the label constraint regularizer and the basis regularizer into NMF, a new semi-supervised NMF method is introduced. The proposed method is applied to image clustering and experimental results demonstrate the effectiveness of the proposed method in contrast with state-of-the-art unsupervised and semi-supervised algorithms.

  8. w

    Global Data Classification Tool Market Research Report: By Deployment Model...

    • wiseguyreports.com
    Updated Jun 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wWiseguy Research Consultants Pvt Ltd (2024). Global Data Classification Tool Market Research Report: By Deployment Model (On-Premises, Cloud-Based, SaaS-Based), By Organization Size (Small & Medium-Sized Enterprises (SMEs), Large Enterprises), By Industry Vertical (Healthcare, Financial Services, Government and Public Sector, Retail and E-commerce, Manufacturing and Logistics), By Data Type (Structured Data, Semi-Structured Data, Unstructured Data), By Functionality (Automated Data Classification, Manual Data Classification, Data Discovery, Data Labeling, Data Masking) and By Regional (North America, Europe, South America, Asia Pacific, Middle East and Africa) - Forecast to 2032. [Dataset]. https://www.wiseguyreports.com/reports/data-classification-tool-market
    Explore at:
    Dataset updated
    Jun 21, 2024
    Dataset authored and provided by
    wWiseguy Research Consultants Pvt Ltd
    License

    https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy

    Time period covered
    Jan 6, 2024
    Area covered
    Global
    Description
    BASE YEAR2024
    HISTORICAL DATA2019 - 2024
    REPORT COVERAGERevenue Forecast, Competitive Landscape, Growth Factors, and Trends
    MARKET SIZE 20232.83(USD Billion)
    MARKET SIZE 20243.38(USD Billion)
    MARKET SIZE 203214.02(USD Billion)
    SEGMENTS COVEREDDeployment Model ,Organization Size ,Industry Vertical ,Data Type ,Application ,Regional
    COUNTRIES COVEREDNorth America, Europe, APAC, South America, MEA
    KEY MARKET DYNAMICSIncreasing data privacy regulations Growing need for data security and compliance Proliferation of unstructured data Rise of artificial intelligence and machine learning Adoption of cloudbased data storage
    MARKET FORECAST UNITSUSD Billion
    KEY COMPANIES PROFILED- Informatica ,- Oracle ,- Symantec ,- IBM ,- Informatica ,- Splunk ,- Varonis Systems ,- Digital Guardian ,- STEALTHbits Technologies ,- Cybereason ,- Netskope ,- FireEye ,- Trustwave ,- Check Point Software Technologies
    MARKET FORECAST PERIOD2024 - 2032
    KEY MARKET OPPORTUNITIESIncrease in data breaches Growing adoption of cloud and SaaS solutions Need for data protection and compliance regulations Emergence of AI and ML technologies Growing focus on data privacy
    COMPOUND ANNUAL GROWTH RATE (CAGR) 19.46% (2024 - 2032)
  9. Z

    CheckMyBlob evaluation data set (CL)

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CheckMyBlob evaluation data set (CL) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1040856
    Explore at:
    Dataset updated
    Aug 8, 2023
    Dataset provided by
    Kowiel, Marcin
    Jaskolski, Mariusz
    Porebski, Przemyslaw J.
    Brzezinski, Dariusz
    Minor, Wladek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A data set of ligands used to evaluate the CheckMyBlob method, described in the Kowiel et al. paper "Automatic recognition of ligands in electron density by machine learning methods".

    This data set repeats the setup used in the study of Carolan & Lamzin titled "Automated identification of crystallographic ligands using sparse-density representations". It consists of ligands from X-ray diffraction experiments with 1.0–2.5 Å resolution. Adjacent PDB ligands were not connected. Ligands were labeled according to the PDB naming convention. The data set was limited to the 82 ligand types listed by Carolan & Lamzin. The resulting data set consists of 121,360 examples with ligand counts ranging from 42,622 examples for SO4 to 16 for SPO (spheroidene).

    For machine learning (classification) purposes, the target attribute is: res_name.

  10. d

    3D Microvascular Image Data and Labels for Machine Learning - Dataset -...

    • b2find.dkrz.de
    Updated May 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). 3D Microvascular Image Data and Labels for Machine Learning - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/fc51c530-c314-5fd4-8979-1032b6f798cf
    Explore at:
    Dataset updated
    May 7, 2024
    Description

    These images and associated binary labels were collected from collaborators across multiple universities to serve as a diverse representation of biomedical images of vessel structures, for use in the training and validation of machine learning tools for vessel segmentation. The dataset contains images from a variety of imaging modalities, at different resolutions, using difference sources of contrast and featuring different organs/ pathologies. This data was use to train, test and validated a foundational model for 3D vessel segmentation, tUbeNet, which can be found on github. The paper descripting the training and validation of the model can be found here. Filenames are structured as follows: Data - [Modality][species Organ][resolution].tif Labels - [Modality][species Organ][resolution]labels.tif Sub-volumes of larger dataset - [Modality][species Organ]_subvolume[dimensions in pixels].tif Manual labelling of blood vessels was carried out using Amira (2020.2, Thermo-Fisher, UK). Training data: opticalHREM_murineLiver_2.26x2.26x1.75um.tif: A high resolution episcopic microscopy (HREM) dataset, acquired in house by staining a healthy mouse liver with Eosin B and imaged using a standard HREM protocol. NB: 25% of this image volume was withheld from training, for use as test data. CT_murineTumour_20x20x20um.tif: X-ray microCT images of a microvascular cast, taken from a subcutaneous mouse model of colorectal cancer (acquired in house). NB: 25% of this image volume was withheld from training, for use as test data. RSOM_murineTumour_20x20um.tif: Raster-Scanning Optoacoustic Mesoscopy (RSOM) data from a subcutaneous tumour model (provided by Emma Brown, Bohndiek Group, University of Cambridge). The image data has undergone filtering to reduce the background ​(Brown et al., 2019)​. OCTA_humanRetina_24x24um.tif: retinal angiography data obtained using Optical Coherence Tomography Angiography (OCT-A) (provided by Dr Ranjan Rajendram, Moorfields Eye Hospital). Test data: MRI_porcineLiver_0.9x0.9x5mm.tif: T1-weighted Balanced Turbo Field Echo Magnetic Resonance Imaging (MRI) data from a machine-perfused porcine liver, acquired in-house. Test Data MFHREM_murineTumourLectin_2.76x2.76x2.61um.tif: a subcutaneous colorectal tumour mouse model was imaged in house using Multi-fluorescence HREM in house, with Dylight 647 conjugated lectin staining the vasculature ​(Walsh et al., 2021)​. The image data has been processed using an asymmetric deconvolution algorithm described by ​Walsh et al., 2020​. NB: A sub-volume of 480x480x640 voxels was manually labelled (MFHREM_murineTumourLectin_subvolume480x480x640.tif). MFHREM_murineBrainLectin_0.85x0.85x0.86um.tif: an MF-HREM image of the cortex of a mouse brain, stained with Dylight-647 conjugated lectin, was acquired in house ​(Walsh et al., 2021)​. The image data has been downsampled and processed using an asymmetric deconvolution algorithm described by ​Walsh et al., 2020​. NB: A sub-volume of 1000x1000x99 voxels was manually labelled. This sub-volume is provided at full resolution and without preprocessing (MFHREM_murineBrainLectin_subvol_0.57x0.57x0.86um.tif). 2Photon_murineOlfactoryBulbLectin_0.2x0.46x5.2um.tif: two-photon data of mouse olfactory bulb blood vessels, labelled with sulforhodamine 101, was kindly provided by Yuxin Zhang at the Sensory Circuits and Neurotechnology Lab, the Francis Crick Institute ​(Bosch et al., 2022)​. NB: A sub-volume of 500x500x79 voxel was manually labelled (2Photon_murineOlfactoryBulbLectin_subvolume500x500x79.tif). References: ​​Bosch, C., Ackels, T., Pacureanu, A., Zhang, Y., Peddie, C. J., Berning, M., Rzepka, N., Zdora, M. C., Whiteley, I., Storm, M., Bonnin, A., Rau, C., Margrie, T., Collinson, L., & Schaefer, A. T. (2022). Functional and multiscale 3D structural investigation of brain tissue through correlative in vivo physiology, synchrotron microtomography and volume electron microscopy. Nature Communications 2022 13:1, 13(1), 1–16. https://doi.org/10.1038/s41467-022-30199-6 ​Brown, E., Brunker, J., & Bohndiek, S. E. (2019). Photoacoustic imaging as a tool to probe the tumour microenvironment. DMM Disease Models and Mechanisms, 12(7). https://doi.org/10.1242/DMM.039636 ​Walsh, C., Holroyd, N. A., Finnerty, E., Ryan, S. G., Sweeney, P. W., Shipley, R. J., & Walker-Samuel, S. (2021). Multifluorescence High-Resolution Episcopic Microscopy for 3D Imaging of Adult Murine Organs. Advanced Photonics Research, 2(10), 2100110. https://doi.org/10.1002/ADPR.202100110 ​Walsh, C., Holroyd, N., Shipley, R., & Walker-Samuel, S. (2020). Asymmetric Point Spread Function Estimation and Deconvolution for Serial-Sectioning Block-Face Imaging. Communications in Computer and Information Science, 1248 CCIS, 235–249. https://doi.org/10.1007/978-3-030-52791-4_19

  11. Performance metrics for the validation dataset for models trained with all...

    • plos.figshare.com
    xls
    Updated Mar 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Performance metrics for the validation dataset for models trained with all modifications except for over-sampling (Combined changes) and with just the rotation modification (Rotate). [Dataset]. https://plos.figshare.com/articles/dataset/Performance_metrics_for_the_validation_dataset_for_models_trained_with_all_modifications_except_for_over-sampling_Combined_changes_and_with_just_the_rotation_modification_Rotate_/25510249
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Mar 29, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Marjolein Oostrom; Michael A. Muniak; Rogene M. Eichler West; Sarah Akers; Paritosh Pande; Moses Obiri; Wei Wang; Kasey Bowyer; Zhuhao Wu; Lisa M. Bramer; Tianyi Mao; Bobbie Jo M. Webb-Robertson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Std Dev Diff represents the standard deviation over validation sections of the differences of the metrics to the default rotate augmentation and the Avg Diff is the overall average differences compared to the default model. All models had a background weight of 1.0, all layers fully trainable, and a 0.001 learning rate.

  12. Labelled data for fine tuning a geological Named Entity Recognition and...

    • data-search.nerc.ac.uk
    • metadata.bgs.ac.uk
    html
    Updated Dec 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    British Geological Survey (2024). Labelled data for fine tuning a geological Named Entity Recognition and Entity Relation Extraction model [Dataset]. https://data-search.nerc.ac.uk/geonetwork/srv/api/records/15ac4ca9-3be0-119e-e063-0937940a8990
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Dec 19, 2024
    Dataset authored and provided by
    British Geological Surveyhttps://www.bgs.ac.uk/
    License

    http://inspire.ec.europa.eu/metadata-codelist/LimitationsOnPublicAccess/noLimitationshttp://inspire.ec.europa.eu/metadata-codelist/LimitationsOnPublicAccess/noLimitations

    Time period covered
    Nov 1, 2023 - Feb 15, 2024
    Description

    This dataset consists of sentences extracted from BGS memoirs, DECC/OGA onshore hydrocarbons well reports and Mineral Reconnaissance Programme (MRP) reports. The sentences have been annotated to enable the dataset to be used as labelled training data for a Named Entity Recognition model and Entity Relation Extraction model, both of which are Natural Language Processing (NLP) techniques that assist with extracting structured data from unstructured text. The entities of interest are rock formations, geological ages, rock types, physical properties and locations, with inter-relations such as overlies, observedIn. The entity labels for rock formations and geological ages in the BGS memoirs were an extract from earlier published work https://github.com/BritishGeologicalSurvey/geo-ner-model https://zenodo.org/records/4181488 . The data can be used to fine tune a pre-trained large language model using transfer learning, to create a model that can be used in inference mode to automatically create the labels, thereby creating structured data useful for geological modelling and subsurface characterisation. The data is provided in JSONL(Relation) format which is the export format from doccano open source text annotation software (https://doccano.github.io/doccano/) used to create the labels. The source documents are already publicly available, but the MRP and DECC reports are only published in pdf image form. These latter documents had to undergo OCR and resulted in lower quality text and a lower quality training data. The majority of the labelled data is from the higher quality BGS memoirs text. The dataset is a proof of concept. Minimal peer review of the labelling has been conducted so this should not be treated as a gold standard labelled dataset, and it is of insufficient volume to build a performant model. The development of this training data and the text processing scripts were supported by a grant from UK Government Office for Technology Transfer (GOTT) Knowledge Asset Grant Fund Project 10083604

  13. f

    Optimized network structure.

    • plos.figshare.com
    • figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Menaa Nawaz; Jameel Ahmed (2023). Optimized network structure. [Dataset]. http://doi.org/10.1371/journal.pone.0279305.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Menaa Nawaz; Jameel Ahmed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Optimized network structure.

  14. Performance metrics on the test datasets for the best model, which is...

    • plos.figshare.com
    xls
    Updated Mar 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marjolein Oostrom; Michael A. Muniak; Rogene M. Eichler West; Sarah Akers; Paritosh Pande; Moses Obiri; Wei Wang; Kasey Bowyer; Zhuhao Wu; Lisa M. Bramer; Tianyi Mao; Bobbie Jo M. Webb-Robertson (2024). Performance metrics on the test datasets for the best model, which is trained with 1.0 background weights and rotation augmentations, compared with the original model without fine-tuning. [Dataset]. http://doi.org/10.1371/journal.pone.0293856.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Mar 29, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Marjolein Oostrom; Michael A. Muniak; Rogene M. Eichler West; Sarah Akers; Paritosh Pande; Moses Obiri; Wei Wang; Kasey Bowyer; Zhuhao Wu; Lisa M. Bramer; Tianyi Mao; Bobbie Jo M. Webb-Robertson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Std Dev Diff represents the standard deviation of the differences of the metrics with the original model and the Avg Diff is the average differences compared to the original model. The trained model had a 1.0 background weight, all the layers trainable, 0.001 learning rate, and rotation augmentation (Rotate).

  15. t

    Privacy-Sensitive Conversations between Care Workers and Care Home Residents...

    • test.researchdata.tuwien.ac.at
    • researchdata.tuwien.ac.at
    • +1more
    bin, text/markdown
    Updated Dec 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reinhard Grabler; Michael Starzinger; Matthias Hirschmanner; Matthias Hirschmanner; Helena Anna Frijns; Helena Anna Frijns; Reinhard Grabler; Michael Starzinger; Reinhard Grabler; Michael Starzinger; Reinhard Grabler; Michael Starzinger (2024). Privacy-Sensitive Conversations between Care Workers and Care Home Residents in a Residential Care Home [Dataset]. http://doi.org/10.70124/hbtq5-ykv92
    Explore at:
    bin, text/markdownAvailable download formats
    Dataset updated
    Dec 6, 2024
    Dataset provided by
    TU Wien
    Authors
    Reinhard Grabler; Michael Starzinger; Matthias Hirschmanner; Matthias Hirschmanner; Helena Anna Frijns; Helena Anna Frijns; Reinhard Grabler; Michael Starzinger; Reinhard Grabler; Michael Starzinger; Reinhard Grabler; Michael Starzinger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 2024 - Aug 2024
    Description

    Dataset Card for "privacy-care-interactions"

    Table of Contents

    Dataset Description

    Purpose and Features

    🔒 Collection of Privacy-Sensitive Conversations between Care Workers and Care Home Residents in an Residential Care Home 🔒

    The dataset is useful to train and evaluate models to identify and classify privacy-sensitive parts of conversations from text, especially in the context of AI assistants and LLMs.

    Dataset Overview

    Language Distribution 🌍

    • English (en): 95

    Locale Distribution 🌎

    • United States (US) 🇺🇸: 95

    Key Facts 🔑

    • This is synthetic data! Generated using proprietary algorithms - no privacy violations!
    • Conversations are classified following the taxonomy for privacy-sensitive robotics by Rueben et al. (2017).
    • The data was manually labeled by an expert.

    Dataset Structure

    Data Instances

    The provided data format is .jsonl, the JSON Lines text format, also called newline-delimited JSON. An example entry looks as follows.

    { "text": "CW: Have you ever been to Italy? CR: Oh, yes... many years ago.", "taxonomy": 0, "category": 0, "affected_speaker": 1, "language": "en", "locale": "US", "data_type": 1, "uid": 16, "split": "train" }

    Data Fields

    The data fields are:

    • text: a string feature. The abbreviaton of the speakers refer to the care worker (CW) and the care recipient (CR).
    • taxonomy: a classification label, with possible values including informational (0), invasion (1), collection (2), processing (3), dissemination (4), physical (5), personal-space (6), territoriality (7), intrusion (8), obtrusion (9), contamination (10), modesty (11), psychological (12), interrogation (13), psychological-distance (14), social (15), association (16), crowding-isolation (17), public-gaze (18), solitude (19), intimacy (20), anonymity (21), reserve (22). The taxonomy is derived from Rueben et al. (2017). The classifications were manually labeled by an expert.
    • category: a classification label, with possible values including personal-information (0), family (1), health (2), thoughts (3), values (4), acquaintance (5), appointment (6). The privacy category affected in the conversation. The classifications were manually labeled by an expert.
    • affected_speaker: a classification label, with possible values including care-worker (0), care-recipient (1), other (2), both (3). The speaker whose privacy is impacted during the conversation. The classifications were manually labeled by an expert.
    • language: a string feature. Language code as defined by ISO 639.
    • locale: a string feature. Regional code as defined by ISO 3166-1 alpha-2.
    • data_type: a string a classification label, with possible values including real (0), synthetic (1).
    • uid: a int64 feature. A unique identifier within the dataset.
    • split: a string feature. Either train, validation or test.

    Dataset Splits

    The dataset has 2 subsets:

    • split: with a total of 95 examples split into train, validation and test (70%-15%-15%)
    • unsplit: with a total of 95 examples in a single train split
    nametrainvalidationtest
    split661415
    unsplit95n/an/a

    The files follow the naming convention subset-split-language.jsonl. The following files are contained in the dataset:

    • split-train-en.jsonl
    • split-validation-en.jsonl
    • split-test-en.jsonl
    • unsplit-train-en.jsonl

    Dataset Creation

    Curation Rationale

    Recording audio of care workers and residents during care interactions, which includes partial and full body washing, giving of medication, as well as wound care, is a highly privacy-sensitive use case. Therefore, a dataset is created, which includes privacy-sensitive parts of conversations, synthesized from real-world data. This dataset serves as a basis for fine-tuning a local LLM to highlight and classify privacy-sensitive sections of transcripts created in care interactions, to further mask them to protect privacy.

    Source Data

    Initial Data Collection

    The intial data was collected in the project Caring Robots of TU Wien in cooperation with Caritas Wien. One project track aims to facilitate Large Languge Models (LLM) to support documentation of care workers, with LLM-generated summaries of audio recordings of interactions between care workers and care home residents. The initial data are the transcriptions of those care interactions.

    Data Processing

    The transcriptions were thoroughly reviewed, and sections containing privacy-sensitive information were identified and marked using qualitative data analysis software by two experts. Subsequently, the accessible portions of the interviews were translated from German to US English using the locally executed LLM icky/translate. In the next step, another llama3.1:70b was used locally to synthesize the conversation segments. This process involved generating similar, yet distinct and new, conversations that are not linked to the original data. The dataset was split using the train_test_split function from the <a href="https://scikit-learn.org/1.5/modules/generated/sklearn.model_selection.train_test_split.html" target="_blank"

  16. c

    The global Artificial Intelligence Chip market size is USD 21584.2 million...

    • cognitivemarketresearch.com
    pdf,excel,csv,ppt
    Updated Apr 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cognitive Market Research (2024). The global Artificial Intelligence Chip market size is USD 21584.2 million in 2024. [Dataset]. https://www.cognitivemarketresearch.com/artificial-intelligence-chip-market-report
    Explore at:
    pdf,excel,csv,pptAvailable download formats
    Dataset updated
    Apr 20, 2024
    Dataset authored and provided by
    Cognitive Market Research
    License

    https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy

    Time period covered
    2021 - 2033
    Area covered
    Global
    Description

    According to Cognitive Market Research, the global Artificial Intelligence Chip market size will be USD 21584.2 million in 2024. It will expand at a compound annual growth rate (CAGR) of 39.50% from 2024 to 2031.

    North America held the major market share for more than 40% of the global revenue with a market size of USD 8633.68 million in 2024 and will grow at a compound annual growth rate (CAGR) of 37.7% from 2024 to 2031.
    Europe accounted for a market share of over 30% of the global revenue with a market size of USD 6475.26 million.
    Asia Pacific held a market share of around 23% of the global revenue with a market size of USD 4964.37 million in 2024 and will grow at a compound annual growth rate (CAGR) of 41.5% from 2024 to 2031.
    Latin America had a market share of more than 5% of the global revenue with a market size of USD 1079.21 million in 2024 and will grow at a compound annual growth rate (CAGR) of 38.9% from 2024 to 2031.
    Middle East and Africa had a market share of around 2% of the global revenue and was estimated at a market size of USD 431.68 million in 2024 and will grow at a compound annual growth rate (CAGR) of 39.2% from 2024 to 2031.
    The BFSI held the highest Artificial Intelligence Chip market revenue share in 2024.
    

    Market Dynamics of Artificial Intelligence Chip Market

    Key Drivers for Artificial Intelligence Chip Market

    Rapid data growth and computational power demand to Increase the Demand Globally

    A compute-intensive processor is a critical parameter for the processing of AI algorithms. The speedier the chip, the more quickly it can process the data necessary to construct an AI system. AI processors are primarily utilized in data centers and high-end servers due to the fact that end computers are unable to manage such substantial workloads due to a lack of power and time. AMD provides a series of EPYC processors that include cloud services, data analytics, and visualization. It boasts an Ethernet bandwidth of 8–10 GB and a memory capacity of up to 4 TB. It provides security capabilities, flexibility, and sophisticated I/O integration. Cloud computing, high-performance computing (HPC), and numerous other applications are optimally served by AMD EPYC processors.

    Growing potential of AI-based healthcare tools to Propel Market Growth

    AI improves emergency care monitoring, real-time patient data collecting, and preventative healthcare suggestions. Health and wellness services like mobile apps may track patients' movements using AI. With AI-based tools, in-home health monitoring and information access, personalized health management, and treatment devices like better hearing aids, visual assistive devices, and physical assistive devices like intelligent walkers can be implemented efficiently. Thus, AI-based solutions are being used to improve the physical, emotional, social, and mental health of the elderly globally. Future applications may combine ML, DL, and computer vision for posture detection and geriatric behavior learning.

    Restraint Factor for the Artificial Intelligence Chip Market

    Minimal organized data for AI system development to Limit the Sales

    Training and building a full and powerful AI system need data. The manual entry of data structured datasets earlier. The growing digital footprint and technology trends like IoT and Industry 4.0 generated large amounts of data from wearable devices, smart homes, intelligent thermostats, connected cars, IP cameras, smart devices, manufacturing machines, industrial equipment, and other remotely connected devices. Text, audio, and pictures make up this unstructured data. Without an organized internal structure, developers can't extract relevant data. Training machine learning tools requires high-quality labelled data and skilled human trainers. Time and skill are needed to extract and label unstructured data. Structured data is essential for AI system development. Companies are using semi-structured data to get insights from groupings.

    Impact of Covid-19 on the Artificial Intelligence Chip Market

    The long-term impact of the initial outbreak has been beneficial, despite the disruptions to the supply chain and manufacturing delays. The pandemic has expedited the process of AI adoption in a variety of industries, such as healthcare, retail, and manufacturing. The demand for AI processors was driven by the heightened necessity for automation, remote monitoring, and data and analytics. ...

  17. Z

    FSDnoisy18k

    • data.niaid.nih.gov
    • paperswithcode.com
    • +2more
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FSDnoisy18k [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2529933
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Xavier Favory
    Manoj Plakal
    Mercedes Collado
    Xavier Serra
    Frederic Font
    Eduardo Fonseca
    Daniel P. W. Ellis
    Description

    FSDnoisy18k is an audio dataset collected with the aim of fostering the investigation of label noise in sound event classification. It contains 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data.

    Data curators

    Eduardo Fonseca and Mercedes Collado

    Contact

    You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

    Citation

    If you use this dataset or part of it, please cite the following ICASSP 2019 paper:

    Eduardo Fonseca, Manoj Plakal, Daniel P. W. Ellis, Frederic Font, Xavier Favory, and Xavier Serra, “Learning Sound Event Classifiers from Web Audio with Noisy Labels”, arXiv preprint arXiv:1901.01189, 2019

    You can also consider citing our ISMIR 2017 paper that describes the Freesound Annotator, which was used to gather the manual annotations included in FSDnoisy18k:

    Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, “Freesound Datasets: A Platform for the Creation of Open Audio Datasets”, In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

    FSDnoisy18k description

    What follows is a summary of the most basic aspects of FSDnoisy18k. For a complete description of FSDnoisy18k, make sure to check:

    the FSDnoisy18k companion site: http://www.eduardofonseca.net/FSDnoisy18k/

    the description provided in Section 2 of our ICASSP 2019 paper

    FSDnoisy18k is an audio dataset collected with the aim of fostering the investigation of label noise in sound event classification. It contains 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data.

    The source of audio content is Freesound—a sound sharing site created an maintained by the Music Technology Group hosting over 400,000 clips uploaded by its community of users, who additionally provide some basic metadata (e.g., tags, and title). The 20 classes of FSDnoisy18k are drawn from the AudioSet Ontology and are selected based on data availability as well as on their suitability to allow the study of label noise. The 20 classes are: "Acoustic guitar", "Bass guitar", "Clapping", "Coin (dropping)", "Crash cymbal", "Dishes, pots, and pans", "Engine", "Fart", "Fire", "Fireworks", "Glass", "Hi-hat", "Piano", "Rain", "Slam", "Squeak", "Tearing", "Walk, footsteps", "Wind", and "Writing". FSDnoisy18k was created with the Freesound Annotator, which is a platform for the collaborative creation of open audio datasets.

    We defined a clean portion of the dataset consisting of correct and complete labels. The remaining portion is referred to as the noisy portion. Each clip in the dataset has a single ground truth label (singly-labeled data).

    The clean portion of the data consists of audio clips whose labels are rated as present in the clip and predominant (almost all with full inter-annotator agreement), meaning that the label is correct and, in most cases, there is no additional acoustic material other than the labeled class. A few clips may contain some additional sound events, but they occur in the background and do not belong to any of the 20 target classes. This is more common for some classes that rarely occur alone, e.g., “Fire”, “Glass”, “Wind” or “Walk, footsteps”.

    The noisy portion of the data consists of audio clips that received no human validation. In this case, they are categorized on the basis of the user-provided tags in Freesound. Hence, the noisy portion features a certain amount of label noise.

    Code

    We've released the code for our ICASSP 2019 paper at https://github.com/edufonseca/icassp19. The framework comprises all the basic stages: feature extraction, training, inference and evaluation. After loading the FSDnoisy18k dataset, log-mel energies are computed and a CNN baseline is trained and evaluated. The code also allows to test four noise-robust loss functions. Please check our paper for more details.

    Label noise characteristics

    FSDnoisy18k features real label noise that is representative of audio data retrieved from the web, particularly from Freesound. The analysis of a per-class, random, 15% of the noisy portion of FSDnoisy18k revealed that roughly 40% of the analyzed labels are correct and complete, whereas 60% of the labels show some type of label noise. Please check the FSDnoisy18k companion site for a detailed characterization of the label noise in the dataset, including a taxonomy of label noise for singly-labeled data as well as a per-class description of the label noise.

    FSDnoisy18k basic characteristics

    The dataset most relevant characteristics are as follows:

    FSDnoisy18k contains 18,532 audio clips (42.5h) unequally distributed in the 20 aforementioned classes drawn from the AudioSet Ontology.

    The audio clips are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files.

    The audio clips are of variable length ranging from 300ms to 30s, and each clip has a single ground truth label (singly-labeled data).

    The dataset is split into a test set and a train set. The test set is drawn entirely from the clean portion, while the remainder of data forms the train set.

    The train set is composed of 17,585 clips (41.1h) unequally distributed among the 20 classes. It features a clean subset and a noisy subset. In terms of number of clips their proportion is 10%/90%, whereas in terms of duration the proportion is slightly more extreme (6%/94%). The per-class percentage of clean data within the train set is also imbalanced, ranging from 6.1% to 22.4%. The number of audio clips per class ranges from 51 to 170, and from 250 to 1000 in the clean and noisy subsets, respectively. Further, a noisy small subset is defined, which includes an amount of (noisy) data comparable (in terms of duration) to that of the clean subset.

    The test set is composed of 947 clips (1.4h) that belong to the clean portion of the data. Its class distribution is similar to that of the clean subset of the train set. The number of per-class audio clips in the test set ranges from 30 to 72. The test set enables a multi-class classification problem.

    FSDnoisy18k is an expandable dataset that features a per-class varying degree of types and amount of label noise. The dataset allows investigation of label noise as well as other approaches, from semi-supervised learning, e.g., self-training to learning with minimal supervision.

    License

    FSDnoisy18k has licenses at two different levels, as explained next. All sounds in Freesound are released under Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound. In particular, all Freesound clips included in FSDnoisy18k are released under either CC-BY or CC0. For attribution purposes and to facilitate attribution of these files to third parties, we include a relation of audio clips and their corresponding license in the LICENSE-INDIVIDUAL-CLIPS file downloaded with the dataset.

    In addition, FSDnoisy18k as a whole is the result of a curation process and it has an additional license. FSDnoisy18k is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the dataset.

    Files

    FSDnoisy18k can be downloaded as a series of zip files with the following directory structure:

    root │
    └───FSDnoisy18k.audio_train/ Audio clips in the train set │
    └───FSDnoisy18k.audio_test/ Audio clips in the test set │
    └───FSDnoisy18k.meta/ Files for evaluation setup │ │
    │ └───train.csv Data split and ground truth for the train set │ │
    │ └───test.csv Ground truth for the test set

    └───FSDnoisy18k.doc/ │
    └───README.md The dataset description file that you are reading │
    └───LICENSE-DATASET License of the FSDnoisy18k dataset as an entity

    └───LICENSE-INDIVIDUAL-CLIPS.csv Licenses of the individual audio clips from Freesound

    Each row (i.e. audio clip) of the train.csv file contains the following information:

    fname: the file name

    label: the audio classification label (ground truth)

    aso_id: the id of the corresponding category as per the AudioSet Ontology

    manually_verified: Boolean (1 or 0) flag to indicate whether the clip belongs to the clean portion (1), or to the noisy portion (0) of the train set

    noisy_small: Boolean (1 or 0) flag to indicate whether the clip belongs to the noisy_small portion (1) of the train set

    Each row (i.e. audio clip) of the test.csv file contains the following information:

    fname: the file name

    label: the audio classification label (ground truth)

    aso_id: the id of the corresponding category as per the AudioSet Ontology

    Links

    Source code for our preprint: https://github.com/edufonseca/icassp19 Freesound Annotator: https://annotator.freesound.org/ Freesound: https://freesound.org Eduardo Fonseca’s personal website: http://www.eduardofonseca.net/

    Acknowledgments

    This work is partially supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688382 AudioCommons. Eduardo Fonseca is also sponsored by a Google Faculty Research Award 2017. We thank everyone who contributed to FSDnoisy18k with annotations.

  18. Sample 8 rows of ADE-TABLE.

    • plos.figshare.com
    xlsx
    Updated Sep 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shuntaro Yada; Tomohiro Nishiyama; Shoko Wakamiya; Yoshimasa Kawazoe; Shungo Imai; Satoko Hori; Eiji Aramaki (2024). Sample 8 rows of ADE-TABLE. [Dataset]. http://doi.org/10.1371/journal.pone.0310432.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Sep 11, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Shuntaro Yada; Tomohiro Nishiyama; Shoko Wakamiya; Yoshimasa Kawazoe; Shungo Imai; Satoko Hori; Eiji Aramaki
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Surface expressions derived from private in-hospital texts in Japanese (TEXT). For readability, we have used their English translations. These surface expressions, which are often symptom-like phrases, are categorized as parent diseases and labeled with corresponding ICD codes. We further label each surface expression with relevance to frequent adverse effects of anti-cancer drugs, such as stomatitis, peripheral neuropathy, and hand-foot syndrome (e.g., and ). The original table has eight relevant labels containing binary values (1 if relevant and 0 otherwise). The samples were randomly selected from entries relevant to stomatitis or peripheral neuropathy. (XLSX)

  19. OSM buildings noisy labels dataset

    • zenodo.org
    • explore.openaire.eu
    • +1more
    zip
    Updated Apr 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonas Gütter; Jonas Gütter (2022). OSM buildings noisy labels dataset [Dataset]. http://doi.org/10.5281/zenodo.6477788
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 27, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jonas Gütter; Jonas Gütter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains tile imagery from the OpenStreetMap project alongside label masks for buildings from OpenStreetMap. Besides the original clean label set, additional noisy label sets for random noise, removed and added buildings are provided.

    The purpose of this dataset is to provide training data for analysing the impact of noisy labels on the performance of models for semantic segmentation in Earth observation.

    The code for downloading and creating the datasets as well as for performing some preliminary analyses is also provided, however it is necessary to have access to a tile server where OpenStreetMap tiles can be downloaded in sufficient amounts.

    To reproduce the dataset and perform analysis on it, do the following:

    • unzip data.zip and code.zip
    • create the folder structure from data
    • Build and activate a python environment from environment.yml
    • Insert the url of a suitable tile server for OSM tiles in line 76 of utils.py
    • Execute download_OSM_dataset.py to download OSM image tiles alongside OSM labels
    • Execute create_noisy_labels.py for the OSM dataset to create noisy label sets
    • Divide the images and labels into train and test data. split_data.py can be used as a baseline for this, but pathnames have to be adjusted and the corresponding directories have to be created first.
    • Call train_model.py to train a model on the data. Specify the data size and the label set by giving command line arguments as shown in train_model.sh

  20. d

    Data from: Long-lasting vocal plasticity in adult marmoset monkeys

    • datadryad.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jun 7, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Long-lasting vocal plasticity in adult marmoset monkeys [Dataset]. https://datadryad.org/stash/dataset/doi:10.5061/dryad.7nq1c6s
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 7, 2019
    Dataset provided by
    Dryad
    Authors
    Lingyun Zhao; Bahar Boroumand Rad; Xiaoqin Wang
    Time period covered
    2019
    Description

    Marmoset vocalization dataJammingAllData.mat contains a MATLAB structure array “SbjData” which stores the experimental data for the four subjects. Each row in “SbjData” is a subject. The first column is for low-frequency noise perturbation (and the baseline preceding it) whereas the second column is for high-frequency noise perturbation (and the baseline preceding it). “SbjData.Param” stores various parameters used in the analysis and the experiment info. For example, “exp_day” contains the day number for each experimental session. Perturbation sessions are labeled as “Jamming” in the data structures. “SbjData.Measure.FreqPrePostWin” stores the actual data, as described below for its several fields. “Jamming” and “Baseline” contains the original recorded data for the perturbation and baseline sessions, respectively. “JammingDT” and “BaselineDT” are the detrended data. Inside these matrices, Column 1 labels separate sessions; Column 2 is the call onset time (sec); Column 4 is the fundame...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Stebani, Jannik (2023). Human Inner Ear Anatomy: Labeled Volume CT Data of Inner Ear Fluid Space and Anatomical Landmarks [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8277158

Human Inner Ear Anatomy: Labeled Volume CT Data of Inner Ear Fluid Space and Anatomical Landmarks

Explore at:
Dataset updated
Aug 30, 2023
Dataset provided by
Blaimer, Martin
Rak, Kristen
Neun, Tilman
Stebani, Jannik
Pelt, Daniël M.
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The provided dataset comprises 43 instances of temporal bone volume CT scans. The scans were performed on human cadaveric specimen with a resulting isotropic voxel size of (99 \times 99 \times 99 \, \, \mathrm{\mu m}^3). Voxel-wise image labels of the fluid space of the bony labyrinth, subdivided in the three semantic classes cochlear volume, vestibular volume and semicircular canal volume are provided. In addition, each dataset contains JSON-like descriptor data defining the voxel coordinates of the anatomical landmarks: (1) apex of the cochlea, (2) oval window and (3) round window. The dataset can be used to train and evaluate algorithmic machine learning models for automated innear ear analysis in the context of the supervised learning paradigm.

Usage Notes

The datasets are formatted in the HDF5 format developed by the HDF5 Group. We utilized and thus recommend the usage of Python bindings pyHDF to handle the datasets.

The flat-panel volume CT raw data, labels and landmarks are saved in the HDF5-internal file structure using the respective group and datasets:

raw/raw-0 label/label-0 landmark/landmark-0 landmark/landmark-1 landmark/landmark-2

Array raw and label data can be read from the file by indexing into an opened h5py file handle, for example as numpy.ndarray. Further metadata is contained in the attribute dictionaries of the raw and label datasets.

Landmark coordinate data is available as an attribute dict and contains the coordinate system (LPS or RAS), IJK voxel coordinates and label information. The helicotrema or cochlea top is globally saved in landmark 0, the oval window in landmark 1 and the round window in landmark 2. Read as a Python dictionary, exemplary landmark information for a dataset may reads as follows:

{'coordsys': 'LPS', 'id': 1, 'ijk_position': array([181, 188, 100]), 'label': 'CochleaTop', 'orientation': array([-1., -0., -0., -0., -1., -0., 0., 0., 1.]), 'xyz_position': array([ 44.21109689, -139.38058589, -183.48249736])}

{'coordsys': 'LPS', 'id': 2, 'ijk_position': array([222, 182, 145]), 'label': 'OvalWindow', 'orientation': array([-1., -0., -0., -0., -1., -0., 0., 0., 1.]), 'xyz_position': array([ 48.27890112, -139.95991131, -179.04103763])}

{'coordsys': 'LPS', 'id': 3, 'ijk_position': array([223, 209, 147]), 'label': 'RoundWindow', 'orientation': array([-1., -0., -0., -0., -1., -0., 0., 0., 1.]), 'xyz_position': array([ 48.33120126, -137.27135678, -178.8665465 ])}

Search
Clear search
Close search
Google apps
Main menu