100+ datasets found
  1. R

    Small Data Subset Dataset

    • universe.roboflow.com
    zip
    Updated Jul 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Summer Project 2 (2023). Small Data Subset Dataset [Dataset]. https://universe.roboflow.com/summer-project-2/small-data-subset
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 26, 2023
    Dataset authored and provided by
    Summer Project 2
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Faces Bounding Boxes
    Description

    Small Data Subset

    ## Overview
    
    Small Data Subset is a dataset for object detection tasks - it contains Faces annotations for 215 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  2. f

    Data subset summary.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jan 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Berman, David M.; Gooding, Robert J.; Davey, Scott K.; Garven, Andrew; Ghaedi, Hamid; Sangster, Ami G. (2022). Data subset summary. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000323776
    Explore at:
    Dataset updated
    Jan 24, 2022
    Authors
    Berman, David M.; Gooding, Robert J.; Davey, Scott K.; Garven, Andrew; Ghaedi, Hamid; Sangster, Ami G.
    Description

    This supplementary table contains a data summary that breaks down the number of mutations and their DDR and/or CM classification. There is a summary for each data subset: Least Conservative (High and Moderate), Least Conservative (High), Mid Conservative (High and Moderate) and Most Conservative (High and Moderate). (XLSX)

  3. H

    AORC Subset

    • hydroshare.org
    • beta.hydroshare.org
    zip
    Updated Dec 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayman Nassar; David Tarboton; Anthony M. Castronova (2023). AORC Subset [Dataset]. https://www.hydroshare.org/resource/c1bce473fff641d7a678565af9785c31
    Explore at:
    zip(28.3 KB)Available download formats
    Dataset updated
    Dec 6, 2023
    Dataset provided by
    HydroShare
    Authors
    Ayman Nassar; David Tarboton; Anthony M. Castronova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2010 - Dec 31, 2019
    Area covered
    Description

    The objective of this HydroShare resource is to query AORC v1.0 Forcing data stored on HydroShare's Thredds server and create a subset of this dataset for a designated watershed and timeframe. The user is prompted to define their temporal and spatial frames of interest, which specifies the start and end dates for the data subset. Additionally, the user is prompted to define a spatial frame of interest, which could be a bounding box or a shapefile, to subset the data spatially.

    Before the subsetting is performed, data is queried, and geospatial metadata is added to ensure that the data is correctly aligned with its corresponding location on the Earth's surface. To achieve this, two separate notebooks were created - this notebook and this notebook - which explain how to query the dataset and add geospatial metadata to AORC v1.0 data in detail, respectively. In this notebook, we call functions from the AORC.py script to perform these preprocessing steps, resulting in a cleaner notebook that focuses solely on the subsetting process.

  4. p

    MIMIC-III Clinical Database CareVue subset

    • physionet.org
    Updated Sep 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alistair Johnson; Tom Pollard; Roger Mark (2022). MIMIC-III Clinical Database CareVue subset [Dataset]. http://doi.org/10.13026/8a4q-w170
    Explore at:
    Dataset updated
    Sep 21, 2022
    Authors
    Alistair Johnson; Tom Pollard; Roger Mark
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    MIMIC-III is a database of critically ill patients admitted to an intensive care unit (ICU) at the Beth Israel Deaconess Medical Center (BIDMC) in Boston, MA. MIMIC-III has seen broad use, and was updated with the release of MIMIC-IV. MIMIC-IV contains more contemporaneous stays, higher granularity data, and expanded domains of information. To maximize the sample size of MIMIC-IV, the database overlaps with MIMIC-III, and specifically both databases contain the same admissions which occurred between 2008 - 2012. This overlap complicates analyses of the two databases simultaneously. Here we provide a subset of MIMIC-III containing patients who are not in MIMIC-IV. The goal of this project is to simplify the combination of MIMIC-III with MIMIC-IV.

  5. h

    finetune-data-28fee8943227

    • huggingface.co
    Updated Aug 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subset Data, Inc. (2023). finetune-data-28fee8943227 [Dataset]. https://huggingface.co/datasets/subset-data/finetune-data-28fee8943227
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 4, 2023
    Dataset authored and provided by
    Subset Data, Inc.
    Description

    Dataset Card for "finetune-data-28fee8943227"

    More Information needed

  6. E

    CELEX Dutch lexical database - Syntax Subset

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Oct 5, 2005
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2005). CELEX Dutch lexical database - Syntax Subset [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-L0029_06/
    Explore at:
    Dataset updated
    Oct 5, 2005
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    The Dutch CELEX data is derived from R.H. Baayen, R. Piepenbrock & L. Gulikers, The CELEX Lexical Database (CD-ROM), Release 2, Dutch Version 3.1, Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, 1995.Apart from orthographic features, the CELEX database comprises representations of the phonological, morphological, syntactic and frequency properties of lemmata. For the Dutch data, frequencies have been disambiguated on the basis of the 42.4m Dutch Instituut voor Nederlandse Lexicologie text corpora.To make for greater compatibility with other operating systems, the databases have not been tailored to fit any particular database management program. Instead, the information is presented in a series of plain ASCII files, which can be queried with tools such as AWK and ICON. Unique identity numbers allow the linking of information from different files.This database can be divided into different subsets:· orthography: with or without diacritics, with or without word division positions, alternative spellings, number of letters/syllables;· phonology: phonetic transcriptions with syllable boundaries or primary and secondary stress markers, consonant-vowel patterns, number of phonemes/syllables, alternative pronunciations, frequency per phonetic syllable within words;· morphology: division into stems and affixes, flat or hierarchical representations, stems and their inflections;· syntax: word class, subcategorisations per word class;· frequency of the entries: disambiguated for homographic lemmata.

  7. h

    finetune-data-e4da7017fcce

    • huggingface.co
    Updated Aug 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subset Data, Inc. (2023). finetune-data-e4da7017fcce [Dataset]. https://huggingface.co/datasets/subset-data/finetune-data-e4da7017fcce
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 4, 2023
    Dataset authored and provided by
    Subset Data, Inc.
    Description

    Dataset Card for "finetune-data-e4da7017fcce"

    More Information needed

  8. h

    finetune-data-1215cfd29a6d

    • huggingface.co
    Updated Aug 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subset Data, Inc. (2023). finetune-data-1215cfd29a6d [Dataset]. https://huggingface.co/datasets/subset-data/finetune-data-1215cfd29a6d
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 4, 2023
    Dataset authored and provided by
    Subset Data, Inc.
    Description

    Dataset Card for "finetune-data-1215cfd29a6d"

    More Information needed

  9. Storms dataset, a subset of the NOAA Dataset

    • kaggle.com
    Updated Nov 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md. Zubayer (2023). Storms dataset, a subset of the NOAA Dataset [Dataset]. https://www.kaggle.com/datasets/mdzubayer/storms-dataset-a-subset-of-the-noaa-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 27, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Md. Zubayer
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    The Storms dataset, a subset of the NOAA (National Oceanic and Atmospheric Administration) Atlantic hurricane database best track data, encompasses information about tropical storms measured at different time points over the years. The dataset contains 13 variables, including:  name: The name of the tropical storm.  year: The year in which the storm occurred.  month: The month in which the storm occurred.  day: The day on which the storm occurred.  hour: The hour at which the storm was recorded.  lat: Latitude coordinates of the storm.  long: Longitude coordinates of the storm.  status: The status of the storm (e.g., tropical depression, tropical storm, hurricane).  category: The category of the storm.  wind: Wind speed associated with the storm.  pressure: Atmospheric pressure associated with the storm.  tropicalstorm_force_diameter: Diameter of tropical storm force winds.  hurricane_force_diameter: Diameter of hurricane-force winds.

  10. a

    AOE analysis subset of the Arthropod Easy Capture (AEC) database

    • hub.arcgis.com
    • figshare.com
    • +2more
    Updated Jun 1, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of California, Santa Barbara (2016). AOE analysis subset of the Arthropod Easy Capture (AEC) database [Dataset]. https://hub.arcgis.com/datasets/890dc2eae81b41ae9b1562339137248f
    Explore at:
    Dataset updated
    Jun 1, 2016
    Dataset authored and provided by
    University of California, Santa Barbara
    Area covered
    Description

    Based on the default parameters used in the analysis, the entire AOE database available through figshare (doi: 10.6084/m9.figshare.2060979), represents a subset of the AMNH instance of the AEC database, which includes additional tables to capture host plant data and host analysis.

    1) Miridae subFamily(id) =Mirinae(id:8150), Orthotylinae(id:6294), Phylinae(id:6295), Deraeocorinae(id:8163) from AEC database sql. 2) geographic range: North America Country.UID = Canada(id:2),Mexico(id:8),USA(id:11) 3) complete plant host analysis 4) cleaned plant host data

  11. h

    autotrain-data-03e895593c12

    • huggingface.co
    Updated Aug 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subset Data, Inc. (2023). autotrain-data-03e895593c12 [Dataset]. https://huggingface.co/datasets/subset-data/autotrain-data-03e895593c12
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 2, 2023
    Dataset authored and provided by
    Subset Data, Inc.
    Description

    Dataset Card for "autotrain-data-03e895593c12"

    More Information needed

  12. Subset of Data Citation Corpus version 4

    • kaggle.com
    zip
    Updated Aug 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RodericD.M.Page (2025). Subset of Data Citation Corpus version 4 [Dataset]. https://www.kaggle.com/datasets/rdmpage/subset-of-data-citation-corpus-version-4
    Explore at:
    zip(59591902 bytes)Available download formats
    Dataset updated
    Aug 14, 2025
    Authors
    RodericD.M.Page
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This is a subset of version 4.0 of the Data Citation Corpus. It contains article_ids as cleaned DOIs, dataset ids (e.g., accession numbers, DOIs) and the name of the repository of the data (e.g., Dryad, European Nucleotide Archive). It was extracted from the file 2025-07-27-data-citation-corpus-01-v4.0.json which is one of 11 JSONL files in the corpus.

  13. h

    ethic-subset-data

    • huggingface.co
    Updated Nov 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ethicsadvisorproject (2024). ethic-subset-data [Dataset]. https://huggingface.co/datasets/ethicsadvisorproject/ethic-subset-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 1, 2024
    Authors
    ethicsadvisorproject
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    ethicsadvisorproject/ethic-subset-data dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. d

    Data from: Database used for the evaluation of data used to identify...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Database used for the evaluation of data used to identify groundwater sources under the direct influence of surface water in Pennsylvania [Dataset]. https://catalog.data.gov/dataset/database-used-for-the-evaluation-of-data-used-to-identify-groundwater-sources-under-the-di
    Explore at:
    Dataset updated
    Nov 21, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Pennsylvania
    Description

    The U.S. Geological Survey (USGS), in cooperation with the Pennsylvania Department of Environmental Protection (PADEP), conducted an evaluation of data used by the PADEP to identify groundwater sources under the direct influence of surface water (GUDI) in Pennsylvania (Gross and others, 2022). The data used in this evaluation and the processes used to compile them from multiple sources are described and provided herein. Data were compiled primarily but not exclusively from PADEP resources, including (1) source-information for public water-supply systems and Microscopic Particulate Analysis (MPA) results for public water-supply system groundwater sources from the agency’s Pennsylvania Drinking Water Information System (PADWIS) database (Pennsylvania Department of Environmental Protection, 2016), and (2) results associated with MPA testing from the PADEP Bureau of Laboratories (BOL) files and water-quality analyses obtained from the PADEP BOL, Sample Information System (Pennsylvania Department of Environmental Protection, written commun., various dates). Information compiled from sources other than the PADEP includes anthropogenic (land cover and PADEP region) and naturogenic (geologic and physiographic, hydrologic, soil characterization, and topographic) spatial data. Quality control (QC) procedures were applied to the PADWIS database to verify spatial coordinates, verify collection type information, exclude sources not designated as wells, and verify or remove values that were either obvious errors or populated as zero rather than as “no data.” The QC process reduced the original PADWIS dataset to 12,147 public water-supply system wells (hereafter referred to as the PADWIS database). An initial subset of the PADWIS database, termed the PADWIS database subset, was created to include 4,018 public water-supply system community wells that have undergone the Surface Water Identification Protocol (SWIP), a protocol used by the PADEP to classify sources as GUDI or non-GUDI (Gross and others, 2022). A second subset of the PADWIS database, termed the MPA database subset, represents MPA results for 631 community and noncommunity wells and includes water-quality data (alkalinity, chloride, Escherichia coli, fecal coliform, nitrate, pH, sodium, specific conductance, sulfate, total coliform, total dissolved solids, total residue, and turbidity) associated with groundwater-quality samples typically collected concurrently with the MPA sample. The PADWIS database and two subsets (PADWIS database subset and MPA database subset) are compiled in a single data table (DR_2022_Table.xlsx), with the two subsets differentiated using attributes that are defined in the associated metadata table (DR_2022_Metadata_Table_Variables.xlsx). This metadata file (DR_2022_Metadata.xml) describes data resources, data compilation, and QC procedures in greater detail.

  15. Titanic subset

    • kaggle.com
    zip
    Updated May 11, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rajaganapathy M (2017). Titanic subset [Dataset]. https://www.kaggle.com/datasets/rganapathy/titanic-subset
    Explore at:
    zip(22548 bytes)Available download formats
    Dataset updated
    May 11, 2017
    Authors
    Rajaganapathy M
    Description

    Context

    There's a story behind every dataset and here's your opportunity to share yours.

    Content

    What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  16. h

    finetune-data-5bb8b9feb9b9

    • huggingface.co
    Updated Aug 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subset Data, Inc. (2023). finetune-data-5bb8b9feb9b9 [Dataset]. https://huggingface.co/datasets/subset-data/finetune-data-5bb8b9feb9b9
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 20, 2023
    Dataset authored and provided by
    Subset Data, Inc.
    Description

    Dataset Card for "finetune-data-5bb8b9feb9b9"

    More Information needed

  17. S

    Open Data Subset

    • find.data.gov.scot
    json
    Updated Jan 14, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Smartline (uSmart) (2019). Open Data Subset [Dataset]. https://find.data.gov.scot/datasets/39420
    Explore at:
    json(null MB)Available download formats
    Dataset updated
    Jan 14, 2019
    Dataset provided by
    Smartline (uSmart)
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Open Data Subset

  18. o

    Data from: MIMIC-IV-ECG: Diagnostic Electrocardiogram Matched Subset

    • registry.opendata.aws
    • physionet.org
    Updated Dec 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PhysioNet (2024). MIMIC-IV-ECG: Diagnostic Electrocardiogram Matched Subset [Dataset]. https://registry.opendata.aws/mimic-iv-ecg/
    Explore at:
    Dataset updated
    Dec 19, 2024
    Dataset provided by
    <a href="https://physionet.org/">PhysioNet</a>
    Description

    The MIMIC-IV-ECG module contains approximately 800,000 diagnostic electrocardiograms across nearly 160,000 unique patients. These diagnostic ECGs use 12 leads and are 10 seconds in length. They are sampled at 500 Hz. This subset contains all of the ECGs for patients who appear in the MIMIC-IV Clinical Database. When a cardiologist report is available for a given ECG, we provide the needed information to link the waveform to the report. The patients in MIMIC-IV-ECG have been matched against the MIMIC-IV Clinical Database, making it possible to link to information across the MIMIC-IV modules.

  19. E

    AURORA Project database - Subset of SpeechDat-Car - Spanish database -...

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Aug 16, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). AURORA Project database - Subset of SpeechDat-Car - Spanish database - Evaluation Package [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-AURORA-CD0003_02/
    Explore at:
    Dataset updated
    Aug 16, 2017
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    The Aurora project was originally set up to establish a world wide standard for the feature extraction software which forms the core of the front-end of a DSR (Distributed Speech Recognition) system. ETSI formally adopted this activity as work items 007 and 008.The two work items within ETSI are :- ETSI DES/STQ WI007 : Distributed Speech Recognition - Front-End Feature Extraction Algorithm & Compression Algorithm- ETSI DES/STQ WI008 : Distributed Speech Recognition - Advanced Feature Extraction Algorithm.This database is a subset of the SpeechDat-Car database in Spanish language which has been collected as part of the European Union funded SpeechDat-Car project. It contains isolated and connected Spanish digits spoken in the following noise and driving conditions inside a car : 1. Quiet environment. Stop motor running. 2. Low noise. Town traffic + low speed rough road. 3. High noise : High speed good road.

  20. c

    Curated Breast Imaging Subset of Digital Database for Screening Mammography

    • cancerimagingarchive.net
    csv, dicom, n/a
    Updated Sep 14, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Cancer Imaging Archive (2017). Curated Breast Imaging Subset of Digital Database for Screening Mammography [Dataset]. http://doi.org/10.7937/K9/TCIA.2016.7O02S9CY
    Explore at:
    csv, dicom, n/aAvailable download formats
    Dataset updated
    Sep 14, 2017
    Dataset authored and provided by
    The Cancer Imaging Archive
    License

    https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/

    Time period covered
    Sep 14, 2017
    Dataset funded by
    National Cancer Institutehttp://www.cancer.gov/
    Description

    This CBIS-DDSM (Curated Breast Imaging Subset of DDSM) is an updated and standardized version of the Digital Database for Screening Mammography (DDSM). The DDSM is a database of 2,620 scanned film mammography studies. It contains normal, benign, and malignant cases with verified pathology information. The scale of the database along with ground truth validation makes the DDSM a useful tool in the development and testing of decision support systems. The CBIS-DDSM collection includes a subset of the DDSM data selected and curated by a trained mammographer. The images have been decompressed and converted to DICOM format. Updated ROI segmentation and bounding boxes, and pathologic diagnosis for training data are also included. A manuscript describing how to use this dataset in detail is available at https://www.nature.com/articles/sdata2017177.

    Published research results from work in developing decision support systems in mammography are difficult to replicate due to the lack of a standard evaluation data set; most computer-aided diagnosis (CADx) and detection (CADe) algorithms for breast cancer in mammography are evaluated on private data sets or on unspecified subsets of public databases. Few well-curated public datasets have been provided for the mammography community. These include the DDSM, the Mammographic Imaging Analysis Society (MIAS) database, and the Image Retrieval in Medical Applications (IRMA) project. Although these public data sets are useful, they are limited in terms of data set size and accessibility.

    For example, most researchers using the DDSM do not leverage all its images for a variety of historical reasons. When the database was released in 1997, computational resources to process hundreds or thousands of images were not widely available. Additionally, the DDSM images are saved in non-standard compression files that require the use of decompression code that has not been updated or maintained for modern computers. Finally, the ROI annotations for the abnormalities in the DDSM were provided to indicate a general position of lesions, but not a precise segmentation for them. Therefore, many researchers must implement segmentation algorithms for accurate feature extraction. This causes an inability to directly compare the performance of methods or to replicate prior results. The CBIS-DDSM collection addresses that challenge by publicly releasing an curated and standardized version of the DDSM for evaluation of future CADx and CADe systems (sometimes referred to generally as CAD) research in mammography.

    Please note that the image data for this collection is structured such that each participant has multiple patient IDs. For example, participant 00038 has 10 separate patient IDs which provide information about the scans within the IDs (e.g. Calc-Test_P_00038_LEFT_CC, Calc-Test_P_00038_RIGHT_CC_1). This makes it appear as though there are 6,671 patients according to the DICOM metadata, but there are only 1,566 actual participants in the cohort.

    For scientific and other inquiries about this dataset, please contact TCIA's Helpdesk.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Summer Project 2 (2023). Small Data Subset Dataset [Dataset]. https://universe.roboflow.com/summer-project-2/small-data-subset

Small Data Subset Dataset

small-data-subset

small-data-subset-dataset

Explore at:
zipAvailable download formats
Dataset updated
Jul 26, 2023
Dataset authored and provided by
Summer Project 2
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Variables measured
Faces Bounding Boxes
Description

Small Data Subset

## Overview

Small Data Subset is a dataset for object detection tasks - it contains Faces annotations for 215 images.

## Getting Started

You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.

  ## License

  This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Search
Clear search
Close search
Google apps
Main menu