100+ datasets found
  1. r

    CSAW-CC (mammography) – a dataset for AI research to improve screening,...

    • researchdata.se
    Updated Jan 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fredrik Strand (2025). CSAW-CC (mammography) – a dataset for AI research to improve screening, diagnostics and prognostics of breast cancer [Dataset]. http://doi.org/10.5878/45vm-t798
    Explore at:
    (9211529), (29050)Available download formats
    Dataset updated
    Jan 7, 2025
    Dataset provided by
    Karolinska Institutet
    Authors
    Fredrik Strand
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2008 - 2015
    Area covered
    Stockholm County
    Description

    The dataset contains x-ray images, mammography, from breast cancer screening at the Karolinska University Hospital, Stockholm, Sweden, collected by principal investigator Fredrik Strand at Karolinska Institutet. The purpose for compiling the dataset was to perform AI research to improve screening, diagnostics and prognostics of breast cancer.

    The dataset is based on a selection of cases with and without a breast cancer diagnosis, taken from a more comprehensive source dataset.

    1,103 cases of first-time breast cancer for women in the screening age range (40-74 years) during the included time period (November 2008 to December 2015) were included. Of these, a random selection of 873 cases have been included in the published dataset.

    A random selection of 10,000 healthy controls during the same time period were included. Of these, a random selection of 7,850 cases have been included in the published dataset.

    For each individual all screening mammograms, also repeated over time, were included; as well as the date of screening and the age. In addition, there are pixel-level annotations of the tumors created by a breast radiologist (small lesions such as micro-calcifications have been annotated as an area). Annotations were also drawn in mammograms prior to diagnosis; if these contain a single pixel it means no cancer was seen but the estimated location of the center of the future cancer was shown by a single pixel annotation.

    In addition to images, the dataset also contains cancer data created at the Karolinska University Hospital and extracted through the Regional Cancer Center Stockholm-Gotland. This data contains information about the time of diagnosis and cancer characteristics including tumor size, histology and lymph node metastasis.

    The precision of non-image data was decreased, through categorisation and jittering, to ensure that no single individual can be identified.

    The following types of files are available: - CSV: The following data is included (if applicable): cancer/no cancer (meaning breast cancer during 2008 to 2015), age group at screening, days from image to diagnosis (if any), cancer histology, cancer size group, ipsilateral axillary lymph node metastasis. There is one csv file for the entire dataset, with one row per image. Any information about cancer diagnosis is repeated for all rows for an individual who was diagnosed (i.e., it is also included in rows before diagnosis). For each exam date there is the assessment by radiologist 1, radiologist 2 and the consensus decision. - DICOM: Mammograms. For each screening, four images for the standard views were acuqired: left and right, mediolateral oblique and craniocaudal. There should be four files per examination date. - PNG: Cancer annotations. For each DICOM image containing a visible tumor.

    Access: The dataset is available upon request due to the size of the material. The image files in DICOM and PNG format comprises approximately 2.5 TB. Access to the CSV file including parametric data is possible via download as associated documentation.

  2. h

    commoncatalog-cc-by

    • huggingface.co
    Updated May 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CommonCanvas (2024). commoncatalog-cc-by [Dataset]. https://huggingface.co/datasets/common-canvas/commoncatalog-cc-by
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 16, 2024
    Dataset authored and provided by
    CommonCanvas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for CommonCatalog CC-BY

    This dataset is a large collection of high-resolution Creative Common images (composed of different licenses, see paper Table 1 in the Appendix) collected in 2014 from users of Yahoo Flickr. The dataset contains images of up to 4k resolution, making this one of the highest resolution captioned image datasets.

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    We provide captions synthetic captions to approximately 100 million high… See the full description on the dataset page: https://huggingface.co/datasets/common-canvas/commoncatalog-cc-by.

  3. R

    Cc Dataset

    • universe.roboflow.com
    zip
    Updated Jun 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CC (2024). Cc Dataset [Dataset]. https://universe.roboflow.com/cc-ayonm/cc-icdzq/model/3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 10, 2024
    Dataset authored and provided by
    CC
    Variables measured
    Frames Bounding Boxes
    Description

    CC

    ## Overview
    
    CC is a dataset for object detection tasks - it contains Frames annotations for 237 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
  4. h

    OmniCorpus-CC-210M

    • huggingface.co
    Updated Aug 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenGVLab (2024). OmniCorpus-CC-210M [Dataset]. https://huggingface.co/datasets/OpenGVLab/OmniCorpus-CC-210M
    Explore at:
    Dataset updated
    Aug 30, 2024
    Dataset authored and provided by
    OpenGVLab
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    🐳 OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

    This repository contains 210 million image-text interleaved documents filtered from the OmniCorpus-CC dataset, which was sourced from Common Crawl.

    Repository: https://github.com/OpenGVLab/OmniCorpus Paper (ICLR 2025 Spotlight): https://arxiv.org/abs/2406.08418

    OmniCorpus dataset is a large-scale image-text interleaved dataset, which pushes the boundaries of scale and diversity by encompassing… See the full description on the dataset page: https://huggingface.co/datasets/OpenGVLab/OmniCorpus-CC-210M.

  5. g

    CSAW-CC (mammography)

    • gts.ai
    json
    Updated Jan 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GTS (2025). CSAW-CC (mammography) [Dataset]. https://gts.ai/dataset-download/csaw-cc-mammography/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Jan 24, 2025
    Dataset provided by
    GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
    Authors
    GTS
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Access the CSAW-CC dataset featuring mammography images from Karolinska University Hospital, including over 1,100 breast cancer cases and over 10,000 healthy controls for AI-driven medical imaging research.

  6. h

    MAP-CC

    • huggingface.co
    Updated Apr 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Multimodal Art Projection (2024). MAP-CC [Dataset]. https://huggingface.co/datasets/m-a-p/MAP-CC
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 5, 2024
    Dataset authored and provided by
    Multimodal Art Projection
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    MAP-CC

    🌐 Homepage | 🤗 MAP-CC | 🤗 CHC-Bench | 🤗 CT-LLM | 📖 arXiv | GitHub An open-source Chinese pretraining dataset with a scale of 800 billion tokens, offering the NLP community high-quality Chinese pretraining data.

      Disclaimer
    

    This model, developed for academic purposes, employs rigorously compliance-checked training data to uphold the highest standards of integrity and compliance. Despite our efforts, the inherent complexities of data and the broad spectrum of… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/MAP-CC.

  7. R

    Mammo Only Cc Dataset

    • universe.roboflow.com
    zip
    Updated Apr 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zakir Alam (2022). Mammo Only Cc Dataset [Dataset]. https://universe.roboflow.com/zakir-alam/mammo-only-cc
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 3, 2022
    Dataset authored and provided by
    Zakir Alam
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Tumor Bounding Boxes
    Description

    Mammo Only CC

    ## Overview
    
    Mammo Only CC is a dataset for object detection tasks - it contains Tumor annotations for 627 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  8. o

    Armenian language dataset from CC-100, monolingual Datasets from Web Crawl...

    • data.opendata.am
    Updated Apr 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Armenian language dataset from CC-100, monolingual Datasets from Web Crawl Data [Dataset]. https://data.opendata.am/dataset/cc100arm
    Explore at:
    Dataset updated
    Apr 6, 2023
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Area covered
    Armenia
    Description

    Armenian language dataset extracted from CC-100 research dataset Description from website This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository. No claims of intellectual property are made on the work of preparation of the corpus.

  9. (CC-CCII)

    • kaggle.com
    zip
    Updated Sep 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    faka_frame_ (2023). (CC-CCII) [Dataset]. https://www.kaggle.com/datasets/fakaframe082/cc-ccii
    Explore at:
    zip(2085897531 bytes)Available download formats
    Dataset updated
    Sep 25, 2023
    Authors
    faka_frame_
    Description

    Dataset

    This dataset was created by faka_frame_

    Contents

  10. Cancer-CC-Dataset-without-Mask

    • kaggle.com
    zip
    Updated Dec 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Orzlala (2023). Cancer-CC-Dataset-without-Mask [Dataset]. https://www.kaggle.com/datasets/orzlala/cancer-cc-dataset-without-mask/code
    Explore at:
    zip(3385355183 bytes)Available download formats
    Dataset updated
    Dec 26, 2023
    Authors
    Orzlala
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Orzlala

    Released under MIT

    Contents

  11. h

    CC-Bench-trajectories

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Z.ai, CC-Bench-trajectories [Dataset]. https://huggingface.co/datasets/zai-org/CC-Bench-trajectories
    Explore at:
    Dataset provided by
    Z.ai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    CC-Bench Trajectories Overview

    To evaluate GLM-4.6's agentic coding capabilities in real-world scenarios, we developed CC-Bench-V1.1 using Claude Code as the agentic coding testbed. Building on CC-Bench-V1.0, we added 22 more challenging coding tasks and conducted comprehensive evaluations against Claude-Sonnet-4, GLM-4.5, Kimi-K2-0905, and DeepSeek-V3.1-Terminus. The benchmark comprises 74 coding tasks spanning frontend development, tool development, data analysis, testing, and… See the full description on the dataset page: https://huggingface.co/datasets/zai-org/CC-Bench-trajectories.

  12. t

    Conceptual Captions (CC-3M) - Dataset - LDM

    • service.tib.eu
    Updated Dec 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Conceptual Captions (CC-3M) - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/conceptual-captions--cc-3m-
    Explore at:
    Dataset updated
    Dec 2, 2024
    Description

    Conceptual Captions (CC-3M) is a large-scale dataset of 300,000 image-caption pairs.

  13. d

    Deep (15-second) seismic reflection profiles CC-1 and CC-2 extending from...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Deep (15-second) seismic reflection profiles CC-1 and CC-2 extending from the eastern California Coast Ranges across the Great Valley into the Sierran foothills at about latitude 37.25° N [Dataset]. https://catalog.data.gov/dataset/deep-15-second-seismic-reflection-profiles-cc-1-and-cc-2-extending-from-the-eastern-califo
    Explore at:
    Dataset updated
    Nov 26, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Coast Ranges
    Description

    This data release contains deep seismic reflection profiles CC-1 and CC-2, which extend eastward from within the California Coast Ranges across the Great Valley and into the Sierran foothills, with a combined east-west length of about 140 km at about the latitude of the town of Merced (37.25° north latitude). The records are processed to 15 seconds two-way time and thus extend deep into the lithosphere as well as capturing detail in the shallow crust. Field data (no longer available) were collected in 1982-85 with vibrator source, an 800-channel, split-spread receiver array using SIGN-BIT technology, and a maximum offset of 12.2 km. Line CC-1 extends from Franciscan Complex of the eastern Coast Ranges east to Merced in the Great Valley; line CC-2 is offset 12.75 km to the south with a 10.8 km overlap of CC-1 and extends east into batholithic rocks of the Sierran Foothills. The included data consist of (1) raster images of stacked and migrated profiles and (2) the ground location of their reconstruction lines and points as scans of 1:4,000-scale paper plots and their digital representations.

  14. i

    Cc amharic Image Classification Dataset

    • images.cv
    zip
    Updated Mar 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Cc amharic Image Classification Dataset [Dataset]. https://images.cv/dataset/cc-amharic-image-classification-dataset
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 23, 2023
    License

    https://images.cv/licensehttps://images.cv/license

    Description

    Labeled Cc amharic images suitable for training and evaluating computer vision and deep learning models.

  15. h

    cc-stories

    • huggingface.co
    • opendatalab.com
    Updated Jun 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Campos (2023). cc-stories [Dataset]. https://huggingface.co/datasets/spacemanidol/cc-stories
    Explore at:
    Dataset updated
    Jun 15, 2023
    Authors
    Daniel Campos
    Description

    CC-Stories (or STORIES) is a dataset for common sense reasoning and language modeling. It was constructed by aggregating documents from the CommonCrawl dataset that has the most overlapping n-grams with the questions in commonsense reasoning tasks. The top 1.0% of highest ranked documents is chosen as the new training corpus.

  16. c

    CC Price Prediction Data

    • coinbase.com
    Updated Nov 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). CC Price Prediction Data [Dataset]. https://www.coinbase.com/price-prediction/cc
    Explore at:
    Dataset updated
    Nov 16, 2025
    Variables measured
    Growth Rate, Predicted Price
    Measurement technique
    User-defined projections based on compound growth. This is not a formal financial forecast.
    Description

    This dataset contains the predicted prices of the asset CC over the next 16 years. This data is calculated initially using a default 5 percent annual growth rate, and after page load, it features a sliding scale component where the user can then further adjust the growth rate to their own positive or negative projections. The maximum positive adjustable growth rate is 100 percent, and the minimum adjustable growth rate is -100 percent.

  17. Z

    sameAs.cc - Dataset of 558M owl:sameAs statements

    • data.niaid.nih.gov
    • live.european-language-grid.eu
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Beek, Wouter; Raad, Joe; Wielemaker, Jan; Van Harmelen, Frank (2020). sameAs.cc - Dataset of 558M owl:sameAs statements [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1973098
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Vrije Universiteit Amsterdam
    Authors
    Beek, Wouter; Raad, Joe; Wielemaker, Jan; Van Harmelen, Frank
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    sameAs.cc is the largest dataset of identity statements that has been gathered from the LOD Cloud to date. This dataset is available in HDT format (Header Dictionary, Triples), and contains 558,943,116 distinct owl:sameAs statements collected from the LOD Laundromat corpus.

  18. o

    Cc Street Cross Street Data in Woodland, WA

    • ownerly.com
    Updated Dec 6, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ownerly (2021). Cc Street Cross Street Data in Woodland, WA [Dataset]. https://www.ownerly.com/wa/woodland/cc-st-home-details
    Explore at:
    Dataset updated
    Dec 6, 2021
    Dataset authored and provided by
    Ownerly
    Area covered
    Woodland, CC Street, Washington
    Description

    This dataset provides information about the number of properties, residents, and average property values for Cc Street cross streets in Woodland, WA.

  19. d

    Campaign Spending OE Ledger CC Dataset

    • datasets.ai
    8
    Updated Apr 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    State of Hawaii (2024). Campaign Spending OE Ledger CC Dataset [Dataset]. https://datasets.ai/datasets/campaign-spending-oe-ledger-cc-dataset
    Explore at:
    8Available download formats
    Dataset updated
    Apr 10, 2024
    Dataset authored and provided by
    State of Hawaii
    Description

    Campaign Spending OE Ledger CC Dataset as of December 31, 2024

  20. C

    CC Harbor Dist

    • data.ca.gov
    • data.cnra.ca.gov
    • +2more
    html
    Updated Sep 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California State Lands Commission (2022). CC Harbor Dist [Dataset]. https://data.ca.gov/dataset/cc-harbor-dist
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Sep 16, 2022
    Dataset authored and provided by
    California State Lands Commissionhttps://www.slc.ca.gov/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    {{description}}

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Fredrik Strand (2025). CSAW-CC (mammography) – a dataset for AI research to improve screening, diagnostics and prognostics of breast cancer [Dataset]. http://doi.org/10.5878/45vm-t798

CSAW-CC (mammography) – a dataset for AI research to improve screening, diagnostics and prognostics of breast cancer

Cohort of Screen-age Women - Case control (CSAW-CC)

Explore at:
7 scholarly articles cite this dataset (View in Google Scholar)
(9211529), (29050)Available download formats
Dataset updated
Jan 7, 2025
Dataset provided by
Karolinska Institutet
Authors
Fredrik Strand
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered
2008 - 2015
Area covered
Stockholm County
Description

The dataset contains x-ray images, mammography, from breast cancer screening at the Karolinska University Hospital, Stockholm, Sweden, collected by principal investigator Fredrik Strand at Karolinska Institutet. The purpose for compiling the dataset was to perform AI research to improve screening, diagnostics and prognostics of breast cancer.

The dataset is based on a selection of cases with and without a breast cancer diagnosis, taken from a more comprehensive source dataset.

1,103 cases of first-time breast cancer for women in the screening age range (40-74 years) during the included time period (November 2008 to December 2015) were included. Of these, a random selection of 873 cases have been included in the published dataset.

A random selection of 10,000 healthy controls during the same time period were included. Of these, a random selection of 7,850 cases have been included in the published dataset.

For each individual all screening mammograms, also repeated over time, were included; as well as the date of screening and the age. In addition, there are pixel-level annotations of the tumors created by a breast radiologist (small lesions such as micro-calcifications have been annotated as an area). Annotations were also drawn in mammograms prior to diagnosis; if these contain a single pixel it means no cancer was seen but the estimated location of the center of the future cancer was shown by a single pixel annotation.

In addition to images, the dataset also contains cancer data created at the Karolinska University Hospital and extracted through the Regional Cancer Center Stockholm-Gotland. This data contains information about the time of diagnosis and cancer characteristics including tumor size, histology and lymph node metastasis.

The precision of non-image data was decreased, through categorisation and jittering, to ensure that no single individual can be identified.

The following types of files are available: - CSV: The following data is included (if applicable): cancer/no cancer (meaning breast cancer during 2008 to 2015), age group at screening, days from image to diagnosis (if any), cancer histology, cancer size group, ipsilateral axillary lymph node metastasis. There is one csv file for the entire dataset, with one row per image. Any information about cancer diagnosis is repeated for all rows for an individual who was diagnosed (i.e., it is also included in rows before diagnosis). For each exam date there is the assessment by radiologist 1, radiologist 2 and the consensus decision. - DICOM: Mammograms. For each screening, four images for the standard views were acuqired: left and right, mediolateral oblique and craniocaudal. There should be four files per examination date. - PNG: Cancer annotations. For each DICOM image containing a visible tumor.

Access: The dataset is available upon request due to the size of the material. The image files in DICOM and PNG format comprises approximately 2.5 TB. Access to the CSV file including parametric data is possible via download as associated documentation.

Search
Clear search
Close search
Google apps
Main menu