Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains x-ray images, mammography, from breast cancer screening at the Karolinska University Hospital, Stockholm, Sweden, collected by principal investigator Fredrik Strand at Karolinska Institutet. The purpose for compiling the dataset was to perform AI research to improve screening, diagnostics and prognostics of breast cancer.
The dataset is based on a selection of cases with and without a breast cancer diagnosis, taken from a more comprehensive source dataset.
1,103 cases of first-time breast cancer for women in the screening age range (40-74 years) during the included time period (November 2008 to December 2015) were included. Of these, a random selection of 873 cases have been included in the published dataset.
A random selection of 10,000 healthy controls during the same time period were included. Of these, a random selection of 7,850 cases have been included in the published dataset.
For each individual all screening mammograms, also repeated over time, were included; as well as the date of screening and the age. In addition, there are pixel-level annotations of the tumors created by a breast radiologist (small lesions such as micro-calcifications have been annotated as an area). Annotations were also drawn in mammograms prior to diagnosis; if these contain a single pixel it means no cancer was seen but the estimated location of the center of the future cancer was shown by a single pixel annotation.
In addition to images, the dataset also contains cancer data created at the Karolinska University Hospital and extracted through the Regional Cancer Center Stockholm-Gotland. This data contains information about the time of diagnosis and cancer characteristics including tumor size, histology and lymph node metastasis.
The precision of non-image data was decreased, through categorisation and jittering, to ensure that no single individual can be identified.
The following types of files are available: - CSV: The following data is included (if applicable): cancer/no cancer (meaning breast cancer during 2008 to 2015), age group at screening, days from image to diagnosis (if any), cancer histology, cancer size group, ipsilateral axillary lymph node metastasis. There is one csv file for the entire dataset, with one row per image. Any information about cancer diagnosis is repeated for all rows for an individual who was diagnosed (i.e., it is also included in rows before diagnosis). For each exam date there is the assessment by radiologist 1, radiologist 2 and the consensus decision. - DICOM: Mammograms. For each screening, four images for the standard views were acuqired: left and right, mediolateral oblique and craniocaudal. There should be four files per examination date. - PNG: Cancer annotations. For each DICOM image containing a visible tumor.
Access: The dataset is available upon request due to the size of the material. The image files in DICOM and PNG format comprises approximately 2.5 TB. Access to the CSV file including parametric data is possible via download as associated documentation.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for CommonCatalog CC-BY
This dataset is a large collection of high-resolution Creative Common images (composed of different licenses, see paper Table 1 in the Appendix) collected in 2014 from users of Yahoo Flickr. The dataset contains images of up to 4k resolution, making this one of the highest resolution captioned image datasets.
Dataset Details
Dataset Description
We provide captions synthetic captions to approximately 100 million high… See the full description on the dataset page: https://huggingface.co/datasets/common-canvas/commoncatalog-cc-by.
Facebook
Twitter## Overview
CC is a dataset for object detection tasks - it contains Frames annotations for 237 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
🐳 OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
This repository contains 210 million image-text interleaved documents filtered from the OmniCorpus-CC dataset, which was sourced from Common Crawl.
Repository: https://github.com/OpenGVLab/OmniCorpus Paper (ICLR 2025 Spotlight): https://arxiv.org/abs/2406.08418
OmniCorpus dataset is a large-scale image-text interleaved dataset, which pushes the boundaries of scale and diversity by encompassing… See the full description on the dataset page: https://huggingface.co/datasets/OpenGVLab/OmniCorpus-CC-210M.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Access the CSAW-CC dataset featuring mammography images from Karolinska University Hospital, including over 1,100 breast cancer cases and over 10,000 healthy controls for AI-driven medical imaging research.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
MAP-CC
🌐 Homepage | 🤗 MAP-CC | 🤗 CHC-Bench | 🤗 CT-LLM | 📖 arXiv | GitHub An open-source Chinese pretraining dataset with a scale of 800 billion tokens, offering the NLP community high-quality Chinese pretraining data.
Disclaimer
This model, developed for academic purposes, employs rigorously compliance-checked training data to uphold the highest standards of integrity and compliance. Despite our efforts, the inherent complexities of data and the broad spectrum of… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/MAP-CC.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Mammo Only CC is a dataset for object detection tasks - it contains Tumor annotations for 627 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Armenian language dataset extracted from CC-100 research dataset Description from website This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository. No claims of intellectual property are made on the work of preparation of the corpus.
Facebook
TwitterThis dataset was created by faka_frame_
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Orzlala
Released under MIT
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
CC-Bench Trajectories Overview
To evaluate GLM-4.6's agentic coding capabilities in real-world scenarios, we developed CC-Bench-V1.1 using Claude Code as the agentic coding testbed. Building on CC-Bench-V1.0, we added 22 more challenging coding tasks and conducted comprehensive evaluations against Claude-Sonnet-4, GLM-4.5, Kimi-K2-0905, and DeepSeek-V3.1-Terminus. The benchmark comprises 74 coding tasks spanning frontend development, tool development, data analysis, testing, and… See the full description on the dataset page: https://huggingface.co/datasets/zai-org/CC-Bench-trajectories.
Facebook
TwitterConceptual Captions (CC-3M) is a large-scale dataset of 300,000 image-caption pairs.
Facebook
TwitterThis data release contains deep seismic reflection profiles CC-1 and CC-2, which extend eastward from within the California Coast Ranges across the Great Valley and into the Sierran foothills, with a combined east-west length of about 140 km at about the latitude of the town of Merced (37.25° north latitude). The records are processed to 15 seconds two-way time and thus extend deep into the lithosphere as well as capturing detail in the shallow crust. Field data (no longer available) were collected in 1982-85 with vibrator source, an 800-channel, split-spread receiver array using SIGN-BIT technology, and a maximum offset of 12.2 km. Line CC-1 extends from Franciscan Complex of the eastern Coast Ranges east to Merced in the Great Valley; line CC-2 is offset 12.75 km to the south with a 10.8 km overlap of CC-1 and extends east into batholithic rocks of the Sierran Foothills. The included data consist of (1) raster images of stacked and migrated profiles and (2) the ground location of their reconstruction lines and points as scans of 1:4,000-scale paper plots and their digital representations.
Facebook
Twitterhttps://images.cv/licensehttps://images.cv/license
Labeled Cc amharic images suitable for training and evaluating computer vision and deep learning models.
Facebook
TwitterCC-Stories (or STORIES) is a dataset for common sense reasoning and language modeling. It was constructed by aggregating documents from the CommonCrawl dataset that has the most overlapping n-grams with the questions in commonsense reasoning tasks. The top 1.0% of highest ranked documents is chosen as the new training corpus.
Facebook
TwitterThis dataset contains the predicted prices of the asset CC over the next 16 years. This data is calculated initially using a default 5 percent annual growth rate, and after page load, it features a sliding scale component where the user can then further adjust the growth rate to their own positive or negative projections. The maximum positive adjustable growth rate is 100 percent, and the minimum adjustable growth rate is -100 percent.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
sameAs.cc is the largest dataset of identity statements that has been gathered from the LOD Cloud to date. This dataset is available in HDT format (Header Dictionary, Triples), and contains 558,943,116 distinct owl:sameAs statements collected from the LOD Laundromat corpus.
Facebook
TwitterThis dataset provides information about the number of properties, residents, and average property values for Cc Street cross streets in Woodland, WA.
Facebook
TwitterCampaign Spending OE Ledger CC Dataset as of December 31, 2024
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
{{description}}
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains x-ray images, mammography, from breast cancer screening at the Karolinska University Hospital, Stockholm, Sweden, collected by principal investigator Fredrik Strand at Karolinska Institutet. The purpose for compiling the dataset was to perform AI research to improve screening, diagnostics and prognostics of breast cancer.
The dataset is based on a selection of cases with and without a breast cancer diagnosis, taken from a more comprehensive source dataset.
1,103 cases of first-time breast cancer for women in the screening age range (40-74 years) during the included time period (November 2008 to December 2015) were included. Of these, a random selection of 873 cases have been included in the published dataset.
A random selection of 10,000 healthy controls during the same time period were included. Of these, a random selection of 7,850 cases have been included in the published dataset.
For each individual all screening mammograms, also repeated over time, were included; as well as the date of screening and the age. In addition, there are pixel-level annotations of the tumors created by a breast radiologist (small lesions such as micro-calcifications have been annotated as an area). Annotations were also drawn in mammograms prior to diagnosis; if these contain a single pixel it means no cancer was seen but the estimated location of the center of the future cancer was shown by a single pixel annotation.
In addition to images, the dataset also contains cancer data created at the Karolinska University Hospital and extracted through the Regional Cancer Center Stockholm-Gotland. This data contains information about the time of diagnosis and cancer characteristics including tumor size, histology and lymph node metastasis.
The precision of non-image data was decreased, through categorisation and jittering, to ensure that no single individual can be identified.
The following types of files are available: - CSV: The following data is included (if applicable): cancer/no cancer (meaning breast cancer during 2008 to 2015), age group at screening, days from image to diagnosis (if any), cancer histology, cancer size group, ipsilateral axillary lymph node metastasis. There is one csv file for the entire dataset, with one row per image. Any information about cancer diagnosis is repeated for all rows for an individual who was diagnosed (i.e., it is also included in rows before diagnosis). For each exam date there is the assessment by radiologist 1, radiologist 2 and the consensus decision. - DICOM: Mammograms. For each screening, four images for the standard views were acuqired: left and right, mediolateral oblique and craniocaudal. There should be four files per examination date. - PNG: Cancer annotations. For each DICOM image containing a visible tumor.
Access: The dataset is available upon request due to the size of the material. The image files in DICOM and PNG format comprises approximately 2.5 TB. Access to the CSV file including parametric data is possible via download as associated documentation.