43 datasets found
  1. P

    COCO-Text Dataset

    • paperswithcode.com
    Updated Dec 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andreas Veit; Tomas Matera; Lukas Neumann; Jiri Matas; Serge Belongie (2024). COCO-Text Dataset [Dataset]. https://paperswithcode.com/dataset/coco-text
    Explore at:
    Dataset updated
    Dec 5, 2024
    Authors
    Andreas Veit; Tomas Matera; Lukas Neumann; Jiri Matas; Serge Belongie
    Description

    The COCO-Text dataset is a dataset for text detection and recognition. It is based on the MS COCO dataset, which contains images of complex everyday scenes. The COCO-Text dataset contains non-text images, legible text images and illegible text images. In total there are 22184 training images and 7026 validation images with at least one instance of legible text.

  2. h

    COCO-Text

    • huggingface.co
    Updated May 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VLM-Perception (2025). COCO-Text [Dataset]. https://huggingface.co/datasets/VLM-Perception/COCO-Text
    Explore at:
    Dataset updated
    May 11, 2025
    Dataset authored and provided by
    VLM-Perception
    Description

    VLM-Perception/COCO-Text dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. P

    MS COCO Dataset

    • paperswithcode.com
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tsung-Yi Lin; Michael Maire; Serge Belongie; Lubomir Bourdev; Ross Girshick; James Hays; Pietro Perona; Deva Ramanan; C. Lawrence Zitnick; Piotr Dollár (2023). MS COCO Dataset [Dataset]. https://paperswithcode.com/dataset/coco
    Explore at:
    Dataset updated
    Dec 10, 2023
    Authors
    Tsung-Yi Lin; Michael Maire; Serge Belongie; Lubomir Bourdev; Ross Girshick; James Hays; Pietro Perona; Deva Ramanan; C. Lawrence Zitnick; Piotr Dollár
    Description

    The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.

    Splits: The first version of MS COCO dataset was released in 2014. It contains 164K images split into training (83K), validation (41K) and test (41K) sets. In 2015 additional test set of 81K images was released, including all the previous test images and 40K new images.

    Based on community feedback, in 2017 the training/validation split was changed from 83K/41K to 118K/5K. The new split uses the same images and annotations. The 2017 test set is a subset of 41K images of the 2015 test set. Additionally, the 2017 release contains a new unannotated dataset of 123K images.

    Annotations: The dataset has annotations for

    object detection: bounding boxes and per-instance segmentation masks with 80 object categories, captioning: natural language descriptions of the images (see MS COCO Captions), keypoints detection: containing more than 200,000 images and 250,000 person instances labeled with keypoints (17 possible keypoints, such as left eye, nose, right hip, right ankle), stuff image segmentation – per-pixel segmentation masks with 91 stuff categories, such as grass, wall, sky (see MS COCO Stuff), panoptic: full scene segmentation, with 80 thing categories (such as person, bicycle, elephant) and a subset of 91 stuff categories (grass, sky, road), dense pose: more than 39,000 images and 56,000 person instances labeled with DensePose annotations – each labeled person is annotated with an instance id and a mapping between image pixels that belong to that person body and a template 3D model. The annotations are publicly available only for training and validation images.

  4. R

    Cocotext Dataset

    • universe.roboflow.com
    zip
    Updated Apr 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hawkins Library (2024). Cocotext Dataset [Dataset]. https://universe.roboflow.com/hawkins-library-cpnul/cocotext-mbfa4/model/22
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 29, 2024
    Dataset authored and provided by
    Hawkins Library
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Text Bounding Boxes
    Description

    COCOTEXT

    ## Overview
    
    COCOTEXT is a dataset for object detection tasks - it contains Text annotations for 8,474 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  5. P

    SPEECH-COCO Dataset

    • paperswithcode.com
    Updated Feb 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    William Havard; Laurent Besacier; Olivier Rosec (2021). SPEECH-COCO Dataset [Dataset]. https://paperswithcode.com/dataset/speech-coco
    Explore at:
    Dataset updated
    Feb 22, 2021
    Authors
    William Havard; Laurent Besacier; Olivier Rosec
    Description

    SPEECH-COCO contains speech captions that are generated using text-to-speech (TTS) synthesis resulting in 616,767 spoken captions (more than 600h) paired with images.

  6. h

    coco2017

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Philipp, coco2017 [Dataset]. https://huggingface.co/datasets/phiyodr/coco2017
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Philipp
    Description

    coco2017

    Image-text pairs from MS COCO2017.

      Data origin
    

    Data originates from cocodataset.org While coco-karpathy uses a dense format (with several sentences and sendids per row), coco-karpathy-long uses a long format with one sentence (aka caption) and sendid per row. coco-karpathy-long uses the first five sentences and therefore is five times as long as coco-karpathy. phiyodr/coco2017: One row corresponds one image with several sentences. phiyodr/coco2017-long: One row… See the full description on the dataset page: https://huggingface.co/datasets/phiyodr/coco2017.

  7. Z

    COCO, LVIS, Open Images V4 classes mapping

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicola Messina (2022). COCO, LVIS, Open Images V4 classes mapping [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7194299
    Explore at:
    Dataset updated
    Oct 13, 2022
    Dataset provided by
    Nicola Messina
    Lucia Vadicamo
    Fabio Carrara
    Fabrizio Falchi
    Paolo Bolettieri
    Claudio Gennaro
    Giuseppe Amato
    Claudio Vairo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains a mapping between the classes of COCO, LVIS, and Open Images V4 datasets into a unique set of 1460 classes.

    COCO [Lin et al 2014] contains 80 classes, LVIS [gupta2019lvis] contains 1460 classes, Open Images V4 [Kuznetsova et al. 2020] contains 601 classes.

    We built a mapping of these classes using a semi-automatic procedure in order to have a unique final list of 1460 classes. We also generated a hierarchy for each class, using wordnet

    This repository contains the following files:

    coco_classes_map.txt, contains the mapping for the 80 coco classes

    lvis_classes_map.txt, contains the mapping for the 1460 coco classes

    openimages_classes_map.txt, contains the mapping for the 601 coco classes

    classname_hyperset_definition.csv, contains the final set of 1460 classes, their definition and hierarchy

    all-classnames.xlsx, contains a side-by-side view of all classes considered

    This mapping was used in VISIONE [Amato et al. 2021, Amato et al. 2022] that is a content-based retrieval system that supports various search functionalities (text search, object/color-based search, semantic and visual similarity search, temporal search). For the object detection VISIONE uses three pre-trained models: VfNet Zhang et al. 2021, Mask R-CNN He et al. 2017, and a Faster R-CNN+Inception ResNet (trained on the Open Images V4).

    This is repository is released under a Creative Commons Attribution license, please cite the following paper if you use it in your work in any form:

    @inproceedings{amato2021visione, title={The visione video search system: exploiting off-the-shelf text search engines for large-scale video retrieval}, author={Amato, Giuseppe and Bolettieri, Paolo and Carrara, Fabio and Debole, Franca and Falchi, Fabrizio and Gennaro, Claudio and Vadicamo, Lucia and Vairo, Claudio}, journal={Journal of Imaging}, volume={7}, number={5}, pages={76}, year={2021}, publisher={Multidisciplinary Digital Publishing Institute} }

    References:

    [Amato et al. 2022] Amato, G. et al. (2022). VISIONE at Video Browser Showdown 2022. In: , et al. MultiMedia Modeling. MMM 2022. Lecture Notes in Computer Science, vol 13142. Springer, Cham. https://doi.org/10.1007/978-3-030-98355-0_52

    [Amato et al. 2021] Amato, G., Bolettieri, P., Carrara, F., Debole, F., Falchi, F., Gennaro, C., Vadicamo, L. and Vairo, C., 2021. The visione video search system: exploiting off-the-shelf text search engines for large-scale video retrieval. Journal of Imaging, 7(5), p.76.

    [Gupta et al.2019] Gupta, A., Dollar, P. and Girshick, R., 2019. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5356-5364).

    [He et al. 2017] He, K., Gkioxari, G., Dollár, P. and Girshick, R., 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961-2969).

    [Kuznetsova et al. 2020] Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A. and Duerig, T., 2020. The open images dataset v4. International Journal of Computer Vision, 128(7), pp.1956-1981.

    [Lin et al. 2014] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P. and Zitnick, C.L., 2014, September. Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740-755). Springer, Cham.

    [Zhang et al. 2021] Zhang, H., Wang, Y., Dayoub, F. and Sunderhauf, N., 2021. Varifocalnet: An iou-aware dense object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8514-8523).

  8. h

    COCO

    • huggingface.co
    • datasets.activeloop.ai
    Updated Feb 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HuggingFaceM4 (2023). COCO [Dataset]. https://huggingface.co/datasets/HuggingFaceM4/COCO
    Explore at:
    Dataset updated
    Feb 6, 2023
    Dataset authored and provided by
    HuggingFaceM4
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MS COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features: Object segmentation, Recognition in context, Superpixel stuff segmentation, 330K images (>200K labeled), 1.5 million object instances, 80 object categories, 91 stuff categories, 5 captions per image, 250,000 people with keypoints.

  9. coco14

    • kaggle.com
    Updated Jul 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RohanS13 (2020). coco14 [Dataset]. https://www.kaggle.com/datasets/rohans13/coco14/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 24, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    RohanS13
    Description

    Dataset

    This dataset was created by RohanS13

    Contents

  10. t

    Spoken-COCO - Dataset - LDM

    • service.tib.eu
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Spoken-COCO - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/spoken-coco
    Explore at:
    Dataset updated
    Dec 2, 2024
    Description

    Spoken-COCO is a large-scale dataset of audio and text pairs.

  11. O

    COCO 2017

    • opendatalab.com
    • huggingface.co
    zip
    Updated Sep 30, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Microsoft (2017). COCO 2017 [Dataset]. https://opendatalab.com/OpenDataLab/COCO_2017
    Explore at:
    zip(49105147630 bytes)Available download formats
    Dataset updated
    Sep 30, 2017
    Dataset provided by
    Microsoft
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features: Object segmentation Recognition in context Superpixel stuff segmentation 330K images (>200K labeled) 1.5 million object instances 80 object categories 91 stuff categories 5 captions per image 250,000 people with keypoints

  12. coco_text_data_train_2014

    • kaggle.com
    Updated May 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SourojitBhaduri (2021). coco_text_data_train_2014 [Dataset]. https://www.kaggle.com/sourojit/coco-text-data-train-2014/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 19, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    SourojitBhaduri
    Description

    Dataset

    This dataset was created by SourojitBhaduri

    Contents

  13. E

    SPEECH-COCO

    • live.european-language-grid.eu
    audio wav
    Updated Dec 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). SPEECH-COCO [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7686
    Explore at:
    audio wavAvailable download formats
    Dataset updated
    Dec 10, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction: Our corpus is an extension of the MS COCO image recognition and captioning dataset. MS COCO comprises images paired with a set of five captions. Yet, it does not include any speech. Therefore, we used Voxygen's text-to-speech system to synthesise the available captions. The addition of speech as a new modality enables MSCOCO to be used for researches in the field of language acquisition, unsupervised term discovery, keyword spotting, or semantic embedding using speech and vision. Our corpus is licensed under a Creative Commons Attribution 4.0 License. Data Set: This corpus contains 616,767 spoken captions from MSCOCO's val2014 and train2014 subsets (respectively 414,113 for train2014 and 202,654 for val2014). We used 8 different voices. 4 of them have a British accent (Paul, Bronwen, Judith, and Elizabeth) and the 4 others have an American accent (Phil, Bruce, Amanda, Jenny). In order to make the captions sound more natural, we used SOX tempo command, enabling us to change the speed without changing the pitch. 1/3 of the captions are 10% slower than the original pace, 1/3 are 10% faster. The last third of the captions was kept untouched. We also modified approximately 30% of the original captions and added disfluencies such as "um", "uh", "er" so that the captions would sound more natural. Each WAV file is paired with a JSON file containing various information: timecode of each word in the caption, name of the speaker, name of the WAV file, etc. The JSON files have the following data structure: {"duration": float, "speaker": string, "synthesisedCaption": string, "timecode": list, "speed": float, "wavFilename": string, "captionID": int, "imgID": int, "disfluency": list}. On average, each caption comprises 10.79 tokens, disfluencies included. The WAV files are on average 3.52 seconds long.

  14. Microsoft COCO (Zhao et al 2017)

    • kaggle.com
    Updated Oct 21, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rachael Tatman (2019). Microsoft COCO (Zhao et al 2017) [Dataset]. https://www.kaggle.com/rtatman/ms-coco/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 21, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rachael Tatman
    Description

    Context

    This dataset contains pickled Python objects with data from the annotations of the Microsoft (MS) COCO dataset. COCO is a large-scale object detection, segmentation, and captioning dataset.

    Content

    Except for the objs file, which is a plain text file continuing a list of objects, the data in this dataset is all in the pickle format, a way of storing Python objects at binary data files.

    Important: These pickles were pickled using Python 2. Since Kernels use Python 3, you will need to specify the encoding when unpickling these files. The Python utility scripts here have been updated to correctly unpickle these files.

    # the correct syntax to read these pickled files into Python 3
    pickle.load(open('file_path, 'rb'), encoding = "latin1")
    

    Acknowledgements

    As a derivative of the original COCO dataset, this dataset is distributed under a CC-BY 4.0 license. These files were distributed as part of the supporting materials for Zhao et al 2017. If you use these files in your work, please cite the following paper:

    Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K. W. (2017). Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 2979-2989).

  15. P

    COST Dataset

    • paperswithcode.com
    Updated Mar 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jitesh Jain; Jianwei Yang; Humphrey Shi (2025). COST Dataset [Dataset]. https://paperswithcode.com/dataset/cost
    Explore at:
    Dataset updated
    Mar 3, 2025
    Authors
    Jitesh Jain; Jianwei Yang; Humphrey Shi
    Description

    Click to add a brief description of the dataset (Markdown and LaTeX enabled).

    Provide:

    a high-level explanation of the dataset characteristics explain motivations and summary of its content potential use cases of the dataset

  16. P

    COCO Captions Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Feb 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xinlei Chen; Hao Fang; Tsung-Yi Lin; Ramakrishna Vedantam; Saurabh Gupta; Piotr Dollar; C. Lawrence Zitnick (2021). COCO Captions Dataset [Dataset]. https://paperswithcode.com/dataset/coco-captions
    Explore at:
    Dataset updated
    Feb 2, 2021
    Authors
    Xinlei Chen; Hao Fang; Tsung-Yi Lin; Ramakrishna Vedantam; Saurabh Gupta; Piotr Dollar; C. Lawrence Zitnick
    Description

    COCO Captions contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.

  17. t

    FS-COCO - Dataset - LDM

    • service.tib.eu
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). FS-COCO - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/fs-coco
    Explore at:
    Dataset updated
    Dec 16, 2024
    Description

    FS-COCO: A large-scale scene sketch dataset with fine-grained alignment among sketch, text, and photo.

  18. h

    laion-coco-aesthetic

    • huggingface.co
    Updated Feb 15, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guangyi Liu (2019). laion-coco-aesthetic [Dataset]. https://huggingface.co/datasets/guangyil/laion-coco-aesthetic
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 15, 2019
    Authors
    Guangyi Liu
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    LAION COCO with aesthetic score and watermark score

    This dataset contains 10% samples of the LAION-COCO dataset filtered by some text rules (remove url, special tokens, etc.), and image rules (image size > 384x384, aesthetic score>4.75 and watermark probability<0.5). There are total 8,563,753 data instances in this dataset. And the corresponding aesthetic score and watermark score are also included. Noted: watermark score in the table means the probability of the existence of the… See the full description on the dataset page: https://huggingface.co/datasets/guangyil/laion-coco-aesthetic.

  19. h

    Recap-COCO-30K

    • huggingface.co
    Updated Jun 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCSC-VLAA (2024). Recap-COCO-30K [Dataset]. https://huggingface.co/datasets/UCSC-VLAA/Recap-COCO-30K
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 28, 2024
    Dataset authored and provided by
    UCSC-VLAA
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Llava recaptioned COCO2014 ValSet.

    Used for text-to-image generation evaluaion. More detial can be found in What If We Recaption Billions of Web Images with LLaMA-3?

      Dataset Structure
    

    "image_id" (str): COCO image id. "coco_url" (image): the COCO image url. "caption" (str): the original COCO caption. "recaption" (str): the llava recaptioned COCO caption.

      Citation
    

    BibTeX: @article{li2024recapdatacomp, title={What If We Recaption Billions of Web Images with… See the full description on the dataset page: https://huggingface.co/datasets/UCSC-VLAA/Recap-COCO-30K.

  20. R

    Face Features Test Dataset

    • universe.roboflow.com
    zip
    Updated Dec 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peter Lin (2021). Face Features Test Dataset [Dataset]. https://universe.roboflow.com/peter-lin/face-features-test
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 6, 2021
    Dataset authored and provided by
    Peter Lin
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Variables measured
    Face Features Bounding Boxes
    Description

    A simple dataset for benchmarking CreateML object detection models. The images are sampled from COCO dataset with eyes and nose bounding boxes added. It’s not meant to be serious or useful in a real application. The purpose is to look at how long it takes to train CreateML models with varying dataset and batch sizes.

    Training performance is affected by model configuration, dataset size and batch configuration. Larger models and batches require more memory. I used CreateML object detection project to compare the performance.

    Hardware

    M1 Macbook Air * 8 GPU * 4/4 CPU * 16G memory * 512G SSD

    M1 Max Macbook Pro * 24 GPU * 2/8 CPU * 32G memory * 2T SSD

    Small Dataset Train: 144 Valid: 16 Test: 8

    Results |batch | M1 ET | M1Max ET | peak mem G | |--------|:------|:---------|:-----------| |16 | 16 | 11 | 1.5 | |32 | 29 | 17 | 2.8 | |64 | 56 | 30 | 5.4 | |128 | 170 | 57 | 12 |

    Larger Dataset Train: 301 Valid: 29 Test: 18

    Results |batch | M1 ET | M1Max ET | peak mem G | |--------|:------|:---------|:-----------| |16 | 21 | 10 | 1.5 | |32 | 42 | 17 | 3.5 | |64 | 85 | 30 | 8.4 | |128 | 281 | 54 | 16.5 |

    CreateML Settings

    For all tests, training was set to Full Network. I closed CreateML between each run to make sure memory issues didn't cause a slow down. There is a bug with Monterey as of 11/2021 that leads to memory leak. I kept an eye on the memory usage. If it looked like there was a memory leak, I restarted MacOS.

    Observations

    In general, more GPU and memory with MBP reduces the training time. Having more memory lets you train with larger datasets. On M1 Macbook Air, the practical limit is 12G before memory pressure impacts performance. On M1 Max MBP, the practical limit is 26G before memory pressure impacts performance. To work around memory pressure, use smaller batch sizes.

    On the larger dataset with batch size 128, the M1Max is 5x faster than Macbook Air. Keep in mind a real dataset should have thousands of samples like Coco or Pascal. Ideally, you want a dataset with 100K images for experimentation and millions for the real training. The new M1 Max Macbooks is a cost effective alternative to building a Windows/Linux workstation with RTX 3090 24G. For most of 2021, the price of RTX 3090 with 24G is around $3,000.00. That means an equivalent windows workstation would cost the same as the M1Max Macbook pro I used to run the benchmarks.

    Full Network vs Transfer Learning

    As of CreateML 3, training with full network doesn't fully utilize the GPU. I don't know why it works that way. You have to select transfer learning to fully use the GPU. The results of transfer learning with the larger dataset. In general, the training time is faster and loss is better.

    batchET minTrain AccVal AccTest AccTop IU TrainTop IU ValidTop IU TestPeak mem Gloss
    1647519127823131.50.41
    3287521107826112.760.02
    641375238782495.30.017
    128257522137825148.40.012

    Github Project

    The source code and full results are up on Github https://github.com/woolfel/createmlbench

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Andreas Veit; Tomas Matera; Lukas Neumann; Jiri Matas; Serge Belongie (2024). COCO-Text Dataset [Dataset]. https://paperswithcode.com/dataset/coco-text

COCO-Text Dataset

Explore at:
Dataset updated
Dec 5, 2024
Authors
Andreas Veit; Tomas Matera; Lukas Neumann; Jiri Matas; Serge Belongie
Description

The COCO-Text dataset is a dataset for text detection and recognition. It is based on the MS COCO dataset, which contains images of complex everyday scenes. The COCO-Text dataset contains non-text images, legible text images and illegible text images. In total there are 22184 training images and 7026 validation images with at least one instance of legible text.

Search
Clear search
Close search
Google apps
Main menu