7 datasets found
  1. h

    Post-OCR-Correction

    • huggingface.co
    • opendatalab.com
    Updated Apr 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PleIAs (2024). Post-OCR-Correction [Dataset]. https://huggingface.co/datasets/PleIAs/Post-OCR-Correction
    Explore at:
    Dataset updated
    Apr 26, 2024
    Dataset authored and provided by
    PleIAs
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Post-OCR correction is a large corpus of 1 billion words containing original texts with a varying number of OCR mistakes and an experimental multilingual post-OCR correction output created by Pleias. Generation of Post-OCR correction was performed using HPC resources from GENCI–IDRIS (Grant 2023-AD011014736) on Jean-Zay.

      Description
    

    All the texts come from collections integrated into Common Corpus, the largest open corpus for pretraining previously released by Pleias on HuggingFace.… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/Post-OCR-Correction.

  2. h

    Pleias-French-Post-OCR-Correction-Instructions

    • huggingface.co
    Updated Feb 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emanuela Boros (2025). Pleias-French-Post-OCR-Correction-Instructions [Dataset]. https://huggingface.co/datasets/emanuelaboros/Pleias-French-Post-OCR-Correction-Instructions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 20, 2025
    Authors
    Emanuela Boros
    Description

    emanuelaboros/Pleias-French-Post-OCR-Correction-Instructions dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. h

    Pleias-French-Post-OCR-Correction-Corrected-Instructions-4000

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emanuela Boros, Pleias-French-Post-OCR-Correction-Corrected-Instructions-4000 [Dataset]. https://huggingface.co/datasets/emanuelaboros/Pleias-French-Post-OCR-Correction-Corrected-Instructions-4000
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Emanuela Boros
    Description

    emanuelaboros/Pleias-French-Post-OCR-Correction-Corrected-Instructions-4000 dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    Latin-PD

    • huggingface.co
    Updated Mar 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PleIAs (2024). Latin-PD [Dataset]. https://huggingface.co/datasets/PleIAs/Latin-PD
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 20, 2024
    Dataset authored and provided by
    PleIAs
    Description

    🇲🇪 Latin Public Domain Books (Latin) 🇲🇪

    Latin-Public Domain or Latin-PD is a large collection aiming to aggregate all Latin monographies and periodicals in the public domain. As of June 2024, it is the largest Latin open corpus.

      Dataset summary
    

    The collection contains 16,521,454,086 words (159,070 titles) recovered from multiple sources, including the Internet Archive and various European national libraries and cultural heritage institutions (BDH, BNF). Each parquet… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/Latin-PD.

  5. h

    Danish-PD

    • huggingface.co
    Updated Nov 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PleIAs (2024). Danish-PD [Dataset]. https://huggingface.co/datasets/PleIAs/Danish-PD
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 6, 2024
    Dataset authored and provided by
    PleIAs
    Description

    🇩🇰 Danish Public Domain 🇩🇰

    Danish-Public Domain or Danish-PD is a large collection aiming to aggregate all Danish monographies and periodicals in the public domain. As of March 2024, it is the biggest Danish open corpus.

      Dataset summary
    

    The collection contains 3113 individual titles making up 322,141,347 words recovered from multiple sources, including Internet Archive and various European national libraries and cultural heritage institutions. Each parquet file has… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/Danish-PD.

  6. h

    Czech-PD

    • huggingface.co
    Updated Nov 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PleIAs (2024). Czech-PD [Dataset]. https://huggingface.co/datasets/PleIAs/Czech-PD
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 6, 2024
    Dataset authored and provided by
    PleIAs
    Description

    🇨🇿 Czech Public Domain 🇨🇿

    Czech-Public Domain or Czech-PD is a large collection aiming to aggregate all Czech monographies and periodicals in the public domain. As of March 2024, it is the biggest Czech open corpus.

      Dataset summary
    

    The collection contains 1585 individual titles making up 259,435,959 words recovered from multiple sources, including Internet Archive and various European national libraries and cultural heritage institutions. Each parquet file has the… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/Czech-PD.

  7. h

    Portuguese-PD

    • huggingface.co
    Updated Nov 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PleIAs (2024). Portuguese-PD [Dataset]. https://huggingface.co/datasets/PleIAs/Portuguese-PD
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 6, 2024
    Dataset authored and provided by
    PleIAs
    Description

    🇵🇹 Portuguese Public Domain 🇵🇹

    Portuguese-Public Domain or Portuguese-PD is a large collection aiming to aggregate all Portuguese monographies and periodicals in the public domain. As of March 2024, it is the biggest Portuguese open corpus.

      Dataset summary
    

    The collection contains 7,840 individual titles making up 672,197,538 words recovered from multiple sources, including Internet Archive and various European national libraries and cultural heritage institutions.… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/Portuguese-PD.

  8. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
PleIAs (2024). Post-OCR-Correction [Dataset]. https://huggingface.co/datasets/PleIAs/Post-OCR-Correction

Post-OCR-Correction

PleIAs/Post-OCR-Correction

Explore at:
232 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Apr 26, 2024
Dataset authored and provided by
PleIAs
License

https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

Description

Post-OCR correction is a large corpus of 1 billion words containing original texts with a varying number of OCR mistakes and an experimental multilingual post-OCR correction output created by Pleias. Generation of Post-OCR correction was performed using HPC resources from GENCI–IDRIS (Grant 2023-AD011014736) on Jean-Zay.

  Description

All the texts come from collections integrated into Common Corpus, the largest open corpus for pretraining previously released by Pleias on HuggingFace.… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/Post-OCR-Correction.

Search
Clear search
Close search
Google apps
Main menu