8 datasets found
  1. h

    Polish-PD

    • huggingface.co
    Updated Nov 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PleIAs (2024). Polish-PD [Dataset]. https://huggingface.co/datasets/PleIAs/Polish-PD
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 6, 2024
    Dataset authored and provided by
    PleIAs
    Description

    ๐Ÿ‡ต๐Ÿ‡ฑ Polish Public Domain ๐Ÿ‡ต๐Ÿ‡ฑ

    Polish-Public Domain or Polish-PD is a large collection aiming to aggregate all Polish monographies and periodicals in the public domain. As of March 2024, it is the biggest Polish open corpus.

      Dataset summary
    

    The collection contains 247,491 individual texts making up 2,697,414,811 words recovered from multiple sources, including Internet Archive and various European national libraries and cultural heritage institutions. Each parquet fileโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/Polish-PD.

  2. h

    Czech-PD

    • huggingface.co
    Updated Nov 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PleIAs (2024). Czech-PD [Dataset]. https://huggingface.co/datasets/PleIAs/Czech-PD
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 6, 2024
    Dataset authored and provided by
    PleIAs
    Description

    ๐Ÿ‡จ๐Ÿ‡ฟ Czech Public Domain ๐Ÿ‡จ๐Ÿ‡ฟ

    Czech-Public Domain or Czech-PD is a large collection aiming to aggregate all Czech monographies and periodicals in the public domain. As of March 2024, it is the biggest Czech open corpus.

      Dataset summary
    

    The collection contains 1585 individual titles making up 259,435,959 words recovered from multiple sources, including Internet Archive and various European national libraries and cultural heritage institutions. Each parquet file has theโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/Czech-PD.

  3. h

    German-PD

    • huggingface.co
    Updated Nov 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PleIAs (2024). German-PD [Dataset]. https://huggingface.co/datasets/PleIAs/German-PD
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 6, 2024
    Dataset authored and provided by
    PleIAs
    Description

    ๐Ÿ‡ฉ๐Ÿ‡ช German Public Domain ๐Ÿ‡ฉ๐Ÿ‡ช

    German-Public Domain or German-PD is a large collection aiming to aggregate all German monographies and periodicals in the public domain. As of March 2024, it is the biggest German open corpus.

      Dataset summary
    

    The collection contains 260,638 individual texts making up 37,650,706,611 words recovered from multiple sources, including Internet Archive and various European national libraries and cultural heritage institutions. Each parquet fileโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/German-PD.

  4. h

    Serbian-PD

    • huggingface.co
    Updated Nov 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PleIAs (2024). Serbian-PD [Dataset]. https://huggingface.co/datasets/PleIAs/Serbian-PD
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 6, 2024
    Dataset authored and provided by
    PleIAs
    Description

    ๐Ÿ‡ท๐Ÿ‡ธ Serbian Public Domain ๐Ÿ‡ท๐Ÿ‡ธ

    Serbian-Public Domain or Serbian-PD is a large collection aiming to aggregate all Serbian monographies and periodicals in the public domain. As of March 2024, it is the biggest Serbian open corpus.

      Dataset summary
    

    The collection contains 1,405 titles making up 156,712,807 words recovered from multiple sources, including Internet Archive and various European national libraries and cultural heritage institutions. Each parquet file has theโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/Serbian-PD.

  5. h

    Italian-PD

    • huggingface.co
    Updated Nov 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PleIAs (2024). Italian-PD [Dataset]. https://huggingface.co/datasets/PleIAs/Italian-PD
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 6, 2024
    Dataset authored and provided by
    PleIAs
    Description

    ๐Ÿ‡ฎ๐Ÿ‡น Italian Public Domain Books (Italian) ๐Ÿ‡ฎ๐Ÿ‡น

    Italian-Public Domain-Book or Italian-PD-Books is a large collection aiming to aggregate all Italian monographies in the public domain. As of March 2024, it is the biggest Italian open corpus.

      Dataset summary
    

    The collection contains 12,945,781,983 words (171,113 titles) recovered from multiple sources, including Internet Archive and various European national libraries and cultural heritage institutions. Each parquet fileโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/Italian-PD.

  6. h

    Latin-PD

    • huggingface.co
    Updated Mar 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PleIAs (2024). Latin-PD [Dataset]. https://huggingface.co/datasets/PleIAs/Latin-PD
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 20, 2024
    Dataset authored and provided by
    PleIAs
    Description

    ๐Ÿ‡ฒ๐Ÿ‡ช Latin Public Domain Books (Latin) ๐Ÿ‡ฒ๐Ÿ‡ช

    Latin-Public Domain or Latin-PD is a large collection aiming to aggregate all Latin monographies and periodicals in the public domain. As of June 2024, it is the largest Latin open corpus.

      Dataset summary
    

    The collection contains 16,521,454,086 words (159,070 titles) recovered from multiple sources, including the Internet Archive and various European national libraries and cultural heritage institutions (BDH, BNF). Each parquetโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/Latin-PD.

  7. h

    Portuguese-PD

    • huggingface.co
    Updated Nov 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PleIAs (2024). Portuguese-PD [Dataset]. https://huggingface.co/datasets/PleIAs/Portuguese-PD
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 6, 2024
    Dataset authored and provided by
    PleIAs
    Description

    ๐Ÿ‡ต๐Ÿ‡น Portuguese Public Domain ๐Ÿ‡ต๐Ÿ‡น

    Portuguese-Public Domain or Portuguese-PD is a large collection aiming to aggregate all Portuguese monographies and periodicals in the public domain. As of March 2024, it is the biggest Portuguese open corpus.

      Dataset summary
    

    The collection contains 7,840 individual titles making up 672,197,538 words recovered from multiple sources, including Internet Archive and various European national libraries and cultural heritage institutions.โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/Portuguese-PD.

  8. h

    Danish-PD

    • huggingface.co
    Updated Nov 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PleIAs (2024). Danish-PD [Dataset]. https://huggingface.co/datasets/PleIAs/Danish-PD
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 6, 2024
    Dataset authored and provided by
    PleIAs
    Description

    ๐Ÿ‡ฉ๐Ÿ‡ฐ Danish Public Domain ๐Ÿ‡ฉ๐Ÿ‡ฐ

    Danish-Public Domain or Danish-PD is a large collection aiming to aggregate all Danish monographies and periodicals in the public domain. As of March 2024, it is the biggest Danish open corpus.

      Dataset summary
    

    The collection contains 3113 individual titles making up 322,141,347 words recovered from multiple sources, including Internet Archive and various European national libraries and cultural heritage institutions. Each parquet file hasโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/Danish-PD.

  9. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
PleIAs (2024). Polish-PD [Dataset]. https://huggingface.co/datasets/PleIAs/Polish-PD

Polish-PD

PleIAs/Polish-PD

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 6, 2024
Dataset authored and provided by
PleIAs
Description

๐Ÿ‡ต๐Ÿ‡ฑ Polish Public Domain ๐Ÿ‡ต๐Ÿ‡ฑ

Polish-Public Domain or Polish-PD is a large collection aiming to aggregate all Polish monographies and periodicals in the public domain. As of March 2024, it is the biggest Polish open corpus.

  Dataset summary

The collection contains 247,491 individual texts making up 2,697,414,811 words recovered from multiple sources, including Internet Archive and various European national libraries and cultural heritage institutions. Each parquet fileโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/Polish-PD.

Search
Clear search
Close search
Google apps
Main menu