100+ datasets found
  1. s

    The CommonCrawl Corpus

    • marketplace.sshopencloud.eu
    Updated Apr 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). The CommonCrawl Corpus [Dataset]. https://marketplace.sshopencloud.eu/dataset/93FNrL
    Explore at:
    Dataset updated
    Apr 24, 2020
    Description

    The Common Crawl corpus contains petabytes of data collected over 8 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.

  2. w

    A corpus of web crawl data composed of 5 billion web pages.

    • data.wu.ac.at
    Updated Oct 10, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Global (2013). A corpus of web crawl data composed of 5 billion web pages. [Dataset]. https://data.wu.ac.at/schema/datahub_io/ZDVlZWJkNmItNThlNC00ZmE1LWE4MGQtNWUwODRjY2ZhZDk5
    Explore at:
    application/download(31232.0)Available download formats
    Dataset updated
    Oct 10, 2013
    Dataset provided by
    Global
    Description

    A corpus of web crawl data composed of 5 billion web pages. This data set is freely available on Amazon S3 at s3://aws-publicdatasets/common-crawl/crawl-002/ and formatted in the ARC (.arc) file format.

    Common Crawl is a non-profit organization that builds and maintains an open repository of web crawl data for the purpose of driving innovation in research, education and technology. This data set contains web crawl data from 5 billion web pages and is released under the Common Crawl Terms of Use.

  3. n

    NIF Registry Automated Crawl Data

    • neuinfo.org
    • rrid.site
    • +2more
    Updated Aug 29, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2012). NIF Registry Automated Crawl Data [Dataset]. http://identifiers.org/RRID:SCR_012862
    Explore at:
    Dataset updated
    Aug 29, 2012
    Description

    An automatic pipeline based on an algorithm that identifies new resources in publications every month to assist the efficiency of NIF curators. The pipeline is also able to find the last time the resource's webpage was updated and whether the URL is still valid. This can assist the curator in knowing which resources need attention. Additionally, the pipeline identifies publications that reference existing NIF Registry resources as this is also of interest. These mentions are available through the Data Federation version of the NIF Registry, http://neuinfo.org/nif/nifgwt.html?query=nlx_144509 The RDF is based on an algorithm on how related it is to neuroscience. (hits of neuroscience related terms). Each potential resource gets assigned a score (based on how related it is to neuroscience) and the resources are then ranked and a list is generated.

  4. Crawl-data-English

    • kaggle.com
    zip
    Updated Dec 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sushii (2023). Crawl-data-English [Dataset]. https://www.kaggle.com/datasets/sushii2512/crawl-data-english
    Explore at:
    zip(156 bytes)Available download formats
    Dataset updated
    Dec 8, 2023
    Authors
    Sushii
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Sushii

    Released under MIT

    Contents

  5. Job Posts Data Crawling Project (Vietnam)

    • kaggle.com
    zip
    Updated Dec 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Văn Duy Cao (2023). Job Posts Data Crawling Project (Vietnam) [Dataset]. https://www.kaggle.com/datasets/vnduycao/job-posts-data-crawling-project-vietnam
    Explore at:
    zip(53707 bytes)Available download formats
    Dataset updated
    Dec 31, 2023
    Authors
    Văn Duy Cao
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Vietnam
    Description

    This is a semi-cleaned dataset containing information from job posts related to data science field. The data is scraped from 4 websites and the process is done in December 2023. Langchain framework from OpenAI was used to support the data extraction task. For example, getting the soft skills and tools that the job post's description mention.

    Here is the data schema for this data set

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F14229286%2Fcd5c6bc8700ad49f34a48b61981625c4%2Fimage%20(2).png?generation=1703998231851462&alt=media" alt="">

    31/12/2023: The data set's description is not finished.

  6. o

    Armenian language dataset from CC-100, monolingual Datasets from Web Crawl...

    • data.opendata.am
    Updated Apr 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Armenian language dataset from CC-100, monolingual Datasets from Web Crawl Data [Dataset]. https://data.opendata.am/dataset/cc100arm
    Explore at:
    Dataset updated
    Apr 6, 2023
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Area covered
    Armenia
    Description

    Armenian language dataset extracted from CC-100 research dataset Description from website This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository. No claims of intellectual property are made on the work of preparation of the corpus.

  7. a

    LLMs.txt Bot Crawl Analysis Data

    • archeredu.com
    html
    Updated Jul 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archer Education (2025). LLMs.txt Bot Crawl Analysis Data [Dataset]. https://www.archeredu.com/hemj/page/2/
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jul 23, 2025
    Dataset authored and provided by
    Archer Education
    Variables measured
    User Agent, Total Requests, Percentage Distribution
    Description

    Comprehensive crawl data showing user agent distribution and frequency for LLMs.txt file requests across eight test websites

  8. Crawl data lazada

    • kaggle.com
    zip
    Updated Dec 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vân Dung (2024). Crawl data lazada [Dataset]. https://www.kaggle.com/datasets/vndung/crawl-data-lazada/code
    Explore at:
    zip(7816 bytes)Available download formats
    Dataset updated
    Dec 26, 2024
    Authors
    Vân Dung
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Vân Dung

    Released under CC0: Public Domain

    Contents

  9. h

    GUI-Net-Crawler

    • huggingface.co
    Updated Nov 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bofei Zhang (2025). GUI-Net-Crawler [Dataset]. https://huggingface.co/datasets/Bofeee5675/GUI-Net-Crawler
    Explore at:
    Dataset updated
    Nov 3, 2025
    Authors
    Bofei Zhang
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    How to use this data?

    After download this repo, use cat to get zip file: cat baidu_wiki_part_* > merge.zip

    Then simply, unzip this zip file unzip merge.zip

      What is in this data?
    
    
    
    
    
      Image(Screenshot)
    

    Raw images are in images folder. /wikihow$ ls data/images | head -5 1111-4.jpg 111-15.jpg 1-draw-7.png 20200613_130717.jpg 22-19.jpg

      Index page
    

    Index page is a collection of web urls. This is how we start to crawl these websites. wikihow$ cat… See the full description on the dataset page: https://huggingface.co/datasets/Bofeee5675/GUI-Net-Crawler.

  10. I

    A Crawl of the Mobile Web Measuring Sensor Accesses

    • databank.illinois.edu
    • aws-databank-alb.library.illinois.edu
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anupam Das; Gunes Acar; Nikita Borisov; Amogh Pradeep, A Crawl of the Mobile Web Measuring Sensor Accesses [Dataset]. http://doi.org/10.13012/B2IDB-9213932_V1
    Explore at:
    Authors
    Anupam Das; Gunes Acar; Nikita Borisov; Amogh Pradeep
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is the result of three crawls of the web performed in May 2018. The data contains raw crawl data and instrumentation captured by OpenWPM-Mobile, as well as analysis that identifies which scripts access mobile sensors, which ones perform some of browser fingerprinting, as well as clustering of scripts based on their intended use. The dataset is described in the included README.md file; more details about the methodology can be found in our ACM CCS'18 paper: Anupam Das, Gunes Acar, Nikita Borisov, Amogh Pradeep. The Web's Sixth Sense: A Study of Scripts Accessing Smartphone Sensors. In Proceedings of the 25th ACM Conference on Computer and Communications Security (CCS), Toronto, Canada, October 15–19, 2018. (Forthcoming)

  11. h

    Common-Crawl-2025-June

    • huggingface.co
    Updated Jun 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shirova AI (2025). Common-Crawl-2025-June [Dataset]. https://huggingface.co/datasets/Shirova/Common-Crawl-2025-June
    Explore at:
    Dataset updated
    Jun 25, 2025
    Dataset authored and provided by
    Shirova AI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Common Crawl 2025 June

    Common-Crawl-2025-June is a curated, processed, and filtered dataset built from the June 2025 Common Crawl web corpus.It contains data crawled between June 1, 2025, and June 10, 2025, processed using Hugging Face’s Data Trove pipeline and several AI-based content filters to remove unsafe, harmful, or low-quality text.

      Dataset Summary
    

    This dataset represents one of the latest structured Common Crawl releases with high-quality web data.The… See the full description on the dataset page: https://huggingface.co/datasets/Shirova/Common-Crawl-2025-June.

  12. Data from: Web Data Commons Training and Test Sets for Large-Scale Product...

    • linkagelibrary.icpsr.umich.edu
    • da-ra.de
    Updated Nov 26, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ralph Peeters; Anna Primpeli; Christian Bizer (2020). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 [Dataset]. http://doi.org/10.3886/E127481V1
    Explore at:
    Dataset updated
    Nov 26, 2020
    Dataset provided by
    University of Mannheim (Germany)
    Authors
    Ralph Peeters; Anna Primpeli; Christian Bizer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.

  13. E

    Data from: esCorpius: A Massive Spanish Crawling Corpus

    • live.european-language-grid.eu
    binary format
    Updated Jun 30, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). esCorpius: A Massive Spanish Crawling Corpus [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/20458
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Jun 30, 2022
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    In the recent years, Transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of data to be (pre-)trained and there is a lack of corpora in languages other than English. Recently, several initiatives have presented multilingual datasets obtained from automatic web crawling. However, the results in Spanish present important shortcomings, as they are either too small in comparison with other languages, or present a low quality derived from sub-optimal cleaning and deduplication. In this paper, we introduce esCorpius, a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data. It is the most extensive corpus in Spanish with this level of quality in the extraction, purification and deduplication of web textual content. Our data curation process involves a novel highly parallel cleaning pipeline and encompasses a series of deduplication mechanisms that together ensure the integrity of both document and paragraph boundaries. Additionally, we maintain both the source web page URL and the WARC shard origin URL in order to complain with EU regulations. esCorpius has been released under CC BY-NC-ND 4.0 license.

  14. n

    retweet-crawl

    • networkrepository.com
    csv
    Updated Aug 18, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Network Data Repository (2018). retweet-crawl [Dataset]. https://networkrepository.com/rt-retweet-crawl.php
    Explore at:
    csvAvailable download formats
    Dataset updated
    Aug 18, 2018
    Dataset authored and provided by
    Network Data Repository
    License

    https://networkrepository.com/policy.phphttps://networkrepository.com/policy.php

    Description

    Twitter retweet network - Nodes are twitter users and edges are retweets. These were collected from various social and political hashtags.

  15. d

    Data from: Wide range screening of algorithmic bias in word embedding models...

    • datadryad.org
    • datasetcatalog.nlm.nih.gov
    • +1more
    zip
    Updated Apr 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Rozado (2020). Wide range screening of algorithmic bias in word embedding models using large sentiment lexicons reveals underreported bias types [Dataset]. http://doi.org/10.5061/dryad.rbnzs7h7w
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 7, 2020
    Dataset provided by
    Dryad
    Authors
    David Rozado
    Time period covered
    Mar 22, 2020
    Description

    This data set has collected several popular pre-trained word embedding models.

    -Word2vec Skip-Gram trained on Google News corpus (100B tokens) https://code.google.com/archive/p/word2vec/

    -Glove trained on Wikipedia 2014 + Gigaword 5 (6B tokens) http://nlp.stanford.edu/data/glove.6B.zip

    -Glove trained on 2B tweets Twitter corpus (27B tokens) http://nlp.stanford.edu/data/glove.twitter.27B.zip

    -Glove trained on Common Crawl (42B tokens) http://nlp.stanford.edu/data/glove.42B.300d.zip

    -Glove trained on Common Crawl (840B tokens) http://nlp.stanford.edu/data/glove.840B.300d.zip

    -FastText trained with subword infomation on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens) https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.vec.zip

    -Fastext trained with subword infomation on Common Crawl (600B tokens) https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip"

  16. Common Crawl

    • registry.opendata.aws
    Updated Apr 18, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Crawl (2018). Common Crawl [Dataset]. https://registry.opendata.aws/commoncrawl/
    Explore at:
    Dataset updated
    Apr 18, 2018
    Dataset provided by
    Common Crawlhttp://commoncrawl.org/
    Description

    A corpus of web crawl data composed of over 300 billion web pages.

  17. w

    RDFa, Microdata, and Microformat Data Set

    • data.wu.ac.at
    html
    Updated Aug 3, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Web Data Commons (2014). RDFa, Microdata, and Microformat Data Set [Dataset]. https://data.wu.ac.at/schema/datahub_io/MDhkYWU2ODMtNmFjYi00NDgxLWFjODMtMjFjOGUzYTVlNzFm
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Aug 3, 2014
    Dataset provided by
    Web Data Commons
    Description

    More and more websites have started to embed structured data describing products, people, organizations, places, events into their HTML pages using markup standards such as RDFa, Microdata and Microformats. The Web Data Commons project extracts this data from several billion web pages. The project provides the extracted data for download and publishes statistics about the deployment of the different formats.

  18. Válasz

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Apr 13, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gábor Palkó; Gábor Palkó; Balázs Indig; Balázs Indig; Zsófia Fellegi; Zsófia Fellegi; Zsófia Sárközi-Lindner; Zsófia Sárközi-Lindner (2022). Válasz [Dataset]. http://doi.org/10.5281/zenodo.5849730
    Explore at:
    binAvailable download formats
    Dataset updated
    Apr 13, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gábor Palkó; Gábor Palkó; Balázs Indig; Balázs Indig; Zsófia Fellegi; Zsófia Fellegi; Zsófia Sárközi-Lindner; Zsófia Sárközi-Lindner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 13, 2001 - Aug 3, 2018
    Description

    This object has been made as a part of the web harvesting project of the Eötvös Loránd University Department of Digital Humanities ELTE DH. Learn more about the workflow HERE about the software used HERE.The aim of the project is to make online news articles and their metadata suitable for research purposes. The archiving workflow is designed to prevent modification or manipulation of the downloaded content. The current version of the curated content with normalized formatting in standard TEI XML format with Schema.org encoded metadata is available HERE. The detailed description of the raw content is the following:

    • The portal's archived content (from 2001-04-13 to 2018-08-03) in WARC format available HERE (crawled: 2019-09-01 11:36:19.949569 - 2021-03-06 20:24:33.398056). The crawling has happened in multiple phases hence the date intervals are unusually wide. No further versions are expected because the crawl is created after the portal has stopped publication.

    Please fill in the following form before requesting access to this dataset:ACCES FORM

  19. h

    crawl-books

    • huggingface.co
    Updated Oct 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hussein F. Hussein (2023). crawl-books [Dataset]. https://huggingface.co/datasets/krvhrv/crawl-books
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 19, 2023
    Authors
    Hussein F. Hussein
    Description

    Dataset Card for "crawl-books"

    More Information needed

  20. h

    Chinese-Common-Crawl-Filtered

    • huggingface.co
    Updated Jun 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jed Cheng (2025). Chinese-Common-Crawl-Filtered [Dataset]. https://huggingface.co/datasets/jed351/Chinese-Common-Crawl-Filtered
    Explore at:
    Dataset updated
    Jun 2, 2025
    Authors
    Jed Cheng
    Description

    Traditional Chinese C4

      Dataset Summary
    

    Data obtained from 2025-18 and 2025-13 Common Crawl. Downloaded and processed using code based on another project attempting to recreate the C4 dataset. The resultant dataset contains both simplified and traditional Chinese. It was then filtered using a modified list of simplified Chinese characters to obtain another traditional Chinese dataset. I am still ironning out the process of filtering. The 2025-13 dataset was deduplicated… See the full description on the dataset page: https://huggingface.co/datasets/jed351/Chinese-Common-Crawl-Filtered.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2020). The CommonCrawl Corpus [Dataset]. https://marketplace.sshopencloud.eu/dataset/93FNrL

The CommonCrawl Corpus

Explore at:
146 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Apr 24, 2020
Description

The Common Crawl corpus contains petabytes of data collected over 8 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.

Search
Clear search
Close search
Google apps
Main menu