100+ datasets found
  1. P

    Common Crawl Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Oct 8, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Crawl Dataset [Dataset]. https://paperswithcode.com/dataset/common-crawl
    Explore at:
    Dataset updated
    Oct 8, 2014
    Description

    The Common Crawl corpus contains petabytes of data collected over 12 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.

  2. h

    CommonCrawl-CreativeCommons

    • huggingface.co
    Updated Jan 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CommonCrawl-CreativeCommons [Dataset]. https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons
    Explore at:
    Dataset updated
    Jan 28, 2025
    Authors
    Bram Vanroy
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Description

    Raw CommonCrawl crawls, annotated with potential Creative Commons license information

    The licensing information is extracted from the web pages based on whether they link to Creative Commons licenses but false positives may occur! While further filtering based on the location type of the license should improve the precision (e.g. by removing hyperlink (a_tag) references), false positives may still occur. See Recommendations and Caveats below!

      Usage
    

    from datasets import… See the full description on the dataset page: https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons.

  3. c

    Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2021 – VERSION 1)

    • lindat.mff.cuni.cz
    Updated Dec 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Oliver Rüdiger (2024). Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2021 – VERSION 1) [Dataset]. https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5809
    Explore at:
    Dataset updated
    Dec 3, 2024
    Authors
    Jan Oliver Rüdiger
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    *** german version see below ***

    The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the German-language (visible) internet over time - with the aim of achieving comparability with the DeReKo (‘German Reference Corpus’ of the Leibniz Institute for the German Language - DeReKo volume 57 billion tokens - status: DeReKo Release 2024-I). The corpus is separated by year (here year 2021) and versioned (here version 1). Version 1 comprises (all years 2013-2024) 97.45 billion tokens.

    The corpus is based on the data dumps from CommonCrawl (https://commoncrawl.org/). CommonCrawl is a non-profit organisation that provides copies of the visible Internet free of charge for research purposes.

    The CommonCrawl WET raw data was first filtered by TLD (top-level domain). Only pages ending in the following TLDs were taken into account: ‘.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich’. These are the exclusive German-language TLDs according to ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) as of 1 June 2024 - TLDs with a purely corporate reference (e.g. ‘.edeka; .bmw; .ford’) were excluded. The language of the individual documents (URLs) was then estimated with the help of NTextCat (https://github.com/ivanakcheurov/ntextcat) (via the CORE14 profile of NTextCat) - only those documents/URLs for which German was the most likely language were processed further (e.g. to exclude foreign-language material such as individual subpages). The third step involved filtering for manual selectors and filtering for 1:1 duplicates (within one year).

    The filtering and subsequent processing was carried out using CorpusExplorer (http://hdl.handle.net/11234/1-2634) and our own (supplementary) scripts, and the TreeTagger (http://hdl.handle.net/11372/LRT-323) was used for automatic annotation. The corpus was processed on the HELIX HPC cluster. The author would like to take this opportunity to thank the state of Baden-Württemberg and the German Research Foundation (DFG) for the possibility to use the bwHPC/HELIX HPC cluster - funding code HPC cluster: INST 35/1597-1 FUGG.

    Data content: - Tokens and record boundaries - Automatic lemma and POS annotation (using TreeTagger) - Metadata: - GUID - Unique identifier of the document - YEAR - Year of capture (please use this information for data slices) - Url - Full URL - Tld - Top-Level Domain - Domain - Domain without TLD (but with sub-domains if applicable) - DomainFull - Complete domain (incl. TLD) - DomainFull - Complete domain (incl. TLD) - Datum - (System Information): Date of the CorpusExplorer (date of capture by CommonCrawl - not date of creation/modification of the document). - Hash - (System Information): SHA1 hash of the CommonCrawl - Pfad - (System Information): Path of the cluster (raw data) - is supplied by the system.

    Please note that the files are saved as *.cec6.gz. These are binary files of the CorpusExplorer (see above). These files ensure efficient archiving. You can use both CorpusExplorer and the ‘CEC6-Converter’ (available for Linux, MacOS and Windows - see: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5705) to convert the data. The data can be exported in the following formats:

    • CATMA v6
    • CoNLL
    • CSV
    • CSV (only meta-data)
    • DTA TCF-XML
    • DWDS TEI-XML
    • HTML
    • IDS I5-XML
    • IDS KorAP XML
    • IMS Open Corpus Workbench
    • JSON
    • OPUS Corpus Collection XCES
    • Plaintext
    • SaltXML
    • SlashA XML
    • SketchEngine VERT
    • SPEEDy/CODEX (JSON)
    • TLV-XML
    • TreeTagger
    • TXM
    • WebLicht
    • XML

    Please note that an export increases the storage space requirement extensively. The ‘CorpusExplorerConsole’ (https://github.com/notesjor/CorpusExplorer.Terminal.Console - available for Linux, MacOS and Windows) also offers a simple solution for editing and analysing. If you have any questions, please contact the author.

    Legal information The data was downloaded on 01.11.2024. The use, processing and distribution is subject to §60d UrhG (german copyright law), which authorises the use for non-commercial purposes in research and teaching. LINDAT/CLARIN is responsible for long-term archiving in accordance with §69d para. 5 and ensures that only authorised persons can access the data. The data has been checked to the best of our knowledge and belief (on a random basis) - should you nevertheless find legal violations (e.g. right to be forgotten, personal rights, etc.), please write an e-mail to the author (amc_report@jan-oliver-ruediger.de) with the following information: 1) why this content is undesirable (please outline only briefly) and 2) how the content can be identified - e.g. file name, URL or domain, etc. The author will endeavour to identify the content. The author will endeavour to remove the content and re-upload the data (modified) within two weeks (new version). If

  4. Crawl attributes ranking by importance on websites in France 2020

    • statista.com
    • flwrdeptvarieties.store
    Updated Nov 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl attributes ranking by importance on websites in France 2020 [Dataset]. https://www.statista.com/statistics/1220602/crawl-attributes-ranking-by-importance-on-websites-france/
    Explore at:
    Dataset updated
    Nov 30, 2022
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2020
    Area covered
    France
    Description

    When crawl and mobile indexing are ensured, it makes it easier for the crawler or the Internet user to visit and to facilitate the discovery of a site by search engines. Thus, according to the source, in 2020, more than 60 percent of SEOs attached great importance to internal networking, that is to say to the presence of internal links pointing to the page to be highlighted. They considered all the criteria in the crawl category to be important, with the exception of the indication of priority in the sitemap, to which only 11 percent of SEOs gave meaning, with 1.53 importance out of five.

  5. h

    wdc-common-crawl-embedded-jsonld

    • huggingface.co
    Updated Aug 9, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Louis Maddox (2013). wdc-common-crawl-embedded-jsonld [Dataset]. https://huggingface.co/datasets/permutans/wdc-common-crawl-embedded-jsonld
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 9, 2013
    Authors
    Louis Maddox
    Description

    permutans/wdc-common-crawl-embedded-jsonld dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. W

    CommonCrawl News Articles by Political Orientation

    • webis.de
    • anthology.aicmu.ac.cn
    7476697
    Updated 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maximilian Keiff; Henning Wachsmuth (2022). CommonCrawl News Articles by Political Orientation [Dataset]. http://doi.org/10.5281/zenodo.7476697
    Explore at:
    7476697Available download formats
    Dataset updated
    2022
    Dataset provided by
    The Web Technology & Information Systems Network
    Leibniz Universität Hannover
    Authors
    Maximilian Keiff; Henning Wachsmuth
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset includes news articles gathered from CommonCrawl for media outlets that were selected based on their political orientation. The news articles span publication dates from 2010 to 2021.

  7. GloVe: Common Crawl 42B tokens

    • kaggle.com
    zip
    Updated Jan 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gerwyn (2020). GloVe: Common Crawl 42B tokens [Dataset]. https://www.kaggle.com/gerwynng/glove-common-crawl-42b-tokens
    Explore at:
    zip(1928408067 bytes)Available download formats
    Dataset updated
    Jan 20, 2020
    Authors
    Gerwyn
    Description

    Dataset

    This dataset was created by Gerwyn

    Contents

  8. r

    NIF Registry Automated Crawl Data

    • rrid.site
    • scicrunch.org
    • +2more
    Updated Feb 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). NIF Registry Automated Crawl Data [Dataset]. http://identifiers.org/RRID:SCR_012862
    Explore at:
    Dataset updated
    Feb 17, 2025
    Description

    An automatic pipeline based on an algorithm that identifies new resources in publications every month to assist the efficiency of NIF curators. The pipeline is also able to find the last time the resource's webpage was updated and whether the URL is still valid. This can assist the curator in knowing which resources need attention. Additionally, the pipeline identifies publications that reference existing NIF Registry resources as this is also of interest. These mentions are available through the Data Federation version of the NIF Registry, http://neuinfo.org/nif/nifgwt.html?query=nlx_144509 The RDF is based on an algorithm on how related it is to neuroscience. (hits of neuroscience related terms). Each potential resource gets assigned a score (based on how related it is to neuroscience) and the resources are then ranked and a list is generated.

  9. FastText Common Crawl bin model

    • kaggle.com
    zip
    Updated Nov 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arthur Stsepanenka (2019). FastText Common Crawl bin model [Dataset]. https://www.kaggle.com/kingarthur7/fasttext-common-crawl-bin-model
    Explore at:
    zip(4506059373 bytes)Available download formats
    Dataset updated
    Nov 20, 2019
    Authors
    Arthur Stsepanenka
    Description

    Dataset

    This dataset was created by Timo Bozsolik

    Contents

  10. c

    Full-population web crawl of .gov.uk web domain, 2014

    • datacatalogue.cessda.eu
    Updated Mar 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholls, T, Oxford Internet Institute (2025). Full-population web crawl of .gov.uk web domain, 2014 [Dataset]. http://doi.org/10.5255/UKDA-SN-852205
    Explore at:
    Dataset updated
    Mar 26, 2025
    Dataset provided by
    University of Oxford
    Authors
    Nicholls, T, Oxford Internet Institute
    Time period covered
    Apr 14, 2014 - Oct 29, 2014
    Area covered
    United Kingdom
    Variables measured
    Other
    Measurement technique
    A web crawl was carried out with Heritrix, the Internet Archive's web crawler. A list of all registered domains in .gov.uk (and their www.x.gov.uk equivalents) was used as a set of start seeds.Sites outside .gov.uk were excluded; robots.txt files were respected, with the consequence that some .gov.uk sites (and some parts of other .gov.uk sites) were not fetched. Certain other areas were manually excluded, particularly crawling traps (e.g. calendars which will serve infinite numbers of pages in the past and future and those websites returning different URLs for each browser session) and the contents of certain large peripheral databases such as online local authority library catalogues. A full set of regular expressions used to filter the URLs fetched are included in the archive.On completion of the crawl, the page URLs and link data were extracted from the output WARC files. The page URLs were manually examined and re-filtered to handle various broken web servers and to reduce duplication of content where multiple views were presented onto the same content (for example, where a site was presented at both http://organisation.gov.uk/ and http://www.organisation.gov.uk/ without HTTP redirection between the two).Finally, The link list was filtered against the URL list to remove bogus links and both lists were map/reduced to a single set of files.Also included in this data release is a derived dataset more useful for high-level work. This is a GraphML file containing all the link and page information reduced to third-level domain level (so darlington.gov.uk is considered as a single node, not a large set of pages) and with the links binarised to present/not present between each node. Each graph node also has various attributes, including the name of the registering organisation and various webometric measures including PageRank, indegree and betweenness centrality.
    Description

    This dataset is the result of a full-population crawl of the .gov.uk web domain, aiming to capture a full picture of the scope of public-facing government activity online and the links between different government bodies.

    Local governments have been developing online services, aiming to better serve the public and reduce administrative costs. However, the impact of this work, and the links between governments’ online and offline activities, remain uncertain. The overall research question for this research examines whether local e-government has met these expectations, of Digital Era Governance and of its practitioners. Aim was to directly analyse the structure and content of government online. It shows that recent digital-centric public administration theories, typified by the Digital Era Governance quasi-paradigm, are not empirically supported by the UK local government experience.

    The data consist of a file of individual Uniform Resource Locators (URLs) fetched during the crawl, and a further file containing pairs of URLs reflecting the Hypertext Markup Language (HTML) links between them. In addition, a GraphML format file is presented for a version of the data reduced to third-level-domains, with accompanying attribute data for the publishing government organisations and calculated webometric statistics based on the third-level-domain link network.

    This project engages with the Digital Era Governance (DEG) work of Dunleavy et. al. and draws upon new empirical methods to explore local government and its use of Internet-related technology. It challenges the existing literature, arguing that e-government benefits have been oversold, particularly for transactional services; it updates DEG with insights from local government.

    The distinctive methodological approach is to use full-population datasets and large-scale web data to provide an empirical foundation for theoretical development, and to test existing theorists’ claims. A new full-population web crawl of .gov.uk is used to analyse the shape and structure of online government using webometrics. Tools from computer science, such as automated classification, are used to enrich our understanding of the dataset. A new full-population panel dataset is constructed covering council performance, cost, web quality, and satisfaction.

    The local government web shows a wide scope of provision but only limited evidence in support of the existing rhetorics of Internet-enabled service delivery. In addition, no evidence is found of a link between web development and performance, cost, or satisfaction. DEG is challenged and developed in light of these findings.

    The project adds value by developing new methods for the use of big data in public administration, by empirically challenging long-held assumptions on the value of the web for government, and by building a foundation of knowledge about local government online to be built on by further research.

    This is an ESRC-funded DPhil research project.

  11. crawl-300d-2M

    • kaggle.com
    zip
    Updated Apr 14, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Josh Ko (2019). crawl-300d-2M [Dataset]. https://www.kaggle.com/nowave/crawl300d2m
    Explore at:
    zip(1545551987 bytes)Available download formats
    Dataset updated
    Apr 14, 2019
    Authors
    Josh Ko
    Description

    Dataset

    This dataset was created by Josh Ko

    Contents

  12. Fast Text Word Embeddings 300d 1M

    • kaggle.com
    Updated Jun 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manish Maharjan (2019). Fast Text Word Embeddings 300d 1M [Dataset]. https://www.kaggle.com/datasets/mmanishh/fast-text-word-embeddings/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 20, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Manish Maharjan
    Description

    300-dimensional pretrained FastText English word vectors released by Facebook.

    The first line of the file contains the number of words in the vocabulary and the size of the vectors. Each line contains a word followed by its vectors, like in the default fastText text format. Each value is space separated. Words are ordered by descending frequency.

  13. h

    olm-CC-MAIN-2022-49-sampling-ratio-olm-0.15114822547

    • huggingface.co
    Updated Dec 15, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Online Language Modelling (2022). olm-CC-MAIN-2022-49-sampling-ratio-olm-0.15114822547 [Dataset]. https://huggingface.co/datasets/olm/olm-CC-MAIN-2022-49-sampling-ratio-olm-0.15114822547
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 15, 2022
    Dataset authored and provided by
    Online Language Modelling
    License

    https://choosealicense.com/licenses/undefined/https://choosealicense.com/licenses/undefined/

    Description

    Dataset Card for OLM November/December 2022 Common Crawl

    Cleaned and deduplicated pretraining dataset, created with the OLM repo here from 15% of the November/December 2022 Common Crawl snapshot. Note: last_modified_timestamp was parsed from whatever a website returned in it's Last-Modified header; there are likely a small number of outliers that are incorrect, so we recommend removing the outliers before doing statistics with last_modified_timestamp.

  14. Z

    German CBOW FastText embeddings with min count 250

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 26, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bocharov, Victor (2021). German CBOW FastText embeddings with min count 250 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5598143
    Explore at:
    Dataset updated
    Oct 26, 2021
    Dataset authored and provided by
    Bocharov, Victor
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    FastText embeddings built from Common Crawl german dataset

    Parameters
    
    
        Parameters
        Value(s)
    
    
    
    
        Dimensions
        256 and 384
    
    
        Context window
        5
    
    
        Negative sampled
        10
    
    
        Epochs
        1
    
    
        Number of buckets
        131072 or 262144
    
    
        Min n
        3
    
    
        Max n
        6
    
  15. l

    Data from: esCorpius: A Massive Spanish Crawling Corpus

    • lindat.cz
    • live.european-language-grid.eu
    • +1more
    Updated Sep 10, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gutiérrez-Fandiño Asier; Pérez-Fernández David; Armengol-Estapé Jordi; Griol David; Callejas Zoraida (2022). esCorpius: A Massive Spanish Crawling Corpus [Dataset]. https://lindat.cz/repository/xmlui/handle/11372/LRT-4807?show=full
    Explore at:
    Dataset updated
    Sep 10, 2022
    Authors
    Gutiérrez-Fandiño Asier; Pérez-Fernández David; Armengol-Estapé Jordi; Griol David; Callejas Zoraida
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    In the recent years, Transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of data to be (pre-)trained and there is a lack of corpora in languages other than English. Recently, several initiatives have presented multilingual datasets obtained from automatic web crawling. However, the results in Spanish present important shortcomings, as they are either too small in comparison with other languages, or present a low quality derived from sub-optimal cleaning and deduplication. In this paper, we introduce esCorpius, a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data. It is the most extensive corpus in Spanish with this level of quality in the extraction, purification and deduplication of web textual content. Our data curation process involves a novel highly parallel cleaning pipeline and encompasses a series of deduplication mechanisms that together ensure the integrity of both document and paragraph boundaries. Additionally, we maintain both the source web page URL and the WARC shard origin URL in order to complain with EU regulations. esCorpius has been released under CC BY-NC-ND 4.0 license.

  16. P

    Data from: mC4 Dataset

    • paperswithcode.com
    • opendatalab.com
    • +1more
    Updated Jun 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Linting Xue; Noah Constant; Adam Roberts; Mihir Kale; Rami Al-Rfou; Aditya Siddhant; Aditya Barua; Colin Raffel (2022). mC4 Dataset [Dataset]. https://paperswithcode.com/dataset/mc4
    Explore at:
    Dataset updated
    Jun 8, 2022
    Authors
    Linting Xue; Noah Constant; Adam Roberts; Mihir Kale; Rami Al-Rfou; Aditya Siddhant; Aditya Barua; Colin Raffel
    Description

    mC4 is a multilingual variant of the C4 dataset called mC4. mC4 comprises natural text in 101 languages drawn from the public Common Crawl web scrape.

  17. C

    Crawl Space Encapsulation Service Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Mar 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Crawl Space Encapsulation Service Report [Dataset]. https://www.archivemarketresearch.com/reports/crawl-space-encapsulation-service-57004
    Explore at:
    doc, pdf, pptAvailable download formats
    Dataset updated
    Mar 14, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The crawl space encapsulation service market is experiencing robust growth, projected to reach a market size of $722.3 million in 2025 and maintain a Compound Annual Growth Rate (CAGR) of 8.9% from 2025 to 2033. This expansion is driven by several key factors. Increasing awareness of the benefits of crawl space encapsulation, such as improved indoor air quality, reduced energy costs, and protection against moisture damage and pest infestations, is fueling demand. Furthermore, the rising prevalence of older homes with inadequate crawl spaces, coupled with stringent building codes emphasizing energy efficiency and moisture control, is creating a significant market opportunity. The segment breakdown reveals a strong preference for plastic-based encapsulation solutions due to their cost-effectiveness and ease of installation, while the residential sector accounts for the largest share of market applications, reflecting the high concentration of older homes in need of this service. Competition in the market is relatively fragmented, with numerous regional and national companies providing crawl space encapsulation services. The geographical spread indicates strong demand across North America and Europe, with emerging markets in Asia-Pacific showing significant growth potential. Future market growth will likely be influenced by technological advancements in encapsulation materials, increasing government regulations on building standards and the expansion of the service into commercial and industrial spaces. The continued growth trajectory of the crawl space encapsulation market is likely to be fueled by several factors. Firstly, the increasing consumer awareness of health risks associated with mold and poor indoor air quality is driving homeowners to seek solutions like encapsulation. Secondly, the rising construction of new homes that may require preventive crawl space encapsulation measures will contribute to market expansion. Thirdly, innovative encapsulation materials and techniques, coupled with improved service offerings by companies, are enhancing the overall value proposition and driving adoption. The industry's focus on providing efficient, cost-effective, and environmentally sound solutions will continue to attract both homeowners and commercial clients. The diverse range of companies operating within the market also points to healthy competition and continued innovation in service offerings, further driving market expansion. The projections suggest the crawl space encapsulation market is poised for substantial growth over the forecast period, propelled by these factors and the increasing recognition of the long-term benefits of this essential home improvement service.

  18. Whole-of-Australian Government Web Crawl

    • data.gov.au
    html, warc
    Updated Jul 29, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Digital Transformation Agency (2019). Whole-of-Australian Government Web Crawl [Dataset]. https://data.gov.au/data/dataset/groups/whole-of-australian-government-web-crawl
    Explore at:
    html, warcAvailable download formats
    Dataset updated
    Jul 29, 2019
    Dataset provided by
    Digital Transformation Agencyhttp://dta.gov.au/
    Area covered
    Australia
    Description

    Includes publicly-accessible, human-readable material from Australian government websites, using the Organisations Register (AGOR) as a seed list and domain categorisation source, obeying robots.txt and sitemap.xml directives, gathered over a 10-day period.

    Several non-*.gov.au domains are included in the AGOR - these have been crawled up to a limit of 10K URLs.

    Several binary file formats included and converted to HTML: doc,docm,docx,dot,epub,keys,numbers,pages,pdf,ppt,pptm,pptx,rtf,xls,xlsm,xlsx

    URLs returning responses larger than 10MB are not included in the dataset.

    Raw gathered data (including metadata) is published in the Web Archive (WARC) format, in both a single, multi-gigabyte WARC file and split series.

    Metadata extracted from pages after filtering is published in JSON format, with fields defined in a data dictionary.

    Licence

    Web content contained within these WARC files has originally been authored by the agency hosting the referenced material. Authoring agencies are responsible for the choice of licence attached to the original material.

    A consistent licence across the entirety of the WARC files' contents should not be assumed. Agencies may have indicated copyright and licence information for a given URL as metadata embedded in a WARC file entry, but this should not be assumed to be present for all WARC entries.

  19. Global Toy Crawl Baby buyers list and Global importers directory of Toy...

    • volza.com
    csv
    Updated Mar 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Volza FZ LLC (2025). Global Toy Crawl Baby buyers list and Global importers directory of Toy Crawl Baby [Dataset]. https://www.volza.com/p/toy-crawl-baby/buyers/buyers-in-india/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Mar 22, 2025
    Dataset provided by
    Volza
    Authors
    Volza FZ LLC
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Count of exporters, Count of importers, Count of shipments, Sum of import value, 2014-01-01/2021-09-30
    Description

    36 Active Global Toy Crawl Baby buyers list and Global Toy Crawl Baby importers directory compiled from actual Global import shipments of Toy Crawl Baby.

  20. n

    Stateless deep crawl EU Feeds (3 August 2018)

    • narcis.nl
    • ssh.datastations.nl
    json
    Updated Jan 20, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eijk, RJW van (Leiden University) (2019). Stateless deep crawl EU Feeds (3 August 2018) [Dataset]. http://doi.org/10.17026/dans-zyd-4468
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Jan 20, 2019
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Eijk, RJW van (Leiden University)
    Description

    Webcrawl. Automated visits to European media websites with the Netograph capture framework. collects news articles from the most popular (1) national, (2) regional, and (3) local newspapers in 28 European countries.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Common Crawl Dataset [Dataset]. https://paperswithcode.com/dataset/common-crawl

Common Crawl Dataset

Explore at:
Dataset updated
Oct 8, 2014
Description

The Common Crawl corpus contains petabytes of data collected over 12 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.

Search
Clear search
Close search
Google apps
Main menu