100+ datasets found
  1. s

    The CommonCrawl Corpus

    • marketplace.sshopencloud.eu
    Updated Apr 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). The CommonCrawl Corpus [Dataset]. https://marketplace.sshopencloud.eu/dataset/93FNrL
    Explore at:
    Dataset updated
    Apr 24, 2020
    Description

    The Common Crawl corpus contains petabytes of data collected over 8 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.

  2. w

    A corpus of web crawl data composed of 5 billion web pages.

    • data.wu.ac.at
    Updated Oct 10, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Global (2013). A corpus of web crawl data composed of 5 billion web pages. [Dataset]. https://data.wu.ac.at/schema/datahub_io/ZDVlZWJkNmItNThlNC00ZmE1LWE4MGQtNWUwODRjY2ZhZDk5
    Explore at:
    application/download(31232.0)Available download formats
    Dataset updated
    Oct 10, 2013
    Dataset provided by
    Global
    Description

    A corpus of web crawl data composed of 5 billion web pages. This data set is freely available on Amazon S3 at s3://aws-publicdatasets/common-crawl/crawl-002/ and formatted in the ARC (.arc) file format.

    Common Crawl is a non-profit organization that builds and maintains an open repository of web crawl data for the purpose of driving innovation in research, education and technology. This data set contains web crawl data from 5 billion web pages and is released under the Common Crawl Terms of Use.

  3. c

    The Global Anti crawling Techniques Market is Growing at Compound Annual...

    • cognitivemarketresearch.com
    pdf,excel,csv,ppt
    Updated Dec 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cognitive Market Research (2024). The Global Anti crawling Techniques Market is Growing at Compound Annual Growth Rate of 6.00% from 2023 to 2030. [Dataset]. https://www.cognitivemarketresearch.com/anti-crawling-techniques-market-report
    Explore at:
    pdf,excel,csv,pptAvailable download formats
    Dataset updated
    Dec 22, 2024
    Dataset authored and provided by
    Cognitive Market Research
    License

    https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy

    Time period covered
    2021 - 2033
    Area covered
    Global
    Description

    According to Cognitive Market Research, The Global Anti crawling Techniques market size is USD XX million in 2023 and will expand at a compound annual growth rate (CAGR) of 6.00% from 2023 to 2030.

    North America Anti crawling Techniques held the major market of more than 40% of the global revenue and will grow at a compound annual growth rate (CAGR) of 4.2% from 2023 to 2030.
    Europe Anti crawling Techniques accounted for a share of over 30% of the global market and are projected to expand at a compound annual growth rate (CAGR) of 4.5% from 2023 to 2030.
    Asia Pacific Anti crawling Techniques held the market of more than 23% of the global revenue and will grow at a compound annual growth rate (CAGR) of 8.0% from 2023 to 2030.
    South American Anti crawling Techniques market of more than 5% of the global revenue and will grow at a compound annual growth rate (CAGR) of 5.4% from 2023 to 2030.
    Middle East and Africa Anti crawling Techniques held the major market of more than 2% of the global revenue and will grow at a compound annual growth rate (CAGR) of 5.7% from 2023 to 2030.
    The market for anti-crawling techniques has grown dramatically as a result of the increasing number of data breaches and public awareness of the need to protect sensitive data. 
    Demand for bot fingerprint databases remains higher in the anti crawling techniques market.
    The content protection category held the highest anti crawling techniques market revenue share in 2023.
    

    Increasing Demand for Protection and Security of Online Data to Provide Viable Market Output

    The market for anti-crawling techniques is expanding due in large part to the growing requirement for online data security and protection. Due to an increase in digital activity, organizations are processing and storing enormous volumes of sensitive data online. Organizations are being forced to invest in strong anti-crawling techniques due to the growing threat of data breaches, illegal access, and web scraping occurrences. By protecting online data from harmful activity and guaranteeing its confidentiality and integrity, these technologies advance the industry. Moreover, the significance of protecting digital assets is increased by the widespread use of the Internet for e-commerce, financial transactions, and sensitive data transfers. Anti-crawling techniques are essential for reducing the hazards connected to online scraping, which is a tactic often used by hackers to obtain important data.

    Increasing Incidence of Cyber Threats to Propel Market Growth
    

    The growing prevalence of cyber risks, such as site scraping and data harvesting, is driving growth in the market for anti-crawling techniques. Organizations that rely significantly on digital platforms run a higher risk of having illicit data extracted. In order to safeguard sensitive data and preserve the integrity of digital assets, organizations have been forced to invest in sophisticated anti-crawling techniques that strengthen online defenses. Moreover, the market's growth is a reflection of growing awareness of cybersecurity issues and the need to put effective defenses in place against changing cyber threats. Moreover, cybersecurity is constantly challenged by the spread of advanced and automated crawling programs. The ever-changing threat landscape forces enterprises to implement anti-crawling techniques, which use a variety of tools like rate limitation, IP blocking, and CAPTCHAs to prevent fraudulent scraping efforts.

    Market Restraints of the Anti crawling Techniques

    Increasing Demand for Ethical Web Scraping to Restrict Market Growth
    

    The growing desire for ethical web scraping presents a unique challenge to the anti-crawling techniques market. Ethical web scraping is the process of obtaining data from websites for lawful objectives, such as market research or data analysis, but without breaching the terms of service. Furthermore, the restraint arises because anti-crawling techniques must distinguish between criminal and ethical scraping operations, finding a balance between preventing websites from misuse and permitting authorized data harvest. This dynamic calls for more complex and adaptable anti-crawling techniques to distinguish between destructive and ethical scrapping actions.

    Impact of COVID-19 on the Anti Crawling Techniques Market

    The demand for online material has increased as a result of the COVID-19 pandemic, which has...

  4. n

    NIF Registry Automated Crawl Data

    • neuinfo.org
    • rrid.site
    • +2more
    Updated Aug 29, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2012). NIF Registry Automated Crawl Data [Dataset]. http://identifiers.org/RRID:SCR_012862
    Explore at:
    Dataset updated
    Aug 29, 2012
    Description

    An automatic pipeline based on an algorithm that identifies new resources in publications every month to assist the efficiency of NIF curators. The pipeline is also able to find the last time the resource's webpage was updated and whether the URL is still valid. This can assist the curator in knowing which resources need attention. Additionally, the pipeline identifies publications that reference existing NIF Registry resources as this is also of interest. These mentions are available through the Data Federation version of the NIF Registry, http://neuinfo.org/nif/nifgwt.html?query=nlx_144509 The RDF is based on an algorithm on how related it is to neuroscience. (hits of neuroscience related terms). Each potential resource gets assigned a score (based on how related it is to neuroscience) and the resources are then ranked and a list is generated.

  5. CommonCrawl WET Sample

    • kaggle.com
    zip
    Updated May 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jye (2023). CommonCrawl WET Sample [Dataset]. https://www.kaggle.com/datasets/jyesawtellrickson/commoncrawl
    Explore at:
    zip(109213996 bytes)Available download formats
    Dataset updated
    May 1, 2023
    Authors
    Jye
    Description

    A sample of the Common Crawl dataset. The archive has 38,079 rows, and is one of 80,000 samples.

    "The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions."

    https://commoncrawl.org/

    WET Response Format: "As many tasks only require textual information, the Common Crawl dataset provides WET files that only contain extracted plaintext. The way in which this textual data is stored in the WET format is quite simple. The WARC metadata contains various details, including the URL and the length of the plaintext data, with the plaintext data following immediately afterwards."

  6. o

    Armenian language dataset from CC-100, monolingual Datasets from Web Crawl...

    • data.opendata.am
    Updated Apr 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Armenian language dataset from CC-100, monolingual Datasets from Web Crawl Data [Dataset]. https://data.opendata.am/dataset/cc100arm
    Explore at:
    Dataset updated
    Apr 6, 2023
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Area covered
    Armenia
    Description

    Armenian language dataset extracted from CC-100 research dataset Description from website This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository. No claims of intellectual property are made on the work of preparation of the corpus.

  7. c

    Yoox products database

    • crawlfeeds.com
    csv, zip
    Updated Sep 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Yoox products database [Dataset]. https://crawlfeeds.com/datasets/yoox-products-database
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Sep 11, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    The Yoox Products Database is a comprehensive, ready-to-use dataset featuring over 250,000 product listings from the Yoox online fashion platform. This database is ideal for eCommerce analytics, price comparison tools, trend forecasting, competitor research, and building product recommendation engines.

    Inside, you’ll find structured CSV files neatly compressed in a ZIP archive, making it simple to import into any BI tool, database, or application.

    Key Data Fields:

    • Product IDs & SKUs

    • Product Titles & Descriptions

    • Categories & Subcategories

    • Brand Information

    • Pricing & Discounts

    • Availability & Stock Status

    • Image Links

    Perfect for data analysts, developers, marketers, and online retailers looking to harness fashion retail insights.

  8. W

    Web Crawler Tool Report

    • marketresearchforecast.com
    doc, pdf, ppt
    Updated Apr 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Forecast (2025). Web Crawler Tool Report [Dataset]. https://www.marketresearchforecast.com/reports/web-crawler-tool-542102
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Apr 26, 2025
    Dataset authored and provided by
    Market Research Forecast
    License

    https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global web crawler tool market is experiencing robust growth, driven by the increasing need for data extraction and analysis across diverse sectors. The market's expansion is fueled by the exponential growth of online data, the rise of big data analytics, and the increasing adoption of automation in business processes. Businesses leverage web crawlers for market research, competitive intelligence, price monitoring, and lead generation, leading to heightened demand. While cloud-based solutions dominate due to scalability and cost-effectiveness, on-premises deployments remain relevant for organizations prioritizing data security and control. The large enterprise segment currently leads in adoption, but SMEs are increasingly recognizing the value proposition of web crawling tools for improving business decisions and operations. Competition is intense, with established players like UiPath and Scrapy alongside a growing number of specialized solutions. Factors such as data privacy regulations and the complexity of managing web crawlers pose challenges to market growth, but ongoing innovation in areas such as AI-powered crawling and enhanced data processing capabilities are expected to mitigate these restraints. We estimate the market size in 2025 to be $1.5 billion, growing at a CAGR of 15% over the forecast period (2025-2033). The geographical distribution of the market reflects the global nature of internet usage, with North America and Europe currently holding the largest market share. However, the Asia-Pacific region is anticipated to witness significant growth driven by increasing internet penetration and digital transformation initiatives across countries like China and India. The ongoing development of more sophisticated and user-friendly web crawling tools, coupled with decreasing implementation costs, is projected to further stimulate market expansion. Future growth will depend heavily on the ability of vendors to adapt to evolving web technologies, address increasing data privacy concerns, and provide robust solutions that cater to the specific needs of various industry verticals. Further research and development into AI-driven crawling techniques will be pivotal in optimizing efficiency and accuracy, which in turn will encourage wider adoption.

  9. I

    A Crawl of the Mobile Web Measuring Sensor Accesses

    • databank.illinois.edu
    • aws-databank-alb.library.illinois.edu
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anupam Das; Gunes Acar; Nikita Borisov; Amogh Pradeep, A Crawl of the Mobile Web Measuring Sensor Accesses [Dataset]. http://doi.org/10.13012/B2IDB-9213932_V1
    Explore at:
    Authors
    Anupam Das; Gunes Acar; Nikita Borisov; Amogh Pradeep
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is the result of three crawls of the web performed in May 2018. The data contains raw crawl data and instrumentation captured by OpenWPM-Mobile, as well as analysis that identifies which scripts access mobile sensors, which ones perform some of browser fingerprinting, as well as clustering of scripts based on their intended use. The dataset is described in the included README.md file; more details about the methodology can be found in our ACM CCS'18 paper: Anupam Das, Gunes Acar, Nikita Borisov, Amogh Pradeep. The Web's Sixth Sense: A Study of Scripts Accessing Smartphone Sensors. In Proceedings of the 25th ACM Conference on Computer and Communications Security (CCS), Toronto, Canada, October 15–19, 2018. (Forthcoming)

  10. Crawl data lazada

    • kaggle.com
    zip
    Updated Dec 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vân Dung (2024). Crawl data lazada [Dataset]. https://www.kaggle.com/datasets/vndung/crawl-data-lazada/code
    Explore at:
    zip(7816 bytes)Available download formats
    Dataset updated
    Dec 26, 2024
    Authors
    Vân Dung
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Vân Dung

    Released under CC0: Public Domain

    Contents

  11. E

    A Massive Spanish Crawling Corpus

    • live.european-language-grid.eu
    json
    Updated Sep 4, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). A Massive Spanish Crawling Corpus [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/20267
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Sep 4, 2022
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    In the recent years, Transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of data to be (pre-)trained and there is a lack of corpora in languages other than English. Recently, several initiatives have presented multilingual datasets obtained from automatic web crawling. However, the results in Spanish present important shortcomings, as they are either too small in comparison with other languages, or present a low quality derived from sub-optimal cleaning and deduplication. In this paper, we introduce esCorpius, a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data. It is the most extensive corpus in Spanish with this level of quality in the extraction, purification and deduplication of web textual content. Our data curation process involves a novel highly parallel cleaning pipeline and encompasses a series of deduplication mechanisms that together ensure the integrity of both document and paragraph boundaries. Additionally, we maintain both the source web page URL and the WARC shard origin URL in order to complain with EU regulations. esCorpius has been released under CC BY-NC-ND 4.0 license.

  12. Data from: Web Data Commons Training and Test Sets for Large-Scale Product...

    • linkagelibrary.icpsr.umich.edu
    • da-ra.de
    Updated Nov 26, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ralph Peeters; Anna Primpeli; Christian Bizer (2020). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 [Dataset]. http://doi.org/10.3886/E127481V1
    Explore at:
    Dataset updated
    Nov 26, 2020
    Dataset provided by
    University of Mannheim (Germany)
    Authors
    Ralph Peeters; Anna Primpeli; Christian Bizer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.

  13. W

    Web Crawler Tool Report

    • marketresearchforecast.com
    doc, pdf, ppt
    Updated Aug 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Forecast (2025). Web Crawler Tool Report [Dataset]. https://www.marketresearchforecast.com/reports/web-crawler-tool-542101
    Explore at:
    ppt, pdf, docAvailable download formats
    Dataset updated
    Aug 25, 2025
    Dataset authored and provided by
    Market Research Forecast
    License

    https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    Discover the booming Web Crawler Tool market! This analysis reveals key trends, drivers, and restraints, plus a detailed look at leading companies like Scrapy, Mozenda, and UiPath. Learn about market size projections, CAGR, and regional market share for informed decision-making.

  14. a

    LLMs.txt Bot Crawl Analysis Data

    • archeredu.com
    html
    Updated Jul 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archer Education (2025). LLMs.txt Bot Crawl Analysis Data [Dataset]. https://www.archeredu.com/hemj/page/2/
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jul 23, 2025
    Dataset authored and provided by
    Archer Education
    Variables measured
    User Agent, Total Requests, Percentage Distribution
    Description

    Comprehensive crawl data showing user agent distribution and frequency for LLMs.txt file requests across eight test websites

  15. h

    Common-Crawl-2025-June

    • huggingface.co
    Updated Jun 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shirova AI (2025). Common-Crawl-2025-June [Dataset]. https://huggingface.co/datasets/Shirova/Common-Crawl-2025-June
    Explore at:
    Dataset updated
    Jun 25, 2025
    Dataset authored and provided by
    Shirova AI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Common Crawl 2025 June

    Common-Crawl-2025-June is a curated, processed, and filtered dataset built from the June 2025 Common Crawl web corpus.It contains data crawled between June 1, 2025, and June 10, 2025, processed using Hugging Face’s Data Trove pipeline and several AI-based content filters to remove unsafe, harmful, or low-quality text.

      Dataset Summary
    

    This dataset represents one of the latest structured Common Crawl releases with high-quality web data.The… See the full description on the dataset page: https://huggingface.co/datasets/Shirova/Common-Crawl-2025-June.

  16. Sample Common Crawl URLs and JS Libs used

    • kaggle.com
    zip
    Updated Jul 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HiHarshSinghal (2022). Sample Common Crawl URLs and JS Libs used [Dataset]. https://www.kaggle.com/datasets/harshsinghal/common-crawl-sample-urls-js-libs
    Explore at:
    zip(209416257 bytes)Available download formats
    Dataset updated
    Jul 13, 2022
    Authors
    HiHarshSinghal
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is compiled from Common Crawl data.

    A sample of Common Crawl WARC files was processed and URLs to their Javascript libraries were mapped.

    Common Crawl https://commoncrawl.org/ publishes an Internet-wide crawl every year (sometimes multiple times a year). These files capture very rich information of the websites that were crawled.

    I wanted to build something like BuiltWith which tells you what technologies a website is using. One place to look for this is what Javascript libraries are being used by the website.

    I'm actually writing a book on how to build a data product for those who own a keyboard.

    You can read along as I continue writing the book at https://buildadataproduct.netlify.app/

  17. v

    Global import data of Crawler

    • volza.com
    csv
    Updated Dec 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Volza FZ LLC (2025). Global import data of Crawler [Dataset]. https://www.volza.com/p/crawler/import/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 10, 2025
    Dataset authored and provided by
    Volza FZ LLC
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Count of importers, Sum of import value, 2014-01-01/2021-09-30, Count of import shipments
    Description

    81368 Global import shipment records of Crawler with prices, volume & current Buyer's suppliers relationships based on actual Global export trade database.

  18. h

    Chinese-Common-Crawl-Filtered

    • huggingface.co
    Updated Jun 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jed Cheng (2025). Chinese-Common-Crawl-Filtered [Dataset]. https://huggingface.co/datasets/jed351/Chinese-Common-Crawl-Filtered
    Explore at:
    Dataset updated
    Jun 2, 2025
    Authors
    Jed Cheng
    Description

    Traditional Chinese C4

      Dataset Summary
    

    Data obtained from 2025-18 and 2025-13 Common Crawl. Downloaded and processed using code based on another project attempting to recreate the C4 dataset. The resultant dataset contains both simplified and traditional Chinese. It was then filtered using a modified list of simplified Chinese characters to obtain another traditional Chinese dataset. I am still ironning out the process of filtering. The 2025-13 dataset was deduplicated… See the full description on the dataset page: https://huggingface.co/datasets/jed351/Chinese-Common-Crawl-Filtered.

  19. n

    Integrated Datasets

    • neuinfo.org
    • dknet.org
    • +2more
    Updated Jun 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Integrated Datasets [Dataset]. http://identifiers.org/RRID:SCR_010503
    Explore at:
    Dataset updated
    Jun 14, 2021
    Description

    A virtual database cataloging numerous data set resources, including: BrainMaps.org, Cell Centered Database, Clinical Trials Network (CTN) Data Share, ClinicalTrials.gov, CRCNS, Gene Expression Omnibus, ArrayExpress, MPD - Mouse Phenome Database, BioSharing, Gene Weaver, XNAT Central, 1000 Functional Connectomes Project, Health.Data.gov, SciCrunch Registry, NIF Registry Automated Crawl Data, NeuroVault, OpenfMRI, Physiobank, RanchoBiosciences, YPED, Data.gov Science, and Research Data Catalog.

  20. c

    Google Play Store Reviews Database

    • crawlfeeds.com
    csv, zip
    Updated Aug 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2024). Google Play Store Reviews Database [Dataset]. https://crawlfeeds.com/datasets/google-play-store-reviews-database
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Aug 27, 2024
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    Explore the Google Play Store Reviews Database, a comprehensive collection of user reviews for various apps available on the Google Play Store.

    This dataset includes millions of reviews across a wide range of categories such as games, productivity, social media, finance, health, and more. Each review entry provides essential details, including app names, user ratings, review texts, review dates, and user feedback, offering valuable insights for developers, data analysts, and market researchers.

    Key Features:

    • Extensive Review Coverage: Contains millions of user reviews from the Google Play Store, covering various app categories like games, productivity, social media, finance, and health.
    • Detailed Review Information: Each review includes key details such as app name, user rating, review text, review date, and user feedback, allowing for in-depth analysis of user sentiment and app performance.
    • Ideal for Market Analysis: Perfect for developers, data scientists, and market researchers interested in analyzing user feedback, studying trends in app usage, or optimizing app development strategies based on user reviews.
    • Rich Source of User Insights: Provides a comprehensive overview of user experiences and preferences, helping professionals stay updated on the latest trends, popular apps, and user satisfaction levels.

    Whether you're analyzing user feedback, researching market trends, or developing new app strategies, the Google Play Store Reviews Database is an invaluable resource that provides detailed insights and extensive coverage of app reviews on the Google Play Store.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2020). The CommonCrawl Corpus [Dataset]. https://marketplace.sshopencloud.eu/dataset/93FNrL

The CommonCrawl Corpus

Explore at:
146 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Apr 24, 2020
Description

The Common Crawl corpus contains petabytes of data collected over 8 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.

Search
Clear search
Close search
Google apps
Main menu