100+ datasets found
  1. SEO Crawl Datasets

    • kaggle.com
    zip
    Updated Nov 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elias Dabbas (2020). SEO Crawl Datasets [Dataset]. https://www.kaggle.com/eliasdabbas/seocrawldatasets
    Explore at:
    zip(222489274 bytes)Available download formats
    Dataset updated
    Nov 20, 2020
    Authors
    Elias Dabbas
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Crawling websites to understand and analyze their content is very important, for a general understanding of the website, as well as for SEO purposes. Crawled sites are basically converted to tables where each row represents a URL, and each column contains information on a certain attribute of that URL (title, h1, h2, meta description, etc.)

    Content

    A set of crawl datasets of various websites, as well as supporting datasets (XML sitemaps, crawl logs, robots.txt)

    • Dajango
      • crawl dataset of the docs as well as the www sub-domains
      • crawl logs

    Acknowledgements

    Scrapy pandas advertools

    Inspiration

    Trying to come up with a standardized procedure for analyzing websites, so others can use and build upon.

  2. c

    The Global Anti crawling Techniques Market is Growing at Compound Annual...

    • cognitivemarketresearch.com
    pdf,excel,csv,ppt
    Updated Dec 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cognitive Market Research (2024). The Global Anti crawling Techniques Market is Growing at Compound Annual Growth Rate of 6.00% from 2023 to 2030. [Dataset]. https://www.cognitivemarketresearch.com/anti-crawling-techniques-market-report
    Explore at:
    pdf,excel,csv,pptAvailable download formats
    Dataset updated
    Dec 22, 2024
    Dataset authored and provided by
    Cognitive Market Research
    License

    https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy

    Time period covered
    2021 - 2033
    Area covered
    Global
    Description

    According to Cognitive Market Research, The Global Anti crawling Techniques market size is USD XX million in 2023 and will expand at a compound annual growth rate (CAGR) of 6.00% from 2023 to 2030.

    North America Anti crawling Techniques held the major market of more than 40% of the global revenue and will grow at a compound annual growth rate (CAGR) of 4.2% from 2023 to 2030.
    Europe Anti crawling Techniques accounted for a share of over 30% of the global market and are projected to expand at a compound annual growth rate (CAGR) of 4.5% from 2023 to 2030.
    Asia Pacific Anti crawling Techniques held the market of more than 23% of the global revenue and will grow at a compound annual growth rate (CAGR) of 8.0% from 2023 to 2030.
    South American Anti crawling Techniques market of more than 5% of the global revenue and will grow at a compound annual growth rate (CAGR) of 5.4% from 2023 to 2030.
    Middle East and Africa Anti crawling Techniques held the major market of more than 2% of the global revenue and will grow at a compound annual growth rate (CAGR) of 5.7% from 2023 to 2030.
    The market for anti-crawling techniques has grown dramatically as a result of the increasing number of data breaches and public awareness of the need to protect sensitive data. 
    Demand for bot fingerprint databases remains higher in the anti crawling techniques market.
    The content protection category held the highest anti crawling techniques market revenue share in 2023.
    

    Increasing Demand for Protection and Security of Online Data to Provide Viable Market Output

    The market for anti-crawling techniques is expanding due in large part to the growing requirement for online data security and protection. Due to an increase in digital activity, organizations are processing and storing enormous volumes of sensitive data online. Organizations are being forced to invest in strong anti-crawling techniques due to the growing threat of data breaches, illegal access, and web scraping occurrences. By protecting online data from harmful activity and guaranteeing its confidentiality and integrity, these technologies advance the industry. Moreover, the significance of protecting digital assets is increased by the widespread use of the Internet for e-commerce, financial transactions, and sensitive data transfers. Anti-crawling techniques are essential for reducing the hazards connected to online scraping, which is a tactic often used by hackers to obtain important data.

    Increasing Incidence of Cyber Threats to Propel Market Growth
    

    The growing prevalence of cyber risks, such as site scraping and data harvesting, is driving growth in the market for anti-crawling techniques. Organizations that rely significantly on digital platforms run a higher risk of having illicit data extracted. In order to safeguard sensitive data and preserve the integrity of digital assets, organizations have been forced to invest in sophisticated anti-crawling techniques that strengthen online defenses. Moreover, the market's growth is a reflection of growing awareness of cybersecurity issues and the need to put effective defenses in place against changing cyber threats. Moreover, cybersecurity is constantly challenged by the spread of advanced and automated crawling programs. The ever-changing threat landscape forces enterprises to implement anti-crawling techniques, which use a variety of tools like rate limitation, IP blocking, and CAPTCHAs to prevent fraudulent scraping efforts.

    Market Restraints of the Anti crawling Techniques

    Increasing Demand for Ethical Web Scraping to Restrict Market Growth
    

    The growing desire for ethical web scraping presents a unique challenge to the anti-crawling techniques market. Ethical web scraping is the process of obtaining data from websites for lawful objectives, such as market research or data analysis, but without breaching the terms of service. Furthermore, the restraint arises because anti-crawling techniques must distinguish between criminal and ethical scraping operations, finding a balance between preventing websites from misuse and permitting authorized data harvest. This dynamic calls for more complex and adaptable anti-crawling techniques to distinguish between destructive and ethical scrapping actions.

    Impact of COVID-19 on the Anti Crawling Techniques Market

    The demand for online material has increased as a result of the COVID-19 pandemic, which has...

  3. News sites blocking Google's AI crawler worldwide 2023, by country

    • statista.com
    Updated May 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). News sites blocking Google's AI crawler worldwide 2023, by country [Dataset]. https://www.statista.com/statistics/1463530/google-crawlers-and-news-websites-worldwide-by-country/
    Explore at:
    Dataset updated
    May 2, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2023
    Area covered
    Worldwide
    Description

    By the end of 2023, ** percent of the top most used news websites in Germany were blocking Google's AI crawler, having been quick to act after the crawlers were launched. The figure was substantially lower in Spain and Poland, and in both cases, news publishers were slower to react, meaning that by the end of 2023 just ***** percent of top news sites (print, broadcast, and digital-born) in each country were blocking Google's AI from crawling their content.

  4. Crawl-data-English

    • kaggle.com
    zip
    Updated Dec 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sushii (2023). Crawl-data-English [Dataset]. https://www.kaggle.com/datasets/sushii2512/crawl-data-english
    Explore at:
    zip(156 bytes)Available download formats
    Dataset updated
    Dec 8, 2023
    Authors
    Sushii
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Sushii

    Released under MIT

    Contents

  5. w

    Global Live Crawling Service Market Research Report: By Service Type (Web...

    • wiseguyreports.com
    Updated Sep 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Global Live Crawling Service Market Research Report: By Service Type (Web Crawling, Data Extraction, Content Scraping, SEO Monitoring), By Deployment Type (Cloud-Based, On-Premises), By End User (E-commerce, Travel and Hospitality, Healthcare, Finance), By Industry (Retail, Media and Entertainment, Education, Real Estate) and By Regional (North America, Europe, South America, Asia Pacific, Middle East and Africa) - Forecast to 2035 [Dataset]. https://www.wiseguyreports.com/reports/live-crawling-service-market
    Explore at:
    Dataset updated
    Sep 15, 2025
    License

    https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy

    Time period covered
    Sep 25, 2025
    Area covered
    Global
    Description
    BASE YEAR2024
    HISTORICAL DATA2019 - 2023
    REGIONS COVEREDNorth America, Europe, APAC, South America, MEA
    REPORT COVERAGERevenue Forecast, Competitive Landscape, Growth Factors, and Trends
    MARKET SIZE 20241.74(USD Billion)
    MARKET SIZE 20251.92(USD Billion)
    MARKET SIZE 20355.25(USD Billion)
    SEGMENTS COVEREDService Type, Deployment Type, End User, Industry, Regional
    COUNTRIES COVEREDUS, Canada, Germany, UK, France, Russia, Italy, Spain, Rest of Europe, China, India, Japan, South Korea, Malaysia, Thailand, Indonesia, Rest of APAC, Brazil, Mexico, Argentina, Rest of South America, GCC, South Africa, Rest of MEA
    KEY MARKET DYNAMICSRising demand for real-time data, Increasing e-commerce activities, Advancements in AI technologies, Growing need for competitive intelligence, Enhanced customer engagement strategies
    MARKET FORECAST UNITSUSD Billion
    KEY COMPANIES PROFILEDOctoparse, Apify, Bright Data, Diffbot, Mozenda, Crawling API, ScrapingBee, DataMiner, WebHarvy, WebScraper.io, Import.io, Zyte, Scrapinghub, ParseHub, Content Grabber, Scrapy
    MARKET FORECAST PERIOD2025 - 2035
    KEY MARKET OPPORTUNITIESIncreased demand for real-time data, Expansion into emerging markets, Integration with AI technologies, Enhanced compliance and monitoring solutions, Growing interest in web analytics tools
    COMPOUND ANNUAL GROWTH RATE (CAGR) 10.6% (2025 - 2035)
  6. m

    SM01: Web crawler benchmark QES15 and QES30 experiments results

    • data.mendeley.com
    Updated Nov 1, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Goran Grubić (2017). SM01: Web crawler benchmark QES15 and QES30 experiments results [Dataset]. http://doi.org/10.17632/yb4mph9wgv.1
    Explore at:
    Dataset updated
    Nov 1, 2017
    Authors
    Goran Grubić
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Research project SM01 (Parallel Semantic Crawler for manufacturing multilingual web...).

    The experiments QES15 and QES15 are performed over Sd sample subset using BF, PR, HITS and SM crawlers.

    The Sd includes 50 web sites with the most challenging multi-lingual content in domain of metal manufacturing business in Serbia.

    These two experiments have different Page Load limit (PL_max) set to load 15 (QES15) and 30 pages (QES30) per domain.

    Please refer to the Crawl Report Content Guide to learn about files in the report archives.

  7. Job Posts Data Crawling Project (Vietnam)

    • kaggle.com
    zip
    Updated Dec 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Văn Duy Cao (2023). Job Posts Data Crawling Project (Vietnam) [Dataset]. https://www.kaggle.com/datasets/vnduycao/job-posts-data-crawling-project-vietnam
    Explore at:
    zip(53707 bytes)Available download formats
    Dataset updated
    Dec 31, 2023
    Authors
    Văn Duy Cao
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Vietnam
    Description

    This is a semi-cleaned dataset containing information from job posts related to data science field. The data is scraped from 4 websites and the process is done in December 2023. Langchain framework from OpenAI was used to support the data extraction task. For example, getting the soft skills and tools that the job post's description mention.

    Here is the data schema for this data set

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F14229286%2Fcd5c6bc8700ad49f34a48b61981625c4%2Fimage%20(2).png?generation=1703998231851462&alt=media" alt="">

    31/12/2023: The data set's description is not finished.

  8. A

    Anti-crawling Techniques Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Nov 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Anti-crawling Techniques Report [Dataset]. https://www.datainsightsmarket.com/reports/anti-crawling-techniques-1956630
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Nov 8, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    Explore the booming Anti-crawling Techniques market, driven by sophisticated bot threats and essential data protection needs. Discover key insights, growth drivers, and leading solutions safeguarding businesses worldwide.

  9. h

    Common-Crawl-2025-June

    • huggingface.co
    Updated Jun 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shirova AI (2025). Common-Crawl-2025-June [Dataset]. https://huggingface.co/datasets/Shirova/Common-Crawl-2025-June
    Explore at:
    Dataset updated
    Jun 25, 2025
    Dataset authored and provided by
    Shirova AI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Common Crawl 2025 June

    Common-Crawl-2025-June is a curated, processed, and filtered dataset built from the June 2025 Common Crawl web corpus.It contains data crawled between June 1, 2025, and June 10, 2025, processed using Hugging Face’s Data Trove pipeline and several AI-based content filters to remove unsafe, harmful, or low-quality text.

      Dataset Summary
    

    This dataset represents one of the latest structured Common Crawl releases with high-quality web data.The… See the full description on the dataset page: https://huggingface.co/datasets/Shirova/Common-Crawl-2025-June.

  10. h

    AI-paper-crawl

    • huggingface.co
    Updated Jan 1, 1980
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Forty-Two AI Lab (1980). AI-paper-crawl [Dataset]. https://huggingface.co/datasets/Seed42Lab/AI-paper-crawl
    Explore at:
    Dataset updated
    Jan 1, 1980
    Dataset authored and provided by
    Forty-Two AI Lab
    Description

    Dataset Card for "AI-paper-crawl"

    The dataset contains 11 splits, corresponding to 11 conferences. For each split, there are several fields:

    "index": Index number starting from 0. It's the primary key; "text": The content of the paper in pure text form. Newline is turned into 3 spaces if "-" is not detected; "year": A string of the paper's publication year, like "2018". Transform it into int if you need to; "No": A string of index number within a year. 1-indexed. In "ECCV" split… See the full description on the dataset page: https://huggingface.co/datasets/Seed42Lab/AI-paper-crawl.

  11. h

    crawleval

    • huggingface.co
    Updated May 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawlab Team (2025). crawleval [Dataset]. https://huggingface.co/datasets/crawlab/crawleval
    Explore at:
    Dataset updated
    May 9, 2025
    Dataset authored and provided by
    Crawlab Team
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    CrawlEval

    Resources and tools for evaluating the performance and behavior of web crawling systems.

      Overview
    

    CrawlEval provides a comprehensive suite of tools and datasets for evaluating web crawling systems, with a particular focus on HTML pattern extraction and content analysis. The project includes:

    A curated dataset of web pages with ground truth patterns Tools for fetching and analyzing web content Evaluation metrics and benchmarking capabilities

      Dataset… See the full description on the dataset page: https://huggingface.co/datasets/crawlab/crawleval.
    
  12. Common Crawl Micro Subset English

    • kaggle.com
    zip
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikhil R (2025). Common Crawl Micro Subset English [Dataset]. https://www.kaggle.com/datasets/nikhilr612/common-crawl-micro-subset-english
    Explore at:
    zip(5504236429 bytes)Available download formats
    Dataset updated
    Apr 10, 2025
    Authors
    Nikhil R
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    A subset of Common Crawl, extracted from Colossally Cleaned Common Crawl (C4) dataset with the additional constraint that extracted text safely encodes to ASCII. A Unigram tokenizer of vocabulary 12.228k tokens is provided, along with pre-tokenized data.

  13. Mobile First Indexing attributes ranking on websites in France 2020, by...

    • statista.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista, Mobile First Indexing attributes ranking on websites in France 2020, by importance [Dataset]. https://www.statista.com/statistics/1220608/mobile-first-indexing-attributes-ranking-by-importance-on-websites-france/
    Explore at:
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2020
    Area covered
    France
    Description

    In the eyes of French SEOs, if there was one point that mattered absolutely in terms of SEO for mobile first indexing, it was the adaptation of the size of the content to the size of the screen in 2020. Other than that, when crawl was ensured, it made it easier for the crawler or the Internet user to visit and to facilitate the discovery of a site by search engines.

  14. Válasz

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Apr 13, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gábor Palkó; Gábor Palkó; Balázs Indig; Balázs Indig; Zsófia Fellegi; Zsófia Fellegi; Zsófia Sárközi-Lindner; Zsófia Sárközi-Lindner (2022). Válasz [Dataset]. http://doi.org/10.5281/zenodo.5849730
    Explore at:
    binAvailable download formats
    Dataset updated
    Apr 13, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gábor Palkó; Gábor Palkó; Balázs Indig; Balázs Indig; Zsófia Fellegi; Zsófia Fellegi; Zsófia Sárközi-Lindner; Zsófia Sárközi-Lindner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 13, 2001 - Aug 3, 2018
    Description

    This object has been made as a part of the web harvesting project of the Eötvös Loránd University Department of Digital Humanities ELTE DH. Learn more about the workflow HERE about the software used HERE.The aim of the project is to make online news articles and their metadata suitable for research purposes. The archiving workflow is designed to prevent modification or manipulation of the downloaded content. The current version of the curated content with normalized formatting in standard TEI XML format with Schema.org encoded metadata is available HERE. The detailed description of the raw content is the following:

    • The portal's archived content (from 2001-04-13 to 2018-08-03) in WARC format available HERE (crawled: 2019-09-01 11:36:19.949569 - 2021-03-06 20:24:33.398056). The crawling has happened in multiple phases hence the date intervals are unusually wide. No further versions are expected because the crawl is created after the portal has stopped publication.

    Please fill in the following form before requesting access to this dataset:ACCES FORM

  15. h

    Chinese-Common-Crawl-Filtered

    • huggingface.co
    Updated Jun 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jed Cheng (2025). Chinese-Common-Crawl-Filtered [Dataset]. https://huggingface.co/datasets/jed351/Chinese-Common-Crawl-Filtered
    Explore at:
    Dataset updated
    Jun 2, 2025
    Authors
    Jed Cheng
    Description

    Traditional Chinese C4

      Dataset Summary
    

    Data obtained from 2025-18 and 2025-13 Common Crawl. Downloaded and processed using code based on another project attempting to recreate the C4 dataset. The resultant dataset contains both simplified and traditional Chinese. It was then filtered using a modified list of simplified Chinese characters to obtain another traditional Chinese dataset. I am still ironning out the process of filtering. The 2025-13 dataset was deduplicated… See the full description on the dataset page: https://huggingface.co/datasets/jed351/Chinese-Common-Crawl-Filtered.

  16. h

    Data from: AICC

    • huggingface.co
    Updated Nov 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenDataLab (2025). AICC [Dataset]. https://huggingface.co/datasets/opendatalab/AICC
    Explore at:
    Dataset updated
    Nov 28, 2025
    Dataset authored and provided by
    OpenDataLab
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    🔧 🔧 Our New-Gen Html Parser MinerU-HTML Now Realease!

      AICC: AI-ready Common Crawl Dataset
    

    Paper | Project page

    AICC (AI-ready Common Crawl) is a large-scale, AI-ready web dataset derived from Common Crawl, containing semantically extracted Markdown-formatted main content from diverse web pages. The dataset is constructed using the MinerU-HTML, a web extraction pipeline developed by OpenDataLab.

    High-quality main content: High-fidelity main content extracted from diverse Common… See the full description on the dataset page: https://huggingface.co/datasets/opendatalab/AICC.

  17. E

    Data from: R crawlers for five Slovenian web media 1.0

    • live.european-language-grid.eu
    Updated Apr 22, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). R crawlers for five Slovenian web media 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/20080
    Explore at:
    Dataset updated
    Apr 22, 2017
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Five web-crawlers written in the R language for retrieving Slovenian texts from the news portals 24ur, Dnevnik, Finance, Rtvslo, and Žurnal24. These portals contain political, business, economic and financial content.

  18. h

    cccc_filtered

    • huggingface.co
    Updated Jun 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Pile (2024). cccc_filtered [Dataset]. https://huggingface.co/datasets/common-pile/cccc_filtered
    Explore at:
    Dataset updated
    Jun 4, 2024
    Dataset authored and provided by
    Common Pile
    Description

    Creative Commons Common Crawl

      Description
    

    This dataset contains text from 52 Common Crawl snapshots, covering about half of Common Crawl snapshots available to date and covering all years of operations of Common Crawl up to 2024. We found a higher level of duplication across this collection, suggesting that including more snapshots would lead to a modest increase in total token yield. From these snapshots, we extract HTML content using FastWarc. Then, using a regular… See the full description on the dataset page: https://huggingface.co/datasets/common-pile/cccc_filtered.

  19. Bar Crawl: Detecting Heavy Drinking Data Set

    • kaggle.com
    Updated Mar 25, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gaafa (2020). Bar Crawl: Detecting Heavy Drinking Data Set [Dataset]. https://www.kaggle.com/nautiyalamit/bar-crawl-detecting-heavy-drinking-data-set/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 25, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gaafa
    Description

    Data Set Information:

    Relevant Information: All data is fully anonymized.

    Data was originally collected from 19 participants, but the TAC readings of 6 participants were deemed unusable by SCRAM [1]. The data included is from the remaining 13 participants.

    Accelerometer data was collected from smartphones at a sampling rate of 40Hz (file: all_accelerometer_data_pids_13.csv). The file contains 5 columns: a timestamp, a participant ID, and a sample from each axis of the accelerometer. Data was collected from a mix of 11 iPhones and 2 Android phones as noted in phone_types.csv. TAC data was collected using SCRAM [2] ankle bracelets and was collected at 30 minute intervals. The raw TAC readings are in the raw_tac directory. TAC readings which are more readily usable for processing are in clean_tac directory and have two columns: a timestamp and TAC reading. The cleaned TAC readings: (1) were processed with a zero-phase low-pass filter to smooth noise without shifting phase; (2) were shifted backwards by 45 minutes so the labels more closely match the true intoxication of the participant (since alcohol takes about 45 minutes to exit through the skin.) Please see the above referenced study for more details on how the data was processed ([Web Link]).

    1 - [Web Link] 2 - J. Robert Zettl. The determination of blood alcohol concentration by transdermal measurement. [Web Link], 2002.

    Number of Instances: Accelerometer readings: 14,057,567 TAC readings: 715 Participants: 13

    Number of Attributes: - Time series: 3 axes of accelerometer data (columns x, y, z in all_accelerometer_data_pids_13.csv) - Static: 1 phone-type feature (in phone_types.csv) - Target: 1 time series of TAC for each of the 13 participants (in clean_tac directory).

    For Each Attribute: (Main) all_accelerometer_data_pids_13.csv: time: integer, unix timestamp, milliseconds pid: symbolic, 13 categories listed in pids.txt x: continuous, time-series y: continuous, time-series z: continuous, time-series clean_tac/*.csv: timestamp: integer, unix timestamp, seconds TAC_Reading: continuous, time-series phone_type.csv: pid: symbolic, 13 categories listed in pids.txt phonetype: symbolic, 2 categories (iPhone, Android)

    (Other) raw/*.xlsx: TAC Level: continuous, time-series IR Voltage: continuous, time-series Temperature: continuous, time-series Time: datetime Date: datetime

    Missing Attribute Values: None

    Target Distribution: TAC is measured in g/dl where 0.08 is the legal limit for intoxication while driving Mean TAC: 0.065 +/- 0.182 Max TAC: 0.443 TAC Inner Quartiles: 0.002, 0.029, 0.092 Mean Time-to-last-drink: 16.1 +/- 6.9 hrs

    Attribute Information:

    Provide information about each attribute in your data set.

    Relevant Papers:

    Past Usage: (a) Complete reference of article where it was described/used: Killian, J.A., Passino, K.M., Nandi, A., Madden, D.R. and Clapp, J., Learning to Detect Heavy Drinking Episodes Using Smartphone Accelerometer Data. In Proceedings of the 4th International Workshop on Knowledge Discovery in Healthcare Data co-located with the 28th International Joint Conference on Artificial Intelligence (IJCAI 2019) (pp. 35-42). Web Link Indication of what attribute(s) were being predicted Features: Three-axis time series accelerometer data Target: Time series transdermal alcohol content (TAC) data (real-time measure of intoxication) (c) Indication of study's results The study decomposed each time series into 10 second windows and performed binary classification to predict if windows corresponded to an intoxicated participant (TAC >= 0.08) or sober participant (TAC < 0.08). The study tested several models and achieved a test accuracy of 77.5% with a random forest.

    Citation Request:

    When using this dataset, please cite: Killian, J.A., Passino, K.M., Nandi, A., Madden, D.R. and Clapp, J., Learning to Detect Heavy Drinking Episodes Using Smartphone Accelerometer Data. In Proceedings of the 4th International Workshop on Knowledge Discovery in Healthcare Data co-located with the 28th International Joint Conference on Artificial Intelligence (IJCAI 2019) (pp. 35-42). [Web Link]

  20. Data from: Székelyhon

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Apr 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gábor Palkó; Gábor Palkó; Balázs Indig; Balázs Indig; Zsófia Fellegi; Zsófia Fellegi; Zsófia Sárközi-Lindner; Zsófia Sárközi-Lindner (2022). Székelyhon [Dataset]. http://doi.org/10.5281/zenodo.5849138
    Explore at:
    binAvailable download formats
    Dataset updated
    Apr 13, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gábor Palkó; Gábor Palkó; Balázs Indig; Balázs Indig; Zsófia Fellegi; Zsófia Fellegi; Zsófia Sárközi-Lindner; Zsófia Sárközi-Lindner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 29, 2017 - May 21, 2021
    Description

    This object has been created as a part of the web harvesting project of the Eötvös Loránd University Department of Digital Humanities ELTE DH. Learn more about the workflow HERE about the software used HERE.The aim of the project is to make online news articles and their metadata suitable for research purposes. The archiving workflow is designed to prevent modification or manipulation of the downloaded content. The current version of the curated content with normalized formatting in standard TEI XML format with Schema.org encoded metadata is available HERE. The detailed description of the raw content is the following:

    • The portal's archived content (from 2017-01-29 to 2021-05-21) in WARC format available HERE (crawled: 2021-05-21T09:51:12.531750 - 2021-05-21T18:38:24.961226).

    Please fill in the following form before requesting access to this dataset:ACCES FORM

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Elias Dabbas (2020). SEO Crawl Datasets [Dataset]. https://www.kaggle.com/eliasdabbas/seocrawldatasets
Organization logo

SEO Crawl Datasets

Crawled websites for SEO analysis

Explore at:
55 scholarly articles cite this dataset (View in Google Scholar)
zip(222489274 bytes)Available download formats
Dataset updated
Nov 20, 2020
Authors
Elias Dabbas
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context

Crawling websites to understand and analyze their content is very important, for a general understanding of the website, as well as for SEO purposes. Crawled sites are basically converted to tables where each row represents a URL, and each column contains information on a certain attribute of that URL (title, h1, h2, meta description, etc.)

Content

A set of crawl datasets of various websites, as well as supporting datasets (XML sitemaps, crawl logs, robots.txt)

  • Dajango
    • crawl dataset of the docs as well as the www sub-domains
    • crawl logs

Acknowledgements

Scrapy pandas advertools

Inspiration

Trying to come up with a standardized procedure for analyzing websites, so others can use and build upon.

Search
Clear search
Close search
Google apps
Main menu