36 datasets found
  1. statistics

    • huggingface.co
    Updated Nov 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Crawl Foundation (2024). statistics [Dataset]. https://huggingface.co/datasets/commoncrawl/statistics
    Explore at:
    Dataset updated
    Nov 20, 2024
    Dataset provided by
    Common Crawlhttp://commoncrawl.org/
    Authors
    Common Crawl Foundation
    Description

    Common Crawl Statistics

    Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:

      Charsets
    

    The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.

  2. h

    vietvault

    • huggingface.co
    Updated Jul 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nam Pham (2024). vietvault [Dataset]. http://doi.org/10.57967/hf/2210
    Explore at:
    Dataset updated
    Jul 9, 2024
    Authors
    Nam Pham
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    VietVault

    VietVault is a large-scale Vietnamese language corpus, carefully filtered and curated from Common Crawl dataset dumps prior to 2023. This dataset is designed to serve as a high-quality resource for Vietnamese language model pretraining and various natural language processing tasks.

      Dataset Statistics
    

    Size: 80GB of raw text Language: Vietnamese Source: Common Crawl dataset (all dumps in 2013-2023) Preprocessing: Cleaned, deduplicated, filtered for Vietnamese… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/vietvault.

  3. P

    CC100 Dataset

    • paperswithcode.com
    Updated Jul 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexis Conneau; Kartikay Khandelwal; Naman Goyal; Vishrav Chaudhary; Guillaume Wenzek; Francisco Guzmán; Edouard Grave; Myle Ott; Luke Zettlemoyer; Veselin Stoyanov (2020). CC100 Dataset [Dataset]. https://paperswithcode.com/dataset/cc100
    Explore at:
    Dataset updated
    Jul 14, 2020
    Authors
    Alexis Conneau; Kartikay Khandelwal; Naman Goyal; Vishrav Chaudhary; Guillaume Wenzek; Francisco Guzmán; Edouard Grave; Myle Ott; Luke Zettlemoyer; Veselin Stoyanov
    Description

    This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages. This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository.

  4. E

    Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18)

    • live.european-language-grid.eu
    • explore.openaire.eu
    • +1more
    json
    Updated Apr 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7748
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Apr 11, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) containing text reuse cases extracted from within Wikipedia and in between Wikipedia and a sample of the Common Crawl.The corpus has following structure:

    wikipedia.jsonl.bz2: Each line, representing a Wikipedia article, contains a json array of article_id, article_title, and article_body

    within-wikipedia-tr-01.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)

    within-wikipedia-tr-02.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)

    preprocessed-web-sample.jsonl.xz: Each line, representing a web page, contains a json object of d_id, d_url, and content

    without-wikipedia-tr.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (Wikipedia article id), d_id (web page id), s_text (article text), d_content (web page content)

    The datasets were extracted in the work by Alshomary et al. 2018 that aimed to study the text reuse phenomena related to Wikipedia at scale. A pipeline for large scale text reuse extraction was developed and used on Wikipedia and the CommonCrawl.

  5. German CBOW FastText embeddings with min count 250

    • zenodo.org
    • data.niaid.nih.gov
    xz
    Updated Oct 26, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victor Bocharov; Victor Bocharov (2021). German CBOW FastText embeddings with min count 250 [Dataset]. http://doi.org/10.5281/zenodo.5598144
    Explore at:
    xzAvailable download formats
    Dataset updated
    Oct 26, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Victor Bocharov; Victor Bocharov
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    FastText embeddings built from Common Crawl german dataset

    Parameters
    ParametersValue(s)
    Dimensions256 and 384
    Context window5
    Negative sampled10
    Epochs1
    Number of buckets131072 or 262144
    Min n3
    Max n6
  6. P

    Data from: mC4 Dataset

    • paperswithcode.com
    • opendatalab.com
    • +1more
    Updated Mar 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Linting Xue; Noah Constant; Adam Roberts; Mihir Kale; Rami Al-Rfou; Aditya Siddhant; Aditya Barua; Colin Raffel (2021). mC4 Dataset [Dataset]. https://paperswithcode.com/dataset/mc4
    Explore at:
    Dataset updated
    Mar 24, 2021
    Authors
    Linting Xue; Noah Constant; Adam Roberts; Mihir Kale; Rami Al-Rfou; Aditya Siddhant; Aditya Barua; Colin Raffel
    Description

    mC4 is a multilingual variant of the C4 dataset called mC4. mC4 comprises natural text in 101 languages drawn from the public Common Crawl web scrape.

  7. h

    cc_news

    • huggingface.co
    Updated Jul 3, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vladimir Blagojevic (2018). cc_news [Dataset]. https://huggingface.co/datasets/vblagoje/cc_news
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 3, 2018
    Authors
    Vladimir Blagojevic
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Dataset Card for CC-News

      Dataset Summary
    

    CC-News dataset contains news articles from news sites all over the world. The data is available on AWS S3 in the Common Crawl bucket at /crawl-data/CC-NEWS/. This version of the dataset has been prepared using news-please - an integrated web crawler and information extractor for news.It contains 708241 English language news articles published between Jan 2017 and December 2019. It represents a small portion of the English… See the full description on the dataset page: https://huggingface.co/datasets/vblagoje/cc_news.

  8. Z

    French CBOW FastText embeddings with min count 250

    • data.niaid.nih.gov
    Updated Apr 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    de Chalendar, Gaël (2022). French CBOW FastText embeddings with min count 250 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6471440
    Explore at:
    Dataset updated
    Apr 20, 2022
    Dataset provided by
    Bocharov, Victor
    de Chalendar, Gaël
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    French
    Description

    FastText embeddings built from Common Crawl French dataset

    Parameters
    
    
        Parameters
        Value(s)
    
    
    
    
        Dimensions
        512
    
    
        Context window
        5
    
    
        Negative sampled
        10
    
    
        Epochs
        1
    
    
        Number of buckets
        262144
    
    
        Min n
        3
    
    
        Max n
        6
    
  9. h

    fineweb

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    FineData
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    🍷 FineWeb

    15 trillion tokens of the finest data the 🌐 web has to offer

      What is it?
    

    The 🍷 FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release of the full dataset under… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.

  10. P

    CCQA Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Oct 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Huber; Armen Aghajanyan; Barlas Oğuz; Dmytro Okhonko; Wen-tau Yih; Sonal Gupta; Xilun Chen (2021). CCQA Dataset [Dataset]. https://paperswithcode.com/dataset/ccqa
    Explore at:
    Dataset updated
    Oct 13, 2021
    Authors
    Patrick Huber; Armen Aghajanyan; Barlas Oğuz; Dmytro Okhonko; Wen-tau Yih; Sonal Gupta; Xilun Chen
    Description

    CCQA is new web-scale dataset for in-domain model pre-training. CCQA is a novel QA dataset based on the Common Crawl project. Using the readily available schema.org annotation, around 130 million multilingual question-answer pairs are extracted, including about 60 million English data-points.

  11. P

    CCNet Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Apr 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guillaume Wenzek; Marie-Anne Lachaux; Alexis Conneau; Vishrav Chaudhary; Francisco Guzmán; Armand Joulin; Edouard Grave (2024). CCNet Dataset [Dataset]. https://paperswithcode.com/dataset/ccnet
    Explore at:
    Dataset updated
    Apr 3, 2024
    Authors
    Guillaume Wenzek; Marie-Anne Lachaux; Alexis Conneau; Vishrav Chaudhary; Francisco Guzmán; Armand Joulin; Edouard Grave
    Description

    CCNet is a dataset extracted from Common Crawl with a different filtering process than for OSCAR. It was built using a language model trained on Wikipedia, in order to filter out bad quality texts such as code or tables. CCNet contains longer documents on average compared to OSCAR with smaller—and often noisier—documents weeded out.

  12. h

    oscar

    • huggingface.co
    Updated Aug 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OSCAR (2023). oscar [Dataset]. https://huggingface.co/datasets/oscar-corpus/oscar
    Explore at:
    Dataset updated
    Aug 1, 2023
    Dataset authored and provided by
    OSCAR
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.\

  13. P

    WDC LSPM Dataset

    • paperswithcode.com
    Updated Dec 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). WDC LSPM Dataset [Dataset]. https://paperswithcode.com/dataset/wdc-products
    Explore at:
    Dataset updated
    Dec 2, 2020
    Description

    Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label "match" or "no match") for four product categories, computers, cameras, watches and shoes.

    In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web via weak supervision.

    The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites.

  14. h

    Web_DomURLs

    • huggingface.co
    Updated Sep 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdelkader El Mahdaouy (2024). Web_DomURLs [Dataset]. https://huggingface.co/datasets/amahdaouy/Web_DomURLs
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 17, 2024
    Authors
    Abdelkader El Mahdaouy
    Description

    Datasets Overview

    The dataset URLs and Domain Names are collected from the following sources:

      mC4
    

    Description: The Multilingual Colossal Common Crawl Corpus (mC4) is a cleaned version of the Common Crawl's web corpus, curated by the Allen Institute for Artificial Intelligence. It contains approximately 170 million URLs. Source: mC4 Dataset on Hugging Face

      falcon-refinedweb
    

    Description: An English large-scale dataset curated for large language model… See the full description on the dataset page: https://huggingface.co/datasets/amahdaouy/Web_DomURLs.

  15. Web Data Commons (November 2018) Property and Datatype Usage Dataset

    • zenodo.org
    application/gzip
    Updated May 10, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Martin Keil; Jan Martin Keil (2022). Web Data Commons (November 2018) Property and Datatype Usage Dataset [Dataset]. http://doi.org/10.5281/zenodo.6477443
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 10, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jan Martin Keil; Jan Martin Keil
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a dataset about the usage of properties and datatypes in the Web Data Commons RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets (November 2018) based on the Common Crawl November 2018 archive. The dataset has been produced using the RDF Property and Datatype Usage Scanner v2.1.1, which is based on the Apache Jena framework. Only RDFa and embedded JSON-LD data were considered, as Microdata and Microformats do not incorporate explicit datatypes.

    Dataset Properties

    • Size: 22.2 MiB compressed, 569.6 MiB uncompressed, 2 608 325 rows plus 1 head line determined using gunzip -c measurements.csv.gz | wc -l
    • Parsing Failures: The scanner failed to parse 4 135 842 triples (~0.077 %) of the source dataset (containing 5 367 569 192 triples).
    • Content:
      • CATEGORY: The category (html-embedded-jsonld or html-rdfa) of the Web Data Commons file that has been measured.
      • FILE_URL: The URL of the Web Data Commons file that has been measured.
      • MEASUREMENT: The applied measurement with specific conditions, one of:
        • UnpreciseRepresentableInDouble: The number of lexicals that are in the lexical space but not in the value space of xsd:double.
        • UnpreciseRepresentableInFloat: The number of lexicals that are in the lexical space but not in the value space of xsd:float.
        • UsedAsDatatype: The total number of literals with the datatype.
        • UsedAsPropertyRange: The number of statements that specify the datatype as range of the property.
        • ValidDateNotation: The number of lexicals that are in the lexical space of xsd:date.
        • ValidDateTimeNotation: The number of lexicals that are in the lexical space of xsd:dateTime.
        • ValidDecimalNotation: The number of lexicals that represent a number with decimal notation and whose lexical representation is thereby in the lexical space of xsd:decimal, xsd:float, and xsd:double.
        • ValidExponentialNotation: The number of lexicals that represent a number with exponential notation and whose lexical representation is thereby in the lexical space of xsd:float, and xsd:double.
        • ValidInfOrNaNNotation: The number of lexicals that equals either INF, +INF, -INF or NaN and whose lexical representation is thereby in the lexical space of xsd:float, and xsd:double.
        • ValidIntegerNotation: The number of lexicals that represent an integer number and whose lexical representation is thereby in the lexical space of xsd:integer, xsd:decimal, xsd:float, and xsd:double.
        • ValidTimeNotation: The number of lexicals that are in the lexical space of xsd:time.
        • ValidTrueOrFalseNotation: The number of lexicals that equal either true or false and whose lexical representation is thereby in the lexical space of xsd:boolean.
        • ValidZeroOrOneNotation: The number of lexicals that equal either 0 or 1 and whose lexical representation is thereby in the lexical space of xsd:boolean, and xsd:integer, xsd:decimal, xsd:float, and xsd:double.
        Note: Lexical representation of xsd:double values in embedded JSON-LD got normalized to always use exponential notation with up to 16 fractional digits (see related code). Be careful by drawing conclusions from according Valid… and Unprecise… measures.
      • PROPERTY: The property that has been measured.
      • DATATYPE: The datatype that has been measured.
      • QUANTITY: The count of statements that fulfill the condition specified by the measurement per file, property and datatype.

    Preview

    "CATEGORY","FILE_URL","MEASUREMENT","PROPERTY","DATATYPE","QUANTITY"
    "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","https://www.w3.org/2006/vcard/ns#longitude","https://www.w3.org/2001/XMLSchema#float","4"
    "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","https://www.w3.org/2006/vcard/ns#latitude","https://www.w3.org/2001/XMLSchema#float","4"
    "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","https://purl.org/goodrelations/v1#hasCurrencyValue","https://www.w3.org/2001/XMLSchema#float","6"
    "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","http://purl.org/goodrelations/v1#hasCurrencyValue","http://www.w3.org/2001/XMLSchema#floatfloat","8"
    "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","https://opengraphprotocol.org/schema/latitude","http://www.w3.org/2001/XMLSchema#string","30"
    …
    "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-embedded-jsonld.nq-00734.gz","ValidZeroOrOneNotation","http://schema.org/numberOfItems","http://www.w3.org/2001/XMLSchema#integer","40"
    "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-embedded-jsonld.nq-00734.gz","ValidZeroOrOneNotation","http://schema.org/ratingValue","http://www.w3.org/2001/XMLSchema#integer","431"
    "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-embedded-jsonld.nq-00734.gz","ValidZeroOrOneNotation","http://schema.org/width","http://www.w3.org/2001/XMLSchema#integer","122"
    "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-embedded-jsonld.nq-00734.gz","ValidZeroOrOneNotation","http://schema.org/minValue","http://www.w3.org/2001/XMLSchema#integer","63"
    "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2018-12/quads/dpef.html-embedded-jsonld.nq-00734.gz","ValidZeroOrOneNotation","http://schema.org/pageEnd","http://www.w3.org/2001/XMLSchema#integer","139"
    

    Note: The data contain malformed IRIs, like "xsd:dateTime" (instead of probably "http://www.w3.org/2001/XMLSchema#dateTime"), which are caused by missing namespace definitions in the original source website.

    Reproduce

    To reproduce this dataset checkout the RDF Property and Datatype Usage Scanner v2.1.1 and execute:

    mvn clean package
    java -jar target/Scanner.jar --category html-rdfa --list http://webdatacommons.org/structureddata/2018-12/files/html-rdfa.list November2018
    java -jar target/Scanner.jar --category html-embedded-jsonld --list http://webdatacommons.org/structureddata/2018-12/files/html-embedded-jsonld.list November2018
    ./measure.sh November2018
    # Wait until the scan has completed. This will take a few days
    java -jar target/Scanner.jar --results ./November2018/measurements.csv.gz November2018
    
  16. n

    Data from: Wide range screening of algorithmic bias in word embedding models...

    • data.niaid.nih.gov
    • zenodo.org
    zip
    Updated Apr 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Rozado (2020). Wide range screening of algorithmic bias in word embedding models using large sentiment lexicons reveals underreported bias types [Dataset]. http://doi.org/10.5061/dryad.rbnzs7h7w
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 7, 2020
    Dataset provided by
    Otago Polytechnic
    Authors
    David Rozado
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Concerns about gender bias in word embedding models have captured substantial attention in the algorithmic bias research literature. Other bias types however have received lesser amounts of scrutiny. This work describes a large-scale analysis of sentiment associations in popular word embedding models along the lines of gender and ethnicity but also along the less frequently studied dimensions of socioeconomic status, age, physical appearance, sexual orientation, religious sentiment and political leanings. Consistent with previous scholarly literature, this work has found systemic bias against given names popular among African-Americans in most embedding models examined. Gender bias in embedding models however appears to be multifaceted and often reversed in polarity to what has been regularly reported. Interestingly, using the common operationalization of the term bias in the fairness literature, novel types of so far unreported bias types in word embedding models have also been identified. Specifically, the popular embedding models analyzed here display negative biases against middle and working-class socioeconomic status, male children, senior citizens, plain physical appearance and intellectual phenomena such as Islamic religious faith, non-religiosity and conservative political orientation. Reasons for the paradoxical underreporting of these bias types in the relevant literature are probably manifold but widely held blind spots when searching for algorithmic bias and a lack of widespread technical jargon to unambiguously describe a variety of algorithmic associations could conceivably be playing a role. The causal origins for the multiplicity of loaded associations attached to distinct demographic groups within embedding models are often unclear but the heterogeneity of said associations and their potential multifactorial roots raises doubts about the validity of grouping them all under the umbrella term bias. Richer and more fine-grained terminology as well as a more comprehensive exploration of the bias landscape could help the fairness epistemic community to characterize and neutralize algorithmic discrimination more efficiently.

    Methods This data set has collected several popular pre-trained word embedding models.

    -Word2vec Skip-Gram trained on Google News corpus (100B tokens) https://code.google.com/archive/p/word2vec/

    -Glove trained on Wikipedia 2014 + Gigaword 5 (6B tokens) http://nlp.stanford.edu/data/glove.6B.zip

    -Glove trained on 2B tweets Twitter corpus (27B tokens) http://nlp.stanford.edu/data/glove.twitter.27B.zip

    -Glove trained on Common Crawl (42B tokens) http://nlp.stanford.edu/data/glove.42B.300d.zip

    -Glove trained on Common Crawl (840B tokens) http://nlp.stanford.edu/data/glove.840B.300d.zip

    -FastText trained with subword infomation on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens) https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.vec.zip

    -Fastext trained with subword infomation on Common Crawl (600B tokens) https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip"

  17. h

    hplt-vi

    • huggingface.co
    Updated Oct 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Symato Team (2024). hplt-vi [Dataset]. https://huggingface.co/datasets/Symato/hplt-vi
    Explore at:
    Dataset updated
    Oct 1, 2024
    Dataset authored and provided by
    Symato Team
    Description

    NOTE: Dữ liệu mới hơn (chưa lọc) đã có tại

    https://github.com/hplt-project/data-analytics-tool/blob/main/reports/mono-2.0/HPLT-v2-vie_Latn.lite.pdf https://hplt-project.org/datasets/v2.0

    Dữ liệu tiếng Việt từ https://hplt-project.org/datasets/v1, loại bỏ những dữ liệu từ Common Crawl (CC) Thống kê theo tên miền SIZE DOCS DOMAIN

    40855.5mb 3586.6k http://dongtrieu.edu.vn 30012.1mb 112.8k http://hamtruyentranh.net… See the full description on the dataset page: https://huggingface.co/datasets/Symato/hplt-vi.

  18. P

    Data from: CCMatrix Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Feb 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Holger Schwenk; Guillaume Wenzek; Sergey Edunov; Edouard Grave; Armand Joulin (2021). CCMatrix Dataset [Dataset]. https://paperswithcode.com/dataset/ccmatrix
    Explore at:
    Dataset updated
    Feb 14, 2021
    Authors
    Holger Schwenk; Guillaume Wenzek; Sergey Edunov; Edouard Grave; Armand Joulin
    Description

    CCMatrix uses ten snapshots of a curated common crawl corpus (Wenzek et al., 2019) totalling 32.7 billion unique sentences.

  19. h

    DCAD-2000

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenBMB, DCAD-2000 [Dataset]. https://huggingface.co/datasets/openbmb/DCAD-2000
    Explore at:
    Dataset authored and provided by
    OpenBMB
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Summary

    DCAD-2000 is a large-scale multilingual corpus built using newly extracted Common Crawl data (CC-MAIN-2024-46) and existing multilingual datasets. It includes over 2,282 languages, 46.72TB of data, and 8.63 billion documents, spanning 155 highand medium-resource languages and 159 writing scripts. We propose reframing data cleaning as an anomaly detection task. This dynamic filtering approach significantly enhances data quality by identifying and removing noisy or… See the full description on the dataset page: https://huggingface.co/datasets/openbmb/DCAD-2000.

  20. O

    Tracking the Trackers

    • opendatalab.com
    • paperswithcode.com
    zip
    Updated Mar 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technical University of Berlin (2023). Tracking the Trackers [Dataset]. https://opendatalab.com/OpenDataLab/Tracking_the_Trackers
    Explore at:
    zip(1857478517 bytes)Available download formats
    Dataset updated
    Mar 24, 2023
    Dataset provided by
    Technical University of Berlin
    University of Koblenz-Landau
    License

    https://commoncrawl.org/terms-of-use/https://commoncrawl.org/terms-of-use/

    Description

    Tracking the Trackers is a large-scale analysis of third-party trackers on the World Wide Web. We extract third-party embeddings from more than 3.5 billion web pages of the CommonCrawl 2012 corpus, and aggregate those to a dataset containing more than 140 million third-party embeddings in over 41 million domains.We provide the data used in our recent large-scale analysis of third-party trackers on the web. We created an extractor that finds embedded third-party resources from HTML pages and ran it on the 3.5 billion webpages contained in the CommonCrawl 2012 web crawl.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Common Crawl Foundation (2024). statistics [Dataset]. https://huggingface.co/datasets/commoncrawl/statistics
Organization logo

statistics

commoncrawl/statistics

Common Crawl Statistics

Explore at:
Dataset updated
Nov 20, 2024
Dataset provided by
Common Crawlhttp://commoncrawl.org/
Authors
Common Crawl Foundation
Description

Common Crawl Statistics

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:

  Charsets

The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.

Search
Clear search
Close search
Google apps
Main menu