100+ datasets found
  1. h

    wikipedia

    • huggingface.co
    • tensorflow.org
    Updated Feb 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Online Language Modelling (2023). wikipedia [Dataset]. https://huggingface.co/datasets/olm/wikipedia
    Explore at:
    Dataset updated
    Feb 21, 2023
    Dataset authored and provided by
    Online Language Modelling
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

  2. e

    Plaintext Wikipedia dump 2018 - Dataset - B2FIND

    • b2find.eudat.eu
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Plaintext Wikipedia dump 2018 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/3074cb26-6a0d-5803-8520-d0050a22c66e
    Explore at:
    Description

    Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018. The data come from all Wikipedias for which dumps could be downloaded at [https://dumps.wikimedia.org/]. This amounts to 297 Wikipedias, usually corresponding to individual languages and identified by their ISO codes. Several special Wikipedias are included, most notably "simple" (Simple English Wikipedia) and "incubator" (tiny hatching Wikipedias in various languages). For a list of all the Wikipedias, see [https://meta.wikimedia.org/wiki/List_of_Wikipedias]. The script which can be used to get new version of the data is included, but note that Wikipedia limits the download speed for downloading a lot of the dumps, so it takes a few days to download all of them (but one or a few can be downloaded fast). Also, the format of the dumps changes time to time, so the script will probably eventually stop working one day. The WikiExtractor tool [http://medialab.di.unipi.it/wiki/Wikipedia_Extractor] used to extract text from the Wikipedia dumps is not mine, I only modified it slightly to produce plaintext outputs [https://github.com/ptakopysk/wikiextractor].

  3. Wikipedia Knowledge Graph dataset

    • zenodo.org
    • produccioncientifica.ugr.es
    • +1more
    pdf, tsv
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas (2024). Wikipedia Knowledge Graph dataset [Dataset]. http://doi.org/10.5281/zenodo.6346900
    Explore at:
    tsv, pdfAvailable download formats
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range of researchers, such as informetricians, sociologists or data scientists.

    There are a total of 9 files, all of them in tsv format, and they have been built under a relational structure. The main one that acts as the core of the dataset is the page file, after it there are 4 files with different entities related to the Wikipedia pages (category, url, pub and page_property files) and 4 other files that act as "intermediate tables" making it possible to connect the pages both with the latter and between pages (page_category, page_url, page_pub and page_link files).

    The document Dataset_summary includes a detailed description of the dataset.

    Thanks to Nees Jan van Eck and the Centre for Science and Technology Studies (CWTS) for the valuable comments and suggestions.

  4. h

    wikipedia-small-3000-embedded

    • huggingface.co
    Updated Apr 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hafedh Hichri (2024). wikipedia-small-3000-embedded [Dataset]. https://huggingface.co/datasets/not-lain/wikipedia-small-3000-embedded
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 6, 2024
    Authors
    Hafedh Hichri
    License

    https://choosealicense.com/licenses/gfdl/https://choosealicense.com/licenses/gfdl/

    Description

    this is a subset of the wikimedia/wikipedia dataset code for creating this dataset : from datasets import load_dataset, Dataset from sentence_transformers import SentenceTransformer model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

    load dataset in streaming mode (no download and it's fast)

    dataset = load_dataset( "wikimedia/wikipedia", "20231101.en", split="train", streaming=True )

    select 3000 samples

    from tqdm importtqdm data = Dataset.from_dict({}) for i, entry in… See the full description on the dataset page: https://huggingface.co/datasets/not-lain/wikipedia-small-3000-embedded.

  5. h

    rag-mini-wikipedia

    • huggingface.co
    Updated May 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RAG Datasets (2025). rag-mini-wikipedia [Dataset]. https://huggingface.co/datasets/rag-datasets/rag-mini-wikipedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 5, 2025
    Dataset authored and provided by
    RAG Datasets
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    In this huggingface discussion you can share what you used the dataset for. Derives from https://www.kaggle.com/datasets/rtatman/questionanswer-dataset?resource=download we generated our own subset using generate.py.

  6. h

    simple-wiki

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Embedding Training Data, simple-wiki [Dataset]. https://huggingface.co/datasets/embedding-data/simple-wiki
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Embedding Training Data
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for "simple-wiki"

      Dataset Summary
    

    This dataset contains pairs of equivalent sentences obtained from Wikipedia.

      Supported Tasks
    

    Sentence Transformers training; useful for semantic search and sentence similarity.

      Languages
    

    English.

      Dataset Structure
    

    Each example in the dataset contains pairs of equivalent sentences and is formatted as a dictionary with the key "set" and a list with the sentences as "value". {"set":… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/simple-wiki.

  7. E

    Plaintext Wikipedia dump 2018

    • live.european-language-grid.eu
    binary format
    Updated Feb 24, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Plaintext Wikipedia dump 2018 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1242
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Feb 24, 2018
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018.

    The data come from all Wikipedias for which dumps could be downloaded at [https://dumps.wikimedia.org/]. This amounts to 297 Wikipedias, usually corresponding to individual languages and identified by their ISO codes. Several special Wikipedias are included, most notably "simple" (Simple English Wikipedia) and "incubator" (tiny hatching Wikipedias in various languages).

    For a list of all the Wikipedias, see [https://meta.wikimedia.org/wiki/List_of_Wikipedias].

    The script which can be used to get new version of the data is included, but note that Wikipedia limits the download speed for downloading a lot of the dumps, so it takes a few days to download all of them (but one or a few can be downloaded fast).

    Also, the format of the dumps changes time to time, so the script will probably eventually stop working one day.

    The WikiExtractor tool [http://medialab.di.unipi.it/wiki/Wikipedia_Extractor] used to extract text from the Wikipedia dumps is not mine, I only modified it slightly to produce plaintext outputs [https://github.com/ptakopysk/wikiextractor].

  8. n

    Dataset of Pairs of an Image and Tags for Cataloging Image-based Records

    • narcis.nl
    • data.mendeley.com
    Updated Apr 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suzuki, T (via Mendeley Data) (2022). Dataset of Pairs of an Image and Tags for Cataloging Image-based Records [Dataset]. http://doi.org/10.17632/msyc6mzvhg.2
    Explore at:
    Dataset updated
    Apr 19, 2022
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Suzuki, T (via Mendeley Data)
    Description

    Brief ExplanationThis dataset is created to develop and evaluate a cataloging system which assigns appropriate metadata to an image record for database management in digital libraries. That is assumed for evaluating a task, in which given an image and assigned tags, an appropriate Wikipedia page is selected for each of the given tags.A main characteristic of the dataset is including ambiguous tags. Thus, visual contents of images are not unique to their tags. For example, it includes a tag 'mouse' which has double meaning of not a mammal but a computer controller device. The annotations are corresponding Wikipedia articles for tags as correct entities by human judgement.The dataset offers both data and programs that reproduce experiments of the above-mentioned task. Its data consist of sources of images and annotations. The image sources are URLs of 420 images uploaded to Flickr. The annotations are a total 2,464 relevant Wikipedia pages manually judged for tags of the images. The dataset also provides programs in Jupiter notebook (scripts.ipynb) to conduct a series of experiments running some baseline methods for the designated task and evaluating the results. ## Structure of the Dataset1. data directory 1.1. image_URL.txt This file lists URLs of image files. 1.2. rels.txt This file lists collect Wikipedia pages for each topic in topics.txt 1.3. topics.txt This file lists a target pair, which is called a topic in this dataset, of an image and a tag to be disambiguated. 1.4. enwiki_20171001.xml This file is extracted texts from the title and body parts of English Wikipedia articles as of 1st October 2017. This is a modified data of Wikipedia dump data (https://archive.org/download/enwiki-20171001).2. img directory This directory is a placeholder directory to fetch image files for downloading.3. results directory This directory is a placeholder directory to store results files for evaluation. It maintains three results of baseline methods in sub-directories. They contain json files each of which is a result of one topic, and are ready to be evaluated using an evaluation scripts in scripts.ipynb for reference of both usage and performance. 4. scripts.ipynb The scripts for running baseline methods and evaluation are ready in this Jupyter notebook file.

  9. R

    Wikipedia Dataset

    • universe.roboflow.com
    zip
    Updated Jul 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    yolov8ui (2025). Wikipedia Dataset [Dataset]. https://universe.roboflow.com/yolov8ui/wikipedia/model/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 10, 2025
    Dataset authored and provided by
    yolov8ui
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    UI Elements Bounding Boxes
    Description

    Wikipedia

    ## Overview
    
    Wikipedia is a dataset for object detection tasks - it contains UI Elements annotations for 5,522 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  10. T

    wiki40b

    • tensorflow.org
    • opendatalab.com
    • +1more
    Updated Aug 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). wiki40b [Dataset]. https://www.tensorflow.org/datasets/catalog/wiki40b
    Explore at:
    Dataset updated
    Aug 30, 2023
    Description

    Clean-up text for 40+ Wikipedia languages editions of pages correspond to entities. The datasets have train/dev/test splits per language. The dataset is cleaned up by page filtering to remove disambiguation pages, redirect pages, deleted pages, and non-entity pages. Each example contains the wikidata id of the entity, and the full Wikipedia article after page processing that removes non-content sections and structured objects. The language models trained on this corpus - including 41 monolingual models, and 2 multilingual models - can be found at https://tfhub.dev/google/collections/wiki40b-lm/1.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wiki40b', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  11. Data from: WikiHist.html: English Wikipedia's Full Revision History in HTML...

    • zenodo.org
    application/gzip, zip
    Updated Jun 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Blagoj Mitrevski; Tiziano Piccardi; Tiziano Piccardi; Robert West; Robert West; Blagoj Mitrevski (2020). WikiHist.html: English Wikipedia's Full Revision History in HTML Format [Dataset]. http://doi.org/10.5281/zenodo.3605388
    Explore at:
    application/gzip, zipAvailable download formats
    Dataset updated
    Jun 8, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Blagoj Mitrevski; Tiziano Piccardi; Tiziano Piccardi; Robert West; Robert West; Blagoj Mitrevski
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    Introduction

    Wikipedia is written in the wikitext markup language. When serving content, the MediaWiki software that powers Wikipedia parses wikitext to HTML, thereby inserting additional content by expanding macros (templates and modules). Hence, researchers who intend to analyze Wikipedia as seen by its readers should work with HTML, rather than wikitext. Since Wikipedia’s revision history is made publicly available by the Wikimedia Foundation exclusively in wikitext format, researchers have had to produce HTML themselves, typically by using Wikipedia’s REST API for ad-hoc wikitext-to-HTML parsing. This approach, however, (1) does not scale to very large amounts of data and (2) does not correctly expand macros in historical article revisions.

    We have solved these problems by developing a parallelized architecture for parsing massive amounts of wikitext using local instances of MediaWiki, enhanced with the capacity of correct historical macro expansion. By deploying our system, we produce and hereby release WikiHist.html, English Wikipedia’s full revision history in HTML format. It comprises the HTML content of 580M revisions of 5.8M articles generated from the full English Wikipedia history spanning 18 years from 1 January 2001 to 1 March 2019. Boilerplate content such as page headers, footers, and navigation sidebars are not included in the HTML.

    For more details, please refer to the description below and to the dataset paper:
    Blagoj Mitrevski, Tiziano Piccardi, and Robert West: WikiHist.html: English Wikipedia’s Full Revision History in HTML Format. In Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020.
    https://arxiv.org/abs/2001.10256

    When using the dataset, please cite the above paper.

    Dataset summary

    The dataset consists of three parts:

    1. English Wikipedia’s full revision history parsed to HTML,
    2. a table of the creation times of all Wikipedia pages (page_creation_times.json.gz),
    3. a table that allows for resolving redirects for any point in time (redirect_history.json.gz).

    Part 1 is our main contribution, while parts 2 and 3 contain complementary information that can aid researchers in their analyses.

    Getting the data

    Parts 2 and 3 are hosted in this Zenodo repository. Part 1 is 7TB large -- too large for Zenodo -- and is therefore hosted externally on the Internet Archive. For downloading part 1, you have multiple options:

    Dataset details

    Part 1: HTML revision history
    The data is split into 558 directories, named enwiki-20190301-pages-meta-history$1.xml-p$2p$3, where $1 ranges from 1 to 27, and p$2p$3 indicates that the directory contains revisions for pages with ids between $2 and $3. (This naming scheme directly mirrors that of the wikitext revision history from which WikiHist.html was derived.) Each directory contains a collection of gzip-compressed JSON files, each containing 1,000 HTML article revisions. Each row in the gzipped JSON files represents one article revision. Rows are sorted by page id, and revisions of the same page are sorted by revision id. We include all revision information from the original wikitext dump, the only difference being that we replace the revision’s wikitext content with its parsed HTML version (and that we store the data in JSON rather than XML):

    • id: id of this revision
    • parentid: id of revision modified by this revision
    • timestamp: time when revision was made
    • cont_username: username of contributor
    • cont_id: id of contributor
    • cont_ip: IP address of contributor
    • comment: comment made by contributor
    • model: content model (usually "wikitext")
    • format: content format (usually "text/x-wiki")
    • sha1: SHA-1 hash
    • title: page title
    • ns: namespace (always 0)
    • page_id: page id
    • redirect_title: if page is redirect, title of target page
    • html: revision content in HTML format

    Part 2: Page creation times (page_creation_times.json.gz)

    This JSON file specifies the creation time of each English Wikipedia page. It can, e.g., be used to determine if a wiki link was blue or red at a specific time in the past. Format:

    • page_id: page id
    • title: page title
    • ns: namespace (0 for articles)
    • timestamp: time when page was created

    Part 3: Redirect history (redirect_history.json.gz)

    This JSON file specifies all revisions corresponding to redirects, as well as the target page to which the respective page redirected at the time of the revision. This information is useful for reconstructing Wikipedia's link network at any time in the past. Format:

    • page_id: page id of redirect source
    • title: page title of redirect source
    • ns: namespace (0 for articles)
    • revision_id: revision id of redirect source
    • timestamp: time at which redirect became active
    • redirect: page title of redirect target (in 1st item of array; 2nd item can be ignored)

    The repository also contains two additional files, metadata.zip and mysql_database.zip. These two files are not part of WikiHist.html per se, and most users will not need to download them manually. The file metadata.zip is required by the download script (and will be fetched by the script automatically), and mysql_database.zip is required by the code used to produce WikiHist.html. The code that uses these files is hosted at GitHub, but the files are too big for GitHub and are therefore hosted here.

    WikiHist.html was produced by parsing the 1 March 2019 dump of https://dumps.wikimedia.org/enwiki/20190301 from wikitext to HTML. That old dump is not available anymore on Wikimedia's servers, so we make a copy available at https://archive.org/details/enwiki-20190301-original-full-history-dump_dlab .

  12. Kaggle Wikipedia Web Traffic Daily Dataset (without Missing Values)

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Apr 1, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rakshitha Godahewa; Rakshitha Godahewa; Christoph Bergmeir; Christoph Bergmeir; Geoff Webb; Geoff Webb; Rob Hyndman; Rob Hyndman; Pablo Montero-Manso; Pablo Montero-Manso (2021). Kaggle Wikipedia Web Traffic Daily Dataset (without Missing Values) [Dataset]. http://doi.org/10.5281/zenodo.4656075
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 1, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rakshitha Godahewa; Rakshitha Godahewa; Christoph Bergmeir; Christoph Bergmeir; Geoff Webb; Geoff Webb; Rob Hyndman; Rob Hyndman; Pablo Montero-Manso; Pablo Montero-Manso
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was used in the Kaggle Wikipedia Web Traffic forecasting competition. It contains 145063 daily time series representing the number of hits or web traffic for a set of Wikipedia pages from 2015-07-01 to 2017-09-10.

    The original dataset contains missing values. They have been simply replaced by zeros.

  13. a

    Wikipedia Training Data for Megatron-LM

    • academictorrents.com
    bittorrent
    Updated Aug 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    None (2021). Wikipedia Training Data for Megatron-LM [Dataset]. https://academictorrents.com/details/b6215a898a2a08b6061d23f2e4e1094121fb7082
    Explore at:
    bittorrent(7840268306)Available download formats
    Dataset updated
    Aug 28, 2021
    Authors
    None
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    A preprocessed dataset for training. Please see instructions in for how to use it. Note: the author does not own any copyrights of the data.

  14. f

    Wikipedia pagecounts sorted by page (year 2014)

    • figshare.com
    txt
    Updated Feb 15, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alessio Bogon; Cristian Consonni; Alberto Montresor (2016). Wikipedia pagecounts sorted by page (year 2014) [Dataset]. http://doi.org/10.6084/m9.figshare.2085643.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 15, 2016
    Dataset provided by
    figshare
    Authors
    Alessio Bogon; Cristian Consonni; Alberto Montresor
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the page view statistics for all the WikiMedia projects in the year 2014, ordered by (project, page, timestamp). It has been generated starting from the WikiMedia's pagecounts-raw[1] dataset.The CSV uses spaces as delimiter, without any form of escaping because it is not needed. It has 5 columns:* project: the project name* page: the page requested, url-escaped* timestamp: the timestamp of the hour (format: "%Y%m%d-%H%M%S")* count: the number of times the page has been requested (in that hour)* bytes: the number of bytes transferred (in that hour)You can download the full dataset via torrent[2].Further information about this dataset are available at:http://disi.unitn.it/~consonni/datasets/wikipedia-pagecounts-sorted-by-page-year-2014/[1] https://dumps.wikimedia.org/other/pagecounts-raw/[2] http://disi.unitn.it/~consonni/datasets/wikipedia-pagecounts-sorted-by-page-year-2014/#download

  15. h

    simple-wikipedia

    • huggingface.co
    Updated Aug 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rahul Aralikatte (2023). simple-wikipedia [Dataset]. https://huggingface.co/datasets/rahular/simple-wikipedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 17, 2023
    Authors
    Rahul Aralikatte
    Description

    simple-wikipedia

    Processed, text-only dump of the Simple Wikipedia (English). Contains 23,886,673 words.

  16. Data from: WikiMuTe: A web-sourced dataset of semantic descriptions for...

    • zenodo.org
    csv
    Updated Apr 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benno Weck; Benno Weck; Holger Kirchhoff; Holger Kirchhoff; Peter Grosche; Peter Grosche; Serra Xavier; Serra Xavier (2024). WikiMuTe: A web-sourced dataset of semantic descriptions for music audio [Dataset]. http://doi.org/10.5281/zenodo.10223363
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Benno Weck; Benno Weck; Holger Kirchhoff; Holger Kirchhoff; Peter Grosche; Peter Grosche; Serra Xavier; Serra Xavier
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    This upload contains the supplementary material for our paper presented at the MMM2024 conference.

    Dataset

    The dataset contains rich text descriptions for music audio files collected from Wikipedia articles.

    The audio files are freely accessible and available for download through the URLs provided in the dataset.

    Example

    A few hand-picked, simplified examples of the dataset.

    file

    aspects

    sentences

    🔈 Bongo sound.wav

    ['bongoes', 'percussion instrument', 'cumbia', 'drums']

    ['a loop of bongoes playing a cumbia beat at 99 bpm']

    🔈 Example of double tracking in a pop-rock song (3 guitar tracks).ogg

    ['bass', 'rock', 'guitar music', 'guitar', 'pop', 'drums']

    ['a pop-rock song']

    🔈 OriginalDixielandJassBand-JazzMeBlues.ogg

    ['jazz standard', 'instrumental', 'jazz music', 'jazz']

    ['Considered to be a jazz standard', 'is an jazz composition']

    🔈 Colin Ross - Etherea.ogg

    ['chirping birds', 'ambient percussion', 'new-age', 'flute', 'recorder', 'single instrument', 'woodwind']

    ['features a single instrument with delayed echo, as well as ambient percussion and chirping birds', 'a new-age composition for recorder']

    🔈 Belau rekid (instrumental).oga

    ['instrumental', 'brass band']

    ['an instrumental brass band performance']

    ...

    ...

    ...

    Dataset structure

    We provide three variants of the dataset in the data folder.

    All are described in the paper.

    1. all.csv contains all the data we collected, without any filtering.
    2. filtered_sf.csv contains the data obtained using the self-filtering method.
    3. filtered_mc.csv contains the data obtained using the MusicCaps dataset method.

    File structure

    Each CSV file contains the following columns:

    • file: the name of the audio file
    • pageid: the ID of the Wikipedia article where the text was collected from
    • aspects: the short-form (tag) description texts collected from the Wikipedia articles
    • sentences: the long-form (caption) description texts collected from the Wikipedia articles
    • audio_url: the URL of the audio file
    • url: the URL of the Wikipedia article where the text was collected from

    Citation

    If you use this dataset in your research, please cite the following paper:

    @inproceedings{wikimute,
    title = {WikiMuTe: {A} Web-Sourced Dataset of Semantic Descriptions for Music Audio},
    author = {Weck, Benno and Kirchhoff, Holger and Grosche, Peter and Serra, Xavier},
    booktitle = "MultiMedia Modeling",
    year = "2024",
    publisher = "Springer Nature Switzerland",
    address = "Cham",
    pages = "42--56",
    doi = {10.1007/978-3-031-56435-2_4},
    url = {https://doi.org/10.1007/978-3-031-56435-2_4},
    }

    License

    The data is available under the Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) license.

    Each entry in the dataset contains a URL linking to the article, where the text data was collected from.

  17. T

    wikipedia_toxicity_subtypes

    • tensorflow.org
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). wikipedia_toxicity_subtypes [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia_toxicity_subtypes
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    The comments in this dataset come from an archive of Wikipedia talk page comments. These have been annotated by Jigsaw for toxicity, as well as (for the main config) a variety of toxicity subtypes, including severe toxicity, obscenity, threatening language, insulting language, and identity attacks. This dataset is a replica of the data released for the Jigsaw Toxic Comment Classification Challenge and Jigsaw Multilingual Toxic Comment Classification competition on Kaggle, with the test dataset merged with the test_labels released after the end of the competitions. Test data not used for scoring has been dropped. This dataset is released under CC0, as is the underlying comment text.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wikipedia_toxicity_subtypes', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  18. wikipedia-2023-11-embed-multilingual-v3

    • huggingface.co
    Updated Nov 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cohere (2023). wikipedia-2023-11-embed-multilingual-v3 [Dataset]. https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 1, 2023
    Dataset authored and provided by
    Coherehttps://cohere.com/
    Description

    Multilingual Embeddings for Wikipedia in 300+ Languages

    This dataset contains the wikimedia/wikipedia dataset dump from 2023-11-01 from Wikipedia in all 300+ languages. The individual articles have been chunked and embedded with the state-of-the-art multilingual Cohere Embed V3 embedding model. This enables an easy way to semantically search across all of Wikipedia or to use it as a knowledge source for your RAG application. In total is it close to 250M paragraphs / embeddings. You… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3.

  19. Multilingual NER Data (English)

    • kaggle.com
    zip
    Updated Apr 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raj Nath Patel (2021). Multilingual NER Data (English) [Dataset]. https://www.kaggle.com/rajnathpatel/multilingual-ner-data-english
    Explore at:
    zip(1078306 bytes)Available download formats
    Dataset updated
    Apr 12, 2021
    Authors
    Raj Nath Patel
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Context

    The dataset is processed version of the following- Github: https://github.com/afshinrahimi/mmner Download: https://www.amazon.com/clouddrive/share/d3KGCRCIYwhKJF0H3eWA26hjg2ZCRhjpEQtDL70FSBN

    The datasets are available for 218 languages in the above download link. I processed for a few languages and uploaded here. Let me know in the comments if you need data in any specific language.

    Content

    The dataset is annotated with the following 4 Entity Types- PER, LOC, ORG, and MISC

    Acknowledgements

    Massively Multilingual Transfer for NER https://arxiv.org/abs/1902.00193

  20. h

    wiki_dpr

    • huggingface.co
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI at Meta (2024). wiki_dpr [Dataset]. https://huggingface.co/datasets/facebook/wiki_dpr
    Explore at:
    Dataset updated
    May 29, 2024
    Dataset authored and provided by
    AI at Meta
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This is the wikipedia split used to evaluate the Dense Passage Retrieval (DPR) model. It contains 21M passages from wikipedia along with their DPR embeddings. The wikipedia articles were split into multiple, disjoint text blocks of 100 words as passages.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Online Language Modelling (2023). wikipedia [Dataset]. https://huggingface.co/datasets/olm/wikipedia

wikipedia

Wikipedia

olm/wikipedia

Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Feb 21, 2023
Dataset authored and provided by
Online Language Modelling
License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

Search
Clear search
Close search
Google apps
Main menu