100+ datasets found
  1. o

    Armenian wikipedia (hywiki) XML dumps - Dataset - Data Catalog Armenia

    • data.opendata.am
    Updated Apr 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Armenian wikipedia (hywiki) XML dumps - Dataset - Data Catalog Armenia [Dataset]. https://data.opendata.am/dataset/hywiki-xml-dumps
    Explore at:
    Dataset updated
    Apr 6, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Armenia
    Description

    Dumps of the Armenian wikipedia provided by Wikimedia foundation. Available as gzipped XML files

  2. T

    wikipedia

    • tensorflow.org
    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wikipedia [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia
    Explore at:
    Description

    Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wikipedia', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  3. Search State Program Report (SPR) Projects

    • catalog.data.gov
    Updated Mar 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institute of Museum and Library Services (2025). Search State Program Report (SPR) Projects [Dataset]. https://catalog.data.gov/dataset/search-state-program-report-spr-projects
    Explore at:
    Dataset updated
    Mar 27, 2025
    Dataset provided by
    Institute of Museum and Library Serviceshttps://www.imls.gov/
    Description

    A dedicated dataset for searching State Program Report (SPR) projects from IMLS.

  4. T

    wiki_auto

    • tensorflow.org
    • opendatalab.com
    • +1more
    Updated Dec 6, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). wiki_auto [Dataset]. https://www.tensorflow.org/datasets/catalog/wiki_auto
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    WikiAuto provides a set of aligned sentences from English Wikipedia and Simple English Wikipedia as a resource to train sentence simplification systems. The authors first crowd-sourced a set of manual alignments between sentences in a subset of the Simple English Wikipedia and their corresponding versions in English Wikipedia (this corresponds to the manual config), then trained a neural CRF system to predict these alignments. The trained model was then applied to the other articles in Simple English Wikipedia with an English counterpart to create a larger corpus of aligned sentences (corresponding to the auto, auto_acl, auto_full_no_split, and auto_full_with_split configs here).

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wiki_auto', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  5. Data Registry

    • catalog.data.gov
    • data.wu.ac.at
    Updated Dec 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bureau of the Fiscal Service (2023). Data Registry [Dataset]. https://catalog.data.gov/dataset/data-registry
    Explore at:
    Dataset updated
    Dec 1, 2023
    Dataset provided by
    Bureau of the Fiscal Servicehttps://www.fiscal.treasury.gov/
    Description

    The purpose of the Fiscal Service Data Registry is to promote the common identification, use and sharing of data/information across the federal government.

  6. Data from: Wikipedia on the CompTox Chemicals Dashboard: Connecting...

    • catalog.data.gov
    • datasets.ai
    Updated Nov 3, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2022). Wikipedia on the CompTox Chemicals Dashboard: Connecting Resources to Enrich Public Chemical Data [Dataset]. https://catalog.data.gov/dataset/wikipedia-on-the-comptox-chemicals-dashboard-connecting-resources-to-enrich-public-chemica
    Explore at:
    Dataset updated
    Nov 3, 2022
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Spreadsheet summaries of identifier availability and correctness in Wikipedia Tabular summaries of identifier availability and correctness in Wikipedia; summary statistics of drugboxes and chemboxes Investigation of John W. Huffman cannabinoid dataset Summary of Wikipedia pages linked to DSSTox records Complete identifier data scraped from Wikipedia Chembox and Drugbox pages. This dataset is associated with the following publication: Sinclair, G., I. Thillainadarajah, B. Meyer, V. Samano, S. Sivasupramaniam, L. Adams, E. Willighagen, A. Richard, M. Walker, and A. Williams. Wikipedia on the CompTox Chemicals Dashboard: Connecting Resources to Enrich Public Chemical Data. Journal of Chemical Information and Modeling. American Chemical Society, Washington, DC, USA, 62(20): 4888-4905, (2022).

  7. The Long Dark Item Catalog

    • kaggle.com
    zip
    Updated Sep 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mckenna (2024). The Long Dark Item Catalog [Dataset]. https://www.kaggle.com/datasets/mckennae34/the-long-dark-item-catalog
    Explore at:
    zip(4009 bytes)Available download formats
    Dataset updated
    Sep 7, 2024
    Authors
    mckenna
    Description

    Created myself using references from The Long Dark Fandom Wiki.

    The Long Dark is a survival game created by Hinterland Games in 2017. Set in the Canadian wilderness, a geomagnetic storm has destroyed all electronics. You must brave the elements and wildlife to survive, collecting resources where you can.

    I created this dataset because I wanted to do a case-study inspired by The Long Dark, but couldn't find any csv files. This is my first dataset, so please give suggestions! I'm new to the world of data analysis and hoping to learn as much as I can.

  8. Recalls API

    • s.cnmilf.com
    • catalog.data.gov
    • +1more
    Updated Mar 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Consumer Product Safety Commission (2021). Recalls API [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/recalls-api
    Explore at:
    Dataset updated
    Mar 4, 2021
    Dataset provided by
    U.S. Consumer Product Safety Commissionhttp://cpsc.gov/
    Description

    CPSC provides accessibility to recalls via a recall database. The information is publicly available to consumers and businesses as well as software and application developers.

  9. T

    wikipedia_toxicity_subtypes

    • tensorflow.org
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). wikipedia_toxicity_subtypes [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia_toxicity_subtypes
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    The comments in this dataset come from an archive of Wikipedia talk page comments. These have been annotated by Jigsaw for toxicity, as well as (for the main config) a variety of toxicity subtypes, including severe toxicity, obscenity, threatening language, insulting language, and identity attacks. This dataset is a replica of the data released for the Jigsaw Toxic Comment Classification Challenge and Jigsaw Multilingual Toxic Comment Classification competition on Kaggle, with the test dataset merged with the test_labels released after the end of the competitions. Test data not used for scoring has been dropped. This dataset is released under CC0, as is the underlying comment text.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wikipedia_toxicity_subtypes', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  10. Wikimedia Structured Dataset Navigator (JSONL)

    • kaggle.com
    zip
    Updated Apr 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mehranism (2025). Wikimedia Structured Dataset Navigator (JSONL) [Dataset]. https://www.kaggle.com/datasets/mehranism/wikimedia-structured-dataset-navigator-jsonl
    Explore at:
    zip(266196504 bytes)Available download formats
    Dataset updated
    Apr 23, 2025
    Authors
    Mehranism
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    ๐Ÿ“š Overview: This dataset provides a compact and efficient way to explore the massive "Wikipedia Structured Contents" dataset by Wikimedia Foundation, which consists of 38 large JSONL files (each ~2.5GB). Loading these directly in Kaggle or Colab is impractical due to resource constraints. This file index solves that problem.

    ๐Ÿ” Whatโ€™s Inside: This dataset includes a single JSONL file named wiki_structured_dataset_navigator.jsonl that contains metadata for every file in the English portion of the Wikimedia dataset.

    Each line in the JSONL file is a JSON object with the following fields: - file_name: the actual filename in the source dataset (e.g., enwiki_namespace_0_0.jsonl) - file_index: the numeric row index of the file - name: the Wikipedia article title or identifier - url: a link to the full article on Wikipedia - description: a short description or abstract of the article (when available)

    ๐Ÿ›  Use Case: Use this dataset to search by keyword, article name, or description to find which specific files from the full Wikimedia dataset contain the topics you're interested in. You can then download only the relevant file(s) instead of the entire dataset.

    โšก๏ธ Benefits: - Lightweight (~MBs vs. GBs) - Easy to load and search - Great for indexing, previewing, and subsetting the Wikimedia dataset - Saves time, bandwidth, and compute resources

    ๐Ÿ“Ž Example Usage (Python): ```python import kagglehub import json import pandas as pd import numpy as np import os from tqdm import tqdm from datetime import datetime import re

    def read_jsonl(file_path, max_records=None): data = [] with open(file_path, 'r', encoding='utf-8') as f: for i, line in enumerate(tqdm(f)): if max_records and i >= max_records: break data.append(json.loads(line)) return data

    file_path = kagglehub.dataset_download("mehranism/wikimedia-structured-dataset-navigator-jsonl",path="wiki_structured_dataset_navigator.jsonl") data = read_jsonl(file_path) print(f"Successfully loaded {len(data)} records")

    df = pd.DataFrame(data) print(f"Dataset shape: {df.shape}") print(" Columns in the dataset:") for col in df.columns: print(f"- {col}")

    
    This dataset is perfect for developers working on:
    - Retrieval-Augmented Generation (RAG)
    - Large Language Model (LLM) fine-tuning
    - Search and filtering pipelines
    - Academic research on structured Wikipedia content
    
    ๐Ÿ’ก Tip:
    Pair this index with the original [Wikipedia Structured Contents dataset](https://www.kaggle.com/datasets/wikimedia-foundation/wikipedia-structured-contents) for full article access.
    
    ๐Ÿ“ƒ Format:
    - File: `wiki_structured_dataset_navigator.jsonl`
    - Format: JSON Lines (1 object per line)
    - Encoding: UTF-8
    
    ---
    
    ### **Tags**
    

    wikipedia, wikimedia, jsonl, structured-data, search-index, metadata, file-catalog, dataset-index, large-language-models, machine-learning ```

    Licensing

    CC0: Public Domain Dedication
    

    (Recommended for open indexing tools with no sensitive data.)

  11. AOP-Wiki Event Component Annotation

    • catalog.data.gov
    • datasets.ai
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). AOP-Wiki Event Component Annotation [Dataset]. https://catalog.data.gov/dataset/aop-wiki-event-component-annotation
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    This dataset contains ontology terms associated with key events from the AOP-Wiki. This information was used to seed the AOP-Wiki with a carefully selected set of ontology terms prior to opening up the option for authors to tag their own AOPs. This is intended to provide existing examples for authors and improve consistency when assigning terms to the key events. This dataset is associated with the following publication: Ives, C., I. Campia, R. Wang, C. Wittwehr, and S. Edwards. Creating a Structured Adverse Outcome Pathway Knowledgebase via Ontology-Based Annotations. Applied In Vitro Toxicology. Mary Ann Liebert, Inc., Larchmont, NY, USA, 3(4): 298-311, (2017).

  12. Public Library Survey (PLS) 2015

    • s.cnmilf.com
    • catalog.data.gov
    Updated Mar 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institute of Museum and Library Services (2025). Public Library Survey (PLS) 2015 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/public-library-survey-pls-2015
    Explore at:
    Dataset updated
    Mar 27, 2025
    Dataset provided by
    Institute of Museum and Library Serviceshttps://www.imls.gov/
    Description

    Data Files โ€“ CSV, SAS, and SPSS; Documentation; Annual Report; State Profiles; Supplementary Tables with User Note; Data Element Definitions; and News Release.

  13. State Library Administrative Agencies Survey: Fiscal Year 2016

    • s.cnmilf.com
    • catalog.data.gov
    Updated Mar 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institute of Museum and Library Services (2025). State Library Administrative Agencies Survey: Fiscal Year 2016 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/state-library-administrative-agencies-survey-fiscal-year-2016-dbb44
    Explore at:
    Dataset updated
    Mar 27, 2025
    Dataset provided by
    Institute of Museum and Library Serviceshttps://www.imls.gov/
    Description

    Data Files โ€“ CSV (29 KB) and SAS (40 KB); Documentation (PDF, 1.43 MB); Report.

  14. Cumulative Numismatic Sales Figures

    • s.cnmilf.com
    • datasets.ai
    • +1more
    Updated Jun 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States Mint, Sales and Marketing (SAM) Department (2025). Cumulative Numismatic Sales Figures [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/cumulative-numismatic-sales-figures
    Explore at:
    Dataset updated
    Jun 5, 2025
    Dataset provided by
    United States Minthttp://usmint.gov/
    Description

    The cumulative total of net sales demand for numismatic products reported from the launch of each product through the report date. Updated weekly by 5 p.m. (ET) Tuesdays. These figures should not be considered final and are subject to change as audited. Products are classified by program category. Some products may contain multiple coins that could fall under various categories

  15. Search Awarded Grants

    • catalog.data.gov
    Updated Mar 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institute of Museum and Library Services (2025). Search Awarded Grants [Dataset]. https://catalog.data.gov/dataset/search-awarded-grants
    Explore at:
    Dataset updated
    Mar 27, 2025
    Dataset provided by
    Institute of Museum and Library Serviceshttps://www.imls.gov/
    Description

    A dedicated dataset for searching awarded grants from the Institute of Museum and Library Services.

  16. Recall Violations

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Mar 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Consumer Product Safety Commission (2021). Recall Violations [Dataset]. https://catalog.data.gov/dataset/recall-violations
    Explore at:
    Dataset updated
    Mar 4, 2021
    Dataset provided by
    U.S. Consumer Product Safety Commissionhttp://cpsc.gov/
    Description

    For all products regulated by the CPSC, the Commission issues a Letter of Advice (LOA) when there is a violation of a mandatory standard

  17. Principal Outstanding Detail Reports

    • catalog.data.gov
    • data.wu.ac.at
    Updated Dec 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bureau of the Fiscal Service (2023). Principal Outstanding Detail Reports [Dataset]. https://catalog.data.gov/dataset/principal-outstanding-detail-reports
    Explore at:
    Dataset updated
    Dec 1, 2023
    Dataset provided by
    Bureau of the Fiscal Servicehttps://www.fiscal.treasury.gov/
    Description

    Daily Principal Outstanding

  18. Buyback Results

    • catalog.data.gov
    • s.cnmilf.com
    Updated Dec 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bureau of the Fiscal Service (2023). Buyback Results [Dataset]. https://catalog.data.gov/dataset/buyback-results
    Explore at:
    Dataset updated
    Dec 1, 2023
    Dataset provided by
    Bureau of the Fiscal Servicehttps://www.fiscal.treasury.gov/
    Description

    Buyback Results displays results of periodic buy backs of unmatured U.S. Treasury marketable securities.

  19. Buyback Announcements

    • catalog.data.gov
    • datasets.ai
    Updated Dec 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bureau of the Fiscal Service (2023). Buyback Announcements [Dataset]. https://catalog.data.gov/dataset/buyback-announcements
    Explore at:
    Dataset updated
    Dec 1, 2023
    Dataset provided by
    Bureau of the Fiscal Servicehttps://www.fiscal.treasury.gov/
    Description

    Buyback Announcements displays announcements of periodic buy backs of unmatured U.S. Treasury marketable securities.

  20. Federal Credit Similar Maturity Rates

    • catalog.data.gov
    • datasets.ai
    Updated Dec 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bureau of the Fiscal Service (2023). Federal Credit Similar Maturity Rates [Dataset]. https://catalog.data.gov/dataset/federal-credit-similar-maturity-rates
    Explore at:
    Dataset updated
    Dec 1, 2023
    Dataset provided by
    Bureau of the Fiscal Servicehttps://www.fiscal.treasury.gov/
    Description

    Federal Credit Similar Maturity Rates: Rates are displayed from 1 year or less to 20 years or more for fiscal years 1992 previous fiscal year.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2023). Armenian wikipedia (hywiki) XML dumps - Dataset - Data Catalog Armenia [Dataset]. https://data.opendata.am/dataset/hywiki-xml-dumps

Armenian wikipedia (hywiki) XML dumps - Dataset - Data Catalog Armenia

Explore at:
Dataset updated
Apr 6, 2023
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered
Armenia
Description

Dumps of the Armenian wikipedia provided by Wikimedia foundation. Available as gzipped XML files

Search
Clear search
Close search
Google apps
Main menu