100+ datasets found

o
Armenian wikipedia (hywiki) XML dumps - Dataset - Data Catalog Armenia
data.opendata.am
Updated Apr 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Armenian wikipedia (hywiki) XML dumps - Dataset - Data Catalog Armenia [Dataset]. https://data.opendata.am/dataset/hywiki-xml-dumps
Explore at:
Dataset updated
Apr 6, 2023
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Armenia
Description
Dumps of the Armenian wikipedia provided by Wikimedia foundation. Available as gzipped XML files
T
wikipedia
tensorflow.org
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
wikipedia [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia
Explore at:
Description
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('wikipedia', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
Search State Program Report (SPR) Projects
catalog.data.gov
Updated Mar 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Institute of Museum and Library Services (2025). Search State Program Report (SPR) Projects [Dataset]. https://catalog.data.gov/dataset/search-state-program-report-spr-projects
Explore at:
Dataset updated
Mar 27, 2025
Dataset provided by
Institute of Museum and Library Serviceshttps://www.imls.gov/
Description
A dedicated dataset for searching State Program Report (SPR) projects from IMLS.
T
wiki_auto
tensorflow.org
opendatalab.com
+1more
Updated Dec 6, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). wiki_auto [Dataset]. https://www.tensorflow.org/datasets/catalog/wiki_auto
Explore at:
Dataset updated
Dec 6, 2022
Description
WikiAuto provides a set of aligned sentences from English Wikipedia and Simple English Wikipedia as a resource to train sentence simplification systems. The authors first crowd-sourced a set of manual alignments between sentences in a subset of the Simple English Wikipedia and their corresponding versions in English Wikipedia (this corresponds to the manual config), then trained a neural CRF system to predict these alignments. The trained model was then applied to the other articles in Simple English Wikipedia with an English counterpart to create a larger corpus of aligned sentences (corresponding to the auto, auto_acl, auto_full_no_split, and auto_full_with_split configs here).

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('wiki_auto', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
Data Registry
catalog.data.gov
data.wu.ac.at
Updated Dec 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bureau of the Fiscal Service (2023). Data Registry [Dataset]. https://catalog.data.gov/dataset/data-registry
Explore at:
Dataset updated
Dec 1, 2023
Dataset provided by
Bureau of the Fiscal Servicehttps://www.fiscal.treasury.gov/
Description
The purpose of the Fiscal Service Data Registry is to promote the common identification, use and sharing of data/information across the federal government.
Data from: Wikipedia on the CompTox Chemicals Dashboard: Connecting...
catalog.data.gov
datasets.ai
Updated Nov 3, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2022). Wikipedia on the CompTox Chemicals Dashboard: Connecting Resources to Enrich Public Chemical Data [Dataset]. https://catalog.data.gov/dataset/wikipedia-on-the-comptox-chemicals-dashboard-connecting-resources-to-enrich-public-chemica
Explore at:
Dataset updated
Nov 3, 2022
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Spreadsheet summaries of identifier availability and correctness in Wikipedia Tabular summaries of identifier availability and correctness in Wikipedia; summary statistics of drugboxes and chemboxes Investigation of John W. Huffman cannabinoid dataset Summary of Wikipedia pages linked to DSSTox records Complete identifier data scraped from Wikipedia Chembox and Drugbox pages. This dataset is associated with the following publication: Sinclair, G., I. Thillainadarajah, B. Meyer, V. Samano, S. Sivasupramaniam, L. Adams, E. Willighagen, A. Richard, M. Walker, and A. Williams. Wikipedia on the CompTox Chemicals Dashboard: Connecting Resources to Enrich Public Chemical Data. Journal of Chemical Information and Modeling. American Chemical Society, Washington, DC, USA, 62(20): 4888-4905, (2022).
The Long Dark Item Catalog
kaggle.com
zip
Updated Sep 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mckenna (2024). The Long Dark Item Catalog [Dataset]. https://www.kaggle.com/datasets/mckennae34/the-long-dark-item-catalog
Explore at:
zip(4009 bytes)Available download formats
Dataset updated
Sep 7, 2024
Authors
mckenna
Description
Created myself using references from The Long Dark Fandom Wiki.

The Long Dark is a survival game created by Hinterland Games in 2017. Set in the Canadian wilderness, a geomagnetic storm has destroyed all electronics. You must brave the elements and wildlife to survive, collecting resources where you can.

I created this dataset because I wanted to do a case-study inspired by The Long Dark, but couldn't find any csv files. This is my first dataset, so please give suggestions! I'm new to the world of data analysis and hoping to learn as much as I can.
Recalls API
s.cnmilf.com
catalog.data.gov
+1more
Updated Mar 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Consumer Product Safety Commission (2021). Recalls API [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/recalls-api
Explore at:
Dataset updated
Mar 4, 2021
Dataset provided by
U.S. Consumer Product Safety Commissionhttp://cpsc.gov/
Description
CPSC provides accessibility to recalls via a recall database. The information is publicly available to consumers and businesses as well as software and application developers.
T
wikipedia_toxicity_subtypes
tensorflow.org
Updated Dec 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). wikipedia_toxicity_subtypes [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia_toxicity_subtypes
Explore at:
Dataset updated
Dec 6, 2022
Description
The comments in this dataset come from an archive of Wikipedia talk page comments. These have been annotated by Jigsaw for toxicity, as well as (for the main config) a variety of toxicity subtypes, including severe toxicity, obscenity, threatening language, insulting language, and identity attacks. This dataset is a replica of the data released for the Jigsaw Toxic Comment Classification Challenge and Jigsaw Multilingual Toxic Comment Classification competition on Kaggle, with the test dataset merged with the test_labels released after the end of the competitions. Test data not used for scoring has been dropped. This dataset is released under CC0, as is the underlying comment text.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('wikipedia_toxicity_subtypes', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
Wikimedia Structured Dataset Navigator (JSONL)
kaggle.com
zip
Updated Apr 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mehranism (2025). Wikimedia Structured Dataset Navigator (JSONL) [Dataset]. https://www.kaggle.com/datasets/mehranism/wikimedia-structured-dataset-navigator-jsonl
Explore at:
zip(266196504 bytes)Available download formats
Dataset updated
Apr 23, 2025
Authors
Mehranism
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
📚 Overview: This dataset provides a compact and efficient way to explore the massive "Wikipedia Structured Contents" dataset by Wikimedia Foundation, which consists of 38 large JSONL files (each ~2.5GB). Loading these directly in Kaggle or Colab is impractical due to resource constraints. This file index solves that problem.

🔍 What’s Inside: This dataset includes a single JSONL file named wiki_structured_dataset_navigator.jsonl that contains metadata for every file in the English portion of the Wikimedia dataset.

Each line in the JSONL file is a JSON object with the following fields: - file_name: the actual filename in the source dataset (e.g., enwiki_namespace_0_0.jsonl) - file_index: the numeric row index of the file - name: the Wikipedia article title or identifier - url: a link to the full article on Wikipedia - description: a short description or abstract of the article (when available)

🛠 Use Case: Use this dataset to search by keyword, article name, or description to find which specific files from the full Wikimedia dataset contain the topics you're interested in. You can then download only the relevant file(s) instead of the entire dataset.

⚡️ Benefits: - Lightweight (~MBs vs. GBs) - Easy to load and search - Great for indexing, previewing, and subsetting the Wikimedia dataset - Saves time, bandwidth, and compute resources

📎 Example Usage (Python): ```python import kagglehub import json import pandas as pd import numpy as np import os from tqdm import tqdm from datetime import datetime import re

def read_jsonl(file_path, max_records=None): data = [] with open(file_path, 'r', encoding='utf-8') as f: for i, line in enumerate(tqdm(f)): if max_records and i >= max_records: break data.append(json.loads(line)) return data

file_path = kagglehub.dataset_download("mehranism/wikimedia-structured-dataset-navigator-jsonl",path="wiki_structured_dataset_navigator.jsonl") data = read_jsonl(file_path) print(f"Successfully loaded {len(data)} records")

df = pd.DataFrame(data) print(f"Dataset shape: {df.shape}") print(" Columns in the dataset:") for col in df.columns: print(f"- {col}")

This dataset is perfect for developers working on: - Retrieval-Augmented Generation (RAG) - Large Language Model (LLM) fine-tuning - Search and filtering pipelines - Academic research on structured Wikipedia content 💡 Tip: Pair this index with the original [Wikipedia Structured Contents dataset](https://www.kaggle.com/datasets/wikimedia-foundation/wikipedia-structured-contents) for full article access. 📃 Format: - File: `wiki_structured_dataset_navigator.jsonl` - Format: JSON Lines (1 object per line) - Encoding: UTF-8 --- ### **Tags**

wikipedia, wikimedia, jsonl, structured-data, search-index, metadata, file-catalog, dataset-index, large-language-models, machine-learning ```

Licensing

CC0: Public Domain Dedication

(Recommended for open indexing tools with no sensitive data.)
AOP-Wiki Event Component Annotation
catalog.data.gov
datasets.ai
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). AOP-Wiki Event Component Annotation [Dataset]. https://catalog.data.gov/dataset/aop-wiki-event-component-annotation
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
This dataset contains ontology terms associated with key events from the AOP-Wiki. This information was used to seed the AOP-Wiki with a carefully selected set of ontology terms prior to opening up the option for authors to tag their own AOPs. This is intended to provide existing examples for authors and improve consistency when assigning terms to the key events. This dataset is associated with the following publication: Ives, C., I. Campia, R. Wang, C. Wittwehr, and S. Edwards. Creating a Structured Adverse Outcome Pathway Knowledgebase via Ontology-Based Annotations. Applied In Vitro Toxicology. Mary Ann Liebert, Inc., Larchmont, NY, USA, 3(4): 298-311, (2017).
Public Library Survey (PLS) 2015
s.cnmilf.com
catalog.data.gov
Updated Mar 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Institute of Museum and Library Services (2025). Public Library Survey (PLS) 2015 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/public-library-survey-pls-2015
Explore at:
Dataset updated
Mar 27, 2025
Dataset provided by
Institute of Museum and Library Serviceshttps://www.imls.gov/
Description
Data Files – CSV, SAS, and SPSS; Documentation; Annual Report; State Profiles; Supplementary Tables with User Note; Data Element Definitions; and News Release.
State Library Administrative Agencies Survey: Fiscal Year 2016
s.cnmilf.com
catalog.data.gov
Updated Mar 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Institute of Museum and Library Services (2025). State Library Administrative Agencies Survey: Fiscal Year 2016 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/state-library-administrative-agencies-survey-fiscal-year-2016-dbb44
Explore at:
Dataset updated
Mar 27, 2025
Dataset provided by
Institute of Museum and Library Serviceshttps://www.imls.gov/
Description
Data Files – CSV (29 KB) and SAS (40 KB); Documentation (PDF, 1.43 MB); Report.
Cumulative Numismatic Sales Figures
s.cnmilf.com
datasets.ai
+1more
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States Mint, Sales and Marketing (SAM) Department (2025). Cumulative Numismatic Sales Figures [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/cumulative-numismatic-sales-figures
Explore at:
Dataset updated
Jun 5, 2025
Dataset provided by
United States Minthttp://usmint.gov/
Description
The cumulative total of net sales demand for numismatic products reported from the launch of each product through the report date. Updated weekly by 5 p.m. (ET) Tuesdays. These figures should not be considered final and are subject to change as audited. Products are classified by program category. Some products may contain multiple coins that could fall under various categories
Search Awarded Grants
catalog.data.gov
Updated Mar 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Institute of Museum and Library Services (2025). Search Awarded Grants [Dataset]. https://catalog.data.gov/dataset/search-awarded-grants
Explore at:
Dataset updated
Mar 27, 2025
Dataset provided by
Institute of Museum and Library Serviceshttps://www.imls.gov/
Description
A dedicated dataset for searching awarded grants from the Institute of Museum and Library Services.
Recall Violations
catalog.data.gov
s.cnmilf.com
+1more
Updated Mar 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Consumer Product Safety Commission (2021). Recall Violations [Dataset]. https://catalog.data.gov/dataset/recall-violations
Explore at:
Dataset updated
Mar 4, 2021
Dataset provided by
U.S. Consumer Product Safety Commissionhttp://cpsc.gov/
Description
For all products regulated by the CPSC, the Commission issues a Letter of Advice (LOA) when there is a violation of a mandatory standard
Principal Outstanding Detail Reports
catalog.data.gov
data.wu.ac.at
Updated Dec 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bureau of the Fiscal Service (2023). Principal Outstanding Detail Reports [Dataset]. https://catalog.data.gov/dataset/principal-outstanding-detail-reports
Explore at:
Dataset updated
Dec 1, 2023
Dataset provided by
Bureau of the Fiscal Servicehttps://www.fiscal.treasury.gov/
Description
Daily Principal Outstanding
Buyback Results
catalog.data.gov
s.cnmilf.com
Updated Dec 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bureau of the Fiscal Service (2023). Buyback Results [Dataset]. https://catalog.data.gov/dataset/buyback-results
Explore at:
Dataset updated
Dec 1, 2023
Dataset provided by
Bureau of the Fiscal Servicehttps://www.fiscal.treasury.gov/
Description
Buyback Results displays results of periodic buy backs of unmatured U.S. Treasury marketable securities.
Buyback Announcements
catalog.data.gov
datasets.ai
Updated Dec 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bureau of the Fiscal Service (2023). Buyback Announcements [Dataset]. https://catalog.data.gov/dataset/buyback-announcements
Explore at:
Dataset updated
Dec 1, 2023
Dataset provided by
Bureau of the Fiscal Servicehttps://www.fiscal.treasury.gov/
Description
Buyback Announcements displays announcements of periodic buy backs of unmatured U.S. Treasury marketable securities.
Federal Credit Similar Maturity Rates
catalog.data.gov
datasets.ai
Updated Dec 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bureau of the Fiscal Service (2023). Federal Credit Similar Maturity Rates [Dataset]. https://catalog.data.gov/dataset/federal-credit-similar-maturity-rates
Explore at:
Dataset updated
Dec 1, 2023
Dataset provided by
Bureau of the Fiscal Servicehttps://www.fiscal.treasury.gov/
Description
Federal Credit Similar Maturity Rates: Rates are displayed from 1 year or less to 20 years or more for fiscal years 1992 previous fiscal year.