Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dumps of the Armenian wikipedia provided by Wikimedia foundation. Available as gzipped XML files
Facebook
TwitterWikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('wikipedia', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterA dedicated dataset for searching State Program Report (SPR) projects from IMLS.
Facebook
TwitterWikiAuto provides a set of aligned sentences from English Wikipedia and
Simple English Wikipedia as a resource to train sentence simplification
systems. The authors first crowd-sourced a set of manual alignments between
sentences in a subset of the Simple English Wikipedia and their corresponding
versions in English Wikipedia (this corresponds to the manual config),
then trained a neural CRF system to predict these alignments. The trained
model was then applied to the other articles in Simple English Wikipedia
with an English counterpart to create a larger corpus of aligned sentences
(corresponding to the auto, auto_acl, auto_full_no_split, and
auto_full_with_split configs here).
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('wiki_auto', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterThe purpose of the Fiscal Service Data Registry is to promote the common identification, use and sharing of data/information across the federal government.
Facebook
TwitterSpreadsheet summaries of identifier availability and correctness in Wikipedia Tabular summaries of identifier availability and correctness in Wikipedia; summary statistics of drugboxes and chemboxes Investigation of John W. Huffman cannabinoid dataset Summary of Wikipedia pages linked to DSSTox records Complete identifier data scraped from Wikipedia Chembox and Drugbox pages. This dataset is associated with the following publication: Sinclair, G., I. Thillainadarajah, B. Meyer, V. Samano, S. Sivasupramaniam, L. Adams, E. Willighagen, A. Richard, M. Walker, and A. Williams. Wikipedia on the CompTox Chemicals Dashboard: Connecting Resources to Enrich Public Chemical Data. Journal of Chemical Information and Modeling. American Chemical Society, Washington, DC, USA, 62(20): 4888-4905, (2022).
Facebook
TwitterCreated myself using references from The Long Dark Fandom Wiki.
The Long Dark is a survival game created by Hinterland Games in 2017. Set in the Canadian wilderness, a geomagnetic storm has destroyed all electronics. You must brave the elements and wildlife to survive, collecting resources where you can.
I created this dataset because I wanted to do a case-study inspired by The Long Dark, but couldn't find any csv files. This is my first dataset, so please give suggestions! I'm new to the world of data analysis and hoping to learn as much as I can.
Facebook
TwitterCPSC provides accessibility to recalls via a recall database. The information is publicly available to consumers and businesses as well as software and application developers.
Facebook
TwitterThe comments in this dataset come from an archive of Wikipedia talk page comments. These have been annotated by Jigsaw for toxicity, as well as (for the main config) a variety of toxicity subtypes, including severe toxicity, obscenity, threatening language, insulting language, and identity attacks. This dataset is a replica of the data released for the Jigsaw Toxic Comment Classification Challenge and Jigsaw Multilingual Toxic Comment Classification competition on Kaggle, with the test dataset merged with the test_labels released after the end of the competitions. Test data not used for scoring has been dropped. This dataset is released under CC0, as is the underlying comment text.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('wikipedia_toxicity_subtypes', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
๐ Overview: This dataset provides a compact and efficient way to explore the massive "Wikipedia Structured Contents" dataset by Wikimedia Foundation, which consists of 38 large JSONL files (each ~2.5GB). Loading these directly in Kaggle or Colab is impractical due to resource constraints. This file index solves that problem.
๐ Whatโs Inside:
This dataset includes a single JSONL file named wiki_structured_dataset_navigator.jsonl that contains metadata for every file in the English portion of the Wikimedia dataset.
Each line in the JSONL file is a JSON object with the following fields:
- file_name: the actual filename in the source dataset (e.g., enwiki_namespace_0_0.jsonl)
- file_index: the numeric row index of the file
- name: the Wikipedia article title or identifier
- url: a link to the full article on Wikipedia
- description: a short description or abstract of the article (when available)
๐ Use Case: Use this dataset to search by keyword, article name, or description to find which specific files from the full Wikimedia dataset contain the topics you're interested in. You can then download only the relevant file(s) instead of the entire dataset.
โก๏ธ Benefits: - Lightweight (~MBs vs. GBs) - Easy to load and search - Great for indexing, previewing, and subsetting the Wikimedia dataset - Saves time, bandwidth, and compute resources
๐ Example Usage (Python): ```python import kagglehub import json import pandas as pd import numpy as np import os from tqdm import tqdm from datetime import datetime import re
def read_jsonl(file_path, max_records=None): data = [] with open(file_path, 'r', encoding='utf-8') as f: for i, line in enumerate(tqdm(f)): if max_records and i >= max_records: break data.append(json.loads(line)) return data
file_path = kagglehub.dataset_download("mehranism/wikimedia-structured-dataset-navigator-jsonl",path="wiki_structured_dataset_navigator.jsonl") data = read_jsonl(file_path) print(f"Successfully loaded {len(data)} records")
df = pd.DataFrame(data) print(f"Dataset shape: {df.shape}") print(" Columns in the dataset:") for col in df.columns: print(f"- {col}")
This dataset is perfect for developers working on:
- Retrieval-Augmented Generation (RAG)
- Large Language Model (LLM) fine-tuning
- Search and filtering pipelines
- Academic research on structured Wikipedia content
๐ก Tip:
Pair this index with the original [Wikipedia Structured Contents dataset](https://www.kaggle.com/datasets/wikimedia-foundation/wikipedia-structured-contents) for full article access.
๐ Format:
- File: `wiki_structured_dataset_navigator.jsonl`
- Format: JSON Lines (1 object per line)
- Encoding: UTF-8
---
### **Tags**
wikipedia, wikimedia, jsonl, structured-data, search-index, metadata, file-catalog, dataset-index, large-language-models, machine-learning ```
CC0: Public Domain Dedication
(Recommended for open indexing tools with no sensitive data.)
Facebook
TwitterThis dataset contains ontology terms associated with key events from the AOP-Wiki. This information was used to seed the AOP-Wiki with a carefully selected set of ontology terms prior to opening up the option for authors to tag their own AOPs. This is intended to provide existing examples for authors and improve consistency when assigning terms to the key events. This dataset is associated with the following publication: Ives, C., I. Campia, R. Wang, C. Wittwehr, and S. Edwards. Creating a Structured Adverse Outcome Pathway Knowledgebase via Ontology-Based Annotations. Applied In Vitro Toxicology. Mary Ann Liebert, Inc., Larchmont, NY, USA, 3(4): 298-311, (2017).
Facebook
TwitterData Files โ CSV, SAS, and SPSS; Documentation; Annual Report; State Profiles; Supplementary Tables with User Note; Data Element Definitions; and News Release.
Facebook
TwitterData Files โ CSV (29 KB) and SAS (40 KB); Documentation (PDF, 1.43 MB); Report.
Facebook
TwitterThe cumulative total of net sales demand for numismatic products reported from the launch of each product through the report date. Updated weekly by 5 p.m. (ET) Tuesdays. These figures should not be considered final and are subject to change as audited. Products are classified by program category. Some products may contain multiple coins that could fall under various categories
Facebook
TwitterA dedicated dataset for searching awarded grants from the Institute of Museum and Library Services.
Facebook
TwitterFor all products regulated by the CPSC, the Commission issues a Letter of Advice (LOA) when there is a violation of a mandatory standard
Facebook
TwitterDaily Principal Outstanding
Facebook
TwitterBuyback Results displays results of periodic buy backs of unmatured U.S. Treasury marketable securities.
Facebook
TwitterBuyback Announcements displays announcements of periodic buy backs of unmatured U.S. Treasury marketable securities.
Facebook
TwitterFederal Credit Similar Maturity Rates: Rates are displayed from 1 year or less to 20 years or more for fiscal years 1992 previous fiscal year.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dumps of the Armenian wikipedia provided by Wikimedia foundation. Available as gzipped XML files