100+ datasets found

structured-wikipedia
huggingface.co
Updated Sep 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wikimedia (2024). structured-wikipedia [Dataset]. https://huggingface.co/datasets/wikimedia/structured-wikipedia
Explore at:
Dataset updated
Sep 16, 2024
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Authors
Wikimedia
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Card for Wikimedia Structured Wikipedia

Dataset Description Dataset Summary

Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback. This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema (JSONL compressed as zip). Each JSON line holds the content of one full Wikipedia article stripped of… See the full description on the dataset page: https://huggingface.co/datasets/wikimedia/structured-wikipedia.
u
Structural Metadata from ArXiv Articles
ebiquity.umbc.edu
Updated Sep 1, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Rahman (2017). Structural Metadata from ArXiv Articles [Dataset]. https://ebiquity.umbc.edu/resource/html/id/374/Structural-Metadata-from-ArXiv-Articles
Explore at:
zip compressed json object(566 megabytes, compressed)Available download formats
Dataset updated
Sep 1, 2017
Authors
Muhammad Rahman
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Time period covered
1991 - 2016
Description
The dataset contains metadata encoded in JSON and extracted from more than one million arXiv articles that were put online before the end of 2016. The metadata includes the arXiv id, category names, title, author names, abstract, link to article, publication date and table of contents.
Dataset: A Systematic Literature Review on the topic of High-value datasets
zenodo.org
data.niaid.nih.gov
bin, png, txt
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anastasija Nikiforova; Anastasija Nikiforova; Nina Rizun; Nina Rizun; Magdalena Ciesielska; Magdalena Ciesielska; Charalampos Alexopoulos; Charalampos Alexopoulos; Andrea Miletič; Andrea Miletič (2024). Dataset: A Systematic Literature Review on the topic of High-value datasets [Dataset]. http://doi.org/10.5281/zenodo.8075918
Explore at:
png, bin, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8075918
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anastasija Nikiforova; Anastasija Nikiforova; Nina Rizun; Nina Rizun; Magdalena Ciesielska; Magdalena Ciesielska; Charalampos Alexopoulos; Charalampos Alexopoulos; Andrea Miletič; Andrea Miletič
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains data collected during a study ("Towards High-Value Datasets determination for data-driven development: a systematic literature review") conducted by Anastasija Nikiforova (University of Tartu), Nina Rizun, Magdalena Ciesielska (Gdańsk University of Technology), Charalampos Alexopoulos (University of the Aegean)and Andrea Miletič (University of Zagreb)
It being made public both to act as supplementary data for "Towards High-Value Datasets determination for data-driven development: a systematic literature review" paper (pre-print is available in Open Access here -> https://arxiv.org/abs/2305.10234) and in order for other researchers to use these data in their own work.

The protocol is intended for the Systematic Literature review on the topic of High-value Datasets with the aim to gather information on how the topic of High-value datasets (HVD) and their determination has been reflected in the literature over the years and what has been found by these studies to date, incl. the indicators used in them, involved stakeholders, data-related aspects, and frameworks. The data in this dataset were collected in the result of the SLR over Scopus, Web of Science, and Digital Government Research library (DGRL) in 2023.

***Methodology***

To understand how HVD determination has been reflected in the literature over the years and what has been found by these studies to date, all relevant literature covering this topic has been studied. To this end, the SLR was carried out to by searching digital libraries covered by Scopus, Web of Science (WoS), Digital Government Research library (DGRL).

These databases were queried for keywords ("open data" OR "open government data") AND ("high-value data*" OR "high value data*"), which were applied to the article title, keywords, and abstract to limit the number of papers to those, where these objects were primary research objects rather than mentioned in the body, e.g., as a future work. After deduplication, 11 articles were found unique and were further checked for relevance. As a result, a total of 9 articles were further examined. Each study was independently examined by at least two authors.

To attain the objective of our study, we developed the protocol, where the information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information.

***Test procedure***
Each study was independently examined by at least two authors, where after the in-depth examination of the full-text of the article, the structured protocol has been filled for each study.
The structure of the survey is available in the supplementary file available (see Protocol_HVD_SLR.odt, Protocol_HVD_SLR.docx)
The data collected for each study by two researchers were then synthesized in one final version by the third researcher.

***Description of the data in this data set***

Protocol_HVD_SLR provides the structure of the protocol
Spreadsheets #1 provides the filled protocol for relevant studies.
Spreadsheet#2 provides the list of results after the search over three indexing databases, i.e. before filtering out irrelevant studies

The information on each selected study was collected in four categories:
(1) descriptive information,
(2) approach- and research design- related information,
(3) quality-related information,
(4) HVD determination-related information

Descriptive information
1) Article number - a study number, corresponding to the study number assigned in an Excel worksheet
2) Complete reference - the complete source information to refer to the study
3) Year of publication - the year in which the study was published
4) Journal article / conference paper / book chapter - the type of the paper -{journal article, conference paper, book chapter}
5) DOI / Website- a link to the website where the study can be found
6) Number of citations - the number of citations of the article in Google Scholar, Scopus, Web of Science
7) Availability in OA - availability of an article in the Open Access
8) Keywords - keywords of the paper as indicated by the authors
9) Relevance for this study - what is the relevance level of the article for this study? {high / medium / low}

Approach- and research design-related information
10) Objective / RQ - the research objective / aim, established research questions
11) Research method (including unit of analysis) - the methods used to collect data, including the unit of analy-sis (country, organisation, specific unit that has been ana-lysed, e.g., the number of use-cases, scope of the SLR etc.)
12) Contributions - the contributions of the study
13) Method - whether the study uses a qualitative, quantitative, or mixed methods approach?
14) Availability of the underlying research data- whether there is a reference to the publicly available underly-ing research data e.g., transcriptions of interviews, collected data, or explanation why these data are not shared?
15) Period under investigation - period (or moment) in which the study was conducted
16) Use of theory / theoretical concepts / approaches - does the study mention any theory / theoretical concepts / approaches? If any theory is mentioned, how is theory used in the study?

Quality- and relevance- related information
17) Quality concerns - whether there are any quality concerns (e.g., limited infor-mation about the research methods used)?
18) Primary research object - is the HVD a primary research object in the study? (primary - the paper is focused around the HVD determination, sec-ondary - mentioned but not studied (e.g., as part of discus-sion, future work etc.))

HVD determination-related information
19) HVD definition and type of value - how is the HVD defined in the article and / or any other equivalent term?
20) HVD indicators - what are the indicators to identify HVD? How were they identified? (components & relationships, “input -> output")
21) A framework for HVD determination - is there a framework presented for HVD identification? What components does it consist of and what are the rela-tionships between these components? (detailed description)
22) Stakeholders and their roles - what stakeholders or actors does HVD determination in-volve? What are their roles?
23) Data - what data do HVD cover?
24) Level (if relevant) - what is the level of the HVD determination covered in the article? (e.g., city, regional, national, international)

***Format of the file***
.xls, .csv (for the first spreadsheet only), .odt, .docx

***Licenses or restrictions***
CC-BY

For more info, see README.txt
Z
AJOL dataset: structured metadata of articles and journals indexed in...
data.niaid.nih.gov
zenodo.org
Updated Mar 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alonso-Álvarez, Patricia (2025). AJOL dataset: structured metadata of articles and journals indexed in African Journals Online [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14899379
Explore at:
Dataset updated
Mar 10, 2025
Dataset authored and provided by
Alonso-Álvarez, Patricia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset of African Journals Online publications and journals (last update: February 2024). The dataset contains metadata for articles and journals indexed in African Journals Online (AJOL). It provides the information contained in AJOL in a structured format that can be downloaded and used easily. It also contains a unique identifier matching AJOL articles with their OpenAlex records in order to facilitate the use, comparison, and combination of both data sources.

Details about the download, methods, and findings are reported in the following preprint:

Alonso-Álvarez, P. (2025). A small step towards the epistemic decentralization of science: a dataset of journals and publications indexed in African Journals Online. Zenodo. 10.5281/zenodo.14900054

Detailed information on the database construction process is reported in the following file:

ajol_database_report.pdf

Data files:

ajol_journals.csv: contains metadata from journals, including title, eISSN, ISSN print, country, JPPS category, and open access status (binary for diamond journals).

ajol_journals_area.csv: related journals to their AJOL research area categories. Journals can belong up to three categories.

ajol_pub.csv: contains articles’ metadata, including journal identifiers, article URL, doi, issue, volume, date, year, title, first page, and last page.

ajol_pub_author.csv: relates articles to their authors.

ajol_pub_keyword.csv: includes article keywords.

ajol_pub_openalex.csv: relates AJOL articles to their OpenAlex records using the unique identifiers of each data source.

readme.csv: contains the description of the variables in all data files.

ajol_database_report.pdf: detailed information on the database construction process.
c
Medium articles dataset
crawlfeeds.com
json, zip
Updated Jul 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Medium articles dataset [Dataset]. https://crawlfeeds.com/datasets/medium-articles-dataset
Explore at:
json, zipAvailable download formats
Dataset updated
Jul 23, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Buy Medium Articles Dataset – 500K+ Published Articles in JSON Format

Get access to a premium Medium articles dataset containing 500,000+ curated articles with metadata including author profiles, publication dates, reading time, tags, claps, and more. Ideal for natural language processing (NLP), machine learning, content trend analysis, and AI model training.

Use Cases:

Training language models (LLMs)

Analyzing content trends and engagement

Sentiment and text classification

SEO research and author profiling

Academic or commercial research

Why Choose This Dataset?

High-volume, cleanly structured JSON

Ideal for developers, researchers, and data scientists

Easy integration with Python, R, SQL, and other data pipelines

Affordable and ready-to-use
h
Wikipedia-Articles
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data, Wikipedia-Articles [Dataset]. https://huggingface.co/datasets/BrightData/Wikipedia-Articles
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Bright Data
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for "BrightData/Wikipedia-Articles"

Dataset Summary

Explore a collection of millions of Wikipedia articles with the Wikipedia dataset, comprising over 1.23M structured records and 10 data fields updated and refreshed regularly. Each entry includes all major data points such as timestamp, URLs, article titles, raw and cataloged text, images, "see also" references, external links, and a structured table of contents. For a complete list of data points, please… See the full description on the dataset page: https://huggingface.co/datasets/BrightData/Wikipedia-Articles.
f
Development of the number of categorized articles and of authors.
plos.figshare.com
xls
Updated Jun 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iassen Halatchliyski; Ulrike Cress (2023). Development of the number of categorized articles and of authors. [Dataset]. http://doi.org/10.1371/journal.pone.0111958.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0111958.t001
Dataset updated
Jun 9, 2023
Dataset provided by
PLOS ONE
Authors
Iassen Halatchliyski; Ulrike Cress
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Development of the number of categorized articles and of authors.
Z
Data from: Dataset for the EPSL article: Structure and dynamics of the Tonga...
data.niaid.nih.gov
zenodo.org
Updated Oct 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhiteng Yu (2022). Dataset for the EPSL article: Structure and dynamics of the Tonga subduction zone: new insight from P-wave anisotropic tomography [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7076276
Explore at:
Dataset updated
Oct 7, 2022
Dataset provided by
Dapeng Zhao
Zhiteng Yu
Jiabiao Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The obtained 3-D P-wave anisotropic and isotropic velocity models in the Tonga subduction zone.

Please find this article: Z. Yu*, D. Zhao and J. Li*. Structure and dynamics of the Tonga subduction zone: New insight from P-wave anisotropic tomography. Earth and Planetary Science Letters, https://doi.org/10.1016/j.epsl.2022.117844

Extended Wikipedia Multimodal Dataset

kaggle.com

Updated Apr 4, 2020

Facebook

Twitter

Click to copy link

Link copied

Cite

Oleh Onyshchak (2020). Extended Wikipedia Multimodal Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/1058023

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.34740/kaggle/dsv/1058023

Dataset updated

Apr 4, 2020

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Oleh Onyshchak

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Wikipedia Featured Articles multimodal dataset

Overview

This is a multimodal dataset of featured articles containing 5,638 articles and 57,454 images.
Its superset of good articles is also hosted on Kaggle. It has six times more entries although with a little worse quality.

It contains the text of an article and also all the images from that article along with metadata such as image titles and descriptions. From Wikipedia, we selected featured articles, which are just a small subset of all available ones, because they are manually reviewed and protected from edits. Thus it's the best theoretical quality human editors on Wikipedia can offer.

You can find more details in "Image Recommendation for Wikipedia Articles" thesis.

Dataset structure

The high-level structure of the dataset is as follows:

.
+-- page1 
|  +-- text.json 
|  +-- img 
|    +-- meta.json
+-- page2 
|  +-- text.json 
|  +-- img 
|    +-- meta.json
: 
+-- pageN 
|  +-- text.json 
|  +-- img 
|    +-- meta.json

label	description
pageN	is the title of N-th Wikipedia page and contains all information about the page
text.json	text of the page saved as JSON. Please refer to the details of JSON schema below.
meta.json	a collection of all images of the page. Please refer to the details of JSON schema below.
imageN	is the N-th image of an article, saved in `jpg` format where the width of each image is set to 600px. Name of the image is md5 hashcode of original image title.

text.JSON Schema

Below you see an example of how data is stored:

{
 "title": "Naval Battle of Guadalcanal",
 "id": 405411,
 "url": "https://en.wikipedia.org/wiki/Naval_Battle_of_Guadalcanal",
 "html": "...

...", "wikitext": "... The '''Naval Battle of Guadalcanal''', sometimes referred to as ...", }

key	description
title	page title
id	unique page id
url	url of a page on Wikipedia
html	HTML content of the article
wikitext	wikitext content of the article

Please note that @html and @wikitext properties represent the same information in different formats, so just choose the one which is easier to parse in your circumstances.

meta.JSON Schema

{
 "img_meta": [
  {
   "filename": "702105f83a2aa0d2a89447be6b61c624.jpg",
   "title": "IronbottomSound.jpg",
   "parsed_title": "ironbottom sound",
   "url": "https://en.wikipedia.org/wiki/File%3AIronbottomSound.jpg",
   "is_icon": False,
   "on_commons": True,
   "description": "A U.S. destroyer steams up what later became known as ...",
   "caption": "Ironbottom Sound. The majority of the warship surface ...",
   "headings": ['Naval Battle of Guadalcanal', 'First Naval Battle of Guadalcanal', ...],
   "features": ['4.8618264', '0.49436468', '7.0841103', '2.7377882', '2.1305492', ...],
   },
   ...
  ]
}

key	description
filename	unique image id, md5 hashcode of original image title
title	image title retrieved from Commons, if applicable
parsed_title	image title split into words, i.e. "helloWorld.jpg" -> "hello world"
url	url of an image on Wikipedia
is_icon	True if image is an icon, e.g. category icon. We assume that image is an icon if you cannot load a preview on Wikipedia after clicking on it
on_commons	True if image is available from Wikimedia Commons dataset
description	description of an image parsed from Wikimedia Commons page, if available
caption	caption of an image parsed from Wikipedia article, if available
headings	list of all nested headings of location where article is placed in Wikipedia article. The first element is top-most heading
features	output of 5-th convolutional layer of ResNet152 trained on ImageNet dataset. That output of shape (19, 24, 2048) is then max-pooled to a shape (2048,). Features taken from original images downloaded in `jpeg` format with fixed width of 600px. Practically, it is a list of floats with len = 2048

Collection method

Data was collected by fetching featured articles text&image content with pywikibot library and then parsing out a lot of additional metadata from HTML pages from Wikipedia and Commons.

Kaggle Data Collection Notebook...

Multilingual Historical News Article Extraction and Classification Dataset
zenodo.org
csv
Updated Jan 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johanna Mauermann; Carlos-Emiliano González-Gallardo; Sarah Oberbichler; Sarah Oberbichler; Johanna Mauermann; Carlos-Emiliano González-Gallardo (2025). Multilingual Historical News Article Extraction and Classification Dataset [Dataset]. http://doi.org/10.57967/hf/3965
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.57967/hf/3965
Dataset updated
Jan 12, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Johanna Mauermann; Carlos-Emiliano González-Gallardo; Sarah Oberbichler; Sarah Oberbichler; Johanna Mauermann; Carlos-Emiliano González-Gallardo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 20, 2024
Description
This dataset was created specifically to test LLMs capabilities in processing and extracting topic-specific articles from historical unstructured newspaper issues. While traditional article separation tasks rely on layout information or a combination of layout and semantic understanding, this dataset evaluates a novel approach using OCR'd text and context understanding. This method can considerably improve the corpus building process for individual researchers working on specific topics such as migration or disasters. The dataset consists of French, German, and English newspapers from 1909 and contains multiple layers of information: detailed metadata about each newspaper issue (including identifiers, titles, dates, and institutional information), full-text content of newspaper pages or sections, context window for processing, and human-annotated ground truth extractions. The dataset is structured to enable three-step evaluation of LLMs: first, their ability to classify content as relevant or not relevant to a specific topic (such as the 1908 Messina earthquake), second, their accuracy in extracting complete relevant articles from the broader newspaper text, and third, to correctly mark beginning and end of the articles, especially when several articles where published in the same newspaper issue. By providing human-annotated ground truth, the dataset allows for systematic assessment of how well LLMs can understand historical text, maintain contextual relevance, and perform precise information extraction. This testing framework helps evaluate LLMs' effectiveness in handling real-world historical document processing tasks while maintaining accuracy and contextual understanding.
AmericanStories
huggingface.co
opendatalab.com
Updated Jun 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dell Research Harvard (2023). AmericanStories [Dataset]. http://doi.org/10.57967/hf/0757
Explore at:
Unique identifier
https://doi.org/10.57967/hf/0757
Dataset updated
Jun 14, 2023
Dataset provided by
Dell Technologieshttp://dell.com/
Authors
Dell Research Harvard
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
American Stories offers high-quality structured data from historical newspapers suitable for pre-training large language models to enhance the understanding of historical English and world knowledge. It can also be integrated into external databases of retrieval-augmented language models, enabling broader access to historical information, including interpretations of political events and intricate details about people's ancestors. Additionally, the structured article texts facilitate the application of transformer-based methods for popular tasks like detecting reproduced content, significantly improving accuracy compared to traditional OCR methods. American Stories serves as a substantial and valuable dataset for advancing multimodal layout analysis models and other multimodal applications.
d
Characteristics, utilization and influence of viewpoint articles from the...
datadryad.org
zenodo.org
zip
Updated Nov 19, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katie Tayler-Smith (2020). Characteristics, utilization and influence of viewpoint articles from the Structured Operational Research and Training Initiative (SORT IT) – 2009-2020 [Dataset]. http://doi.org/10.5061/dryad.fj6q573sk
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.fj6q573sk
Dataset updated
Nov 19, 2020
Dataset provided by
Dryad
Authors
Katie Tayler-Smith
Time period covered
Nov 18, 2020
Description
Background: The Structured Operational Research and Training Initiative (SORT IT) teaches the practical skills of conducting and publishing operational research (OR) to influence health policy and/or practice. In addition to original research articles, viewpoint articles are also produced and published as secondary outputs of SORT IT courses. We assessed the characteristics, use and influence of viewpoint articles derived from all SORT IT courses.

Methods: This was a cross-sectional study involving all published viewpoint articles derived from the SORT IT courses held between August 2009 and March 2020. Characteristics of these papers were sourced from the papers themselves and from SORT-IT members involved in writing the papers. Data on use were sourced from the metrics provided on the online publishing platforms and from Google Scholar. Influence on policy and practice was self-assessed by the authors of the papers and was performed only for papers deemed to be ‘calls for action’.

...
The structure of a primary research article.
plos.figshare.com
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maureen A. Carey; Kevin L. Steiner; William A. Petri Jr (2023). The structure of a primary research article. [Dataset]. http://doi.org/10.1371/journal.pcbi.1008032.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1008032.t002
Dataset updated
Jun 3, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Maureen A. Carey; Kevin L. Steiner; William A. Petri Jr
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The structure of a primary research article.
Human Written Text
kaggle.com
Updated May 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Youssef Elebiary (2025). Human Written Text [Dataset]. https://www.kaggle.com/datasets/youssefelebiary/human-written-text
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 13, 2025
Dataset provided by
Kaggle
Authors
Youssef Elebiary
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Overview

This dataset contains 20000 pieces of text collected from Wikipedia, Gutenberg, and CNN/DailyMail. The text is cleaned by replacing symbols such as (.*?/) with a white space using automatic scripts and regex.

Data Source Distribution

10,000 Wikipedia Articles: From the 20220301 dump.

3,000 Gutenberg Books: Via the GutenDex API.

7,000 CNN/DailyMail News Articles: From the CNN/DailyMail 3.0.0 dataset.

Why These Sources

The data was collected from these source to ensure the highest level of integrity against AI generated text. * Wikipedia: The 20220301 dataset was chosen to minimize the chance of including articles generated or heavily edited by AI. * Gutenberg: Books from this source are guaranteed to be written by real humans and span various genres and time periods. * CNN/DailyMail: These news articles were written by professional journalists and cover a variety of topics, ensuring diversity in writing style and subject matter.

Dataset Structure

The dataset consists of 5 CSV files. 1. CNN_DailyMail.csv: Contains all processed news articles. 2. Gutenberg.csv: Contains all processed books. 3. Wikipedia.csv: Contains all processed Wikipedia articles. 4. Human.csv: Combines all three datasets in order. 5. Shuffled_Human.csv: This is the randomly shuffled version of Human.csv.

Each file has 2 columns: - Title: The title of the item. - Text: The content of the item.

Uses

This dataset is suitable for a wide range of NLP tasks, including: - Training models to distinguish between human-written and AI-generated text (Human/AI classifiers). - Training LSTMs or Transformers for chatbots, summarization, or topic modeling. - Sentiment analysis, genre classification, or linguistic research.

Disclaimer

While the data was collected from such sources, the data may not be 100% pure from AI generated text. Wikipedia articles may reflect systemic biases in contributor demographics. CNN/DailyMail articles may focus on specific news topics or regions.

For details on how the dataset was created, click here to view the Kaggle notebook used.

Licensing

This dataset is published under the MIT License, allowing free use for both personal and commercial purposes. Attribution is encouraged but not required.
Development of the number of articles with new contributions per period.
plos.figshare.com
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iassen Halatchliyski; Ulrike Cress (2023). Development of the number of articles with new contributions per period. [Dataset]. http://doi.org/10.1371/journal.pone.0111958.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0111958.t003
Dataset updated
Jun 3, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Iassen Halatchliyski; Ulrike Cress
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Development of the number of articles with new contributions per period.
A Structured Human-Annotated Dataset for Food Extrusion Literature
researchdata.edu.au
data.csiro.au
datadownload
Updated Feb 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jordan Pennells; Mr Jordan Pennells; Mr Jordan Pennells (2025). A Structured Human-Annotated Dataset for Food Extrusion Literature [Dataset]. http://doi.org/10.25919/R4Y6-R260
Explore at:
datadownloadAvailable download formats
Unique identifier
https://doi.org/10.25919/R4Y6-R260
Dataset updated
Feb 27, 2025
Dataset provided by
CSIROhttp://www.csiro.au/
Authors
Jordan Pennells; Mr Jordan Pennells; Mr Jordan Pennells
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Time period covered
Oct 17, 2022 - Feb 26, 2025
Description
This dataset is a manually curated collection of structured data extracted from peer-reviewed food extrusion research articles. The dataset captures key parameters relevant to food extrusion processes, including equipment configurations, processing conditions, formulation details, and characterization methods. It is intended to support literature synthesis, meta-analyses, and knowledge representation in food extrusion research. This dataset provides a searchable, structured repository for researchers to efficiently access and analyse trends in food extrusion studies beyond what is typically available in standard academic databases. Lineage: This dataset was manually curated from 335 peer-reviewed food extrusion research articles sourced from the Web of Science database. The literature search used the following search syntax: "extru*" (Topic) AND "food" (Topic) NOT “packaging” (Topic). WoS Category filter: Food Science Technology, Nutrition & Dietetics, and Agriculture Dairy Animal Science. Key parameters—including equipment configurations, processing conditions, formulation details, and characterisation methods—were extracted, structured, and categorised by a domain expert in food engineering following a predefined schema. Citation screening was performed to ensure dataset quality.
Z
Dataset for the article: "Weak genetic structure despite strong genomic...
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bekkevold, Dorte (2020). Dataset for the article: "Weak genetic structure despite strong genomic signal in lesser sandeel in the North Sea" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3458887
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Bekkevold, Dorte
Le Moan, Alan
Christensen, Asbjørn
van Deurs, Mikael
Hemmer-Hansen, Jakob
Jiménez-Mena, Belén
Mosegaard, Henrik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
North Sea
Description
Dataset used for the article: "Weak genetic structure despite strong genomic signal in lesser sandeel in the North Sea".

Dataset consists on a VCF file from 471 individuals of lesser sandeel, Ammodytes marinus (L.). This VCF is the end product of the bioinformatic analysis described in the paper Jimenez-Mena et al. (2019). Data was obtained from double-digest Restriction-site Associated DNA (ddRAD) sequencing. More information can be obtained in Methods of the article. The information of each of the individuals in the VCF is also included as a separate file, as well as the supplementary tables of the article.
d
Characteristics, utilization and influence of viewpoint articles from the...
search.dataone.org
Updated May 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katie Tayler-Smith (2025). Characteristics, utilization and influence of viewpoint articles from the Structured Operational Research and Training Initiative (SORT IT) â€“ 2009-2020 [Dataset]. http://doi.org/10.5061/dryad.fj6q573sk
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.fj6q573sk
Dataset updated
May 13, 2025
Dataset provided by
Dryad Digital Repository
Authors
Katie Tayler-Smith
Time period covered
Jan 1, 2020
Description
Background: The Structured Operational Research and Training Initiative (SORT IT) teaches the practical skills of conducting and publishing operational research (OR) to influence health policy and/or practice. In addition to original research articles, viewpoint articles are also produced and published as secondary outputs of SORT IT courses. We assessed the characteristics, use and influence of viewpoint articles derived from all SORT IT courses.

Methods: This was a cross-sectional study involving all published viewpoint articles derived from the SORT IT courses held between August 2009 and March 2020. Characteristics of these papers were sourced from the papers themselves and from SORT-IT members involved in writing the papers. Data on use were sourced from the metrics provided on the online publishing platforms and from Google Scholar. Influence on policy and practice was self-assessed by the authors of the papers and was performed only for papers deemed to be â€˜calls for actionâ€™. ...
Publication and dataset for article "Structure and transport properties of...
zenodo.org
bin, pdf, txt, zip
Updated Jan 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
E. Edmund; T. Bi; Z.M. Geballe; K. Brugman; J.-F. Lin; S. Chariton; V. B. Prakapenka; Ján Minár; R. E. Cohen; A. F. Goncharov; E. Edmund; T. Bi; Z.M. Geballe; K. Brugman; J.-F. Lin; S. Chariton; V. B. Prakapenka; Ján Minár; R. E. Cohen; A. F. Goncharov (2025). Publication and dataset for article "Structure and transport properties of FeS at planetary core conditions" [Dataset]. http://doi.org/10.5281/zenodo.14602522
Explore at:
txt, zip, bin, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14602522
Dataset updated
Jan 30, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
E. Edmund; T. Bi; Z.M. Geballe; K. Brugman; J.-F. Lin; S. Chariton; V. B. Prakapenka; Ján Minár; R. E. Cohen; A. F. Goncharov; E. Edmund; T. Bi; Z.M. Geballe; K. Brugman; J.-F. Lin; S. Chariton; V. B. Prakapenka; Ján Minár; R. E. Cohen; A. F. Goncharov
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Publication and dataset for article "Structure and transport properties of FeS at planetary core conditions" in Earth and Planetary Science Letters, Volume 646, 15 November 2024, 118959.
T
scientific_papers
tensorflow.org
huggingface.co
Updated Dec 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). scientific_papers [Dataset]. https://www.tensorflow.org/datasets/catalog/scientific_papers
Explore at:
Dataset updated
Dec 23, 2022
Description
Scientific papers datasets contains two sets of long and structured documents. The datasets are obtained from ArXiv and PubMed OpenAccess repositories.

Both "arxiv" and "pubmed" have two features:

article: the body of the document, pagragraphs seperated by "/n".

abstract: the abstract of the document, pagragraphs seperated by "/n".

section_names: titles of sections, seperated by "/n".

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('scientific_papers', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

Facebook

Twitter

Click to copy link

Link copied

Cite

Wikimedia (2024). structured-wikipedia [Dataset]. https://huggingface.co/datasets/wikimedia/structured-wikipedia

structured-wikipedia

wikimedia/structured-wikipedia

Explore at:

223 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Sep 16, 2024

Dataset provided by

Wikimedia Foundationhttp://www.wikimedia.org/

Authors

Wikimedia

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dataset Card for Wikimedia Structured Wikipedia

  Dataset Description





  Dataset Summary

Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback. This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema (JSONL compressed as zip). Each JSON line holds the content of one full Wikipedia article stripped of… See the full description on the dataset page: https://huggingface.co/datasets/wikimedia/structured-wikipedia.

Clear search

Close search

Google apps

Main menu

structured-wikipedia

Structural Metadata from ArXiv Articles

Dataset: A Systematic Literature Review on the topic of High-value datasets

AJOL dataset: structured metadata of articles and journals indexed in...

Medium articles dataset

Buy Medium Articles Dataset – 500K+ Published Articles in JSON Format

Use Cases:

Why Choose This Dataset?

Wikipedia-Articles

Development of the number of categorized articles and of authors.

Data from: Dataset for the EPSL article: Structure and dynamics of the Tonga...

Extended Wikipedia Multimodal Dataset

Wikipedia Featured Articles multimodal dataset

Overview

Dataset structure

text.JSON Schema

meta.JSON Schema

Collection method

Multilingual Historical News Article Extraction and Classification Dataset

AmericanStories

Characteristics, utilization and influence of viewpoint articles from the...

The structure of a primary research article.

Human Written Text

Overview

Data Source Distribution

Why These Sources

Dataset Structure

Uses

Disclaimer

Licensing

Development of the number of articles with new contributions per period.

A Structured Human-Annotated Dataset for Food Extrusion Literature

Dataset for the article: "Weak genetic structure despite strong genomic...

Characteristics, utilization and influence of viewpoint articles from the...

Publication and dataset for article "Structure and transport properties of...

scientific_papers

structured-wikipedia

wikimedia/structured-wikipedia