Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for Wikimedia Structured Wikipedia
Dataset Description
Dataset Summary
Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback. This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema (JSONL compressed as zip). Each JSON line holds the content of one full Wikipedia article stripped of… See the full description on the dataset page: https://huggingface.co/datasets/wikimedia/structured-wikipedia.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The dataset contains metadata encoded in JSON and extracted from more than one million arXiv articles that were put online before the end of 2016. The metadata includes the arXiv id, category names, title, author names, abstract, link to article, publication date and table of contents.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains data collected during a study ("Towards High-Value Datasets determination for data-driven development: a systematic literature review") conducted by Anastasija Nikiforova (University of Tartu), Nina Rizun, Magdalena Ciesielska (Gdańsk University of Technology), Charalampos Alexopoulos (University of the Aegean) and Andrea Miletič (University of Zagreb)
It being made public both to act as supplementary data for "Towards High-Value Datasets determination for data-driven development: a systematic literature review" paper (pre-print is available in Open Access here -> https://arxiv.org/abs/2305.10234) and in order for other researchers to use these data in their own work.
The protocol is intended for the Systematic Literature review on the topic of High-value Datasets with the aim to gather information on how the topic of High-value datasets (HVD) and their determination has been reflected in the literature over the years and what has been found by these studies to date, incl. the indicators used in them, involved stakeholders, data-related aspects, and frameworks. The data in this dataset were collected in the result of the SLR over Scopus, Web of Science, and Digital Government Research library (DGRL) in 2023.
***Methodology***
To understand how HVD determination has been reflected in the literature over the years and what has been found by these studies to date, all relevant literature covering this topic has been studied. To this end, the SLR was carried out to by searching digital libraries covered by Scopus, Web of Science (WoS), Digital Government Research library (DGRL).
These databases were queried for keywords ("open data" OR "open government data") AND ("high-value data*" OR "high value data*"), which were applied to the article title, keywords, and abstract to limit the number of papers to those, where these objects were primary research objects rather than mentioned in the body, e.g., as a future work. After deduplication, 11 articles were found unique and were further checked for relevance. As a result, a total of 9 articles were further examined. Each study was independently examined by at least two authors.
To attain the objective of our study, we developed the protocol, where the information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information.
***Test procedure***
Each study was independently examined by at least two authors, where after the in-depth examination of the full-text of the article, the structured protocol has been filled for each study.
The structure of the survey is available in the supplementary file available (see Protocol_HVD_SLR.odt, Protocol_HVD_SLR.docx)
The data collected for each study by two researchers were then synthesized in one final version by the third researcher.
***Description of the data in this data set***
Protocol_HVD_SLR provides the structure of the protocol
Spreadsheets #1 provides the filled protocol for relevant studies.
Spreadsheet#2 provides the list of results after the search over three indexing databases, i.e. before filtering out irrelevant studies
The information on each selected study was collected in four categories:
(1) descriptive information,
(2) approach- and research design- related information,
(3) quality-related information,
(4) HVD determination-related information
Descriptive information
1) Article number - a study number, corresponding to the study number assigned in an Excel worksheet
2) Complete reference - the complete source information to refer to the study
3) Year of publication - the year in which the study was published
4) Journal article / conference paper / book chapter - the type of the paper -{journal article, conference paper, book chapter}
5) DOI / Website- a link to the website where the study can be found
6) Number of citations - the number of citations of the article in Google Scholar, Scopus, Web of Science
7) Availability in OA - availability of an article in the Open Access
8) Keywords - keywords of the paper as indicated by the authors
9) Relevance for this study - what is the relevance level of the article for this study? {high / medium / low}
Approach- and research design-related information
10) Objective / RQ - the research objective / aim, established research questions
11) Research method (including unit of analysis) - the methods used to collect data, including the unit of analy-sis (country, organisation, specific unit that has been ana-lysed, e.g., the number of use-cases, scope of the SLR etc.)
12) Contributions - the contributions of the study
13) Method - whether the study uses a qualitative, quantitative, or mixed methods approach?
14) Availability of the underlying research data- whether there is a reference to the publicly available underly-ing research data e.g., transcriptions of interviews, collected data, or explanation why these data are not shared?
15) Period under investigation - period (or moment) in which the study was conducted
16) Use of theory / theoretical concepts / approaches - does the study mention any theory / theoretical concepts / approaches? If any theory is mentioned, how is theory used in the study?
Quality- and relevance- related information
17) Quality concerns - whether there are any quality concerns (e.g., limited infor-mation about the research methods used)?
18) Primary research object - is the HVD a primary research object in the study? (primary - the paper is focused around the HVD determination, sec-ondary - mentioned but not studied (e.g., as part of discus-sion, future work etc.))
HVD determination-related information
19) HVD definition and type of value - how is the HVD defined in the article and / or any other equivalent term?
20) HVD indicators - what are the indicators to identify HVD? How were they identified? (components & relationships, “input -> output")
21) A framework for HVD determination - is there a framework presented for HVD identification? What components does it consist of and what are the rela-tionships between these components? (detailed description)
22) Stakeholders and their roles - what stakeholders or actors does HVD determination in-volve? What are their roles?
23) Data - what data do HVD cover?
24) Level (if relevant) - what is the level of the HVD determination covered in the article? (e.g., city, regional, national, international)
***Format of the file***
.xls, .csv (for the first spreadsheet only), .odt, .docx
***Licenses or restrictions***
CC-BY
For more info, see README.txt
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset of African Journals Online publications and journals (last update: February 2024). The dataset contains metadata for articles and journals indexed in African Journals Online (AJOL). It provides the information contained in AJOL in a structured format that can be downloaded and used easily. It also contains a unique identifier matching AJOL articles with their OpenAlex records in order to facilitate the use, comparison, and combination of both data sources.
Details about the download, methods, and findings are reported in the following preprint:
Alonso-Álvarez, P. (2025). A small step towards the epistemic decentralization of science: a dataset of journals and publications indexed in African Journals Online. Zenodo. 10.5281/zenodo.14900054
Detailed information on the database construction process is reported in the following file:
ajol_database_report.pdf
Data files:
ajol_journals.csv: contains metadata from journals, including title, eISSN, ISSN print, country, JPPS category, and open access status (binary for diamond journals).
ajol_journals_area.csv: related journals to their AJOL research area categories. Journals can belong up to three categories.
ajol_pub.csv: contains articles’ metadata, including journal identifiers, article URL, doi, issue, volume, date, year, title, first page, and last page.
ajol_pub_author.csv: relates articles to their authors.
ajol_pub_keyword.csv: includes article keywords.
ajol_pub_openalex.csv: relates AJOL articles to their OpenAlex records using the unique identifiers of each data source.
readme.csv: contains the description of the variables in all data files.
ajol_database_report.pdf: detailed information on the database construction process.
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Get access to a premium Medium articles dataset containing 500,000+ curated articles with metadata including author profiles, publication dates, reading time, tags, claps, and more. Ideal for natural language processing (NLP), machine learning, content trend analysis, and AI model training.
Training language models (LLMs)
Analyzing content trends and engagement
Sentiment and text classification
SEO research and author profiling
Academic or commercial research
High-volume, cleanly structured JSON
Ideal for developers, researchers, and data scientists
Easy integration with Python, R, SQL, and other data pipelines
Affordable and ready-to-use
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for "BrightData/Wikipedia-Articles"
Dataset Summary
Explore a collection of millions of Wikipedia articles with the Wikipedia dataset, comprising over 1.23M structured records and 10 data fields updated and refreshed regularly. Each entry includes all major data points such as timestamp, URLs, article titles, raw and cataloged text, images, "see also" references, external links, and a structured table of contents. For a complete list of data points, please… See the full description on the dataset page: https://huggingface.co/datasets/BrightData/Wikipedia-Articles.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Development of the number of categorized articles and of authors.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The obtained 3-D P-wave anisotropic and isotropic velocity models in the Tonga subduction zone.
Please find this article: Z. Yu*, D. Zhao and J. Li*. Structure and dynamics of the Tonga subduction zone: New insight from P-wave anisotropic tomography. Earth and Planetary Science Letters, https://doi.org/10.1016/j.epsl.2022.117844
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
It contains the text of an article and also all the images from that article along with metadata such as image titles and descriptions. From Wikipedia, we selected featured articles, which are just a small subset of all available ones, because they are manually reviewed and protected from edits. Thus it's the best theoretical quality human editors on Wikipedia can offer.
You can find more details in "Image Recommendation for Wikipedia Articles" thesis.
The high-level structure of the dataset is as follows:
.
+-- page1
| +-- text.json
| +-- img
| +-- meta.json
+-- page2
| +-- text.json
| +-- img
| +-- meta.json
:
+-- pageN
| +-- text.json
| +-- img
| +-- meta.json
label | description |
---|---|
pageN | is the title of N-th Wikipedia page and contains all information about the page |
text.json | text of the page saved as JSON. Please refer to the details of JSON schema below. |
meta.json | a collection of all images of the page. Please refer to the details of JSON schema below. |
imageN | is the N-th image of an article, saved in jpg format where the width of each image is set to 600px. Name of the image is md5 hashcode of original image title. |
Below you see an example of how data is stored:
{
"title": "Naval Battle of Guadalcanal",
"id": 405411,
"url": "https://en.wikipedia.org/wiki/Naval_Battle_of_Guadalcanal",
"html": "...
...", "wikitext": "... The '''Naval Battle of Guadalcanal''', sometimes referred to as ...", }
key | description |
---|---|
title | page title |
id | unique page id |
url | url of a page on Wikipedia |
html | HTML content of the article |
wikitext | wikitext content of the article |
Please note that @html and @wikitext properties represent the same information in different formats, so just choose the one which is easier to parse in your circumstances.
{
"img_meta": [
{
"filename": "702105f83a2aa0d2a89447be6b61c624.jpg",
"title": "IronbottomSound.jpg",
"parsed_title": "ironbottom sound",
"url": "https://en.wikipedia.org/wiki/File%3AIronbottomSound.jpg",
"is_icon": False,
"on_commons": True,
"description": "A U.S. destroyer steams up what later became known as ...",
"caption": "Ironbottom Sound. The majority of the warship surface ...",
"headings": ['Naval Battle of Guadalcanal', 'First Naval Battle of Guadalcanal', ...],
"features": ['4.8618264', '0.49436468', '7.0841103', '2.7377882', '2.1305492', ...],
},
...
]
}
key | description |
---|---|
filename | unique image id, md5 hashcode of original image title |
title | image title retrieved from Commons, if applicable |
parsed_title | image title split into words, i.e. "helloWorld.jpg" -> "hello world" |
url | url of an image on Wikipedia |
is_icon | True if image is an icon, e.g. category icon. We assume that image is an icon if you cannot load a preview on Wikipedia after clicking on it |
on_commons | True if image is available from Wikimedia Commons dataset |
description | description of an image parsed from Wikimedia Commons page, if available |
caption | caption of an image parsed from Wikipedia article, if available |
headings | list of all nested headings of location where article is placed in Wikipedia article. The first element is top-most heading |
features | output of 5-th convolutional layer of ResNet152 trained on ImageNet dataset. That output of shape (19, 24, 2048) is then max-pooled to a shape (2048,). Features taken from original images downloaded in jpeg format with fixed width of 600px. Practically, it is a list of floats with len = 2048 |
Data was collected by fetching featured articles text&image content with pywikibot library and then parsing out a lot of additional metadata from HTML pages from Wikipedia and Commons.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was created specifically to test LLMs capabilities in processing and extracting topic-specific articles from historical unstructured newspaper issues. While traditional article separation tasks rely on layout information or a combination of layout and semantic understanding, this dataset evaluates a novel approach using OCR'd text and context understanding. This method can considerably improve the corpus building process for individual researchers working on specific topics such as migration or disasters. The dataset consists of French, German, and English newspapers from 1909 and contains multiple layers of information: detailed metadata about each newspaper issue (including identifiers, titles, dates, and institutional information), full-text content of newspaper pages or sections, context window for processing, and human-annotated ground truth extractions. The dataset is structured to enable three-step evaluation of LLMs: first, their ability to classify content as relevant or not relevant to a specific topic (such as the 1908 Messina earthquake), second, their accuracy in extracting complete relevant articles from the broader newspaper text, and third, to correctly mark beginning and end of the articles, especially when several articles where published in the same newspaper issue. By providing human-annotated ground truth, the dataset allows for systematic assessment of how well LLMs can understand historical text, maintain contextual relevance, and perform precise information extraction. This testing framework helps evaluate LLMs' effectiveness in handling real-world historical document processing tasks while maintaining accuracy and contextual understanding.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
American Stories offers high-quality structured data from historical newspapers suitable for pre-training large language models to enhance the understanding of historical English and world knowledge. It can also be integrated into external databases of retrieval-augmented language models, enabling broader access to historical information, including interpretations of political events and intricate details about people's ancestors. Additionally, the structured article texts facilitate the application of transformer-based methods for popular tasks like detecting reproduced content, significantly improving accuracy compared to traditional OCR methods. American Stories serves as a substantial and valuable dataset for advancing multimodal layout analysis models and other multimodal applications.
Background: The Structured Operational Research and Training Initiative (SORT IT) teaches the practical skills of conducting and publishing operational research (OR) to influence health policy and/or practice. In addition to original research articles, viewpoint articles are also produced and published as secondary outputs of SORT IT courses. We assessed the characteristics, use and influence of viewpoint articles derived from all SORT IT courses.
Methods: This was a cross-sectional study involving all published viewpoint articles derived from the SORT IT courses held between August 2009 and March 2020. Characteristics of these papers were sourced from the papers themselves and from SORT-IT members involved in writing the papers. Data on use were sourced from the metrics provided on the online publishing platforms and from Google Scholar. Influence on policy and practice was self-assessed by the authors of the papers and was performed only for papers deemed to be ‘calls for action’.
...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The structure of a primary research article.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains 20000 pieces of text collected from Wikipedia, Gutenberg, and CNN/DailyMail. The text is cleaned by replacing symbols such as (.*?/) with a white space using automatic scripts and regex.
The data was collected from these source to ensure the highest level of integrity against AI generated text. * Wikipedia: The 20220301 dataset was chosen to minimize the chance of including articles generated or heavily edited by AI. * Gutenberg: Books from this source are guaranteed to be written by real humans and span various genres and time periods. * CNN/DailyMail: These news articles were written by professional journalists and cover a variety of topics, ensuring diversity in writing style and subject matter.
The dataset consists of 5 CSV files.
1. CNN_DailyMail.csv
: Contains all processed news articles.
2. Gutenberg.csv
: Contains all processed books.
3. Wikipedia.csv
: Contains all processed Wikipedia articles.
4. Human.csv
: Combines all three datasets in order.
5. Shuffled_Human.csv
: This is the randomly shuffled version of Human.csv
.
Each file has 2 columns:
- Title
: The title of the item.
- Text
: The content of the item.
This dataset is suitable for a wide range of NLP tasks, including: - Training models to distinguish between human-written and AI-generated text (Human/AI classifiers). - Training LSTMs or Transformers for chatbots, summarization, or topic modeling. - Sentiment analysis, genre classification, or linguistic research.
While the data was collected from such sources, the data may not be 100% pure from AI generated text. Wikipedia articles may reflect systemic biases in contributor demographics. CNN/DailyMail articles may focus on specific news topics or regions.
For details on how the dataset was created, click here to view the Kaggle notebook used.
This dataset is published under the MIT License, allowing free use for both personal and commercial purposes. Attribution is encouraged but not required.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Development of the number of articles with new contributions per period.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset is a manually curated collection of structured data extracted from peer-reviewed food extrusion research articles. The dataset captures key parameters relevant to food extrusion processes, including equipment configurations, processing conditions, formulation details, and characterization methods. It is intended to support literature synthesis, meta-analyses, and knowledge representation in food extrusion research. This dataset provides a searchable, structured repository for researchers to efficiently access and analyse trends in food extrusion studies beyond what is typically available in standard academic databases. Lineage: This dataset was manually curated from 335 peer-reviewed food extrusion research articles sourced from the Web of Science database. The literature search used the following search syntax: "extru*" (Topic) AND "food" (Topic) NOT “packaging” (Topic). WoS Category filter: Food Science Technology, Nutrition & Dietetics, and Agriculture Dairy Animal Science. Key parameters—including equipment configurations, processing conditions, formulation details, and characterisation methods—were extracted, structured, and categorised by a domain expert in food engineering following a predefined schema. Citation screening was performed to ensure dataset quality.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset used for the article: "Weak genetic structure despite strong genomic signal in lesser sandeel in the North Sea".
Dataset consists on a VCF file from 471 individuals of lesser sandeel, Ammodytes marinus (L.). This VCF is the end product of the bioinformatic analysis described in the paper Jimenez-Mena et al. (2019). Data was obtained from double-digest Restriction-site Associated DNA (ddRAD) sequencing. More information can be obtained in Methods of the article. The information of each of the individuals in the VCF is also included as a separate file, as well as the supplementary tables of the article.
Background: The Structured Operational Research and Training Initiative (SORT IT) teaches the practical skills of conducting and publishing operational research (OR) to influence health policy and/or practice. In addition to original research articles, viewpoint articles are also produced and published as secondary outputs of SORT IT courses. We assessed the characteristics, use and influence of viewpoint articles derived from all SORT IT courses.
Methods: This was a cross-sectional study involving all published viewpoint articles derived from the SORT IT courses held between August 2009 and March 2020. Characteristics of these papers were sourced from the papers themselves and from SORT-IT members involved in writing the papers. Data on use were sourced from the metrics provided on the online publishing platforms and from Google Scholar. Influence on policy and practice was self-assessed by the authors of the papers and was performed only for papers deemed to be ‘calls for action’. ...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Publication and dataset for article "Structure and transport properties of FeS at planetary core conditions" in Earth and Planetary Science Letters, Volume 646, 15 November 2024, 118959.
Scientific papers datasets contains two sets of long and structured documents. The datasets are obtained from ArXiv and PubMed OpenAccess repositories.
Both "arxiv" and "pubmed" have two features:
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('scientific_papers', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for Wikimedia Structured Wikipedia
Dataset Description
Dataset Summary
Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback. This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema (JSONL compressed as zip). Each JSON line holds the content of one full Wikipedia article stripped of… See the full description on the dataset page: https://huggingface.co/datasets/wikimedia/structured-wikipedia.